The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Did you remember to?


1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?


Then follow instructions in Article#1894.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.9.0 is now available. Download and read release notes here.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

SelectVariants and VariantFiltration methods

bwubbbwubb Member Posts: 54
edited January 2013 in Ask the GATK team

Hi, I wanted to double check my methods for some targeted capture data. I ran 96 samples through UG to produce a multisample VCF. I separated snps and indels into separate files using SelectVariants, and applied filters:

For snps
"QD < 2.0", "MQ < 40.0", "FS > 60.0", "HaplotypeScore > 13.0", "MQRankSum < -12.5", "ReadPosRankSum < -8.0"

For indels
"QD < 2.0", "ReadPosRankSum < -20.0", "InbreedingCoeff < -0.8", "FS > 200.0"

I then went back through with SelectVariants, pulling out each sample one at a time into their own filtered VCF.

My results are... lets say, wrong. I am wondering if it would be better practice to select each sample first and then apply the filters, or if it does not matter and my errors lie elsewhere. Thank you.

Post edited by Geraldine_VdAuwera on

Best Answers

  • pdexheimerpdexheimer Member, Dev Posts: 544 ✭✭✭✭
    Accepted Answer

    Just because a site is listed in the vcf doesn't necessarily mean it's variant in (any of) the sample(s) - a genotype of homozygous reference is perfectly valid. SelectVariants has an option for removing variants that are reference in all selected samples, you may want to try that.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,631 admin
    Accepted Answer

    Yes @bwubb, the GATK does maintain sample individuality when doing multisample calling. To further clarify what @pdexheimer is saying: the "same number of variants in each sample's VCF" thing is not a cause for alarm. When you are calling variants on your multisample dataset, any site that is called as variant in at least one sample will therefore be called/genotyped for every other sample, even if it is hom ref in those samples. When you separate out the variant calls by sample, you will necessarily get the same number of calls for every sample, but a large number of those may be hom ref. You can indeed filter those out if you want.

    Geraldine Van der Auwera, PhD

Answers

  • bwubbbwubb Member Posts: 54

    This is still an issue for me. I was under the impression my method was in line with "Best Practices".

    I ran UG on a list of sample.bams to produce a multi-sample vcf.
    Next, I ran selectVariants to pick out snps and indels for each sample into a sampl.snp.vcf, and sample.indel.vcf respectively.
    I then applied filter criteria (I even broke up || statements into individual filters).

    The big issue is that all of the vcf files have the same number of variants for each sample. I compared results with old results which used GATK-1.3 and ran UG individually on each sample.bam this definitely did not happen previously. I could go back to that, but I was under the impression that GATK was suppose to be able to handle a list of bams and maintain sample individuality. Am I incorrect? Thank you.

  • pdexheimerpdexheimer Member, Dev Posts: 544 ✭✭✭✭
    Accepted Answer

    Just because a site is listed in the vcf doesn't necessarily mean it's variant in (any of) the sample(s) - a genotype of homozygous reference is perfectly valid. SelectVariants has an option for removing variants that are reference in all selected samples, you may want to try that.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie Posts: 11,631 admin
    Accepted Answer

    Yes @bwubb, the GATK does maintain sample individuality when doing multisample calling. To further clarify what @pdexheimer is saying: the "same number of variants in each sample's VCF" thing is not a cause for alarm. When you are calling variants on your multisample dataset, any site that is called as variant in at least one sample will therefore be called/genotyped for every other sample, even if it is hom ref in those samples. When you separate out the variant calls by sample, you will necessarily get the same number of calls for every sample, but a large number of those may be hom ref. You can indeed filter those out if you want.

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Member Posts: 54

    Thank you both. When @pdexheimer posted his comment, the light-bulb clicked on in my brain. I just needed to re-run my scripts to confirm. Hopefully I get it right this time. Thanks again!

Sign In or Register to comment.