SelectVariants and VariantFiltration methods

bwubbbwubb Posts: 50Member
edited January 2013 in Ask the GATK team

Hi, I wanted to double check my methods for some targeted capture data. I ran 96 samples through UG to produce a multisample VCF. I separated snps and indels into separate files using SelectVariants, and applied filters:

For snps "QD < 2.0", "MQ < 40.0", "FS > 60.0", "HaplotypeScore > 13.0", "MQRankSum < -12.5", "ReadPosRankSum < -8.0"

For indels "QD < 2.0", "ReadPosRankSum < -20.0", "InbreedingCoeff < -0.8", "FS > 200.0"

I then went back through with SelectVariants, pulling out each sample one at a time into their own filtered VCF.

My results are... lets say, wrong. I am wondering if it would be better practice to select each sample first and then apply the filters, or if it does not matter and my errors lie elsewhere. Thank you.

Post edited by Geraldine_VdAuwera on

Best Answers

  • pdexheimerpdexheimer Posts: 373Member, GSA Collaborator ✭✭✭
    Answer ✓

    Just because a site is listed in the vcf doesn't necessarily mean it's variant in (any of) the sample(s) - a genotype of homozygous reference is perfectly valid. SelectVariants has an option for removing variants that are reference in all selected samples, you may want to try that.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,672Administrator, GATK Developer admin
    Answer ✓

    Yes @bwubb, the GATK does maintain sample individuality when doing multisample calling. To further clarify what @pdexheimer is saying: the "same number of variants in each sample's VCF" thing is not a cause for alarm. When you are calling variants on your multisample dataset, any site that is called as variant in at least one sample will therefore be called/genotyped for every other sample, even if it is hom ref in those samples. When you separate out the variant calls by sample, you will necessarily get the same number of calls for every sample, but a large number of those may be hom ref. You can indeed filter those out if you want.


  • bwubbbwubb Posts: 50Member

    This is still an issue for me. I was under the impression my method was in line with "Best Practices".

    I ran UG on a list of sample.bams to produce a multi-sample vcf. Next, I ran selectVariants to pick out snps and indels for each sample into a sampl.snp.vcf, and sample.indel.vcf respectively. I then applied filter criteria (I even broke up || statements into individual filters).

    The big issue is that all of the vcf files have the same number of variants for each sample. I compared results with old results which used GATK-1.3 and ran UG individually on each sample.bam this definitely did not happen previously. I could go back to that, but I was under the impression that GATK was suppose to be able to handle a list of bams and maintain sample individuality. Am I incorrect? Thank you.

  • bwubbbwubb Posts: 50Member

    Thank you both. When @pdexheimer posted his comment, the light-bulb clicked on in my brain. I just needed to re-run my scripts to confirm. Hopefully I get it right this time. Thanks again!

Sign In or Register to comment.