Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Using Variant Filtration properly

nikhil_joshinikhil_joshi Member
edited June 2013 in Ask the GATK team

So I have tried using the Variant Recalibrator with my horse data, but it seems I don't have enough known SNPs for it to work properly, so I am going to go the Variant Filtration route. First I used vcftools to filter out the indels so I am left with only SNPs. Then, I ran Variant Filtration on my data using this command:

java -Xmx2g -jar /share/apps/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R /data/horse/reference/eqcab2/eqCab2.all_chr.fa --variant horse.output.raw.snps_only.vcf.recode.vcf -o horse.output.filtered.snps_only.vcf --filterExpression "QD < 2.0 || MQ < 40.0 || FS > 60.0 || HaplotypeScore > 13.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filterName "horseSNPFilters"

After which I get this error for every line:

WARN 16:56:11,946 Interpreter - ![38,52]: 'QD < 2.0 || MQ < 40.0 || FS > 60.0 || HaplotypeScore > 13.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0;' undefined variable HaplotypeScore

So I tried various things like adding the HaplotypeScore annotation, and that didn't work. I then tried taking out that part of the filter, and I got this error for every line:

WARN 16:56:30,115 Interpreter - ![38,47]: 'QD < 2.0 || MQ < 40.0 || FS > 60.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0;' undefined variable MQRankSum

... which is very odd since the file clearly has a MQRankSum value in the INFO field for every line. I got the same error with ReadPosRankSum as well. When I was finally down to "QD < 2.0 || MQ < 40.0 || FS > 60.0", it ran without error, but it didn't seem like it filtered anything. So then I tried just using "QD < 2.0", which did nothing... I tried "QD < 50" which did nothing despite there clearly being SNPs with QD < 50. So my question is, what am I doing wrong and why isn't any filtering happening? Also, how do I add HaplotypeScore annotation to every line so that I can filter on it? Finally, this is a WGS experiment so it would also be nice to get some advice about the filtration parameters, because it seems like that according to the website, the above filter values are for exome sequencing, which my data is not. Any help is highly appreciated! Thanks!

  • Nik.

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Nik,

    Are you getting the warning for every single line or just some of them? Some annotations like MQRankSum can only be calculated for heterozygous calls (as noted in the tech docs) so it's expected that homs won't have the annotation.

    By the way, when you want to separate out SNPs from indels, you can use the GATK's SelectVariants tools and select on variant type (see doc).

    We don't currently provide hard filtering recommendations for WGS data, as we mostly work with exomes. You'll need to experiment with the parameters based on the exome recommendations, and figure out what works best for your data.

  • nikhil_joshinikhil_joshi Member

    Hi Geraldine,

    Thanks for your reply. So you are right... I mistakenly thought that MQRankSum occurred on every line, but it does not. However, the filtering still doesn't seem to be working.... I looked at the file and there is definitely a QD for every line, and I also made sure there were QDs above and below 30, and then I tried using a filter of "QD < 30.0". This didn't filter anything even though there are clearly SNPs that have a QD < 30.0 in the input file. The output file still has those SNPs. What am I doing wrong?

    • Nik.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Nik,

    What do you mean by "didn't filter anything"? Do you see any PASS / FILTER annotations in the filter info field in the VCF lines? Keep in mind that filtering doesn't remove any variants from the file, it only annotates them. To actually select/remove variants, you use SelectVariants.

  • nikhil_joshinikhil_joshi Member

    So for example, here is the 2nd SNP line of the output vcf file after using "QD < 30.0" (as a test):

    chr8 79782250 . G A 571.25 PASS AC=5;AF=0.625;AN=8;BaseQRankSum=0.067;DP=21;FS=0.000;MLEAC=5;MLEAF=0.625;MQ=60.00;MQ0=0;MQRankSum=-0.762;QD=29.45;ReadPosRankSum=1.179 GT:AD:GQ:PL 1/1:0,4:12:201,12,0 0/1:1,4:12:209,0,12 0/0:7,0:18:0,18,156 1/1:0,5:12:200,12,0

    It says it PASSed the filter, but the QD is clearly less than 30. Here is the full command I used:

    java -Xmx2g -jar /share/apps/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R /data/horse/reference/eqcab2/eqCab2.all_chr.fa --variant horse.output.raw.snps_only.withHaplo.vcf.recode.vcf -o horse.output.filtered.snps_only.withHaplo.vcf --filterExpression "QD < 30.0" --filterName "horseSNPFilters2"

    • Nik.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh, I see -- that is weird. Are you not getting any variants at all marked FILTER? Can you tell me what version of GATK you're using?

  • nikhil_joshinikhil_joshi Member
    edited June 2013

    I think I found the problem and it is really strange.... probably a bug. And I am running the latest GATK, version v2.5-2-gf57256b. So when I run the command like this:

    java -Xmx2g -jar /share/apps/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R /data/horse/reference/eqcab2/eqCab2.all_chr.fa --variant horse.output.raw.snps_only.vcf.recode.vcf -o horse.output.filtered.snps_only.vcf --filterExpression "QD < 30.0" --filterName "horseSNPFiltersQD"

    I get the correct output for this SNP:

    chr8 79782452 . G T 357.12 horseSNPFiltersQD AC=5;AF=0.625;AN=8;BaseQRankSum=-0.457;ClippingRankSum=-2.024;DP=31;FS=0.000;MLEAC=5;MLEAF=0.625;MQ=59.56;MQ0=0;MQRankSum=0.849;QD=27.47;ReadPosRankSum=-1.501 GT:AD:GQ:PL 1/1:0,3:18:162,18,0 0/1:9,2:30:30,0,160 0/0:10,0:24:0,24,208 1/1:0,7:21:205,21,0

    However, if I run the exact same command except changing the "30.0" to a "30" (yes, really), like this:

    java -Xmx2g -jar /share/apps/gatk/GenomeAnalysisTK.jar -T VariantFiltration -R /data/horse/reference/eqcab2/eqCab2.all_chr.fa --variant horse.output.raw.snps_only.vcf.recode.vcf -o horse.output.filtered.snps_only.vcf --filterExpression "QD < 30" --filterName "horseSNPFiltersQD"

    then I get this output for that same SNP:

    chr8 79782452 . G T 357.12 PASS AC=5;AF=0.625;AN=8;BaseQRankSum=-0.457;ClippingRankSum=-2.024;DP=31;FS=0.000;MLEAC=5;MLEAF=0.625;MQ=59.56;MQ0=0;MQRankSum=0.849;QD=27.47;ReadPosRankSum=-1.501 GT:AD:GQ:PL 1/1:0,3:18:162,18,0 0/1:9,2:30:30,0,160 0/0:10,0:24:0,24,208 1/1:0,7:21:205,21,0

    That seems somewhat ridiculous. I almost feel like I'm doing something wrong, but I tried it multiple times and the change is evident. I'm guessing it is a bug?

    • Nik.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah no, in fact that's the expected behavior. As stated in the documentation on JEXL expressions, you have to pass in the correct variable type, e.g. if you're filtering on an annotation that has decimal values, you have to write the filtering value as a decimal in your filter expression, not as an integer. Otherwise the value comparison isn't made properly.

    I didn't pick this up as the cause of your problem because in your previous post you wrote that your expression was "QD < 30.0", written correctly. This is why it's important to post the exact command you ran, not an approximation, otherwise I don't have the necessary information to diagnose the problem.

  • nikhil_joshinikhil_joshi Member

    Hi Geraldine,

    So I must have copied a command that I tried somewhere in my command line... however, it seems to me that if JEXL has a problem with 30 vs. 30.0, that is a problem in JEXL. I'm a programmer and that doesn't seem like good coding. In any case, I looked at this page and on there it says that Java will throw an exception if the type of the value is different from the type in the expression. However, when I ran it using an integer (rather than a floating point) value, it didn't complain at all. It ran as if nothing was wrong. At the very least, shouldn't VariantFiltration give a warning or an error if that happens?

    • Nik.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, it used to be the normal behavior to error out on the type mismatch. Not sure what changed -- we're drawing on external libraries for much of the JEXL stuff. I'll see if we want to put in our own check for type mismatches.

  • angezouangezou Member

    Just with regards to the comment below, what should I look for in order to find the best parameters to hard filter the variants? I am fairly new to WGS and I'm not sure how to experiment with variant filtration and how to adjust the parameters accordingly. Thanks!

    @Geraldine_VdAuwera said:

    We don't currently provide hard filtering recommendations for WGS data, as we mostly work with exomes. You'll need to experiment with the parameters based on the exome recommendations, and figure out what works best for your data.

  • SheilaSheila Broad InstituteMember, Broadie admin

    @angezou
    Hi,

    You can plot the annotations from the hard filtering recommendations and see what the distributions look like. I have a document in progress that describes more in detail how to do this, but it is not quite complete yet. Have a look at Geraldine's answer from January 7th in this thread.

    -Sheila

Sign In or Register to comment.