We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

VariantFiltration: how to filter samples where less then 95% of reads agree with the called genotype

I've been looking over the documentation for VariantFilteration and jexl (http://gatkforums.broadinstitute.org/gatk/discussion/1255/using-jexl-to-apply-hard-filters-or-select-variants-based-on-annotation-values) to figure out how to do this, but I can seem to find an answer.

I would like to filter snips where there are many reads that disagree with the called genotype (eg FORMAT 1:20,60:80:99:1900,0).

In pseudo code, I'd like to write something like "AD[GT]/DP<0.95", where the allelic depth (AD) for the called genotype (GT) divided by the total depth (DP) < 0.95. However, the docs indicate that it is not possible to access GT:
"For now, you can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception"

Is there another way to accomplish what I want using VariantFiltration? It seems like a common sense filter, and it's also something I frequently see in the literature for snips in haploid organisms.

Best Answer


  • rsleerslee Member

    Hi there,

    I'm hoping someone can advise on this. I've been trying to do something very similar to the above, except selecting only sites with AD_ALT / DP > 0.9. Only variant sites are in my vcf.

    However, using the following code, I find that only SNP loci that have NO reference alleles present are exported - despite the fact that I have many sites where the reference allele is present but in under 10% of the reads.

    java -Xmx2g -jar GenomeAnalysisTK.jar -T SelectVariants -R ref.fa -V SNPs_only.vcf -select 'vc.getGenotype("sample").getAD().1 / vc.getGenotype("sample").getDP() > 0.9' -o filtered.vcf

    My JEXL expression seems in line with what's above, and with the example here https://gatkforums.broadinstitute.org/gatk/discussion/comment/28654#Comment_28654, so I'm not sure what's happening.

    Thank you in advance!


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi R,

    Yea, I am not sure what is happening here. Your command looks fine. Which version of GATK are you using? Can you try with some other values than 0.9? Does that change the result and how?


  • The problem is that Java treats the genotype fields as integers. Therefore, it thinks you are dividing two integers, and performing a/b where b>a will always result in a value of 0. You need to multiply the numerator and denominator by 1.0 to convert the integers into real numbers. Then, Java will work as you want it to. The other issue is that the getAD().1 suffix means take the allelic depth of the first alternate allele. However, there is no guarantee that the genotype of your sample is that of the first alternate allele. Furthermore, dividing by getDP() is not correct since there is no guarantee that the genotype depth will equal the sum of the allelic depths (if reads are uninformative, they count towards the genotype depth, but not the allelic depth).

    I do not think there is any way to do exactly what you want, unless there is some guaranteed way to extract out the allelic depth of the alternate allele that matches the genotype. In general, when I am filtering in this way, what I have instead done is export the VCF to table format, do the filtering on the table, export the positions of the filtered positions to a bed file, and then filter on that bed file using the --mask function in GATK VariantFiltration. However, even this is not perfect since filtering on bed files is done on any positions overlapping the positions in the bed file and not just those that start at the position (this can be an issue if there are deletions in your VCF file). An additional line of code needs to be run to remove the filter from those positions that overlap the bed file, but do not start at positions in the bed file.
Sign In or Register to comment.