To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Recalculate AF, AC and AN after applying sample/genotype-level filtering

xiaolicbsxiaolicbs Broad InstituteMember
edited March 2016 in Ask the GATK team

Dear GATK team,

I am trying to use VariantFiltration and SelectVariants to mask individual calls with GQ < 20 or DP < 10 in my VCF; however, as I succeed masking individuals who didn't pass the QC with './.', the AC, AN and AF in the INFO field stayed the same. Just want to make sure that I did everything correctly:

I first use VariantFiltration to mark filtered individuals (I am using GATK version: 3.5-0-g36282e4):

java -jar GenomeAnalysisTK.jar -R hg19.fasta -V test.vcf -o test.vf.vcf -G_filter 'GQ<20' --genotypeFilterName 'lowGQ' -G_filter 'DP<10' --genotypeFilterName 'lowDP' -V test.vcf -o test.out1.vcf

and then use SelectVariants to set those filtered individuals:

java -jar GenomeAnalysisTK.jar -T SelectVariants --removeUnusedAlternates --excludeNonVariants --setFilteredGtToNocall -V test.out1.vcf -o test.out2.vcf

Here is a toy example which might help you debug:

the original input was:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 1 1000 . A T 2000 PASS AC=3;AF=0.375;AN=8 GT:DP:GQ 0/0:20:30 0/1:20:30 0/1:20:30 0/1:5:10

After VariantFiltration:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 1 1000 . A T 2000 PASS AC=3;AF=0.375;AN=8 GT:DP:FT:GQ 0/0:20:PASS:30 0/1:20:PASS:30 0/1:20:PASS:30 0/1:5:lowDP;lowGQ:10

After SelectVariants:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 1 1000 . A T 2000 PASS AC=3;AF=0.375;AN=8 GT:DP:FT:GQ 0/0:20:PASS:30 0/1:20:PASS:30 0/1:20:PASS:30 ./.:5:lowDP;lowGQ:10

It seems to me that selectVariants should update the AC, AF and AN, as after we masking the individual's genotypes. I am not sure if I am doing this with the correct command. I've been reading previous threads about variantFiltration and selectVariants on individual genotypes, but not find cases like this.

Hope to hear from you. Thanks a lot.

Xiao

Answers

  • xiaolicbsxiaolicbs Broad InstituteMember
    edited March 2016

    The version I've being using is: 3.5-0-g36282e4

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @xiaolicbs
    Hi Xiao,

    Are you asking for the AC, AF, and AN fields to be ./.? Or, do you simply want to remove the entire record from the VCF?

    Thanks,
    Sheila

  • xiaolicbsxiaolicbs Broad InstituteMember
    edited March 2016

    Hi Sheila,

    Sorry I should make it more clear. I am not asking for AC, AN and AF fields to be ./.. I am hoping selectVariants to recalculate the AC, AN and AF after applying filtering on specific genotype/sample. The issue is currently all INFO fields stayed the same, making the genotype fields inconsistent with INFO.

    In the toy case that I showed, the correct AC, AN and AF field should be:

    AC=2;AF=0.333333;AN=6

    Does that make sense to you?

    Xiao

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @xiaolicbs Unfortunately what you are asking for is not currently possible, but I agree it leads to inconsistency. If you can submit a test case we can see if it might be possible to implement an option to recalculate these genotype-specific values. See https://www.broadinstitute.org/gatk/guide/article?id=1894 for instructions on submitting a test snippet.

  • xiaolicbsxiaolicbs Broad InstituteMember

    @Geraldine_VdAuwera Thanks for considering solve this problem. There are lots of use cases, such as doing variant QC for eQTL / GWAS analysis, where people want to apply genotype-specific filters for further downstream analysis.

    Currently I write a code to fix the INFO/ALT fields, but it gets complicated for multi-allelic sites, where after genotype-specific filtering, one of the alt alleles are missing. In that case we need to fix allele specific annotations, such as AD, PL, PID and PGT. Now I just discard PL, PID and PGT field (if applicable) for simplicity, but it would be good to do it completely right.

    Another issue is, when there are InDels spanning SNPs, after GQ/DP filtering (or sample subsetting), the SNP is removed. There will be a record with ALT field asterisk mark

    BTW, for PL, could you tell me how it was ordered for multi allelic site? Let's say we have three alleles, one ref and two alt, what's the orders for PL? Are they 0/0, 0/1, 0/2, 1/1, 1/2, 2/2?

    I will be generating few snippets where there are needs to remove alleles after genotype/sample specific filtering.

    Thanks a lot.

    Xiao

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @xiaolicbs
    Hi Xiao,

    Have a look at the VCF spec for how the PL field is ordered. You will have to look at the description of the GL field under "1.4.2 Genotype fields".

    For your example, the PLs are in this order:

    0/0,0/1,1/1,0/2,1/2,2/2.

    -Sheila

  • xiaolicbsxiaolicbs Broad InstituteMember

    @Sheila said:
    @xiaolicbs
    Hi Xiao,

    Have a look at the VCF spec for how the PL field is ordered. You will have to look at the description of the GL field under "1.4.2 Genotype fields".

    For your example, the PLs are in this order:

    0/0,0/1,1/1,0/2,1/2,2/2.

    -Sheila

    Thanks Sheila, that helps a lot. :)

  • xiaolicbsxiaolicbs Broad InstituteMember

    @Geraldine_VdAuwera said:
    @xiaolicbs Unfortunately what you are asking for is not currently possible, but I agree it leads to inconsistency. If you can submit a test case we can see if it might be possible to implement an option to recalculate these genotype-specific values. See https://www.broadinstitute.org/gatk/guide/article?id=1894 for instructions on submitting a test snippet.

    Hi Geraldine,

    I have uploaded a file named: Xiao_SelectVariantsNotUpdateINFO.tar.gz to the ftp you pointed me to. In the file, you can find three vcfs, toy.vcf, toy.VariantFiltration.vcf, toy.selectVariants.vcf, which correspond to input, output of VariantFiltration and SelectVariants step. There is also a shell script which contained the GATK command that I used to generate those three files.

    Let me know if you need anything. Hope this is helpful fixing the bug.

    Xiao

    Issue · Github
    by Sheila

    Issue Number
    724
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • @xiaolicbs I just saw this post, and wanted to let you know that we are also experiencing the same issue. The way we've worked around it is through a call to VariantAnnotator after the SelectVariants call. Our general workflow is as follows:

    java -jar GATK.jar -T VariantFiltration -V in.vcf -o filtered.vcf ... java -jar GATK.jar -T SelectVariants -V filtered.vcf -o masked .vcf --setFilteredGtToNocall -trimAlts ... java -jar GATK.jar -T VariantAnnotator -V masked.vcf -o masked.anno.vcf -A ChromosomeCounts ...

    We've also noted that the DP annotation is not recalculated with SelectVariants. We're looking at adding the Coverage annotation in the VariantAnnotator, but if that requires the raw BAM files, we should be able to use the DepthPerAlleleBySample annotation to get something approximating the DP.

  • xiaolicbsxiaolicbs Broad InstituteMember

    @johnwallace123 said:
    @xiaolicbs I just saw this post, and wanted to let you know that we are also experiencing the same issue. The way we've worked around it is through a call to VariantAnnotator after the SelectVariants call. Our general workflow is as follows:

    java -jar GATK.jar -T VariantFiltration -V in.vcf -o filtered.vcf ... java -jar GATK.jar -T SelectVariants -V filtered.vcf -o masked .vcf --setFilteredGtToNocall -trimAlts ... java -jar GATK.jar -T VariantAnnotator -V masked.vcf -o masked.anno.vcf -A ChromosomeCounts ...

    We've also noted that the DP annotation is not recalculated with SelectVariants. We're looking at adding the Coverage annotation in the VariantAnnotator, but if that requires the raw BAM files, we should be able to use the DepthPerAlleleBySample annotation to get something approximating the DP.

    Hi John,

    Thanks for helping out. I looked into the VariantAnnotator, and seems like AF, AC and AN are not part of VCF INFO annotation that GATK could help generate. I tested out and seemed like it cannot update AF, AC and AN fields. But there seems other INFO annotation that would help you use case. I am trying to write my own code to solve this issue, but it became complicated when involving removal of deletions with SNPs inside its interval (the '*' cases). Thanks for help anyway. :)

    Xiao

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @xiaolicbs
    Hi Xiao,

    I have uploaded a bug report. You can keep track of it here.

    -Sheila

  • AmandaAmanda North CarolinaMember

    Hi Everyone,
    It seems that I also am running into this same bug. I'm also trying to filter using AN, but it is not being recalculated as everyone else has mentioned. This seems like a giant oversight in the pipeline as it would be the only way to involve a filter for missing data which is important for mapping and other population based studies, particularly where you cannot use the variant re-calibration.

    Xiao - if you happen to have a way of recalculating this, please let me know, I'd be certainly grateful if you could share.

    Amanda

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi everyone, thanks for letting us know that this is such a big problem to you all. Considering how "popular" this issue is turning out to be, we're going to try to prioritize it. We'll update this thread when we have news.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Amanda @xiaolicbs @johnwallace123
    Hi everyone,

    A fix is in! You should be able to use the latest nightly build that contains the fix :smile:

    -Sheila

  • AmandaAmanda North CarolinaMember

    Hello Sheila,
    Can you please clarify, will INFO field now be recalculated in both SelectVariants and VariantFiltration steps? Thank you for fixing this issue, I was just coming back to see if there was a way around it, so very happy to see that you were able to fix the bug.
    Amanda

  • everestial007everestial007 GreensboroMember

    It looks like there is still issues with updating of the INFO field values. I am using GATK 3.7 and 3.8 and this issue still exits.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @everestial007
    Hi,

    Can you tell us the exact command you ran and post some example outputs that are incorrect?

    Also, does this happen in GATK 4.0.0.0?

    Thanks,
    Sheila

Sign In or Register to comment.