Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Incorrect AD values in HC-called vcf and combined gvcf

seruseru BergenMember ✭✭

Hi,

I am using GATK v3.2.2 following the recommended practices (...HC -> CombineGVCFs -> GenotypeGVCFs ...) and while looking through suspicious variants I came across a few hetz with AD=X,0. Tracing them back I found two inconsistencies (bugs?);

1) Reordering of genotypes when combining gvcfs while the AD values are kept intact, which leads to an erronous AD for a heterozygous call. Also, I find it hard to understand why the 1bp insertion is emitted in the gvcf - there is no reads supporting it:

  • single sample gvcf
    1 26707944 . A AG,G,<NON_REF> 903.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/2:66,0,36,0:102:99:1057,1039,4115,0,2052,1856,941,3051,1925,2847:51,15,27,9

  • combined gvcf
    1 26707944 . A G,AG,<NON_REF> . . [INFO] GT:AD:DP:MIN_DP:PL:SB [other_samples] ./.:66,0,36,0:102:.:1057,0,1856,1039,2052,4115,941,1925,3051,2847:51,15,27,9 [other_samples]

  • vcf
    1 26707944 . A G 3169.63 . [INFO] [other_samples] 0/1:66,0:102:99:1057,0,1856 [other_samples]

2) Incorrect AD is taken while genotyping gvcf files:

  • single sample gvcf:
    1 1247185 rs142783360 AG A,<NON_REF> 577.73 . [INFO] GT:AD:DP:GQ:PL:SB 0/1:13,20,0:33:99:615,0,361,654,421,1075:7,6,17,3
  • combined gvcf
    1 1247185 rs142783360 AG A,<NON_REF> . . [INFO] [other_samples] ./.:13,20,0:33:.:615,0,361,654,421,1075:7,6,17,3 [other_samples]

  • vcf
    1 1247185 . AG A 569.95 . [INFO] [other_samples] 0/1:13,0:33:99:615,0,361 [other_samples]

I have found multiple such cases here, and no errors nor warnings in the logs. I checked also with calls that I had done before on these samples, but in a smaller batch. There the AD values were correct, but there were plenty of other hetz with AD=X,0... I haven't looked closer into those.

Are these bugs that have been fixed in 3.3? Or maybe my brain is not working properly today and I miss sth obvious?

Best regards,
Paweł

Comments

  • SheilaSheila Broad InstituteMember, Broadie admin

    @seru‌

    Hi Pawel,

    We had a similar issue that was fixed in version 3.3. This is not exactly the same issue, but can you please try with the latest version and see if it works?

    If not, I will need you to submit snippets of your files.

    Thanks,
    Sheila

  • seruseru BergenMember ✭✭

    I tried GATK 3.3 now and can report the following. For the first problem (reordering of AD caused by erronous gvcf entry) the strange record in gvcf is gone after switching to latest GATK:
    $ gatk2 -T HaplotypeCaller -I $BAM -L 1:26706944-26708944 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -minPruning 4 -stand_call_conf 50.0 2>/dev/null | grep 26707944
    1 26707944 . A AG,G,<NON_REF> 903.73 . BaseQRankSum=2.706;ClippingRankSum=-1.803;DP=113;MLEAC=0,1,0;MLEAF=0.00,0.500,0.00;MQ=53.59;MQ0=0;MQRankSum=1.978;ReadPosRankSum=1.835 GT:AD:DP:GQ:PL:SB 0/2:66,0,36,0:102:99:1057,1039,4115,0,2052,1856,941,3051,1925,2847:51,15,27,9

    $ gatk3 -T HaplotypeCaller -I $BAM -L 1:26706944-26708944 --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 -minPruning 4 -stand_call_conf 50.0 2>/dev/null | grep 26707944
    1 26707944 . A G,<NON_REF> 1071.77 . BaseQRankSum=2.402;ClippingRankSum=-1.495;DP=108;MLEAC=1,0;MLEAF=0.500,0.00;MQ=55.16;MQ0=0;MQRankSum=1.603;ReadPosRankSum=-0.192 GT:AD:DP:GQ:PL:SB 0/1:58,48,0:106:99:1100,0,1404,1271,1543,2814:43,15,36,12

    I assume it solves the problem down the line. Note that I tried only this particular variant, but hope that it would work for other similar cases as well. I also noticed that the depth has changed between versions of GATK, so I am guessing there has been more substantial changes in the HC algorithm...

    The second problem (incorrect AD after haplotypecalling) is also gone, but only for this particular variant. I can still see similar issue for other variants, e.g.
    gvcf record like this
    1 1159182 . G C,<NON_REF> . . [INFO] GT:AD:DP:MIN_DP:PL:SB [other_samples] ./.:21,10,0:31:.:354,0,2602,420,2632,3052:11,10,10,0 [other_samples]
    turned into
    1 1159182 . G C 26279.18 . [INFO] GT:AD:DP:GQ:PL [other_samples] 0/1:10,0:10:54:54,0,1158 [other_samples]

    Here, AD from position 1 is lost (for the variant where gatk 3.3 resolved the problem, AD from position 2 was skipped/missed). I find these problematic calls looking for heterozygous calls with 0 alt allele depth, e.g. "0/1:[0-9][0-9],0:". You should be able to identify them too, it is not seldom it happens. If you can't reproduce it though, I have isolated single-record gvcf files for two samples that alone can cause the reported problem (with GATK 3.3-0).

    Best,
    Paweł

  • SheilaSheila Broad InstituteMember, Broadie admin

    @seru
    Hi Pawel,

    Please do submit a bug report, so we can debug locally. http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • seruseru BergenMember ✭✭

    Hi,

    I have uploaded discussion5088_bug.tar.gz file.

    Best, Paweł

  • SheilaSheila Broad InstituteMember, Broadie admin

    @seru
    Hi Pawel,

    I have to apologize. I did not get to your bug until now, and because we did some spring cleaning of the server, your uploaded files are gone. Please upload them again.

    Sorry for the inconvenience. I will process it as soon as you confirm the upload.

    Thanks,
    Sheila

  • seruseru BergenMember ✭✭

    Hi,

    I have uploaded the file again. The name is the same: discussion5088_bug.tar.gz

    Cheers,
    Paweł

  • SheilaSheila Broad InstituteMember, Broadie admin

    @seru
    Hi Pawel,

    Thanks. I see you uploaded the GVCFs and the final vcf. Can you please upload snippets of the bam files? You can just use Print Reads to get snippets for the two samples. I need about 500 bases before and after the site of interest.

    Sorry for the hassle!

    -Sheila

  • seruseru BergenMember ✭✭

    Sorry, I haven't gotten any notification about Developments in this thread. Do you still have access to the uploaded archive? I could look up the bam file then

    Pawel

  • SheilaSheila Broad InstituteMember, Broadie admin

    @seru
    Hi Pawel,

    Yes, I still have your original discussion5088_bug.tar.gz files. I am just missing the bam files. If you can upload those either as a new .gz file or simply re-upload the same .gz file with the bam files, I can proceed with the bug report.

    However, before you do upload anything, can you please test out if the issue still occurs with the latest version? There have been a few versions of GATK since we last discussed this issue.

    Thanks,
    Sheila

  • seruseru BergenMember ✭✭

    Yepp, I will try to dig out the data and test with 3.5.

  • seruseru BergenMember ✭✭

    Hi again,

    I have the files ready, but your FTP seems to be under high traffic and refuses connection (max 20 users), so I am uploading it here.
    Let me know if you have any questions.

    Paweł

    Issue · Github
    by Sheila

    Issue Number
    482
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • SheilaSheila Broad InstituteMember, Broadie admin

    @seru
    Hi Pawel,

    I got the file. Thanks.

    I will process the bug report soon.

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie admin

    @seru
    Hi Pawel,

    I just tested your files on position 1:1159182, and it seems like everything is working fine with the latest version. The AD values are correctly transferred from the GVCF to the VCF. However, I would like to point out that I don't think the variant is real. If you look at the original bam file and bamout file, only the forward reads carry the C variant. It looks like the sequencer has made a mistake within the repeat region, and I suspect the site will be filtered out by VQSR.

    -Sheila

  • seruseru BergenMember ✭✭

    Hi again,

    Thank you for looking into this. Yes, the variant is clearly in a problematic region and the call may well be an error. But this is not so important. The AD field is transferred properly from GVCF indeed, in both 3.3 and 3.5. I don't know now how I generated the GVCF quoted in the initial bug report, but if this was using version 3.2, the case is closed.

    However, what is still surprising me is a call 0/1 with 0 reads supporting the ALT allele (AD=10,0). How is that explained?

    Paweł

  • SheilaSheila Broad InstituteMember, Broadie admin

    @seru
    Hi Pawel,

    When I get the final VCF from GenotypeGVCFs, the AD field shows 10,5, not 10,0. Perhaps that was also fixed in the latest version. Are you using the latest version (3.5)?

    -Sheila

Sign In or Register to comment.