Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GenotypeGVCFs wrongly genotypes positions spanning deletions

When we run GenotypeGVCFs without the combine step, we are getting wrong genotypes for positions spanning deletions. We are also missing the AD values for the deletion allele. For example see these calls:

1   6529182 rs375111412 TTCCTCC T,TTCC  7506.41 .   AC=1,1;AF=0.009615,0.221;AN=4;BaseQRankSum=0.096;ClippingRankSum=0.292;DB;DP=3437;ExcessHet=18.9343;FS=0;InbreedingCoeff=-0.2993;MLEAC=1,23;MLEAF=0.009615,0.221;MQ=60.04;MQRankSum=0.042;QD=4.27;ReadPosRankSum=0.585;SOR=0.674    GT:AD:DP:GQ:PGT:PID:PL  0/2:165,4,22:191:99:.:.:408,769,11860,0,7242,6888   0/1:24,38,2:64:99:.:.:1513,0,1384,1504,993,2521
1   6529183 .   T   .   .   .   AN=4;DP=3786    GT:AD:DP:PGT:PID:RGQ    0/0:165:191:.:.:99  0/0:24:64:.:.:99

We are using GATK 3.5. We had 52 samples in this VCF and sliced it with bcftools for easier reading.

This is the same output if we do run CombineGVCFs on the individual gVCF files first.

1   6529182 rs375111412 TTCCTCC TTCC,T  7506.41 .   AC=1,1;AF=0.221,0.009615;AN=4;BaseQRankSum=0.096;ClippingRankSum=0.292;DB;DP=3437;ExcessHet=18.9343;FS=0;InbreedingCoeff=-0.2993;MLEAC=23,1;MLEAF=0.221,0.009615;MQ=60.04;MQRankSum=0.042;QD=4.27;ReadPosRankSum=0.585;SOR=0.674    GT:AD:DP:GQ:PGT:PID:PL  0/1:165,22,4:191:99:.:.:408,0,6888,769,7242,11860   0/2:24,2,38:64:99:.:.:1513,1504,2521,0,993,1384
1   6529183 .   T   *   .   .   AC=0;AF=0;AN=4;DP=3786  GT:AD:DP:PGT:PID:RGQ    0/0:165,22:191:.:.:99   0/0:24,38:64:.:.:99

Should GenotypeGVCFs properly genotype positions spanning deletions? Or is running CombineGVCFs a required step?

Thanks,
Carlos

Tagged:

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Carlos, running CombineGVCFs should not be required, and GenotypeGVCFs should do the right thing. Would you be able to put together a test case and share the data with us so we can troubleshoot this locally? Instructions for submitting a bug report are here: https://www.broadinstitute.org/gatk/guide/article?id=1894

  • CarlosBorrotoCarlosBorroto Member ✭✭

    Hi @Geraldine_VdAuwera, thanks a lot for looking into this issue. I uploaded test files I think will be sufficient to reproduce this issue. Please look for file 6678.zip.

    Also, we noticed another related issue. For the case where we run combine then genotype. While we get the deletion label, and the allele counts are mostly correct*, the genotype calls themselves are not. We added another sample to the files we uploaded to better show this. Please see the uploaded files and this snippet:

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  sample01    sample02    sample03
    1   6529182 .   TTCCTCC T,TTCC  1889.16 .   AC=1,1;AF=0.167,0.167;AN=6;BaseQRankSum=1.75;ClippingRankSum=0.312;DP=324;ExcessHet=3.01;FS=0.000;MLEAC=1,1;MLEAF=0.167,0.167;MQRankSum=0.540;QD=7.41;RAW_MQ=768100;ReadPosRankSum=2.02;SOR=0.709   GT:AD:DP:GQ:PL  0/0:43,0,0:43:89:0,89,984,89,984,984    0/2:165,4,22:191:99:408,769,11860,0,7242,6888   0/1:24,38,2:64:99:1513,0,1384,1504,993,2521
    1   6529183 .   T   *   .   .   AC=0;AF=0.00;AN=6;DP=330    GT:AD:DP:RGQ    0/0:49,0:49:76  0/0:165,22:191:99   0/0:24,38:64:99
    1   6529184 .   C   *   .   .   AC=0;AF=0.00;AN=6;DP=332    GT:AD:DP:RGQ    0/0:51,0:51:82  0/0:165,22:191:99   0/0:24,38:64:99
    1   6529185 .   C   *   .   .   AC=0;AF=0.00;AN=6;DP=320    GT:AD:DP:RGQ    0/0:39,0:39:76  0/0:165,22:191:99   0/0:24,38:64:99
    1   6529186 .   T   *   .   .   AC=0;AF=0.00;AN=6;DP=320    GT:AD:DP:RGQ    0/0:39,0:39:76  0/0:165,22:191:99   0/0:24,38:64:99
    1   6529187 .   C   *   .   .   AC=0;AF=0.00;AN=6;DP=324    GT:AD:DP:RGQ    0/0:43,0:43:99  0/0:165,22:191:99   0/0:24,38:64:99
    1   6529188 .   C   *   .   .   AC=0;AF=0.00;AN=6;DP=324    GT:AD:DP:RGQ    0/0:43,0:43:99  0/0:165,22:191:99   0/0:24,38:64:99
    

    Here we believe for the positions with the deletion label sample02 and sample03 should be 0/1 and not 0/0.

    This is why we say the allele counts are *mostly correct. For sample03 the deletion position has allele counts 165,4,22. We believe then the first 3 positions after should have 187,4 follow by 3 more with 165,26. Something similar can be said for sample02. However we think this is minor issue and should probably be discussed in a separated post.

    Best,
    Carlos

    Issue · Github
    by Sheila

    Issue Number
    460
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    chandrans
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks Carlos, we'll try to look into this very soon.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @CarlosBorroto
    Hi Carlos,

    I am unable to reproduce the issue with the latest version. I get the correct genotypes and AD values for position 1:6529182. However, position 1:6529183 is not present in any of the final VCFs. Perhaps this is because there are not enough samples in the bug report to emit the site?

    -Sheila

  • CarlosBorrotoCarlosBorroto Member ✭✭

    Hi @Sheila,

    Thanks for looking into this issue. It seems there is a miscommunications. Let me try to focus this report into one issue only and put it as clear as possible.

    Using the files I uploaded if you run this command (assuming the jar and reference are in the same path):

    java -Xmx4g -jar GenomeAnalysisTK.jar -T GenotypeGVCFs -R human_g1k_v37.fasta -V sample01.g.vcf -V sample02.g.vcf -V sample03.g.vcf -L 1:6529091-6529311 -writeFullFormat -allSites -o samples.genotyped.vcf
    

    You get this call for position 1:6529183:

    1   6529183 .   T   .   .   .   AN=6;DP=330 GT:AD:DP:RGQ    0/0:.:49:76 0/0:165:191:99  0/0:24:64:99
    

    We expect something more like:

    1   6529183 .   T   *   .   .   AC=0;AF=0.00;AN=6;DP=330    GT:AD:DP:RGQ    0/0:49,0:49:76  0/1:165,22:191:99   0/1:24,38:64:99
    

    I'm leaving RGQ untouched as I don't know how the PL would look if present at all. The main point we want to make here is how GTs and ADs are wrong for samples 2 and 3.

    I hope this make the issue we are reporting more defined and easier to reproduce.

    Please let us know if you need more details.

    Thanks,
    Carlos

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @CarlosBorroto
    Hi Carlos,

    I see what you are saying. I forgot to add the -allSites argument.

    I am about to put in a bug report. You can track it here.

    Thanks
    Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @CarlosBorroto
    Hi Carlos,

    A quick update: The team will have a meeting to discuss this in a couple of weeks. Fixes will start shortly after.

    -Sheila

  • CarlosBorrotoCarlosBorroto Member ✭✭

    Great to hear, thanks!

Sign In or Register to comment.