Missing bases when running genotypeGVCFs

BAGBAG TorontoMember
edited August 2014 in Ask the GATK team

Hello,
We have been using GenotypeGVCFs to combine many GVCFs into one file. When doing this we find that a very small fraction of bases are missing in the final output, on the order of about 40 bases per 10 kb. We have been unable to determine why. The GVCFs that are being combined were created using Haplotype Caller on emit all sites, and they contain all of the bases, including the ones that are missing from the combined GVCF at the end. There does not seem to be any default filtering or other settings for GenotypeGVCFs that could be causing this. It does seem that some of the missing sites are associated with indels. Any ideas why we are having this issue? Does it have to do with the recalling of sites from the combined data? This seems similar to previoius posts on combining GVCFs, but not quite the same.

Combined GVCF, missing base 51:

1 49 . A . . . DP=278;NCC=34 GT:AD:DP 0/0:13:13 0/0:8:8 0/0:9:9 0/0:15:15 0/0:19:19 0/0:10:10 0/0:9:9 0/0:6:6 0/0:10:12 0/0:3:3 0/0:9:9 0/0:4:40/0:1:1 0/0:3:3 0/0:10:10 0/0:4:4 0/0:5:5 0/0:10:11 0/0:15:15 0/0:8:8 0/0:7:7 0/0:12:14 0/0:3:3 0/0:7:7 0/0:7:7 0/0:5:5 0/0:8:8 0/0:4:4 0/0:9:9 0/0:13:13 0/0:5:5 0/0:6:8 0/0:11:11 0/0:3:3

1 50 . A . . . DP=282;NCC=34 GT:AD:DP 0/0:13:13 0/0:8:8 0/0:9:9 0/0:14:15 0/0:19:19 0/0:10:10 0/0:9:9 0/0:6:6 0/0:10:14 0/0:2:2 0/0:8:8 0/0:4:40/0:1:1 0/0:3:3 0/0:10:10 0/0:4:4 0/0:6:6 0/0:11:14 0/0:15:15 0/0:8:8 0/0:7:7 0/0:13:15 0/0:2:2 0/0:6:6 0/0:7:7 0/0:5:5 0/0:8:8 0/0:4:4 0/0:9:9 0/0:13:13 0/0:5:5 0/0:7:9 0/0:11:11 0/0:3:3

1 52 . T . . . DP=285;NCC=34 GT:AD:DP 0/0:14:15 0/0:5:6 0/0:8:9 0/0:14:15 0/0:15:15 0/0:9:10 0/0:10:10 0/0:6:6 0/0:5:14 0/0:2:3 0/0:7:80/0:4:4 0/0:1:1 0/0:4:4 0/0:10:11 0/0:2:4 0/0:6:6 0/0:9:13 0/0:13:15 0/0:4:9 0/0:7:7 0/0:4:15 0/0:0:2 0/0:7:8 0/0:6:6 0/0:4:5 0/0:8:8 0/0:4:4 0/0:9:9 0/0:12:13 0/0:5:5 0/0:4:9 0/0:11:11 0/0:5:5

But the individual gvcfs (34 of them) have this base, one individual with a large deletion:
for file in *_cmb.vcf; do head -82 $file | tail -1 ;done
1 51 . A . . . GT:AD:DP:GQ:PL 0/0:13,1:14:19:0,19,408

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:14,1:15:2:0,2,445

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:16,0:16:30:0,30,450

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:5,1:6:0:0,0,203

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:9,1:10:0:0,0,220

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:9,0:9:24:0,24,360

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:6,0:6:18:0,18,220

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:5,10:15:0:0,0,0

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:3,1:4:0:0,0,86

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:9,1:10:5:0,5,266

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:7,0:7:21:0,21,218

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:4,0:4:12:0,12,177

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:10,1:11:7:0,8,298

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:2,2:4:0:0,0,15

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:6,0:6:15:0,15,225

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:11,3:14:0:0,0,265

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:14,1:15:0:0,0,428

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:5,3:8:0:0,0,85

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:1,0:1:3:0,3,46

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:3,1:4:0:0,0,84

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:7,0:7:18:0,18,270

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:6,8:14:0:0,0,0

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:1,2:3:0:0,0,0

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:7,1:8:0:0,0,195

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:8,0:8:21:0,21,315

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:4,0:4:12:0,12,165

1 51 . ATCCCTAAATCTTTAAATCCTACATCCATGAATCCCTAAATACCTAAT A, 0.05 . BaseQRankSum=0.727;ClippingRankSum=0.727;DP=7;MLEAC=1,0;MLEAF=0.500,0.00;MQ=57.66;MQ0=0;MQRankSum=0.727;ReadPosRankSum=-0.727 GT:AD:DP:GQ:PL:SB 0/1:3,1,0:4:19:19,0,123,28,126,154:3,0,1,0

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:4,0:4:9:0,9,135

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:9,0:9:27:0,27,348

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:13,1:14:0:0,0,402

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:5,0:5:14:0,15,142

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:4,4:8:0:0,0,26

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:11,0:11:24:0,24,360

1 51 . A . . . GT:AD:DP:GQ:PL 0/0:4,0:4:11:0,12,105

Any tips much appreciated.

Thanks,

Billie

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @BAG‌

    Hi Billie,

    Can you please confirm that you are using the latest version of GATK (version 3.2-2)?

    Thanks,
    Sheila

  • BAGBAG TorontoMember

    Hi, Sheila. Yes we are running 3.2-2.

    Header line from the file:

    GATKCommandLine=<ID=GenotypeGVCFs,Version=3.2-2-gec30cee,

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @BAG‌

    Hi Billie,

    Can you share a snippet of the data that is causing the error with us? We would like to replicate it ourselves.

    You can find instructions on how to do so here: http://gatkforums.broadinstitute.org/discussion/1894/how-do-i-submit-a-detailed-bug-report

    Thanks,
    Sheila

  • BAGBAG TorontoMember

    Hi Sheila, Here are the commands we used:

    ~/apps/jre1.7.0_51/bin/java
    -Djava.io.tmpdir=/scratch/w/wrighste/youngwha/tmp/ -Xmx14g -jar
    ~/apps/GenomeAnalysisTK.jar -T GenotypeGVCFs -R
    ~/references/TAIR10_chr_all.fa -V $SCRATCH/Billie_34Samples/all.list
    -allSites --max_alternate_alleles 36 -nt 12 -o
    /scratch/w/wrighste/youngwha/Billie_34Samples/combined_gvcf.vcf >
    /scratch/w/wrighste/youngwha/Billie_34Samples/combined_gvcf.log 2>
    /scratch/w/wrighste/youngwha/Billie_34Samples/combined_gvcf.err &

    I am uploading the individual files as an archive to the Broad ftp site. "34genomes.tar.gz"

    Unfortunately I don't currently have access to the .log and .err files generated from this run right now, but if they are critical, I can do some digging, retrieve and send them in about two weeks. I remember scanning through the .err file and there was nothing significant there. The program runs to completion, the output is just missing some bases, about 0.61% of the genome.

    Thanks very much for your help.

    Billie

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @BAG

    Hi Billie,

    I was able to replicate this issue, and I submitted a bug report to the developers.

    Hopefully this will be fixed soon, but I cannot guarantee an exact time.

    Please do note that if you are asked to submit a bug report next time, a small snippet of the region where the bug is occurring is sufficient. The files you submitted are huge and cause my computer to slow down quite a bit.

    Thanks,
    Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @BAG‌

    Hi,

    This has been fixed today. Please download the nightly build tonight.

    https://www.broadinstitute.org/gatk/nightly

    -Sheila

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @BAG‌

    Sorry I got a little too excited about this fix. It probably will not be fixed in the nightly until Monday or Tuesday of next week. Please check back then.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    This is fixed in the new release (v3.3).

  • BillieGBillieG TorontoMember

    Hi Geraldine and Sheila,
    Thank you for working to fix this issue with the previous version of GATK. I know this was a while ago, but can you tell me what was the nature of the sites that were being skipped by the GenotypeGVCFs function? Did most of the sites contain large indels? I have to work with the SNP set called using the old version and I need to think about whether the small number of sites that are missing from the output will strongly influence my results.
    Thanks again.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @BillieG It seems that this affected indels, but we don't have detailed information about how often this would occur or whether the indels that were skipped had any particular set of characteristics.

Sign In or Register to comment.