If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on October 14, 2019, due to the U.S. holiday. We will return to monitoring the forum on October 15.

HaplotypeCaller error: bad variant representation and missing variant in gvcf and vcf

Hi all,
I'm using GATK v.3.6 for a multi-sample analysis. I followed best practices and this is the command line for HaplotypeCaller:

    java -Xmx64g -jar $GATK -T HaplotypeCaller \
    -R $REF \
    -I $PROCESSING/5_BQSR/${filename%.*}.bam \
    -o $PROCESSING/6_Variant/GATK/${filename%.*}.g.vcf \
    -ERC GVCF \
    --doNotRunPhysicalPhasing \
    -bamout $PROCESSING/6_Variant/GATK/${filename%.*}.g.vcf.bam \
    -L $TARGET

Looking at the final vcf format (after GenotypeGVCFs but also in .g.vcf file) I found these variants:

chr21 44477938 . CGGGCACCCGTTTGAGCTGCCTGTAGGTGACCGGGCACCCGTTTGAGCTGCCTGTAGGTGACT TGGGCACCCGTTTGAGCTGCCTGTAGGTGACCGGGCACCCGTTTGAGCTGCCTGTAGGTGACT,C 6064.54 PASS AC=3,1;AF=0.125,0.042;AN=24;BaseQRankSum=1.00;ClippingRankSum=0.092;DP=3685;ExcessHet=0.6070;FS=5.275;InbreedingCoeff=0.4000;MLEAC=3,1;MLEAF=0.125,0.042;MQ=57.73;MQRankSum=-4.330e-01;QD=10.40;ReadPosRankSum=-5.700e-02;SOR=0.577 GT:AD:DP:GQ:PL 0/0:229,0,0:229:99:0,120,1800,120,1800,1800 0/1:177,131,0:308:99:2363,0,5706,2891,6106,8997 0/0:246,0,0:246:99:0,120,1800,120,1800,1800 0/0:223,0,0:223:99:0,120,1800,120,1800,1800 0/1:123,80,0:203:99:1422,0,3124,1790,3357,5147 0/0:393,0,0:393:99:0,120,1800,120,1800,1800 0/0:311,0,35:346:61:0,913,12646,61,11795,11566 0/0:461,0,0:461:99:0,120,1800,120,1800,1800 1/2:1,35,36:72:99:2338,868,1163,1274,0,3057 0/0:374,0,0:374:99:0,120,1800,120,1800,1800 0/0:37,0,0:37:99:0,99,1239,99,1239,1239 0/0:356,0,0:356:99:0,120,1800,120,1800,1800

Looking at this variant 1/2:1,35,36:72:99:2338,868,1163,1274,0,3057 it seems like there is a 63 bp deletion on this site.
Now, look at the bam file screen for this sample: there is no deletion in that site but 3 snps. In fact, using other variant callers (VarScan and FreeBayes), I found these variants but no deletion:

chr21 44477938 . C T
chr21 44477971 . G C
chr21 44478000 . T C

However, GATK calls the last one variant at this position:

chr21 44478000 . T C 56720.78 PASS AC=14;AF=0.636;AN=22;BaseQRankSum=0.692;ClippingRankSum=-2.390e-01;DP=3859;ExcessHet=1.1475;FS=0.000;InbreedingCoeff=0.2143;MLEAC=14;MLEAF=0.636;MQ=57.12;MQRankSum=0.139;QD=21.15;ReadPosRankSum=0.667;SOR=0.702 GT:AD:DP:GQ:PL ./.:234,0:234 1/1:0,287:287:99:8615,853,0 1/1:0,252:252:99:7501,751,0 1/1:0,188:188:99:5121,531,0 1/1:0,229:229:99:5490,588,0 0/0:393,0:393:99:0,120,1800 0/1:133,190:323:99:4813,0,490 1/1:3,464:467:99:14732,1315,0 0/1:56,35:91:99:1155,0,1414 0/1:155,280:435:99:5265,0,3405 0/0:37,0:37:99:0,99,1239 0/1:167,243:410:99:4096,0,310

In addition, consider that I found the snp in position chr21-44477971 on different samples in VarScan and FreeBayes:

0/1:255:254:245:136:106:43,44%:4,17E-39:34:32:53:83:60:46 0/0:452:294:286:279:5:1,75%:3,07E-2:33:31:130:149:3:2 0/0:430:242:230:230:0:0%:1E0:32:0:101:129:0:0 0/0:334:216:206:200:3:1,46%:1,2407E-1:32:33:100:100:2:1 0/0:304:200:188:185:3:1,6%:1,24E-1:32:31:84:101:1:2 0/0:557:357:342:337:5:1,46%:3,0793E-2:32:23:173:164:2:3 0/1:255:384:374:241:133:35,56%:5,2547E-47:33:31:112:129:69:64 0/0:734:416:391:391:0:0%:1E0:33:0:183:208:0:0 0/1:187:123:119:67:52:43,7%:1,6526E-19:34:32:29:38:28:24 0/0:589:340:326:325:1:0,31%:5E-1:33:31:152:173:0:1 0/0:399:242:233:229:2:0,86%:2,4946E-1:32:20:113:116:2:0 0/0:498:332:317:309:6:1,89%:1,5254E-2:32:28:151:158:4:2

In conclusion, GATK seems to call a deletion missing information about the second variant for all the samples.
I hope everything is clear, thank you in advice for your help!


Best Answer


  • valentinvalentin ✭✭ Cambridge, MAMember, Dev ✭✭

    Hi @Matteodigg,

    Yeah, it may well be that the three snps are the right answer in this case as typically they are considered more parsimonious that such a big deletion.

    However I think that in order to to have the whole picture here you need to zoom out a bit, ideally including unique flanking sequences before and after the region that you are showing; it seems that the reference sequence is is a series of long unit (~34bp) identical and quasi-identical repeats.

    Also since you are generating the "bamout" with the reads realigned against the haplotypes, that may help to clarify why HC is calling the deletion. Could you please also plot coloring by the "HC" tag?

    Thanks, V.

  • MatteodiggMatteodigg ItalyMember

    Hi @valentin,

    I checked the .g.vcf.bam, I forgot that GATK v.3.6 includes realignment in HC step. Effectively, as you can see from the screen, GATK seems to call a deletion. I will confirm that to verify if exists. Thank you in advice for your help.


  • SheilaSheila admin Broad InstituteMember, Broadie, Moderator admin

    Hi Matteo,

    Looking forward to your results!


Sign In or Register to comment.