We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

HaplotypeCaller error: bad variant representation and missing variant in gvcf and vcf

Hi all,
I'm using GATK v.3.6 for a multi-sample analysis. I followed best practices and this is the command line for HaplotypeCaller:

    java -Xmx64g -jar $GATK -T HaplotypeCaller \
    -R $REF \
    -I $PROCESSING/5_BQSR/${filename%.*}.bam \
    -o $PROCESSING/6_Variant/GATK/${filename%.*}.g.vcf \
    -ERC GVCF \
    --doNotRunPhysicalPhasing \
    -bamout $PROCESSING/6_Variant/GATK/${filename%.*}.g.vcf.bam \
    -L $TARGET

Looking at the final vcf format (after GenotypeGVCFs but also in .g.vcf file) I found these variants:

chr21 44477938 . CGGGCACCCGTTTGAGCTGCCTGTAGGTGACCGGGCACCCGTTTGAGCTGCCTGTAGGTGACT TGGGCACCCGTTTGAGCTGCCTGTAGGTGACCGGGCACCCGTTTGAGCTGCCTGTAGGTGACT,C 6064.54 PASS AC=3,1;AF=0.125,0.042;AN=24;BaseQRankSum=1.00;ClippingRankSum=0.092;DP=3685;ExcessHet=0.6070;FS=5.275;InbreedingCoeff=0.4000;MLEAC=3,1;MLEAF=0.125,0.042;MQ=57.73;MQRankSum=-4.330e-01;QD=10.40;ReadPosRankSum=-5.700e-02;SOR=0.577 GT:AD:DP:GQ:PL 0/0:229,0,0:229:99:0,120,1800,120,1800,1800 0/1:177,131,0:308:99:2363,0,5706,2891,6106,8997 0/0:246,0,0:246:99:0,120,1800,120,1800,1800 0/0:223,0,0:223:99:0,120,1800,120,1800,1800 0/1:123,80,0:203:99:1422,0,3124,1790,3357,5147 0/0:393,0,0:393:99:0,120,1800,120,1800,1800 0/0:311,0,35:346:61:0,913,12646,61,11795,11566 0/0:461,0,0:461:99:0,120,1800,120,1800,1800 1/2:1,35,36:72:99:2338,868,1163,1274,0,3057 0/0:374,0,0:374:99:0,120,1800,120,1800,1800 0/0:37,0,0:37:99:0,99,1239,99,1239,1239 0/0:356,0,0:356:99:0,120,1800,120,1800,1800

Looking at this variant 1/2:1,35,36:72:99:2338,868,1163,1274,0,3057 it seems like there is a 63 bp deletion on this site.
Now, look at the bam file screen for this sample: there is no deletion in that site but 3 snps. In fact, using other variant callers (VarScan and FreeBayes), I found these variants but no deletion:

chr21 44477938 . C T
chr21 44477971 . G C
chr21 44478000 . T C

However, GATK calls the last one variant at this position:

chr21 44478000 . T C 56720.78 PASS AC=14;AF=0.636;AN=22;BaseQRankSum=0.692;ClippingRankSum=-2.390e-01;DP=3859;ExcessHet=1.1475;FS=0.000;InbreedingCoeff=0.2143;MLEAC=14;MLEAF=0.636;MQ=57.12;MQRankSum=0.139;QD=21.15;ReadPosRankSum=0.667;SOR=0.702 GT:AD:DP:GQ:PL ./.:234,0:234 1/1:0,287:287:99:8615,853,0 1/1:0,252:252:99:7501,751,0 1/1:0,188:188:99:5121,531,0 1/1:0,229:229:99:5490,588,0 0/0:393,0:393:99:0,120,1800 0/1:133,190:323:99:4813,0,490 1/1:3,464:467:99:14732,1315,0 0/1:56,35:91:99:1155,0,1414 0/1:155,280:435:99:5265,0,3405 0/0:37,0:37:99:0,99,1239 0/1:167,243:410:99:4096,0,310

In addition, consider that I found the snp in position chr21-44477971 on different samples in VarScan and FreeBayes:

0/1:255:254:245:136:106:43,44%:4,17E-39:34:32:53:83:60:46 0/0:452:294:286:279:5:1,75%:3,07E-2:33:31:130:149:3:2 0/0:430:242:230:230:0:0%:1E0:32:0:101:129:0:0 0/0:334:216:206:200:3:1,46%:1,2407E-1:32:33:100:100:2:1 0/0:304:200:188:185:3:1,6%:1,24E-1:32:31:84:101:1:2 0/0:557:357:342:337:5:1,46%:3,0793E-2:32:23:173:164:2:3 0/1:255:384:374:241:133:35,56%:5,2547E-47:33:31:112:129:69:64 0/0:734:416:391:391:0:0%:1E0:33:0:183:208:0:0 0/1:187:123:119:67:52:43,7%:1,6526E-19:34:32:29:38:28:24 0/0:589:340:326:325:1:0,31%:5E-1:33:31:152:173:0:1 0/0:399:242:233:229:2:0,86%:2,4946E-1:32:20:113:116:2:0 0/0:498:332:317:309:6:1,89%:1,5254E-2:32:28:151:158:4:2

In conclusion, GATK seems to call a deletion missing information about the second variant for all the samples.
I hope everything is clear, thank you in advice for your help!


Best Answer


  • valentinvalentin Cambridge, MAMember, Dev ✭✭

    Hi @Matteodigg,

    Yeah, it may well be that the three snps are the right answer in this case as typically they are considered more parsimonious that such a big deletion.

    However I think that in order to to have the whole picture here you need to zoom out a bit, ideally including unique flanking sequences before and after the region that you are showing; it seems that the reference sequence is is a series of long unit (~34bp) identical and quasi-identical repeats.

    Also since you are generating the "bamout" with the reads realigned against the haplotypes, that may help to clarify why HC is calling the deletion. Could you please also plot coloring by the "HC" tag?

    Thanks, V.

  • MatteodiggMatteodigg ItalyMember

    Hi @valentin,

    I checked the .g.vcf.bam, I forgot that GATK v.3.6 includes realignment in HC step. Effectively, as you can see from the screen, GATK seems to call a deletion. I will confirm that to verify if exists. Thank you in advice for your help.


  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    Hi Matteo,

    Looking forward to your results!


Sign In or Register to comment.