HaplotypeCaller misses a true variant

flapaflapa BolognaPosts: 3Member

I'm updating my pipeline for exome sequencing analysis, so I'm experiencing the HaplotypeCaller capabilities! I have analyzed the same sample with the UnifiedGenotyper walker and the HC one and I have examined the differences between the two output vcf files and I had a very bad finding... HC failed to find a true novel variant!! I know that this is a true variants because I validated that with Sanger sequencing after the first calling with UG.

I have run UG using GATK version 1.6-11-g3b2fab9. This is the VCF line of the variant:

chr7 45123943 . A T 3436.17 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=2.043;DP=114;Dels=0.00;FS=2.678;HRun=1;HaplotypeScore=0.0000;MQ=42.18;MQ0=1;MQRankSum=2.152;QD=30.14;ReadPosRankSum=-0.781;SB=-1010.47 GT:AD:DP:GQ:PL 1/1:8,105:114:99:3436,253,0

I have run HC using GATK version 2.7-4-g6f46d11 both in a single- and in a multi-sample manner but not the shadow of this variant in the VCF output..
I also noticed that together with this novel variant, HC lost other two variants upstream the first; these are the VCF lines:

chr7 45123881 rs61740891 C T 654.25 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=0.205;DB;DP=43;DS;Dels=0.00;FS=65.862;HRun=0;HaplotypeScore=2.2312;MQ=30.27;MQ0=1;MQRankSum=3.254;QD=15.22;ReadPosRankSum=-3.921;SB=-3.02 GT:AD:DP:GQ:PL 0/1:18,25:43:99:684,0,176

chr7 45123888 . C T 161.90 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=-2.293;DP=26;DS;Dels=0.00;FS=49.656;HRun=2;HaplotypeScore=0.0000;MQ=23.81;MQ0=1;MQRankSum=0.425;QD=6.23;ReadPosRankSum=-3.821;SB=-3.00 GT:AD:DP:GQ:PL 0/1:17,9:26:99:192,0,249

How is it possible?

Many thanks in advance

Best, Flavia

  • flapaflapa BolognaPosts: 3Member

    Hi Geraldine,

    I've just uploaded my data in the FTP server in a file named flapa_data.tar.gz; I created a BAM file for the whole chr7 in which the non-called variants fall.

    I hope this can be helpful!


  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 10,395Administrator, Dev admin

    I was able to reproduce your issue, so I'm now passing this on to the devs for in-depth debugging.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Broad InstitutePosts: 698Member, Administrator, Broadie, Moderator, Dev admin

    Hi Flavia,

    I've taken a look at your example and would like to explain what's happening. If you look carefully at the HC call in that region you'll notice that it assembles it into a very large (120bp) deletion (with 90% of your reads supporting that call). The HC believes that those "SNPs" aren't real, but rather are artifacts from a misalignment around the deletion.

    I've attached a screenshot of your data that illustrates it quite nicely. The upper half shows the nice clean HC re-alignments around the deletion. The lower half shows the original reads; notice that the coverage drops dramatically over the deletion and that those "SNPs" occur near the breakpoints. These are classic signs of mis-alignments.

    Is it possible that the Sanger sequencing validation could be interpreted in this way too?

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

  • flapaflapa BolognaPosts: 3Member

    Hi Eric,

    thank you so much for your very clear answer.
    The gene sequence is very repetitive; so, after your explanation, I think that also the Sanger sequencing could be interpreted in this way.
    Now I'm trying to perform a more specific PCR and I'll let you know if I'll reply the validation.

    Thanks for yor help!

