To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

HaplotypeCaller misses a true variant

Hi,
I'm updating my pipeline for exome sequencing analysis, so I'm experiencing the HaplotypeCaller capabilities! I have analyzed the same sample with the UnifiedGenotyper walker and the HC one and I have examined the differences between the two output vcf files and I had a very bad finding... HC failed to find a true novel variant!! I know that this is a true variants because I validated that with Sanger sequencing after the first calling with UG.

I have run UG using GATK version 1.6-11-g3b2fab9. This is the VCF line of the variant:

chr7 45123943 . A T 3436.17 PASS AC=2;AF=1.00;AN=2;BaseQRankSum=2.043;DP=114;Dels=0.00;FS=2.678;HRun=1;HaplotypeScore=0.0000;MQ=42.18;MQ0=1;MQRankSum=2.152;QD=30.14;ReadPosRankSum=-0.781;SB=-1010.47 GT:AD:DP:GQ:PL 1/1:8,105:114:99:3436,253,0

I have run HC using GATK version 2.7-4-g6f46d11 both in a single- and in a multi-sample manner but not the shadow of this variant in the VCF output..
I also noticed that together with this novel variant, HC lost other two variants upstream the first; these are the VCF lines:

chr7 45123881 rs61740891 C T 654.25 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=0.205;DB;DP=43;DS;Dels=0.00;FS=65.862;HRun=0;HaplotypeScore=2.2312;MQ=30.27;MQ0=1;MQRankSum=3.254;QD=15.22;ReadPosRankSum=-3.921;SB=-3.02 GT:AD:DP:GQ:PL 0/1:18,25:43:99:684,0,176

chr7 45123888 . C T 161.90 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=-2.293;DP=26;DS;Dels=0.00;FS=49.656;HRun=2;HaplotypeScore=0.0000;MQ=23.81;MQ0=1;MQRankSum=0.425;QD=6.23;ReadPosRankSum=-3.821;SB=-3.00 GT:AD:DP:GQ:PL 0/1:17,9:26:99:192,0,249

How is it possible?

Many thanks in advance

Best, Flavia

Best Answers

Answers

  • flapaflapa BolognaMember

    Hi Geraldine,

    I've just uploaded my data in the FTP server in a file named flapa_data.tar.gz; I created a BAM file for the whole chr7 in which the non-called variants fall.

    I hope this can be helpful!

    Flavia

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I was able to reproduce your issue, so I'm now passing this on to the devs for in-depth debugging.

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Hi Flavia,

    I've taken a look at your example and would like to explain what's happening. If you look carefully at the HC call in that region you'll notice that it assembles it into a very large (120bp) deletion (with 90% of your reads supporting that call). The HC believes that those "SNPs" aren't real, but rather are artifacts from a misalignment around the deletion.

    I've attached a screenshot of your data that illustrates it quite nicely. The upper half shows the nice clean HC re-alignments around the deletion. The lower half shows the original reads; notice that the coverage drops dramatically over the deletion and that those "SNPs" occur near the breakpoints. These are classic signs of mis-alignments.

    Is it possible that the Sanger sequencing validation could be interpreted in this way too?

  • flapaflapa BolognaMember

    Hi Eric,

    thank you so much for your very clear answer.
    The gene sequence is very repetitive; so, after your explanation, I think that also the Sanger sequencing could be interpreted in this way.
    Now I'm trying to perform a more specific PCR and I'll let you know if I'll reply the validation.

    Thanks for yor help!
    Flavia

Sign In or Register to comment.