GATK licensing moves to direct-through-Broad model -- read about it on the GATK blog

Bug in HaplotypeCaller: GT is called 0/1, but AD is 206,0

wxiewxie ExeterPosts: 4Member

HI, I'd like to report a weird result from HaplotypeCaller.

We have a patient sequenced by targeted sequencing on HNF1B. This patient has been confirmed to have a whole gene deletion of HNF1B so we used this patient as a positive control. We expected to see no heterozygous variants called in HNF1B.

However, HaplotypeCaller called two heterozygous variants: one deletion (it didn't pass the FS strand bias filter and the ReadPosRankSumTest filter) and one substitution (this one passed all the quality filters). Both these two variants were not called by UnifiedGenotyper (and the variants called by UnifiedGenotyper in HNF1B region were all homozygous as what we expected)

Please see the VCF table:

There are three things I want to highlight:
1, The deletion is only 10 bases upstream of the substitution, but the FS score is enormous for the deletion whereas very low for the substitution. If there's a strand bias, it must affect both variants if they are so close to each other.
2, The score of ReadPosRankSumTest of the deletion didn't pass the filter because it's always called near the end of the reads. The ReadPosRankSumTest score for the substitution is missing.
3, The genotype was called 0/1 for the substitution, but if we look at the AD, there are 206 reads supporting the ref allele and 0 read supporting the alt allele. Going by the AD, it's clearly a homozygous ref/ref genotype.

Then I looked into the bam files. It turns out the all the alternative alleles of the substitution were from the ends of bad reads, and there are not too many of them after all. So the reads in the bam file also support a homozygous ref/ref genotype.

Therefore I'm really confused why the substitution has 0/1 genotype called by the HaplotypeCaller and why it passed the filter.

Many Thanks

Betty

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Hi Betty, can you tell me what version you're using?

    Geraldine Van der Auwera, PhD

  • wxiewxie ExeterPosts: 4Member

    @Geraldine_VdAuwera said:
    Hi Betty, can you tell me what version you're using?

    Hi, I'm using GenomeAnalysisTK-3.1-1

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Thanks -- this looks like it might be a bug, I'll pass it on to the HC dev for evaluation. Would you be able to share a snippet of data so that we can debug this locally? We would of course treat it as confidential and delete it when we're done.

    Geraldine Van der Auwera, PhD

  • wxiewxie ExeterPosts: 4Member

    @Geraldine_VdAuwera said:
    Thanks -- this looks like it might be a bug, I'll pass it on to the HC dev for evaluation. Would you be able to share a snippet of data so that we can debug this locally? We would of course treat it as confidential and delete it when we're done.

    Thanks!
    I've extracted reads in the HNF1B region from the bam file and uploaded them on a google drive

    Here are the download links of the bam and bai files:
    https://drive.google.com/file/d/0B-MrTTlr9P8UNjBsLTB4ZG1DQTg/edit?usp=sharing
    https://drive.google.com/file/d/0B-MrTTlr9P8UX3FuOTkzMi1XSUk/edit?usp=sharing
    https://drive.google.com/file/d/0B-MrTTlr9P8Ua2p1emhrTkY1dkE/edit?usp=sharing
    https://drive.google.com/file/d/0B-MrTTlr9P8UN0N0ME5lblY0Zms/edit?usp=sharing

    HNF1B.bam is the bam file after the step of RealignerTargetCreator;
    HNF1B.HC-3.1-1.bam is the bam file after HaplotypeCaller (with --bamOutput)

    Please let me know if there's any problem with the download link or if you need more data :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Thanks for sharing the data. Can I ask you to upload the files to our FTP server rather than provide links? This makes it much easier for us to process incoming user files. Sorry I forgot to mention that initially; here are the instructions: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • wxiewxie ExeterPosts: 4Member

    @Geraldine_VdAuwera said:
    Thanks for sharing the data. Can I ask you to upload the files to our FTP server rather than provide links? This makes it much easier for us to process incoming user files. Sorry I forgot to mention that initially; here are the instructions: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Hi I've uploaded those bam files to your FTP server under the username gsapubftp
    The file name is HaplotypeCaller_GT_is_called_0_1_but_AD_is_206_0.tar.gz

    Thanks

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Thanks @wxie, I'll process your files tomorrow, and hopefully we can get this fixed promptly.

    Geraldine Van der Auwera, PhD

  • JTeerJTeer Posts: 11Member

    I'm seeing the same issue, and although I'm not completely sure, it does look like this is happening when SoftClipped reads are nearby. @wxie, do you know if you have softclipped reads near or at the variant location? (I see lots of reads lining up in your screenshot, but can't tell if it's clipping in addition to depth.)

  • jpofmarsjpofmars Posts: 5Member

    Hi,

    do you have fiexd this issu ?

    I have the same problem :
    chr13 52593258 rs147743710 C T 1407.77 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=-2.797;DB;DP=99;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=59.85;MQ0=0;MQRankSum=0.387;QD=14.22;ReadPosRankSum=1.794;set=variant GT:AD:FT:GQ:PL 0/1:1343,0:het:99:1436,6,0

    On IGV I have A=1,C=43,T=56 for this position.

    For information, I used Haplotype GenomeAnalysisTK-2.8.1. I think reanalyse with 3.1.1 but if there are the same bug...

    Thank you for your help !!

  • SheilaSheila Broad InstitutePosts: 1,241Member, GATK Dev, Broadie, Moderator, DSDE Member admin

    @jpofmars

    Hi,

    Please try running the latest nightly build on a small section to see if it works.

    Thanks,
    Sheila

  • valentinvalentin Cambridge, MAPosts: 19Member, GATK Dev mod

    @jpofmars, also, please post the raw HC output that shows the error (i.e. with a contradictory Likelihoods, AD or GT values) together with the command line used; I'm guessing that the extract that you posted is the output from a larger pipeline including CombineVariants (where does the info field "set" comes from otherwise?) thus the error could come from some other variant calling run so it is better if we factor that out by just looking to the culpable HC's output.

    In any case, I find it quite surprising that you have such a low counts looking at the input bam with IGV but such a large number (and just ref) in the AD annotation. Is it possible that those values somehow come from another BAM through another combined VCF (thanks to CombineVariants)?

    And as Sheila has said you need to use a more up-to-date version (being nightly build the best); 2.8 is kind of old as we have changed HC quite a bit since then.

    Cheers, V.

  • jpofmarsjpofmars Posts: 5Member

    Thank you for your answers.

    I use only HaplotypeCaller, but i split raw vcf file to separate snp and indels for the filtering step.
    The problem came well from CombineVariants when I combine snp filtered file with indel filtered file.
    With the new version the problem is resolved.

    Thank you again !

    JP

Sign In or Register to comment.