Haplotype Caller incorrectly calling Blocks of Variants Heterozygous

aeonsimaeonsim Posts: 64Member ✭✭
edited November 2012 in Ask the GATK team

Hi

I seem to have found a bit of an issue with the Haplotype caller. Looking at variants called with it I've come across a number of small blocks in the genome where the Haplotype caller has called every individual (50 individuals) either RA or RR, which seemed a bit odd considering the population.

Looking at the BAMs and VCFs from SAMtools and the Unified Genotyper these blocks of snps clearly contain all three states as I'd expect RR/RA/AA. Looking at the BAM the reads are of decent quality and have no nearby insertions or deletions to complicate things, and the variants have been called correctly by Samtools and UG.

Any idea what's causing this? Attached is an IGV image showing one of the regions in question, Top VCF is the Haplotype Caller (showing all calls as RA or RR, which is incorrect), Second is UG (showing a mix of RR/RA/AA which is correct). The First BAM shows one of the Animals HC is calling incorrectly as RA for the 5 SNPs shown, while the Second is an Animal that HC is calling RA correctly.

Note these incorrect calls from the HC also passed VQSR. I believe the version of GATK is one of the 2.1 releases.

Screen Shot 2012-11-13 at 5.32.44 PM.png
1680 x 1050 - 157K
Post edited by Geraldine_VdAuwera on

Best Answer

Answers

  • rpoplinrpoplin Posts: 122GATK Developer mod
    edited November 2012

    Hi there,

    This is very curious. Would you be able to share the data from this region with us to help us debug the problem?

    You can use PrintReads (with -L chr9:52,315,067-52,315,498, for example) to extract an interval of your BAM file and upload it to our FTP server: http://gatkforums.broadinstitute.org/discussion/1215/how-can-i-access-the-gsa-public-ftp-server

    Thanks for your help with this!

    Post edited by rpoplin on
  • aeonsimaeonsim Posts: 64Member ✭✭

    I've uploaded a small region around the problem zone mentioned above (HC-het-issue.tar.gz). I've included the HC and UG calls for the region and a bam containing that region for 4 individuals. Two are Hom Alt's which have been called incorrectly as Het by HC while the remaining two are examples of a Hom Ref and an actual Het.

  • aeonsimaeonsim Posts: 64Member ✭✭

    Quick note: The version of GATK used was 2.1-8-g5efb575.

  • rpoplinrpoplin Posts: 122GATK Developer mod

    Thank you very much for sending these files. Would you be able to also provide the B. taurus reference files you are using?

    Thanks!

  • aeonsimaeonsim Posts: 64Member ✭✭

    It's the standard BosTau6 ref (same as in IGV) but I'll upload a copy to the FTP for you.

  • aeonsimaeonsim Posts: 64Member ✭✭

    @rpoplin said: Thank you very much for sending these files. Would you be able to also provide the B. taurus reference files you are using?

    I've uploaded it as bosTau6.fasta.gz, though the ftp app may have played up a little at the end, if it's corrupted let me know and I'll upload it again.

    Thanks

  • rpoplinrpoplin Posts: 122GATK Developer mod
    edited November 2012

    Hmmm, when I run this now with the latest version (2.2-10) I'm not able to replicate the problem.

    I used this command line:

    java -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -L chr9:52,315,067-52,315,498 -I HCissue.bam -R bosTau6.fasta -o output.vcf -debug

    and it calls the block of variants in your screenshot as:

    BE091642881 - hom ref

    BE155317935 - het

    BE432902494 - hom var

    BE832676583 - hom var

    which looks correct to me.

    Can you run this with the latest version and see if it looks correct to you now?

    Thanks for your help,

    Post edited by rpoplin on
  • aeonsimaeonsim Posts: 64Member ✭✭

    @rpoplin I've had a go with 2.2-15 and the results are variable. If I only use a few animals or a very small region of the genome I get the correct results, for example this small target is called correctly.

    -L chr9:52,315,067-52,315,498

    However if I use a larger region say 2Kb then it goes straight back to calling all the variant HetRA.

    java -jar ../../tools/GenomeAnalysisTK-2.2-15-g9214b2f/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../../refs/bosTau6.fasta -I 50-anmls.bam -o 50-anmls.bam.vcf -L chr9:52,314,000-52,316,000

    The attached ss shows this. I'll upload a larger BAM containing 50 individuals and more seq around that region to the file server. The ss shows in order VCFs for (all VCFs HC unless stated otherwise):

    • Small targeted region 4 animals: Correct
    • Small targeted region 50 animals.: 3 Vars Correct 4th Wrong!
    • Original Wrong VCF
    • Original Unified Genotyper VCF: Correct
    • Larger Region 2kb using command above: Incorrect (all het)
    • Larger Region 10Kb using same settings as above: Incorrect all Het.

    File is: 50-anmls-issue-complete.tar.gz

    Screen Shot 2012-11-29 at 12.04.21 PM.png
    1914 x 1057 - 95K
  • aeonsimaeonsim Posts: 64Member ✭✭

    Thanks, looks like we'll just stick with the Unified Genotyper for a couple more releases then retest the Haplotype caller. This is a rather disconcerting bug though, incorrectly called Het/Hom genotypes can cause significant issues for some of our work.

    Do you have an automated test that looks for Genotypes changing between versions of GATK, and/or compare simple clean SNP (no nearby reps, indels etc) genotype calls between the outputs of UG and HC? It may be worth investigating as it would probably catch cases like this.

    Thanks

  • thondeboerthondeboer Posts: 15Member

    I also see this problem crop up in the version I am using which unfortunately is still version 2012.1-2.2.16-2-g9513dd0 since that is the one that is only available for us commercial customers...Can someone indicate if this has been fixed and in which version? The commercial release is due out and I am sure it is some very old version again, so want to make sure we get a fixed version...Not sure why I should bother with the commercial version if the only real advantage is HaplotypeCaller and that was buggy in release 1

    This is what I see:

    0/1:0,37:99:2790,0,148 This was the original line 17 63534781 . C A 2761.77 . AC=1;AF=0.500;AN=2;ActiveRegionSize=295;DP=39;EVENTLENGTH=0;FS=0.000;HaplotypeScore=40.9590;MLEAC=1;MLEAF=0.500;MQ=53.23;MQ0=0;NVH=5;NumHapAssembly=12;NumHapEval=3;QD=70.81;QDE=14.16;TYPE=SNP;extType=SNP GT:AD:GQ:PL 0/1:0,37:99:2790,0,148

    Thanks,

    Thon

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,822Administrator, GATK Developer admin

    Hi Thon,

    I'm almost certain that this has been fixed since version 2.2. We have made a lot of improvements to the HC since then. Keep in mind that it started out as a highly experimental tool and has been under intense development. These types of bugs will occur less as the tool matures.

    I believe the next commercial release will synch up to our upcoming 2.4 release (but can't guarantee that since it's not my turf).

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.