The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Haplotype Caller incorrectly calling Blocks of Variants Heterozygous

Member Posts: 68 ✭✭✭
edited November 2012

Hi

I seem to have found a bit of an issue with the Haplotype caller. Looking at variants called with it I've come across a number of small blocks in the genome where the Haplotype caller has called every individual (50 individuals) either RA or RR, which seemed a bit odd considering the population.

Looking at the BAMs and VCFs from SAMtools and the Unified Genotyper these blocks of snps clearly contain all three states as I'd expect RR/RA/AA. Looking at the BAM the reads are of decent quality and have no nearby insertions or deletions to complicate things, and the variants have been called correctly by Samtools and UG.

Any idea what's causing this? Attached is an IGV image showing one of the regions in question, Top VCF is the Haplotype Caller (showing all calls as RA or RR, which is incorrect), Second is UG (showing a mix of RR/RA/AA which is correct). The First BAM shows one of the Animals HC is calling incorrectly as RA for the 5 SNPs shown, while the Second is an Animal that HC is calling RA correctly.

Note these incorrect calls from the HC also passed VQSR.
I believe the version of GATK is one of the 2.1 releases.

Post edited by Geraldine_VdAuwera on
Tagged:

• Dev Posts: 122 ✭✭✭
edited November 2012

Hi there,

This is very curious. Would you be able to share the data from this region with us to help us debug the problem?

You can use PrintReads (with -L chr9:52,315,067-52,315,498, for example) to extract an interval of your BAM file and upload it to our FTP server: http://gatkforums.broadinstitute.org/discussion/1215/how-can-i-access-the-gsa-public-ftp-server

Thanks for your help with this!

• Member Posts: 68 ✭✭✭

I've uploaded a small region around the problem zone mentioned above (HC-het-issue.tar.gz). I've included the HC and UG calls for the region and a bam containing that region for 4 individuals. Two are Hom Alt's which have been called incorrectly as Het by HC while the remaining two are examples of a Hom Ref and an actual Het.

• Member Posts: 68 ✭✭✭

Quick note: The version of GATK used was 2.1-8-g5efb575.

• Member Posts: 18
• Dev Posts: 122 ✭✭✭

Thank you very much for sending these files. Would you be able to also provide the B. taurus reference files you are using?

Thanks!

• Member Posts: 68 ✭✭✭

It's the standard BosTau6 ref (same as in IGV) but I'll upload a copy to the FTP for you.

• Member Posts: 68 ✭✭✭

@rpoplin said:
Thank you very much for sending these files. Would you be able to also provide the B. taurus reference files you are using?

I've uploaded it as bosTau6.fasta.gz, though the ftp app may have played up a little at the end, if it's corrupted let me know and I'll upload it again.

Thanks

• Dev Posts: 122 ✭✭✭
edited November 2012

Hmmm, when I run this now with the latest version (2.2-10) I'm not able to replicate the problem.

I used this command line:

java -jar dist/GenomeAnalysisTK.jar -T HaplotypeCaller -L chr9:52,315,067-52,315,498 -I HCissue.bam -R bosTau6.fasta -o output.vcf -debug

and it calls the block of variants in your screenshot as:

BE091642881 - hom ref

BE155317935 - het

BE432902494 - hom var

BE832676583 - hom var

which looks correct to me.

Can you run this with the latest version and see if it looks correct to you now?

• Member Posts: 68 ✭✭✭

@rpoplin
I've had a go with 2.2-15 and the results are variable. If I only use a few animals or a very small region of the genome I get the correct results, for example this small target is called correctly.

-L chr9:52,315,067-52,315,498

However if I use a larger region say 2Kb then it goes straight back to calling all the variant HetRA.

java -jar ../../tools/GenomeAnalysisTK-2.2-15-g9214b2f/GenomeAnalysisTK.jar -T HaplotypeCaller -R ../../refs/bosTau6.fasta -I 50-anmls.bam -o 50-anmls.bam.vcf -L chr9:52,314,000-52,316,000

The attached ss shows this. I'll upload a larger BAM containing 50 individuals and more seq around that region to the file server.
The ss shows in order VCFs for (all VCFs HC unless stated otherwise):

• Small targeted region 4 animals: Correct
• Small targeted region 50 animals.: 3 Vars Correct 4th Wrong!
• Original Wrong VCF
• Original Unified Genotyper VCF: Correct
• Larger Region 2kb using command above: Incorrect (all het)
• Larger Region 10Kb using same settings as above: Incorrect all Het.

File is:
50-anmls-issue-complete.tar.gz

• Member Posts: 68 ✭✭✭

Thanks, looks like we'll just stick with the Unified Genotyper for a couple more releases then retest the Haplotype caller. This is a rather disconcerting bug though, incorrectly called Het/Hom genotypes can cause significant issues for some of our work.

Do you have an automated test that looks for Genotypes changing between versions of GATK, and/or compare simple clean SNP (no nearby reps, indels etc) genotype calls between the outputs of UG and HC? It may be worth investigating as it would probably catch cases like this.

Thanks

• Redwood City, CA, USAMember Posts: 17

I also see this problem crop up in the version I am using which unfortunately is still version 2012.1-2.2.16-2-g9513dd0 since that is the one that is only available for us commercial customers...Can someone indicate if this has been fixed and in which version? The commercial release is due out and I am sure it is some very old version again, so want to make sure we get a fixed version...Not sure why I should bother with the commercial version if the only real advantage is HaplotypeCaller and that was buggy in release 1

This is what I see:

 0/1:0,37:99:2790,0,148 
This was the original line
 17 63534781 . C A 2761.77 . AC=1;AF=0.500;AN=2;ActiveRegionSize=295;DP=39;EVENTLENGTH=0;FS=0.000;HaplotypeScore=40.9590;MLEAC=1;MLEAF=0.500;MQ=53.23;MQ0=0;NVH=5;NumHapAssembly=12;NumHapEval=3;QD=70.81;QDE=14.16;TYPE=SNP;extType=SNP GT:AD:GQ:PL 0/1:0,37:99:2790,0,148 `

Thanks,

Thon