It looks like you're new here. If you want to get involved, click one of these buttons!
I've been experiencing some apparent errors with HaplotypeCaller that I think could be related to how it chooses candidate haplotypes when performing multi-sample calling. Please see the example files I've uploaded to the server (cooketho_20130103.tar.gz). For instance if you look at position 3511 in sample 2, there are 14 non-reference reads and 0 reference reads. When HaplotypeCaller is run with just this sample, it calls this locus homozygous non-reference, which seems to me to be the correct behavior. But when run with all 14 samples, it doesn't call a SNP at this locus. Repeating the run in debug mode shows that the (immediate) cause is that there were 11 candidate haplotypes found, and not a single one of them had the non-reference allele at position 3511. Why?
I came across an earlier post that suggested in some cases increasing the --minPruning value can be of use, but I tried this to no avail.
http://gatkforums.broadinstitute.org/discussion/1764/haplotypecaller-in-cohorts
My organism is a plant, and is is considerably more heterozygous than human, but changing the --heterozygosity value did not appear to help either. Double check me on this if you like.
Can you please suggest a fix, or perhaps release some documentation on how HaplotypeCaller selects candidate haplotypes?
P.S. Any idea of when the source will be released to the public, or when a more comprehensive manual will be released? Would be very helpful for figuring out what is going on in cases like this.
Thanks! Tom
rpoplin
Posts: 92 mod
Hi Tom,
Thank you for the beautifully laid out examples files. Very easy to follow. You are exactly correct that the problem is that the algorithm we use to choose candidate haplotypes isn't really expecting to see a population with heterozygosity as high as you have here. We are in the middle of rewriting this piece of the code to be more general but in the mean time I've added for the next release a new command line argument that will help you. One of the blocking issues for you here is that there is a hard limit to the maximum number of haplotypes that the program will consider and this hard limit was chosen with human populations in mind. By changing that limit with the new --maxNumHaplotypesInPopulation command line argument that I just added I was able to call the mutations in your example bam with this command in our latest internal development version:
java -Xmx10G -jar ${gatk}/GenomeAnalysisTK.jar \
-T HaplotypeCaller \
-R scaffold_1.masked.fa \
-I example.bam \
-L scaffold_1:3500-4000 \
-maxNumHaplotypesInPopulation 20 \
-minPruning 2 \
--debug \
-o hc.vcf > debug.txt
Cheers,
Geraldine_VdAuwera
Posts: 2,239 admin
Answers
Thanks Ryan! When will a version of GATK be available that has this new option? Would it be possible to send me a copy of the internal development version in the interim--just for testing purposes? Tom
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •