Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Haplotype depends on kmer size - how to get just 2 haplos from haplotypeCaller

bdorseybdorsey Huntington Botanical GardensMember

Hello,
I am using HaplotypeCaller in order to get haplotype sequences from individual samples (several samples per species) for gene tree/species tree analysis. The reads are from an exome capture experiment. Because I am running individual samples I have limited the max # of haplotypes to 2. However, the default behavior of using two kmer size (10 and 25) results in up to four haplotypes per exon (interval) in the bamout file. I have found that if I supply a kmerSize parameter I get only 2 haplotypes but these differ depending on the kmer I supply. The difference is not only subsetting of the snps found with multiple kmer sizes but distinct snps called with different kmer sizes as well. I would like to run the analysis with multiple kmerSizes specified and have the caller only output the two most likely haplotypes. Is this possible and, if so, how can I do it? Or, am I misunderstanding how the caller works?

I think I understand why different kmer sizes would result in different snps called but if anyone could explain it to me I'd love confirmation.

Here is my original command line before experimenting with kmer sizes:
java -jar /opt/local/NGS/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T HaplotypeCaller -R /Users/bdorsey/Documents/Dioon/Capture_seqs_assembly/captured_seqs_uniq.fa -I /Volumes/HD2/Capture_assembly/Dioon1/contigs/Dioon1_m1n350r.10x.sp5.bam -L /Volumes/HD2/Capture_assembly/Dioon1/exonsCov10sp5.list --activeRegionIn /Volumes/HD2/Capture_assembly/Dioon1/exonsCov10sp5.list --maxNumHaplotypesInPopulation 2 --minReadsPerAlignmentStart 5 -out_mode EMIT_ALL_SITES -ERC BP_RESOLUTION --forceActive --dontTrimActiveRegions --activeRegionMaxSize 10000 -bamWriterType CALLED_HAPLOTYPES --disableOptimizations -bamout /Volumes/HD2/Capture_assembly/Dioon1/haplo/Dioon1.haplos.bam -o /Volumes/HD2/Capture_assembly/Dioon1/haplo/Dioon1.haplos.g.vcf

Thanks very much for any help.
Cheers,
Brian D

Answers

  • SheilaSheila Broad InstituteMember, Broadie admin

    @bdorsey
    Hi Brian D,

    It is definitely possible for Haplotype Caller to get different haplotypes depending on the kmer size. Have a look at this article for more information on how the graph of haplotypes is built: https://www.broadinstitute.org/gatk/guide/article?id=4146

    Does the kmer size produce a different set of haplotypes for every single interval or just some intervals? We have noticed in regions of repeats, the kmer size can affect the graph output.

    -Sheila

  • bdorseybdorsey Huntington Botanical GardensMember

    Hi Sheila,
    Thanks for the response. I've checked a set of my intervals (exons) and it doesn't seem that the alternate haplotypes are occurring at repeat regions. I re-read the article you referenced but I still have a question or two. While the max# of haplotypes per kmerSize (and therefore per graph) can be set - which I set to 2 - there doesn't seem to be a way to only output the 2 best supported haplotypes from the union of those from the two graphs. In other words, I sometimes get four haplotypes when I run with default kmer sizes of 10 and 25 (2 from each graph) but I would like to be able to choose the two most likely from this set of four. Since the caller keeps track of the likelihood of each haplotype, given the reads, is it possible to ask it to choose the two most likely? The bamout file does not include any measure of support for the haplotypes so I don't know how I could choose myself. Lastly, could you please confirm my understanding that a larger kmer size will be more specific and therefor more conservative when inferring haplotypes?

    Thanks much!
    Brian

  • SheilaSheila Broad InstituteMember, Broadie admin

    @bdorsey
    Hi Brian,

    You are correct that if you set the -maxHaplotypes to 2, the 2 most likely haplotypes will be chosen. A larger kmer size will be more specific and also more conservative.

    However, I am curious what you bam file looks like to be having such variation depending on the kmer size. Can you post an IGV screenshot of some regions that produce different haplotypes depending on kmer size? Also, please post the bamout file of the regions as well.

    Thanks,
    Sheila

  • bdorseybdorsey Huntington Botanical GardensMember

    Hi Sheila,
    I am not sure that I explained my results well enough. I understand that setting -maxHaplotypes to 2 is supposed to output only the 2 most likely haplotypes but that is not what I am getting when using the default kmer settings (i.e. k=10 and then k=25). It seems that the program chooses the 2 most likely haplos from each graph, not the two most likely from all haplos resulting from the two graphs. This is easy to see in the screen shots I've posted. In each, the track labeled Dioon1_haplos.gvcf.bam is from the default settings. The txt file below has the bamout lines from the default analysis. You can see that there are four haplytypes in these. The other tracks are from runs with only one kmer size specified, resulting in two haplotypes. You will also notice the haplotypes from each single kmer size track differ and that they are not just subsets of the 2-kmer haplos but distinct at some loci. Of course, each track also has the reference sequence as well. I've included the original bam file track in two of the images as well (Dioon1_m1n350.bam).

    I really appreciate your help with this question. Please let me know if I can provide more info.

    Best,
    Brian

  • SheilaSheila Broad InstituteMember, Broadie admin

    @bdorsey
    Hi Brian,

    Sorry for the late response. Yes, I see what you are asking. For each kmer size you input, the most likely haplotype on the forward and reverse strands are given. The reference is always output. Most of the cases have either 5 haplotypes (2 kmers input) or 3 haplotypes (1 kmer input). The Dioon1.kmer35.bam only has two haplotypes, because one of the most likely haplotypes is the reference.

    I hope this answers your question.

    Also, because there is such low depth at the sites you are looking at, it is better to add -minPruning 1 and -minDanglingBranchLength 1. With these extra parameters, you may not need to play around with kmer size. They will help to recover any potential variants that may otherwise be lost. https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--minPruning
    https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php#--minDanglingBranchLength

    -Sheila

  • NadiaNadia Member

    I am calling Haplotypes in a family of two parents and their progeny (F1 in plants). Both parents are heterozygous, so should the --maxNumHaplotypesInPopulation be 4?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Nadia,

    That is technically correct but it's not necessary to set this parameter unless you really need to optimize performance.

Sign In or Register to comment.