Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Problems generating phased haplotypes with HaplotypeCaller
I am trying to use the HaplotypeCaller to phase alleles for nuclear genes in a diploid plant species. I have reads from one species mapped to the reference genome from another closely related species. What I want to end up with are contiguous sequences for separate alleles that I can use for phylogenetic analysis.
I have generated a gvcf file using the --emitRefConfidence BP_RESOLUTION option. In the output, some of the bases appear phased, but I don't really understand some of the results. For example, for some positions the phasing seems to refer to the same base. Here is an example:
scaffold11251 3338 . A G, 1549.77 . DP=39;MLEAC=2,0;MLEAF=1.00,0.00;MQ=42.18 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,39,0:39:99:0|1:3338_A_G:1578,117,0,1578,117,1578:0,0,16,23
Also, in this case, the genotype is 1/1 so I don't understand why this position appears to be phased.
I have also looked at the bam output format. I have tried setting options so that I only get a maximum of 2 haplotypes (I read this post http://gatkforums.broadinstitute.org/discussion/4527/can-i-retrieve-the-haplotypes-assembled-from-haplotypecaller) using this option:
but for most active regions I still get more than 2 haplotypes (most often 3 or 4, but sometimes 5).
I've read through the HaplotypeCaller documentation + the description of the gvcf format but I still don't fully understand how to interpret the results or successfully generate the output I need, i.e. contiguous sequences of separate phased alleles.
I would be very grateful of any advice!