Problems generating phased haplotypes with HaplotypeCaller
I am trying to use the HaplotypeCaller to phase alleles for nuclear genes in a diploid plant species. I have reads from one species mapped to the reference genome from another closely related species. What I want to end up with are contiguous sequences for separate alleles that I can use for phylogenetic analysis.
I have generated a gvcf file using the --emitRefConfidence BP_RESOLUTION option. In the output, some of the bases appear phased, but I don't really understand some of the results. For example, for some positions the phasing seems to refer to the same base. Here is an example:
scaffold11251 3338 . A G, 1549.77 . DP=39;MLEAC=2,0;MLEAF=1.00,0.00;MQ=42.18 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,39,0:39:99:0|1:3338_A_G:1578,117,0,1578,117,1578:0,0,16,23
Also, in this case, the genotype is 1/1 so I don't understand why this position appears to be phased.
I have also looked at the bam output format. I have tried setting options so that I only get a maximum of 2 haplotypes (I read this post http://gatkforums.broadinstitute.org/discussion/4527/can-i-retrieve-the-haplotypes-assembled-from-haplotypecaller) using this option:
but for most active regions I still get more than 2 haplotypes (most often 3 or 4, but sometimes 5).
I've read through the HaplotypeCaller documentation + the description of the gvcf format but I still don't fully understand how to interpret the results or successfully generate the output I need, i.e. contiguous sequences of separate phased alleles.
I would be very grateful of any advice!