Problems generating phased haplotypes with HaplotypeCaller

Laura_KellyLaura_Kelly UKMember
edited October 2015 in Ask the GATK team

Hello

I am trying to use the HaplotypeCaller to phase alleles for nuclear genes in a diploid plant species. I have reads from one species mapped to the reference genome from another closely related species. What I want to end up with are contiguous sequences for separate alleles that I can use for phylogenetic analysis.

I have generated a gvcf file using the --emitRefConfidence BP_RESOLUTION option. In the output, some of the bases appear phased, but I don't really understand some of the results. For example, for some positions the phasing seems to refer to the same base. Here is an example:
scaffold11251 3338 . A G, 1549.77 . DP=39;MLEAC=2,0;MLEAF=1.00,0.00;MQ=42.18 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,39,0:39:99:0|1:3338_A_G:1578,117,0,1578,117,1578:0,0,16,23

Also, in this case, the genotype is 1/1 so I don't understand why this position appears to be phased.

I have also looked at the bam output format. I have tried setting options so that I only get a maximum of 2 haplotypes (I read this post http://gatkforums.broadinstitute.org/discussion/4527/can-i-retrieve-the-haplotypes-assembled-from-haplotypecaller) using this option:
-maxNumHaplotypesInPopulation 2

but for most active regions I still get more than 2 haplotypes (most often 3 or 4, but sometimes 5).

I've read through the HaplotypeCaller documentation + the description of the gvcf format but I still don't fully understand how to interpret the results or successfully generate the output I need, i.e. contiguous sequences of separate phased alleles.

I would be very grateful of any advice!

Laura

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Laura_Kelly
    Hi Laura,

    Can you please post the surrounding variant records? Specifically, the PID field is the phase identifier that all the same phased variants will have. If you can post all the variant records with the same PID, it would be really helpful. Can you also post the bamout file at those sites?

    Thanks,
    Sheila

  • Laura_KellyLaura_Kelly UKMember

    Hi Shelia

    Thanks for your reply. I've uploaded the vcf and bam files - is this what you need?

    Laura

  • Laura_KellyLaura_Kelly UKMember

    Hi Sheila

    I have tried a couple of times to add a comment with the vcf and bam info in the body of the text in case there is a problem with the files I uploaded yesterday, but my comment is not showing up. The bam info was a link to an image file I uploaded, so maybe that is causing a problem?

    Please let me know if need me to provide more info!

    Many thanks
    Laura

  • Laura_KellyLaura_Kelly UKMember

    Hi Sheila

    I am still having problems with physical phasing of haplotypes in HaplotypeCaller. It seems that most of the sites that are being phased in the genes I am analysing are 1/1 variants. Here is another example, these are all the lines for PID '3929_C_T':
    scaffold11251 3923 . C . . . GT:AD:DP:GQ:PL 0/0:29,0:29:75:0,75,1125
    scaffold11251 3924 . T . . . GT:AD:DP:GQ:PL 0/0:29,0:29:75:0,75,1125
    scaffold11251 3925 . C . . . GT:AD:DP:GQ:PL 0/0:29,0:29:75:0,75,1125
    scaffold11251 3926 . A . . . GT:AD:DP:GQ:PL 0/0:29,0:29:69:0,69,1035
    scaffold11251 3927 . G . . . GT:AD:DP:GQ:PL 0/0:27,0:27:69:0,69,1035
    scaffold11251 3928 . A . . . GT:AD:DP:GQ:PL 0/0:27,0:27:69:0,69,1035
    scaffold11251 3929 . C T, 951.77 . DP=26;MLEAC=2,0;MLEAF=1.00,0.00;MQ=43.13 GT:AD:DP:GQ:PGT:PID:PL:SB 1/1:0,26,0:26:78:0|1:3929_C_T:980,78,0,980,78,980:0,0,12,14
    scaffold11251 3930 . A . . . GT:AD:DP:GQ:PL 0/0:27,0:27:69:0,69,1035
    scaffold11251 3931 . T . . . GT:AD:DP:GQ:PL 0/0:26,0:26:69:0,69,1035
    scaffold11251 3932 . T . . . GT:AD:DP:GQ:PL 0/0:27,0:27:72:0,72,1080
    scaffold11251 3933 . G . . . GT:AD:DP:GQ:PL 0/0:27,0:27:72:0,72,1080
    scaffold11251 3934 . A . . . GT:AD:DP:GQ:PL 0/0:26,0:26:72:0,72,1080
    scaffold11251 3935 . A . . . GT:AD:DP:GQ:PL 0/0:26,0:26:72:0,72,1080
    scaffold11251 3936 . T . . . GT:AD:DP:GQ:PL 0/0:27,0:27:75:0,75,1125
    scaffold11251 3937 . A . . . GT:AD:DP:GQ:PL 0/0:28,0:28:78:0,78,1170
    scaffold11251 3938 . C . . . GT:AD:DP:GQ:PL 0/0:28,0:28:78:0,78,1170
    scaffold11251 3939 . A . . . GT:AD:DP:GQ:PL 0/0:29,0:29:81:0,81,1215
    scaffold11251 3940 . T . . . GT:AD:DP:GQ:PL 0/0:28,0:28:81:0,81,1215
    scaffold11251 3941 . T . . . GT:AD:DP:GQ:PL 0/0:28,0:28:81:0,81,1215
    scaffold11251 3942 . T . . . GT:AD:DP:GQ:PL 0/0:25,1:26:60:0,60,939
    scaffold11251 3943 . T . . . GT:AD:DP:GQ:PL 0/0:27,0:27:78:0,78,1170
    scaffold11251 3944 . T . . . GT:AD:DP:GQ:PL 0/0:28,0:28:81:0,81,1215
    scaffold11251 3945 . G . . . GT:AD:DP:GQ:PL 0/0:27,1:28:42:0,42,987
    scaffold11251 3946 . A . . . GT:AD:DP:GQ:PL 0/0:28,0:28:78:0,78,1170
    scaffold11251 3947 . A . . . GT:AD:DP:GQ:PL 0/0:28,0:28:78:0,78,1170
    scaffold11251 3948 . G . . . GT:AD:DP:GQ:PL 0/0:28,0:28:78:0,78,1170
    scaffold11251 3949 . T . . . GT:AD:DP:GQ:PL 0/0:29,0:29:81:0,81,1215
    scaffold11251 3950 . A . . . GT:AD:DP:GQ:PL 0/0:29,0:29:78:0,78,1170
    scaffold11251 3951 . G . . . GT:AD:DP:GQ:PL 0/0:29,0:29:78:0,78,1170
    scaffold11251 3952 . T . . . GT:AD:DP:GQ:PL 0/0:28,0:28:78:0,78,1170
    scaffold11251 3953 . A . . . GT:AD:DP:GQ:PL 0/0:28,0:28:78:0,78,1170
    scaffold11251 3954 . T . . . GT:AD:DP:GQ:PL 0/0:28,0:28:72:0,72,1080
    scaffold11251 3955 . T . . . GT:AD:DP:GQ:PL 0/0:28,0:28:72:0,72,1080
    scaffold11251 3956 . C . . . GT:AD:DP:GQ:PL 0/0:27,0:27:72:0,72,1080
    scaffold11251 3957 . A . . . GT:AD:DP:GQ:PL 0/0:27,0:27:69:0,69,1035
    scaffold11251 3958 . T . . . GT:AD:DP:GQ:PL 0/0:27,0:27:69:0,69,1035
    scaffold11251 3959 . G . . . GT:AD:DP:GQ:PL 0/0:27,0:27:69:0,69,1035
    scaffold11251 3960 . A . . . GT:AD:DP:GQ:PL 0/0:26,0:26:69:0,69,1035
    scaffold11251 3961 . A . . . GT:AD:DP:GQ:PL 0/0:27,0:27:72:0,72,1080
    scaffold11251 3962 . A . . . GT:AD:DP:GQ:PL 0/0:28,0:28:75:0,75,1125
    scaffold11251 3963 . G . . . GT:AD:DP:GQ:PL 0/0:28,0:28:75:0,75,1125
    scaffold11251 3964 . G . . . GT:AD:DP:GQ:PL 0/0:27,0:27:75:0,75,1125
    scaffold11251 3965 . A . . . GT:AD:DP:GQ:PL 0/0:28,0:28:75:0,75,1125
    scaffold11251 3966 . G . . . GT:AD:DP:GQ:PL 0/0:28,0:28:75:0,75,1125
    scaffold11251 3967 . A . . . GT:AD:DP:GQ:PL 0/0:29,0:29:78:0,78,1170
    scaffold11251 3968 . T . . . GT:AD:DP:GQ:PL 0/0:29,0:29:78:0,78,1170
    scaffold11251 3969 . G . . . GT:AD:DP:GQ:PL 0/0:29,0:29:78:0,78,1170
    scaffold11251 3970 . A . . . GT:AD:DP:GQ:PL 0/0:28,0:28:78:0,78,1170
    scaffold11251 3971 . A . . . GT:AD:DP:GQ:PL 0/0:28,0:28:78:0,78,1170
    scaffold11251 3972 . A . . . GT:AD:DP:GQ:PL 0/0:28,0:28:78:0,78,1170
    scaffold11251 3973 . T . . . GT:AD:DP:GQ:PL 0/0:28,1:29:69:0,70,1040
    scaffold11251 3974 . A . . . GT:AD:DP:GQ:PL 0/0:28,1:29:69:0,70,1030
    scaffold11251 3975 . C . . . GT:AD:DP:GQ:PL 0/0:29,0:29:87:0,87,1137
    scaffold11251 3976 . T . . . GT:AD:DP:GQ:PL 0/0:29,0:29:86:0,87,1107
    scaffold11251 3977 . C . . . GT:AD:DP:GQ:PL 0/0:29,0:29:86:0,87,1076
    scaffold11251 3978 . T . . . GT:AD:DP:GQ:PL 0/0:29,0:29:84:0,84,1260
    scaffold11251 3979 . T . . . GT:AD:DP:GQ:PL 0/0:29,0:29:84:0,84,1260
    scaffold11251 3980 . C . . . GT:AD:DP:GQ:PL 0/0:29,0:29:84:0,84,1260
    scaffold11251 3981 . T . . . GT:AD:DP:GQ:PL 0/0:28,1:29:74:0,75,1031
    scaffold11251 3982 . T . . . GT:AD:DP:GQ:PL 0/0:29,0:29:81:0,81,1215
    scaffold11251 3983 . T . . . GT:AD:DP:GQ:PL 0/0:28,1:29:74:0,75,1096
    scaffold11251 3984 . A . . . GT:AD:DP:GQ:PL 0/0:26,1:27:68:0,69,974
    scaffold11251 3985 . G . . . GT:AD:DP:GQ:PL 0/0:29,0:29:81:0,81,1215
    scaffold11251 3986 . G A, 470.77 . BaseQRankSum=0.268;ClippingRankSum=-0.708;DP=27;MLEAC=1,0;MLEAF=0.500,0.00;MQ=43.37;MQRankSum=1.195;ReadPosRankSum=0.024 GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:12,15,0:27:99:0|1:3929_C_T:499,0,367,535,412,947:6,6,11,4
    scaffold11251 3987 . G . . . GT:AD:DP:GQ:PL 0/0:30,0:30:81:0,81,1215
    scaffold11251 3988 . T . . . GT:AD:DP:GQ:PL 0/0:31,0:31:78:0,78,1170
    scaffold11251 3989 . A . . . GT:AD:DP:GQ:PL 0/0:30,0:30:78:0,78,1170
    scaffold11251 3990 . T . . . GT:AD:DP:GQ:PL 0/0:29,1:30:72:0,73,1117
    scaffold11251 3991 . G . . . GT:AD:DP:GQ:PL 0/0:30,0:30:78:0,78,1170
    scaffold11251 3992 . A . . . GT:AD:DP:GQ:PL 0/0:29,0:29:78:0,78,1170
    scaffold11251 3993 . T . . . GT:AD:DP:GQ:PL 0/0:30,0:30:81:0,81,1215
    scaffold11251 3994 . T . . . GT:AD:DP:GQ:PL 0/0:32,0:32:87:0,87,1305
    scaffold11251 3995 . T . . . GT:AD:DP:GQ:PL 0/0:32,0:32:87:0,87,1305

    There are only two variants for this haplotype block and only one of them is heterozygous.

    I would be very grateful for any further advice you could give me on using HaplotypeCaller for physical phasing.

    Many thanks
    Laura

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Laura, the phasing that HC emits in the GVCF is not meant to be interpreted directly, it's only an intermediate notation that will be finalized appropriately when you run GenotypeGVCFs.

  • Laura_KellyLaura_Kelly UKMember

    Hi Geraldine

    Thanks for your reply. I actually only have one sample that I have mapped to my reference, so I didn't plan to run the GenotypeGVCFs tool. What I am trying to do is map reads from one species to the reference of another and then extract phased sequences for separate alleles. I would like to extract contiguous sequences, rather than just the variants. I thought that this might be possible by physically phasing variants with HaplotypeCaller. Are you saying that it is not possible to do this?

    Also, I thought that is would be possible to select the most likely haplotypes for each active region from the bam output. However, as I mentioned in my original post, when I try to set the options to only select the two most likely haplotypes for each active region I still end up with additional possible haplotypes for each active region (e.g. 5 or more for some regions).

    Many thanks
    Laura

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @Laura_Kelly
    Hi Laura,

    The intermediate BP_RESOLUTION GVCF is not meant to be used as the final analysis file. Even if you have one sample, you need to run GenotypeGVCFs to get the final VCF. Once you get the final VCF, can you please post the records for positions 3986 and 3929? Please also post IGV screenshots of the region from the original bam file and bamout file. I cannot view the bam file you posted because I do not have the reference.

    Thanks,
    Sheila

  • Laura_KellyLaura_Kelly UKMember

    Dear Sheila and Geraldine

    Thanks for your help. I have run the GentoypeGVCFs tool and here are the records for positions 3929 & 3986 (+ surrounding lines):
    scaffold11251 3922 . C . . . AN=2;DP=31 GT:AD:DP:RGQ 0/0:31:31:78
    scaffold11251 3923 . C . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:75
    scaffold11251 3924 . T . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:75
    scaffold11251 3925 . C . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:75
    scaffold11251 3926 . A . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:69
    scaffold11251 3927 . G . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:27:27:69
    scaffold11251 3928 . A . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:27:27:69
    scaffold11251 3929 . C T 951.77 . AC=2;AF=1.00;AN=2;DP=26;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=43.13;QD=29.09;SOR=0.846 GT:AD:DP:GQ:PGT:PID:PL 1/1:0,26:26:78:1|1:3929_C_T:980,78,0
    scaffold11251 3930 . A . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:27:27:69
    scaffold11251 3931 . T . . . AN=2;DP=26 GT:AD:DP:RGQ 0/0:26:26:69
    scaffold11251 3932 . T . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:27:27:72
    scaffold11251 3933 . G . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:27:27:72
    scaffold11251 3934 . A . . . AN=2;DP=26 GT:AD:DP:RGQ 0/0:26:26:72
    scaffold11251 3935 . A . . . AN=2;DP=26 GT:AD:DP:RGQ 0/0:26:26:72
    scaffold11251 3936 . T . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:27:27:75
    scaffold11251 3937 . A . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:78
    scaffold11251 3938 . C . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:78
    scaffold11251 3939 . A . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:81
    scaffold11251 3940 . T . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:81
    scaffold11251 3941 . T . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:81
    scaffold11251 3942 . T . . . AN=2;DP=26 GT:AD:DP:RGQ 0/0:25:26:60
    scaffold11251 3943 . T . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:27:27:78
    scaffold11251 3944 . T . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:81
    scaffold11251 3945 . G . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:27:28:42
    scaffold11251 3946 . A . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:78
    scaffold11251 3947 . A . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:78
    scaffold11251 3948 . G . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:78
    scaffold11251 3949 . T . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:81
    scaffold11251 3950 . A . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:78
    scaffold11251 3951 . G . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:78
    scaffold11251 3952 . T . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:78
    scaffold11251 3953 . A . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:78
    scaffold11251 3954 . T . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:72
    scaffold11251 3955 . T . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:72
    scaffold11251 3956 . C . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:27:27:72
    scaffold11251 3957 . A . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:27:27:69
    scaffold11251 3958 . T . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:27:27:69
    scaffold11251 3959 . G . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:27:27:69
    scaffold11251 3960 . A . . . AN=2;DP=26 GT:AD:DP:RGQ 0/0:26:26:69
    scaffold11251 3961 . A . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:27:27:72
    scaffold11251 3962 . A . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:75
    scaffold11251 3963 . G . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:75
    scaffold11251 3964 . G . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:27:27:75
    scaffold11251 3965 . A . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:75
    scaffold11251 3966 . G . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:75
    scaffold11251 3967 . A . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:78
    scaffold11251 3968 . T . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:78
    scaffold11251 3969 . G . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:78
    scaffold11251 3970 . A . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:78
    scaffold11251 3971 . A . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:78
    scaffold11251 3972 . A . . . AN=2;DP=28 GT:AD:DP:RGQ 0/0:28:28:78
    scaffold11251 3973 . T . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:28:29:69
    scaffold11251 3974 . A . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:28:29:69
    scaffold11251 3975 . C . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:87
    scaffold11251 3976 . T . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:86
    scaffold11251 3977 . C . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:86
    scaffold11251 3978 . T . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:84
    scaffold11251 3979 . T . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:84
    scaffold11251 3980 . C . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:84
    scaffold11251 3981 . T . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:28:29:74
    scaffold11251 3982 . T . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:81
    scaffold11251 3983 . T . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:28:29:74
    scaffold11251 3984 . A . . . AN=2;DP=27 GT:AD:DP:RGQ 0/0:26:27:68
    scaffold11251 3985 . G . . . AN=2;DP=29 GT:AD:DP:RGQ 0/0:29:29:81
    scaffold11251 3986 . G A 470.77 . AC=1;AF=0.500;AN=2;BaseQRankSum=0.268;ClippingRankSum=-7.080e-01;DP=27;FS=5.907;MLEAC=1;MLEAF=0.500;MQ=43.37;MQRankSum=1.20;QD=17.44;ReadPosRankSum=0.024;SOR=1.911 GT:AD:DP:GQ:PGT:PID:PL 0/1:12,15:27:99:0|1:3929_C_T:499,0,367
    scaffold11251 3987 . G . . . AN=2;DP=30 GT:AD:DP:RGQ 0/0:30:30:81
    scaffold11251 3988 . T . . . AN=2;DP=31 GT:AD:DP:RGQ 0/0:31:31:78
    scaffold11251 3989 . A . . . AN=2;DP=30 GT:AD:DP:RGQ 0/0:30:30:78
    scaffold11251 3990 . T . . . AN=2;DP=30 GT:AD:DP:RGQ 0/0:29:30:72
    scaffold11251 3991 . G . . . AN=2;DP=30 GT:AD:DP:RGQ 0/0:30:30:78

    I had a problem getting IGV to run, but I have uploaded screen shots from Tablet for the variants in the original input bam file and the bamout file with the selected haplotypes from HC. I used the options "--bamWriterType CALLED_HAPLOTYPES" and "-maxNumHaplotypesInPopulation 2" when generating the bam output file.

    Best wishes
    Laura

    Issue · Github
    by Sheila

    Issue Number
    297
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • ianwianw Bangor, UKMember

    Hi @Geraldine_VdAuwera and @Sheila,

    Can I belatedly follow up on this and ask why @Laura_Kelly was getting phasing for her homozygous non-reference site, please? I've been getting similar results. I also have a number of sites which, overall, across 200 samples, are called as SNPs but which, for individual samples, are recorded as 0/0 or ./. yet those samples' phasing information still record the site as 0|1. I hope that makes sense!

    Many thanks,

    Ian

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @ianw
    Hi Ian,

    Can you please post an example? I'm not sure I understand your question.

    Thanks,
    Sheila

  • ianwianw Bangor, UKMember

    Hi @Sheila

    Having looked again, I think I've solved the problem, so don't worry about it :smile: Thanks for getting back to me though!

    Cheers,

    Ian

Sign In or Register to comment.