We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Using GATK haplotype caller and SHAPEIT genotype likelihoods

AnniqueCAnniqueC Groningen, NetherlandsMember
edited March 2016 in Ask the GATK team

Dear all,

I have a question regarding the use of the SHAPEIT genotype calling by consecutive use of GATK, BEAGLE, and SHAPEIT. I have used the GATK haplotype caller, giving rise to an output of .vcf files with a ‘PL’ (normalised phred-scaled likelihood) column, and no ‘GL’ (genotype log10 likelihood) column. BEAGLE can handle PL as input, and the initial genotype calling works fine.

The next step would be to use SHAPEIT for phasing, as described here: [https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#gcall]

There is a C++ script described on this page, called prepareGenFromBeagle4, which will convert the BEAGLE .vcf files to the correct SHAPEIT input. However, this script looks for a ‘GL’ column in the .vcf file, which is unavailable since GATK only outputs ‘PL’. The script therefore crashes and I cannot proceed to the phasing step.

I have considered and tested a simple conversion script from PL to GL, where
GL = PL/-10
However, since the PL is normalised (genotype with highest likelihood is set to 0), there is some loss of information here.

For example for a given genotype AA/AB/BB, if the original GLs (these are not in the GATK output but say they were) were
-0.1 / -1 / -10
The corresponding PLs should be
1 / 10 / 100
And the normalised PLs (GATK output) would be
0 / 10 / 100
Giving rise to these converted GLs after the simple conversion
0 / -1 / -10

The converted GLs are used to then calculate genotype probabilities required for the SHAPEIT input. The issue that I have, is that all three genotype probabilities in the SHAPEIT input need to add up to 1.00 in total. The GL values are somehow scaled to calculate the genotype probabilities. Therefore, the final genotype probabilities in the SHAPEIT input file would turn out differently if I used the 'original' GL values (which I do not have) in comparison to the converted GL values. I am afraid this will introduce bias into the genotype probabilities used by SHAPEIT.

Apologies for the long post, I would be grateful hearing your thoughts on this issue. Have you used GATK and SHAPEIT consecutively before and run into this problem? Is there a reason to be weary of the potential bias here?

Many thanks in advance,
Annique Claringbould

Issue · Github
by Sheila

Issue Number
Last Updated
Closed By


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Annique,

    We've never used SHAPEIT so we can't comment on this directly. But it sounds like what it produces is physical phasing. If you follow our Best Practices pipeline you already get physical phasing from HaplotypeCaller so you don't need to run SHAPEIT as far as I can tell. What are you trying to get out of it?

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @AnniqueC I believe you are working with low coverage data, which necessitates refinement and phasing after calling, right?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Tommy, you're alive! Haven't seen you around in a bit.

    It's true that we are not used to dealing with very low coverage data so our recommendations may not apply. Tommy, if you can share some insights from your own experience that's appreciated as always.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Geraldine_VdAuwera I am very much alive and I very much miss being part of the GATK forum. Things very busy and I've had to just click "mark all read" on my RSS feed. I might be back with lots of RNAseq questions in the near future...

    I recently pushed a slow mini Python script for doing the PL to GL conversion to GitHub:

    This functionality doesn't seem to have been added to the tag2tag plugin of bcftools yet:

    @AnniqueC I haven't checked the source code for prepareGenFromBeagle4, but I suspect you will end up with almost identical probabilities in your gen file for the GLs -0.1,-1,-10 and 0,-1,-10. Have you tried it?

    Please note that the SHAPEIT website says this about prepareGenFromBeagle4:

    Moreover, we used 0.995 as the threshold meaning that all genotypes with a posterior above 0.995 are directly fixed and will only need phasing in the SHAPEIT step.

    Furthermore the default threshold of SHAPEIT2 (--input-thr) is 0.995 according to the website, so it shouldn't be an issue.

    I hope this is somewhat useful and has made you more calm. Should I be more nervous? Have I overlooked something?

  • AnniqueCAnniqueC Groningen, NetherlandsMember

    Dear both,

    After reading the following post

    I realised that in my example above, I made a mistake. The normalised PL scores would not be 0/10/100, but rather 0/9/99. This solves the issue of scalability to later genotype probabilities, so indeed, a simple conversion from GL to PL should be sufficient.

    Thanks for your replies and hope that this clears things up.


Sign In or Register to comment.