The frontline support team will be unavailable to answer questions until May27th 2019. We will be back soon after. Thank you for your patience and we apologize for any inconvenience!
Using GATK haplotype caller and SHAPEIT genotype likelihoods
I have a question regarding the use of the SHAPEIT genotype calling by consecutive use of GATK, BEAGLE, and SHAPEIT. I have used the GATK haplotype caller, giving rise to an output of .vcf files with a ‘PL’ (normalised phred-scaled likelihood) column, and no ‘GL’ (genotype log10 likelihood) column. BEAGLE can handle PL as input, and the initial genotype calling works fine.
The next step would be to use SHAPEIT for phasing, as described here: [https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html#gcall]
There is a C++ script described on this page, called prepareGenFromBeagle4, which will convert the BEAGLE .vcf files to the correct SHAPEIT input. However, this script looks for a ‘GL’ column in the .vcf file, which is unavailable since GATK only outputs ‘PL’. The script therefore crashes and I cannot proceed to the phasing step.
I have considered and tested a simple conversion script from PL to GL, where
GL = PL/-10
However, since the PL is normalised (genotype with highest likelihood is set to 0), there is some loss of information here.
For example for a given genotype AA/AB/BB, if the original GLs (these are not in the GATK output but say they were) were
-0.1 / -1 / -10
The corresponding PLs should be
1 / 10 / 100
And the normalised PLs (GATK output) would be
0 / 10 / 100
Giving rise to these converted GLs after the simple conversion
0 / -1 / -10
The converted GLs are used to then calculate genotype probabilities required for the SHAPEIT input. The issue that I have, is that all three genotype probabilities in the SHAPEIT input need to add up to 1.00 in total. The GL values are somehow scaled to calculate the genotype probabilities. Therefore, the final genotype probabilities in the SHAPEIT input file would turn out differently if I used the 'original' GL values (which I do not have) in comparison to the converted GL values. I am afraid this will introduce bias into the genotype probabilities used by SHAPEIT.
Apologies for the long post, I would be grateful hearing your thoughts on this issue. Have you used GATK and SHAPEIT consecutively before and run into this problem? Is there a reason to be weary of the potential bias here?
Many thanks in advance,