Using CombineVariants to update GT fields with phasing information

steps372steps372 South AfricaMember

Hi GATK Team

I was wondering whether you guys have developed capability for updating GT fields with phasing information contained in another VCF? For instance, I generated phased genotypes using Beagle, and now simply want to update the genotype format fields of the unphased VCF (containing valuable FORMAT annotations that are lost after using Beagle). So I want to keep annotations and phasing information in one VCF.

I tried to do this using CombineVariants (Prioritize), however this only output the FORMAT fields of the prioritized vcf, rather than just merging them.

I realize you provide limited support to VCFs processed by non-GATK tools, but it would be helpful to know whether there is such capability that I am missing in the documentation.

Thanks for all your help!

Answers

  • steps372steps372 South AfricaMember

    Just an update

    I managed to parse information from the phased and unphased vcfs together using a python script. I realise you guys don't recommend passing GATK-generated vcfs through custom a python parsing script, but this was the only option I could find. If you guys know of any alternatives, please let me know.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    Hey @steps372 , you can actually use VariantAnnotator to copy annotations from a "resource" VCF. Although normally this is meant for INFO level annotations, it might be possible to use it for genotype annotations. It wouldn't overwrite the existing genotype, but would copy it as a separate annotation (which we prefer, for bookkeeping purposes).
  • everestial007everestial007 GreensboroMember ✭✭
    edited January 2017

    @steps372 @Geraldine_VdAuwera

    I had similar problem where I wanted to transfer the phase information from PGT to GT field. I suggest using awk rather than python because awk scpript will be short. Python obviously gives you more control but for this purpose awk is a better choice.

    This answer can be referred to if someone has similar problem in the future.

    Below is the awk script:

    awk 'BEGIN{FS=OFS="\t"} {for (i=10;i<=NF;i++) { split($i,f,/:/); if (f[5]~/\|/) sub(/^[^:]+/,f[5],$i) } }1' DNA_samples.passed_variants.vcf > DNA_samples.passed_PGTphase_Updated_variants.vcf

    Description:

    • The field separator should always be assigned as TAB (for both input (FS) and ouput (OFS).
    • I was wanting to update the phase state from PGT field (i.e the 5th field in my vcf) to GT field (i.e the 1st field in my vcf) for all the SAMPLE column. The SAMPLE column always starts as 10th column (therefore i=10 is started as for-loop).

    Additional notes:

    • If your PGT or other phase state (like PS etc.) is in 3rd field update f[5] to f[3].
    • If phase state is to be updated only for one specific SAMPLE (say only for a sample at 12th column position), remove i++.

    The downstream use of the ouput *.vcf files are compatible with GATK operations. - checked and verified.

Sign In or Register to comment.