Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

no phased genotypes in readback phased vcf in v3.1-1

Hello,

I am working with v3.1-1 and cannot get ReadBackPhasing to produce phased genotypes (i.e., 0|0, 0|1, 1|1) on a multi-sample vcf. Here is my command line:

$ java -Xmx6g -jar /share/apps/gatk/3.1-1/GenomeAnalysisTK.jar \
-T ReadBackedPhasing \
-R ${REF} \
-I ${INBAM} \
--variant ${INVCF} \
-L ${INVCF} \
-o ${OUTVCF} \
--phaseQualityThresh 20.0

When I use and older version (v2.6-4), the same commandline yields phased genotypes on the same input vcf.

The logging information indicates that many sites were successfully phased for each sample but $ grep "|" <phased.vcf> yields nothing.

It looks like a bug in v3.1-1 or am I missing something?

Jonathan

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Jonathan,

    Read-backed phasing now uses a different notation instead of the transmission phasing. The author is prepping some documentation on this topic; in the meantime look for an annotation with square brackets.

  • flowflow Member
    edited April 2014

    Hi Geraldine,

    I'll be interested to learn more about the format change, but I don't see square brackets. I do see the HP genotype tag that was not in the previous version but not sure how to interpret it:

    FORMAT=<ID=HP,Number=.,Type=String,Description="Read-backed phasing haplotype identifiers">

    Here is one record from the v3.1-1 output:

    chromosome_1 53280 . G A 2625.76 . AC=4;AF=0.100;AN=40;BaseQRankSum=-4.586;DP=1519;Dels=0.00;FS=6.669;HaplotypeScore=22.9977;InbreedingCoeff=-0.1111;MLEAC=4;MLEAF=0.100;MQ=52.13;MQ0=0;MQRankSum=-7.679;QD=7.63;ReadPosRankSum=-1.759;SB=-1.266e+03 GT:AD:DP:GQ:HP:PL 0/0:55,1:55:99:.:0,160,2094 0/0:90,0:90:99:.:0,271,3594 0/0:46,0:46:99:.:0,138,1799 0/0:49,1:49:99:.:0,141,1881 0/0:99,0:99:99:.:0,286,3786 0/0:68,0:68:99:.:0,199,2601 0/0:81,0:81:99:.:0,229,2900 0/0:60,0:60:99:.:0,159,1977 0/0:60,0:60:99:.:0,150,1875 0/0:82,0:82:99:.:0,241,3054 0/0:80,0:80:99:.:0,235,2904 0/0:71,0:71:99:.:0,144,1698 0/1:53,32:81:99:53280-1,53280-2:673,0,924 0/1:50,27:74:99:53280-1,53280-2:610,0,740 0/1:67,37:100:99:53280-1,53280-2:930,0,1121 0/1:51,26:74:99:53280-1,53280-2:474,0,859 0/0:81,0:81:99:.:0,235,3091 0/0:92,0:92:99:.:0,271,3607 0/0:78,0:78:99:.:0,220,2863 0/0:79,0:79:99:.:0,223,2834

    Here is the same record from 2.6-4 output

    chromosome_1 53280 . G A 2625.76 . AC=4;AF=0.100;AN=40;BaseQRankSum=-4.586;DP=1519;Dels=0.00;FS=6.669;HaplotypeScore=22.9977;InbreedingCoeff=-0.1111;MLEAC=4;MLEAF=0.100;MQ=52.13;MQ0=0;MQRankSum=-7.679;QD=7.63;ReadPosRankSum=-1.759;SB=-1.266e+03 GT:AD:DP:GQ:PL 0|0:55,1:55:99:0,160,2094 0|0:90,0:90:99:0,271,35940|0:46,0:46:99:0,138,1799 0|0:49,1:49:99:0,141,1881 0|0:99,0:99:99:0,286,3786 0|0:68,0:68:99:0,199,2601 0|0:81,0:81:99:0,229,2900 0|0:60,0:60:99:0,159,1977 0|0:60,0:60:99:0,150,1875 0|0:82,0:82:99:0,241,3054 0|0:80,0:80:99:0,235,2904 0|0:71,0:71:99:0,144,1698 0/1:53,32:81:99:673,0,924 0/1:50,27:74:99:610,0,740 0/1:67,37:100:99:930,0,1121 0/1:51,26:74:99:474,0,859 0|0:81,0:81:99:0,235,3091 0|0:92,0:92:99:0,271,3607 0|0:78,0:78:99:0,220,2863 0|0:79,0:79:99:0,223,2834

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Oh yes it's the HP tag. For some reason I remembered it coming with brackets but I guess I was wrong. I don't deal with those very often. The numbers refer to the phased haplotypes, but I can't tell you anything more about how to interpret them, sorry...

  • MaxMax Member

    Hm, I would assume that all SNPs with the same HP tags are on the same haplotype (if the PQ field is available. Otherwise the site was not phased)

    So in this example, everything between position 151527327 and 151527457 is inherited together. However I'm unsure how to exactly interpret the 151527268-1,151527268-2 Identifier.

    POS REF ALT GT:HP:PQ

    151527268 C T 0/1:151527268-1,151527268-2

    151527327 A C 0/1:151527268-1,151527268-2:1916.58

    151527330 T C 0/1:151527268-1,151527268-2:3197.73

    151527376 C T 0/1:151527268-1,151527268-2:2520.91

    151527394 C T 0/1:151527268-1,151527268-2:2389.63

    151527445 T A 0/1:151527268-1,151527268-2:2389.63

    151527457 C T 0/1:151527268-1,151527268-2:2501.22

    151531019 C A 0/1:151531019-1,151531019-2

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I believe you're correct, and I assume the -1/-2 indicates which allele is on which haplotype. I expect that if one of the alt alleles was on the -1 haplotype, the notation would be e.g. 151527457 C T 0/1:151527268-2,151527268-1:2501.22.

  • MaxMax Member

    Seems reasonable, thanks ;)

    Do you also have an idea how to interpret the identifiers ? In my example and in all cases that I've seen in my test file, the identifier always matches the position of the last SNP which is not on the haplotype.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Max, my reading is that the identifier is the position of the first SNP in the haplotype. The first SNP doesn't need a PQ because it is objectively on that haplotype by definition -- its presence defines the haplotype. Then every other variant is phased relative to that first one, and the PQ tells you how confident we are that they are indeed on that haplotype. Does that make sense?

  • flowflow Member

    So there appears to be two changes then. The use of the HP tag and the abandonment of the "|" in the GT subfield. As far as I can tell however, this is not consistent with the VCF formatting standard for phased genotypes (indeed downstream software that relies on the "|" breaks in our analysis pipelines).

    Also, for the example I provided in my original post, the variant in question does not appear to be phased although it was in the version 2.6-4 output (note the HP tag has "." in all genotypes"). I'm unclear why this might be.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @flow, this is a single underlying change -- the HP tag replaces the pipe character. My understanding of the rationale is that we wanted to differentiate physical (read-backed) phasing from pedigree-based phasing (which I believe is what the spec intends to convey). Sometimes the two disagree, so if you use the same notation for both, you have to choose at some point which you want to trust over the other, and you lose information. In contrast, having separate notations allows you to retain all the information and to know where the info comes from in each case. You don't necessarily want to treat the phasing information the same way whether it's physical or pedigree-based. I'm sorry if that breaks some downstream software but perhaps that means the software needs to be smarter about how it treats information.

  • flowflow Member

    Thanks Geraldine. Obviously we are using the GATK tool for convenience. We are happy to adjust our software to handle the output--which is trivial compared to developing our own phasing tool.

    It helps to understand the rational for the decision, which makes sense. Thanks again.

  • MaxMax Member

    @Geraldine_VdAuwera, Thanks for your help! This seems plausible ;)

  • flowflow Member

    @Geraldine_VdAuwera Another possible way to not lose information and not restrict the "|" to a pedigree-based phase definition (that may or may not have been the original intent in the vcf spec--but the spec itself does not make this distinction), would be to use a tag for each type of phasing that had been applied and indicate whether that method of phasing supported the "|" in a given genotype.

    Thanks again for your help.

  • TechnicalVaultTechnicalVault Cambridge, UKMember ✭✭✭

    @Geraldine_VdAuwera Is there any chance you guys could put down a pull request to the VCF spec to explain the HP tag? Downstream software can only be smarter about tags that they know about. At present the only post I can find on the vcf-spec mailing list mentioning an HP FORMAT tag is the one by Adam Auton about a completely separate "Inferred Heteroplasmy Frequency"

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @TechnicalVault‌ Fair point, didn't mean to trivialize the difficulties of keeping different tools in sync in a pipeline. I'll see what I can do to get this documented -- latest I heard the dev responsible was working on docs, but there's been radio silence for a while. Will ping him.

  • MaxMax Member

    Is there any reason why sites with multiple alternate alleles dont get phased ? I just saw it in my multi-sample vcf file, that none of those specific sites could successfully be phased (so there wasnt even a HP tag)

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @‌Max

    Hi,

    It only works for biallelic sites currently.

    -Sheila

Sign In or Register to comment.