Service notice: Several of our team members are on vacation so service will be slow through at least July 13th, possibly longer depending on how much backlog accumulates during that time. This means that for a while it may take us more time than usual to answer your questions. Thank you for your patience.

Version highlights for GATK version 3.3

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

Another season, another GATK release. Personally, Fall is my favorite season, and while I don’t want to play favorites with versions (though unlike with children, you’re allowed to say that the most recent one is the best --and you can tell I was a youngest child) this one is pretty special to me.

Because -ploidy! Yeah, that’s really all I need to say about that. I was a microbiologist once. And I expect many plant people will be happy too.

Other cool stuff detailed below includes: full functionality for the genotype refinement workflow tools; physical phasing and appropriate handling of dangly bits by HaplotypeCaller (must… resist… jokes…); a wealth of new documentation for variant annotations; and a slew of bug fixes that I won’t go over but are listed in the release notes.


Genotype refinement workflow with all the trimmings

As announced earlier this week, we recently developed a workflow for refining genotype calls, intended for researchers who need highly accurate genotype information as well as preliminary identification of possible de novo mutations (see the documentation for details). Although all the tools involved were already available in GATK 3.2, some functionalities were not, so we’re very happy to finally make all of them available in this new version. Plus, we like the new StrandOddsRatio annotation (which sort of replaces FisherStrand for estimating strand bias) so much that we made it a standard one, and it now gets annotated by default.


Non-diploids, rejoice!

This is also a feature that was announced a little while ago, but until now was only fully available in the nightly builds, which are technically unsupported unless we tell you to use them to get past a bad bug. In this new release, both HaplotypeCaller and GenotypeGVCFs are able to deal with non-diploid organisms (whether haploid or exotically polyploid). In the case of HaplotypeCaller, you need to specify the ploidy of your non-diploid sample with the -ploidy argument. HC can only deal with one ploidy at a time, so if you want to process different chromosomes with different ploidies (e.g. to call X and Y in males) you need to run them separately. On the bright side, you can combine the resulting files afterward. In particular, if you’re running the -ERC GVCF workflow, you’ll find that both CombineGVCFs and GenotypeGVCFs are able to handle mixed ploidies (between locations and between samples). Both tools are able to correctly work out the ploidy of any given sample at a given site based on the composition of the GT field, so they don’t require you to specify the -ploidy argument.


HaplotypeCaller gets physical

You know how HC performs a complete reassembly of reads in an ActiveRegion? (If you don’t, go read this now. Go on, we’ll wait for you.) Well, this involves building an assembly graph, of course (of course!), and it produces a list of haplotypes. Fast-forward a couple of steps, and you end up with a list of variants. That’s great, but until now, those variants were unphased, meaning the HC didn’t give you any information about whether any two variants’ alleles were on the same haplotype (meaning, on the same physical piece of DNA) or not. For example, you’d want to know whether you had this:

image

or this:

image

But HC wouldn’t tell you which it was in its output. Which was a shame, because the HC sees that information! It took a little tweaking to get it to talk, but now it emits physical phasing by default in its GVCF output (both banded GVCF and BP_RESOLUTION).

In a nutshell, phased records will look like this:

1   1372243  .  T  <NON_REF>  .  .  END=1372267  <snip>  <snip>
1   1372268  .  G  A,<NON_REF>  .  .  <snip>  GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,40,0:70:99:0|1:1372268_G_A:<snip>
1   1372269  .  G  T,<NON_REF>  .  .  <snip>  GT:AD:DP:GQ:PGT:PID:PL:SB 0/1:30,41,0:71:99:0|1:1372268_G_A:<snip>
1   1372270  .  C  <NON_REF>  .  .  END=1372299  <snip>  <snip>

You see that the phasing info is encoded in two new sample-level annotations, PID (for phase identifier) and PGT (phased genotype). More than two variants can be phased in a group with the same PID, and that can include mixed types of variants (e.g. SNPs and indels).

image

The one big caveat related to the physical phasing output by HC in GVCFs is that, like the GVCF itself, it is not intended to be used directly for analysis! You must run your GVCFs through GenotypeGVCFs in order to get the finalized, properly formatted, ready-for-analysis calls.


Heads or tails

Speaking of HaplotypeCaller getting more helpful all the time, here’s some more of that. This still has to do with the graph assembly, and specifically, with how HC handles the bits at the edges of the graph, which are called dangling heads and dangling tails. Without going too far into the details, let’s just say that sometimes you have a variant that’s near the edge of a covered region, and due to technical reasons (cough kmer size cough) the end of the variant path can’t be tied back into the reference path, so it just dangles there (like, say, Florida) and gets trimmed off in the next step (rising ocean levels). And thus the variant is lost (boo).

image

We originally started paying attention to this because it often happens at the edge of exons near splice junctions in RNAseq data, but it can also happen in DNA data. The solution was to give HC the ability to recover these cliff-dwelling variants by merging the dangling ends back into the graph using special logic tailored for those situations. If you have been using our RNAseq Best Practices, then you may recognize this as the logic invoked by the --recoverDanglingHeads argument. In the new version, the functionality has been improved further and is now enabled by default for all variant calling (so you no longer need to specify that argument for RNAseq analysis). The upshot is that sensitivity is improved, especially for RNAseq data but also for DNA.


Variant annotations finally make sense

Finally, I want to attract everyone’s attention to the Variant Annotations section of the Tool Documentation, which has just undergone a comprehensive overhaul. All annotations now have some kind of documentation outlining their general purpose, output, interpretation, caveats and some notes about how they’re calculated where applicable. Tell us what you think; we are feedback junkies.

Comments

  • I used v3.3-0-g37228af and here is my command:

    java -Xmx10g -jar $gatk -T HaplotypeCaller \
     -R $refGenome \
     --dbsnp $dbSNP \
     -o s.10.g.vcf.gz \
     -I s.bam \
     -L 10 \
     --emitRefConfidence GVCF --variant_index_type LINEAR --variant_index_parameter 128000 &&
    

    but there is no PGT in the output:

    10      93181   .       T       <NON_REF>       .       .       END=93182       GT:DP:GQ:MIN_DP:PL      0/0:10:18:10:0,18,270
    10      93183   .       C       <NON_REF>       .       .       END=93183       GT:DP:GQ:MIN_DP:PL      0/0:10:0:10:0,0,325
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @blueskypy Reference blocks are unphased, by definition -- there is no way to distinguish separate haplotypes.

  • I see, so I need to use BP_RESOLUTION.
    Also how to put phased genotype at the GT? Because other programs, for example, Beagle, takes GT?

  • blueskypyblueskypy Member
    edited November 2014

    so does it mean that for banded GVCF, only some sites will be phased after GenotypeGVCFs; but for BP_RESOLUTION GVCFs, all sites, except those no calls, will be phased after GenotypeGVCFs?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Why would you want to have phasing info at hom-ref sites? Again, by definition there is nothing that can be phased there. Even if you use BP_RESOLUTION, there will be no PGT tag at hom-ref sites because they can't be phased...

  • sorry, I forgot the band is only for hom sites ( had a long day!).

    is there a way to to put phased genotype at the GT? Because other programs, for example, Beagle, takes GT? (I can write a script to do that, just wonder if GATK can do that too)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hah, no worries, I figured :)

    Nope, to avoid collisions with the output of other methods, the PGT stays in PGT. Script away :)

  • could you explain the following lines in a gvcf file?
    why no PGT in this line?

    10      93945   rs4431953       G       A,<NON_REF>     47.77   .       BaseQRankSum=-1.868;ClippingRankSum=-0.104;DB;DP=32;MLEAC=
    1,0;MLEAF=0.500,0.00;MQ=48.09;MQ0=0;MQRankSum=0.156;ReadPosRankSum=0.467        GT:AD:DP:GQ:PL:SB       0/1:27,5,0:32:76:76,0,887,
    157,902,1059:7,20,0,5
    

    why the GT and PGT are different?

    10      94784   .       C       G,<NON_REF>     0       .       DP=14;MLEAC=0,0;MLEAF=0.00,0.00;MQ=58.80;MQ0=0  GT:AD:DP:GQ:PGT:PI
    D:PL:SB 0/0:14,0,0:14:42:0|1:94784_C_G:0,42,838,42,838,838:8,6,0,0
    10      879756  rs11818325      G       A,<NON_REF>     99.18   .       DB;DP=3;MLEAC=2,0;MLEAF=1.00,0.00;MQ=60.00;MQ0=0        GT
    :AD:DP:GQ:PGT:PID:PL:SB 1/1:0,3,0:3:9:0|1:879756_G_A:126,9,0,126,9,126:0,0,2,1
    1
    
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @blueskypy‌

    Hi,

    The PGT has to do with phased genotyping. Please have a look at this thread for more information: http://gatkforums.broadinstitute.org/discussion/4682/pgt-and-pid-in-vcf

    -Sheila

  • Thanks Sheila! But that thread does not answer my question.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @‌blueskypy

    To clarify the scope of PGT, the physical phasing done by HaplotypeCaller is limited to the range of individual active regions, so any variant that is by itself within an active region will be unphased. To phase variants across active regions, you need to use complementary phasing tools such as Phase By Transmission and ReadBackedPhasing.

    Regarding the second line, this is a slight oddity that is linked to the intermediate representation of phasing in GVCFs. You should ignore it and only look at the phasing output by GenotypeGVCFs.

  • Thanks Geraldine! Did you mean "phased" in this sentence "so any variant that is by itself within an active region will be unphased"?

    what's the difference between phasing done by HC and ReadBackedPhasing since they read the same bam file?

    For sites that can be phased by both imputation and HC or ReadBackedPhasing, I guess the latter is more accurate since it's based on bam file. Is my understanding correct?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    No, I do mean that any variant that is by itself within an active region will be unphased (hence no PGT in your first line) since there is no other variant relative to which it could be phased. Phasing is always relative. The difference between HC's phasing and ReadBackedPhasing is that HC emits the phasing that corresponds to the called haplotypes within discrete regions, while ReadBackedPhasing is not limited to active regions, and instead will extend phasing as far as can be supported by the data.

    Imputation is based on external data and certain assumptions relative to those data, so you can argue that it's potentially less accurate than physical phasing, yes.

  • siantornosiantorno Cambridge, UKMember

    Hi Geraldine, is PhaseByTransmission able to handle and properly phase polyploid GT calls?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @siantorno I believe it's only designed to handle diploids but could be wrong. The author, @Laurent Francioli, may jump in to tell you more.

  • LaurentLaurent Member, Broadie

    Hi @siantorno, indeed at the moment PhaseByTransmission only handles diploid calls properly, sorry! Note that we have an internal release which handles the X-chromosome correctly but we still need to polish a few things before release.

    Cheers,
    Laurent

  • Dear All,

    I thank GATK developers for implementing phasing features.
    I wonder if we can perform ReadBackedPhasing for a multiple sample (e.g. N samples) merged vcf file.

    For instance,

    ...
    -T ReadBackedPhasing \
    -I sample_1.bam \
    -I sample_2.bam \
    ...
    -I sample_N.bam \
    -V N_samples_merged.vcf \
    -o N_samples_merged_phased.vcf

    And, same questions for 'PhaseByTransmission'

    Best wishes,
    Luke

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @lukehur, I think that should work but the simplest way to know for sure is to try it on a small region and see if it works as you expect or not.

  • YingLiuYingLiu ChinaMember

    HI , @Geraldine_VdAuwera
    "More than two variants can be phased in a group with the same PID, and that can include mixed types of variants (e.g. SNPs and indels)",you mean that GVCF can phase both indels and SNPs ? but I know RBP only can phase SNP !

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @YingLiu
    Hi,

    Yes, HaplotypeCaller phases both SNPs and indels.

    -Sheila

  • YingLiuYingLiu ChinaMember

    @Sheila said:
    @YingLiu
    Hi,

    Yes, HaplotypeCaller phases both SNPs and indels.

    -Sheila

    @Sheila
    HI,
    could you provide detailed introduction with using HaplotypeCaller(GVCF mode) to phase SNPs and indels ?
    I feel provding related phase alorgthim paper to us is helpful .

    Best

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @YingLiu
    Hi,

    There is no white paper, but you can have a look at the Methods and Algorithms section for a detailed explanation of the tool.

    -Sheila

Sign In or Register to comment.