Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Physical Phasing Information HaplotypeCaller


I am looking to use HaplotypeCaller to call germline variants, and I am particularly interested in the orientation of these variants relative to one another (cis- or trans-). There seems to be reference to physical phasing in the (HaplotypeCaller documentation)[https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_haplotypecaller_HaplotypeCaller.php#--do-not-run-physical-phasing], but I cannot find any physical phasing information in my VCF file.

For instance, I would expect the two variants below:

1 1647722 . G T 307.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-2.861;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=53.28;MQRankSum=-5.260;QD=10.61;ReadPosRankSum=-0.098;SOR=0.155 GT:AD:DP:GQ:PL 0/1:21,8:29:99:315,0,841
1 1647725 . G A 304.60 . AC=1;AF=0.500;AN=2;BaseQRankSum=-1.277;DP=29;ExcessHet=3.0103;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=52.38;MQRankSum=-5.262;QD=10.50;ReadPosRankSum=-0.448;SOR=0.204 GT:AD:DP:GQ:PL 0/1:20,9:29:99:312,0,883

to be in the cis- orientation because they share nearly identical read counts, but I cannot find a corresponding annotation in the VCF file that says as much.

My command to call HaplotypeCaller is as below:

$gatk_launcher --java-options -Xmx${mem}g HaplotypeCaller \
-R $reference \
-I $bam_file \
-O $out_file \
-L $intervals_split &>> $log_file

Thank you for the help!!


  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited April 19


    I checked with our dev team and this is what they had to say:

    Haplotypecaller and M2 automatically emit phased genotypes (i.e the PGT and PID tags and 0|1 instead of 0/1 genotypes) in any mode. However, the phasing algorithm is very conservative and a huge amount of phasing is not found.

    The way the phasing algorithm decides to phase is by checking whether two variants always occur on the same haplotype or always occur on a different haplotype. The excess haplotypes severely dilute the signal.

    This raises the question of whether we could do better, and the answer is yes, easily. The current code is very naive. The dev team is looking into making this process better.

    Post edited by bhanuGandham on
  • nukaemonnukaemon JapanMember
    edited September 4

    > HaplotypeCaller and M2 automatically emit phased genotypes (i.e the PGT and PID tags and 0|1 instead of 0/1 genotypes) in any mode.

    Could you clarify if this is true when running HaplotypeCaller without -ERC option?
    I ran HC(of v4.1.3.0) on my WES data in default single-sample mode(i.e. without -ERC GVCF or BP_RESOLUTION) and found that all the variant's genotypes were unphased (only 0/1, 1/1, or 1/2 shown) while when running with -ERC GVCF in the same input, I saw several phased genotypes.
    There are more than 400K sites called and passed with FilterVariantTranches, so I'm sure there must be a few variants close enough to be physically phased even though the phasing algorithm is very conservative as you mentioned.
    Post edited by bshifaw on
  • bshifawbshifaw Member, Broadie, Moderator admin

    @nukaemon, I may have to double check with my team. In the meantime let me know if this forum post answers your question. Thread

  • nukaemonnukaemon JapanMember
    @bshifaw, thank you for your response and the link.

    I looked for old documents or posts that may have answered to this and found many saying, though all about GATK 3, that allele-specific handling in HaplotypeCaller can be performed in reference confidence mode which I guess means running with gVCF option.
    (ex. #9622, sorry that markdown doesn't seem to work in my environment.)

    So, my guess is that this is still the same to the current version of HaplotypeCaller, and thus outputting VCF spec-compliant phased variants is limited only when HC is run with -ERC GVCF or BP_RESOLUTION,
    but if someone in your team clarifies this, I appreciate.
  • bshifawbshifaw Member, Broadie, Moderator admin

    Just confirmed with the team.

    Correct, variant phasing information is emitted only when -ERC GVCF is included in the command parameters.

    The engine automatically turns off phasing unless the user has requested GVCF mode.

    Source code for this is located here.

Sign In or Register to comment.