Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

UnifiedGenotyper Genotype Calling Oddity

I encountered a strange issue with GATK calling. For instance, a mutation below has VAF of ~49%, but somehow was called by GATK as HOM.

This happens quite a few times where a 45% mutation is sometimes is called as WildType etc. I understand that GATK chooses the genotype that has the highest probability likelihood, so a 50% VAF wildtype is in theory possible, but I don’t see how that can happen so often.Have you experienced this type of problem before? Below is the cmd line I used. Any suggestions of what may have happened? Thanks much in advance.

java -Djava.io.tmpdir=MK1029_PN011_20140314/Temp/ -Xmx20g -jar GenomeAnalysisTK-2014.3-17-g0583018/GenomeAnalysisTK.jar -nt 5 --allSitePLs -L MK1029rs48snp.bed -glm SNP -R unmasked_hg19.fasta -T UnifiedGenotyper --output_mode EMIT_ALL_SITES -D dbSNP137.hg19.vcf -stand_call_conf 0.0 -rf MappingQuality -mmq 20 -stand_emit_conf 0.0 -dcov 100000 -o all.vcf -I bamfile1 –I bamfile2 –bamfile3 etc.

aa.png 11.7K


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there,

    We no longer support using UnifiedGenotyper for calling variants outside of some exceptional cases; you should try using HaplotypeCaller instead. If the problem persists with HaplotypeCaller, then we can try to help you figure it out.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Geraldine_VdAuwera For version 3.4 do you recommend using UG for low coverage data? I have a few more low coverage datasets to be called. I was about to test out the sensitivity of HC3.4 compared to UG, but maybe you can save me the effort? Thank you very much either way.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I really wouldn't recommend using UG anymore at all. For low coverage data, based on what we've seen recently, I think you may get some sensitivity boost by relaxing some of HC's parameters like --minPruning and --minDanglingBranchLength. You can also consider lowering the base and mapping quality requirements. It will probably noticeably increase your false positive discovery rate, but there's not really any way around that (aside from getting more coverage...).

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited July 2015

    I finally got around to running HC with --minPruning and minDanglingBranchLength 2/4, 2/1, 1/4, 1/1. Initially those ROC curves do not look good for HC compared to UG; I will attach tomorrow. UG performs better in terms of sensitivity. This is low coverage (4x-8x) data. Any other suggestions as to what I can try? I want to bury UG once and for all, but in the words of Schwarzenegger in Terminator Genisys it is "old but not obsolete". Can you just have the devs add an --perform_as_well_as_UG_on_low_coverage_data option to HC ASAP? :) Thanks!

    I didn't lower the base and mapping quality requirements. I need to check the HC manual how to do that.

    Sorry for bringing an old thread to live, but this is an annoying pebble to carry in my shoe. I'm throwing half of it into your shoe @Geraldine_VdAuwera .

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    LOL nice one, Tommy.

    Looking fwd to the roc curves... Does this include indels?

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited July 2015

    @Geraldine_VdAuwera Just don't tell your spouse I'm throwing stones after you...

    I have done the ROC curves for indels. HaplotypeCaller3.4 does just as bad as UnifiedGenotyper3.4 as one would expect for low coverage data. That's all fine. I'm not concerned about this.

    However, HC only calls 43k of 63k SNPs on chrom20 for sample NA12878 at low coverage, whereas UG calls 54k. Below are some of the 11k variants that are not called correctly by HC in GVCF mode, but are called by UG.

    I have attached the ROC curves after running HC with different parameters. I'm still waiting for the -mmq 10 and --minPruning 1 runs to finish. I'm very happy to receive additional suggestions for parameter tweaks that will increase the sensitivity.

    UG 229978   rs448030    G   C   GT:AD:DP:GQ:PL  0/1:4,2:6:59:59,0,145
    HC 229978   rs448030    G   C   GT:AD:DP:GQ:PGT:PID:PL  0/0:6,0:6:0:.:.:0,0,94
    UG 231312   rs386387    A   C   GT:AD:DP:GQ:PL  0/1:5,2:7:50:50,0,151
    HC 231312   rs386387    A   C   GT:AD:DP:GQ:PGT:PID:PL  0/0:7,0:7:0:.:.:0,0,135
    UG 347009   rs55913983  C   A   GT:AD:DP:GQ:PL  0/1:6,3:9:41:41,0,221
    HC 347009   rs55913983  C   A   GT:AD:DP:GQ:PGT:PID:PL  0/0:7,0:7:0:.:.:0,0,214
    UG 429722   rs35381164  G   A   GT:AD:DP:GQ:PL  0/1:4,2:6:39:39,0,100
    HC 429722   rs35381164  G   A   GT:AD:DP:GQ:PGT:PID:PL  0/0:6,0:6:0:.:.:0,0,86
    UG 531221   rs204659    C   T   GT:AD:DP:GQ:PL  0/1:4,2:6:51:51,0,119
    HC 531221   rs204659    C   T   GT:AD:DP:GQ:PL  0/0:6,0:6:0:0,0,71
    UG 753739   rs668988    G   A   GT:AD:DP:GQ:PL  0/1:5,2:7:61:61,0,190
    HC 753739   rs668988    G   A   GT:AD:DP:GQ:PL  0/0:7,0:7:0:0,0,141
    UG 765485   rs11698159  A   G   GT:AD:DP:GQ:PGT:PID:PL  0/0:9,0:9:0:.:.:0,0,328
    HC 765788   rs7264641   C   T   GT:AD:DP:GQ:PGT:PID:PL  0/0:7,0:7:0:.:.:0,0,186
    UG 1111614  rs2143226   T   C   GT:AD:DP:GQ:PL  0/1:4,2:6:52:52,0,144
    HC 1111614  rs2143226   T   C   GT:AD:DP:GQ:PL  0/0:6,0:6:0:0,0,101
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hah, indeed. Have I mentioned she may have access to drones?

    Hmm, those 0,0,X PLs are pretty eloquent... HC is definitely being more conservative when confronted with low evidence. Let me know how it goes with the mmq and pruning tweaks. On my end I'll ask the devs if there are any plans to improve sensitivity at low depth.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Geraldine_VdAuwera I'm not sure I want to cycle to work anymore, if I spot a drone above me...

    Yes, lots of 0,0,X PLs after HC and GG. But what puzzles me is that UG seems to get the AD right, whereas HC/GG seems to get it wrong all the time with no support for the alternate allelele. How is that even possible?

    The last few jobs ran to completion; also the --mmq 10 one. I have attached the plot with all the ROC curves. They all lie on top of each other. UG wins by quite a margin.

    Our conclusion is to use samtools and UG for low coverage data. I need to ask @TechnicalVault if samtools will be able to do some kind of joint calling similar to HC/GG in the near future.

    Thank you for all your suggestions and I'm sorry I'm not able to present some wonderful results for HC. I'm removing the pebble from both our shoes.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You're killing me Smalls, you're killing me. (with apologies for the US cultural reference)

    OK, that's fair. Maybe look into bcbio's implementations of pseudo-joint calling with non-HC callers. Pains me to point you to the competition (and yes, we consider UG to be competition, at this point), but ultimately what I care about is enabling good science, and it sounds like this is what you need for your use case, until we can show some substantial improvements in HC's sensitivity at low coverage. Which may not be complete science fiction; as part of the push for GATK4 (which I may or may not have mentioned previously...), the HC is getting rewritten with some planned improvements; in part to efficiency, but some performance gains may be forthcoming.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Thanks @Geraldine_VdAuwera! HC4 sounds exciting! I don't think low coverage is where the puck is going to be though (with apologies for the Gretzky reference). I'll put a smile on your face some other time :) I'm very sad myself about not being able to use the HC GVCF workflow, because these bams might be included in another study with high coverage WGS data, and I currently have no good ideas on how to include them. Thank you for your time!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited July 2015


  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    @Geraldine_VdAuwera I was a bit impatient, because I need this project to move forward fast (as you might have guessed from my 3am GMT posts), so I didn't wait for the --minPruning 1 ROC curve to be generated. Please see the attachment with the inclusion of it. HC is still not quite as good as UG, but there is less of a difference with --minPruning 1. I would be very happy to test other settings for you/us/me. This result is sufficiently good, that I will recommend we use these HC3.4 --minPruning 1 calls for joint GenotypeGVCFs calling with the high coverage data. Thanks a million!

    It's somewhat slower (~10-20%) than the default --minPruning 2. I'll try with --minPruning 0 now. I'll post the ROC curve in a few days. I currently don't want this competing too much with my other high priority jobs.

    P.S. Here explanation of abbreviations in the plot:
    "pipeline_HC3.4_minPruning2_minDanglingBranchLength4_incl1000Gp3CEU" - African populations along with the ~100 1000G phase 3 CEU samples
    "pipeline_HC3.4_minPruning2_minDanglingBranchLength4_1000Gp3CEU" - The CEU samples only.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Whoo! Well that's good news to wake up to. I'm not seeing your latest plot though.

    I'm not sure --minPruning 0 is going to be considered a valid value; or if it is, that it will have any effect different from --minPruning 1. The meaning of the parameter is "minimum number of reads needed to support a node for it not to be pruned from the graph". But no node will be created with supporting reads = 0... You'll always have at minimum 1 read. So that test might be a waste of time and compute, to be honest.

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Thanks so much for this Geraldine! It's only chrom20, but it's still waste. I switched my brain on now :wink:

    I have attached the plot to this post as well. The hard link is:

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    And I see the graph now, thanks! Nice. Do you have an equivalent graph for indels, by any chance?

  • tommycarstensentommycarstensen United KingdomMember ✭✭✭

    Sorry @Geraldine_VdAuwera, but I didn't get a notification about your post despite following this thread. Attached is the figure for indels on chrom20.

    Not shown on the graph is the fact that the sensitivity increases, when HC is run on NA12878 and CEU without ~2000-3000 Africans. Unfortunately I deleted this data to speed up an rsync process and I think I trashed the figure. You just have to take my word for it :/

    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • tommycarstensentommycarstensen United KingdomMember ✭✭✭
    edited July 2015

    @Geraldine_VdAuwera Last post to this thread. Oddly enough -minPruning 0 worked, but it produced results identical to those of -minPruning 1:

    P.S. @Sheila Your GitHub issues are not visible to the public (e.g. issue 77 above). Not sure if they are supposed to be. Thought I would let you know.

  • SheilaSheila Broad InstituteMember, Broadie admin

    Hi Tommy,

    Thanks for checking on the minPruning 0. We are planning to document the sensitivity and specificity soon.

    As for Github issues, I don't think the content is supposed to be visible to the public. I think the issue itself (in the blue box) is supposed to be visible so users know some kind of status on it. In this case, it is closed.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yeah the issue link itself is for our internal use only. I'm planning to put in some logic to make the links only visible to us, to minimize confusion. For the record, any issues deemed of public interest are visible here: https://www.broadinstitute.org/gatk/guide/issue-tracker

Sign In or Register to comment.