If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

HaplotypeCaller Multisample Variant Calling

Hey there!

I've been using HaplotypeCaller as part of a new whole genome variant calling pipeline I'm working on and I had a question about the number of samples to use. From my tests, it seems like increasing the number of samples to run HaplotypeCaller on simultaneously improves the accuracy no matter how many samples I add (I've tried 1, 4, and 8 samples at a time). Before I tried 16 samples, I was wondering if you could tell me if there's a point of diminishing returns for adding samples to HaplotypeCaller. It seems like with every sample I add, the time/sample increases, so I don't want to keep adding samples if it's not going to result in an improved call set, but if it does improve the results I'll deal with it and live with the longer run times. I should note that I'm making this pipeline for an experiment where there will be up to 50 individuals, and of those, there are family groups of 3-4 people. If running HaplotypeCaller on all 50 simultaneously would result in the best call set, that's what I'll do. Thanks! (By the way, I love the improvements you made with 2.5!)

  • Grant

Best Answers


  • Thanks a ton Geraldine,

    This was really helpful. I guess I'll have to experiment a bit more. I'm usually working with around 20x coverage so I was wondering if that 100 sample approximation was with similar coverage. If so, that should work out well for the short term and I look forward to what comes in 2.6!

  • Thank you again for your suggestions. For now it looks like I can just keep increasing sample counts for a while, but if I hit any hiccups I'll tweak those defaults :)

  • I've begun work testing the rate of diminishing returns for my data and I have a question. How do you determine the quality of a call set produced by HaplotypeCaller? I've noticed in some figures (like this ones on this page that you just put "True positive rate" or "False positive rate", but it's not clear (at least to me) how you derived those values. I know of some QC metrics you can use like Ti/Tv ratios, but I was wondering what you use at Broad to evaluate these tools so I know if I'm heading in the right direction. Sorry to bother you again, and thanks for all of the help so far.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Grant,

    Call set quality evaluation is a complex topic. The basic way we calculate false vs. true positives is to compare calls to a database of highly curated calls which we use as "truth" data. Here, the selection of the truth data is key to the validity of the comparison, of course. We have some internal resources for this, as well as some public resources such as the datasets provided in our resource bundle. They are described (with an estimate or their reliability) in the FAQ article on VQSR training/truth datasets.

  • MaguelonneMaguelonne ParisMember


    Time increases when you add samples, but what about virtual memory used?!

  • SheilaSheila Broad InstituteMember, Broadie admin



    I am assuming you are asking about RAM. RAM does demand an increase as a function of sample number because more data will need to be loaded into memory for processing. This is one of the reasons why the single-sample/GVCF workflow is better than classic multisample calling. Please read more about it here:


Sign In or Register to comment.