The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Surround blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block.
GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# HaplotypeCaller Multisample Variant Calling

Member Posts: 6

Hey there!

I've been using HaplotypeCaller as part of a new whole genome variant calling pipeline I'm working on and I had a question about the number of samples to use. From my tests, it seems like increasing the number of samples to run HaplotypeCaller on simultaneously improves the accuracy no matter how many samples I add (I've tried 1, 4, and 8 samples at a time). Before I tried 16 samples, I was wondering if you could tell me if there's a point of diminishing returns for adding samples to HaplotypeCaller. It seems like with every sample I add, the time/sample increases, so I don't want to keep adding samples if it's not going to result in an improved call set, but if it does improve the results I'll deal with it and live with the longer run times. I should note that I'm making this pipeline for an experiment where there will be up to 50 individuals, and of those, there are family groups of 3-4 people. If running HaplotypeCaller on all 50 simultaneously would result in the best call set, that's what I'll do. Thanks! (By the way, I love the improvements you made with 2.5!)

• Grant
Tagged:

• Member Posts: 6

Thanks a ton Geraldine,

This was really helpful. I guess I'll have to experiment a bit more. I'm usually working with around 20x coverage so I was wondering if that 100 sample approximation was with similar coverage. If so, that should work out well for the short term and I look forward to what comes in 2.6!

• Member Posts: 6

Thank you again for your suggestions. For now it looks like I can just keep increasing sample counts for a while, but if I hit any hiccups I'll tweak those defaults

• Member Posts: 6

I've begun work testing the rate of diminishing returns for my data and I have a question. How do you determine the quality of a call set produced by HaplotypeCaller? I've noticed in some figures (like this ones on this page that you just put "True positive rate" or "False positive rate", but it's not clear (at least to me) how you derived those values. I know of some QC metrics you can use like Ti/Tv ratios, but I was wondering what you use at Broad to evaluate these tools so I know if I'm heading in the right direction. Sorry to bother you again, and thanks for all of the help so far.

Hi Grant,

Call set quality evaluation is a complex topic. The basic way we calculate false vs. true positives is to compare calls to a database of highly curated calls which we use as "truth" data. Here, the selection of the truth data is key to the validity of the comparison, of course. We have some internal resources for this, as well as some public resources such as the datasets provided in our resource bundle. They are described (with an estimate or their reliability) in the FAQ article on VQSR training/truth datasets.

Geraldine Van der Auwera, PhD

• ParisMember Posts: 9

Hi,

Time increases when you add samples, but what about virtual memory used?!