The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# How many genomes do you run UnifiedGenotyper on?

Member Posts: 27

I'm running UnifiedGenotyper on a large number of post-processing BAMs - which of course makes UnifiedGenotyper run a bit slower than when you run it on individual genomes. There's an accuracy advantage to running on large numbers of genomes at a time, but the returns begin to diminish at increasing n.

My question is: when you have a very large number of 4x full genomes (not exomes) available - say, high plural thousand - at what point do you want to say that the advantage to including another genome in a single variant call run is traded off by the disadvantage of longer runtime, higher chance of failure, gigantic VCF output file, etc? Where is your cutoff point? Of course, it depends on the speed and reliability of your computing system, but an experience on any system would be useful. Do you yourself do variant calls on max 50 BAMs at a time? 500? 5000?

The G1K methodology doesn't hint at where they've set their run-size cutoff point, but both they and uk10k seem to have their VCF's in filesizes of a thousand genomes per file or so. I'm tempted to take that as a hint, but I want to ask the smart people of the community first.

• Member Posts: 27

Cool @ami, thanks for a quick reply.

So, 60k for exomes, and 2k for 4x wgs, no problem. And to be clear, 2k @4x is the highest you did.

No indication that you wanted to batch, for 2k @4x?

In my own case, I'm occasionally having systemwide "blips" (nothing to do with GATK, it's the computer cluster) sabotage processes that run for a long time. So this is why I've been tempted to sacrifice a bit of accuracy for higher process reliability. Also, I want to be able to parallelize better. But I guess you don't have this problem, my hardware cluster is sort of dated. ^_^

Does your cluster support Queue? Using Queue to scatter-gather your jobs would solve your issue with system blips. It can chop up your UG jobs into small intervals so the processes don't run as long, combine the results appropriately, and will auto-manage re-running any jobs that do fail.

Geraldine Van der Auwera, PhD

• Dev Posts: 50

just as technical note, we do run it all in parallel, for the 26K exomes, we ran joint calling on each chromosome, and scatter each chromosome to 1000 jobs. (since UG can work on each locus separately), you can do it relatively easily with Queue (there is a workshop next month that will teach how to do so).
As to your question, we are sure that running 2k @4x in batches is not the best (or good) way to run it.

• Member Posts: 27

I've been using the local queueing system, but it's probably less smart about parallelization than queue (it parallelizes by thread, but not by node, I believe).

Ok, I'm going to parallelize in queue, and then feedback you both about the performance improvement.