The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Download the latest Picard release at
GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

How many genomes do you run UnifiedGenotyper on?

I'm running UnifiedGenotyper on a large number of post-processing BAMs - which of course makes UnifiedGenotyper run a bit slower than when you run it on individual genomes. There's an accuracy advantage to running on large numbers of genomes at a time, but the returns begin to diminish at increasing n.

My question is: when you have a very large number of 4x full genomes (not exomes) available - say, high plural thousand - at what point do you want to say that the advantage to including another genome in a single variant call run is traded off by the disadvantage of longer runtime, higher chance of failure, gigantic VCF output file, etc? Where is your cutoff point? Of course, it depends on the speed and reliability of your computing system, but an experience on any system would be useful. Do you yourself do variant calls on max 50 BAMs at a time? 500? 5000?

The G1K methodology doesn't hint at where they've set their run-size cutoff point, but both they and uk10k seem to have their VCF's in filesizes of a thousand genomes per file or so. I'm tempted to take that as a hint, but I want to ask the smart people of the community first.

Best Answer


  • Cool @ami, thanks for a quick reply.

    So, 60k for exomes, and 2k for 4x wgs, no problem. And to be clear, 2k @4x is the highest you did.

    No indication that you wanted to batch, for 2k @4x?

    In my own case, I'm occasionally having systemwide "blips" (nothing to do with GATK, it's the computer cluster) sabotage processes that run for a long time. So this is why I've been tempted to sacrifice a bit of accuracy for higher process reliability. Also, I want to be able to parallelize better. But I guess you don't have this problem, my hardware cluster is sort of dated. ^_^

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Does your cluster support Queue? Using Queue to scatter-gather your jobs would solve your issue with system blips. It can chop up your UG jobs into small intervals so the processes don't run as long, combine the results appropriately, and will auto-manage re-running any jobs that do fail.

  • just as technical note, we do run it all in parallel, for the 26K exomes, we ran joint calling on each chromosome, and scatter each chromosome to 1000 jobs. (since UG can work on each locus separately), you can do it relatively easily with Queue (there is a workshop next month that will teach how to do so).
    As to your question, we are sure that running 2k @4x in batches is not the best (or good) way to run it.

  • I've been using the local queueing system, but it's probably less smart about parallelization than queue (it parallelizes by thread, but not by node, I believe).

    Ok, I'm going to parallelize in queue, and then feedback you both about the performance improvement.

Sign In or Register to comment.