The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Download the latest Picard release at
GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

Resource requirements for large variant calling job (is this possible?)

jmartinjmartin Member
edited January 2013 in Ask the GATK team

I'm trying to call variants on metagenomic data using the UnifiedGenotyper. I know that the diploid genotype calls & likelihoods will not be valid since my data is not diploid, but I want to use the vcf output so sum up base frequencies at detected variant loci.

I mapped 100+ samples (each being ~2 Illumina GA2 lanes of data that after host filtering usually contain about 20-40 million reads per sample) against a database of 671 bacterial reference sequences (and each reference can be in multiple parts, so I probably have 10s of thousands of sequence records in my ref db, spanning the 671 reference genomes...around 2.2Gb in total size). I am then feeding the resulting 100+ bam files to the UnifiedGenotyper.

After some initial mistakes on my part (yes I have entered the future and am using GATK 2.2-5 now :) ) I've now started a run in proper fashion, but after a couple hours its dying with the message that the java application has run out of memory:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.2-5-g3bf5e3f):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions
ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java
ERROR ------------------------------------------------------------------------------------------

I had set -Xmx60g for that failed run, so now I'm wondering if its possible to estimate how much memory would be needed for this job I'm trying to run. Do you think a job of this size is even possible with the UG? Is it the number of references that is killing me here? Or the number of samples?

Post edited by Geraldine_VdAuwera on


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi there,

    Welcome to the future, and sorry for the delay in answering! It was due to a time differential adjustment from your recent temporal acceleration ;)

    If your references are draft genomes with lots of contigs, then yes that's going to be a big problem. We haven't had that problem ourselves but we recently had a user post a similar problem on this forum. As far as we know they solved the problem by obtaining a more assembled version of their organism reference. If you can't do something like that, you might want to try batching your reference genomes rather than using them all at once.

    I'll transfer this to "Ask the Community", hopefully someone out there will have some better idea of how to do this.

Sign In or Register to comment.