The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Did you remember to?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK 3.7 is here! Be sure to read the Version Highlights and optionally the full Release Notes.

# Resource requirements for large variant calling job (is this possible?)

Member Posts: 1
edited January 2013

I'm trying to call variants on metagenomic data using the UnifiedGenotyper. I know that the diploid genotype calls & likelihoods will not be valid since my data is not diploid, but I want to use the vcf output so sum up base frequencies at detected variant loci.

I mapped 100+ samples (each being ~2 Illumina GA2 lanes of data that after host filtering usually contain about 20-40 million reads per sample) against a database of 671 bacterial reference sequences (and each reference can be in multiple parts, so I probably have 10s of thousands of sequence records in my ref db, spanning the 671 reference genomes...around 2.2Gb in total size). I am then feeding the resulting 100+ bam files to the UnifiedGenotyper.

After some initial mistakes on my part (yes I have entered the future and am using GATK 2.2-5 now ) I've now started a run in proper fashion, but after a couple hours its dying with the message that the java application has run out of memory:

##### ERROR ------------------------------------------------------------------------------------------

I had set -Xmx60g for that failed run, so now I'm wondering if its possible to estimate how much memory would be needed for this job I'm trying to run. Do you think a job of this size is even possible with the UG? Is it the number of references that is killing me here? Or the number of samples?

Post edited by Geraldine_VdAuwera on
Tagged: