Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Resource requirements for large variant calling job (is this possible?)

jmartinjmartin Posts: 1Member
edited January 2013 in Ask the team

I'm trying to call variants on metagenomic data using the UnifiedGenotyper. I know that the diploid genotype calls & likelihoods will not be valid since my data is not diploid, but I want to use the vcf output so sum up base frequencies at detected variant loci.

I mapped 100+ samples (each being ~2 Illumina GA2 lanes of data that after host filtering usually contain about 20-40 million reads per sample) against a database of 671 bacterial reference sequences (and each reference can be in multiple parts, so I probably have 10s of thousands of sequence records in my ref db, spanning the 671 reference genomes...around 2.2Gb in total size). I am then feeding the resulting 100+ bam files to the UnifiedGenotyper.

After some initial mistakes on my part (yes I have entered the future and am using GATK 2.2-5 now :) ) I've now started a run in proper fashion, but after a couple hours its dying with the message that the java application has run out of memory:

ERROR ------------------------------------------------------------------------------------------
ERROR A USER ERROR has occurred (version 2.2-5-g3bf5e3f):
ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
ERROR Please do not post this error to the GATK forum
ERROR
ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: There was a failure because you did not provide enough memory to run this program. See the -Xmx JVM argument to adjust the maximum heap size provided to Java
ERROR ------------------------------------------------------------------------------------------

I had set -Xmx60g for that failed run, so now I'm wondering if its possible to estimate how much memory would be needed for this job I'm trying to run. Do you think a job of this size is even possible with the UG? Is it the number of references that is killing me here? Or the number of samples?

Post edited by Geraldine_VdAuwera on

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,260Administrator, GSA Member admin

    Hi there,

    Welcome to the future, and sorry for the delay in answering! It was due to a time differential adjustment from your recent temporal acceleration ;)

    If your references are draft genomes with lots of contigs, then yes that's going to be a big problem. We haven't had that problem ourselves but we recently had a user post a similar problem on this forum. As far as we know they solved the problem by obtaining a more assembled version of their organism reference. If you can't do something like that, you might want to try batching your reference genomes rather than using them all at once.

    I'll transfer this to "Ask the Community", hopefully someone out there will have some better idea of how to do this.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.