Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

CombineVariants for over 4000 PON, memory issue

shenglaishenglai ChicagoMember

Hi GATK,
I'm trying to build a panel of normal with ~4000 normal samples from tcga. The version I'm using is vnightly-2016-02-25-gf39d340.
I first tried to use .vcf files directly and let combinevariants make .idx for each of them automatically.
I'm using the following cmd:

cmd = ['java', '-Djava.io.tmpdir=/tmp/job_tmp', '-d64', '-jar', '-Xmx128G', gatk_path, '-nt', str(thread_count), '-T', 'CombineVariants', '-R', reference_fasta_path, '-minN', '2', '--setKey', 'null', '--filteredAreUncalled', '--filteredrecordsmergetype', 'KEEP_IF_ANY_UNFILTERED', '-o', output_vcf]
for vcf_path in vcf_files:
cmd.extend(['-V', vcf_path])

However, it always failed creating .idx after ~1.5k samples. So I manually ran the cmd couple times to get all .idx files. and then ran the cmd with --disable_auto_index_creation_and_locking_when_reading_rods.
Then I got the error ERROR MESSAGE: java.lang.reflect.InvocationTargetException. After searching the forum, this post ,http://gatkforums.broadinstitute.org/gatk/discussion/6094/error-message-java-lang-reflect-invocationtargetexception, suggests that it might be caused by "bad" .idx files.
So, I used bgzip and tabix to re-index all the .vcf.gz files. Fed .vcf.gz and .tbi to gatk, and ran the cmd again. Then I got

##### ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.
##### ERROR ------------------------------------------------------------------------------------------
INFO  12:44:40,035 ProgressMeter -    chr1:7995100     68374.0    13.9 h       8.5 d        0.3%    32.1 w      32.0 w 
INFO  12:46:29,151 ProgressMeter -    chr1:7995100     68374.0    13.9 h       8.5 d        0.3%    32.2 w      32.1 w 
INFO  12:47:50,165 ProgressMeter -    chr1:7995100     68374.0    13.9 h       8.5 d        0.3%    32.2 w      32.2 w 
INFO  12:49:58,354 ProgressMeter -    chr1:7995100     68374.0    14.0 h       8.5 d        0.3%    32.3 w      32.2 w 
INFO  12:50:58,603 ProgressMeter -    chr1:7995100     68374.0    14.0 h       8.5 d        0.3%    32.4 w      32.3 w

I have mainly 3 questions,
1. Did I do anything stupid?
2. How many resources would you suggest to use in my case. Right now it ran in a single vm. I could try it on the cluster level.
3. Is there any different ways to do that? I didn't run them by batches, because of -minN 2.

Best
SL

Comments

Sign In or Register to comment.