Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

CombineVariants for over 4000 PON, memory issue

shenglaishenglai ChicagoMember

I'm trying to build a panel of normal with ~4000 normal samples from tcga. The version I'm using is vnightly-2016-02-25-gf39d340.
I first tried to use .vcf files directly and let combinevariants make .idx for each of them automatically.
I'm using the following cmd:

cmd = ['java', '-Djava.io.tmpdir=/tmp/job_tmp', '-d64', '-jar', '-Xmx128G', gatk_path, '-nt', str(thread_count), '-T', 'CombineVariants', '-R', reference_fasta_path, '-minN', '2', '--setKey', 'null', '--filteredAreUncalled', '--filteredrecordsmergetype', 'KEEP_IF_ANY_UNFILTERED', '-o', output_vcf]
for vcf_path in vcf_files:
cmd.extend(['-V', vcf_path])

However, it always failed creating .idx after ~1.5k samples. So I manually ran the cmd couple times to get all .idx files. and then ran the cmd with --disable_auto_index_creation_and_locking_when_reading_rods.
Then I got the error ERROR MESSAGE: java.lang.reflect.InvocationTargetException. After searching the forum, this post ,http://gatkforums.broadinstitute.org/gatk/discussion/6094/error-message-java-lang-reflect-invocationtargetexception, suggests that it might be caused by "bad" .idx files.
So, I used bgzip and tabix to re-index all the .vcf.gz files. Fed .vcf.gz and .tbi to gatk, and ran the cmd again. Then I got

##### ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.
##### ERROR ------------------------------------------------------------------------------------------
INFO  12:44:40,035 ProgressMeter -    chr1:7995100     68374.0    13.9 h       8.5 d        0.3%    32.1 w      32.0 w 
INFO  12:46:29,151 ProgressMeter -    chr1:7995100     68374.0    13.9 h       8.5 d        0.3%    32.2 w      32.1 w 
INFO  12:47:50,165 ProgressMeter -    chr1:7995100     68374.0    13.9 h       8.5 d        0.3%    32.2 w      32.2 w 
INFO  12:49:58,354 ProgressMeter -    chr1:7995100     68374.0    14.0 h       8.5 d        0.3%    32.3 w      32.2 w 
INFO  12:50:58,603 ProgressMeter -    chr1:7995100     68374.0    14.0 h       8.5 d        0.3%    32.4 w      32.3 w

I have mainly 3 questions,
1. Did I do anything stupid?
2. How many resources would you suggest to use in my case. Right now it ran in a single vm. I could try it on the cluster level.
3. Is there any different ways to do that? I didn't run them by batches, because of -minN 2.



Sign In or Register to comment.