CombineVariants for over 4000 PON, memory issue

shenglaishenglai ChicagoMember

I'm trying to build a panel of normal with ~4000 normal samples from tcga. The version I'm using is vnightly-2016-02-25-gf39d340.
I first tried to use .vcf files directly and let combinevariants make .idx for each of them automatically.
I'm using the following cmd:

cmd = ['java', '-Djava.io.tmpdir=/tmp/job_tmp', '-d64', '-jar', '-Xmx128G', gatk_path, '-nt', str(thread_count), '-T', 'CombineVariants', '-R', reference_fasta_path, '-minN', '2', '--setKey', 'null', '--filteredAreUncalled', '--filteredrecordsmergetype', 'KEEP_IF_ANY_UNFILTERED', '-o', output_vcf]
for vcf_path in vcf_files:
cmd.extend(['-V', vcf_path])

However, it always failed creating .idx after ~1.5k samples. So I manually ran the cmd couple times to get all .idx files. and then ran the cmd with --disable_auto_index_creation_and_locking_when_reading_rods.
Then I got the error ERROR MESSAGE: java.lang.reflect.InvocationTargetException. After searching the forum, this post ,http://gatkforums.broadinstitute.org/gatk/discussion/6094/error-message-java-lang-reflect-invocationtargetexception, suggests that it might be caused by "bad" .idx files.
So, I used bgzip and tabix to re-index all the .vcf.gz files. Fed .vcf.gz and .tbi to gatk, and ran the cmd again. Then I got

##### ERROR MESSAGE: An error occurred because you did not provide enough memory to run this program. You can use the -Xmx argument (before the -jar argument) to adjust the maximum heap size provided to Java. Note that this is a JVM argument, not a GATK argument.
##### ERROR ------------------------------------------------------------------------------------------
INFO  12:44:40,035 ProgressMeter -    chr1:7995100     68374.0    13.9 h       8.5 d        0.3%    32.1 w      32.0 w 
INFO  12:46:29,151 ProgressMeter -    chr1:7995100     68374.0    13.9 h       8.5 d        0.3%    32.2 w      32.1 w 
INFO  12:47:50,165 ProgressMeter -    chr1:7995100     68374.0    13.9 h       8.5 d        0.3%    32.2 w      32.2 w 
INFO  12:49:58,354 ProgressMeter -    chr1:7995100     68374.0    14.0 h       8.5 d        0.3%    32.3 w      32.2 w 
INFO  12:50:58,603 ProgressMeter -    chr1:7995100     68374.0    14.0 h       8.5 d        0.3%    32.4 w      32.3 w

I have mainly 3 questions,
1. Did I do anything stupid?
2. How many resources would you suggest to use in my case. Right now it ran in a single vm. I could try it on the cluster level.
3. Is there any different ways to do that? I didn't run them by batches, because of -minN 2.



