How to speed up CombineGVCFs - seems unfeasibly slow?
I hope you had a good weekend! My question, to start this week (hopefully the only thread!), is how to use CombineGVCFs. As you may remember I currently have ~100 WES samples for joint genotyping, and I have been testing GenotypeGVCFs directly and VQSR. However, in the near future I am going to have several hundred more samples in my hand, and hence I understand that making batches using CombineGVCFs is the way forward, as this is the only way to allow for subsequent merging of new gVCFs once sample size passes 200.
So I have been trying to run it this afternoon on a node that has 16 CPUs and 128Gb of RAM. Initial attempt was with -nt 14, but this gave the following message.
ERROR MESSAGE: Invalid command line: Argument nt has a bad value: The analysis CombineGVCFs currently does not support parallel execution with nt. Please run your analysis without the nt option.
So, therefore I started it without any threading, but then it appears it is going to take ~75hrs by it's own estimate after running for almost 1hr.
Is this really how long it should take? I have been trying to decipher the various possible issues looking at old threads on the forum, but I am not sure if there is any way I should be able to speed this up (short of splitting out the chromosomes and running them all individually)?
Also, I have seen mention somewhere that it may be because the files have been zipped in an incompatible manner - however, the input gVCFs are exactly the same that were successfully passed through GenotypeGVCFs, so this seems unlikely. I saw on one old thread that CombineGVCFs was super-slow, but I don't know if this is still the situation? For reference, my command was as shown below:
java -Xmx100000m -Djava.io.tmpdir=$TMPDIR -jar /apps/GATK/3.3-0/GenomeAnalysisTK.jar \ -T CombineGVCFs \ -R hsapiens.hs37d5.fasta \ -V /path/File000001.g.vcf.gz \ .... -V /path/File000100.g.vcf.gz \ -nt 14 \ -o AllgVCFsCombined.nt14.g.vcf
Thanks, in advance, as always.