GenotypeGVCFs with draft quality reference genome
I am using the GATK pipeline to call variants by aligning reads to a draft quality reference genome that is ~367000 scaffolds. I split the scaffolds up into 50 intervals and successfully (and pretty quickly) generated GVCFs for 25 individuals using the -L option. However, I am having the worst of times with GenotypeGVCFs. After running for nearly 2 days on the first interval list, GenotypeGVCFs has not even output a file. Based on another post in the forum, I removed the scaffolds that are NOT in the interval from the GVCF header, and that sped up the process slightly - I have a combined VCF file with just the header generated after about 18 hours. Not sure how much longer the process has as the progress meter doesn't seem to be making any sense.
Is there any known way(s) to optimize this process?
Currently using the following command:
java -Djava.io.tmpdir=/data/lwwvd/genoGVCF.tmp -XX:ParallelGCThreads=4 -Xmx15g -jar /usr/local/bin/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -nt 16 -T GenotypeGVCFs -R ../ref_genomes/bbu_ref_UMD_CASPUR_WB_2.0.fa -L interval_lists/bbub.refctgs.49.interval_list -V ./1095/1095.49.g.vcf.gz -V ./189/189.49.g.vcf.gz -V ./190/190.49.g.vcf.gz -V ./196/196.49.g.vcf.gz -V ./246/246.49.g.vcf.gz -V ./337/337.49.g.vcf.gz -V ./581/581.49.g.vcf.gz -V ./583/583.49.g.vcf.gz -V ./662/662.49.g.vcf.gz -V ./701/701.49.g.vcf.gz -V ./850/850.49.g.vcf.gz -V ./92764/92764.49.g.vcf.gz -V ./92765/92765.49.g.vcf.gz -V ./92766/92766.49.g.vcf.gz -V ./92767/92767.49.g.vcf.gz -V ./92768/92768.49.g.vcf.gz -V ./92769/92769.49.g.vcf.gz -V ./92770/92770.49.g.vcf.gz -V ./92771/92771.49.g.vcf.gz -V ./92774/92774.49.g.vcf.gz -V ./92775/92775.49.g.vcf.gz -V ./92776/92776.49.g.vcf.gz -V ./92777/92777.49.g.vcf.gz -V ./92778/92778.49.g.vcf.gz -V ./92795/92795.49.g.vcf.gz -o BBUB.combined.49.vcf