Attention: Want an end-to-end pipelining solution for GATK Best Practices?
Running genotypeGVCFs with ~4000 human exome data: stuck on "ProgressMeter - Starting"
I am running genotypeGVCFs with ~4000 human exome data. To speed up the process, I have splited exome.interval_list into sub_interval_list which one interval file contains ~100kb regions. Then I submitted the genotypeGVCFs jobs in parallel for each sub_interval_list. e.g.
java -Xmx32g -jar /GATK/3.6/jar-bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 2 -L /home/jjduan/scatter_interval_list/interval_list.sub000000.interval_list -D /home/jjduan/ref_b37/dbsnp_138.b37.vcf -R /home/jjduan/ref_b37/human_g1k_v37.fasta --variant /home/jjduan/mergedGVCF/chr_19_mergedGVCF.list -o /home/jjduan/genotypedVCF/chr_19_sub000000.vcf java -Xmx32g -jar /GATK/3.6/jar-bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 2 -L /home/jjduan/scatter_interval_list/interval_list.sub000001.interval_list -D /home/jjduan/ref_b37/dbsnp_138.b37.vcf -R /home/jjduan/ref_b37/human_g1k_v37.fasta --variant /home/jjduan/mergedGVCF/chr_19_mergedGVCF.list -o /home/jjduan/genotypedVCF/chr_19_sub000001.vcf ...
However, I kept receiving "ProgressMeter - Starting" for hours without any variants outputed.
INFO 00:09:31,580 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 00:09:31,581 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime WARN 00:09:32,292 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bi WARN 00:09:32,295 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bi INFO 00:09:32,295 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant f INFO 00:10:01,605 ProgressMeter - Starting 0.0 30.0 s 49.6 w 100.0% 30.0 s 0.0 s INFO 00:10:31,606 ProgressMeter - Starting 0.0 60.0 s 99.2 w 100.0% 60.0 s 0.0 s INFO 00:11:01,608 ProgressMeter - Starting 0.0 90.0 s 148.9 w 100.0% 90.0 s 0.0 s INFO 00:11:31,611 ProgressMeter - Starting 0.0 120.0 s 198.5 w 100.0% 120.0 s 0.0 s INFO 00:12:01,613 ProgressMeter - Starting 0.0 2.5 m 248.1 w 100.0% 2.5 m 0.0 s
I have read this thread and noticed this happens for reference genome with millions of contigs. But my data is human with much fewer contigs, so I would not think they are the same cases.
I know WDL/cromwell can support scatter/gather method to speed up. However, as I understand, the principle of the scatter/gather is the same as what I did here. So even using WDL, the parallelizable jobs are still facing the same stuck situation. Is that right?
Is there anything else I can do to get this to run at all, or faster, or just wait?
Thanks a lot for any inputs!