Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Running genotypeGVCFs with ~4000 human exome data: stuck on "ProgressMeter - Starting"
I am running genotypeGVCFs with ~4000 human exome data. To speed up the process, I have splited exome.interval_list into sub_interval_list which one interval file contains ~100kb regions. Then I submitted the genotypeGVCFs jobs in parallel for each sub_interval_list. e.g.
java -Xmx32g -jar /GATK/3.6/jar-bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 2 -L /home/jjduan/scatter_interval_list/interval_list.sub000000.interval_list -D /home/jjduan/ref_b37/dbsnp_138.b37.vcf -R /home/jjduan/ref_b37/human_g1k_v37.fasta --variant /home/jjduan/mergedGVCF/chr_19_mergedGVCF.list -o /home/jjduan/genotypedVCF/chr_19_sub000000.vcf java -Xmx32g -jar /GATK/3.6/jar-bin/GenomeAnalysisTK.jar -T GenotypeGVCFs -nt 2 -L /home/jjduan/scatter_interval_list/interval_list.sub000001.interval_list -D /home/jjduan/ref_b37/dbsnp_138.b37.vcf -R /home/jjduan/ref_b37/human_g1k_v37.fasta --variant /home/jjduan/mergedGVCF/chr_19_mergedGVCF.list -o /home/jjduan/genotypedVCF/chr_19_sub000001.vcf ...
However, I kept receiving "ProgressMeter - Starting" for hours without any variants outputed.
INFO 00:09:31,580 ProgressMeter - | processed | time | per 1M | | total | remaining INFO 00:09:31,581 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime WARN 00:09:32,292 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bi WARN 00:09:32,295 StrandBiasTest - StrandBiasBySample annotation exists in input VCF header. Attempting to use StrandBiasBySample values to calculate strand bi INFO 00:09:32,295 GenotypeGVCFs - Notice that the -ploidy parameter is ignored in GenotypeGVCFs tool as this is automatically determined by the input variant f INFO 00:10:01,605 ProgressMeter - Starting 0.0 30.0 s 49.6 w 100.0% 30.0 s 0.0 s INFO 00:10:31,606 ProgressMeter - Starting 0.0 60.0 s 99.2 w 100.0% 60.0 s 0.0 s INFO 00:11:01,608 ProgressMeter - Starting 0.0 90.0 s 148.9 w 100.0% 90.0 s 0.0 s INFO 00:11:31,611 ProgressMeter - Starting 0.0 120.0 s 198.5 w 100.0% 120.0 s 0.0 s INFO 00:12:01,613 ProgressMeter - Starting 0.0 2.5 m 248.1 w 100.0% 2.5 m 0.0 s
I have read this thread and noticed this happens for reference genome with millions of contigs. But my data is human with much fewer contigs, so I would not think they are the same cases.
I know WDL/cromwell can support scatter/gather method to speed up. However, as I understand, the principle of the scatter/gather is the same as what I did here. So even using WDL, the parallelizable jobs are still facing the same stuck situation. Is that right?
Is there anything else I can do to get this to run at all, or faster, or just wait?
Thanks a lot for any inputs!