Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
GenomicsDBImport run slowly with multiple samples
Since I took the GATK4 training early this year, I switched my resequencing analysis from GATK 3.6 to GATK 4.0. Everything worked fine except the GVCF consolidate step. GenomicsDBImport takes much longer time than the traditional CombineGVCFs. On a pilot run of three samples, it took 40 hours while CombineGVCFs only took one hour.
Now I am expanding to 140 samples. I gave GenomicsDBImport a test run on 1 Mb on one chromosome, it took 5 hours, while CombineGVCFs took 10 minutes. I realized after a few runs that the running time is linear to my sample size and interval size. Therefore, I tried out a few parameters as suggested, including,
1. simply run gatk --java-options "-Xmx4g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenomicsDBImport --genomicsdb-workspace-path my_db --intervals chr2:1-10000000 -R ref.fa -V gvcf.list.
2. use --batch-size 10, which took 5% longer than the first solution
3. use --batch-size 10 --reader-threads 5, no difference from solution 2
4. use --L intervals.list, which split 10 Mb into 10 intervals but spent the same time as solution 2
It turns out that GenomicsDBImport without any other parameters runs the fastest, although it's still too long. I've used CombineGVCFs to generate a consolidate GVCF file for genotypeGVCFs, although I noticed that GenomicsDBImport and CombineGVCFs resulted different variants. I would like to compare the variants side by side.
So my questions are,
1. What is the criteria to split one chromosome into a few dozens of intervals without causing potential problems? For example, if a gene is split into two intervals, does it affect the following steps?
2. What is the following procedure after GenomicsDBImport is done with consecutive intervals because GenotypeGVCFs only runs on individual database? I’m testing it now so I will have an answer this week.
After all, I didn’t expect that GenomicsDBImport runs so slowly because I read CombineGVCFs is slower in the GATK4 forum. I would appreciate if its slowness is fixed. Otherwise, I have to split chromosomes into small intervals.