We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
GenomicsDBImport run slowly with multiple samples
Since I took the GATK4 training early this year, I switched my resequencing analysis from GATK 3.6 to GATK 4.0. Everything worked fine except the GVCF consolidate step. GenomicsDBImport takes much longer time than the traditional CombineGVCFs. On a pilot run of three samples, it took 40 hours while CombineGVCFs only took one hour.
Now I am expanding to 140 samples. I gave GenomicsDBImport a test run on 1 Mb on one chromosome, it took 5 hours, while CombineGVCFs took 10 minutes. I realized after a few runs that the running time is linear to my sample size and interval size. Therefore, I tried out a few parameters as suggested, including,
1. simply run gatk --java-options "-Xmx4g -DGATK_STACKTRACE_ON_USER_EXCEPTION=true" GenomicsDBImport --genomicsdb-workspace-path my_db --intervals chr2:1-10000000 -R ref.fa -V gvcf.list.
2. use --batch-size 10, which took 5% longer than the first solution
3. use --batch-size 10 --reader-threads 5, no difference from solution 2
4. use --L intervals.list, which split 10 Mb into 10 intervals but spent the same time as solution 2
It turns out that GenomicsDBImport without any other parameters runs the fastest, although it's still too long. I've used CombineGVCFs to generate a consolidate GVCF file for genotypeGVCFs, although I noticed that GenomicsDBImport and CombineGVCFs resulted different variants. I would like to compare the variants side by side.
So my questions are,
1. What is the criteria to split one chromosome into a few dozens of intervals without causing potential problems? For example, if a gene is split into two intervals, does it affect the following steps?
2. What is the following procedure after GenomicsDBImport is done with consecutive intervals because GenotypeGVCFs only runs on individual database? I’m testing it now so I will have an answer this week.
After all, I didn’t expect that GenomicsDBImport runs so slowly because I read CombineGVCFs is slower in the GATK4 forum. I would appreciate if its slowness is fixed. Otherwise, I have to split chromosomes into small intervals.