This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
GenomicsDBImport or CombineGvcfs on WGS dataset
I'm currently setting up a pipeline for my group to make master-vcf's (vcf's containing all our samples) per species we work with. Per (non-model)species we have samples sizes ranging from 50 to approaching 200. The goal of these vcf's is that we all use the same vcf for our questions (bioinformatically), i.e. that we aren't all ourselves using computationally intensive programs for the exact same purpose. I've been adapting your best practices pipeline (with some minor changes). I'm currently at the GenomicsDBImport step, I like the datastore concept very much and would like to implement it.
However, when running the program on 155 samples on the whole genome (which is 1 Gb (with about 100 Super-Scaffolds)), the estimated time (with batches of +/- 50) is roughly 120 days! Which is rather unacceptable.
my code is:
java -Xms180G -Xmx180G -jar /data/biosoftware/GATK/gatk-220.127.116.11/gatk-package-18.104.22.168-local.jar GenomicsDBImport --genomicsdb-workspace-path database_species --batch-size 52 --sample-name-map cohort.sample_map -L Super-Scaffold.list --reader-threads 5 --overwrite-existing-genomicsdb-workspace true
I completely understand that WGS is a lot bigger compared to WES (part of the reason why I amped up the memory), but I might be doing something wrong. Any tips on how to speed this up are welcome, preferably without splitting too much (would love to have everything in 1 datastore and not for example per chromosome, as I'm hoping for the adding samples to existing datastore to happen soon).
I decided to run CombineGvcfs and this takes considerable less amount of time (however a lot of memory), even tough you claim that GenomicsDBImport should take less time.
At the moment my group is considering to continue with the old (deprecated) pipeline of CombineGvcfs. But before this I would like to gather advice/recommendations on which program is the best (also in future) for WGS-data.
(as a side note I would like to point out that "--overwrite-existing-genomicsdb-workspace true" doesn't work, so why is it available?)
Thank you in advance and best,