Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

GenomicsDBImport or CombineGvcfs on WGS dataset

Hi,

I'm currently setting up a pipeline for my group to make master-vcf's (vcf's containing all our samples) per species we work with. Per (non-model)species we have samples sizes ranging from 50 to approaching 200. The goal of these vcf's is that we all use the same vcf for our questions (bioinformatically), i.e. that we aren't all ourselves using computationally intensive programs for the exact same purpose. I've been adapting your best practices pipeline (with some minor changes). I'm currently at the GenomicsDBImport step, I like the datastore concept very much and would like to implement it.
However, when running the program on 155 samples on the whole genome (which is 1 Gb (with about 100 Super-Scaffolds)), the estimated time (with batches of +/- 50) is roughly 120 days! Which is rather unacceptable.

my code is:

java -Xms180G -Xmx180G -jar /data/biosoftware/GATK/gatk-4.1.2.0/gatk-package-4.1.2.0-local.jar GenomicsDBImport --genomicsdb-workspace-path database_species --batch-size 52 --sample-name-map cohort.sample_map -L Super-Scaffold.list --reader-threads 5 --overwrite-existing-genomicsdb-workspace true

I completely understand that WGS is a lot bigger compared to WES (part of the reason why I amped up the memory), but I might be doing something wrong. Any tips on how to speed this up are welcome, preferably without splitting too much (would love to have everything in 1 datastore and not for example per chromosome, as I'm hoping for the adding samples to existing datastore to happen soon).

I decided to run CombineGvcfs and this takes considerable less amount of time (however a lot of memory), even tough you claim that GenomicsDBImport should take less time.

At the moment my group is considering to continue with the old (deprecated) pipeline of CombineGvcfs. But before this I would like to gather advice/recommendations on which program is the best (also in future) for WGS-data.

(as a side note I would like to point out that "--overwrite-existing-genomicsdb-workspace true" doesn't work, so why is it available?)

Thank you in advance and best,

Best Answer

Answers

  • ABoursABours Member

    Hi @bhanuGandham,
    Thanks so much for your answer. I think it would be helpful to more people to have a note for this in the documentation, is that possible?

    Best,

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @ABours

    I will look into it. Thank you for bringing this to our attention.

Sign In or Register to comment.