Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Workflow for GenomicsDBImport and GenotypeGVCFs across multiple intervals

mpj5142mpj5142 Penn StateMember

Hello,

We have ~250 WGS samples that I am trying to run through the GATK Best Practices pipeline for germline calls. I'm currently running GATK v4.0.11.0, although the GVCFs were previously generated using HaplotypeCaller v.3.7.

I'm trying to setup a pipeline for GenomicsDBImport and GenotypeGVCFs, as CombineGVCFs under GATK 3.7 was simply not able to process the large number of WGS samples. I read through a couple of posts on this topic already here and here but am still confused on a couple of points in putting the pipeline together.

First, I see that the recommendation for using intervals is to use the same number of intervals as the number of samples--250 in my case. Are these splits for the whole genome or for individual chromosomes--i.e. split each chromosome into 250 intervals, or split the entire genome into 250 intervals (~10 per chromosome)? Also, should any of the intervals overlap each other to covervariants that may fall right at an interval boundary?

Second, from what I understand, both GenomicsDBImport and GenotypeGVCFs should be run separately on each interval, creating 250 (or 250*24 chromosomes) separate databases and VCF files. For the downstream steps in the Best Practices pipeline (i.e. VariantRecalibrator), would a simple merge of the 250 VCF files into one VCF file after GenotypeGVCFs suffice, or are there any other special processing steps recommended?

Lastly, our cluster has 1.25Tb of RAM and 160 cores. Do you have any rough estimates on how much memory to allocate per interval run (which would also determine how many intervals we can run at once)? I can try out a few different values here if you don't have any particular recommendations.

Thank you very much for your time and help!

Regards,

Matthew

Answers

  • bshifawbshifaw moonMember, Broadie, Moderator admin

    Have you gotten the chance to look through the workflows provided in the gatk-workflows git organization? One of the repos called gatk4-germline-snps-indels has a wdl script which executes both GenomicsDBImport and GenotypeGVCFs along with the VQSR step.

  • mpj5142mpj5142 Penn StateMember

    Hi @bshifaw ,

    I looked through the joint-discovery-gatk4-local.wdl pipeline, and see that there is a GatherVcfs task after VQSR, which looks to be the step for merging VCFs from different genome segments. Right now the GATK implementation I'm working with is shell-scripted, so I'll have to study up on WDL to implement this particular pipeline.

    In the meantime, I started running GenomicsDBImport and GenotypeGVCFs on 250 genome segments (not by chromosome), and they are running pretty smoothly (<48 hours per segment while running 12 segments in parallel with 100Gb memory/segment). Hopefully I can merge the resulting VCFs after this step for VQSR, but if not I will merge after VQSR as shown in the WDL pipeline.

    Thanks for your help!

Sign In or Register to comment.