If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
GenotypeGVCFs on whole genomes taking too long
Dear GATK team members and forum users,
I am analysing 200 germline whole genomes following the GATK best practises. I am experiencing issues with GenotypeGVCFs, whose runtime is increasing exponentially as the number of samples (gVCFs) increases.
To set you in context, I have 200 germline whole genomes in BAM format. These are high coverage, so their size ranges between 40-130GB. After recalibration, the size of these BAM files increases around 2-fold. The recalibrated BAMs are the input of HaplotypeCaller. I have run ~100 of these BAMs and got the gVCFs.
Now I want to perform joint-genotyping with GenotypeGVCFs. I remember having only 22 samples and running GenotypeGVCFs with these 22 gVCFs did not take long (around 4.5h), but now that I want to re-run with 100 samples this single command takes too long (around 1 week). Actually I am running the pipeline on an HPC, which has a maximum walltime of 1 week, hence GenotypeGVCFs is killed before finishing.
The gVCFs are compressed using bgzip + tabix. The .g.vcf.gz weight between 1.9-7GB. These are used to feed GenotypeGVCFs. I am using 230Gb memory. The exact command I am running is the following:
java -Xmx230g -Djava.io.tmpdir=/tmp \
-jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R reference.fa \
--dbsnp bundle2_8/b37/dbsnp_138.b37.vcf \
--variant sample1.g.vcf.gz --variant sample2.g.vcf.gz ... --variant sample100.g.vcf.gz \
The reason why I am not using -nt option is that it gives an "error MESSAGE: Code exception".
The GATK version I am using is 3.7
I also tried combining the 100 gVCFs into 2 batches of 50 each, but this also takes too long, around 3 days for each batch (6 days in total).
I wonder what approach would be suitable to handle this amount of data and whether this is normal. I am really concerned because I don't know how I am going to manage this once I have the 200 gVCFs.
All answers will be appreciated.