GenotypeGVCFs on whole genomes taking too long

zek12zek12 LondonMember
edited March 2017 in Ask the GATK team

Dear GATK team members and forum users,

I am analysing 200 germline whole genomes following the GATK best practises. I am experiencing issues with GenotypeGVCFs, whose runtime is increasing exponentially as the number of samples (gVCFs) increases.

To set you in context, I have 200 germline whole genomes in BAM format. These are high coverage, so their size ranges between 40-130GB. After recalibration, the size of these BAM files increases around 2-fold. The recalibrated BAMs are the input of HaplotypeCaller. I have run ~100 of these BAMs and got the gVCFs.

Now I want to perform joint-genotyping with GenotypeGVCFs. I remember having only 22 samples and running GenotypeGVCFs with these 22 gVCFs did not take long (around 4.5h), but now that I want to re-run with 100 samples this single command takes too long (around 1 week). Actually I am running the pipeline on an HPC, which has a maximum walltime of 1 week, hence GenotypeGVCFs is killed before finishing.

The gVCFs are compressed using bgzip + tabix. The .g.vcf.gz weight between 1.9-7GB. These are used to feed GenotypeGVCFs. I am using 230Gb memory. The exact command I am running is the following:

java -Xmx230g -Djava.io.tmpdir=/tmp \
-jar GenomeAnalysisTK.jar \
-T GenotypeGVCFs \
-R reference.fa \
--dbsnp bundle2_8/b37/dbsnp_138.b37.vcf \
--variant sample1.g.vcf.gz --variant sample2.g.vcf.gz ... --variant sample100.g.vcf.gz \
-o joint.vcf

The reason why I am not using -nt option is that it gives an "error MESSAGE: Code exception".
The GATK version I am using is 3.7

I also tried combining the 100 gVCFs into 2 batches of 50 each, but this also takes too long, around 3 days for each batch (6 days in total).

I wonder what approach would be suitable to handle this amount of data and whether this is normal. I am really concerned because I don't know how I am going to manage this once I have the 200 gVCFs.

All answers will be appreciated.

Thanks,
Ezequiel

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Ezequiel,

    In our production pipelines we use a sort of database developed by Intel called GenomicDB to make the process much more scalable, but unfortunately right now it's fairly difficult to set up. However there is development work being done to make it a lot easier. For now combining GVCFs into batches is probably the most helpful thing you can do. I know it's a heavy upfront investment, but as you add more samples it will make a difference.

    One other thing I can tell you is that one of our developers did some profiling/benchmarking one some regions where the process was unusually slow, and found that for the "slow" region analyzed at least, the new QUAL calculator (-newQual advanced argument introduced in 3.7, see release notes) made the process roughly 4 to 5 times faster. It's unlikely these gains apply equally across the genome, but it may make a substantial difference nevertheless.

  • zek12zek12 LondonMember

    Dear Geraldine,

    Ok, will do what you suggest. I'll let you know if these changes made a difference.

    Thanks!

  • zek12zek12 LondonMember
    edited March 2017

    Dear Geraldine,

    I have another question. Can I combine a one-sample gVCF with a multi-sample gVCF (using CombineGVCFs) every time I get a new sample gVCF? My idea is to, every time I genotype a new sample, add it to a already-combined gVCF, so I would have an accummulative combined.g.vcf which will eventually be the input for joint-genotyping. Is this supposed to be faster than combining all the one-sample gVCFs?

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zek12
    Hi Ezequiel.

    Sure, you can combine GVCFs with different numbers of samples using CombineGVCFs. I think inputting one combined GVCF into GenotypeGVCFs will be faster than inputting many combined GVCFs. But, I have not confirmed this with tests. Let us know how things go!

    -Sheila

  • zek12zek12 LondonMember

    Great, thanks, will let you know Sheila.

    Many thanks to both :)

  • zek12zek12 LondonMember
    edited March 2017

    Dear Geraldine and Sheila

    Sorry for my late post. After doing some research and tests, I've found out that performing joint-genotyping parallelised by chromosome did the trick and is way more efficient than batching samples. I am using -L option for this:

    java -jar GenomeAnalysisTK.jar \
    -T GenotypeGVCFs \
    -R reference.fasta \
    --dbsnp dbsnp_138.b37.vcf \
    --variant sample1.g.vcf.gz ... sampleN.g.vcf.gz \
    -L $chr \
    -newQual \
    -o joint_$chr.vcf

    Splitting the job into 25 parts (one per chromosome) turned out to take a total of 12hs, instead of the 7 days it used to take.
    What do you think of this approach?

    Besides, although in the best practises you recommend batching samples when we have more that ~200 samples, I guess this only applies when we have whole-exomes?

    Thanks in advance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @zek12,

    Indeed, there are multiple things you can do to speed up processing and/or minimize requirements. Batching GVCFs is done to deal with the problem of having to store the index of too many files into memory. Parallelizing over intervals has a different effect, namely reducing the time it takes from the user's perspective (it still takes a lot of time in actuality, if you add up the runtime of all the parallel jobs). In production we parallelize over calling intervals, which allows us to parallelize more than just at the chromosome level.

  • zek12zek12 LondonMember

    Thank you very much for your answer Geraldine, you both have been very helpful!

    Ezequiel

  • init_jsinit_js Member

    Sorry to be late to the party, but I see that -newQual has become --new-qual in Gatk 4.0.1.2, and it's still turned off by default. Is there an article that discusses the tradeoffs? I'm considering it.

    There's also an undocumented TMP_DIR option for GenotypeGVCFs in the usage help text. Would it help if that was set to a location on tmpfs or a fast disk?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator
    edited May 1

    @init_js
    Hi,

    This ticket should have some helpful insight.

    For TMP_DIR, this thread and perhaps this ticket will help.

    -Sheila

    EDIT: I will see if we can change the documentation for TMP_DIR.

Sign In or Register to comment.