We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
HaplotypeCaller 4.beta.6 gVCF performance

Hi, ever since the 4.beta.4 release, I've noticed a significant increase in the memory requirements and execution time of HaplotypeCaller in gVCF mode. I tested the 4.beta.2 and 4.beta.6 version of HaplotypeCaller with a NA12878 BAM, aligned with BWA 0.7.13 with approximately 30x coverage. 4.beta.2 completed after roughly 5h with 2GB of memory, while 4.beta.6 completed after roughly 30h with 15GB of memory. 4.beta.6 failed with an out of memory exception when given less memory.
Both versions were ran with the same settings (--interval_set_rule UNION --genotyping_mode DISCOVERY --createOutputVariantIndex --emitRefConfidence GVCF) and parallelized on intervals from a custom BED file.
From my understanding of the release notes, the versions from 4.beta.4 onwards have a bug fix that corrects the results of HaplotypeCaller in gVCF mode. Is the performance difference to be expected?
Thank you,
Teodora
Best Answer
-
Geraldine_VdAuwera Cambridge, MA admin
Hi @teodora_aleksic, the longer runtime is not a consequence of the bug fix, it's due to another change that was made in that release. It's sort of an artifact of our current development constraints, which forced us to remove some key optimizations while we're refining and evaluating equivalence of results with the older version. I'm writing up a blog post about this that will go out next week, since it has emerged as a concern for many users. The good news is that we should be able to restore the optimizations in the near future.
Answers
Hi @teodora_aleksic, the longer runtime is not a consequence of the bug fix, it's due to another change that was made in that release. It's sort of an artifact of our current development constraints, which forced us to remove some key optimizations while we're refining and evaluating equivalence of results with the older version. I'm writing up a blog post about this that will go out next week, since it has emerged as a concern for many users. The good news is that we should be able to restore the optimizations in the near future.
Hi @Geraldine_VdAuwera, that's good to hear. Thank you for the quick response.
Hi @Geraldine_VdAuwera , I started testing the full release of GATK 4. So far, I've noticed the same performance issues as with 4.beta.6. Is this behavior still expected?
No that’s not expected, can you please provide some details of what you are doing and what issues you observe?
Hi @Geraldine_VdAuwera , thanks for the quick response!
We used CCLE WES FASTQs for the purpose of testing (6GB per paired end). We scattered the tool using a custom BED file with 86 intervals.
4.0.0.0 failed due to memory with 2GB per job. It completed with 4GB per job. Each job lasted approximately 1h 10 min. 4.beta.2 completed with 2GB per job and each job lasted approximately 5m.
This is in gVCF mode, with the same settings as described above. All input files, settings and hardware are the same between versions.
Issue · Github
by Sheila
Out of curiosity, did you check the validity of the results you got with the beta.2?
We have tested the validity of a Whole Genome GATK 4.beta2 workflow (in gVCF mode, with VQSR) with a number of different library preps of HG001-HG005 GIAB samples. All of the scores were at expected levels, for example: HG001-50x SNP (precision=99.91, recall=99.75), INDEL (precision=99.40, recall=98.65), HG002-50x SNP (precision=99.90, recall=99.71), INDEL (precision=99.47, recall=98.54).
Also, an intersection between the VCFs produced by 4.beta.2 and 4.0.0.0 (the same Whole Genome workflow configuration) using mentioned CCLE WES files produced 100 000 matching variants, while 200 were different.
@teodora_aleksic
Hi Teodora,
Thanks for the information. I will pass this on to the team, and someone will get back to you soon.
-Sheila
@teodora_aleksic
Hi again Teodora,
Can you provide us with a test case to reproduce this? The developers say their own profiling shows that the 4.0 version of HaplotypeCaller is much faster than the later betas. Can you also try running with GATK3 and letting us know if GATK4 is faster?
The 4.beta.2 version of HaplotypeCaller does NOT give correct output at all, while the 4.0 version gives output that is almost identical to GATK3. HaplotypeCaller in 4.beta.2 was not a complete version of the tool, so it is not appropriate to be comparing the release to it.
-Sheila
Hi everyone,
We found the reason why HaplotypeCaller 4.0.0.0 performed worse than 4.beta.2. We are scattering tools using intervals from a custom BED file. Before, each instance of HaplotypeCaller received a BAM file from ApplyBQSR that was produced using a specific interval, but not the interval itself. This worked for 4.beta.2, but was causing poor performance for 4.0.0.0. We now pass both the BAM and the interval to HaplotypeCaller 4.0.0.0 and it performs just as well as 4.beta.2 with the same amount of memory.
@teodora_aleksic
Hi Teodora,
Thank you for reporting your findings
I will pass this along to the team.
-Sheila