If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Using Interval List with HaplotypeCaller
I have an ~30gb BAM to pass to haplotype caller and my knowledge of how to make this proceed as quickly as possible. I support a team of CBs who are currently unable to effectively use our gatk workflow due to how long it
takes for us to process BAMs of this size.
First, my understanding is that if an interval list is passed to HaplotypeCaller, some kind of parallel processing is done? If this is true, and given I'm executing this through a WDL and we're running this on the cloud, will specifying more cores increase parallelization?
Also, for my test data set, I'm passing an interval list file formatted like so:
X:1-1500000 X:1500001-3000000 X:3000001-4500000 X:4500001-6000000 X:6000001-7500000
Where this file was generated by chunking out the reference sequence. Is this an acceptable approach for doing this or is there a canonical way of doing it?
Lastly, is there any other documentation (or suggestions) for how to speed up processing of large bam files such as this one with HaplotypeCaller?