If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Parallelizing base quality score recalibration
I'm attempting to use the BaseRecalibrator tool for 30-50x depth whole genome datasets with BAM files of around 100 - 150GB. However it is very computationally demanding so I'd really like to distribute the processing over many cores on our cluster. I've done this for the indel realignment process by running for each chromosome separately as described in the now retired guidelines on "Parallelism with the GATK" (I think a new version is due to be issued at some point). It's less clear, to me at least, how to do this for the BaseRecalibrator.
For example, is it possible to combine GATKReports for the recalibration data generated for separate chromosomes? Or should I run the on-the-fly recalibration with PrintReads and the -BQSR option using the recalibration data for each chromosome separately? If the latter, does it matter that for some of the smaller unplaced/unlocalized chromosomes the recalibration tables will contain values for covariates generated with only a few observations? The documentation on the Base Quality Score Recalibrator seems to suggest that the recalibration tables need to be calculated over the whole genome.