If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
BaseRecalibrator - Trade-Off between runtime and accuracy
We are working with Illumina HiSeq 2000 paired-end data and as time goes by, lanes yield more and more sequences.
We are processing data at the lane BAM level (only one read group). The procedure, among others, does BWA mapping, Indel realignment, duplicates flagging and base quality recalibration. This is, as expected, a long process to complete but clearly the base recalibration stage is the longest by far, especially when lanes contain many sequences. We are using QualityScoreCovariate, ReadGroupCovariate, ContextCovariate and CycleCovariate covariates.
For instance, we have quite big lanes :
1 lane of 140,000,000 pairs (280,000,000 reads) : ~36 hours for recalibration
1 lane of 185,000,000 pairs (370,000,000 reads) : ~48 hours for recalibration
We obviously wish to reduce this run time and I found in the following link a small chapter on the topic (at the very end of the page) :
So, we are really keen on downsampling our BAM files to reduce run time but at the same time we want our data as accurate as possible to help us for instance in the task of diminishing false positive substitutions rate. So if it is worth to wait, we wait.
Nevertheless, in the plot shown in the previous link, the x axis stops at 5,000,000 reads, where the RMSE value seems to have reached a "plateau".
1) We were thus wondering if there is a read count threshold (empirical value) above which the accuracy of the recalibration is no more improved ?
2) If such a threshold exists, I can not find the '--process_nth_locus' switch described in the link above, should I use '-dt', '-dfrac', '-dcov' options instead to downsample ?
3 ) Is the '--num_threads' working with BaseRecalibrator Walker ? Up to how many threads ?
Thanks a lot,
PS : GATK version used is v2.0-23-ge9a19be