Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

parallelizing PrintReads

Hi Team,

So looking back at a previous analysis, I noticed that BaseRecalibrator took about 16 hours, but then PrintReads took another 48, which seems egregious.

My plan is to run PrintReads separately on different loci, each using the "merged" table from BaseRecalibrator, and then merge the resulting BAM files.

I'd be surprised if this isn't equivalent to running PrintReads on the whole genome at once, but I just wanted to check with the experts here.

Thanks,
Neil

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Neil,

    I don't think you're going to gain much if anything that way, since PrintReads is a whole lot of I/O and you still need to write out all the info to one big bam at the end. I guess if the point is to speed up the overall pipeline, you could data-parallelize the PrintReads step by chromosome, so you'd end up with a separate BAM for each chromosome, but then not gather them into a single BAM at the end. Instead, just pass the list of separate per-chromosome BAMs to the next tool, and the engine should take care of putting everything back together for you. Does that make sense?

  • neilwneilw Member

    Thanks, that does make sense. But I do expect a win because merging at the end using picard tools is an order of magnitude cheaper than PrintReads, for some reason. The real question is how many simultaneous PrintReads can run before I've saturated I/O on the disk being hit. I'm guessing that PrintReads taking 48 hours to blast through a single sample 50x dataset can't possibly be optimal, especially if Picard can merge in four hours.

    It's surprising that PrintReads would be so slow, and maybe I'll find out that it was simply strange network conditions several months ago when I did this.

    If it works out, I'll let you know.

    Regards,
    Neil

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, fair point. We'd certainly be interested to hear about the results you get, so please do let us know either way.

  • mglclinicalmglclinical USAMember

    it would save time if PrintReads could be parallelized

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Work in that direction is happening in development of GATK4

  • mglclinicalmglclinical USAMember
    edited January 2016

    Hi @Geraldine_VdAuwera ,

    I have parallelized IndelRealignment step (RealignmentTargetCreator RTC + IndelRealigner IR) i.e. by running RTC and IR per chromosome concurrently and then merged the individual per chromosome realigned bam files into one file i.e. Realigned_AllChr_Merged.bam. I do know that the next step, BQSR, can take all individual per chromosome realigned bam files as input at once with the -I parameter, but I instead merged individual bam files into 1 merged bam file Realigned_AllChr_Merged.bam

    For the next 2 steps , BaseRecalibration & PrintReads, I gave the same file Realigned_AllChr_Merged.bam as input to both 2 steps. The output from BaseRecalibration is recal_data.table. For PrintReads step, I applied the recalibration with the option -BQSR recal_data.table. BaseRecalibration step took ~15 minutes, where as ~PrintReads took 45 minutes. PrintReads does not show increase in speed when the number of threads (-nct option) is increased.

    So, I have recalibrated on the entire genome and created the output recal_data.table file. I just want to restate that the recal_data.table file is calculated with the whole genome, it is not calculated with per chromosome bam file.

    Now, Can I split the bam file Realigned_AllChr_Merged.bam file into 25 bam files (Realigned_Chr1.bam, Realigned_Chr2.bam, Realigned_Chr3.bam ......Realigned_Chr22.bam, Realigned_ChrX.bam, Realigned_ChrY.bam, Realigned_ChrM.bam), one for each human chromosome, and then run PrintReads on all above 25 bam files simultaneously with the same "recal_data.table" file as -BQSR option ?

    By parallelizing the PrintReads step by chromosome, I am hoping to reduce the run-time of PrintReads significantly , Please let me know if this approach is recommended ?

    I believe I do not need to split the "recal_data.table" file, and it can be given as an input as a whole to PrintReads -BQSR option for every input bam file Realigned_Chr1.bam, Realigned_Chr2.bam etc. ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Yes you can absolutely do that. You can even pass the resulting per-chromosome recalibrated files to HaplotypeCaller without merging them if you want to save on I/O.

    Another way to save on turnaround time when you're in a rush is to skip the PrintReads step entirely, and feed the realigned bams directly to HC along with -BQSR recal_data.table (because -BQSR is an engine argument that can be applied on the fly to any bam-eating tool). You can always generate the recalibrated bam on the side for archival purposes, but that step doesn't have to slow you down if you're looking to shorten time to having actionable calls for e.g. clinical purposes, where I understand turnaround time can be especially important.

  • newGATKusernewGATKuser CaseMember

    Hi @Geraldine_VdAuwera ,

    Thank you for the helpful tip. I was wondering if the same could work for MuTect (i.e., feed realigned bars directly to MuTect with the -BQSR option)? If so, is there a way to specify the BQSR for both the tumor and normal .bam in the command?

    For MuTect1, my command right now looks like this:
    java -Xmx4g -jar mutect-1.1.7.jar --analysis_type MuTect --reference_sequence GRCh37-lite.fa --intervals loci.bed --input_file:normal $in_normal_bam --input_file:tumor $in_tumor_bam --out call_stats.out

    Would the new command be something like:
    java -Xmx4g -jar mutect-1.1.7.jar --analysis_type MuTect --reference_sequence GRCh37-lite.fa --intervals loci.bed --input_file:normal $in_normal_bam -BQSR recal_data.normal.table --input_file:tumor $in_tumor_bam -BQSR recal_data.tumor.table --out call_stats.out

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You can't specify two different BQSR recal files but you could simply run BaseRecalibrator on your tumor and normal together. The output recal file will contain information for each, as long as your read groups are correctly identified.

    We haven't ever tested the on the fly recalibration with Mutect2 but there's no technical reason for it not to work.

  • newGATKusernewGATKuser CaseMember

    That's wonderful! I didn't know I could run BaseRecalibrator with multiple files. This should cut down my processing time by a lot. Thanks again

  • mglclinicalmglclinical USAMember

    @Geraldine_VdAuwera , thanks for your earlier reply and for confirming the workflow that I described

Sign In or Register to comment.