To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Second run of BaseRecalibrator in parallel with HaplotypeCaller

SvyatoslavSidorovSvyatoslavSidorov St. Petersburg, RussiaMember
edited March 2016 in Ask the GATK team

Dear GATK team,

I'd like to ask a question about the possibility of HaplotypeCaller and AnalyzeCovariates running in parallel: I developed a QScript that runs indel realignment, BQSR, variant calling (obtaining gVCF file as a result) and then -- BaseRecalibrator for the second time followed by AnalyzeCovariates. Looking in QScript jobreport PDF file I noticed that the second run of BaseRecalibrator was performed after HaplotypeCaller, though as I can understand, HaplotypeCaller and the second run of BaseRecalibrator are independent regarding data and potentially can run in paralle.

Can I run them in parallel somehow in order to save time?

Issue · Github
by Sheila

Issue Number
671
State
open
Last Updated
Assignee
Array
Milestone
Array

Best Answer

Answers

  • SvyatoslavSidorovSvyatoslavSidorov St. Petersburg, RussiaMember

    Sorry, failed to attach a plot of jobs.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Sure, the QC portion of BQSR can be done independently of HaplotypeCaller and the rest of the pipeline.

  • SvyatoslavSidorovSvyatoslavSidorov St. Petersburg, RussiaMember
    edited March 2016

    Geraldine, thank you for your answer! Can I somehow make sure of this parallelization in my Queue script? (It doesn't do this parallelization automatically, as we can see from the job plot.)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Not really; queue will line up jobs based on when their inputs become available, but you can't specifically control the order.

  • SvyatoslavSidorovSvyatoslavSidorov St. Petersburg, RussiaMember

    I went through my QScript again and failed to see any dependencies that could prevent Queue from running HaplotypeCaller and BaseRecalibrator (second time) in parallel. Here is the corresponding snippet of my QScript:

    ` // Generate the second recalibrarion table for BQSR quality control
    val baseRecalibrator2 = new BaseRecalibrator
    baseRecalibrator2.input_file :+= indelRealigner.out
    baseRecalibrator2.intervals :+= qscript.processIntervalsFile
    baseRecalibrator2.reference_sequence = qscript.referenceFile
    baseRecalibrator2.knownSites = knownFilesBQSR
    baseRecalibrator2.BQSR = baseRecalibrator1.out
    baseRecalibrator2.out = qscript.recalTableFile2
    if (qscript.sgBR != 1) {
    baseRecalibrator2.scatterCount = qscript.sgBR
    }
    baseRecalibrator2.nct = qscript.nctBR
    add(baseRecalibrator2)

         // Perform QC with AnalyzeCovariates
         val analyzeCovariates = new AnalyzeCovariates
         analyzeCovariates.intervals :+= qscript.processIntervalsFile
         analyzeCovariates.reference_sequence = qscript.referenceFile
         analyzeCovariates.beforeReportFile = baseRecalibrator1.out
         analyzeCovariates.afterReportFile = baseRecalibrator2.out
         analyzeCovariates.plotsReportFile = qscript.plotsReportFile
         analyzeCovariates.intermediateCsvFile = qscript.intermediateCsvFile
         add(analyzeCovariates)
    
         // Call variants with HaplotypeCaller and obtain a VCF/gVCF file for the sample
         val haplotypeCaller = new HaplotypeCaller
         haplotypeCaller.input_file :+= printReads.out
         haplotypeCaller.intervals :+= qscript.processIntervalsFile
         haplotypeCaller.reference_sequence = qscript.referenceFile
    
         if (qscript.pcrModel == "CONSERVATIVE") {
             haplotypeCaller.pcr_indel_model = PCR_ERROR_MODEL.CONSERVATIVE
         } else if (qscript.pcrModel == "NONE") { // for PCR-free data
             haplotypeCaller.pcr_indel_model = PCR_ERROR_MODEL.NONE
         } else if (qscript.pcrModel == "AGGRESSIVE") {
             haplotypeCaller.pcr_indel_model = PCR_ERROR_MODEL.AGGRESSIVE
         } else if (qscript.pcrModel == "HOSTILE") {
             haplotypeCaller.pcr_indel_model = PCR_ERROR_MODEL.HOSTILE
         }
    
         haplotypeCaller.dbsnp = qscript.dbsnpFile
    
         if (qscript.emitRefConfidence == "GVCF") {
             haplotypeCaller.emitRefConfidence = ReferenceConfidenceMode.GVCF
             haplotypeCaller.variant_index_type = GATKVCFIndexType.LINEAR
             haplotypeCaller.variant_index_parameter = Some(128000)
         } else if (qscript.emitRefConfidence == "NONE") {
             haplotypeCaller.emitRefConfidence = ReferenceConfidenceMode.NONE
         } else if (qscript.emitRefConfidence == "BP_RESOLUTION") {
             haplotypeCaller.emitRefConfidence = ReferenceConfidenceMode.BP_RESOLUTION
         }
    
         haplotypeCaller.out = qscript.gvcfFile
    
         if (qscript.nctHC != 1) {
             haplotypeCaller.nct = qscript.nctHC
         }
    
         if (qscript.sgHC != 1) {
             haplotypeCaller.scatterCount = qscript.sgHC
         }
    
         add(haplotypeCaller)`
    

    This is the end of the script function. (Sorry for code formatting inconsistency...)

    Maybe Queue can't run independent jobs in parallel because I have no job runner on my server?

  • SvyatoslavSidorovSvyatoslavSidorov St. Petersburg, RussiaMember

    Geraldine, thank you very much for your answer! I'll go through the release notes for 3.5!

  • SvyatoslavSidorovSvyatoslavSidorov St. Petersburg, RussiaMember

    Yes, that was the case! I just added -jobRunner ParallelShell to my Queue command, and all independent tasks began to run in parallel. Thanks again!

Sign In or Register to comment.