Data processing steps using a lot of memory

prussellprussell Denver, COMember

Hi,

I'm using a QScript to run the GATK best practices on LSF. I'm trying to run the data processing steps now. RealignerTargetCreator and IndelRealigner are using so much memory that my jobs are being killed by LSF. Usually they don't even make it through RealignerTargetCreator, but the ones that do make it have a memory explosion in IndelRealigner and get killed at that step.

I'm testing the pipeline with a very small dataset: 3 bam files of about 10 million reads each. Even when I downsample them to 1 million reads each, the memory still explodes. I've tried requesting up to 32g of memory. Higher memory requests allow the jobs to run longer, but they eventually outgrow even 32g.

A few notes:

  • I'm using GATK/Queue 3.5-0.
  • The "-memLimit" option is working correctly so that my jobs are submitted with the correct memory request. LSF is reporting that the actual memory usage is indeed extremely high.
  • I've validated my bam files with the Picard validation tool.
  • I'm using 1000G_phase1.indels.hg19.vcf and dbsnp_137.hg19.vcf from the GATK resource bundle downloaded in early 2013.

Do you have any idea what could be happening?

Here is an example of the output from LSF:

Sender: LSF System <hpcadmin@c27>
Subject: Job 57251: <RealignerTargetCreator: B11.sorted.rg.downsample.interval_list> Exited

Job <RealignerTargetCreator: B11.sorted.rg.downsample.interval_list> was submitted from host <c15> by user <russellp> in cluster <compbio_cluster1>.
Job was executed on host(s) <c27>, in queue <shared>, as user <russellp> in cluster <compbio_cluster1>.
</home/russellp> was used as the home directory.
</home/russellp/Projects/IPF_resequencing/GATK_testing> was used as the working directory.
Started at Fri May 20 16:10:00 2016
Results reported at Fri May 20 16:11:03 2016

Your job looked like:

------------------------------------------------------------
# LSBATCH: User input
sh /home/russellp/Projects/IPF_resequencing/GATK_testing/.queue/tmp/.exec5219091543152698331
------------------------------------------------------------

TERM_MEMLIMIT: job killed after reaching LSF memory usage limit.
Exited with exit code 130.

Resource usage summary:

    CPU time   :     79.19 sec.
    Max Memory :      7951 MB
    Max Swap   :     34733 MB

    Max Processes  :         4
    Max Threads    :        29

Here is my QScript:

package qscripts

import org.broadinstitute.gatk.queue.QScript
//import org.broadinstitute.gatk.queue.extensions.picard.MarkDuplicates
import org.broadinstitute.gatk.queue.extensions.gatk._

/**
  * Created by prussell on 3/3/16.
  * Implements GATK best practices
  */
class GatkBestPractices extends QScript {

  /**
    * *********************************************************************
    *                               INPUTS
    * *********************************************************************
    */

  /**
    * Reference genome fasta
    */
  @Input(doc="Reference genome for the bam files", shortName="R", fullName="REF_FASTA", required=true)
  var referenceFile: File = null

  /**
    * Bam files
    */
  @Input(doc="One or more bam files", shortName="I", fullName="INPUT_BAM", required=true)
  var bamFiles: List[File] = Nil

  /**
    * VCF files
    */
  @Input(doc="VCF file(s) with known indels", fullName="KNOWN_INDELS", required=true)
  var knownIndels: List[File] = Nil
  @Input(doc="Database of known variants e.g. dbSNP", fullName="KNOWN_VARIANTS", required=true)
  var knownPolymorphicSites: List[File] = Nil

  /**
    * Output directory
    */
  @Input(doc="Output directory", shortName="od", fullName="OUT_DIR", required=true)
  var outputDirectory : File = null

  /**
    * Output file prefix not including directory
    */
  @Input(doc="Output prefix not including directory", shortName="op", fullName="OUT_PREFIX", required=true)
  var outputPrefix : File = null

  /**
    * Common arguments
    */
  trait CommonArguments extends CommandLineGATK {
    this.reference_sequence = referenceFile
    // TODO other common arguments?
  }

  def script() = {    

    /**
      * Container for the processed bam files
      */
    var processedFiles = Seq.empty[File]


    /**
      * Data processing
      */
    for(bam <- bamFiles) {

      /**
        * *********************************************************************
        * LOCAL REALIGNMENT AROUND INDELS
        * https://www.broadinstitute.org/gatk/guide/article?id=38
        * https://www.broadinstitute.org/gatk/guide/article?id=2800
        * *********************************************************************
        */

      /**
        * Local realignment around indels
        * Step 1 of 2: RealignerTargetCreator: Define intervals to target for local realignment
        * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_RealignerTargetCreator.php
        */
      val realignerTargetCreator = new RealignerTargetCreator with CommonArguments
      realignerTargetCreator.input_file +:= bam
      realignerTargetCreator.known = knownIndels
      realignerTargetCreator.out = swapExt(bam, "bam", "interval_list")
      realignerTargetCreator.maxIntervalSize = int2intOption(500) // Default 500
      realignerTargetCreator.minReadsAtLocus = int2intOption(4) // Default 4
      realignerTargetCreator.mismatchFraction = double2doubleOption(0.0) // Default 0.0
      realignerTargetCreator.windowSize = int2intOption(10) // Default 10
      add(realignerTargetCreator)

      /**
        * Local realignment around indels
        * Step 2 of 2: IndelRealigner: Perform local realignment of reads around indels
        * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_indels_IndelRealigner.php
        */
      val indelRealigner = new IndelRealigner with CommonArguments
      indelRealigner.targetIntervals = realignerTargetCreator.out
      indelRealigner.input_file +:= bam
      indelRealigner.knownAlleles = knownIndels
      indelRealigner.out = swapExt(bam, "bam", "realign.bam")
      indelRealigner.consensusDeterminationModel = null
      indelRealigner.LODThresholdForCleaning = double2doubleOption(5.0) // Default 5.0
      indelRealigner.nWayOut = null
      indelRealigner.entropyThreshold = double2doubleOption(0.15) // Default 0.15
      indelRealigner.maxConsensuses = int2intOption(30) // Default 30
      indelRealigner.maxIsizeForMovement = int2intOption(3000) // Default 3000
      indelRealigner.maxPositionalMoveAllowed = int2intOption(200) // Default 200
      indelRealigner.maxReadsForConsensuses = int2intOption(120) // Default 120
      indelRealigner.maxReadsForRealignment = int2intOption(20000) // Default 20000
      indelRealigner.maxReadsInMemory = int2intOption(150000) // Default 150000
      indelRealigner.noOriginalAlignmentTags = false
      add(indelRealigner)

      /**
        * *********************************************************************
        * BASE QUALITY SCORE RECALIBRATION
        * https://www.broadinstitute.org/gatk/guide/article?id=44
        * https://www.broadinstitute.org/gatk/guide/article?id=2801
        * *********************************************************************
        */

      /**
        * Base quality score recalibration
        * Step 1 of 3: BaseRecalibrator: Generate base recalibration table to compensate for systematic errors in basecalling confidences
        * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_bqsr_BaseRecalibrator.php
        */

      // Generate the first pass recalibration table file
      val baseRecalibratorBefore = new BaseRecalibrator with CommonArguments
      baseRecalibratorBefore.input_file +:= indelRealigner.out
      baseRecalibratorBefore.out = swapExt(indelRealigner.out, "realign.bam", "base_recalibrator_first_pass.out")
      baseRecalibratorBefore.knownSites = knownPolymorphicSites
      baseRecalibratorBefore.indels_context_size = int2intOption(3) // Default 3
      baseRecalibratorBefore.maximum_cycle_value = int2intOption(500) // Default 500
      baseRecalibratorBefore.mismatches_context_size = int2intOption(2) // Default 2
      baseRecalibratorBefore.solid_nocall_strategy = null
      baseRecalibratorBefore.solid_recal_mode = null
      baseRecalibratorBefore.list = false
      baseRecalibratorBefore.lowMemoryMode = false
      baseRecalibratorBefore.no_standard_covs = false
      baseRecalibratorBefore.sort_by_all_columns = false
      baseRecalibratorBefore.binary_tag_name = null
      baseRecalibratorBefore.bqsrBAQGapOpenPenalty = double2doubleOption(40.0) // Default 40.0
      baseRecalibratorBefore.deletions_default_quality = int2byteOption(45) // Default 45
      baseRecalibratorBefore.insertions_default_quality = int2byteOption(45) // Default 45
      baseRecalibratorBefore.low_quality_tail = int2byteOption(2) // Default 2
      baseRecalibratorBefore.mismatches_default_quality = int2byteOption(-1) // Default -1
      baseRecalibratorBefore.quantizing_levels = int2intOption(16) // Default 16
      baseRecalibratorBefore.run_without_dbsnp_potentially_ruining_quality = false
      add(baseRecalibratorBefore)

      // Generate the second pass recalibration table file
      val baseRecalibratorAfter = new BaseRecalibrator with CommonArguments
      baseRecalibratorAfter.BQSR = baseRecalibratorBefore.out
      baseRecalibratorAfter.input_file +:= indelRealigner.out
      baseRecalibratorAfter.out = swapExt(indelRealigner.out, "realign.bam", "base_recalibrator_second_pass.out")
      baseRecalibratorAfter.knownSites = knownPolymorphicSites
      baseRecalibratorAfter.indels_context_size = int2intOption(3) // Default 3
      baseRecalibratorAfter.maximum_cycle_value = int2intOption(500) // Default 500
      baseRecalibratorAfter.mismatches_context_size = int2intOption(2) // Default 2
      baseRecalibratorAfter.solid_nocall_strategy = null
      baseRecalibratorAfter.solid_recal_mode = null
      baseRecalibratorAfter.list = false
      baseRecalibratorAfter.lowMemoryMode = false
      baseRecalibratorAfter.no_standard_covs = false
      baseRecalibratorAfter.sort_by_all_columns = false
      baseRecalibratorAfter.binary_tag_name = null
      baseRecalibratorAfter.bqsrBAQGapOpenPenalty = double2doubleOption(40.0) // Default 40.0
      baseRecalibratorAfter.deletions_default_quality = int2byteOption(45) // Default 45
      baseRecalibratorAfter.insertions_default_quality = int2byteOption(45) // Default 45
      baseRecalibratorAfter.low_quality_tail = int2byteOption(2) // Default 2
      baseRecalibratorAfter.mismatches_default_quality = int2byteOption(-1) // Default -1
      baseRecalibratorAfter.quantizing_levels = int2intOption(16) // Default 16
      baseRecalibratorAfter.run_without_dbsnp_potentially_ruining_quality = false
      add(baseRecalibratorAfter)

      /**
        * Base quality score recalibration
        * Step 2 of 3: AnalyzeCovariates: Create plots to visualize base recalibration results
        * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_bqsr_AnalyzeCovariates.php
        */
      val analyzeCovariates = new AnalyzeCovariates with CommonArguments
      analyzeCovariates.beforeReportFile = baseRecalibratorBefore.out
      analyzeCovariates.afterReportFile = baseRecalibratorAfter.out
      analyzeCovariates.plotsReportFile = new File(outputDirectory.getAbsolutePath + "/" + outputPrefix + "_BQSR.pdf")
      analyzeCovariates.intermediateCsvFile = new File(outputDirectory.getAbsolutePath + "/" + outputPrefix + "_BQSR.csv")
      analyzeCovariates.ignoreLMT = false
      add(analyzeCovariates)

      /**
        * Base quality score recalibration
        * Step 3 of 3: PrintReads: Write out sequence read data (for filtering, merging, subsetting etc)
        * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_readutils_PrintReads.php
        */
      val printReads = new PrintReads with CommonArguments
      printReads.input_file +:= bam
      printReads.BQSR = baseRecalibratorAfter.out
      printReads.out = swapExt(bam, "bam", "recalibrated.bam")


      processedFiles +:= printReads.out

    }


    /**
      * *********************************************************************
      *                        VARIANT DISCOVERY
      * https://www.broadinstitute.org/gatk/guide/bp_step.php?p=2
      * *********************************************************************
      */

    /**
      * Variant discovery
      * Step 1 of 6: HaplotypeCaller: Call germline SNPs and indels via local re-assembly of haplotypes
      * https://www.broadinstitute.org/gatk/guide/article?id=2803
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_haplotypecaller_HaplotypeCaller.php
      * https://www.broadinstitute.org/gatk/guide/article?id=3893
      */
    // TODO

    /**
      * Variant discovery
      * Step 2 of 6: CombineGVCFs: Combine per-sample gVCF files produced by HaplotypeCaller into a multi-sample gVCF file
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineGVCFs.php
      */
    // TODO

    /**
      * Variant discovery
      * Step 3 of 6: GenotypeGVCFs: Perform joint genotyping on gVCF files produced by HaplotypeCaller
      * https://www.broadinstitute.org/gatk/guide/article?id=3893
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeGVCFs.php
      */
    // TODO

    /**
      * Variant discovery
      * Step 4 of 6: VariantFiltration: Filter variant calls based on INFO and FORMAT annotations
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_filters_VariantFiltration.php
      */
    // TODO

    /**
      * Variant discovery
      * Step 5 of 6: VariantRecalibrator: Build a recalibration model to score variant quality for filtering purposes
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_VariantRecalibrator.php
      * https://www.broadinstitute.org/gatk/guide/article?id=39
      * https://www.broadinstitute.org/gatk/guide/article?id=2805
      *
      */
    // TODO

    /**
      * Variant discovery
      * Step 6 of 6: ApplyRecalibration: Apply a score cutoff to filter variants based on a recalibration table
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantrecalibration_ApplyRecalibration.php
      * https://www.broadinstitute.org/gatk/guide/article?id=2806
      */
    // TODO

    /**
      * *********************************************************************
      *                        CALLSET REFINEMENT
      * https://www.broadinstitute.org/gatk/guide/bp_step.php?p=3
      * *********************************************************************
      */

    /**
      * Callset refinement
      * Step 1 of 8: CalculateGenotypePosteriors: Calculate genotype posterior likelihoods given panel data
      * https://www.broadinstitute.org/gatk/guide/article?id=4727
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_CalculateGenotypePosteriors.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 2 of 8: VariantFiltration: Filter variant calls based on INFO and FORMAT annotations
      * https://www.broadinstitute.org/gatk/guide/article?id=4727
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_filters_VariantFiltration.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 3 of 8: VariantAnnotator: Annotate variant calls with context information
      * https://www.broadinstitute.org/gatk/guide/article?id=4727
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_annotator_VariantAnnotator.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 4 of 8: SelectVariants: Select a subset of variants from a larger callset
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_SelectVariants.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 5 of 8: CombineVariants: Combine variant records from different sources
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 6 of 8: VariantEval: General-purpose tool for variant evaluation (% in dbSNP, genotype concordance, Ti/Tv ratios, and a lot more)
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_varianteval_VariantEval.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 7 of 8: VariantsToTable: Extract specific fields from a VCF file to a tab-delimited table
      * https://www.broadinstitute.org/gatk/guide/tooldocs/org_broadinstitute_gatk_tools_walkers_variantutils_VariantsToTable.php
      */
    // TODO

    /**
      * Callset refinement
      * Step 8 of 8: GenotypeConcordance (Picard version): Genotype concordance
      * https://broadinstitute.github.io/picard/command-line-overview.html#GenotypeConcordance
      */
    // TODO



  }

}

Issue · Github
by Sheila

Issue Number
918
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
chandrans

Best Answer

Answers

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @prussell
    Hi,

    It looks like you are specifying a lot of advanced parameters. Some of those could significantly increase memory requirements. Can you comment out all the lines specifying non-required parameters (so the defaults are used) and see if the memory drain is still a problem?

    Thanks,
    Sheila

  • prussellprussell Denver, COMember

    Hi Sheila,

    Thanks for your reply. I tried commenting out the non-required parameters and still saw the same memory issues. I should mention that I'm working with targeted DNA-seq data, so we only have coverage for a handful of regions genome-wide. Could that be affecting the memory usage?

    Thanks,
    Pam

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @prussell
    Hi Pam,

    Sorry for the delay in responding. Can you tell us what happens to memory usage if you just run the indel realignment tools individually, not through Queue? This will help us determine whether the problem is at the Queue pipeline level, or at the tools and/or dataset level.

    Thanks,
    Sheila

  • prussellprussell Denver, COMember
    edited June 2016

    Hi Sheila,

    I've tried running the tools individually and they run with no memory explosion. I'm submitting them to LSF with the same memory requests I've instructed Queue to use. (I've verified that the memory requests from Queue are being received by LSF.) I think the problem must be either within Queue, or in the interface between Queue and LSF. Thanks again for your help with this.

    Thanks,
    Pam

  • prussellprussell Denver, COMember
    edited June 2016

    Hi Sheila,

    I'm thinking this might be a problem on our end, as the memory explosions seem to be nondeterministic. Unless you have an idea of what could be happening, I will talk to our cluster administrators about this.

    Thanks,
    Pam

Sign In or Register to comment.