Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

counting bams in a bam list with Queue

flescaiflescai Posts: 53Member ✭✭
edited January 2013 in Ask the GATK team

Hi there, I wanted to reproduce in my variant calling Queue script the same conditional you have in MethodsDevelopmenCallingPipeline, i.e. including InbreedingCoeff depending on the number of samples. However, in that script the number of samples is passed to the Target object as an integer, and I would like to count it from the bam file list passed as an input to the script.

Therefore I followed the method in DataProcessingPipeline, i.e.

import org.broadinstitute.sting.queue.util.QScriptUtils
[...]
@Input(doc="input BAM file - or list of BAM files", fullName="input", shortName="I", required=true)
var bamFile: File = _
[...]
val bamFilesList = QScriptUtils.createSeqFromFile(bamFile)
val sampleNo = bamFilesList.size

But unfortunately, despite DataProcessingPipeline works just fine, when I put these lines in my other script I get the following error:

INFO  12:48:08,616 HelpFormatter - Date/Time: 2012/11/08 12:48:08 
INFO  12:48:08,616 HelpFormatter - ---------------------------------------------------------------------- 
INFO  12:48:08,616 HelpFormatter - ---------------------------------------------------------------------- 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.sting.utils.exceptions.DynamicClassResolutionException: Could not create module HaplotypeCallerStep because      Cannot instantiate class (Invocation failure) caused by exception null
at org.broadinstitute.sting.utils.classloader.PluginManager.createByType(PluginManager.java:306)
at org.broadinstitute.sting.utils.classloader.PluginManager.createAllTypes(PluginManager.java:317)
at org.broadinstitute.sting.queue.QCommandLine.execute(QCommandLine.scala:126)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:236)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:146)
at org.broadinstitute.sting.queue.QCommandLine$.main(QCommandLine.scala:62)
at org.broadinstitute.sting.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-2-gf44cc4e):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR Visit our website and forum for extensive documentation and answers to 
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Could not create module HaplotypeCallerStep because Cannot instantiate class (Invocation failure) caused by exception null
##### ERROR ------------------------------------------------------------------------------------------

I tried several alternatives looking at the imports in DataProcessingPipeline but maybe I am missing something. Could you please advise?

thanks very much Francesco

Post edited by Geraldine_VdAuwera on
Tagged:

Best Answer

  • Johan_DahlbergJohan_Dahlberg Posts: 85 ✭✭✭
    Answer ✓

    The problem is that you are calling QScriptUtils.createSeqFromFile(bamFile) in the body of the script. I think (my knowledge of Scala is not as complete as I would like it to be) this causes the scala compiler to try to include the code in the constructor, and thus calling createSeqFromFile with null, as this variable is only set at runtime. This causes the script to fail. If you put the function call in the script portion you should be fine. :) There is a minimal example of this which works on my end:

    package org.broadinstitute.sting.queue.qscripts
    
    import org.broadinstitute.sting.queue.QScript
    import org.broadinstitute.sting.queue.util.QScriptUtils
    
    
    class HaplotypeCallerStep extends QScript {
        // Create an alias 'qscript' to be able to access variables
        // in the HaplotypeCallerStep.
        // 'qscript' is now the same as 'HaplotypeCallerStep.this'
        qscript =>
    
    
        // Required arguments.  All initialized to empty values.
    
        @Input(doc="input BAM file - or list of BAM files", fullName="input", shortName="I", required=true)
        var bamFile: File = _
    
        /***************************************************
        * main script
        ***************************************************/
    
        def script() {
    
            val bamFilesList = QScriptUtils.createSeqFromFile(bamFile)
            val sampleNo = bamFilesList.size
    
            System.err.println("------ samples ------")
            System.err.println(sampleNo)
            System.err.println("======================")
        }
    
    }
    

    Hope this solution works for you.

Answers

  • ebanksebanks Posts: 679GATK Developer mod

    Hi Francesco,

    Unfortunately, we just don't have the resources to help people with programming questions. Perhaps you'll have better luck posting this in the Ask the Community section.

    I will say that it looks like your problem is unrelated and has something to do with the Haplotype Caller. Perhaps you are using GATK lite? Good luck!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • flescaiflescai Posts: 53Member ✭✭

    Hi Eric, I assumed QScriptUtils was part of Queue and createSeqFromFile not to be a general scala method, and therefore that's why I posted here the question.

    I am using GATK2-2.2 full version, not the light one.

  • Johan_DahlbergJohan_Dahlberg Posts: 85Member ✭✭✭

    QScriptUtils and createSeqFromFile are part of Queue. But I'm 99% sure that inn't what is causing your problem. As Eric points out, it is more likely something wrong in the haplotype caller part. As stated in the error message: "Could not create module HaplotypeCallerStep because Cannot instantiate class (Invocation failure) caused by exception null"

    If you can post your entire script here, or somewhere else, it would be easier to give an answer as to what is causing the problem. My guess is that it is caused by a problem in your extension class for the Haplotype caller.

  • flescaiflescai Posts: 53Member ✭✭

    Thanks Johan, that's really very kind of you.

    Apart the few lines I wrote above, I didn't modify the extension of the class in other parts, and without reference to import org.broadinstitute.sting.queue.util.QScriptUtils it works (if I don't add the method of course)

    This is my script, it might be useful to other people as well. Thanks

        package org.broadinstitute.sting.queue.qscripts
    
        import org.broadinstitute.sting.queue.QScript
        import org.broadinstitute.sting.queue.extensions.gatk._
    
        import org.broadinstitute.sting.queue.util.QScriptUtils
    
    
        /**
         *  TARGETS HAVE BEEN SUBSTITUTED WITH CLASSES IN ORDER TO BETTER IMPORT
         * THE NAMES OF THE DIFFERENT FILES, FOLLOWING THE EXAMPLE OF THE
         * methodsDevelopment script
         */
    
    
        class HaplotypeCallerStep extends QScript {
          // Create an alias 'qscript' to be able to access variables
          // in the HaplotypeCallerStep.
          // 'qscript' is now the same as 'HaplotypeCallerStep.this'
          qscript =>
    
    
          // Required arguments.  All initialized to empty values.
    
          @Input(doc="input BAM file - or list of BAM files", fullName="input", shortName="I", required=true)
          var bamFile: File = _
    
    
          @Input(doc="The reference file for the bam files.", shortName="R", required=true)
          var referenceFile: File = _ // _ is scala shorthand for null
    
          @Argument(doc="the project name determines the final output base name", fullName="project", shortName="p", required=true)
          var projectName: String = _
    
          // The following arguments are all optional.
    
          @Input(doc="An optional file with a list of intervals to proccess.", shortName="L", required=false)
          var intervals: File = _
    
          @Argument(doc="A optional list of filter names.", shortName="filter", required=false)
          var filterNames: List[String] = Nil // Nil is an empty List, versus null which means a non-existent List.
    
          @Argument(doc="An optional list of filter expressions.", shortName="filterExpression", required=false)
          var filterExpressions: List[String] = Nil
    
    
        /*********************************************************
        * definitions of names and files within a Target class
        **********************************************************/
    
    
        class Target {
    
            // DEFINE file names
    
            val clusterFile = new File(qscript.projectName + "hc.clusters")
            val rawVCF = new File(qscript.projectName + ".raw.vcf")
            val annoVCF = new File(qscript.projectName + ".annotated.vcf")
            val recalibratedVCF = new File(qscript.projectName + ".HaplotypeCaller.recalibrated.vcf")
            val tranchesFile = new File(qscript.projectName + ".tranches")
            val vqsrRscript = new File(qscript.projectName + ".vqsr.r")
            val recalFile = new File(qscript.projectName + ".tranches.recal")
    
    
        }
    
            // DEFINE resources location
    
            val bamFilesList = QScriptUtils.createSeqFromFile(bamFile)
            val sampleNo = bamFilesList.size
            System.err.println("------ samples ------")
            System.err.println(sampleNo)
            System.err.println("======================")
    
            val b37 = "/share/apps/genomics/reference/human_g1k_v37.fasta"
            val dbSNP_b37 = "/share/apps/genomics/reference/gatkresources_hg19_1.5/ftp.broadinstitute.org/bundle/1.5/b37/dbsnp_135.b37.vcf" 
            val hapmap_b37 = "/share/apps/genomics/reference/gatkresources_hg19_1.5/ftp.broadinstitute.org/bundle/1.5/b37/hapmap_3.3.b37.sites.vcf"
            val omni_b37 = "/share/apps/genomics/reference/gatkresources_hg19_1.5/ftp.broadinstitute.org/bundle/1.5/b37/1000G_omni2.5.b37.sites.vcf"
            val mills_b37 = "/share/apps/genomics/reference/gatkresources_hg19_1.5/ftp.broadinstitute.org/bundle/1.5/b37/Mills_and_1000G_gold_standard.indels.b37.sites.vcf"
    
    
        /*********************************************************
        * extension of the arguments
        *********************************************************/
    
          // This trait allows us set the variables below in one place,
          // and then reuse this trait on each CommandLineGATK function below.
    
          trait UNIVERSAL_GATK_ARGS extends CommandLineGATK {
            logging_level = "INFO";
          }
    
          class HaplotypeCallerArguments extends HaplotypeCaller with UNIVERSAL_GATK_ARGS {
            this.reference_sequence = qscript.referenceFile
            this.intervals = if (qscript.intervals == null) Nil else List(qscript.intervals)
            // Set the memory limit to 6 gigabytes on each command.
            this.memoryLimit = 6
            this.input_file :+= qscript.bamFile
            this.D = qscript.dbSNP_b37
          }
    
            class EvalArguments extends VariantEval with UNIVERSAL_GATK_ARGS {
            this.reference_sequence = qscript.referenceFile
            // Set the memory limit to 6 gigabytes on each command.
            this.memoryLimit = 6
            this.D = qscript.dbSNP_b37
          }
    
          class AnnotationArguments extends VariantAnnotator with UNIVERSAL_GATK_ARGS {
            this.reference_sequence = qscript.referenceFile
            // Set the memory limit to 7 gigabytes on each command.
            this.memoryLimit = 7
            this.input_file :+= qscript.bamFile
            //this.annotation ++= List("QD", "QDE", "HaplotypeScore", "MQRankSum", "ReadPosRankSum", "FS", "MQ")
            //this.useAllAnnotations // DISABLED, in order to annotate only with rs identifiers
            this.D = qscript.dbSNP_b37
          }
    
          class VQSRArguments (t: Target) extends VariantRecalibrator with UNIVERSAL_GATK_ARGS {
            this.reference_sequence = qscript.referenceFile
            this.intervals = if (qscript.intervals == null) Nil else List(qscript.intervals)
            // Set the memory limit to 4 gigabytes on each command.
            this.memoryLimit = 4
            this.input :+= t.rawVCF
            this.maxGaussians = 6
            this.resource :+= new TaggedFile( hapmap_b37, "hapmap,known=false,training=true,truth=true,prior=15.0" )
            this.resource :+= new TaggedFile( omni_b37, "omni,known=false,training=true,truth=false,prior=12.0" )
            this.resource :+= new TaggedFile( dbSNP_b37, "dbsnp,known=true,training=false,truth=false,prior=6.0" )
            this.resource :+= new TaggedFile( mills_b37, "mills,known=true,training=true,truth=true,prior=12.0" )
            this.use_annotation ++= List("QD", "MQRankSum", "ReadPosRankSum", "FS", "MQ", "InbreedingCoeff", "ClippingRankSum")
            this.tranches_file = t.tranchesFile
            this.recal_file = t.recalFile
            this.rscript_file = t.vqsrRscript
            this.mode = org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorArgumentCollection.Mode.BOTH
            this.analysisName = qscript.projectName + "_VQSRs"
          }
    
          class ApplyArguments (t: Target) extends ApplyRecalibration with UNIVERSAL_GATK_ARGS {
            this.reference_sequence = qscript.referenceFile
            this.intervals = if (qscript.intervals == null) Nil else List(qscript.intervals)
            // Set the memory limit to 4 gigabytes on each command.
            this.memoryLimit = 4
            this.input :+= t.rawVCF
            this.tranches_file = t.tranchesFile
            this.recal_file = t.recalFile
            this.ts_filter_level = 97.0
            this.mode = org.broadinstitute.sting.gatk.walkers.variantrecalibration.VariantRecalibratorArgumentCollection.Mode.BOTH
          }
    
    
        /***************************************************
        * main script
        ***************************************************/
    
          def script() {
    
            // Create the four functions that we may run depending on options.
            val target = new Target
            val genotyper = new HaplotypeCallerArguments
            val evalUnfiltered = new EvalArguments
    
            genotyper.scatterCount = 300
            genotyper.out = target.rawVCF
            genotyper.jobName = "HaplotypeCaller"
    
            evalUnfiltered.eval :+= genotyper.out
            evalUnfiltered.out = swapExt(genotyper.out, "vcf", "eval")
            evalUnfiltered.jobName = "UnfiltEval"
    
            // execute genotyping and eval
    
            add(genotyper, evalUnfiltered)
    
    
            // sets the VQSR
    
            val recalibrator = new VQSRArguments(target)
            recalibrator.tranches_file = target.tranchesFile
            recalibrator.jobName = "VQSR"
    
            // sets Appply recalibration
    
            val applyrecal = new ApplyArguments(target)
            applyrecal.tranches_file = target.tranchesFile
            applyrecal.recal_file = target.recalFile
            applyrecal.out = target.recalibratedVCF
            applyrecal.jobName = "ApplyRecal"
    
            // add the jobs
    
            //add (annotator, recalibrator, applyrecal) //no need to annotate
            add(recalibrator, applyrecal)
    
            // eval the results
    
            val evalFiltered = new EvalArguments
            evalFiltered.eval :+= applyrecal.out
            evalFiltered.out = swapExt(applyrecal.out, "vcf", "eval")
            evalFiltered.jobName = "EvalFilt"
    
            // annotate the variants for the RS numbers because HaplotypeCaller doesn't yet
    
            val annotator = new AnnotationArguments
    
            annotator.variant = applyrecal.out
            annotator.out = swapExt(applyrecal.out, "vcf", "anno.vcf")
            annotator.scatterCount = 100
            annotator.jobName = "rsAnnotation"
    
            add (evalFiltered, annotator)
    
    
          }
    
    
    
        }
    
  • pdexheimerpdexheimer Posts: 335Member, GSA Collaborator ✭✭✭

    Hi Francesco -

    Is it failing when you specify a "list of BAM files"? Because I would expect that to cause problems right here:

      class HaplotypeCallerArguments extends HaplotypeCaller with UNIVERSAL_GATK_ARGS {
        [...]
        this.input_file :+= qscript.bamFile
        [...]
      }
    

    You should be giving it bamFilesList (the actual bams), not bamFile (the file containing a list of filenames). The same problem is in AnnotationArguments. I don't know if that would lead to the error you're seeing, though...

  • flescaiflescai Posts: 53Member ✭✭

    thanks, but the script works perfectly that way if I don't add the lines I reported initially. the sintax you refer to is present in ExampleUnifiedGenotyper.scala provided with Queue examples

     genotyper.input_file :+= qscript.bamFile
    

    the bamFilesList is something I create here for the sole purpose of calculating the size, i.e. the number of samples

  • flescaiflescai Posts: 53Member ✭✭

    Just to clarify, if I comment out these three lines

    import org.broadinstitute.sting.queue.util.QScriptUtils
    [...]
    val bamFilesList = QScriptUtils.createSeqFromFile(bamFile)
    val sampleNo = bamFilesList.size
    

    the script I posted works just fine.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,079Administrator, GATK Developer admin

    FYI, I have moved this thread to "Ask the community". Carry on, folks; and thanks for jumping in to help.

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 335Member, GSA Collaborator ✭✭✭

    Hmm, didn't realize the input would work that way. That's pretty cool.

    In my own script, I walk through the BAMs and check the RGs to get a count of samples. Here's my code, I'm pretty sure I adapted countSamples from DataProcessingPipeline:

    import net.sf.samtools.SAMFileReader
    [...]
    def countSamples(bamFiles: Seq[File]): Int = {
      val sampleSet = scala.collection.mutable.Set.empty[String]
      for (bam <- bamFiles) {
        val samReader = new SAMFileReader(bam)
        val header = samReader.getFileHeader
        val readGroups = header.getReadGroups
    
        for (rg <- readGroups) {
          sampleSet += rg.getSample
        }
      }
      sampleSet.size
    }
    
    [...]
    
    val bams = QScriptUtils.createSeqFromFile(bamFile)
    val nSamples = countSamples(bams)
    
  • flescaiflescai Posts: 53Member ✭✭

    Indeed, that's precisely where I found the method createSeqFromFile

    In the count function, you pass a countSamples(bamFiles: Seq[File]) which is defined as a Seq.

    Because it's been converted by that method I'm interested in.

    val bams = QScriptUtils.createSeqFromFile(bamFile)
    val nSamples = countSamples(bams)
    

    You can see it is a method of QScriptUtils

    In order to be able to use it, you need to import it with

     import org.broadinstitute.sting.queue.util.QScriptUtils
    

    So I'm doing precisely the same as in DataProcessingPipeline, except for the .size method, which is anyway a method of Seq.

    However, as soon as I put that import into the HaplotypeCaller class, it seems not capable to instantiate it anymore.

    GATK team members say it has nothing to do with their code... but then I don't understand why the very same code of just three lines conflicts with a script working otherwise.

  • pdexheimerpdexheimer Posts: 335Member, GSA Collaborator ✭✭✭

    Can you recreate the problem in a minimal script? Perhaps your copy of Queue is corrupt

  • flescaiflescai Posts: 53Member ✭✭

    Thanks Johan that's great! which confirms my knowledge of scala proximal to zero :-)

    I've simply moved the code into the class Target now, and it works.

    class Target {
    
        // DEFINE file names
    
        val clusterFile = new File(qscript.projectName + "hc.clusters")
        val rawVCF = new File(qscript.projectName + ".raw.vcf")
        val annoVCF = new File(qscript.projectName + ".annotated.vcf")
        val recalibratedVCF = new File(qscript.projectName + ".HaplotypeCaller.recalibrated.vcf")
        val tranchesFile = new File(qscript.projectName + ".tranches")
        val vqsrRscript = new File(qscript.projectName + ".vqsr.r")
        val recalFile = new File(qscript.projectName + ".tranches.recal")
    
        val bamFilesList = QScriptUtils.createSeqFromFile(qscript.bamFile)
        val nSamples = bamFilesList.size
    
    }
    

    this way I can add a conditional on the number of samples anywhere I want in the other classes, outside the main script.

    thanks a lot!!

    Francesco

Sign In or Register to comment.