The current GATK version is 3.5-0

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# counting bams in a bam list with Queue

Posts: 67Member ✭✭
edited January 2013

Hi there,
I wanted to reproduce in my variant calling Queue script the same conditional you have in MethodsDevelopmenCallingPipeline, i.e. including InbreedingCoeff depending on the number of samples.
However, in that script the number of samples is passed to the Target object as an integer, and I would like to count it from the bam file list passed as an input to the script.

Therefore I followed the method in DataProcessingPipeline, i.e.

import org.broadinstitute.sting.queue.util.QScriptUtils
[...]
@Input(doc="input BAM file - or list of BAM files", fullName="input", shortName="I", required=true)
var bamFile: File = _
[...]
val bamFilesList = QScriptUtils.createSeqFromFile(bamFile)
val sampleNo = bamFilesList.size


But unfortunately, despite DataProcessingPipeline works just fine, when I put these lines in my other script I get the following error:

INFO  12:48:08,616 HelpFormatter - Date/Time: 2012/11/08 12:48:08
INFO  12:48:08,616 HelpFormatter - ----------------------------------------------------------------------
INFO  12:48:08,616 HelpFormatter - ----------------------------------------------------------------------
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
org.broadinstitute.sting.utils.exceptions.DynamicClassResolutionException: Could not create module HaplotypeCallerStep because      Cannot instantiate class (Invocation failure) caused by exception null
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 2.2-2-gf44cc4e):
##### ERROR
##### ERROR Please visit the wiki to see if this is a known problem
##### ERROR If not, please post the error, with stack trace, to the GATK forum
##### ERROR
##### ERROR MESSAGE: Could not create module HaplotypeCallerStep because Cannot instantiate class (Invocation failure) caused by exception null
##### ERROR ------------------------------------------------------------------------------------------


I tried several alternatives looking at the imports in DataProcessingPipeline but maybe I am missing something.

thanks very much
Francesco

Post edited by Geraldine_VdAuwera on
Tagged:

• Posts: 94Member ✭✭✭

The problem is that you are calling QScriptUtils.createSeqFromFile(bamFile) in the body of the script. I think (my knowledge of Scala is not as complete as I would like it to be) this causes the scala compiler to try to include the code in the constructor, and thus calling createSeqFromFile with null, as this variable is only set at runtime. This causes the script to fail. If you put the function call in the script portion you should be fine. There is a minimal example of this which works on my end:

package org.broadinstitute.sting.queue.qscripts

class HaplotypeCallerStep extends QScript {
// Create an alias 'qscript' to be able to access variables
// in the HaplotypeCallerStep.
// 'qscript' is now the same as 'HaplotypeCallerStep.this'
qscript =>

// Required arguments.  All initialized to empty values.

@Input(doc="input BAM file - or list of BAM files", fullName="input", shortName="I", required=true)
var bamFile: File = _

/***************************************************
* main script
***************************************************/

def script() {

val bamFilesList = QScriptUtils.createSeqFromFile(bamFile)
val sampleNo = bamFilesList.size

System.err.println("------ samples ------")
System.err.println(sampleNo)
System.err.println("======================")
}

}


Hope this solution works for you.

Hi Francesco,

Unfortunately, we just don't have the resources to help people with programming questions. Perhaps you'll have better luck posting this in the Ask the Community section.

I will say that it looks like your problem is unrelated and has something to do with the Haplotype Caller. Perhaps you are using GATK lite? Good luck!

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Posts: 67Member ✭✭

Hi Eric,
I assumed QScriptUtils was part of Queue and createSeqFromFile not to be a general scala method, and therefore that's why I posted here the question.

I am using GATK2-2.2 full version, not the light one.

• Posts: 94Member ✭✭✭

QScriptUtils and createSeqFromFile are part of Queue. But I'm 99% sure that inn't what is causing your problem. As Eric points out, it is more likely something wrong in the haplotype caller part. As stated in the error message: "Could not create module HaplotypeCallerStep because Cannot instantiate class (Invocation failure) caused by exception null"

If you can post your entire script here, or somewhere else, it would be easier to give an answer as to what is causing the problem. My guess is that it is caused by a problem in your extension class for the Haplotype caller.

• Posts: 67Member ✭✭

Thanks Johan, that's really very kind of you.

Apart the few lines I wrote above, I didn't modify the extension of the class in other parts, and without reference to
it works (if I don't add the method of course)

This is my script, it might be useful to other people as well. Thanks

    package org.broadinstitute.sting.queue.qscripts

/**
*  TARGETS HAVE BEEN SUBSTITUTED WITH CLASSES IN ORDER TO BETTER IMPORT
* THE NAMES OF THE DIFFERENT FILES, FOLLOWING THE EXAMPLE OF THE
* methodsDevelopment script
*/

class HaplotypeCallerStep extends QScript {
// Create an alias 'qscript' to be able to access variables
// in the HaplotypeCallerStep.
// 'qscript' is now the same as 'HaplotypeCallerStep.this'
qscript =>

// Required arguments.  All initialized to empty values.

@Input(doc="input BAM file - or list of BAM files", fullName="input", shortName="I", required=true)
var bamFile: File = _

@Input(doc="The reference file for the bam files.", shortName="R", required=true)
var referenceFile: File = _ // _ is scala shorthand for null

@Argument(doc="the project name determines the final output base name", fullName="project", shortName="p", required=true)
var projectName: String = _

// The following arguments are all optional.

@Input(doc="An optional file with a list of intervals to proccess.", shortName="L", required=false)
var intervals: File = _

@Argument(doc="A optional list of filter names.", shortName="filter", required=false)
var filterNames: List[String] = Nil // Nil is an empty List, versus null which means a non-existent List.

@Argument(doc="An optional list of filter expressions.", shortName="filterExpression", required=false)
var filterExpressions: List[String] = Nil

/*********************************************************
* definitions of names and files within a Target class
**********************************************************/

class Target {

// DEFINE file names

val clusterFile = new File(qscript.projectName + "hc.clusters")
val rawVCF = new File(qscript.projectName + ".raw.vcf")
val annoVCF = new File(qscript.projectName + ".annotated.vcf")
val recalibratedVCF = new File(qscript.projectName + ".HaplotypeCaller.recalibrated.vcf")
val tranchesFile = new File(qscript.projectName + ".tranches")
val vqsrRscript = new File(qscript.projectName + ".vqsr.r")
val recalFile = new File(qscript.projectName + ".tranches.recal")

}

// DEFINE resources location

val bamFilesList = QScriptUtils.createSeqFromFile(bamFile)
val sampleNo = bamFilesList.size
System.err.println("------ samples ------")
System.err.println(sampleNo)
System.err.println("======================")

val b37 = "/share/apps/genomics/reference/human_g1k_v37.fasta"

/*********************************************************
* extension of the arguments
*********************************************************/

// This trait allows us set the variables below in one place,
// and then reuse this trait on each CommandLineGATK function below.

trait UNIVERSAL_GATK_ARGS extends CommandLineGATK {
logging_level = "INFO";
}

class HaplotypeCallerArguments extends HaplotypeCaller with UNIVERSAL_GATK_ARGS {
this.reference_sequence = qscript.referenceFile
this.intervals = if (qscript.intervals == null) Nil else List(qscript.intervals)
// Set the memory limit to 6 gigabytes on each command.
this.memoryLimit = 6
this.input_file :+= qscript.bamFile
this.D = qscript.dbSNP_b37
}

class EvalArguments extends VariantEval with UNIVERSAL_GATK_ARGS {
this.reference_sequence = qscript.referenceFile
// Set the memory limit to 6 gigabytes on each command.
this.memoryLimit = 6
this.D = qscript.dbSNP_b37
}

class AnnotationArguments extends VariantAnnotator with UNIVERSAL_GATK_ARGS {
this.reference_sequence = qscript.referenceFile
// Set the memory limit to 7 gigabytes on each command.
this.memoryLimit = 7
this.input_file :+= qscript.bamFile
//this.annotation ++= List("QD", "QDE", "HaplotypeScore", "MQRankSum", "ReadPosRankSum", "FS", "MQ")
//this.useAllAnnotations // DISABLED, in order to annotate only with rs identifiers
this.D = qscript.dbSNP_b37
}

class VQSRArguments (t: Target) extends VariantRecalibrator with UNIVERSAL_GATK_ARGS {
this.reference_sequence = qscript.referenceFile
this.intervals = if (qscript.intervals == null) Nil else List(qscript.intervals)
// Set the memory limit to 4 gigabytes on each command.
this.memoryLimit = 4
this.input :+= t.rawVCF
this.maxGaussians = 6
this.resource :+= new TaggedFile( hapmap_b37, "hapmap,known=false,training=true,truth=true,prior=15.0" )
this.resource :+= new TaggedFile( omni_b37, "omni,known=false,training=true,truth=false,prior=12.0" )
this.resource :+= new TaggedFile( dbSNP_b37, "dbsnp,known=true,training=false,truth=false,prior=6.0" )
this.resource :+= new TaggedFile( mills_b37, "mills,known=true,training=true,truth=true,prior=12.0" )
this.use_annotation ++= List("QD", "MQRankSum", "ReadPosRankSum", "FS", "MQ", "InbreedingCoeff", "ClippingRankSum")
this.tranches_file = t.tranchesFile
this.recal_file = t.recalFile
this.rscript_file = t.vqsrRscript
this.analysisName = qscript.projectName + "_VQSRs"
}

class ApplyArguments (t: Target) extends ApplyRecalibration with UNIVERSAL_GATK_ARGS {
this.reference_sequence = qscript.referenceFile
this.intervals = if (qscript.intervals == null) Nil else List(qscript.intervals)
// Set the memory limit to 4 gigabytes on each command.
this.memoryLimit = 4
this.input :+= t.rawVCF
this.tranches_file = t.tranchesFile
this.recal_file = t.recalFile
this.ts_filter_level = 97.0
}

/***************************************************
* main script
***************************************************/

def script() {

// Create the four functions that we may run depending on options.
val target = new Target
val genotyper = new HaplotypeCallerArguments
val evalUnfiltered = new EvalArguments

genotyper.scatterCount = 300
genotyper.out = target.rawVCF
genotyper.jobName = "HaplotypeCaller"

evalUnfiltered.eval :+= genotyper.out
evalUnfiltered.out = swapExt(genotyper.out, "vcf", "eval")
evalUnfiltered.jobName = "UnfiltEval"

// execute genotyping and eval

// sets the VQSR

val recalibrator = new VQSRArguments(target)
recalibrator.tranches_file = target.tranchesFile
recalibrator.jobName = "VQSR"

// sets Appply recalibration

val applyrecal = new ApplyArguments(target)
applyrecal.tranches_file = target.tranchesFile
applyrecal.recal_file = target.recalFile
applyrecal.out = target.recalibratedVCF
applyrecal.jobName = "ApplyRecal"

//add (annotator, recalibrator, applyrecal) //no need to annotate

// eval the results

val evalFiltered = new EvalArguments
evalFiltered.eval :+= applyrecal.out
evalFiltered.out = swapExt(applyrecal.out, "vcf", "eval")
evalFiltered.jobName = "EvalFilt"

// annotate the variants for the RS numbers because HaplotypeCaller doesn't yet

val annotator = new AnnotationArguments

annotator.variant = applyrecal.out
annotator.out = swapExt(applyrecal.out, "vcf", "anno.vcf")
annotator.scatterCount = 100
annotator.jobName = "rsAnnotation"

}

}

• Posts: 533Member, Dev ✭✭✭✭

Hi Francesco -

Is it failing when you specify a "list of BAM files"? Because I would expect that to cause problems right here:

  class HaplotypeCallerArguments extends HaplotypeCaller with UNIVERSAL_GATK_ARGS {
[...]
this.input_file :+= qscript.bamFile
[...]
}


You should be giving it bamFilesList (the actual bams), not bamFile (the file containing a list of filenames). The same problem is in AnnotationArguments. I don't know if that would lead to the error you're seeing, though...

• Posts: 67Member ✭✭

thanks, but the script works perfectly that way if I don't add the lines I reported initially.
the sintax you refer to is present in ExampleUnifiedGenotyper.scala provided with Queue examples

 genotyper.input_file :+= qscript.bamFile


the bamFilesList is something I create here for the sole purpose of calculating the size, i.e. the number of samples

• Posts: 67Member ✭✭

Just to clarify, if I comment out these three lines

import org.broadinstitute.sting.queue.util.QScriptUtils
[...]
val bamFilesList = QScriptUtils.createSeqFromFile(bamFile)
val sampleNo = bamFilesList.size


the script I posted works just fine.

FYI, I have moved this thread to "Ask the community". Carry on, folks; and thanks for jumping in to help.

Geraldine Van der Auwera, PhD

• Posts: 533Member, Dev ✭✭✭✭

Hmm, didn't realize the input would work that way. That's pretty cool.

In my own script, I walk through the BAMs and check the RGs to get a count of samples. Here's my code, I'm pretty sure I adapted countSamples from DataProcessingPipeline:

import net.sf.samtools.SAMFileReader
[...]
def countSamples(bamFiles: Seq[File]): Int = {
val sampleSet = scala.collection.mutable.Set.empty[String]
for (bam <- bamFiles) {

sampleSet += rg.getSample
}
}
sampleSet.size
}

[...]

val bams = QScriptUtils.createSeqFromFile(bamFile)
val nSamples = countSamples(bams)

• Posts: 67Member ✭✭

Indeed, that's precisely where I found the method createSeqFromFile

In the count function, you pass a countSamples(bamFiles: Seq[File]) which is defined as a Seq.

Because it's been converted by that method I'm interested in.

val bams = QScriptUtils.createSeqFromFile(bamFile)
val nSamples = countSamples(bams)


You can see it is a method of QScriptUtils

In order to be able to use it, you need to import it with

 import org.broadinstitute.sting.queue.util.QScriptUtils


So I'm doing precisely the same as in DataProcessingPipeline, except for the .size method, which is anyway a method of Seq.

However, as soon as I put that import into the HaplotypeCaller class, it seems not capable to instantiate it anymore.

GATK team members say it has nothing to do with their code... but then I don't understand why the very same code of just three lines conflicts with a script working otherwise.

• Posts: 533Member, Dev ✭✭✭✭

Can you recreate the problem in a minimal script? Perhaps your copy of Queue is corrupt

• Posts: 94Member ✭✭✭

The problem is that you are calling QScriptUtils.createSeqFromFile(bamFile) in the body of the script. I think (my knowledge of Scala is not as complete as I would like it to be) this causes the scala compiler to try to include the code in the constructor, and thus calling createSeqFromFile with null, as this variable is only set at runtime. This causes the script to fail. If you put the function call in the script portion you should be fine. There is a minimal example of this which works on my end:

package org.broadinstitute.sting.queue.qscripts

class HaplotypeCallerStep extends QScript {
// Create an alias 'qscript' to be able to access variables
// in the HaplotypeCallerStep.
// 'qscript' is now the same as 'HaplotypeCallerStep.this'
qscript =>

// Required arguments.  All initialized to empty values.

@Input(doc="input BAM file - or list of BAM files", fullName="input", shortName="I", required=true)
var bamFile: File = _

/***************************************************
* main script
***************************************************/

def script() {

val bamFilesList = QScriptUtils.createSeqFromFile(bamFile)
val sampleNo = bamFilesList.size

System.err.println("------ samples ------")
System.err.println(sampleNo)
System.err.println("======================")
}

}


Hope this solution works for you.

• Posts: 67Member ✭✭

Thanks Johan that's great!
which confirms my knowledge of scala proximal to zero :-)

I've simply moved the code into the class Target now, and it works.

class Target {

// DEFINE file names

val clusterFile = new File(qscript.projectName + "hc.clusters")
val rawVCF = new File(qscript.projectName + ".raw.vcf")
val annoVCF = new File(qscript.projectName + ".annotated.vcf")
val recalibratedVCF = new File(qscript.projectName + ".HaplotypeCaller.recalibrated.vcf")
val tranchesFile = new File(qscript.projectName + ".tranches")
val vqsrRscript = new File(qscript.projectName + ".vqsr.r")
val recalFile = new File(qscript.projectName + ".tranches.recal")

val bamFilesList = QScriptUtils.createSeqFromFile(qscript.bamFile)
val nSamples = bamFilesList.size

}


this way I can add a conditional on the number of samples anywhere I want in the other classes, outside the main script.

thanks a lot!!

Francesco