Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Scatter/Gather in a Custom Wrapper for Queue

Hi guys,
I've been trying to do something supposedly simple: i.e. annotating a VCF file with a custom annotation, using Queue with a custom wrapper.
I followed the instructions here
https://www.broadinstitute.org/gatk/events/3391/GATKw1310-Q-4-Advanced_Queue.pdf
However, since I'm working with a VCF file, I thought about distributing better my job(s) by scattering/gathering the input, benefiting of Queue functionality.
I thought, following this presentation
https://www.broadinstitute.org/gatk/events/3391/GATKw1310-Q-3-Queue_Parallelism.pdf
that .scatterCount would be available natively by extending commandLineFunction, but apparently I get a message saying it's not a member of my class.

Would you please suggest how can I scatter/gather a VCF file if I have to process it with a custom wrapper?
I haven't found this question answered before, but happy to read elsewhere if it's been already.

This is my script

package org.broadinstitute.gatk.queue.qscripts

import org.broadinstitute.gatk.queue.QScript
import org.broadinstitute.gatk.queue.extensions.gatk._
import org.broadinstitute.gatk.queue.util.QScriptUtils
import org.broadinstitute.gatk.utils.commandline._
import org.broadinstitute.gatk.queue.function.scattergather._
import collection.JavaConversions._



class customAnnotation extends QScript {
  // Create an alias 'qscript' to be able to access variables
  qscript =>


  // Required arguments.  All initialized to empty values.

  @Input(doc="VCF file to be annotated", fullName="vcf", shortName="V", required=true)
  var inVcf: File = _


/*********************************************************
* definitions of names
**********************************************************/

    val baseName = swapExt(qscript.inVcf, "vcf", "anno")
    val myOut = new File( baseName + ".customAnno.vcf")
    val annotationOut = new File( baseName + ".parsed.vcf")
    val testFile = new File( baseName + ".TEST")



/*********************************************************
* CUSTOM annotation as command line
**********************************************************/     

    class MyAnnotation extends CommandLineFunction {

        @Input(doc = "input VCF file")
        val input: File = qscript.inVcf

        @Output(doc = "output VCF file")
        val output: File = qscript.myOut

        this.jobNativeArgs = Seq("--mem=12000")

        this.jobNativeArgs ++= Seq("--time=12:00:00")
        // job name
        override def jobRunnerJobName = "myAnno"

        this.scatterCount = 30

        override def commandLine = required("perl ~/tools/customAnno.pl") +
            required("-i", input) +
            required("-o", output)

    }


/***************************************************
* main script
***************************************************/

  def script() {

    val myanno = new MyAnnotation
    add(myanno)


  }



}

and this is the error I get:

Picked up _JAVA_OPTIONS: -XX:ParallelGCThreads=1
INFO  12:01:38,767 QScriptManager - Compiling 1 QScript 
ERROR 12:01:40,198 QScriptManager - testAnno.scala:56: value scatterCount is not a member of customAnnotation.this.MyAnnotation 
ERROR 12:01:40,200 QScriptManager -         this.scatterCount = 30 
ERROR 12:01:40,200 QScriptManager -              ^ 
ERROR 12:01:40,225 QScriptManager - two errors found 
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace 
org.broadinstitute.gatk.queue.QException: Compile of /home/lescai/pipeline/annotation/testAnno.scala failed with 2 errors
    at org.broadinstitute.gatk.queue.QScriptManager.loadScripts(QScriptManager.scala:79)
    at org.broadinstitute.gatk.queue.QCommandLine.org$broadinstitute$gatk$queue$QCommandLine$$qScriptPluginManager$lzycompute(QCommandLine.scala:95)
    at org.broadinstitute.gatk.queue.QCommandLine.org$broadinstitute$gatk$queue$QCommandLine$$qScriptPluginManager(QCommandLine.scala:93)
    at org.broadinstitute.gatk.queue.QCommandLine.getArgumentSources(QCommandLine.scala:230)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:205)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:62)
    at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------

thanks for helping an inexperienced .scala user :)
Francesco

Answers

  • pdexheimerpdexheimer Member ✭✭✭✭

    Hi Francesco -

    The scatter/gather stuff is all in the ScatterGatherableFunction trait, which you should be able to mix into your class. I've never written a custom function with s/g, but I'm pretty sure you'll have to also set the scatterClass field.

    A word of warning - the scatter classes that exist are written specifically for GATK, and so make use of all of the various interval arguments defined in the engine. I'm not sure if it's going to be easier to write a new scatter class or add all of those arguments to your custom class (and underlying perl script), but I'm pretty sure you're going to have to do one of those. If you haven't already, take a look at the existing scatter code. The s/g "engine" is in ScatterGatherableFunction and ScatterFunction, while the existing classes are all in the org.broadinstitute.gatk.extensions.gatk package, in the gatk-queue-extensions-public maven module

  • flescaiflescai Member ✭✭
    edited February 2015

    thanks very much @pdexheimer
    I've added the scatterClass to the code, as

            this.scatterClass = classOf[LocusScatterFunction]
    
            @Input(doc = "input VCF file")
            val input: File = qscript.inVcf
    
            @Output(doc = "output VCF file")
            @Gather(classOf[VcfGatherFunction])
            val output: File = qscript.myOut
    

    before defining the input, trying to take some example from the MuTect.scala script.
    I suppose the key point is the interval arguments you mention, which goes quite beyond my experience in GATK code. I went through MuTect.scala but couldn't really see where they add these.
    and of course without that, my script generates an error like this

        ERROR 10:29:20,858 PluginManager - Couldn't initialize the plugin. Typically this is because of wrong global class variable initializations. 
        INFO  10:29:20,858 QCommandLine - Done with errors 
        ##### ERROR ------------------------------------------------------------------------------------------
        ##### ERROR stack trace 
        org.broadinstitute.gatk.utils.exceptions.DynamicClassResolutionException: Could not create module customAnnotation because Cannot instantiate class (Invocation failure) caused by exception null
            at org.broadinstitute.gatk.utils.classloader.PluginManager.createByType(PluginManager.java:312)
            at org.broadinstitute.gatk.utils.classloader.PluginManager.createAllTypes(PluginManager.java:323)
            at org.broadinstitute.gatk.queue.QCommandLine.execute(QCommandLine.scala:146)
            at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
            at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
            at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:62)
            at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala)
        ##### ERROR ------------------------------------------------------------------------------------------
    

    I'll keep looking at the LocusScatterFunction.scala and the others to try understand what I'm missing.

    Post edited by flescai on
  • flescaiflescai Member ✭✭

    Polishing the code and going through some of the examples, I noticed that if I assign a predefined value for the output file, I get that error "couldn't initialise the plugin".
    On the contrary, if I leave the output as _ it doesn't produce that error and completes successfully the dry run, but doesn't scatter anything...

    I'm a bit puzzled, but this goes beyond my knowledge of Queue

    package org.broadinstitute.gatk.queue.qscripts
    
    import org.broadinstitute.gatk.queue.QScript
    import java.io.File
    import org.broadinstitute.gatk.utils.commandline.Argument
    import org.broadinstitute.gatk.utils.commandline.Gather
    import org.broadinstitute.gatk.utils.commandline.Input
    import org.broadinstitute.gatk.utils.commandline.Output
    import org.broadinstitute.gatk.queue.function.scattergather.ScatterGatherableFunction
    import org.broadinstitute.gatk.queue.extensions.gatk.{TaggedFile, VcfGatherFunction, LocusScatterFunction}
    
        class customAnnotation extends QScript {
          // Create an alias 'qscript' to be able to access variables
          qscript =>
    
          @Input(doc="VCF file to be annotated", fullName="vcf", shortName="V", required=true)
          var inVcf: File = _
    
    
        /*********************************************************
        * CUSTOM annotation as command line
        **********************************************************/     
    
            class MyAnnotation extends org.broadinstitute.gatk.queue.extensions.gatk.CommandLineGATK with ScatterGatherableFunction {
    
                analysisName = "myAnno"
                analysis_type = "myAnno"
                scatterClass = classOf[LocusScatterFunction]
    
    
                @Output(doc = "output VCF file")
                @Gather(classOf[VcfGatherFunction])
                var output: File = _
    
                @Input(doc = "input VCF file")
                var input = qscript.inVcf
    
    
                this.jobNativeArgs = Seq("--mem=12000")
    
                this.jobNativeArgs ++= Seq("--time=12:00:00")
                // job name
                override def jobRunnerJobName = "myAnno"
    
                this.scatterCount = 30
    
                override def commandLine = required("perl ~/tools/customAnno.pl") +
                    required("-i", input) +
                    required("-o", output)
    
            }
    
    
        /***************************************************
        * main script
        ***************************************************/
    
          def script() {
            val myanno = new MyAnnotation
            add(myanno)
    
    
          }
    
    
    
        }
    

    but if anybody has a good suggestion, it would be very appreciated.
    I think the simple task of splitting a VCF files in chunks to be processed even just with a command line and gathered together is a native feature of Queue and would be interesting to apply it more broadly.

Sign In or Register to comment.