Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

GATK Queue and BQSR

Hi,

I am trying to speed up BaseRecalibrator by using GATK Queue. However, it does not seem to play well with scatter-gather. When I run it with the -BQSR option, it fails with an error message which seems to apply to GATK Queue itself and not (at least not directly) to something in my script.

My script is just a wrapper:

import org.broadinstitute.gatk.queue.QScript
import org.broadinstitute.gatk.queue.extensions.gatk._

class MyBaseRecalibrator extends QScript {

    @Input(shortName="R")
    var referenceFile: File = _

    @Input(shortName="I")
    var bamFile: File = _

    @Input(shortName="L", required=false)
    var targetIntervals: File = _

    @Argument(shortName="interval_padding", required=false)
    var intervalPadding: Int = _

    @Input(shortName="knownSites")
    var knownSites: List[File] = Nil

    @Input(shortName="BQSR", required=false)
    var bqsrFile: File = _

    @Output(shortName="o")
    var outFile: File = _

    @Argument(shortName="scatterCount")
    var scatterCount: Int = _

    def script() {

        val br = new BaseRecalibrator

        br.R = referenceFile
        br.I :+= bamFile

        br.L :+= targetIntervals
        br.interval_padding = intervalPadding

        br.BQSR = bqsrFile

        br.knownSites = knownSites

        br.o = outFile

        br.scatterCount = scatterCount
        br.memoryLimit = 8

        add(br)

    }

}

When the BQSR option is left out, everything works great, but when it is included, the error message is as follows:

##### ERROR ------------------------------------------------------------------------------------------
##### ERROR stack trace
java.lang.NullPointerException
    at org.broadinstitute.gatk.queue.extensions.gatk.GATKScatterFunction$$anonfun$intervalFilesExist$1.apply(GATKScatterFunction.scala:90)
    at org.broadinstitute.gatk.queue.extensions.gatk.GATKScatterFunction$$anonfun$intervalFilesExist$1.apply(GATKScatterFunction.scala:90)
    at scala.collection.LinearSeqOptimized$class.exists(LinearSeqOptimized.scala:80)
    at scala.collection.immutable.List.exists(List.scala:84)
    at org.broadinstitute.gatk.queue.extensions.gatk.GATKScatterFunction$class.intervalFilesExist(GATKScatterFunction.scala:90)
    at org.broadinstitute.gatk.queue.extensions.gatk.ContigScatterFunction.intervalFilesExist(ContigScatterFunction.scala:35)
    at org.broadinstitute.gatk.queue.extensions.gatk.ContigScatterFunction.scatterCount(ContigScatterFunction.scala:37)
    at org.broadinstitute.gatk.queue.function.scattergather.ScatterGatherableFunction$class.generateFunctions(ScatterGatherableFunction.scala:131)
    at org.broadinstitute.gatk.queue.extensions.gatk.BaseRecalibrator.generateFunctions(BaseRecalibrator.scala:10)
    at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$fillGraph$1.apply(QGraph.scala:182)
    at org.broadinstitute.gatk.queue.engine.QGraph$$anonfun$fillGraph$1.apply(QGraph.scala:179)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at org.broadinstitute.gatk.queue.engine.QGraph.fillGraph(QGraph.scala:179)
    at org.broadinstitute.gatk.queue.engine.QGraph.run(QGraph.scala:140)
    at org.broadinstitute.gatk.queue.QCommandLine.execute(QCommandLine.scala:170)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:248)
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(CommandLineProgram.java:155)
    at org.broadinstitute.gatk.queue.QCommandLine$.main(QCommandLine.scala:61)
    at org.broadinstitute.gatk.queue.QCommandLine.main(QCommandLine.scala)
##### ERROR ------------------------------------------------------------------------------------------
##### ERROR A GATK RUNTIME ERROR has occurred (version 3.4-46-gbc02625):
##### ERROR
##### ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
##### ERROR If not, please post the error message, with stack trace, to the GATK forum.
##### ERROR Visit our website and forum for extensive documentation and answers to
##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
##### ERROR
##### ERROR MESSAGE: Code exception (see stack trace for error itself)
##### ERROR ------------------------------------------------------------------------------------------
INFO  11:26:09,842 QCommandLine - Shutting down jobs. Please wait...

None of the lines seem to refer to my code. Any clues?

Thanks,
Michael

Tagged:

Issue · Github
by Sheila

Issue Number
179
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
vdauwera

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Michael, when you say

    When the BQSR option is left out,

    do you mean that you remove the argument from the script, or that you don't specify the argument in the command line?

    Can you show the full commands that you are running in both cases?

  • micknudsenmicknudsen DenmarkMember

    Hi Geraldine,

    Thanks for your reply. Here is what I mean by leaving BQSR out. (Note that I have changed the script name to QueueBaseRecalibrator.scala and included memoryLimit and GATK key options since my last post).

    When I run the following (which does not include BQSR), everything works great:

    java -Djava.io.tmpdir=tmp -jar /com/extra/GATKQueue/{gatk_queue_version}/jar-bin/Queue.jar -S scala/QueueBaseRecalibrator.scala \
                -I {realigned_bam_file} \
                -R {reference_genome} \
                -L {target_intervals} \
                -interval_padding 100 \
                -o {recalibrated_data_table} \
                -knownSites {dbsnp} \
                -knownSites {indel_calls} \
                -knownSites {g1k} \
                -K {key_file} \
                -jobRunner Drmaa \
                -scatterCount 16 \
                -memoryLimit 16 \
                -jobNative "--mem=16384" \
                -run
    

    However, in the next step of my analysis, I do the following:

    java -Djava.io.tmpdir=tmp -jar /com/extra/GATKQueue/{gatk_queue_version}/jar-bin/Queue.jar -S scala/QueueBaseRecalibrator.scala \
                -I {realigned_bam_file} \
                -R {reference_genome} \
                -BQSR {recalibrated_data_table} \
                -o {post_recalibrated_data_table} \
                -knownSites {dbsnp} \
                -knownSites {indel_calls} \
                -knownSites {g1k} \
                -K {key_file} \
                -jobRunner Drmaa \
                -scatterCount 16 \
                -memoryLimit 16 \
                -jobNative "--mem=16384" \
                -run
    

    and this is when the error occurs. If I set scatterCount=1, the script runs fine (but the there is no point in using Queue).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, I ran a couple of tests and it looks like it's not the -BQSR argument at fault, it's -L. Check it out: in your first command, you use -L, but then you don't use it in your second. I actually realized this was the important difference because I first ran the first command without -L and got that same error despite not having specified -BQSR.

    This seems to be a quirk of the BQSR custom gatherer. You could call it a bug but we don't have resources to address it right now, so my advice is to just pass in intervals as a workaround.

  • micknudsenmicknudsen DenmarkMember

    Thank you so much. Everything works great now!

  • buddejbuddej St. LouisMember

    Is BQSR limited to 24 scatter jobs? Whenever I tell Queue to scatter wider, it only does a 24-split. And there intervals aren't evenly divided, they way they are when scattering HC or GGVCFs -- it seems to be one per chromosome. I'm guessing this is by design, but just wanted to confirm.

    Using the latest Queue-3.6-0, haven't tested with any other version.

  • davidallanwrightdavidallanwright MinnesotaMember

    Hello. Is Queue in GATK 4.0?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @davidallanwright
    Hi,

    No, Queue is being replaced with WDL.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    @buddej Yes, by default BQSR is scattered per chromosome. It's possible to change that but I wouldn't recommend it.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
    @davidallanwright No, there are no plans to include Queue in GATK 4. We are moving to a different pipelining system that offers us more flexibility.
  • buddejbuddej St. LouisMember

    @Geraldine_VdAuwera Thanks! I figured it was done for a reason, but I wanted to make sure it wasn't a coding error on my end.

Sign In or Register to comment.