To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

Questions about the UnifiedGenotyper Queue script

zivaziva SwedenMember
edited May 2015 in Ask the GATK team

Hi,
I have several questions about the function "UnifiedGenotyper" and its corresponding Qscript.

  1. I first test this function on a single chromosome of my chicken sample (bam file is about 800M) to call SNPs, and set the ploidy value to 30. It took me half a day to finish. Is it normal that it took so long for a single chromosome?

As a result of this, I tried to use the scatter-gather to parallel my job in order to get the full use of my multi-node server. I built my own script based on modifying the ExampleUnifiedGenotyper.scala.
I defined "ploidy" as an argument in my Qscript, then I assign a value of 30 to ploidy in my bash command line.

java -Djava.io.tmpdir=$tmp_dir \
    -jar /sw/apps/bioinfo/GATK-Queue/3.2.2/Queue.jar \
    -S /path/to/ug.scala \
    -R $ref_file \
    -I /path/to/bamfile.bam \
    -o /path/to/my.vcf  \
    -ploidy 30 \
    -run

2.The script ran successfully, but the output VCF file was based sample_ploidy=2. It seems that it does not respond to the argument "ploidy". Then if I add one command line in the def script() part, which is" genotyper.sample_ploidy = qscript.sample_ploidy ". GATK will give warning of errors "sample_ploidy is not a member of TestUnifiedGenotyper.this.UnifiedGenotyperArguments".
Could you please help point out where I did wrong?

Also, I have some questions about how to assign values to argument such as " this.memoryLimit" and "genotyper.scatterCount".

  1. Based on the explanation given in the example scala file, argument "this.memoryLimit" sets the memory limit for each command.
    So in my case, I have only one command in my script, so that I can assign as much memory as I have to it?
    I am running on a server that has multiple-node, each node with 16 cores, and each core has 8G RAM. So in my understanding, if I run this job on one node with 16 cores, can I assign "this.memoryLimit=128" (16*8)? Do I understand correctly?

  2. about the scatterCount, can I assign it with the number of cores I have in the server? That is, based on the example I listed above, 16.

Thank you very much for your help and time!
Attached is my Qscript in relating with my question.

My Qscript is as follows:

package org.broadinstitute.gatk.queue.qscripts.examples

import org.broadinstitute.gatk.queue.QScript
import org.broadinstitute.gatk.queue.extensions.gatk._

class TestUnifiedGenotyper extends QScript {
      qscript =>
     @Input(doc="The reference file for the bam files.", shortName="R")
      var referenceFile: File = _ 

      @Input(doc="Bam file to genotype.", shortName="o")
     var outputFile: File = _

      @Input(doc="Bam file to genotype.", shortName="I")
      var bamFile: File = _

      @Input(doc="An optional file with dbSNP information.", shortName="D", required=false)
       var dbsnp: File = _

      @Argument(doc="Ploidy (number of chromosomes) per sample.", shortName="ploidy")
      var sample_ploidy: List[String] = Nil 

      trait UnifiedGenotyperArguments extends CommandLineGATK {
           this.reference_sequence = qscript.referenceFile
           this.memoryLimit = 2
      }

  def script() {
    val genotyper = new UnifiedGenotyper with UnifiedGenotyperArguments

    genotyper.scatterCount = 3
    genotyper.input_file :+= qscript.bamFile
   // genotyper.sample_ploidy = qscript.sample_ploidy
    genotyper.out = swapExt(outputFile, qscript.bamFile, "bam", "unfiltered.vcf")

    add(genotyper)
  }
}
Post edited by Geraldine_VdAuwera on

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hmm, I'm not sure what's going on here -- can you try using a different name (like samplePloidy) for defining your ploidy variable? I wonder if it's a naming collision of some kind.

    Note that Queue scatter-gather depends on having a job scheduler set up to dispatch jobs to different nodes. Do you have a job scheduler set up.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
  • zivaziva SwedenMember

    Hi,

    Thank you both for your reply!
    @thibault I tried your way. The script runs smoothly, but still the output is based on sample_ploidy=2.
    As you suggested, I added the line "import org.broadinstitute.gatk.utils.commandline.ClassType" as follows:

    package org.broadinstitute.gatk.queue.qscripts.examples

    import org.broadinstitute.gatk.queue.QScript
    import org.broadinstitute.gatk.queue.extensions.gatk._
    import org.broadinstitute.gatk.utils.commandline.ClassType

    class TestUnifiedGenotyper extends QScript {
    qscript =>
    .......

    Then define the argument ploidy as integer as
    "
    @Argument(doc="Ploidy (number of chromosomes) per sample.", shortName="ploidy")
    @ClassType(classOf[Int])
    var sample_ploidy: Option[Int] = None
    "

    And I ran with the same bash command as I posted above, it still did not respond to my ploidy value assignment.

    Please help me in pointing out where I did wrong! Thanks a lot!

  • zivaziva SwedenMember

    @Geraldine_VdAuwera
    About the job schedular, the server I work on has a system called "The Slurm Workload Manager". Is this a type of job schedular that scatter-gather requires?

    I tried different values to the two arguments "this.memoryLimit" and "genotyper.scatterCount".
    I did not find any obvious difference in applying different values to " this.memoryLimit".
    As for "genotyper.scatterCount", I do find that the higher value I assigned to it, the more "jobs" it can be splited into.
    However, no matter what value it had assigned to, the "jobs" seem to run one after the other, rather than parallel.
    The following information were extracted from the log file,

    "INFO 09:27:08,210 QGraph - 18 Pend, 1 Run, 0 Fail, 0 Done
    .......
    INFO 09:28:32,013 QGraph - 17 Pend, 1 Run, 0 Fail, 1 Done
    .......
    INFO 09:29:21,840 QGraph - 16 Pend, 1 Run, 0 Fail, 2 Done
    "

    Last, based on your experience, I need 12hours on 16 cores (with the help of options "-nct" and "-nt") to complete variant calling on a 800M bam file, is it a normal case?

    Thanks a lot for your help and time!

  • thibaultthibault Broad InstituteMember, Broadie, Dev

    Queue does not run anything in parallel by itself. You need to tell it to submit jobs to a job scheduler. SLURM is an example of this, so it sounds like you're in luck! We don't support SLURM specifically (and we don't run it) but here are some user-contributed suggestions: http://gatkforums.broadinstitute.org/discussion/2781/running-gatk-queue-in-slurm

    Please post the current version of your script here. When I tested the approach I suggested to you, I did see the ploidy correctly applied to subtasks. Maybe something else is wrong?

  • zivaziva SwedenMember

    Thank you @thibault for your answer! It works and I need to use the commandline "genotyper.sample_ploidy = qscript.sample_ploidy" in the def script(){...} as well.

    And thank you for pointing me to the thread that answered my other question!

    Thank you @thibault and @Geraldine_VdAuwera both for your help and time!

Sign In or Register to comment.