If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Questions about the UnifiedGenotyper Queue script

zivaziva SwedenMember
edited May 2015 in Ask the GATK team

I have several questions about the function "UnifiedGenotyper" and its corresponding Qscript.

  1. I first test this function on a single chromosome of my chicken sample (bam file is about 800M) to call SNPs, and set the ploidy value to 30. It took me half a day to finish. Is it normal that it took so long for a single chromosome?

As a result of this, I tried to use the scatter-gather to parallel my job in order to get the full use of my multi-node server. I built my own script based on modifying the ExampleUnifiedGenotyper.scala.
I defined "ploidy" as an argument in my Qscript, then I assign a value of 30 to ploidy in my bash command line.

java$tmp_dir \
    -jar /sw/apps/bioinfo/GATK-Queue/3.2.2/Queue.jar \
    -S /path/to/ug.scala \
    -R $ref_file \
    -I /path/to/bamfile.bam \
    -o /path/to/my.vcf  \
    -ploidy 30 \

2.The script ran successfully, but the output VCF file was based sample_ploidy=2. It seems that it does not respond to the argument "ploidy". Then if I add one command line in the def script() part, which is" genotyper.sample_ploidy = qscript.sample_ploidy ". GATK will give warning of errors "sample_ploidy is not a member of TestUnifiedGenotyper.this.UnifiedGenotyperArguments".
Could you please help point out where I did wrong?

Also, I have some questions about how to assign values to argument such as " this.memoryLimit" and "genotyper.scatterCount".

  1. Based on the explanation given in the example scala file, argument "this.memoryLimit" sets the memory limit for each command.
    So in my case, I have only one command in my script, so that I can assign as much memory as I have to it?
    I am running on a server that has multiple-node, each node with 16 cores, and each core has 8G RAM. So in my understanding, if I run this job on one node with 16 cores, can I assign "this.memoryLimit=128" (16*8)? Do I understand correctly?

  2. about the scatterCount, can I assign it with the number of cores I have in the server? That is, based on the example I listed above, 16.

Thank you very much for your help and time!
Attached is my Qscript in relating with my question.

My Qscript is as follows:

package org.broadinstitute.gatk.queue.qscripts.examples

import org.broadinstitute.gatk.queue.QScript
import org.broadinstitute.gatk.queue.extensions.gatk._

class TestUnifiedGenotyper extends QScript {
      qscript =>
     @Input(doc="The reference file for the bam files.", shortName="R")
      var referenceFile: File = _ 

      @Input(doc="Bam file to genotype.", shortName="o")
     var outputFile: File = _

      @Input(doc="Bam file to genotype.", shortName="I")
      var bamFile: File = _

      @Input(doc="An optional file with dbSNP information.", shortName="D", required=false)
       var dbsnp: File = _

      @Argument(doc="Ploidy (number of chromosomes) per sample.", shortName="ploidy")
      var sample_ploidy: List[String] = Nil 

      trait UnifiedGenotyperArguments extends CommandLineGATK {
           this.reference_sequence = qscript.referenceFile
           this.memoryLimit = 2

  def script() {
    val genotyper = new UnifiedGenotyper with UnifiedGenotyperArguments

    genotyper.scatterCount = 3
    genotyper.input_file :+= qscript.bamFile
   // genotyper.sample_ploidy = qscript.sample_ploidy
    genotyper.out = swapExt(outputFile, qscript.bamFile, "bam", "unfiltered.vcf")

Post edited by Geraldine_VdAuwera on

Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, I'm not sure what's going on here -- can you try using a different name (like samplePloidy) for defining your ploidy variable? I wonder if it's a naming collision of some kind.

    Note that Queue scatter-gather depends on having a job scheduler set up to dispatch jobs to different nodes. Do you have a job scheduler set up.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
  • zivaziva SwedenMember


    Thank you both for your reply!
    @thibault I tried your way. The script runs smoothly, but still the output is based on sample_ploidy=2.
    As you suggested, I added the line "import org.broadinstitute.gatk.utils.commandline.ClassType" as follows:

    package org.broadinstitute.gatk.queue.qscripts.examples

    import org.broadinstitute.gatk.queue.QScript
    import org.broadinstitute.gatk.queue.extensions.gatk._
    import org.broadinstitute.gatk.utils.commandline.ClassType

    class TestUnifiedGenotyper extends QScript {
    qscript =>

    Then define the argument ploidy as integer as
    @Argument(doc="Ploidy (number of chromosomes) per sample.", shortName="ploidy")
    var sample_ploidy: Option[Int] = None

    And I ran with the same bash command as I posted above, it still did not respond to my ploidy value assignment.

    Please help me in pointing out where I did wrong! Thanks a lot!

  • zivaziva SwedenMember

    About the job schedular, the server I work on has a system called "The Slurm Workload Manager". Is this a type of job schedular that scatter-gather requires?

    I tried different values to the two arguments "this.memoryLimit" and "genotyper.scatterCount".
    I did not find any obvious difference in applying different values to " this.memoryLimit".
    As for "genotyper.scatterCount", I do find that the higher value I assigned to it, the more "jobs" it can be splited into.
    However, no matter what value it had assigned to, the "jobs" seem to run one after the other, rather than parallel.
    The following information were extracted from the log file,

    "INFO 09:27:08,210 QGraph - 18 Pend, 1 Run, 0 Fail, 0 Done
    INFO 09:28:32,013 QGraph - 17 Pend, 1 Run, 0 Fail, 1 Done
    INFO 09:29:21,840 QGraph - 16 Pend, 1 Run, 0 Fail, 2 Done

    Last, based on your experience, I need 12hours on 16 cores (with the help of options "-nct" and "-nt") to complete variant calling on a 800M bam file, is it a normal case?

    Thanks a lot for your help and time!

  • thibaultthibault Broad InstituteMember, Broadie, Moderator, Dev admin

    Queue does not run anything in parallel by itself. You need to tell it to submit jobs to a job scheduler. SLURM is an example of this, so it sounds like you're in luck! We don't support SLURM specifically (and we don't run it) but here are some user-contributed suggestions:

    Please post the current version of your script here. When I tested the approach I suggested to you, I did see the ploidy correctly applied to subtasks. Maybe something else is wrong?

  • zivaziva SwedenMember

    Thank you @thibault for your answer! It works and I need to use the commandline "genotyper.sample_ploidy = qscript.sample_ploidy" in the def script(){...} as well.

    And thank you for pointing me to the thread that answered my other question!

    Thank you @thibault and @Geraldine_VdAuwera both for your help and time!

Sign In or Register to comment.