Service Notice: Due to the blizzard currently hammering the US Northeast, the Broad is shut down and the GATK forum will be mostly unattended while we hunker down and sip hot cocoa with marshmallows. Assuming the power stays on and we're able to dig ourselves out of the snow when it's all over, normal service should resume Wednesday or Thursday.

Scatter/Gather utilizing multiple cores in the same node/job

jesseerdmannjesseerdmann Posts: 2Member
edited April 2013 in Ask the GATK team

At the Minnesota Supercomputing Institute, our environment requires that jobs on our high performance clusters reserve an entire node. I have implemented my own Torque Manager/Runner for our environment based on the Grid Engine Manager/Runner. The way I have gotten this to work in our environment is to set the nCoresRequest for the scatter/gather method to the minimum required of eight. My understanding is that for the InDelRealigner, for example, the job reserves a node with eight cores, but only uses one. That means our users would have their compute time allocation consumed eight times faster than is necessary.

What I am wondering is are there options that I am missing where some number of the scatter/gather requests can be grouped into a single job submission? If I were writing this as a PBS script for our environment and I wanted to use 16 cores in a scatter/gather implementation, I would write two jobs, each with eight commands. They would look something like the following:

#PBS Job Configuration stuff
pbsdsh -n 0 java -jar ... &
pbsdsh -n 1 java -jar ... &
pbsdsh -n 2 java -jar ... &
pbsdsh -n 3 java -jar ... &
pbsdsh -n 4 java -jar ... &
pbsdsh -n 5 java -jar ... &
pbsdsh -n 6 java -jar ... &
pbsdsh -n 7 java -jar ... &
wait

Has anyone done something similar in Queue? Any pointers? Thanks in advance!

Post edited by Geraldine_VdAuwera on

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,957Administrator, GATK Developer admin

    Hi there,

    Unfortunately, for tools like IndelRealigner that cannot be multi-threaded, there is nothing we can do for you, since the nCoresRequest parameter is a property of your cluster software. There is no built-in option to do what you want within Queue. Perhaps other users may suggest a workaround...

    Geraldine Van der Auwera, PhD

  • ryanabashbashryanabashbash Texas A&M UniversityPosts: 10Member

    I seem to have run into something similar when trying to run a QScript on a cluster. When writing the QScript, I tested it on a single machine that was running SGE, and it happily ran many Scatter/Gather jobs on the single node in parallel. However, once I moved over to the cluster (running UGE), it's happy to run multiple BWA jobs from the QScript on a single node, but it only launches one GATK job (e.g. IndelRealigner) on a single node.

    Here is the class I use for BWA alignment that has no problem launching multiple instances on a single node in the cluster.

    class BWAalignment(BWApath: File, BWAnumThreads: Int, readGroup: String, BWAindex: File, input: File, output: File) extends CommandLineFunction {
       @Input(doc="Input FASTQ")
       var inputFastq: File = input
       @Output(doc="Output SAM")
       var outputSAM: File = output
       this.jobName = "BWA_alignment"
       this.analysisName = "BWA_alignment"
       this.nCoresRequest = BWAnumThreads
       this.residentRequest = 6000
       this.jobEnvironmentNames = List("mpi " + BWAnumThreads)  //get smp_pe not valid errors without this.
    
       def commandLine = {
          BWApath + " mem " + " -t " + BWAnumThreads + " -R \"" + readGroup + "\" " + BWAindex + " " + inputFastq + " > " + outputSAM
       }
    }
    

    This is what I use a little later on in the script to define the IndelRealigner, but only one job ever launches on a single node.

    var indelRealignment = new IndelRealigner
    indelRealignment.reference_sequence = referenceFile
    indelRealignment.input_file = List(PicardOutput)
    indelRealignment.targetIntervals = targetCreation.out
    indelRealignment.out = swapExt(PicardOutput, "bam", "realigned.bam")
    indelRealignment.scatterCount = GATKnumScatter
    indelRealignment.jobEnvironmentNames = List("mpi " + "1")  //get smp_pe not valid errors without this.
    

    Given that the CommandLineFunction is able to launch multiple instances on a single node (e.g. 4 instances of BWA, each consuming 4 processors on a 16 processor node), is there a way to launch multiple IndelRealigners on a single node (e.g. 16 instances of IndelRealigner on a 16 processor node)? It seems I'm fundamentally misunderstanding something since things worked fine on the single machine running SGE.

Sign In or Register to comment.