It looks like you're new here. If you want to get involved, click one of these buttons!
pwhite
Posts: 3Member ✭
What is the best way to get Queue to optimize utilization of a given number of cores in an SGE cluster? The DataProcessingPipeline.scala has a hidden parameter "scatter_gather" which sets the nContigs variable. Is it safe to use this option? For example, if you had 100 cores available in your cluster could you set the option to 100? Is there any advantage to setting it higher?
Without setting it, Queue appears to set the nContigs value based on the number of chromosomes in the BAM input. So if using a whole genome BAM it's 25, your example Chr20 data it's 1, or with an unaligned BAM it's 0. So if starting with unaligned data, it appears to run on a single core?
Geraldine_VdAuwera
Posts: 2,239 admin
As I recall Queue will not "over-scatter" your jobs, i.e. it won't forcefully split up the data further than makes sense for the analysis. The rest is up to the script you use. The DPP does those things correctly, if that's what you're using. As for changing the scatter count, it should be OK, but I would recommend you read the pipeline script to check where that parameter is used and verify that it matches your expectations.
Geraldine Van der Auwera, PhD
Answers
The merits of increasing scatter-gather count depend a lot on what kind of jobs you're sending out, and also a little on what is the setup of your cluster. Some jobs can't be scattered beyond a certain count because the data simply can't be divvied up further. For example, if the smallest unit of data your job can operate on is the chromosome, there is no point in increasing the scatter gather count beyond number of chromosomes. If the smallest unit is a gene interval and you have 20000 intervals, technically you can set scatter count to the number of intervals, 20000. The advantage is that the scattered jobs will have very short runtimes, which is good if you have a fast-moving "short jobs" queue in your cluster setup. But having so many jobs adds overhead for processing, so at some point the increase in scatter count will stop yielding any performance gains, and start costing you. Finally, if you have unaligned data and the job is to align it, it will run on a single node by default because the aligner job needs to have access to all the data -- as far as I know you can't just split up the data and send them to be aligned on different machines. Keep in mind that scatter-gather is different from multithreading. We'll have a new documentation article on that topic out very soon.
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Thanks Geraldine. So if I am running Queue with whole human genome resequencing data to perform alignment, realignment, dedup and recalibration is it smart enough to know that realignment can be ran across multiple intervals, dedup on a single genome BAM, and to calculate the covariates over multiple intervals, sum them for the genome and apply them to the intervals, and finally merge into a single processed BAM? Can I safely increase the scatter_gather variable to over 25 (the number of chromosomes, which it defaults to if given an aligned BAM as input)?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •