scatterContigIntervals produces highly uneven division of labour
Is there a particular reason that IntervalUtils.scatterContigIntervals (from https://github.com/broadgsa/gatk-protected/blob/master/public/gatk-framework/src/main/java/org/broadinstitute/sting/utils/interval/IntervalUtils.java), called from the ContigScatterFunction, assigns all but the first N contigs encountered, where N is the number of scattered interval sets desired, to the last set and thus often produces a highly imbalanced workload?
For example, in the common case that I'm scatter-gathering a whole-genome analysis, and have 4 workers at my disposal, it will scatter producing:
Worker 1: chromosome 1
Worker 2: chromosome 2
Worker 3: chromosome 3
Worker 4: chromosomes 4 ... 22, X, Y, ...
Whilst I could set scatterCount == number of chromosomes, this makes the job take longer than it should because each individual scattered tasklet has a fair startup and shutdown time. I have replaced scatterContigIntervals with a jury-rigged patch that round-robin distributes intervals, which works for Queue's scatter-gather usage, but might break any callers that require scatterContigIntervals to maintain interval order if there are any such callers. If so then I suggest counting the total breadth of the given intervals, and trying to break the sequence roughly-evenly. I'd be happy to submit a patch if you could clarify the method's contract:
- Is the input interval list guaranteed sorted?
- If not, and I get input intervals like
do I need to ensure both the X intervals go to the same worker?