Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
scatterContigIntervals produces highly uneven division of labour
Is there a particular reason that IntervalUtils.scatterContigIntervals (from https://github.com/broadgsa/gatk-protected/blob/master/public/gatk-framework/src/main/java/org/broadinstitute/sting/utils/interval/IntervalUtils.java), called from the ContigScatterFunction, assigns all but the first N contigs encountered, where N is the number of scattered interval sets desired, to the last set and thus often produces a highly imbalanced workload?
For example, in the common case that I'm scatter-gathering a whole-genome analysis, and have 4 workers at my disposal, it will scatter producing:
Worker 1: chromosome 1
Worker 2: chromosome 2
Worker 3: chromosome 3
Worker 4: chromosomes 4 ... 22, X, Y, ...
Whilst I could set scatterCount == number of chromosomes, this makes the job take longer than it should because each individual scattered tasklet has a fair startup and shutdown time. I have replaced scatterContigIntervals with a jury-rigged patch that round-robin distributes intervals, which works for Queue's scatter-gather usage, but might break any callers that require scatterContigIntervals to maintain interval order if there are any such callers. If so then I suggest counting the total breadth of the given intervals, and trying to break the sequence roughly-evenly. I'd be happy to submit a patch if you could clarify the method's contract:
- Is the input interval list guaranteed sorted?
- If not, and I get input intervals like
do I need to ensure both the X intervals go to the same worker?