If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
scatterContigIntervals produces highly uneven division of labour
Is there a particular reason that IntervalUtils.scatterContigIntervals (from https://github.com/broadgsa/gatk-protected/blob/master/public/gatk-framework/src/main/java/org/broadinstitute/sting/utils/interval/IntervalUtils.java), called from the ContigScatterFunction, assigns all but the first N contigs encountered, where N is the number of scattered interval sets desired, to the last set and thus often produces a highly imbalanced workload?
For example, in the common case that I'm scatter-gathering a whole-genome analysis, and have 4 workers at my disposal, it will scatter producing:
Worker 1: chromosome 1
Worker 2: chromosome 2
Worker 3: chromosome 3
Worker 4: chromosomes 4 ... 22, X, Y, ...
Whilst I could set scatterCount == number of chromosomes, this makes the job take longer than it should because each individual scattered tasklet has a fair startup and shutdown time. I have replaced scatterContigIntervals with a jury-rigged patch that round-robin distributes intervals, which works for Queue's scatter-gather usage, but might break any callers that require scatterContigIntervals to maintain interval order if there are any such callers. If so then I suggest counting the total breadth of the given intervals, and trying to break the sequence roughly-evenly. I'd be happy to submit a patch if you could clarify the method's contract:
- Is the input interval list guaranteed sorted?
- If not, and I get input intervals like
do I need to ensure both the X intervals go to the same worker?