Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

scatterContigIntervals produces highly uneven division of labour

smowtonsmowton Manchester, UKMember

Is there a particular reason that IntervalUtils.scatterContigIntervals (from https://github.com/broadgsa/gatk-protected/blob/master/public/gatk-framework/src/main/java/org/broadinstitute/sting/utils/interval/IntervalUtils.java), called from the ContigScatterFunction, assigns all but the first N contigs encountered, where N is the number of scattered interval sets desired, to the last set and thus often produces a highly imbalanced workload?

For example, in the common case that I'm scatter-gathering a whole-genome analysis, and have 4 workers at my disposal, it will scatter producing:

Worker 1: chromosome 1
Worker 2: chromosome 2
Worker 3: chromosome 3
Worker 4: chromosomes 4 ... 22, X, Y, ...

Whilst I could set scatterCount == number of chromosomes, this makes the job take longer than it should because each individual scattered tasklet has a fair startup and shutdown time. I have replaced scatterContigIntervals with a jury-rigged patch that round-robin distributes intervals, which works for Queue's scatter-gather usage, but might break any callers that require scatterContigIntervals to maintain interval order if there are any such callers. If so then I suggest counting the total breadth of the given intervals, and trying to break the sequence roughly-evenly. I'd be happy to submit a patch if you could clarify the method's contract:

  • Is the input interval list guaranteed sorted?
  • If not, and I get input intervals like

X:1-100
Y:1-100
X:101-200

do I need to ensure both the X intervals go to the same worker?

Chris

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Chris,

    Your observation is correct -- right now the distribution of work for scatter count << contig number is suboptimal. We'd love to look at a patch that improves this! I'll get @kshakir to confirm the method's contract.

  • kshakirkshakir Broadie, Dev ✭✭

    Contract of the scatterContigIntervals:

    Pre conditions:

    • locs[1..x] are sorted by contig then start position
    • locs[x].stop() < locs[x+1].start(), aka no overlapping locs
    • locs.size() <= scatterParts.size()
    • scatterParts[1..y] are in a specific order

    Post conditions:

    • locs from the same contig belong to only one scatter part
    • locs in each scatterPart are still sorted
    • ( locs in scatterParts[y] ) are all < ( locs in scatterParts[y+1] )

    The post conditions ensure that a) no locs from the same contig end up in separate scatter jobs, and b) the gather may concatenate the job outputs together.

    Besides that it's up to the implementer as to when the actual scatterPart files are written to disk, etc. If I've forgotten anything or it's not clear, please let us know. As Geraldine said, a patch with a better balancing algorithm would be happily reviewed.

  • smowtonsmowton Manchester, UKMember
    edited July 2014

    OK, in that case I suggest the attached patch -- basically we try to split optimally, with the constraints that each boundary point gets walked forward to find a contig boundary, and we force each scatter part to take at least one contig.

    Test cases tried:

    • One scatterPart
    • Number of scatterParts equal to number of locs
    • Multiple intervals with same chromosome across an ideal split point
    • Ideal split point that must not be used because we would run out of contigs for the later scatter parts

    One caveat: you'd prob want to change the generic StingExceptions into something more appropriate.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @smowton,

    Sorry for the long delay, the engineers were dealing with some high priority tasks. They will look at your proposed patch now but they need you to make a pull request for it against the public github repository here: https://github.com/broadgsa/gatk

  • smowtonsmowton Manchester, UKMember

    Just noticed this now, sorry. There's probably a way to get the forum software to send me email... anyhow I'll get a pull request in tomorrow.

  • smowtonsmowton Manchester, UKMember

    Pull request created.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thanks @smowton‌, I'll make sure this gets processed.

    FYI we have a FAQ about turning on forum notifications here: https://www.broadinstitute.org/gatk/guide/article?id=27

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @smowton‌,

    Your patch has been merged into our repository (with minor modifications) and the functionality will be available in the next nightly build and in the 3.3 release (which we hope to have out by the end of the week).

Sign In or Register to comment.