Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

distribution of scatter intervals in Queue

Hi,
I am using Queue to run BaseRecalibrator (partition type of READ). I am testing the performance of the system at different scatter counts, but I am confused by the interval files Queue generates. Am I misunderstanding them? Here is an version of my ../temp_25_of_80/scatter.intervals, edited for readability:
HD VN:1.4 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 . . . @SQ SN:chrUn_gl000246 LN:38154 @SQ SN:chrUn_gl000247 LN:36422 @SQ SN:chrUn_gl000248 LN:39786 @SQ SN:chrUn_gl000249 LN:38502 @SQ SN:chrDecoy LN:35477943 chrM 1 16571 + interval_25

The first 86 lines of the interval file don't appear to change across temp directories, with the remaining lines indicating what I understand to be the actual region used in that particular scattered run, so in the above example, temp_25 runs on all of chrM. The first 79 temp files in this example have one chromosome per directory, so temp_01_of_80 corresponds to chr1, temp_02_of_80 corresponds to chr2, etc. until temp_80_of_80, which has:
chrUn_gl000245 1 36651 + interval_80 chrUn_gl000246 1 38154 + interval_81 chrUn_gl000247 1 36422 + interval_82 chrUn_gl000248 1 39786 + interval_83 chrUn_gl000249 1 38502 + interval_84 chrDecoy 1 35477943 + interval_85

Is this how scatter works? It just matches a chromosome to a scattered run until the last one which it shoves any remaining regions in? Isn't there more of an attempt to balance the size of each scattered element?

My testing has been in 2.7.

Best Answer

Answers

  • thibaultthibault Broad InstituteMember, Broadie, Moderator, Dev admin
    edited February 2014

    This is how scattering works for partition type READ (as well as partition type CONTIG). More balanced partitions are possible for LOCUS and INTERVAL partition types.

    Some examples can be found here: http://gatkforums.broadinstitute.org/discussion/1310/pipelining-the-gatk-with-queue

  • adouble2adouble2 Member

    Yeah, I get that, but I am just wondering if Queue attempts to balance the size of scattered elements at all. Like in my example, chrDecoy is significantly bigger than chrUn_gl000244, chrUn_gl000245, etc. so it would make a bit more sense to see temp_79_of_80 have interval_80 through interval_84, and temp_80_of_80 have interval_85 I realize that regardless of the arrangement, chr1 will still be the largest interval, so maybe it's just decided that this optimization of scatter size is moot. I am just trying to confirm if there is some configuration where Queue would try to optimize which intervals get assigned to a temp file.

  • kshakirkshakir Broadie, Dev ✭✭

    The READ and CONTIG scatter are not currently optimized based on contig size, but could be. See IntervalUtils for the current implementation of scatterContigIntervals, and the unit tests for example invocations.

Sign In or Register to comment.