Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
distribution of scatter intervals in Queue
I am using Queue to run BaseRecalibrator (partition type of READ). I am testing the performance of the system at different scatter counts, but I am confused by the interval files Queue generates. Am I misunderstanding them? Here is an version of my ../temp_25_of_80/scatter.intervals, edited for readability:
HD VN:1.4 @SQ SN:chr1 LN:249250621 @SQ SN:chr2 LN:243199373 @SQ SN:chr3 LN:198022430 . . . @SQ SN:chrUn_gl000246 LN:38154 @SQ SN:chrUn_gl000247 LN:36422 @SQ SN:chrUn_gl000248 LN:39786 @SQ SN:chrUn_gl000249 LN:38502 @SQ SN:chrDecoy LN:35477943 chrM 1 16571 + interval_25
The first 86 lines of the interval file don't appear to change across temp directories, with the remaining lines indicating what I understand to be the actual region used in that particular scattered run, so in the above example, temp_25 runs on all of chrM. The first 79 temp files in this example have one chromosome per directory, so temp_01_of_80 corresponds to chr1, temp_02_of_80 corresponds to chr2, etc. until temp_80_of_80, which has:
chrUn_gl000245 1 36651 + interval_80 chrUn_gl000246 1 38154 + interval_81 chrUn_gl000247 1 36422 + interval_82 chrUn_gl000248 1 39786 + interval_83 chrUn_gl000249 1 38502 + interval_84 chrDecoy 1 35477943 + interval_85
Is this how scatter works? It just matches a chromosome to a scattered run until the last one which it shoves any remaining regions in? Isn't there more of an attempt to balance the size of each scattered element?
My testing has been in 2.7.