SplitIntervals: option to specify a minimum genomic distance for intervals at the edges of splits

mack812mack812 SpainMember
edited February 26 in Ask the GATK team

Hi,

I detected an issue when doing scatter-gather by interval splits with my wdl script, developed to analyze WES data from tumor samples (snp-indel variant discovery workflow) . The issue is explained in detail in a previous post: https://gatkforums.broadinstitute.org/gatk/discussion/23486/out-of-order-read-after-markduplicatespark-baserecalibrator-applybqsr

I am bringing this up again in this new post because I am afraid that by replying to myself in the previous one I probably made it less visible to the GATK team (sorry for that).

As described in the previous post I was able to solve the issue by editing the scattered interval files with text editors and manually re-arranging the intervals at the edges of the splits. So my question/petition is:

  • Could it be possible to use the SplitIntervals tool in a way in which the intervals at the edges of the splits (last interval in a given split and first interval in the following one) are spaced by a specified minimum genomic distance?
    • As an example: an option in SplitIntervals, lets call it "edge distance" that given a value of 2kb (--edge-distance 2000) selects the intervals at the edges of every split so that the end position of the last interval in split n (in file 000n-scattered.interval_list) is no less than 2kb away from the start position of the first interval in split n+1 (in file 000n+1-scattered.interval_list). In case of WES experiments, considering the size of the average intron it is very easy to meet this requirement (i.e. most intervals are spaced by distances bigger than 1 or 2 kb).
      This option would be used concurrently with the option --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION

As described in my previous post, by doing this manually I was able to prevent mutect2 from crushing with an Attempting to add a read to ActiveRegion out of order w.r.t. other reads error. I could trace back this error to duplicate reads present in the merged recalibrated bam, which in all cases were reads that spanned the intervals at the edges of contiguous splits that were separated by a distance smaller than the read size (150 bp). I think that this was causing the duplication of these "edge-crossing reads" during the scatter step.

Thank you in advance

Answers

Sign In or Register to comment.