Out of order read after MarkDuplicateSpark + BaseRecalibrator/ApplyBQSR

Hi,

I am building a workflow for discovery of somatic snvs + indels that is pretty much the Broad's Best Practice but incorporating MarkDuplicatesSpark and a couple of other minor changes. Today I was running a normal-tumor pair of samples from WES experiments in GCP, and everything was going great until the workflow failed during Mutect2. In one of the shards (I am scattering the M2 step through 12 splits of the exome bedfile) I got this error:

    13:53:46.994 INFO  ProgressMeter -       chr19:18926479             20.2                 22440           1112.1
    13:53:51.138 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.589863008
    13:53:51.145 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 415.78724766500005
    13:53:51.147 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 82.56 sec
    13:53:52.161 INFO  Mutect2 - Shutting down engine
    [February 19, 2019 1:53:52 PM UTC] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 20.68 minutes.
    Runtime.totalMemory()=1132453888
    java.lang.IllegalArgumentException: Attempting to add a read to ActiveRegion out of order w.r.t. other reads: lastRead SRR3270880.37535587 chr19:19227104-19227253 at 19227104 attempting to add SRR3270880.23592400 chr19:19226999-19227148 at 19226999
        at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:730)
        at org.broadinstitute.hellbender.engine.AssemblyRegion.add(AssemblyRegion.java:338)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.fillNextAssemblyRegionWithReads(AssemblyRegionIterator.java:230)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.loadNextAssemblyRegion(AssemblyRegionIterator.java:194)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:135)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:34)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:286)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)
    Using GATK jar /gatk/gatk-package-4.1.0.0-local.jar

The other 11 shards finished without errors and produced the expected output.

I checked the bam from the tumor sample and indeed the read mentioned in the error is out of order. It is the second read from the end in the following snippet (pasting here only the first 9 columns from the bam file):

    SRR3270880.37535587 163 chr19   19227104    60  150M    =   19227395    441
    SRR3270880.46694860 147 chr19   19227106    60  150M    =   19226772    -484
    SRR3270880.60287639 1171    chr19   19227106    60  150M    =   19226772    -484
    SRR3270880.68448188 83  chr19   19227106    60  150M    =   19226611    -645
    SRR3270880.70212050 1171    chr19   19227106    60  150M    =   19226772    -484
    SRR3270880.23592400 163 chr19   19226999    60  150M    =   19227232    383
    SRR3270880.21876644 1171    chr19   19227001    60  150M    =   19226793    -358

The read does not have any bad quality flags and it appears twice in the bam, being in the correct order in its first occurrence (second read in the following snippet):

    SRR3270880.61849825 147 chr19   19226995    60  150M    =   19226895    -250
    SRR3270880.23592400 163 chr19   19226999    60  150M    =   19227232    383
    SRR3270880.21876644 1171    chr19   19227001    60  150M    =   19226793    -358
    SRR3270880.47062210 147 chr19   19227001    60  150M    =   19226625    -526

The workflow does not include SortSam after MarkDuplicatesSpark as MDSpark's output is supposed to be coordinate sorted. From the bam's header: @HD VN:1.6 GO:none SO:coordinate

Previous to Mutect2, BaseRecalibrator-GatherBqsrReport-ApplyBQSR-GatherBamFiles (non-Spark versions) finished without any errors. These steps are also scattered through interval splits of the exome bedfile.

Strikingly, the start-end positions of this out of order read span from the last interval of interval split 6 to the first interval of interval split 7. Maybe the read was included in two contiguous splits of the bam file at the same time and that is why it appears twice in the bam file after the merge done by GatherBamFiles. (Last interval from split 6: chr19 19226311 19227116 ; first interval from interval split 7 : chr19 19227145 19228774 )

Intervals in my workflow are split by "SplitIntervals" tool (gatk4.1.0.0). I am currently including the argument --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION and feel that this could have to do with the error...

Any ideas of how this issue can be solved?

Thank you in advance

Answers

  • mack812mack812 SpainMember

    Hi again,

    I have repeated the process with the exome interval file split in a different number of sub-interval files. The error happened again, being the read out of order in a different location. I can confirm that the wrong order is not a product of MarDuplcatesSpark but it happens during the scatter-gather steps of BaseRecalibrator-ApplyBQSR

    This time I split the exome interval files in a different number of subsets with the following code (no --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION option this time):

    gatk --java-options "-Djava.io.tmpdir=/tmp -Xms2000m" SplitIntervals \
      -R /RefGenomes/GRCh38_refs/Homo_sapiens_assembly38.fasta \
      -L  ~/S03723314_Padded_hg38_collapsed.interval_list \
      --scatter-count 10 \
      -O interval-files-10-splits
    

    With this set of interval files I scattered again the steps of BaseRecalibrator-ApplyBQSR, with its subsequent step of GatherBamFiles (all gatk4.1.0.0)

    Then when I run Mutect2 (gatk4.1.0.0) I get the same error, this time a read on a different location, but again on the last interval of one of the intervals files generated by SplitIntervals and used in the scatter process

    The error:

    12:06:00.627 INFO  ProgressMeter -       chr1:247451348             17.1                 22440           1310.1
    12:06:10.240 INFO  VectorLoglessPairHMM - Time spent in setup for JNI call : 0.38375621000000004
    12:06:10.245 INFO  PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 238.902958062
    12:06:10.253 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 75.00 sec
    12:06:12.293 INFO  Mutect2 - Shutting down engine
    [February 20, 2019 12:06:12 PM UTC] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 17.78 minutes.
    Runtime.totalMemory()=1059901440
    java.lang.IllegalArgumentException: Attempting to add a read to ActiveRegion out of order w.r.t. other reads: lastRead SRR3270879.4027480 chr1:248099349-248099498 at 248099349 attempting to add SRR3270880.5831077 chr1:248099301-248099450 at 248099301
        at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:730)
        at org.broadinstitute.hellbender.engine.AssemblyRegion.add(AssemblyRegion.java:338)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.fillNextAssemblyRegionWithReads(AssemblyRegionIterator.java:230)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.loadNextAssemblyRegion(AssemblyRegionIterator.java:194)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:135)
        at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:34)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:286)
        at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267)
        at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)
    

    Again the read producing the error (SRR3270879.4027480) is present twice as Read 1 in the recalibrated bam but only once in the bam output from MarkDuplicatesSpark, right before the scatter step

    This is the wrong ordered occurrence of the read in the recalibrated bam (read in the middle row):

    SRR3270880.6778821  99  chr1    248099350   60  150M    =   248099507   307
    SRR3270880.5831077  99  chr1    248099301   60  150M    =   248099565   414
    SRR3270880.18649293 163 chr1    248099308   60  150M    =   248099633   475
    

    And this the occurrence in the right order (in the same recalibrated bam file):

    SRR3270880.39962045 99  chr1    248099267   60  79M71S  =   248099378
    SRR3270880.5831077  99  chr1    248099301   60  150M    =   248099565
    SRR3270880.18649293 163 chr1    248099308   60  150M    =   248099633
    

    Again, the bam before recalibration (output from MarkDuplicatesSpark) only has one occurrence of this read as read 1.

    Note that, as in my previous post, the read producing the error starts within the last interval from one of the interval file splits and ends within the first interval of the following interval file splits.

    Troublesome read mapping coordinates:
    SRR3270880.5831077 chr1:248099301-248099450
    Last interval from split "0000-scattered.interval_list":
    chr1 248099212 248099357
    First interval from split "0001-scattered.interval_list":
    chr1 248099358 248100544

    Therefore, it is clear that the read got duplicated during the scatter-gather process, since it appears only once in the bam before scattering. It seems that, during the splitting of the bam file, the read was included both in two contiguous splits of the bam file, and therefore is duplicated upon merging by GatherBam files

    Of course I could include a SamSort step after each scatter-gather but that would make the workflow considerably slower and more expensive and the read would still be duplicate (and the read would probably be duplicated).

  • mack812mack812 SpainMember

    I forgot to mention: this happens at all shards, always in the last interval of every split of the intervals file.

  • mack812mack812 SpainMember

    I have solved this issue by running SplitIntervals locally including the --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION and then manually fixing the output by rearranging the intervals in the border of every split, making sure that the genomic distance between the last interval in every split is at least 1 kilobase away from the first interval in the next contiguous split. This way it is impossible that any read (sized 150bp in the samples that I am working with) spans the intervals in the "border" of the splits. Therefore when the bam output of MDSpark is scattered by intervals previous to the recalibration step the reads in the "borders" of the scattered intervals will go to only one of the splits, avoiding the duplication and the out of order issue during somatic variant calling.

    By doing this and repeating exactly the same process (no changes in the wdl other than running the SplitIntervals step locally on my laptop and then feeding the scattered intervals to the wdl as an Array[File] input) with the same samples (same bam files), Mutect2 finished without any errors and so did the rest of steps of the wdl script up to the final filtered vcf.

    The shortcoming of this manual fix is that I will have to take any SplitIntervals steps (used to generate scattered intervals during the run of the analysis) out of my wdl scripts and do this step manually in my laptop, upload the manually "fixed" scattered intervals files to my bucket before the run, and give them as an input to the wdl workflow. Every time I need to change the number of splits/scatters I will have to locally produce and fix a new set of intervals.

    Is there an option to do something similar with the SplitIntervals tool? If not, it would be great if there was an option to space the last interval of each split at least X basepairs away from the first interval of the following contiguous split.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @mack812

    There was a recent change to interval list format for SplitIntervals tool, can you please try to run it with 4.0.11.0 and see if the problem persists? And if yes, can you please whether the intervals output by SplitIntervals overlap?

  • mack812mack812 SpainMember
    edited March 4

    Hi @bhanuGandham,

    Thanks for your reply. I am using GATK 4.1.0.0, which is a more up to date version than 4.0.11.0.

    Regarding the overlapping of intervals, the intervals at the edges never overlap, at least when using this version of GATK (4.1.0.0).

    a) If I run the SplitIntervals tool without the option --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION the intervals at the edges are always the product of an original interval being split in two sub-intervals. They look like this:

    chr1    203850966   203851332   +   .  ---- Last interval from scattered interval file 0000
    chr1    203851333   203851341   +   .  ---- First interval from scattered interval file 0001
    

    Interval in the original interval list file that was split:

    chr1    203850965       203851341
    

    The distance between edge intervals is always 1 bp when using the tool this way, as shown here.

    b) If run WITH the option --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION the intervals at the edges are never split in two. They are just contiguous intervals from the original interval list file. They look like this:

    chr1    203850544   203850863   +   .  ---- Last interval from scattered interval file 0000
    chr1    203850966   203851341   +   .  ---- First interval from scattered interval file 0001
    

    In this case the distance between edge intervals varies, being 103 bp in this specific example.

    If I use the intervals produced as shown in example a) for scatter-gather in the base recalibration/Apply BQSR section of my wdl, I get the "out of order read error" shown in my previous post in ALL shards when running Mutect2 (itself also run in scatter-gather mode).

    If I use the intervals produced as shown in example b) for scatter-gather in the base recalibration/Apply BQSR section of my wdl, I only get the "out of order read error" in SOME of the shards when running Mutect2. More specifically, the error happens in those shards in which the distance between edge intervals is < 150 bp (read size of the dataset analyzed), like in example b)

    In all cases, if I search the read that is prompted in the error (see post above) in the bam file produced after the gather step (i.e. after the GatherBamFiles step), I always find it duplicated. However, the same read is never duplicated in the bam before the scatter step (the output of MarkDuplicates tool).

    This is the way that I manually fix the intervals produced as shown in example b) to prevent the "out of order" error from happening later on during Mutect2:

    chr1    203850966   203851341   +   .  ---- New last interval in scattered interval file 0000
    chr1    203861540   203861915   +   .  ---- New first interval in scattered interval file 0001
    

    As you can easily see, I just moved down one interval the split position in the original interval list file. Now the distance between edge intervals is 10199 bp, making it impossible for a read sized 150 bp to "cross the edge" (i.e. to overlap both intervals at the edges of the split).

    I know it sounds confusing but this solution really solved completely the issue so I was thinking that maybe it could be included in the SplitIntervals tool as an option, as explained in my other thread.

    Post edited by mack812 on
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    HI @mack812

    I will have the dev team look into this issue. I will get back to you.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @mack812

    This is an error in a user-designed workflow, when combining the scattering of SplitIntervals with ApplyBQSR producing an ill-formed bam. The Mutect2 workflow itself is not responsible, as far as I can tell. ApplyBQSR should not be scattered. Here is a thread that will help understand this better: https://gatkforums.broadinstitute.org/gatk/discussion/comment/47995/#Comment_47995

  • mack812mack812 SpainMember
    edited March 8

    Hi @bhanuGandham,

    Thanks for your reply.

    Even if I go ahead and do not run ApplyBQSR in scatter/gather mode, as you are suggesting, there are other steps in the wdl scripts from the best practices collection that are run in scatter-gather mode. Hence, I think that the true question here, the way that I should have formulated it from the beginning (I am afraid that my previous posts were quite confusing, sorry for that), is the following one:

    • How are the reads that overlap both edges of each split handled during the scatter process? I mean how are they handled generically in any scatter step, not just the one involving specific tools like ApplyBQSR.

    For the sake of clarity, here is an illustration of these "edge-crossing" reads, which were the ones that got duplicated in the scatter-gather step and that subsequently caused the error during Mutect2, as explained on my previous posts:

    More in detail my question is: what does the scatter function do with these reads represented in orange in the figure? I can only think of the following, excluding, options:

    1.- They will all be processed in shard0 as reads belonging to the first sub-list of intervals (i.e. run with option -L 0000-scattered.interval_list in the command)
    2.- They will all be processed in shard1 as reads belonging to the second sub-list of intervals (-L 0001-scattered.interval_list in the command)
    3.- They will be processed as part of the sub-interval they overlap the most, therefore part of them will be processed in shard0 and the rest of them in shard1.
    4.- They will be ignored and not processed because they overlap both sub-lists of intervals
    5.- They will be processed in both shard0 and shard1, and therefore they will be processed twice, because they overlap both sub-lists of intervals.

    According to the observations that I described on my previous posts in this thread, it seems like option 5 is the one happening.

    The figure above illustrates how the edges look like when SplitIntervals is run without the option --subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION, being the intervals at the edges only 1 bp away from each other. The SplitIntervals tool is used like this by default (if no "extra args" are specified) in the wdl from the gatk4-somatic-snvs-indels best practice workflow, line 477 of the script. The file of intervals used as input belongs to a customised commercial exome gene panel (Line 7 in the json file). Pretty much what I am doing just with a different commercial hyb-capture exome panel.

    Without knowing exactly what the scatter-gather process does with these "edge-crossing" reads I will stick to my manual fix of the sub-lists of intervals produced by SplitIntervals for every scatter-gather steps in my wdl scripts. As explained previously, this simple solution prevented the "out of order" error and the duplication of reads issue.

    Post edited by mack812 on
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @mack812

    Here is a document that explains Scatter-Gather Parallelism.

  • mack812mack812 SpainMember
    edited March 19

    Hi @bhanuGandham. I already read that document (several times). I am afraid that I probably explained this issue very poorly. As I wrote before, I will continue using my manual fix and check in the future for updates of the SplitIntervals tool. Thanks for your help.

Sign In or Register to comment.