Out of order read after MarkDuplicateSpark + BaseRecalibrator/ApplyBQSR
I am building a workflow for discovery of somatic snvs + indels that is pretty much the Broad's Best Practice but incorporating MarkDuplicatesSpark and a couple of other minor changes. Today I was running a normal-tumor pair of samples from WES experiments in GCP, and everything was going great until the workflow failed during Mutect2. In one of the shards (I am scattering the M2 step through 12 splits of the exome bedfile) I got this error:
13:53:46.994 INFO ProgressMeter - chr19:18926479 20.2 22440 1112.1 13:53:51.138 INFO VectorLoglessPairHMM - Time spent in setup for JNI call : 0.589863008 13:53:51.145 INFO PairHMM - Total compute time in PairHMM computeLogLikelihoods() : 415.78724766500005 13:53:51.147 INFO SmithWatermanAligner - Total compute time in java Smith-Waterman : 82.56 sec 13:53:52.161 INFO Mutect2 - Shutting down engine [February 19, 2019 1:53:52 PM UTC] org.broadinstitute.hellbender.tools.walkers.mutect.Mutect2 done. Elapsed time: 20.68 minutes. Runtime.totalMemory()=1132453888 java.lang.IllegalArgumentException: Attempting to add a read to ActiveRegion out of order w.r.t. other reads: lastRead SRR3270880.37535587 chr19:19227104-19227253 at 19227104 attempting to add SRR3270880.23592400 chr19:19226999-19227148 at 19226999 at org.broadinstitute.hellbender.utils.Utils.validateArg(Utils.java:730) at org.broadinstitute.hellbender.engine.AssemblyRegion.add(AssemblyRegion.java:338) at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.fillNextAssemblyRegionWithReads(AssemblyRegionIterator.java:230) at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.loadNextAssemblyRegion(AssemblyRegionIterator.java:194) at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:135) at org.broadinstitute.hellbender.engine.AssemblyRegionIterator.next(AssemblyRegionIterator.java:34) at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.processReadShard(AssemblyRegionWalker.java:286) at org.broadinstitute.hellbender.engine.AssemblyRegionWalker.traverse(AssemblyRegionWalker.java:267) at org.broadinstitute.hellbender.engine.GATKTool.doWork(GATKTool.java:966) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191) at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210) at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162) at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205) at org.broadinstitute.hellbender.Main.main(Main.java:291) Using GATK jar /gatk/gatk-package-188.8.131.52-local.jar
The other 11 shards finished without errors and produced the expected output.
I checked the bam from the tumor sample and indeed the read mentioned in the error is out of order. It is the second read from the end in the following snippet (pasting here only the first 9 columns from the bam file):
SRR3270880.37535587 163 chr19 19227104 60 150M = 19227395 441 SRR3270880.46694860 147 chr19 19227106 60 150M = 19226772 -484 SRR3270880.60287639 1171 chr19 19227106 60 150M = 19226772 -484 SRR3270880.68448188 83 chr19 19227106 60 150M = 19226611 -645 SRR3270880.70212050 1171 chr19 19227106 60 150M = 19226772 -484 SRR3270880.23592400 163 chr19 19226999 60 150M = 19227232 383 SRR3270880.21876644 1171 chr19 19227001 60 150M = 19226793 -358
The read does not have any bad quality flags and it appears twice in the bam, being in the correct order in its first occurrence (second read in the following snippet):
SRR3270880.61849825 147 chr19 19226995 60 150M = 19226895 -250 SRR3270880.23592400 163 chr19 19226999 60 150M = 19227232 383 SRR3270880.21876644 1171 chr19 19227001 60 150M = 19226793 -358 SRR3270880.47062210 147 chr19 19227001 60 150M = 19226625 -526
The workflow does not include SortSam after MarkDuplicatesSpark as MDSpark's output is supposed to be coordinate sorted. From the bam's header:
@HD VN:1.6 GO:none SO:coordinate
Previous to Mutect2, BaseRecalibrator-GatherBqsrReport-ApplyBQSR-GatherBamFiles (non-Spark versions) finished without any errors. These steps are also scattered through interval splits of the exome bedfile.
Strikingly, the start-end positions of this out of order read span from the last interval of interval split 6 to the first interval of interval split 7. Maybe the read was included in two contiguous splits of the bam file at the same time and that is why it appears twice in the bam file after the merge done by GatherBamFiles. (Last interval from split 6:
chr19 19226311 19227116 ; first interval from interval split 7 :
chr19 19227145 19228774 )
Intervals in my workflow are split by "SplitIntervals" tool (gatk184.108.40.206). I am currently including the argument
--subdivision-mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION and feel that this could have to do with the error...
Any ideas of how this issue can be solved?
Thank you in advance