We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

MarkDuplicatesSpark multiple inputs sort order

paalmbjpaalmbj OsloMember

I am using GATK Each sample has been run on several lanes of sequencing, and thus has multiple pairs of FASTQ files.

The plan is to run BWA on each pair of fastq files, specifying -R to set the read group. Then pipe that into samtools view, to convert to BAM format. Then merge the BAM files, mark duplicates and sort them using MarkDuplicatesSpark. When I run MarkDuplicatesSpark with multiple BAM input files, however, I get an error:

"Multiple inputs to MarkDuplicatesSpark detected but input XXX.bam was sorted in unsorted order"

The input files are sorted in a unique order depending on query name, but not lexicographically (default output of bcl2fastq). The order is lane and tile, then Y coordinate, then X coordinate. So, two questions: Is this output sufficiently sorted for MarkDuplicatesSpark, and if so, is there a way to force MarkDuplicatesSpark to accept these files as sorted?


  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @paalmbj

    Please take a look at this Best Practices Document for Data preprocessing: https://software.broadinstitute.org/gatk/best-practices/workflow?id=11165

  • paalmbjpaalmbj OsloMember

    Hi bhanuGandham!

    Thanks for the link. So the way I read this is:
    Input: The reference implementation uses a uBAM, but it doesn't say that's a requirement for the best practice.

    1. Map to Reference

    "This first processing step is performed per-read group". Consistent with my approach, but no mention of on-line conversion from SAM to BAM. Anyway, this is just an optimisation to reduce I/O.

    1. Mark Duplicates

    "This second processing step is performed per-sample [...]" Also consistent.

    BQSR is outside the scope of my question.

    None of the reference implementations use MarkDuplicatesSpark (yet). The "Prod* germline short variant per-sample calling" does pipe the output of bwa into MergeBamAlignment, which will convert to BAM just like samtools would (among other things). I also note that they pass SORT_ORDER="unsorted" to MergeBamAlignment. Will be interesting to see how all of this is handled once the reference implementations start to use MarkDuplicatesSpark

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @paalmbj

    I am sorry I didn't quite get what the question here is.

  • paalmbjpaalmbj OsloMember
    edited June 2019

    Thanks for noting that. I was going to perform some tests, so it took a while to reply.

    The question is to input multiple files into MarkDuplicatesSpark by specifying "-I" several times. Normally it won't accept it, saying it can't have multiple inputs with unsorted order. This is what I found recently:

    The SAM data are actually query-grouped when output by bwa, as bwa operates on read by read basis, and the query names of the reads are unique. I added a header line to the SAM before the bwa output, to set GO:query. Then MarkDuplicatesSpark will accept multiple inputs, and produce a correct output.

    However I found that the I/O demands of MarkDuplicatesSpark are too great for our modest HDD-based storage, even if I give it as much RAM as possible in every way I could find. So I will stick with coordinate-sorting the outputs as they come out of bwa, then running good old Picard MarkDuplicates, for now.

    Perhaps MarkDuplicatesSpark is faster and less I/O heavy on actually query-sorted data. I plan to do a test with query-sorted (SO:queryname) data and see how much faster (first sorting them with SortSam).

    You can probably consider this one solved, as I have found that MarkDuplicatesSpark doesn't really suit our infrastructure well.

Sign In or Register to comment.