This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
MarkDuplicatesSpark multiple inputs sort order
I am using GATK 188.8.131.52. Each sample has been run on several lanes of sequencing, and thus has multiple pairs of FASTQ files.
The plan is to run BWA on each pair of fastq files, specifying -R to set the read group. Then pipe that into samtools view, to convert to BAM format. Then merge the BAM files, mark duplicates and sort them using MarkDuplicatesSpark. When I run MarkDuplicatesSpark with multiple BAM input files, however, I get an error:
"Multiple inputs to MarkDuplicatesSpark detected but input XXX.bam was sorted in unsorted order"
The input files are sorted in a unique order depending on query name, but not lexicographically (default output of bcl2fastq). The order is lane and tile, then Y coordinate, then X coordinate. So, two questions: Is this output sufficiently sorted for MarkDuplicatesSpark, and if so, is there a way to force MarkDuplicatesSpark to accept these files as sorted?