If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
MarkDuplicatesSpark multiple inputs sort order
I am using GATK 126.96.36.199. Each sample has been run on several lanes of sequencing, and thus has multiple pairs of FASTQ files.
The plan is to run BWA on each pair of fastq files, specifying -R to set the read group. Then pipe that into samtools view, to convert to BAM format. Then merge the BAM files, mark duplicates and sort them using MarkDuplicatesSpark. When I run MarkDuplicatesSpark with multiple BAM input files, however, I get an error:
"Multiple inputs to MarkDuplicatesSpark detected but input XXX.bam was sorted in unsorted order"
The input files are sorted in a unique order depending on query name, but not lexicographically (default output of bcl2fastq). The order is lane and tile, then Y coordinate, then X coordinate. So, two questions: Is this output sufficiently sorted for MarkDuplicatesSpark, and if so, is there a way to force MarkDuplicatesSpark to accept these files as sorted?