Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

MarkDuplicatesSpark error: Multiple mark duplicate record objects corresponding to read with name

florian_huberflorian_huber SwitzerlandMember
Hi,
I am working on WES data and try to follow GATK's best practices guidelines. I want to switch from MarkDuplicates + SortSam to MarkDuplicatesSpark.
My problem is that MarkDuplicatesSpark terminates with an error message:

"ERROR TaskSetManager: Task 21 in stage 4.0 failed 1 times; aborting job"

Here is the more detailed error message:

"org.apache.spark.SparkException: Job aborted due to stage failure: Task 21 in stage 4.0 failed 1 times, most recent failure: Lost task 21.0 in stage 4.0 (TID 1705, localhost, executor dri
ver): org.broadinstitute.hellbender.exceptions.GATKException: Detected multiple mark duplicate records objects corresponding to read with name 'NB551494:142:HC5KTBGXC:1:11309:14684:9603',
this could be the result of the file sort order being incorrect or that a previous tool has let readnames span multiple partitions"

I believe that this indicates that I am doing something wrong in the process upstream but I could not fix that issue.

Currently, here is the pipeline that I have in place: the first steps are to run bwa-mem on the fastq files and (in parallel) convert the fastq files to the ubam format with Picard FastqToSam:

bwa mem -t 8 -M $bwaIndex ${fq}_R1.fastq.gz ${fq}_R2.fastq.gz \
> ${fq}.sam

java -Xmx8G -jar $PICARD_PATH/picard.jar FastqToSam \
FASTQ=${fq}_R1.fastq.gz \
FASTQ2=${fq}_R2.fastq.gz \
OUTPUT=${fq}.unmapped.bam \
READ_GROUP_NAME=${rgid} \
SAMPLE_NAME=${rgsm} \
LIBRARY_NAME=${rglb} \
PLATFORM_UNIT=${rgpu} \
PLATFORM=${rgpl} \
SEQUENCING_CENTER=${rgcn} \
DESCRIPTION=${rgds}


Then I use MergeBamAlignment for merging the BWA-aligned SAM with the ubam file as follow:

java -Xmx8G -jar $PICARD_PATH/picard.jar MergeBamAlignment \
R=${fasta} \
ALIGNED=${fq}.sam \
UNMAPPED=${fq}.unmapped.bam \
SORT_ORDER=queryname \
O=${fq}.merged.bam

Finally I run MarkDuplicatesSpark:

java -Xmx42g -Djava.io.tmpdir=$tmpDir -jar $GATK_PATH/GenomeAnalysisTK.jar MarkDuplicatesSpark \
-I ${fq}.merged.bam \
-O ${fq}.s.dedup.bam \
-M ${fq}.dup.metrics.out \
--read-validation-stringency LENIENT \
--conf 'spark.executor.cores=8' \
--conf 'spark.local.dir=${tmpDir}'

Any help would be highly appreciated. Thank you in advance!

Best regards,

Florian

Best Answers

  • bhanuGandhambhanuGandham Cambridge MA admin
    Accepted Answer

    Hi @florian_huber

    1. Can you please try to run samsort after MergeBamAlignment and then run MarkDuplicatesSpark. My suspicion is that MergeBamAlignment isn't actual;ly sorting by query, i.e. it is not doing what it is supposed to.
    2. Please validateO=${fq}.merged.bam using ValidateSamFile. This is to test if the sam generated by MergeBamAlignment is mislabeled as query sorted.
  • florian_huberflorian_huber Switzerland
    Accepted Answer

    Hi @bhanuGandham

    I came to the same conclusion as you. In the end I decided to:

    1. Convert fastq files to BAM with FastqToSam and specified "SORT_ORDER="queryname"" although this is the default value
    2. Use the piped command (SamToFastq + bwa mem + samtools view) for the alignement
    3. Run MergeBamAlignment wuth the unmapped.bam and the mapped.bam
    4. Run MarkDuplicatesSpark on the merged.bam

    In the end this does not significantly increase the computation time since the mapped.bam is now sorted by queryname and therefore MergeBamAlignment does not need to pre-sort the aligned SAM by queryname.
    Moreover, I believe that this strategy is in the end more robust (the alignment and the merging derive from the same file) and MarkDuplicatesSpark runs correctly now.

    Thanks for your help

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    Accepted Answer

    Hi @florian_huber

    1. Can you please try to run samsort after MergeBamAlignment and then run MarkDuplicatesSpark. My suspicion is that MergeBamAlignment isn't actual;ly sorting by query, i.e. it is not doing what it is supposed to.
    2. Please validateO=${fq}.merged.bam using ValidateSamFile. This is to test if the sam generated by MergeBamAlignment is mislabeled as query sorted.
  • florian_huberflorian_huber SwitzerlandMember
    Accepted Answer

    Hi @bhanuGandham

    I came to the same conclusion as you. In the end I decided to:

    1. Convert fastq files to BAM with FastqToSam and specified "SORT_ORDER="queryname"" although this is the default value
    2. Use the piped command (SamToFastq + bwa mem + samtools view) for the alignement
    3. Run MergeBamAlignment wuth the unmapped.bam and the mapped.bam
    4. Run MarkDuplicatesSpark on the merged.bam

    In the end this does not significantly increase the computation time since the mapped.bam is now sorted by queryname and therefore MergeBamAlignment does not need to pre-sort the aligned SAM by queryname.
    Moreover, I believe that this strategy is in the end more robust (the alignment and the merging derive from the same file) and MarkDuplicatesSpark runs correctly now.

    Thanks for your help

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @florian_huber

    I want to run a test to dig deeper into this issue. Can you please share these files with me: R=${fasta}
    ALIGNED=${fq}.sam
    UNMAPPED=${fq}.unmapped.bam

    You can share data with us using instructions provided here: https://software.broadinstitute.org/gatk/guide/article?id=1894

  • florian_huberflorian_huber SwitzerlandMember

    Hi @bhanuGandham,

    Unfortunately, these are data that I am not allowed to share... I am really sorry about that.

    Is there anything else that I couls do to help you with that issue?

    What I can say is that when I ran that script, MergeBamAlignment had to sort the .sam file et the error occured afterwards.

    Best,

    Florian

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @florian_huber That is perfectly fine. Thank you for the update!

Sign In or Register to comment.