Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

MergeBamAlignment warning in the best practice pipeline

minimaxminimax Member
edited May 13 in Ask the GATK team

I am running GATK best practice pipeline gatk4-data-processing locally (on my University's linux server) without using docker (so I also modified the codes).

When I first run the pipeline for the first time, it failed due to the warning from MergeBamAlignment that Underlying iterator is not queryname sorted. Then I added SortSam to first sort the ubam to queryname, then merge. This time it succeed, however, I actually also have the same warning (from the stderr.background file; although the rc is 0):

WARNING SamAlignmentMerger Exception merging bam alignment - attempting to sort aligned reads and try again: Underlying iterator is not queryname sorted: HH7JWCCXY180531:8:2122:23997:28048 2/2 151b aligned to 1:10016-10131. > HH7JWCCXY180531:6:1103:9820:7568 1/2 151b aligned to 4:191044144-191044249.

So, should I ignore this warning or I need to address it before I can proceed? Thank you very much!
For your reference, below are the codes that I used (modified base on the best practice pipeline codes):

######### work flow definition

workflow test_pre_data {

  # declare workflow-level variables:
  File flowcell_unmapped_bams_list
  String unmapped_bam_suffix
  Array[File] flowcell_unmapped_bams = read_lines(flowcell_unmapped_bams_list)

  File ref_fasta
  File ref_fasta_index
  File ref_dict

  String? bwa_commandline_override
  String bwa_commandline = select_first([bwa_commandline_override, "bwa mem -K 100000000 -p -v 3 -t 16 -Y $bash_ref_fasta"]) 
  Int compression_level

  String bwa_path
  String picard_path
  String gatk_path
  String samtools_path


  # Get the version of BWA to include in the PG record in the header of the BAM produced 
  # by MergeBamAlignment. 
  call GetBwaVersion{
    input:
      bwa_path = bwa_path
  }

  # Align flowcell-level unmapped input bams in parallel
  scatter (unmapped_bam in flowcell_unmapped_bams) {

    # Get the basename, i.e. strip the filepath and the extension
    String bam_basename = basename(unmapped_bam, unmapped_bam_suffix)

    # sort unmapped BAM reads b/c "MergeBamAlignment" requires input bams in queryname order
    call SortuBAM {
      input:
        input_bam = unmapped_bam,
        gatk_path = gatk_path,
        compression_level = compression_level,
        output_bam_basename = bam_basename + ".sorted"
    }

    # Map reads in unmapped bam to reference
    call SamToFastqAndBwaMem {
      input:
        input_bam = unmapped_bam,
        bwa_commandline = bwa_commandline,
        output_bam_basename = bam_basename + ".unmerged",
        ref_fasta = ref_fasta,
        ref_fasta_index = ref_fasta_index,
        ref_dict = ref_dict,
        bwa_path = bwa_path,
        gatk_path = gatk_path,
        samtools_path = samtools_path,
        compression_level = compression_level
    }

    # Merge original uBAM (queryname sorted) and BWA-aligned BAM 
    call MergeBamAlignment {
      input:
        unmapped_bam = SortuBAM.output_bam,
        bwa_commandline = bwa_commandline,
        bwa_version = GetBwaVersion.version,
        aligned_bam = SamToFastqAndBwaMem.output_bam,
        output_bam_basename = bam_basename + ".aligned.unsorted",
        ref_fasta = ref_fasta,
        ref_fasta_index = ref_fasta_index,
        ref_dict = ref_dict,
        gatk_path = gatk_path,
        compression_level = compression_level
    }
  }
}


######### TASK DEFINITIONS

# Get version of BWA <-- success
task GetBwaVersion {

  String bwa_path

  command {
    ${bwa_path}/bwa 2>&1 | \
    grep -e '^Version' | \
    sed 's/Version: //'
  }
  output {
    String version = read_string(stdout())    
  }
}
# test: File outfile = stdout()

# sort unmapped BAM reads to queryname order
task SortuBAM{
  File input_bam

  Int compression_level

  String output_bam_basename
  String gatk_path
  String java_opt

  command {
    ${gatk_path}/gatk  --java-options "-Dsamjdk.compression_level=${compression_level} ${java_opt}" \
    SortSam \
    --INPUT=${input_bam} \
    --OUTPUT=${output_bam_basename}.bam \
    --SORT_ORDER=queryname \
    --COMPRESSION_LEVEL=${compression_level} \
    --VALIDATION_STRINGENCY SILENT 
  }
  output {
    File output_bam = "${output_bam_basename}.bam"
  }
}

# Read unmapped BAM, convert on-the-fly to FASTQ and stream to BWA MEM for alignment
task SamToFastqAndBwaMem {
  File input_bam
  String bwa_commandline
  String output_bam_basename
  File ref_fasta
  File ref_fasta_index
  File ref_dict

  # This is the .alt file from bwa-kit (https://github.com/lh3/bwa/tree/master/bwakit), 
  # listing the reference contigs that are "alternative". Leave blank in JSON for legacy 
  # references such as b37 and hg19.
  File? ref_alt
  File ref_amb
  File ref_ann
  File ref_bwt
  File ref_pac
  File ref_sa

  Int compression_level

  String bwa_path
  String gatk_path
  String samtools_path
  String java_opt

  # can also use gatk to perform SamToFastq
  command <<<
    set -o pipefail
    set -e

    bash_ref_fasta=${ref_fasta}

    ${gatk_path}/gatk  --java-options "-Dsamjdk.compression_level=${compression_level} ${java_opt}" \
      SamToFastq \
            -INPUT=${input_bam} \
            -FASTQ=/dev/stdout \
            -INTERLEAVE=true \
            -NON_PF=true \
    | \
        ${bwa_path}/${bwa_commandline} /dev/stdin -  2> >(tee ${output_bam_basename}.bwa.stderr.log >&2) \
    | \
        samtools view -1 - > ${output_bam_basename}.bam

  >>>
  output {
    File output_bam = "${output_bam_basename}.bam"
    File bwa_stderr_log = "${output_bam_basename}.bwa.stderr.log"
  }
}

# Merge original input uBAM file with BWA-aligned BAM file
# note, the input unmapped BAM must be in queryname order (gatk SortSam)!
task MergeBamAlignment {
  File unmapped_bam
  String bwa_commandline
  String bwa_version
  File aligned_bam
  String output_bam_basename
  File ref_fasta
  File ref_fasta_index
  File ref_dict

  Int compression_level

  String gatk_path
  String java_opt

  command {
    bash_ref_fasta=${ref_fasta}
    ${gatk_path}/gatk --java-options "-Dsamjdk.compression_level=${compression_level} ${java_opt}" \
      MergeBamAlignment \
      --VALIDATION_STRINGENCY SILENT \
      --EXPECTED_ORIENTATIONS FR \
      --ATTRIBUTES_TO_RETAIN X0 \
      --ALIGNED_BAM ${aligned_bam} \
      --UNMAPPED_BAM ${unmapped_bam} \
      --OUTPUT ${output_bam_basename}.bam \
      --REFERENCE_SEQUENCE ${ref_fasta} \
      --PAIRED_RUN true \
      --SORT_ORDER "unsorted" \
      --IS_BISULFITE_SEQUENCE false \
      --ALIGNED_READS_ONLY false \
      --CLIP_ADAPTERS false \
      --MAX_RECORDS_IN_RAM 2000000 \
      --ADD_MATE_CIGAR true \
      --MAX_INSERTIONS_OR_DELETIONS -1 \
      --PRIMARY_ALIGNMENT_STRATEGY MostDistant \
      --PROGRAM_RECORD_ID "bwamem" \
      --PROGRAM_GROUP_VERSION "${bwa_version}" \
      --PROGRAM_GROUP_COMMAND_LINE "${bwa_commandline}" \
      --PROGRAM_GROUP_NAME "bwamem" \
      --UNMAPPED_READ_STRATEGY COPY_TO_TAG \
      --ALIGNER_PROPER_PAIR_FLAGS true \
      --UNMAP_CONTAMINANT_READS true
  }
  output {
    File output_bam = "${output_bam_basename}.bam"
  }
}

Best Answer

Answers

  • minimaxminimax Member

    @AdelaideR said:
    Hi @minimax

    If the read count is 0 for this error, then I think it would be safe to ignore this warning.

    It is definitely important to sort the sam file first, so that was the correct action to take.

    Thanks for your reply, @AdelaideR !

    I modified the codes again to SortSam both the ubam and the aligned bam inside the SamToFastqAndBwaMem task (i.e., replace the samtools in the official pipeline to SortSam commands), then the warning is gone.

    The MergeBamAlignment actually requires the input ubam to be in queryname order. Don't know why the official pipeline does not include SortSam to make sure of this.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi @minimax - I do see it in the WDL as part of the regular processing steps:


    ${gatk_path} --java-options "-Dsamjdk.compression_level=${compression_level} ${java_opt_sort}" \ SortSam \ --INPUT ${input_bam} \ --OUTPUT /dev/stdout \ --SORT_ORDER "coordinate" \ --CREATE_INDEX false \ --CREATE_MD5_FILE false \

    Is your question whether this should be required in the files that are entered into the WDL that contains this code? It seems that doing it twice might be redundant.

  • minimaxminimax Member

    That's for a different purpose and the --SORT_ORDER is coordinate whereas the MergeBamAlignment requires queryname ordered input.

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Okay, I guess I am not sure why it is not required, then,

Sign In or Register to comment.