Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

MergeBamAlignment – Select primary alignment

micknudsenmicknudsen DenmarkMember ✭✭

Hi,

In the current best practices workflow gatk4-data-processing, you recommend using uBAMs instead of FASTQ files. Great idea! However, when it comes to merging with the BWA alignment BAM, there is something that puzzles me.

Here is an example of a paired-end read mapped by BWA:

XXXXXXXX:412:YYYYYYYYY:1:11101:10001:10497  83  chr16   1229894 0   149M    =   1229833 -210    GGGCCGCGTAGGCGCGGCTCGCCAGGACGGGCAGCGCCAGCAGCAGCAGATTCAGCATCTGGGGAGCAAGGAGGAGCATCGTGGGCCTGGCCGGGCCTCACAGGGCAGGGCTGGGGGCTACAGATTGTGGGGTGAAGAATGGAGCTGAG   AAAAA/E<EEAA</A/<EA<<EEEEEEEE/EEEAAEEAEE/EAEAAEEEEEEEEEEEAEEAAEEAEAEAAEEEEEEEEEEEEAAEEEEAE6EAEEEEEEEE/EEEEEEE/EE/AEAAEEEEEEEEEAAEEEEEEEEEEEEEEEEAAAAA   XA:Z:chr16,+1240848,149M,1;chr16,+1256211,149M,6;   MC:Z:150M   MD:Z:147G1  RG:Z:NS500158.1 NM:i:1  AS:i:147    XS:i:147
XXXXXXXX:412:YYYYYYYYY:1:11101:10001:10497  163 chr16   1229833 0   150M    =   1229894 210 CCAGGCCCTGACCTGTGGAATGTGGTGAGGGGCAGGGTGGACCCCGGCTGGGACTCACCAGGGGCCGCGTAGGCGCGGCTCGCCAGGACGGGCAGCGCCAGCAGCAGCAGATTCAGCATCTGGGGAGCAAGGAGGAGCATCGTGGGCCTG  AAAAAEEEEEEEEEEEEAE6EEEAEEEEEEEEEEEEEEAE/EEEEEEEEEEA/AEAEEEEEEEEEAEAE<EEE6A/EEAAAEEEA/EEAAEEAEEE/AAAAEEEEEEEAE/EEEEEEEEEEAEEEEEEAEEEAEE6EAEEAE<</AAA<6  XA:Z:chr16,-1240908,150M,0; MC:Z:149M   MD:Z:150    RG:Z:NS500158.1 NM:i:0  AS:i:150    XS:i:150

Note that BWA has suggested an alternative alignment given in the XA tag. When using MergeBamAlignment as in the best practices pipeline, the alignment in XA is chosen. I have tried modifying the --PRIMARY_ALIGNMENT_STRATEGY parameter, but is doesn't change anything.

In the old days before uMAPs, you worked directly with FASTQ files and hence used the primary alignment selected by BWA. What is the motivation for changing that?

Answers

  • akovalskakovalsk Member, Broadie, Moderator admin

    Hi @micknudsen, thanks for your question. Would you mind providing the output from MergeBamAligment for the problematic inputs, and the aligner's name and version (bwa mem vs bwa), with the full command line?

  • micknudsenmicknudsen DenmarkMember ✭✭

    Hi @akovalsk. I lost track of where I got the example from, but I have found another run. When I run bwa-mem directly on the FASTQ files, I get the following:

    NS500158:412:HKCWJAFXY:1:21206:18469:3621   1105    chr1    149475938   0   11S139M =   197137052   47660977    CAAAGATTCTGGGTATCAGCCGGCCCTTGGTCCAGTGAGAAGGCAGAGATGAACATTCTAGAAATCAACGAGACATTGCGCCCCCAGCTGGCAGAGAAGAAACAGCAGTTCAGAAACCTCAAAGAGAAATGTTTTCTAACTCAACTGGCC  [email protected]=B>DDED?CDBDEDCCEDDDDCDGDEDDBBDDAFBBCE?EDDDBCFDA<DEDBFBDDD<[email protected]@CCBDD>  XA:Z:chr1,+148595574,139M11S,0;chr1,+146144626,139M11S,1;chr1,+149075686,139M11S,4;chr1,-120805596,11S139M,4;   MC:Z:150M   MD:Z:139    PG:Z:MarkDuplicates RG:Z:NS500158.1 NM:i:0  AS:i:139    XS:i:139
    NS500158:412:HKCWJAFXY:1:21206:18469:3621   1185    chr1    197137052   59  150M    =   149475938   -47660977   GATAATTTTTATAAGGTAGATATTATTTTTAGATATGGAATAATATTGGTGATTTCAATTTTAATAATATTGGATTAAGATGAAAGAATGAGAAGATAAAGGTCCCTCAGCAATATAACTCACAAACATGTTCAGAAGCAGTAAGAAGTC  [email protected]@BD>@DDDD>[email protected]/@@DB+BDFDAF?BDDBCDBDDD>[email protected]@[email protected]/ECBFCD/EC/AF0EC/[email protected]/EDA1,,[email protected]+EDDD.?B/?-?.FAE:DFD/F>CFB+/G10GAD  XA:Z:chr7,+55794215,52M6D98M,11;    MC:Z:11S139M    MD:Z:67C82  PG:Z:MarkDuplicates RG:Z:NS500158.1 NM:i:1  AS:i:145    XS:i:117
    

    but when I use the gatk4-data-processing version with MergeBamAlignment, the following alignment is chosen:

    NS500158:412:HKCWJAFXY:1:21206:18469:3621   65  chr1    148595574   0   139M11S =   197137052   48541479    GGCCAGTTGAGTTAGAAAACATTTCTCTTTGAGGTTTCTGAACTGCTGTTTCTTCTCTGCCAGCTGGGGGCGCAATGTCTCGTTGATTTCTAGAATGTTCATCTCTGCCTTCTCACTGGACCAAGGGCCGGCTGATACCCAGAATCTTTG  >[email protected]@ECDDDDCBDDDDCEDDFCEEADDDDFCDEEFADFADDDDEDDDDFBCDFECFDDDDD<DDDBFBDED<ADFCBDDDE?ECBBFADDBBDDEDGDCDDDDECCDEDBDC?DEDD>[email protected]  MC:Z:150M   MD:Z:139    PG:Z:MarkDuplicates RG:Z:NS500158.1 NM:i:0  MQ:i:59 UQ:i:0  AS:i:139
    NS500158:412:HKCWJAFXY:1:21206:18469:3621   129 chr1    197137052   59  150M    =   148595574   -48541479   GATAATTTTTATAAGGTAGATATTATTTTTAGATATGGAATAATATTGGTGATTTCAATTTTAATAATATTGGATTAAGATGAAAGAATGAGAAGATAAAGGTCCCTCAGCAATATAACTCACAAACATGTTCAGAAGCAGTAAGAAGTC  [email protected]@BD>@DDDD>[email protected]/@@DB+BDFDAF?BDDBCDBDDD>[email protected]@[email protected]/ECBFCD/EC/AF0EC/[email protected]/EDA1,,[email protected]+EDDD.?B/?-?.FAE:DFD/F>CFB+/G10GAD  MC:Z:139M11S    MD:Z:67C82  PG:Z:MarkDuplicates RG:Z:NS500158.1 NM:i:1  MQ:i:0  UQ:i:36 AS:i:145
    

    That is, one of the XA reads is preferred. I now realize that in both this and the previous example, the read has 0 mapping quality, and I have not been able to find examples where this is not the case. Still, I find it a little strange why one would prefer one (ambiguous) mapping over the other.

  • akovalskakovalsk Member, Broadie, Moderator admin

    Thanks for sharing this info, @micknudsen
    MergeBamAlignment uses BestMapqPrimaryAlignmentSelectionStrategy to choose which alignment to use. For paired-end reads, it chooses the alignment that gives the best sum of mapping qualities. In the event that there is a tie, one alignment is chosen randomly (with a fixed seed)

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Hi @akovalsk

    Looks like the best practices workflows still use MostDistant approach for selecting primary alignments.

    Is this still true or has anything changed at best practices since recently?

  • akovalskakovalsk Member, Broadie, Moderator admin

    Hi @SkyWarrior would you mind sharing the context where exactly that screenshot comes from?

  • akovalskakovalsk Member, Broadie, Moderator admin

    Hi @SkyWarrior that repo is indeed out of date, and users should be looking here instead: https://github.com/gatk-workflows/five-dollar-genome-analysis-pipeline/tree/master

    The default alignment strategy for MergeBamAlignment has been BestMapq for some time.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭
    edited October 21

    Hi @akovalsk

    In this git repo I see that

    https://github.com/gatk-workflows/five-dollar-genome-analysis-pipeline/blob/master/tasks/Alignment.wdl

    still uses MostDistant as the strategy?

    Can you clarify that?

    Also is this BestMapq for hg38 based alignments only or does that cover hg19 as well?

    command <<<
        set -o pipefail
        set -e
    
        # set the bash variable needed for the command-line
        bash_ref_fasta=~{reference_fasta.ref_fasta}
        # if reference_fasta.ref_alt has data in it,
        if [ -s ~{reference_fasta.ref_alt} ]; then
          java -Xms1000m -Xmx1000m -jar /usr/gitc/picard.jar \
            SamToFastq \
            INPUT=~{input_bam} \
            FASTQ=/dev/stdout \
            INTERLEAVE=true \
            NON_PF=true | \
          /usr/gitc/~{bwa_commandline} /dev/stdin - 2> >(tee ~{output_bam_basename}.bwa.stderr.log >&2) | \
          java -Dsamjdk.compression_level=~{compression_level} -Xms1000m -Xmx1000m -jar /usr/gitc/picard.jar \
            MergeBamAlignment \
            VALIDATION_STRINGENCY=SILENT \
            EXPECTED_ORIENTATIONS=FR \
            ATTRIBUTES_TO_RETAIN=X0 \
            ATTRIBUTES_TO_REMOVE=NM \
            ATTRIBUTES_TO_REMOVE=MD \
            ALIGNED_BAM=/dev/stdin \
            UNMAPPED_BAM=~{input_bam} \
            OUTPUT=~{output_bam_basename}.bam \
            REFERENCE_SEQUENCE=~{reference_fasta.ref_fasta} \
            PAIRED_RUN=true \
            SORT_ORDER="unsorted" \
            IS_BISULFITE_SEQUENCE=false \
            ALIGNED_READS_ONLY=false \
            CLIP_ADAPTERS=false \
            MAX_RECORDS_IN_RAM=2000000 \
            ADD_MATE_CIGAR=true \
            MAX_INSERTIONS_OR_DELETIONS=-1 \
            PRIMARY_ALIGNMENT_STRATEGY=MostDistant \
            PROGRAM_RECORD_ID="bwamem" \
            PROGRAM_GROUP_VERSION="~{bwa_version}" \
            PROGRAM_GROUP_COMMAND_LINE="~{bwa_commandline}" \
            PROGRAM_GROUP_NAME="bwamem" \
            UNMAPPED_READ_STRATEGY=COPY_TO_TAG \
            ALIGNER_PROPER_PAIR_FLAGS=true \
            UNMAP_CONTAMINANT_READS=true \
            ADD_PG_TAG_TO_READS=false
    
          grep -m1 "read .* ALT contigs" ~{output_bam_basename}.bwa.stderr.log | \
          grep -v "read 0 ALT contigs"
    
        # else reference_fasta.ref_alt is empty or could not be found
        else
          exit 1;
        fi
      >>>
      runtime {
        docker: "us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.4.1-1540490856"
        preemptible: preemptible_tries
        memory: "14 GiB"
        cpu: "16"
        disks: "local-disk " + disk_size + " HDD"
      }
      output {
        File output_bam = "~{output_bam_basename}.bam"
        File bwa_stderr_log = "~{output_bam_basename}.bwa.stderr.log"
      }
    }
    
    task SamSplitter {
      input {
        File input_bam
        Int n_reads
        Int preemptible_tries
        Int compression_level
      }
    
      Float unmapped_bam_size = size(input_bam, "GiB")
      # Since the output bams are less compressed than the input bam we need a disk multiplier that's larger than 2.
      Float disk_multiplier = 2.5
      Int disk_size = ceil(disk_multiplier * unmapped_bam_size + 20)
    
Sign In or Register to comment.