Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

MergeBamAlignment - what are all the exact steps it performs?

Hi, I have a question about the MergeBamAlignment tool
I tried reading through the documentation and through the couple of blog posts that I found on the GATK website, but I still have a couple of things that I could use help clearing out.

Basically, I ran the following tests:
1. Starting from an unmapped BAM file with multiple read groups, I ran the GATK data pre-processing Best Practices WDL
2. Starting from the same uBAM, but with read group information removed using AddOrReplaceReadGroups, I ran the GATK data pre-processing Best Practices WDL
3. Starting from the uBAM without readgroup information, I ran the Data pre-processing pipeline where I removed the MergeBamAlignment step

After this, with the resulting BAM files, I ran the GATK generic germline variant calling Best Practices WDL.

Between the first two cases, for the samples that I was testing with, I found 2.1% differences in variants called.
I understand here that MergeBamAlignments adds the missing readgroup information from the uBAM, which in turn is used during MarkDuplications, BaseRecalibrator and ApplyBQSR steps, which would lead to a different BAM than in the case if I didn't have the read group information (test 2), and so the variant calling would also be different.

But, between test 2 (uBAM with no readgroups and MBA exists) and test 3 (uBAM with no readgroups and MBS does not exist), I also noted differences in variants called - the difference was 0.18%, so albeit small, it still exists.
My understanding is that MergeBamAlignment also performs more actions in additions to just merging readgroups and read-level tag information.
From one post, I understood that MBA turns hardclipped reads (by BWA, usually some chimeric reads) back into softclipped reads.
Does anyone have more info on this? Or on what exactly MBA does?
Should I expect these small differences, or not?

Best Answer

Answers

  • florian_huberflorian_huber SwitzerlandMember
    Hi @john156,

    I believe that you used HaplotypeCaller, which itself is not deterministic if you use the intel acceleration (pairHmm algorithm).
    I don't know if MergeBamAlignment does anything else than adding readgroups information (and would be interseted to know if it does) bt I believe that HaplotypeCaller itself might be the cause of the variations that you see.
    Did you try to run haplotypecaller twice on the same test and see if both runs lead to the exact same results?

    If you want to keep it deterministic and skip the intel acceleration, you can use "-pairHMM LOGLESS_CACHING" as parameter in HaplotypeCaller.

    Best,

    Florian
  • john156john156 Member

    Hi Florian
    thanks for the answer
    I made sure that my HaplotypeCaller is kept deterministic.

    From this post
    https://software.broadinstitute.org/gatk/documentation/article?id=6483 (paragraph 3C)
    I found out what other operations MergeBamAlignment performs (i.e. it's more than just merging read groups).

    Seeing that MBA performs these additional calculations, my final question (and the reason behind my original post) would be:
    If I'm starting from FASTQ files, I have two options:
    1. Convert the FASTQs to uBAM and then run the GATK data pre-processing Best Practices WDL
    2. Edit the WDL to not include the MBA step, and run directly from FASTQs (reducing the runtime a bit)

    Originally I thought that going with the second option would be preferable, keeping in mind the reduced runtime (due to omitting MBA) and no RG information to be merged seeing that I'm starting from FASTQ files.
    But knowing now that MBA does additional computations (like changing hard-clipped reads into soft-clipped), I'm leaning more towards the solution that my pipeline should convert FASTQs into uBAMs and then run the Best Practices WDL, even at the cost of additional runtime.

    If someone from BROAD could confirm my reasoning, that would be great :)

  • florian_huberflorian_huber SwitzerlandMember
    Hi John,

    Thanks for the link. Personnaly, I do the first step of converting fastq to uBAM in parallel with BWA alignment and then use MergeBamAlignment.

    Although this does not fully answer your question, you can have a look at this question: https://gatkforums.broadinstitute.org/gatk/discussion/11694/why-is-converting-from-fastq-to-ubam-nesessary-before-preprocessing

    As mentionned by @SChaluvadi, because MergeBamAlignment keeps the data separated by lanes, BQSR will be able to take into account technical biases for each lane leaing to a better base recalibration. This is likely to affect the downstream process of variant calling.

    Best,

    Florian
  • john156john156 Member
    edited October 8

    Hi Florian
    thanks for the answer once again, and for your link! :)

    It seems that converting the FASTQs to uBAM would be the optimal solution, and doing is parallel as you've said also seems to be the best option.
    One thing - this way you're just creating one uBAM file. From your link, I also got that if the FASTQ files contain multiple read groups, there are scripts to split them by RGs/lanes, in which case you should scatter the BWA by RG, and create multiple uBAM, each per read group.

    I have not run the tests and checked the results, how the analysis would differ if you supplied just one uBAM containing multiple RGs to the Best Practices WDL, or if you'd previously split that uBAM by RG and then supplied multiple uBAM, relying on the scatter feature. Is this just for the speed increase, or does it incur some more computations?
    This is a different question now than my original post, but it seems worth it to continue this discussion now.

    Also, related to my original question - from the link you posted, someone answered that the main reason that you want to use uBAMs is to recover lost information due to hardclipping. From the post:

    BWA hardclips reads if there is a significant discordance between the best matching kmer and the read. These hardclips may end up costing you a particular structural variant or a true indel call. Merging unmapped bam and initial alignment restores the hardclips which I know of no solution for that in BWA parameters.

    Does this hard-clipping happen even if the -Y parameter is turned on in BWA (make supplementary reads soft-clipped instead of hard-clipped?

  • florian_huberflorian_huber SwitzerlandMember
    Hi John,

    You're welcome. Since I did not test for the differences between a run with uBAMs separated by RG and a run with a single uBAM, I can't answer this question.

    Regarding your second question, MergeBamAlignments says:

    "Many alignment tools (still!) require fastq format input. The unmapped bam may contain useful information that will be lost in the conversion to fastq (meta-data like sample alias, library, barcodes, etc., and read-level tags.) This tool takes an unaligned bam with meta-data, and the aligned bam produced by calling SamToFastq and then passing the result to an aligner/mapper. It produces a new SAM file that includes all aligned and unaligned reads and also carries forward additional read attributes from the unmapped BAM (attributes that are otherwise lost in the process of converting to fastq)"

    Therefore you don't only recover lost information due to hard clipping but also meta-data information. However, to be honest, I don't know if hard-clipping happens if the -Y parameter is used in BWA (and I believe that it should not).

    What I can say is that, based on the Broad's code (https://github.com/gatk-workflows/gatk4-data-processing/blob/master/processing-for-variant-discovery-gatk4.wdl), they use both bwa with the Y option and MergeBamAlignment.

    Again, any comment on that by someone from the Broad is welcome!
  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
  • john156john156 Member

    Hi thanks for the answer
    what Florian and I were wondering was the following
    Since the best practice includes the -Y parameter in BWA (to disallow hard clipping), and one of the main points of the MergeBamAlignment tool is to turn harclipped reads bak into softclipped, what is the point of MergeBamAlignment in this case? Does it do anything else?
    Considering the tests I described above, it seems so, but I'd appreciate a confirmation :)
    I couldn't find this info in the thread you linked above.
    Thanks again for any answers!

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi all,

    I checked with the dev team and this is what they said:
    This tool will combine the aligned and unaligned data as it's name suggests. However, the tool assumes that the unaligned BAM and the aligned BAM contain the same data. It does update hard clips to soft clips, but it can do that because it has the reference and original unaligned reads. Primarily this seems to merge the attributes / bam tags that have been added by our pipelines during processing and alignment into the output file.

  • john156john156 Member

    Hi @bhanuGandham thank you for the answer!
    However, I am still not 100% in the clear about this
    My question was mostly about MergeBamAlignment in the context of the GATK Best Practice's for Data preprocessing.

    So, to rephrase the question:

    If I do not have any attributes in my original unaligned BAM file and if I turn on the -Y parameter in BWA (to disallow hard clipping), is there a purpose for the MergeBamAlignment tool in this case?

    Intuitively, I'd say no, but from the tests that I described in my original point, it seems that MergeBamAlignment performs something else other than merging attributes and turning hard clipped reads back into soft clipped reads.
    But please correct me if I've done something else incorrectly with my tests, I tried to describe them as well as I could in my original post.

    Thank you again for the answers!

    Issue · Github
    by bhanuGandham

    Issue Number
    1421
    State
    open
    Last Updated
    Assignee
    Array
  • manolismanolis Member ✭✭
    edited October 29

    Hi, just a related question, maybe I misunderstanding something in this post ...:

    in the best practice pipeline BWA run with the -Y option.

    -Y Use soft clipping CIGAR operation for supplementary alignments. By default, BWA-MEM uses soft clipping for the primary alignment and hard clipping for supplementary alignments.

    This means that both primary and secondary alignments are soft clipped.

    If I run the second option of @john156

    1. Edit the WDL to not include the MBA step, and run directly from FASTQs (reducing the runtime a bit)

    The second option above means that I have to use directly the original two FASTQs (R1 and R2) as input into BWA step without using nowhere the uBAM files and related tools ...

    I ask this because I read from a post of @john156 this. I hope that it is a general consideration (without the -Y option included)

    BWA hardclips reads if there is a significant discordance between the best matching kmer and the read. These hardclips may end up costing you a particular structural variant or a true indel call. Merging unmapped bam and initial alignment restores the hardclips which I know of no solution for that in BWA parameters.

    In the post that you linked @bhanuGandham, @SkyWarrior explain why is important to merge BAM with uBAM... but I still do not understand if SkyWarrior take account of the -Y option in the BWA step...

    If my considerations are correct and I'm going to work in this way:

    original FASTQs -> BWA -Y -> BAM

    I'm not going to lose possible structural variations or possible real indels... is correct that?

    Many thanks

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi all,

    There seems to be some confusion around this topic. Let me look into this some more and get back to you.

  • john156john156 Member

    Thank you @bhanuGandham, this is satisfactory :smile:

Sign In or Register to comment.