Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.

Original bam file vs -bamout bam file, which one sould I rely on?

Adam_U0Adam_U0 Member
edited October 21 in Ask the GATK team

Dear GATK authors and other scientists,

I wonder which bam file is the 'correct' one. Let me explain. I have to select some interesting variants from my vcf files (called by HaplotypeCaller) and then I'm going to confirm them in the wet lab. I'd like to avoid false positives, so I prepared several filtering strategies with strict conditions. First of all I'd like to see my variant in bam file (IGV). However sometimes the variant of interest is present only in bam from -bamout and there's no any alternation in original bam file (based on position).
Yes I know that is similar question here:
https://gatkforums.broadinstitute.org/gatk/discussion/6129/ad-in-vcf-doesnt-match-bam

And Sheila has responded that it's the result of a reassembly done by HaplotypeCaller which may change the positions of the reads. I understand this, you are using the De Brujin graph to reconstruct and select the haplotype with best likelihood.

However should I take variant like this into consideration (example below) or treat him like a false positive? What do you think? I got dozens of variants like this one.
Original bam:

Bamout bam:

Same position, variant present in vcf file with nice score. I see the 'variant pattern' in some reads in bamout one, I think it's pretty suspicous and it may be a group of false positives, what do you think?
Actually I see 'ArtificialHaplotypeRG' section in IGV bamout bam file with this group of variants, so should I ignore them?

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Adam_U0

    The bamout contains "the assembled haplotypes and locally realigned reads" . When comparing the input bam to the bamout bam, there should be less noise in the later due to the higher quality in the local alignment.
    If you want to get a better understanding of bamout you may try running through one of tutorials here which walks you through the IGV steps of exploring a bamout file. Here is another resource. Note: thee docs uses GATK3 but this shouldn't stop you from running the IGV steps and getting a good understanding of the concept.

  • Adam_U0Adam_U0 Member

    Dear bhanuGandham,

    thank you very much for your time, response and help! I've read both articles and my conclusion is: HaplotypeCaller is much more sensitive (local realingnment) in order to detect variants close each other, in fact we are able to find more interesting variants that are absent in BWA MEM .bam (original bam). So as you can see in picture that I provided in previously, there several variants, I mean deletion, insertion and two SNPs are they true positives? What about this huge insertion at the end of several reads? Actually I doubt about the real existance of these variants.
    Furthermore, please take a look.
    Original bams:

    Bamout bams:

    There are my seven samples, colored by read group, that's why each sample has a different color in original bams, each sample different flow-cell line. This the same position and in bamout file many cumulated indels have appeard (red circles). Based on your experience, do you think that they are rely variants, no false positives?

    Best regards

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @Adam_U0

    The real variants and the confidence with which HaplotypeCaller calls those variants is provided in the vcf file. The purpose of a bamout is to validate those variants and understand the rationale behind why HaplotypeCaller called those variants. So if you want to visualize your variants in IGV then I suggest comparing your original input bam, output vcf and output bamout. Each serve a different purpose.
    original input bam -> alignment produced by bwa mem
    output vcf -> variants called by HaplotypeCaller with information on the quality of the variants
    bamout -> to understand why haplotypecaller called a particular variant

    I hope this helps. For more information please refer to our documentation.

Sign In or Register to comment.