Bamout file shows a consistent deletion that is not reported in VCF

Dear GATK team

I ran HaplotypeCaller on a bam file, which is the alignment of a single bacteria sample to its reference genome. To understand the calling process I wanted to compare the resulting VCF and bamout files, so I selected a small region.

java -jar -Xmx8g $BASE/GenomeAnalysisTK.jar -T HaplotypeCaller -R AE014075.fasta -I dedup_3_S3.bam -ploidy 1 -stand_call_conf 30 -stand_emit_conf 10 -o S3_part.vcf -bamout newbamout389088.bam -L AE014075.1:389000-389190

Then I opened three tracks in IGV: the VCF file (S3_part.vcf) on top, then the original bam file (dedup_3_S3.bam), and lastly the bamout file (newbamout389088.bam). As you can see, the original bam file thinks there is a deletion at position 389087 and a SNP at 389091, supported by 861 out of ~1000 reads (MAPQ > 30). The bamfile, after re-alignment and re-assembly, thinks there is an insertion at 389088 and a 3-nt deletion at 389090, supported by almost 100% of the reads.

It is important to note that in the bamout track, all artificial haplotypes (pink reads) show the same insertion and deletions. Actually if you scroll down you will see that all reads from the original bam file (blue) contain exactly the same variants too, although not all of them are informative.

Now if you go to the top VCF track, the insertion at 389088 is called while the deletion at 389090 is not. This is confusing because, didn't the bamout file just say all assembled haploptypes contain a deletion at 389090? And almost all the re-aligned reads from the original bam file also contain this variant?

Also maybe a silly question: since the genome's ploidy is 1 and there's only 1 sample here, shouldn't there be only 1 haplotype by definition? Why bamout has ~10 haplotypes (when by default bamWriterType is CALLED_HAPLOTYPES)?

Thanks a lot!

Comments

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zhihuali
    Hi,

    Can you please run HaplotypeCaller on that region in GVCF mode? I would like to see the GVCF record for the sites with the deletion. Please post the GVCF records for the positions with the insertion and deletion. Can you also post the same IGV screenshot, but zoomed out (including ~1000 bases before and after the sites of interest)?

    Theoretically, for a ploidy 1 sample, you would have only 1 haplotype, but that may not happen in messy regions. The artificial haplotypes contain all possible combinations of variants in the active region. When there are many variants in a region, the number of artificial haplotypes increases.

    -Sheila

  • zhihualizhihuali USAMember

    Thanks Sheila.

    I did what you suggested. Now it seems to come together: in the first screen shot with zoomed-in view (11 bp as before), I saw that the 3-nt deleton shown in bamout has quality -10 as shown in the gVCF file. Probably that's why it's not reported. Although I'm not sure why the deletions here have such low quality if all artificial haplotypes and almost all re-aligned reads contain this deletion?

    Also now I was wondering as a user how to determine the variants in this small region after manual inspection. The original bam says there's a deletion at 389087 (863 vs 263 reads in terms of ALT vs REF) and a SNP at 389091 (857 vs 5 reads). The bamout says there's an insertion at 389088 with multiple possible ALTs (inserted nts), and a 3-nt deletion at 389090 but with low quality. I kinda think the original bam's idea about this region makes more sense intuitively, although I know sometimes my eyes can be cheating... ...

    Issue · Github
    by Sheila

    Issue Number
    1884
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @zhihuali
    Hi,

    Wow, that region is super messy. I am checking with the team on a few things and will get back to you.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hmm, the messiness of this region suggests something is going on. You could try BLATting the sequence to see if it shows up somewhere else; it could be the reads mapped here actually belong elsewhere. In bacteria, repeated sequences caused a lot of difficulty for the aligner. Or there could be a larger structural event at play. Unfortunately cases like these are going to be difficult to diagnose and it may not be possible to get clarity.

Sign In or Register to comment.