Insertion Homozygous but heterozygous in BAM files

jpdesvignesjpdesvignes Marseille, FranceMember

Hello,

I have some troubles with insertion variation. In my VCF file it is flagged as homozygous (1/1) but on my BAM file it appears heterozygous.
I generated BAM output file from HaplotypeCaller but it is the same thing.
I join the IGV snapshot. The top track is bam before HC et the bottom track is bam after HC.
Why if the depth is 34 and the insertion is counted 23 this variant is homozygous ?
For information I use GATK 3.5 and follow the guidelines.

The line in the VCF :
17 8245561 rs145968386 T TTTTA 1098.73 . AC=2;AF=1.00;AN=2;DB;DP=34;ExcessHet=3.0103;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.95;MQ0=0;QD=28.15;SOR=1.708 GT:AD:DP:GQ:PL 1/1:0,23:23:78:1136,78,0

Thank you !

Jean-Pierre

Comments

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jpdesvignes
    Hi Jean-Pierre,

    It looks like some of the reads are filtered out. Can you check the base qualities and mapping qualities of the reads that do not contain the insertion?

    Also, in the bamout file, some of the reads are the ArtificialHaplotypes created by HaplotypeCaller. Can you shade the reads by sample and repost the screenshot?

    Thanks,
    Sheila

  • jpdesvignesjpdesvignes Marseille, FranceMember

    Hi Sheila,

    thanks for you answer.
    I join the IGV snapshot with color sample.
    Indeed there are 4 reads from HaplotypeCaller (ArtificialHaplotypes), 2 of which contains insertion.
    For the sample I always see reads without insertion.

    In IGV I remove all filtered reads.
    For score quality, the mapping (~60) score and base quality score (~30) are equivalent for reads with or without insertion.

    Thanks

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jpdesvignes
    Hi Jean-Pierre,

    I'm not sure why the reads are being filtered out. Can you submit a bug report so I can test out your files locally? Instructions are here.

    Thanks,
    Sheila

  • jpdesvignesjpdesvignes Marseille, FranceMember

    Dear Sheila,

    I put on your ftp data for testing. (folder jpdesvignes_for_sheila)
    It is bam extract from whole genome (start of chr17).
    There are bam, bai and vcf.

    Thank you !

    JP

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @jpdesvignes
    Hi JP,

    It seems you have uploaded the bamout file. I need to original bam file. Can you please upload that as well?

    Thanks,
    Sheila

  • jpdesvignesjpdesvignes Marseille, FranceMember

    Hi Sheila,

    I uploaded bam file for region chr17:8200000-8300000 in the same folder.

    thank you again !

    JP

  • robert123robert123 GermanyMember
    edited July 2017

    Hello, it's been more than a year: has this issue been tracked down to its cause?
    I have the same problem: my insertion is covered by 68 reads in the original bam file. I have generated the bamout file to analyse the problem. In the bamout file there are 60 reads covering the position. Out of these 60 reads, 38 have the insertion and 22 do not. The call that I get is:
    1/1:0,37
    I have no idea why GATK throws away all 22 reads that do not have the insertion and keeps 37 of the 38 reads that have it.
    I see no difference in MAPQ and base quality that would explain this behaviour. I can upload data if needed, but I think it's the same issue as with jpdesvignes' data.

    EDIT:
    Maybe I should add the quality scores.

    For the 38 reads that have the insertion
    MAPQ:
    60,60,70,70,60,60,70,60,60,60,70,60,70,60,29,60,29,70,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,29,29,60,29,70
    Base Qualities (extracted from the bamout via mpileup):
    ABA5BBBBABE5EED8EEEBBB5BABEAADDBD:AC=A

    For the 22 reads that do not have the insertion
    MAPQ:
    60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60,60
    Base Qualities (extracted from the bamout via mpileup):
    BA?BA5AB>BDDC;E<>EEDBC

  • shleeshlee CambridgeMember, Broadie, Moderator admin

    Hi @robert123,

    It would be helpful to see an IGV screenshot of your locus. There are instructions in the HaplotypeCaller tool doc, under the BAMOUT parameter description, on how to color your BAMOUT. Remember that some of the seeming alignments in the BAMOUT are artificial haplotypes.

    The reads are written out containing an "HC" tag (integer) that encodes which haplotype each read best matches according to the haplotype caller's likelihood calculation. The use of this tag is primarily intended to allow good coloring of reads in IGV. Simply go to "Color Alignments By > Tag" and enter "HC" to more easily see which reads go with these haplotype. You can also tell IGV to group reads by sample, which will separate the potential haplotypes from the reads. These features are illustrated in this screenshot. Note that only reads that are actually informative about the haplotypes are emitted with the HC tag. By informative we mean that there's a meaningful difference in the likelihood of the read coming from one haplotype compared to the next best haplotype. When coloring reads by HC tag in IGV, uninformative reads will remain grey. Note also that not every input read is emitted to the bam in this mode. To include all trimmed, downsampled, filtered and uninformative reads, add the --emitDroppedReads argument. If multiple BAMs are passed as input to the tool (as is common for MuTect2), then they will be combined in the -bamout output and tagged with the appropriate sample names.

    See if following the above directions helps you sort out why some reads are dropped.

Sign In or Register to comment.