We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

BAM file generated by GATK HaplotypeCaller shows duplicates, not fixed by Picard FixMateInformation

elcorteganoelcortegano University of EdinburghMember
I have some BAM files that are being processed with GATK to call variants, following a schedule like the one below:

BAM -> HaplotypeCaller + CombineGVCFs + GenotypeGVCFs -> VCF

However, when I examine the variants found (VCF) in IGV software I see some inconsistencies with these BAM files (eg. variants that do not match the position). I understand that this is likely being originated by HaplotypeCaller doing some local realignments.

Because of that, I'm running again HaplotypeCaller with -bamout --force-active --disable-optimizations options in order to generate a BAM file (BAM*) that accounts for these local realignments.

HaplotypeCaller () -> BAM*

However, when I examine these BAM* files in IGV, the number of reads has increased, many of them looking like duplicates. This is weird however, because the original BAM file was processed with Picard AddOrReplaceReadGroups as well as by MarkDuplicates, resulting in no errors by ValdiateSamFile.

The BAM* generated by HaplotypeCaller shows some mate read errors when ValidateSamFile is used. When Picard FixMateInforation is used however, the resulting file is of size 0 (it's empty). Any ideas on why is this happening and how to solve it?

Version numbers are 2.21.1 for Picard and for GATK


  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @elcortegano,

    Please upload a screenshot of the IGV results with the VCF, BAM, and BAM* and the Haplotypecaller command used to generate the results.

  • elcorteganoelcortegano University of EdinburghMember
    The files are the following:

    - Insertion VCF shows a section in the VCF file. The first and last line correspond to two insertions.
    - Insertion BAM is a IGV visualization including that region. It corresponds to the BAM file used to call variants with HaplotypeCaller.
    - The two remaining files are the IGV visualization of the same region, using the BAM file generated by HaplotypeCaller.

    First of all, it can be seen that the BAM files generated by GATK have much more reads, which I do not understand why. Second, the positions do not match exactly between BAM files.

    Variants are generated as follows, starting from BAM files:

    gatk HaplotypeCaller -I file.bam -O file.g.vcf -R reference.fq.gz -ploidy 1 -ERC BP_RESOLUTION -stand-call-conf 10.0 -bamout file_gatk.bam --force-active --disable-optimizations

    The above will generate what I first called the BAM* file. Because I'm doing this for different samples (27 as it can be seen in the VCF snapshot), the final VCF file is actually generated with CombineGVCFs and GenotypeGVCFs:

    gatk CombineGVCFs -R reference.fq.gz -O combined.g.vcf --variant file1.g.vcf --variant file2.g.vcf ...

    gatk GenotypeGVCFs -R reference.fa.qz /V combined.g.vcf -O variants.vcf -ploidy 1 -all-sites
  • bshifawbshifaw Member, Broadie, Moderator admin

    Thanks @elcortegano ,

    I was speaking with the dev team about this and the asked if you could share a snippet of the BAM, BAM* and VCF so that we could look into the issue a bit further. They mentioned Haplotypecaller inserts artificial haplotypes that may be confused as duplicate reads, it's possible to distinguish these reads by separating the reads in IGV by Readgroups.

  • elcorteganoelcortegano University of EdinburghMember
    edited November 2019
    I have prepared a small dataset with small BAM (<20KB) and VCF (<50KB) files, but it seems these cannot be attached here in the forum, or at least I cannot drop them as I did for the images above. Do you have an email or another way I can send you these files?

    Thank you very much, and sorry for the inconveniences
  • bshifawbshifaw Member, Broadie, Moderator admin

    @elcortegano ,
    I've verified your account which should allow you to post to the forum, mind trying again. If that doesn't work you can upload files to our ftp site mentioned here

  • elcorteganoelcortegano University of EdinburghMember
    The error I was getting is "File format is not allowed". I can submit them by changing just the extnsion to .txt.

    The files attached are the following:

    - C0001.bam (submitted as C0001.txt) corresponds to a BAM file containing one of the scaffolds in our original BAM data file (taken using samtools view).
    - C0001_gatk.bam (C0001_gatk.txt) contains data for the same scaffold, but taken from the BAM file generated by GATK.
    - C0001.vcf (C0001_vcf.txt) is the VCF file filtered for that scaffold (using tabix).

    It can be seen for example, that in position C0001:1997644 C0001.bam contains 16 reads, but that number increases to 43 for C0001_gatk.bam.
Sign In or Register to comment.