Bug in HaplotypeCaller

evolvedmicrobeevolvedmicrobe MGHPosts: 16Member

Hi, I wanted to place what I believe is a bug report.

Bug: If the haplotype caller assembles reads into identical haplotypes, then depending on arbitrary factors, it will emit different variant events for this haplotype, meaning that gVCFs from samples with the same genotype will appear to have different genotypes when their VCFs are viewed.

Longer Explanation : Consider the alignment shown below or a reference genome (top) and a reconstructed haplotype aligned to it (aligned as in either the first or second case):

GACCCCGAGGGGGCCCGGGGCCGCGTCCCTGGGCCCTCC
GACCCCGA--GGGCC---------------GGGCCCTCC
GACCCCGA---------GGGCC--------GGGCCCTCC

This alignment is tricky because the "GGGCC" can be positioned in one of two locations, as shown above, and both of these are equivalently optimal with an affine scoring matrix. Which alignment should be returned is thus questionable, but presumably the GATK should always give the same alignment.

However, the GATK does not, and because of this one sample will have a 2D5M15D mutation, while the other can have a 9D5M8D mutation listed, even if both have the exact same genotype. In particular, if the active region or length of the assembled regions change, which can easily happen, then even if every other base is matched, different variants will appear at this locus, based on how the SW alignment is done.

I have tried to track this bug down, and it seems due to how the SW aligner class selects between equally optimal alignments (I believe while constructing the backtrack matrix but have not verified).

Reproducible Example: The higher level problem is that VCFs emitted from identical samples but with slightly different reads/intervals show different variants at this site despite having the same underlying genotypes and assembled haplotypes. Tracking showed this was due to the length of assembled regions varying a bit. A simple example which simply shows the basic flaw in the relevant code is given below.

Does this seem like the type of thing that could be an easy fix? I can also put this code as a branch on gsa_unstable if it would make things easier. I am also assuming that the CombineGVCF step will not fix this error, but have not validated the behavior and for my purposes it would still be problematic.

Warm wishes,
N

//Two examples of an assmebled haplotype and reference genome,
//exactly the same save the window size
byte[] fullRef="GCGTCGCAGTCTTAAGGCCCCGCCTTTTCAGACAGCTTCCGCTGGGCCTGGGCCGCTGCGGGGCGGTCACGGCCCCTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGAGGGGGCCCGGGGCCGCGTCCCTGGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGACCGGGCCGAGCCGGGGGAAGGGCTCCGGTGACT".getBytes();
    byte[] fullHap = "GCGTCGCAGTCTTAAGGCCCCGCCTTTTCAGACAGCTTCCGCTGGGCCTGGGCCGCTGCGGGGCGGTCACGGCCCCTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGA--GGGCC---------------GGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGACCGGGCCGAGCCGGGGGAAGGGCTCCGGTGACT".replace("-","").getBytes();
    byte[] longRef =                                                                            "CTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGAGGGGGCCCGGGGCCGCGTCCCTGGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGA".getBytes();
    byte[] longHap =                                                                           "CTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGA---------GGGCC--------GGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGA".replace("-","").getBytes();
    //a simplified version of the getCigar routine in the haplotype caller
    final String SW_PAD = "NNNNNNNNNN";
    final String paddedRef = SW_PAD + new String(fullRef) + SW_PAD;
    final String paddedPath = SW_PAD + new String(fullHap) + SW_PAD;
    final SmithWaterman alignment = new SWPairwiseAlignment( paddedRef.getBytes(), paddedPath.getBytes(), CigarUtils.NEW_SW_PARAMETERS );
    final SmithWaterman alignment2 = new SWPairwiseAlignment( fullRef, fullHap, CigarUtils.NEW_SW_PARAMETERS );
    //These cigars are not the same, but should be
    Cigar rawPadded = alignment.getCigar();
    Cigar notPadded= alignment2.getCigar();

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,273Administrator, GATK Dev admin

    Hi @evolvedmicrobe‌,

    That's interesting, thanks for bringing it up. I don't expect that running CombineGVCFs would fix this issue. Would you be able to share a snippet of data (bam file) that we could use to debug & test this issue on?

    Geraldine Van der Auwera, PhD

  • evolvedmicrobeevolvedmicrobe MGHPosts: 16Member

    Woops, hit enter too soon, anyway, yes would be more than happy to offer a BAM and an example shell script that can produce two distinct VCFs from the same BAM file. Just let me know where on the server it would be good to drop.

    I also think it might save some time to just see the bug directly though. I just added a branch to the git repository (gsaunstable/Git https://github.com/broadinstitute/gsa-unstable/SWBug), it's the latest version of the code trunk, with one additional file put in that shows the alignment, and clearly demonstrates the problem with a simple main method that produces differing cigars for this indel in question. This should make the problem/example easy to see.

    Cheers,
    N

  • evolvedmicrobeevolvedmicrobe MGHPosts: 16Member

    @Geraldine_VdAuwera‌ An example BAM file and a python script that can produce the bug (BugExample.py) is available at: /humgen/gsa-scr1/pub/incoming/BugExample.tgz

  • evolvedmicrobeevolvedmicrobe MGHPosts: 16Member

    @Geraldine_VdAuwera‌ Woops, hit the tag early again, anyway, have been in touch with Valentin, and this should not be an issue going forward, thanks!

Sign In or Register to comment.