The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

# Bug in HaplotypeCaller

MGHMember

Hi, I wanted to place what I believe is a bug report.

Bug: If the haplotype caller assembles reads into identical haplotypes, then depending on arbitrary factors, it will emit different variant events for this haplotype, meaning that gVCFs from samples with the same genotype will appear to have different genotypes when their VCFs are viewed.

Longer Explanation : Consider the alignment shown below or a reference genome (top) and a reconstructed haplotype aligned to it (aligned as in either the first or second case):

GACCCCGAGGGGGCCCGGGGCCGCGTCCCTGGGCCCTCC
GACCCCGA--GGGCC---------------GGGCCCTCC
GACCCCGA---------GGGCC--------GGGCCCTCC


This alignment is tricky because the "GGGCC" can be positioned in one of two locations, as shown above, and both of these are equivalently optimal with an affine scoring matrix. Which alignment should be returned is thus questionable, but presumably the GATK should always give the same alignment.

However, the GATK does not, and because of this one sample will have a 2D5M15D mutation, while the other can have a 9D5M8D mutation listed, even if both have the exact same genotype. In particular, if the active region or length of the assembled regions change, which can easily happen, then even if every other base is matched, different variants will appear at this locus, based on how the SW alignment is done.

I have tried to track this bug down, and it seems due to how the SW aligner class selects between equally optimal alignments (I believe while constructing the backtrack matrix but have not verified).

Reproducible Example: The higher level problem is that VCFs emitted from identical samples but with slightly different reads/intervals show different variants at this site despite having the same underlying genotypes and assembled haplotypes. Tracking showed this was due to the length of assembled regions varying a bit. A simple example which simply shows the basic flaw in the relevant code is given below.

Does this seem like the type of thing that could be an easy fix? I can also put this code as a branch on gsa_unstable if it would make things easier. I am also assuming that the CombineGVCF step will not fix this error, but have not validated the behavior and for my purposes it would still be problematic.

Warm wishes,
N

//Two examples of an assmebled haplotype and reference genome,
//exactly the same save the window size
byte[] fullRef="GCGTCGCAGTCTTAAGGCCCCGCCTTTTCAGACAGCTTCCGCTGGGCCTGGGCCGCTGCGGGGCGGTCACGGCCCCTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGAGGGGGCCCGGGGCCGCGTCCCTGGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGACCGGGCCGAGCCGGGGGAAGGGCTCCGGTGACT".getBytes();
byte[] fullHap = "GCGTCGCAGTCTTAAGGCCCCGCCTTTTCAGACAGCTTCCGCTGGGCCTGGGCCGCTGCGGGGCGGTCACGGCCCCTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGA--GGGCC---------------GGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGACCGGGCCGAGCCGGGGGAAGGGCTCCGGTGACT".replace("-","").getBytes();
byte[] longRef =                                                                            "CTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGAGGGGGCCCGGGGCCGCGTCCCTGGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGA".getBytes();
byte[] longHap =                                                                           "CTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGA---------GGGCC--------GGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGA".replace("-","").getBytes();
//a simplified version of the getCigar routine in the haplotype caller
final SmithWaterman alignment2 = new SWPairwiseAlignment( fullRef, fullHap, CigarUtils.NEW_SW_PARAMETERS );
//These cigars are not the same, but should be
`
Tagged:

That's interesting, thanks for bringing it up. I don't expect that running CombineGVCFs would fix this issue. Would you be able to share a snippet of data (bam file) that we could use to debug & test this issue on?

• MGHMember

Woops, hit enter too soon, anyway, yes would be more than happy to offer a BAM and an example shell script that can produce two distinct VCFs from the same BAM file. Just let me know where on the server it would be good to drop.

I also think it might save some time to just see the bug directly though. I just added a branch to the git repository (gsaunstable/Git https://github.com/broadinstitute/gsa-unstable/SWBug), it's the latest version of the code trunk, with one additional file put in that shows the alignment, and clearly demonstrates the problem with a simple main method that produces differing cigars for this indel in question. This should make the problem/example easy to see.

Cheers,
N

• MGHMember

@Geraldine_VdAuwera‌ An example BAM file and a python script that can produce the bug (BugExample.py) is available at: /humgen/gsa-scr1/pub/incoming/BugExample.tgz

• MGHMember

@Geraldine_VdAuwera‌ Woops, hit the tag early again, anyway, have been in touch with Valentin, and this should not be an issue going forward, thanks!