It looks like you're new here. If you want to get involved, click one of these buttons!
Hi, I wanted to place what I believe is a bug report.
Bug: If the haplotype caller assembles reads into identical haplotypes, then depending on arbitrary factors, it will emit different variant events for this haplotype, meaning that gVCFs from samples with the same genotype will appear to have different genotypes when their VCFs are viewed.
Longer Explanation : Consider the alignment shown below or a reference genome (top) and a reconstructed haplotype aligned to it (aligned as in either the first or second case):
This alignment is tricky because the "GGGCC" can be positioned in one of two locations, as shown above, and both of these are equivalently optimal with an affine scoring matrix. Which alignment should be returned is thus questionable, but presumably the GATK should always give the same alignment.
However, the GATK does not, and because of this one sample will have a 2D5M15D mutation, while the other can have a 9D5M8D mutation listed, even if both have the exact same genotype. In particular, if the active region or length of the assembled regions change, which can easily happen, then even if every other base is matched, different variants will appear at this locus, based on how the SW alignment is done.
I have tried to track this bug down, and it seems due to how the SW aligner class selects between equally optimal alignments (I believe while constructing the backtrack matrix but have not verified).
Reproducible Example: The higher level problem is that VCFs emitted from identical samples but with slightly different reads/intervals show different variants at this site despite having the same underlying genotypes and assembled haplotypes. Tracking showed this was due to the length of assembled regions varying a bit. A simple example which simply shows the basic flaw in the relevant code is given below.
Does this seem like the type of thing that could be an easy fix? I can also put this code as a branch on gsa_unstable if it would make things easier. I am also assuming that the CombineGVCF step will not fix this error, but have not validated the behavior and for my purposes it would still be problematic.
//Two examples of an assmebled haplotype and reference genome,
//exactly the same save the window size
byte fullHap = "GCGTCGCAGTCTTAAGGCCCCGCCTTTTCAGACAGCTTCCGCTGGGCCTGGGCCGCTGCGGGGCGGTCACGGCCCCTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGA--GGGCC---------------GGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGACCGGGCCGAGCCGGGGGAAGGGCTCCGGTGACT".replace("-","").getBytes();
byte longRef = "CTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGAGGGGGCCCGGGGCCGCGTCCCTGGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGA".getBytes();
byte longHap = "CTTTAAGCCTGAGCCCCGCCCCCTGGCTCCCCGCCCCCTCTTCTCCCCTCCCCCAAGCCAGCACCTGGTGCCCCGGCGGGTCGTGCGGCGCGGCGCTCCGCGGTGAGCGCCTGACCCCGA---------GGGCC--------GGGCCCTCCCCACCCTTGCGGTGGCCTCGCGGGTCCCAGGGGCGGGGCTGGAGCGGCAGCAGGGCCGGGGAGATGGGCGGTGGGGAGCGCGGGAGGGA".replace("-","").getBytes();
//a simplified version of the getCigar routine in the haplotype caller
final String SW_PAD = "NNNNNNNNNN";
final String paddedRef = SW_PAD + new String(fullRef) + SW_PAD;
final String paddedPath = SW_PAD + new String(fullHap) + SW_PAD;
final SmithWaterman alignment = new SWPairwiseAlignment( paddedRef.getBytes(), paddedPath.getBytes(), CigarUtils.NEW_SW_PARAMETERS );
final SmithWaterman alignment2 = new SWPairwiseAlignment( fullRef, fullHap, CigarUtils.NEW_SW_PARAMETERS );
//These cigars are not the same, but should be
Cigar rawPadded = alignment.getCigar();
Cigar notPadded= alignment2.getCigar();