CombineVariants incorrectly(?) complains about badly formed variant when merging multiallelic site

Hi GATK team,

I am attempting to combine a HaplotypeCaller generated VCF with some indels called using pindel using the following arguments (GATK v3.3-0-g37228af):

-R /data/shared/ref/b37/human_g1k_v37.fasta -T CombineVariants --variant:GATK var.HiSeqDecember.raw.vcf --variant:pindel pindel_combined.vcf -o var.HiSeqDecember.pindel.raw.vcf -genotypeMergeOptions PRIORITIZE -priority GATK,pindel

However I get the following error:

ERROR MESSAGE: Badly formed variant context at location 1:157718231; getEnd() was 157718235 but this VariantContext contains an END key with value 157718231

The variants in question are (from GATK):

1 157718231 . CAAAT C,CAAATAAAT 2533.56 PASS AC=3,13;AF=0.125,0.542;AN=24;BaseQRankSum=1.762;ClippingRankSum=-0.327;DP=126;FS=0.000;HOMLEN=39;HOMSEQ=AAATAAATAAATAAATAAATAAATAAATAAATAAATAAA;InbreedingCoeff=-0.1260;MLEAC=3,13;MLEAF=0.125,0.542;MQ=70.00;MQ0=0;MQRankSum=0.920;QD=22.22;ReadPosRankSum=-0.893;SOR=0.976;SVLEN=4;SVTYPE=INS;set=Intersection GT:DP:GQ 0/0:10:30 0/2:9:18 2/2:6:18 2/2:5:15 0/1:10:99 0/2:14:99 2/2:8:24 2/2:6:18 2/2:7:21 0/2:17:99 0/1:5:75 0/1:6:27

and (from pindel):

1 157718231 . C CAAAT . PASS AC=2;AF=0.143;AN=14;END=157718231;HOMLEN=39;HOMSEQ=AAATAAATAAATAAATAAATAAATAAATAAATAAATAAA;SVLEN=4;SVTYPE=INS;set=variant3-variant4-variant6-variant7-variant8-variant9-variant10 GT:AD ./. ./. 0/0:0,7 0/0:0,6 ./. 0/0:0,9 0/0:0,8 0/0:0,7 0/0:0,8 1/1:0,12 ./. ./.

It is worth noting that the pindel VCF here was merged together from several pindel-generated VCFs using CombineVariants without any complaint from the GATK. It looks to me that the END key is correct for the pindel variant (a simple insertion), but the GATK is confused due to the mixed deletion/insertion variant generated by the HaplotypeCaller at the same position (without an END key).

I can rerun the command after stripping all END tags from the pindel VCF and the command completes successfully, so this is not a showstopper for me but I assume this is a bug(?) and if so, it would be great if there were a fix.

Cheers,

Dave

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Dave,

    I think the underlying issue is that your records have different REF alleles. Assuming you're using the reference that matches the GATK record, what CombineVariants sees as the REF allele is CAAAT, so that variant's endpoint is indeed 157718235, and so it is correctly choking on the pindel record's shorter end tag.

    If there is a bug at all here it's that CombineVariants is not checking that the REF alleles match in the two records...

  • @Geraldine_VdAuwera said:
    Hi Dave,

    I think the underlying issue is that your records have different REF alleles. Assuming you're using the reference that matches the GATK record, what CombineVariants sees as the REF allele is CAAAT, so that variant's endpoint is indeed 157718235, and so it is correctly choking on the pindel record's shorter end tag.

    If there is a bug at all here it's that CombineVariants is not checking that the REF alleles match in the two records...

    Hi Geraldine,

    Thanks for the quick response.

    I am using the same reference for both files. I realise the REF call is represented differently in order for the HaplotypeCaller file to represent both a deletion and insertion at the same site, and therefore for that file the END tag (if it were present) should indeed be 157718235. However, one of those ALT alleles represents the same variant from the pindel VCF - that is the CAAAT vs CAAATAAAT from the HaplotypeCaller VCF is the same variant represented slightly differently as the C vs CAAAT from the pindel VCF. So, my assumption was that CombineVariants would realise they actually represent the same variant and merge the variants accordingly (which it does fine after removing the END tags from the pindel VCF). If the END value for the pindel variant is correct when viewed on its own should CombineVariants not realise that the END tag for the allele in the pindel file is actually correct for that representation of the allele and not choke on it? Or is it simply not supported to try to merge records together if the REF alleles are represented differently, even when (one of) the underlying variant(s) is the same in both records?

    Thanks

    Dave

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    I understand what you're saying about different representations of the same thing, but the fact of the matter is that the REF and ALT fields are not interchangeable. This is because the tools are set up to interpret all the information relative to the reference, which serves as a fixed point. I don't know why pindel is emitting the alleles flipped like that if you're using the same reference file, but it shouldn't -- it violates the VCF format specification, which we depend upon to formulate analysis results that are consistent and portable.

    What you're doing (successfully merging after removing the END tag) is working kind of by accident, but it's not safe to do this -- you could potentially run into a situation where an allele gets dropped or switched with another one.

  • I disagree, Geraldine. The pindel call looks perfectly fine to me - it's just the second allele reported from the GATK call, left-aligned and trimmed. The only reason it differs from the GATK representation is that the AAATdel allele is not present in the pindel set. To put it another way, I would expect GATK to represent it the same way if the first alternate allele were missing.

    I think the problem is that GATK (explicitly or implicitly) left-aligns all variants, but you actually have to shift the pindel variant to the right in order for it to be compatible with the GATK variant.

    I suspect you could make this case work by manually right-aligning the pindel call - make it have a REF/ALT of CAAAT/CAAATAAAT. This also violates at least the spirit of the VCF spec, but I think GATK will handle it

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hmm, fair point -- I was looking at it as if it was e.g. REF:A ALT:T,C vs REF:T ALT:A,C. But you're right that this is an indel alignment issue. Sorry guys! I'm going to blame Friday-cross-eyed-from-staring-at-variants-all-week syndrome.

    @pdexheimer‌ Do you think it would make sense to put in some logic to handle this sort of case intelligently and explicitly?

  • No worries Geraldine, pdexheimer has explained the issue much more clearly than I was able to.

    So, if I remove the END tags from the pindel variants as a workaround (easier than right-aligning pindel variants on an ad hoc basis) this should be safe as long as pindel writing the VCF format correctly, right?

  • Removing the END tags would certainly avoid the error (as you've already seen). I think that the merging code that currently exists will work correctly, but it would probably be worthwhile to double-check. If you've got time to mess with it, you could iteratively remove individual END tags from variants that throw this error so that you can build a list to check by hand. After you have five or so, do a blanket removal of all the tags so that the tool actually runs to completion, then go back and verify the five or so problematic sites

Sign In or Register to comment.