Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Possible LeftAlignVariants bug for multi-allelic indel variant

jsorensonjsorenson Posts: 0Member

We have a complex VCF record that doesn't appear to be properly treated by LeftAlignVariants, and I couldn't find evidence that this behavior has been reported anywhere else.

The record is

17      19561175        .       GGTTTGT G,GTTTGT        49      PASS    AC=1,1;AF=0.50,0.50;AN=2;DP=117;DS;MQ=60;MQ0=0;source=Locus     GT:AB:AD:DP     1/2:0.925:68,49:117

Admittedly, the GGTTTGT>GTTTGT variation is odd because it's better specified as GG>G, but there's nothing semantically wrong with this record as written. (if you're wondering where this came from, it came from simulated data)

The challenge however is that the left-aligned version of this variant is AG>G. So I could see expecting the following output from LeftAlignVariants:

17      19561174        .       AGGTTTGT AG,AGTTTGT

Rather ugly, but I think that's the right way to write the original complex variation after left-alignment. Alternatively a separated and phased representation would achieve the same:

17      19561174        .       AG  A             .... 0/1:...
17      19561175        .       GGTTTGT G   .... 1|0:...

But I bet that would introduce all types of problems in LeftAlignVariants if you tried to make that happen.

I think it's a really hard problem to solve in the main, just wanted to post here to see if 1. You agree that it behaves this way and 2. Help anyone else who might be seeing something like this.

Best Answer

Answers

  • brianherbbrianherb Posts: 1Member

    As a followup to this, I would find it useful for LeftAlignVariants to left align multi-allelic indels, but I have found that this function does not maintain the order of the variants-

    for example a call like:

    java -jar GenomeAnalysisTK.jar -R all.fa -T LeftAlignAndTrimVariants --variant multi_indel.vcf --splitMultiallelics --trimAlleles -o multi_indel_left.vcf

    takes input like:

    chr1 8598 . T - TAA,TA

    and outputs:

    chr1 8598 . T - TA

    chr1 8598 . T - TAA

    where the order of the variants is reversed. Now I know that this is an example that was not left corrected, but I see this behavior in left corrected variants as well, and it seems to be random as to which variants get reversed. Is there a way for this function to maintain the order of multi-allelic variants so that I can keep track of them?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,412Administrator, GATK Developer admin

    I'm not sure I understand what you mean by maintaining the order -- are you saying that if you have

    T - TAA, TA
    

    you want it to always output

    T - TAA
    T - TA
    

    ? I'd have to check the code to see if there is a rationale for outputting one or the other first, but I'm going to guess it's just related to the type of data structure we're using to store the alleles and how we're retrieving them. Are you sure the ordering is random, not alphabetical ordering?

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.