If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

VariantAnnotator and multiple records in resources


I'm using VariantAnnotator to add annotations to variants from a bunch of sources. One issue that I have is that for some variants, there are multiple annotations in a supplied resource. In the docs, I read

"Note that if there are multiple records in the resource file that overlap the given position, one is chosen randomly."

Can this behaviour be altered? I need to output all annotations for a record, either on a single line, or on multiple.

In the case i'm working on, one line has the annotation "CLNSIG=5" (i.e. a known pathogenic variant) and the other (likely older record) is "CLNSIG=1" i.e. a variant of unknown significance. I need to output both so I can filter downstream (using SelectVariants) to select those where "CLNSIG=5".



  • dklevebringdklevebring Member
    edited February 2015

    After a night of sleep I can further expand on the issue. In the resource VCF (from ClinVar), I have this:

    #13      32890627        rs80359400      A       AT      .       .       ASP;CLNACC=RCV000113041.1;CLNALLE=1;CLNDBN=Bre
    #13      32890627        rs80359393      A       AT      .       .       ASP;CLNACC=RCV000044248.2|RCV000082917.3;CLNAL
    #13      32890627        rs80359399      AT      A       .       .       ASP;CLNACC=RCV000044247.2|RCV000113038.1;CLNAL

    and dbSNP:

    13  32890627    rs80359399  AT  A   .   .   ASP;GENEINFO=BRCA2:675;LSD;NSF;OTHERKG;PM;REF;RS=80359399;RSPOS=32890633;SAO=0;SLO;SSR=0;VC=DIV;VP=0x050160001205000002100200;WGT=1;dbSNPBuildID=132
    13  32890627    rs80359393  A   AT  .   .   ASP;GENEINFO=BRCA2:675;LSD;NSF;OM;OTHERKG;PM;REF;RS=80359393;RSPOS=32890633;SAO=0;SLO;SSR=0;VC=DIV;VP=0x050160001205000002110200;WGT=1;dbSNPBuildID=132

    And my variants to be annotated are these:

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  L12982N_panel_v1
    13  32890627    .   A   AT  1500    .   AB=0.45 GT:AO:DP:PL:QA:QR:RO    0/1:86:170:100,0,100:2492:2390:78
    13  32890627    .   AT  A   1500    .   AB=0.45 GT:AO:DP:PL:QA:QR:RO    0/1:86:170:100,0,100:2492:2390:78

    I run VariantAnnotator, like so:

    java -jar GenomeAnalysisTK.jar -T VariantAnnotator -R $REF -V $V --resource:clinvar $CLINVAR --expression clinvar.CLNSIG -L $V -E clinvar.CLNACC

    In the results, it's clear that VariantAnnotator did two (kind of weird) things:

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  L12982N_panel_v1
    13  32890627    .   A   AT  1500    .   AB=0.45;clinvar.CLNACC=RCV000113041.1;clinvar.CLNSIG=5  GT:AO:DP:PL:QA:QR:RO    0/1:86:170:100,0,100:2492:2390:78
    13  32890627    .   AT  A   1500    .   AB=0.45;clinvar.CLNACC=RCV000113041.1;clinvar.CLNSIG=5  GT:AO:DP:PL:QA:QR:RO    0/1:86:170:100,0,100:2492:2390:78
    1. VA ignores one of the annotation lines for the insertion A->AT variant (this is according to docs, but still questionable behaviour)
    2. The deletion variant (AT->A) is annotated with the data from the insertion variant in the resource file. See the CLNACC annotation, which for the AT->A should be RCV000044247.2|RCV000113038.1.

    Interestingly though, when I turn on dbSNP annotation with --dbsnp $DBSNP:

    #CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  L12982N_panel_v1
    13  32890627    rs80359393  A   AT  1500    .   AB=0.45;DB;clinvar.CLNACC=RCV000113041.1;clinvar.CLNSIG=5   GT:AO:DP:PL:QA:QR:RO0/1:86:170:100,0,100:2492:2390:78
    13  32890627    rs80359399  AT  A   1500    .   AB=0.45;DB;clinvar.CLNACC=RCV000113041.1;clinvar.CLNSIG=5   GT:AO:DP:PL:QA:QR:RO0/1:86:170:100,0,100:2492:2390:78

    This adds the dbSNP rsids, and does so correctly for both the insertion and deletion. This behaviour is different from that of the resource annotation (point 2 above).

    I assume that 2) is a bug, and 1) is the correct behaviour. I do think however, that 1) is quesionable. Merging records in the resource would be one way around this, but running CombineVariants merges the deletion and insertion variants (all three rows) to a single row, therefore losing the connection between alt allele and rsid. Keeping the two on separate rows would handle this.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I think your interpretations are largely correct.

    I agree 1) is awkward; I'm not sure that behavior can be improved directly (there are many potential complications) but what would you think of this as a workaround: if CombineVariants could be told to merge variant records if and only if the REF and ALT alleles are identical, otherwise keep them separate? With the ability to choose which variant-level annotations it would keep, if they are in conflict (which I think should be feasible with the existing merge priority machinery).

    Regarding 2) I believe this is the same behavior as 1) but with even less justification. The problem being that VA doesn't check the alleles, just the position, iirc. It may be possible to put in such a check; would you be able to submit a bug report with file snippets that we could use in a feature enhancement request?

Sign In or Register to comment.