Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

VariantEval/GenotypeConcordance stratify by depth-of-coverage, genotype quality, or other metric?

nilshomernilshomer Cambridge, MAPosts: 3Member

I would like to evaluate variant calls to produce a plot (psuedo-ROC) of sensitivity vs. specificity (or concordance, etc) when I condition on a minimum/maximum value for a particular metric (coverage, genotype quality, etc.). I can do this by running VariantEval or GenotypeConcordance multiple times, once for each cutoff value, but this is inefficient, since I believe I should be able to compute these values in one pass. Alternatively, if there was a simple tool to annotate each variant as concordance or discordant, I could tabulate the results myself. I would like to rely upon GATK's variant comparison logic to compare variants (especially indels). Any thoughts on if current tools can be parameterized, or adapted for these purposes?

Thanks for your help in advance,

N

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin
    edited June 2013

    Hi Nils,

    The simplest way to do what you want is to use CombineVariants with -V:eval eval.vcf -V:comp comp.vcf, then inspect the set= key. It will be the standard one of

    1) set=eval
    2) set=comp
    3) set=Intersection
    4) set=eval-filteredIncomp
    5) set=filteredIneval-comp
    6) set=filteredInAll
    7) set=filterIneval
    8) set=filterIncomp

    If "comp" is a truth set:

    FP = (1)+(4)
    TP = (3)
    FN = (2)+(5)
    TN = (6)+(7)+(8)

    Does that sound like what you need?

    Post edited by Geraldine_VdAuwera on

    Geraldine Van der Auwera, PhD

  • nilshomernilshomer Cambridge, MAPosts: 3Member

    Close, but I would like the FP/TP/FN/TN when I vary by minimum coverage (say the "DP" field in the VCF). So I would have a table for each of those values conditioned on the minimum coverage. I could then plot a ROC curve: http://en.wikipedia.org/wiki/Receiver_operating_characteristic

    I could run CombineVariants multiple times (or VariantEval or GenotypeConcordance), with each time using a new VCF filtered by my own criteria (say depth of coverage), but this is hugely inefficient, and so instead I would like to run CombineVariants (or more likely the others) and get a table of values. The key is to know how sensitivity relates an arbitrary metric (say coverage).

    My second related question was is there a function in the GATK API that compares a called variant against a true variant (say via VariantContexts)? I could then write my own walker (I think) over two variant files.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Hmm, I see -- but no, there isn't anything built in to do that. I think the approach you hint at with your second question would work well though.

    I would recommend looking at the CombineVariants walker code to get a better idea of the existing functions that you could use to compare calls.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.