What do the VariantEval modules do?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin
edited March 2013 in FAQs

VariantEval accepts two types of modules: stratification and evaluation modules.

  • Stratification modules will stratify (group) the variants based on certain properties.
  • Evaluation modules will compute certain metrics for the variants

CpG

CpG is a three-state stratification:

  • The locus is a CpG site ("CpG")
  • The locus is not a CpG site ("non_CpG")
  • The locus is either a CpG or not a CpG site ("all")

A CpG site is defined as a site where the reference base at a locus is a C and the adjacent reference base in the 3' direction is a G.

EvalRod

EvalRod is an N-state stratification, where N is the number of eval rods bound to VariantEval.

Sample

Sample is an N-state stratification, where N is the number of samples in the eval files.

Filter

Filter is a three-state stratification:

  • The locus passes QC filters ("called")
  • The locus fails QC filters ("filtered")
  • The locus either passes or fails QC filters ("raw")

FunctionalClass

FunctionalClass is a four-state stratification:

  • The locus is a synonymous site ("silent")
  • The locus is a missense site ("missense")
  • The locus is a nonsense site ("nonsense")
  • The locus is of any functional class ("any")

CompRod

CompRod is an N-state stratification, where N is the number of comp tracks bound to VariantEval.

Degeneracy

Degeneracy is a six-state stratification:

  • The underlying base position in the codon is 1-fold degenerate ("1-fold")
  • The underlying base position in the codon is 2-fold degenerate ("2-fold")
  • The underlying base position in the codon is 3-fold degenerate ("3-fold")
  • The underlying base position in the codon is 4-fold degenerate ("4-fold")
  • The underlying base position in the codon is 6-fold degenerate ("6-fold")
  • The underlying base position in the codon is degenerate at any level ("all")

See the [http://en.wikipedia.org/wiki/Genetic_code#Degeneracy Wikipedia page on degeneracy] for more information.

JexlExpression

JexlExpression is an N-state stratification, where N is the number of JEXL expressions supplied to VariantEval. See [[Using JEXL expressions]]

Novelty

Novelty is a three-state stratification:

  • The locus overlaps the knowns comp track (usually the dbSNP track) ("known")
  • The locus does not overlap the knowns comp track ("novel")
  • The locus either overlaps or does not overlap the knowns comp track ("all")

CountVariants

CountVariants is an evaluation module that computes the following metrics:

Metric Definition
nProcessedLoci Number of processed loci
nCalledLoci Number of called loci
nRefLoci Number of reference loci
nVariantLoci Number of variant loci
variantRate Variants per loci rate
variantRatePerBp Number of variants per base
nSNPs Number of snp loci
nInsertions Number of insertion
nDeletions Number of deletions
nComplex Number of complex loci
nNoCalls Number of no calls loci
nHets Number of het loci
nHomRef Number of hom ref loci
nHomVar Number of hom var loci
nSingletons Number of singletons
heterozygosity heterozygosity per locus rate
heterozygosityPerBp heterozygosity per base pair
hetHomRatio heterozygosity to homozygosity ratio
indelRate indel rate (insertion count + deletion count)
indelRatePerBp indel rate per base pair
deletionInsertionRatio deletion to insertion ratio

CompOverlap

CompOverlap is an evaluation module that computes the following metrics:

Metric Definition
nEvalSNPs number of eval SNP sites
nCompSNPs number of comp SNP sites
novelSites number of eval sites outside of comp sites
nVariantsAtComp number of eval sites at comp sites (that is, sharing the same locus as a variant in the comp track, regardless of whether the alternate allele is the same)
compRate percentage of eval sites at comp sites
nConcordant number of concordant sites (that is, for the sites that share the same locus as a variant in the comp track, those that have the same alternate allele)
concordantRate the concordance rate

Understanding the output of CompOverlap

A SNP in the detection set is said to be 'concordant' if the position exactly matches an entry in dbSNP and the allele is the same. To understand this and other output of CompOverlap, we shall examine a detailed example. First, consider a fake dbSNP file (headers are suppressed so that one can see the important things):

 $ grep -v '##' dbsnp.vcf
 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO
 1       10327   rs112750067     T       C       .       .       ASP;R5;VC=SNP;VP=050000020005000000000100;WGT=1;dbSNPBuildID=132

Now, a detection set file with a single sample, where the variant allele is the same as listed in dbSNP:

 $ grep -v '##' eval_correct_allele.vcf
 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT            001-6
 1       10327   .       T       C       5168.52 PASS    ...     GT:AD:DP:GQ:PL    0/1:357,238:373:99:3959,0,4059

Finally, a detection set file with a single sample, but the alternate allele differs from that in dbSNP:

 $ grep -v '##' eval_incorrect_allele.vcf
 #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT            001-6
 1       10327   .       T       A       5168.52 PASS    ...     GT:AD:DP:GQ:PL    0/1:357,238:373:99:3959,0,4059

Running VariantEval with just the CompOverlap module:

 $ java -jar $STING_DIR/dist/GenomeAnalysisTK.jar -T VariantEval \
        -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \
        -L 1:10327 \
        -B:dbsnp,VCF dbsnp.vcf \
        -B:eval_correct_allele,VCF eval_correct_allele.vcf \
        -B:eval_incorrect_allele,VCF eval_incorrect_allele.vcf \
        -noEV \
        -EV CompOverlap \
        -o eval.table

We find that the eval.table file contains the following:

 $ grep -v '##' eval.table | column -t 
 CompOverlap  CompRod  EvalRod                JexlExpression  Novelty  nEvalVariants  nCompVariants  novelSites  nVariantsAtComp  compRate      nConcordant  concordantRate
 CompOverlap  dbsnp    eval_correct_allele    none            all      1              1              0           1                100.00000000  1            100.00000000
 CompOverlap  dbsnp    eval_correct_allele    none            known    1              1              0           1                100.00000000  1            100.00000000
 CompOverlap  dbsnp    eval_correct_allele    none            novel    0              0              0           0                0.00000000    0            0.00000000
 CompOverlap  dbsnp    eval_incorrect_allele  none            all      1              1              0           1                100.00000000  0            0.00000000
 CompOverlap  dbsnp    eval_incorrect_allele  none            known    1              1              0           1                100.00000000  0            0.00000000
 CompOverlap  dbsnp    eval_incorrect_allele  none            novel    0              0              0           0                0.00000000    0            0.00000000

As you can see, the detection set variant was listed under nVariantsAtComp (meaning the variant was seen at a position listed in dbSNP), but only the eval_correct_allele dataset is shown to be concordant at that site, because the allele listed in this dataset and dbSNP match.

TiTvVariantEvaluator

TiTvVariantEvaluator is an evaluation module that computes the following metrics:

Metric Definition
nTi number of transition loci
nTv number of transversion loci
tiTvRatio the transition to transversion ratio
nTiInComp number of comp transition sites
nTvInComp number of comp transversion sites
TiTvRatioStandard the transition to transversion ratio for comp sites
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • LaurentLaurent Posts: 35Member, GSA Collaborator

    This is such a great and helpful page, thanks a lot! A small question regarding the FunctionalClass stratification: what annotation will it read ?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    I'm glad you find it useful. FFunctionalClass reads annotations such as those imported from SnpEff -- see the SnpEff annotation documentation for more details. There's also a presentation on this topic here (see "Functional annotation" toward the end of the page): http://www.broadinstitute.org/gatk/guide/events?id=2038

    Geraldine Van der Auwera, PhD

  • myoglumyoglu Posts: 39Member

    Silly question maybe, but how did you make the nice plots and tables? I have the report as ".txt", but that does not look at all so nice.

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    We have some custom Rscripts to plot the report data. We currently don't make them available to the public though, sorry!

    Geraldine Van der Auwera, PhD

  • SCRSCR CaliforniaPosts: 2Member

    Hi,

    I am using VariantEval to compare variant calls between two vcfs, and I noticed that in the CountVariants table, the values for nCalledLoci and nNoCalls are the same within the rows displaying calls unique to each set. For example, for set 1, nCalledLoci=551 and nNoCalls=551. Logically this seems incorrect - any explanations as to why this is happening?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Hmm. Can you please post the full table?

    Geraldine Van der Auwera, PhD

  • SCRSCR CaliforniaPosts: 2Member
    edited July 28

    Hi @Geraldine_VdAuwera‌,

    Thanks for getting back to me. The full table is quite unwieldy just as text, but I will post it below. Here is a link to a more readable version in dropbox: https://www.dropbox.com/s/y3r4rc5uqlka22q/GATKReport_nCalledLoci_nNoCalls_troubleshooting.xlsx

    #:GATKTable:30:21:%s:%s:%s:%s:%s:%d:%d:%d:%d:%.8f:%.8f:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%d:%.2e:%.2f:%.2f:%.2e:%.2f:%.2f:;
    #:GATKTable:CountVariants:Counts different classes of variants in the sample
    CountVariants   CompRod EvalRod JexlExpression  Novelty nProcessedLoci  nCalledLoci     nRefLoci        nVariantLoci    variantRate     variantRatePerBp        nSNPs   nMNPs   nInsertions     nDeletions      nComplex        nSymbolic       nMixed  nNoCalls        nHets   nHomRef nHomVar nSingletons     nHomDerived     heterozygosity  heterozygosityPerBp     hetHomRatio     indelRate       indelRatePerBp  insertionDeletionRatio
    CountVariants   dbsnp   eval    FilteredInAll   all     3137161264      0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0.00E+00        0       0       0.00E+00        0       0
    CountVariants   dbsnp   eval    FilteredInAll   known   3137161264      0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0.00E+00        0       0       0.00E+00        0       0
    CountVariants   dbsnp   eval    FilteredInAll   novel   3137161264      0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0       0.00E+00        0       0       0.00E+00        0       0
    CountVariants   dbsnp   eval    InPDX_P0-FilteredInPDX_PT       all     3137161264      301     0       301     0.0000001       10422462        219     0       12      57      13      0       0       0       530     0       72      0       0       1.69E-07        5919172 7.36    2.61E-08        38258064        0.21
    CountVariants   dbsnp   eval    InPDX_P0-FilteredInPDX_PT       known   3137161264      250     0       250     0.00000008      12548645        186     0       6       47      11      0       0       0       432     0       68      0       0       1.38E-07        7261947 6.35    2.04E-08        49018144        0.13
    CountVariants   dbsnp   eval    InPDX_P0-FilteredInPDX_PT       novel   3137161264      51      0       51      0.00000002      61512965        33      0       6       10      2       0       0       0       98      0       4       0       0       3.12E-08        32011849        24.5    5.74E-09        174286736       0.6
    CountVariants   dbsnp   eval    InPDX_PT-FilteredInPDX_P0       all     3137161264      494     0       494     0.00000016      6350528 400     0       44      37      13      0       0       0       846     0       142     0       0       2.70E-07        3708228 5.96    3.00E-08        33374056        1.19
    CountVariants   dbsnp   eval    InPDX_PT-FilteredInPDX_P0       known   3137161264      436     0       436     0.00000014      7195324 355     0       40      29      12      0       0       0       733     0       139     0       0       2.34E-07        4279892 5.27    2.58E-08        38730385        1.38
    CountVariants   dbsnp   eval    InPDX_PT-FilteredInPDX_P0       novel   3137161264      58      0       58      0.00000002      54088987        45      0       4       8       1       0       0       0       113     0       3       0       0       3.60E-08        27762489        37.67   4.14E-09        241320097       0.5
    CountVariants   dbsnp   eval    Intersection    all     3137161264      43389   0       43389   0.00001383      72303   40019   0       1614    1655    101     0       0       0       46622   0       40156   0       0       1.49E-05        67289   1.16    1.07E-06        930908  0.98
    CountVariants   dbsnp   eval    Intersection    known   3137161264      42396   0       42396   0.00001351      73996   39307   0       1492    1496    101     0       0       0       44896   0       39896   0       0       1.43E-05        69876   1.13    9.85E-07        1015591 1
    CountVariants   dbsnp   eval    Intersection    novel   3137161264      993     0       993     0.00000032      3159276 712     0       122     159     0       0       0       0       1726    0       260     0       0       5.50E-07        1817590 6.64    8.96E-08        11164274        0.77
    CountVariants   dbsnp   eval    PDX_P0  all     3137161264      551     0       551     0.00000018      5693577 450     0       44      56      1       0       0       551     377     0       174     311     0       1.20E-07        8321382 2.17    3.22E-08        31061002        0.79
    CountVariants   dbsnp   eval    PDX_P0  known   3137161264      355     0       355     0.00000011      8837073 297     0       18      39      1       0       0       355     192     0       163     161     0       6.12E-08        16339381        1.18    1.85E-08        54088987        0.46
    CountVariants   dbsnp   eval    PDX_P0  novel   3137161264      196     0       196     0.00000006      16005924        153     0       26      17      0       0       0       196     185     0       11      150     0       5.90E-08        16957628        16.82   1.37E-08        72957238        1.53
    CountVariants   dbsnp   eval    PDX_PT  all     3137161264      1523    0       1523    0.00000049      2059856 1262    0       131     125     5       0       0       1523    1224    0       299     1025    0       3.90E-07        2563040 4.09    8.32E-08        12019774        1.05
    CountVariants   dbsnp   eval    PDX_PT  known   3137161264      1292    0       1292    0.00000041      2428143 1100    0       87      100     5       0       0       1292    1011    0       281     870     0       3.22E-07        3103027 3.6     6.12E-08        16339381        0.87
    CountVariants   dbsnp   eval    PDX_PT  novel   3137161264      231     0       231     0.00000007      13580784        162     0       44      25      0       0       0       231     213     0       18      155     0       6.79E-08        14728456        11.83   2.20E-08        45466105        1.76
    CountVariants   dbsnp   eval    none    all     3137161264      46258   0       46258   0.00001475      67818   42350   0       1845    1930    133     0       0       2074    49599   0       40843   1336    0       1.58E-05        63250   1.21    1.25E-06        802753  0.96
    CountVariants   dbsnp   eval    none    known   3137161264      44729   0       44729   0.00001426      70137   41245   0       1643    1711    130     0       0       1647    47264   0       40547   1031    0       1.51E-05        66375   1.17    1.11E-06        900448  0.96
    CountVariants   dbsnp   eval    none    novel   3137161264      1529    0       1529    0.00000049      2051773 1105    0       202     219     3       0       0       427     2335    0       296     305     0       7.44E-07        1343538 7.89    1.35E-07        7398965 0.92
    
    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Hi @SCR,

    Thanks, this is fine -- just wanted to check that the table looks sane, which it does if you have multiple samples in your callset. The first set of fields, such as nCalledLoci, are properties that are evaluated per variant site. Then the next set of fields, including nNoCalls, nHets etc. are evaluated per sample, since they are genotype properties. So you can have 551 variant calls (nCalledLoci), with 551 no-genotype-calls (noCalls) over one or more samples. Since it is a bit odd that you'd have exactly the same number I'm wondering if one of your samples has all no-calls at the sites you're looking at. You can stratify this table by sample to find out.

    Geraldine Van der Auwera, PhD

  • tinutinu Posts: 31Member
    edited August 11

    Hi Gerladine,

    I used the following command

    java -Xmx6G -jar /GenomeAnalysisTK.jar -R /hs37d5.fa -T VariantEval -eval INPUT.vcf -o INPUT.gatkreport --dbsnp dbsnp_137.b37.vcf

    GATKTable:CompOverlap:The   overlap between eval    and comp    sites               
    CompOverlap CompRod EvalRod JexlExpression  Novelty nEvalVariants   novelSites  nVariantsAtComp compRate    nConcordant concordantRate  
    CompOverlap dbsnp   eval    none    all 64970   1680    63290   97.41   63201   99.86  
    CompOverlap dbsnp   eval    none    known   63290   0   63290   100 63201   99.86  
    CompOverlap dbsnp   eval    none    novel   1680    1680    0   0   0   0  
    

    My VCF has 66932 variants with 63621 SNPs, 3047 INDELs and 264 multiallelic variants. My question is why is VariantEval reporting all just 64970 variants

    Thanks, Tinu

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    @‌tinu

    How did you count the number of variants in your file?

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.