Our documentation websites are currently offline due to a data center fire. We do not yet have an ETA for restoring service; we’ll update this message when we know more.

VariantEval Evaluation Modules Glossary

KateNKateN Cambridge, MAMember, Broadie, Moderator
edited September 2017 in Methods and Algorithms

Table of Contents

Default modules:

  • CompOverlap: gives concordance metrics based on the overlap between the evaluation and comparison file
  • CountVariants: counts different types (SNP, insertion, complex, etc.) of variants present within your evaluation file and gives related metrics
  • IndelLengthHistogram: gives a table of values for plotting a histogram of indel lengths found in your evaluated variants.
  • IndelSummary: gives metrics related to insertions and deletions (count, multiallelic sites, het-hom ratios, etc.)
  • MultiallelicSummary: gives metrics relevant to multiallelic variant sites, including amount, ratio, and TiTv
  • TiTvVariantEvaluator: gives the number and ratio of transition and transversion variants for your evaluation file, comparison file, and ancestral alleles
  • ValidationReport: details the sensitivity and specificity of your callset, given follow-up validation assay data
  • VariantSummary: gives a summary of metrics related to SNPs and indels

Other available modules:

  • MendelianViolationEvaluator: detects and counts Mendelian violations, given data from parent samples.
  • PrintMissingComp: returns the number of variant sites present in your callset that were not found in the truth set.
  • ThetaVariantEvaluator: computes different estimates of theta based on variant sites and genotypes
  • MetricsCollection: includes all minimum metrics discussed in this article (link to follow; document in progress). Runs by default if CompOverlap, IndelSummary, TiTvVariantEvaluator, CountVariants, & MultiallelicSummary are run as well. (included in the nightly build for immediate use or in the next release of GATK)
    * At the time of writing, the listed modules were present. To check modules present in your specific GATK version, use the -list command.

General

Each table has a few columns of data that will be the same across multiple evaluation modules. To avoid listing them multiple times, they will be specified here

Example Output *
image

  • CompOverlap- In the above example, we see the first column is the CompOverlap. This first column will always be the name of the evaluation module you are currently viewing. IndelSummary will say "IndelSummary", CountVariants will say "CountVariants" and so on.
  • CompRod- shows which file is being compared to the eval for that row.
    By default, this is dbsnp, but you can specify additional comparison files using -comp, and name them using :. E.g. -comp:name \path\to\file.vcf where name is the name you wish to specify for the CompRod column and \path\to\file.vcf is your comparison file. If left unnamed, these additional comparison files will default to "comp" in the CompRod column.

  • EvalRod- shows which file is being evaluated.
    This is useful when specifying multiple eval files. They can be named using the : notation as above. When unnamed, they will default to "eval" in the EvalRod column.

  • JexlExpression- a Jexl query that was applied to the file. For details on Jexl expressions, please read about them here

  • Novelty- has three possible values; all, known, and novel. "Novel" includes anything seen exclusively in the eval that is not seen in the comp. "Known" includes anything seen in both the eval and the comp. "All" is the sum of "Novel" and "Known".
    By default, the comp used to determine novelty is dbsnp. To change this, you must specify -knownName with the new comparison file you have passed in.

*Output from a rare variant association study with >1500 whole genome sequenced samples


CompOverlap

Example Output *
image

  • nEvalVariants- the number of variants in the eval file
  • novelSites- the number of variants in the eval considered to be novel in comparison to dbsnp (same as novel row of nEvalVariants column)
  • nVariantsAtComp- the number of variants present in eval that match the location of a variant in the comparison file (same as known row of nEvalVariants)
  • compRate- nVariantsAtComp divided by nEvalVariants
  • nConcordant- the number of variants present in eval that exactly match the genotype present in the comparison file
  • concordantRate- nConcordant divided by nVariantsAtComp

*Output from a rare variant association study with >1500 whole genome sequenced samples


CountVariants

Example Output *
image

  • nProcessedLoci- the number of loci iterated over in the reference file (also found in MultiallelicSummary)
  • nCalledLoci- the number of loci called in the eval file
  • nRefLoci- the number of loci in eval that matched the reference file
  • nVariantLoci- the number of loci in eval that did not match the reference file
  • variantRate- nVariantLoci divided by nProcessedLoci
  • variantRatePerBp- nProcessedLoci divided by nVariantLoci (a truncated integer)
  • nSNPs- the number of variants determined to be single-nucleotide polymorphisms
  • nMNPs- the number of variants determined to be multi-nucleotide polymorphisms
  • nInsertions- the number of variants determined to be insertions
  • nDeletions- the number of variants determined to be deletions
  • nComplex- the number of variants determined to be complex (both insertions and deletions)
  • nSymbolic- the number of variants determined to be symbolic
  • nMixed- the number of variants determined to be mixed (cannot be determined to be SNPs, MNPs, or indels)
  • nNoCalls- the number of sites at which there was no variant call made
  • nHets- the number of heterozygous loci
  • nHomRef- the number of homozygous reference loci
  • nHomVar- the number of homozygous variant loci
  • nSingletons- the number of variants determined to be singletons (occur only once)
  • nHomDerived- the number of homozygous derived variants; an ancestor had a variant at that site, but the descendant in question no longer has a variant at that site and is now homozygous reference.
  • heterozygosity- nHets divided by nProcessedLoci
  • heterozygosityPerBp- nProcessedLoci divided by nHets (a truncated integer)
  • hetHomRatio- nHets divided by nHomVar
  • indelRate- nInsertions plus nDeletions plus nComplex all divided by nProcessedLoci
  • indelRatePerBp- nProcessedLoci divided by the sum of nInsertions, nDeletions, and nComplex (a truncated integer)
  • insertionDeletionRatio- nInsertions divided by nDeletions

*Output from a rare variant association study with >1500 whole genome sequenced samples


IndelSummary

Example Output *
image

  • n_SNPs- the number of SNPs (multiallelic SNPs are counted once for each allele)
  • n_singleton_SNPs- the number of SNP singleton loci (SNPs seen only once)
  • n_indels- the number of indels (multiallelic indels are counted once for each allele)
  • n_singleton_indels- the number of indel singleton loci (indels seen only once)
  • n_indels_matching_gold_standard- the number of indel loci that match indels in the gold standard (must pass in a -gold parameter)
  • gold_standard_matching_rate- n_indels_matching_gold_standard divided by n_indels
  • n_multiallelic_indel_sites- the number of indel sites that are multiallelic
  • percent_of_sites_with_more_than_2_alleles- n_multiallelic_indel_sites divided by the total number of indel sites
  • SNP_to_indel_ratio- n_SNPs divided by n_indels
  • SNP_to_indel_ratio_for_singletons- n_singleton_SNPs divided by n_singleton_indels
  • n_novel_indels- number of indels considered to be novel in comparison to dbsnp (the novel row of the n_indels column gives the same information)
  • indel_novelty_rate- n_novel_indels divided by n_indels
  • n_insertions- the number of insertion variants
  • n_deletions- the number of deletion variants
  • insertion_to_deletion_ratio- n_insertions divided by n_deletions
  • n_large_deletions- number of deletions with a length greater than 10
  • n_large_insertions- number of insertions with a length greater than 10
  • insertion_to_deletion_ratio_for_large_indels- n_large_insertions divided by n_large_deletions
  • n_coding_indels_frameshifting- the number of indels within the coding regions of the genome which cause a frameshift
  • n_coding_indels_in_frame- the number of indels within the coding regions of the genome which do not cause a frameshift
  • frameshift_rate_for_coding_indels- n_coding_indels_frameshifting divided by the sum of n_coding_indels_frameshifting and n_coding_indels_in_frame
  • SNP_het_to_hom_ratio- the number of heterozygous SNPs divided by the number of homozygous variant SNPs
  • indel_het_to_hom_ratio- the number of heterozygous indels divided by the number of homozygous variant indels
  • ratio_of_1_and_2_to_3_bp_insertions- the sum of one and two base pair insertions divided by three base pair insertions
  • ratio_of_1_and_2_to_3_bp_deletions- the sum of one and two base pair deletions divided by three base pair deletions

*Output from a rare variant association study with >1500 whole genome sequenced samples


TiTvVariantEvaluator

Example Output *
image

  • nTi- number of transition variants in eval (A↔G or T↔C)
  • nTv- number of transversion variants in eval (A↔T or G↔C or A↔C or G↔T)
  • tiTvRatio- nTi divided by nTv
  • nTiInComp- number of transition variants present in the comparison file
  • nTvInComp- number of transversion variants present in the comparison file
  • TiTvRatioStandard- nTiInComp divided by nTvInComp
  • nTiDerived- number of transition variants derived from ancestral alleles
  • nTvDerived- number of transversion variants derived from ancestral alleles
  • tiTvDerivedRatio- nTiDerived divided by nTvDerived

*Output from a rare variant association study with >1500 whole genome sequenced samples


MultiallelicSummary

Example Output *
image

  • nProcessedLoci- number of loci iterated over in the reference file (also found in CountVariants)
  • nSNPs- number of SNPs (multiallelic SNPs are only counted once overall)
  • nMultiSNPs- number of multiallelic SNPs (again, only counted once per loci)
  • processedMultiSnpRatio- nMultiSNPs divided by nProcessedLoci
  • variantMultiSnpRatio- nMultiSNPs divided by nSNPs
  • nIndels- number of indels (multiallelic indels are only counted once overall)
  • nMultiIndels- number of multiallelic indels (again, only counted once per loci)
  • processedMultiIndelRatio- nMultiIndels divided by nProcessedLoci
  • variantMultiIndelRatio- nMultiIndels divided by nIndels
  • nTi- number of transition variants at multiallelic sites
  • nTv- number of transversion variants at multiallelic sites
  • TiTvRatio- nTi divided by nTv
  • knownSNPsPartial- the number of loci at which at least one allele in eval was found in the known comparison file (applies only to multiallelic sites)
  • knownSNPsComplete- the number of loci at which all alleles in eval were also found in the known comparison file (applies only to multiallelic sites)
  • SNPNoveltyRate- the sum of knownSNPsPartial and knownSNPsComplete divided by nMultiSNPs

*Output from a rare variant association study with >1500 whole genome sequenced samples

Post edited by Geraldine_VdAuwera on

Comments

Sign In or Register to comment.