Service note: Geraldine is on vacation this week; other members of GSA will be responding to questions, but they have a lot of work besides this, so be aware that responses may be a little slower than usual. Thank you for your patience.

Using VariantEval

delangeldelangel Posts: 61GSA Official Member mod

For a complete, detailed argument reference, refer to the technical documentation page.

Modules

You can find detailed information about the various modules here.

Stratification modules

  • AlleleFrequency
  • AlleleCount
  • CompRod
  • Contig
  • CpG
  • Degeneracy
  • EvalRod
  • Filter
  • FunctionalClass
  • JexlExpression
  • Novelty
  • Sample

Evaluation modules

  • CompOverlap
  • CountVariants

Note that the GenotypeConcordance module has been rewritten as a separate walker tool (see its Technical Documentation page).

A useful analysis using VariantEval

image

We in GSA often find ourselves performing an analysis of 2 different call sets. For SNPs, we often show the overlap of the sets (their "venn") and the relative dbSNP rates and/or transition-transversion ratios. The picture provided is an example of such a slide and is easy to create using VariantEval. Assuming you have 2 filtered VCF callsets named 'foo.vcf' and 'bar.vcf', there are 2 quick steps.

Combine the VCFs

java -jar GenomeAnalysisTK.jar \
    -R ref.fasta \
    -T CombineVariants \
    -V:FOO foo.vcf \
    -V:BAR bar.vcf \
    -priority FOO,BAR \
    -o merged.vcf

Run VariantEval

java -jar GenomeAnalysisTK.jar \
     -T VariantEval \
     -R ref.fasta \
     -D dbsnp.vcf \
     -select 'set=="Intersection"' -selectName Intersection \
     -select 'set=="FOO"' -selectName FOO \
     -select 'set=="FOO-filterInBAR"' -selectName InFOO-FilteredInBAR \
     -select 'set=="BAR"' -selectName BAR \
     -select 'set=="filterInFOO-BAR"' -selectName InBAR-FilteredInFOO \
     -select 'set=="FilteredInAll"' -selectName FilteredInAll \
     -o merged.eval.gatkreport \
     -eval merged.vcf \
     -l INFO

Checking the possible values of 'set'

It is wise to check the actual values for the set names present in your file before writing complex VariantEval commands. An easy way to do this is to extract the value of the set fields and then reduce that to the unique entries, like so:

java -jar GenomeAnalysisTK.jar -T VariantsToTable -R ref.fasta -V merged.vcf -F set -o fields.txt
grep -v 'set' fields.txt | sort | uniq -c

This will provide you with a list of all of the possible values for 'set' in your VCF so that you can be sure to supply the correct select statements to VariantEval.

Reading the VariantEval output file

The VariantEval output is formatted as a GATKReport.

Understanding Genotype Concordance values from Variant Eval

The VariantEval genotype concordance module emits information the relationship between the eval calls and genotypes and the comp calls and genotypes. The following three slides provide some insight into three key metrics to assess call sensitivity and concordance between genotypes.

##:GATKReport.v0.1 GenotypeConcordance.sampleSummaryStats : the concordance statistics summary for each sample
GenotypeConcordance.sampleSummaryStats  CompRod   CpG      EvalRod  JexlExpression  Novelty  percent_comp_ref_called_var  percent_comp_het_called_het  percent_comp_het_called_var  percent_comp_hom_called_hom  percent_comp_hom_called_var  percent_non-reference_sensitivity  percent_overall_genotype_concordance  percent_non-reference_discrepancy_rate
GenotypeConcordance.sampleSummaryStats  compOMNI  all      eval     none            all      0.78                         97.65                        98.39                        99.13                        99.44                        98.80                              99.09                                 3.60

The key outputs:

  • percent_overall_genotype_concordance
  • percent_non_ref_sensitivity_rate
  • percent_non_ref_discrepancy_rate

All defined below.

image

image

image

image

ExampleDiagram.png
720 x 540 - 87K
GenotypeConcordanceGenotypeErrorRate.png
2999 x 2249 - 413K
GenotypeConcordanceTraditionalGenotypeConcordance.png
2999 x 2249 - 485K
GenotypeConcordanceVarSens.png
2999 x 2249 - 381K
GenotypeExample.png
2999 x 2249 - 441K
Post edited by Geraldine_VdAuwera on

Comments

  • chongmchongm Posts: 23Member

    If I am using comp to evaluate genotype concordance with chip data and the ref/alt alleles for my two vcf files are different. Let's say an "A" is called for a single sample but for my seq vcf this is the alternate allele and for my chip vcf this is the ref allele, will this genotype still be in "concordance?"

    Thanks,

    MC

  • chartlchartl Posts: 8GSA Official Member mod

    Hi chongm-

    We have replaced the GenotypeConcordance module of VariantEval with its own walker: GenotypeConcordance. See:

    http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_GenotypeConcordance.html

    I'm not sure what the old behavior of VariantEval was for the situation you describe, but GenotypeConcordance was written under the assumption that the VCFs passed in adhere to the specification, and that the reference allele represents the bases of the reference used to build the VCF; and while the GATK engine no longer automatically checks VCF reference bases against the actual reference, I'm not sure any behavior can be guaranteed if this is not the case.

    For this particular walker the site will be treated as follows:

    • The site itself will be tabulated as ALLELES_DO_NOT_MATCH in the site table

    • The genotypes will nevertheless be evaluated, and count against you

    The reason the genotypes will be evaluated is to allow for the following case:

    EVAL: A T 0/0 1/0

    COMP: A C 0/0 1/0

    Despite the fact that the second site has a different alternate allele, the first sample is still concordant (A/A vs A/A), while the second sample has a mismatching allele. In your case:

    EVAL: A T 0/0 1/0

    COMP: T A 1/1 1/0

    the first sample (whose alleles are "A/A") still matches the alleles in the EVAL context (on the genotype level), even though the alternate alleles do not match, and is therefore not tabulated under the allele mismatch column, but rather "hom var called as hom ref."

    In conclusion, it's best to fix or remove sites that break ValidateVariants.

    Post edited by chartl on
    chongm
  • chongmchongm Posts: 23Member

    What do the fields "UNAVAILABLE_HOM_REF ... HOM_VAR... _HET" mean? I seem to have a high proportion for both my chip data and my seq data (> 0.95 for these fields). I am thinking that if my chip & seq data do not have a high degree of overlapping sites though the sites they do share seem to be concordant (as calculated by just using varianteval - comp, > 98% concordance), all of these non-shared sites end up in the proportion of sites that are "unavailable"?

    Thanks,

    MC

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 2,239Administrator, GSA Official Member admin

    Hi MC,

    Did you use the --ignoreFilters flag?

    As explained in the doc for GenotypeConcordance, if you use it:

    Filters will be ignored. The FILTER field of the eval and comp VCFs will be ignored. If this flag is not included, all FILTER sites will be treated as not being present in the VCF. (That is, the genotypes will be assigned UNAVAILABLE, as distinct from NO_CALL).

    http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_GenotypeConcordance.html

    Geraldine Van der Auwera, PhD

  • chongmchongm Posts: 23Member

    Oh sorry I didn't see that thank you Geraldine!

    Thanks,

    MC

Sign In or Register to comment.