Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Using VariantEval

delangeldelangel Posts: 71GSA Member mod
edited March 2013 in Methods and Workflows

For a complete, detailed argument reference, refer to the technical documentation page.

Modules

You can find detailed information about the various modules here.

Stratification modules

  • AlleleFrequency
  • AlleleCount
  • CompRod
  • Contig
  • CpG
  • Degeneracy
  • EvalRod
  • Filter
  • FunctionalClass
  • JexlExpression
  • Novelty
  • Sample

Evaluation modules

  • CompOverlap
  • CountVariants

Note that the GenotypeConcordance module has been rewritten as a separate walker tool (see its Technical Documentation page).

A useful analysis using VariantEval

image

We in GSA often find ourselves performing an analysis of 2 different call sets. For SNPs, we often show the overlap of the sets (their "venn") and the relative dbSNP rates and/or transition-transversion ratios. The picture provided is an example of such a slide and is easy to create using VariantEval. Assuming you have 2 filtered VCF callsets named 'foo.vcf' and 'bar.vcf', there are 2 quick steps.

Combine the VCFs

java -jar GenomeAnalysisTK.jar \
    -R ref.fasta \
    -T CombineVariants \
    -V:FOO foo.vcf \
    -V:BAR bar.vcf \
    -priority FOO,BAR \
    -o merged.vcf

Run VariantEval

java -jar GenomeAnalysisTK.jar \
     -T VariantEval \
     -R ref.fasta \
     -D dbsnp.vcf \
     -select 'set=="Intersection"' -selectName Intersection \
     -select 'set=="FOO"' -selectName FOO \
     -select 'set=="FOO-filterInBAR"' -selectName InFOO-FilteredInBAR \
     -select 'set=="BAR"' -selectName BAR \
     -select 'set=="filterInFOO-BAR"' -selectName InBAR-FilteredInFOO \
     -select 'set=="FilteredInAll"' -selectName FilteredInAll \
     -o merged.eval.gatkreport \
     -eval merged.vcf \
     -l INFO

Checking the possible values of 'set'

It is wise to check the actual values for the set names present in your file before writing complex VariantEval commands. An easy way to do this is to extract the value of the set fields and then reduce that to the unique entries, like so:

java -jar GenomeAnalysisTK.jar -T VariantsToTable -R ref.fasta -V merged.vcf -F set -o fields.txt
grep -v 'set' fields.txt | sort | uniq -c

This will provide you with a list of all of the possible values for 'set' in your VCF so that you can be sure to supply the correct select statements to VariantEval.

Reading the VariantEval output file

The VariantEval output is formatted as a GATKReport.

Understanding Genotype Concordance values from Variant Eval

The VariantEval genotype concordance module emits information the relationship between the eval calls and genotypes and the comp calls and genotypes. The following three slides provide some insight into three key metrics to assess call sensitivity and concordance between genotypes.

##:GATKReport.v0.1 GenotypeConcordance.sampleSummaryStats : the concordance statistics summary for each sample
GenotypeConcordance.sampleSummaryStats  CompRod   CpG      EvalRod  JexlExpression  Novelty  percent_comp_ref_called_var  percent_comp_het_called_het  percent_comp_het_called_var  percent_comp_hom_called_hom  percent_comp_hom_called_var  percent_non-reference_sensitivity  percent_overall_genotype_concordance  percent_non-reference_discrepancy_rate
GenotypeConcordance.sampleSummaryStats  compOMNI  all      eval     none            all      0.78                         97.65                        98.39                        99.13                        99.44                        98.80                              99.09                                 3.60

The key outputs:

  • percent_overall_genotype_concordance
  • percent_non_ref_sensitivity_rate
  • percent_non_ref_discrepancy_rate

All defined below.

image

image

image

image

ExampleDiagram.png
720 x 540 - 87K
GenotypeConcordanceGenotypeErrorRate.png
2999 x 2249 - 413K
GenotypeConcordanceTraditionalGenotypeConcordance.png
2999 x 2249 - 485K
GenotypeConcordanceVarSens.png
2999 x 2249 - 381K
GenotypeExample.png
2999 x 2249 - 441K
Post edited by Geraldine_VdAuwera on

Comments

  • chongmchongm Posts: 33Member

    If I am using comp to evaluate genotype concordance with chip data and the ref/alt alleles for my two vcf files are different. Let's say an "A" is called for a single sample but for my seq vcf this is the alternate allele and for my chip vcf this is the ref allele, will this genotype still be in "concordance?"

    Thanks,

    MC

  • chartlchartl Posts: 11GSA Member mod
    edited March 2013

    Hi chongm-

    We have replaced the GenotypeConcordance module of VariantEval with its own walker: GenotypeConcordance. See:

    http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_GenotypeConcordance.html

    I'm not sure what the old behavior of VariantEval was for the situation you describe, but GenotypeConcordance was written under the assumption that the VCFs passed in adhere to the specification, and that the reference allele represents the bases of the reference used to build the VCF; and while the GATK engine no longer automatically checks VCF reference bases against the actual reference, I'm not sure any behavior can be guaranteed if this is not the case.

    For this particular walker the site will be treated as follows:

    • The site itself will be tabulated as ALLELES_DO_NOT_MATCH in the site table

    • The genotypes will nevertheless be evaluated, and count against you

    The reason the genotypes will be evaluated is to allow for the following case:

    EVAL: A T 0/0 1/0

    COMP: A C 0/0 1/0

    Despite the fact that the second site has a different alternate allele, the first sample is still concordant (A/A vs A/A), while the second sample has a mismatching allele. In your case:

    EVAL: A T 0/0 1/0

    COMP: T A 1/1 1/0

    the first sample (whose alleles are "A/A") still matches the alleles in the EVAL context (on the genotype level), even though the alternate alleles do not match, and is therefore not tabulated under the allele mismatch column, but rather "hom var called as hom ref."

    In conclusion, it's best to fix or remove sites that break ValidateVariants.

    Post edited by chartl on
  • chongmchongm Posts: 33Member

    What do the fields "UNAVAILABLE_HOM_REF ... HOM_VAR... _HET" mean? I seem to have a high proportion for both my chip data and my seq data (> 0.95 for these fields). I am thinking that if my chip & seq data do not have a high degree of overlapping sites though the sites they do share seem to be concordant (as calculated by just using varianteval - comp, > 98% concordance), all of these non-shared sites end up in the proportion of sites that are "unavailable"?

    Thanks,

    MC

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Hi MC,

    Did you use the --ignoreFilters flag?

    As explained in the doc for GenotypeConcordance, if you use it:

    Filters will be ignored. The FILTER field of the eval and comp VCFs will be ignored. If this flag is not included, all FILTER sites will be treated as not being present in the VCF. (That is, the genotypes will be assigned UNAVAILABLE, as distinct from NO_CALL).

    http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_GenotypeConcordance.html

    Geraldine Van der Auwera, PhD

  • chongmchongm Posts: 33Member

    Oh sorry I didn't see that thank you Geraldine!

    Thanks,

    MC

  • CarolCarol Posts: 58Member

    Where could the files cited in GATKwh0-BP-8-variant_analysis.pdf like 1000G_ALL_Sites.vcf.gz, 1000G_62_FIN_Genotypes.bcf, OMNI_MONO_SITES omni_mono.vcf be downloaded?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    These files are not publicly available; they are used in example command lines for purposes of demonstration only.

    Geraldine Van der Auwera, PhD

  • beezonebeezone Posts: 8Member
    edited July 2013

    would you please kindly explain me how can I generate ROC plot from vcf file for comparing variant calling performance.

    Post edited by beezone on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Sorry @beezone, we don't provide support for that. You'll need to develop your own analysis.

    Geraldine Van der Auwera, PhD

  • beezonebeezone Posts: 8Member

    Thank you @Geraldine, Would you please explain me how can I extract gatk.report from single vcf file and how to convert it into csv format. Is there any tools in GATK to perform these tasks?

  • beezonebeezone Posts: 8Member
    edited July 2013

    I got the GATKReport file (shown below) but I am really surprised to see null value for most cases. I did supply dbsnp file but didn't get any concordatRate.

    GATKReport.v0.2 CompOverlap : The overlap between eval and comp sites
    CompOverlap  CompRod  EvalRod  JexlExpression  Novelty  nEvalVariants  novelSites  nVariantsAtComp  compRate  nConcordant  concordantRate
    CompOverlap  dbsnp    input_0  none            all              98294       98294                0      0.00            0            0.00
    CompOverlap  dbsnp    input_0  none            known            98140       98140                0      0.00            0            0.00
    CompOverlap  dbsnp    input_0  none            novel              154         154                0      0.00            0            0.00
    
    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Hi @beezone, can you tell me what version you're using and post your command line?

    Geraldine Van der Auwera, PhD

  • beezonebeezone Posts: 8Member

    Hello@ Geraldine, the version that I am using is GenomeAnalysisTK-2.5-2 (latest version from GATK), and the command line was :

    java -jar GenomeAnalysisTK.jar \ -T VariantEval \ -R ref.fasta \ -D dbsnp.vcf \ -select 'set=="Intersection"' -selectName Intersection \ -select 'set=="FOO"' -selectName FOO \ -select 'set=="FOO-filterInBAR"' -selectName InFOO-FilteredInBAR \ -select 'set=="BAR"' -selectName BAR \ -select 'set=="filterInFOO-BAR"' -selectName InBAR-FilteredInFOO \ -select 'set=="FilteredInAll"' -selectName FilteredInAll \ -o merged.eval.gatkreport \ -eval merged.vcf \ -l INFO

    (same as mentioned in this issue)

    later I tried given command line for single vcf file:

    java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantEval -R hg19.fa -o variant_eval_gatkreport --eval input.vcf --comp dbSNP.vcf

    Both of these command lines generate similar result as I had posted ( with null values for concordance rate and others).

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    I see, thanks. Can you maybe have a look at some positions in your callset that are in dbsnp, since it looks like most of your calls are knowns? Perhaps comparing the records will give an indication of why they are not getting counted as overlapping.

    Geraldine Van der Auwera, PhD

  • pradeepnpradeepn MGHPosts: 6Member

    I tried using this to compare my variant callsets from UnifiedGenotyper and HaplotypeCaller. Both show similar metrics when evaluated individually and are highly concordant with dbSNP and 1000G but I'm having difficulty generating the comparison between the two using the protocol after combing them described above. It appears that the expressions for each set (ie "Intersection" , "FilteredInAll", "UG", "HC", "HC-filterInUG", "filterInHC-UG") are not being recognized. I checked the possible values of 'set' as described above and also inspected the vcf file to confirm that there are variants that apply to each expression.

    I tried using various inputs for the expression (examples below): -select 'set=="HC-filterInUG"' -selectName InHC-filterInUG -select 'set="HC-filterInUG"' -selectName InHC-filterInUG -select 'set=HC-filterInUG' -selectName InHC-filterInUG

    But I keep getting the following output: CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants novelSites nVariantsAtComp compRate nConcordant concordantRate CompOverlap dbsnp eval FilteredInAll all 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval FilteredInAll known 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval FilteredInAll novel 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval HC all 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval HC known 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval HC novel 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval InHC-filterInUG all 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval InHC-filterInUG known 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval InHC-filterInUG novel 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval InUG-filterInHC all 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval InUG-filterInHC known 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval InUG-filterInHC novel 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval Intersection all 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval Intersection known 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval Intersection novel 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval UG all 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval UG known 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval UG novel 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval none all 5520427 225607 5294820 95.91 5143352 97.14 CompOverlap dbsnp eval none known 5294820 0 5294820 100.00 5143352 97.14 CompOverlap dbsnp eval none novel 225607 225607 0 0.00 0 0.00

    Any thoughts on how to make this comparison? Thanks in advance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Hi @pradeepn, can you please post the command line you used for combining the callsets?

    Geraldine Van der Auwera, PhD

  • pradeepnpradeepn MGHPosts: 6Member

    java -jar ../tools/GATK/GenomeAnalysisTK-nightly-2013-07-08-g05eadc3/GenomeAnalysisTK.jar \ -T CombineVariants \ -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \ -V:HC vcf/PCAD_01.HC.hg19.recal.SNPs.indels.vcf \ -V:UG vcf/PCAD_01.UG.hg19.recal.SNPs.indels.vcf \ -priority HC,UG \ -o vcf/PCAD_01.merged.hg19.recal.SNPs.indels.vcf

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Thanks, that all looks okay -- can you post a few lines from the merged VCF too?

    Geraldine Van der Auwera, PhD

  • pradeepnpradeepn MGHPosts: 6Member

    Here is an example of a few lines from the merged VCF. It seems to have the appropriate "set=" annotation.

    1 726924 . T TGAATGGAATGGAATC 55.50 PASS AC=2;AF=0.125;AN=16;BaseQRankSum=0.486;ClippingRankSum=0.618;DP=296;FS=0.000;MLEAC=2;MLEAF=0.125;MQ=42.78;MQ0=0;MQRankSum=0.302;QD=0.06;ReadPosRankSum=1.067;VQSLOD=3.57;culprit=FS;set=HC GT:AD:DP:GQ:PL 0/0:9,0:9:43:0,43,505 0/0:7,0:7:40:0,40,359 0/1:5,3:8:73:73,0,301 0/0:9,0:9:38:0,38,509 0/0:7,0:7:39:0,39,346 0/0:7,0:7:41:0,41,407 0/0:8,0:8:33:0,33,386 0/1:4,2:6:31:31,0,177

    1 726929 . G GGAATGGAATC 115.54 PASS AC=4;AF=0.250;AN=16;BaseQRankSum=-1.505;ClippingRankSum=-0.517;DP=279;FS=1.736;MLEAC=4;MLEAF=0.250;MQ=42.19;MQ0=0;MQRankSum=-0.701;QD=0.07;ReadPosRankSum=2.332;VQSLOD=2.54;culprit=FS;set=HC GT:AD:DP:GQ:PL 0/0:9,0:9:44:0,44,507 0/1:4,2:6:40:40,0,205 0/0:10,0:10:52:0,52,596 0/1:3,3:6:94:94,0,140 0/1:3,2:5:26:26,0,180 0/1:4,1:5:9:9,0,256 0/0:6,0:6:28:0,28,315 0/0:10,0:10:39:0,39,497

    1 726934 . G GGAATC 569.20 PASS AC=9;AF=0.563;AN=16;ClippingRankSum=-0.737;DP=971;MQ0=0;NEGATIVE_TRAIN_SITE;set=HC-filterInUG GT:AD:DP:GQ:PL 1/1:0,6:6:7:212,7,0 0/1:8,2:10:49:49,0,341 0/1:7,3:10:98:98,0,329 0/1:10,2:12:12:12,0,466 0/1:6,3:9:47:47,0,331 0/1:6,3:9:80:80,0,297 0/1:4,2:6:66:66,0,149 0/1:8,2:10:57:57,0,389

    1 726939 rs4520358 G C 3156.33 VQSRTrancheSNP99.90to100.00 AC=6;AF=0.375;AN=16;BaseQRankSum=-8.091;DB;DP=747;Dels=0.00;FS=2.401;HaplotypeScore=68.7126;MLEAC=6;MLEAF=0.375;MQ=31.14;MQ0=119;MQRankSum=8.041;QD=5.59;ReadPosRankSum=6.997;VQSLOD=-2.923e+01;culprit=HaplotypeScore;set=FilteredInAll GT:AD:DP:GQ:PL 0/1:64,25:87:99:345,0,1098 0/1:67,28:94:99:488,0,1072 0/0:80,7:86:99:0,174,1862 0/1:64,42:102:99:851,0,693

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    @pradeepn, would you mind uploading a snippet of your file to our FTP? I'd like to try to debug this locally. Instructions are here: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • lmeleelmelee Los Angeles, CAPosts: 3Member

    I am trying to create a comparison of the uniquely called and overlapping SNPs between two .vcfs (they are labeled "wex" and "wgs"). Specifically, to eventually create a Venn as in the VariantEval main page, I assume I would be taking the "all Intersection" and the "all wex" and "all wgs." However, the GATKReport indicates that the intersection is larger than the wgs variants. How is this possible? Any light you might be able to shed on this would be greatly appreciated.

    CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants novelSites nVariantsAtComp compRate nConcordant concordantRate
    CompOverlap dbsnp eval FilteredInAll all 0 0 0 0 0 0
    CompOverlap dbsnp eval FilteredInAll known 0 0 0 0 0 0
    CompOverlap dbsnp eval FilteredInAll novel 0 0 0 0 0 0
    CompOverlap dbsnp eval Intersection all 93805 8058 85747 91.41 82032 95.67
    CompOverlap dbsnp eval Intersection known 85747 0 85747 100 82032 95.67
    CompOverlap dbsnp eval Intersection novel 8058 8058 0 0 0 0
    CompOverlap dbsnp eval Inwex-FilteredInwgs all 0 0 0 0 0 0
    CompOverlap dbsnp eval Inwex-FilteredInwgs known 0 0 0 0 0 0
    CompOverlap dbsnp eval Inwex-FilteredInwgs novel 0 0 0 0 0 0
    CompOverlap dbsnp eval Inwgs-FilteredInwex all 0 0 0 0 0 0
    CompOverlap dbsnp eval Inwgs-FilteredInwex known 0 0 0 0 0 0
    CompOverlap dbsnp eval Inwgs-FilteredInwex novel 0 0 0 0 0 0
    CompOverlap dbsnp eval none all 493138 65291 427847 86.76 418234 97.75
    CompOverlap dbsnp eval none known 427847 0 427847 100 418234 97.75
    CompOverlap dbsnp eval none novel 65291 65291 0 0 0 0
    CompOverlap dbsnp eval wex all 382044 51559 330485 86.5 326075 98.67
    CompOverlap dbsnp eval wex known 330485 0 330485 100 326075 98.67
    CompOverlap dbsnp eval wex novel 51559 51559 0 0 0 0
    CompOverlap dbsnp eval wgs all 17289 5674 11615 67.18 10127 87.19
    CompOverlap dbsnp eval wgs known 11615 0 11615 100 10127 87.19
    CompOverlap dbsnp eval wgs novel 5674 5674 0 0 0 0

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    The numbers given for "wgs" and for "wex" are the numbers of sites unique to the callsets, respectively, not the total number of variants in the callsets. So this just means that most of your wgs variants are also in your wex callset. I'm a little surprised you would have more unique variants in wex than in wgs, but assuming this is before filtering, it doesn't necessarily mean much.

    Geraldine Van der Auwera, PhD

  • lmeleelmelee Los Angeles, CAPosts: 3Member

    Thanks for the clarification. Could you tell me the difference between the nEvalVariants in CompOverlap and the nSNPs in CountVariants? Since there's also an "Intersection" in CountVariants, I'm not sure which to use to depict unique and overlapping SNP numbers.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,276Administrator, GSA Member admin

    Can you post the table you get from CountVariants, just so I'm sure I'm commenting on the right thing?

    Geraldine Van der Auwera, PhD

  • lmeleelmelee Los Angeles, CAPosts: 3Member

    Please see attached.

    xlsx
    xlsx
    for GATK.xlsx
    54K
Sign In or Register to comment.