The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

# Using VariantEval

Posts: 71GATK Developer mod
edited March 2013

For a complete, detailed argument reference, refer to the technical documentation page.

## Modules

You can find detailed information about the various modules here.

### Stratification modules

• AlleleFrequency
• AlleleCount
• CompRod
• Contig
• CpG
• Degeneracy
• EvalRod
• Filter
• FunctionalClass
• JexlExpression
• Novelty
• Sample

### Evaluation modules

• CompOverlap
• CountVariants

Note that the GenotypeConcordance module has been rewritten as a separate walker tool (see its Technical Documentation page).

## A useful analysis using VariantEval

We in GSA often find ourselves performing an analysis of 2 different call sets. For SNPs, we often show the overlap of the sets (their "venn") and the relative dbSNP rates and/or transition-transversion ratios. The picture provided is an example of such a slide and is easy to create using VariantEval. Assuming you have 2 filtered VCF callsets named 'foo.vcf' and 'bar.vcf', there are 2 quick steps.

#### Combine the VCFs

java -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T CombineVariants \
-V:FOO foo.vcf \
-V:BAR bar.vcf \
-priority FOO,BAR \
-o merged.vcf


#### Run VariantEval

java -jar GenomeAnalysisTK.jar \
-T VariantEval \
-R ref.fasta \
-D dbsnp.vcf \
-select 'set=="Intersection"' -selectName Intersection \
-select 'set=="FOO"' -selectName FOO \
-select 'set=="FOO-filterInBAR"' -selectName InFOO-FilteredInBAR \
-select 'set=="BAR"' -selectName BAR \
-select 'set=="filterInFOO-BAR"' -selectName InBAR-FilteredInFOO \
-select 'set=="FilteredInAll"' -selectName FilteredInAll \
-o merged.eval.gatkreport \
-eval merged.vcf \
-l INFO


### Checking the possible values of 'set'

It is wise to check the actual values for the set names present in your file before writing complex VariantEval commands. An easy way to do this is to extract the value of the set fields and then reduce that to the unique entries, like so:

java -jar GenomeAnalysisTK.jar -T VariantsToTable -R ref.fasta -V merged.vcf -F set -o fields.txt
grep -v 'set' fields.txt | sort | uniq -c


This will provide you with a list of all of the possible values for 'set' in your VCF so that you can be sure to supply the correct select statements to VariantEval.

## Reading the VariantEval output file

The VariantEval output is formatted as a GATKReport.

### Understanding Genotype Concordance values from Variant Eval

The VariantEval genotype concordance module emits information the relationship between the eval calls and genotypes and the comp calls and genotypes. The following three slides provide some insight into three key metrics to assess call sensitivity and concordance between genotypes.

##:GATKReport.v0.1 GenotypeConcordance.sampleSummaryStats&#160;: the concordance statistics summary for each sample
GenotypeConcordance.sampleSummaryStats  CompRod   CpG      EvalRod  JexlExpression  Novelty  percent_comp_ref_called_var  percent_comp_het_called_het  percent_comp_het_called_var  percent_comp_hom_called_hom  percent_comp_hom_called_var  percent_non-reference_sensitivity  percent_overall_genotype_concordance  percent_non-reference_discrepancy_rate
GenotypeConcordance.sampleSummaryStats  compOMNI  all      eval     none            all      0.78                         97.65                        98.39                        99.13                        99.44                        98.80                              99.09                                 3.60


The key outputs:

• percent_overall_genotype_concordance
• percent_non_ref_sensitivity_rate
• percent_non_ref_discrepancy_rate

All defined below.

Post edited by Geraldine_VdAuwera on
Tagged:

• Posts: 33Member

If I am using comp to evaluate genotype concordance with chip data and the ref/alt alleles for my two vcf files are different. Let's say an "A" is called for a single sample but for my seq vcf this is the alternate allele and for my chip vcf this is the ref allele, will this genotype still be in "concordance?"

Thanks,

MC

• Posts: 11GATK Developer mod
edited March 2013

Hi chongm-

We have replaced the GenotypeConcordance module of VariantEval with its own walker: GenotypeConcordance. See:

I'm not sure what the old behavior of VariantEval was for the situation you describe, but GenotypeConcordance was written under the assumption that the VCFs passed in adhere to the specification, and that the reference allele represents the bases of the reference used to build the VCF; and while the GATK engine no longer automatically checks VCF reference bases against the actual reference, I'm not sure any behavior can be guaranteed if this is not the case.

For this particular walker the site will be treated as follows:

• The site itself will be tabulated as ALLELES_DO_NOT_MATCH in the site table

• The genotypes will nevertheless be evaluated, and count against you

The reason the genotypes will be evaluated is to allow for the following case:

EVAL: A T 0/0 1/0

COMP: A C 0/0 1/0

Despite the fact that the second site has a different alternate allele, the first sample is still concordant (A/A vs A/A), while the second sample has a mismatching allele. In your case:

EVAL: A T 0/0 1/0

COMP: T A 1/1 1/0

the first sample (whose alleles are "A/A") still matches the alleles in the EVAL context (on the genotype level), even though the alternate alleles do not match, and is therefore not tabulated under the allele mismatch column, but rather "hom var called as hom ref."

In conclusion, it's best to fix or remove sites that break ValidateVariants.

Post edited by chartl on
• Posts: 33Member

What do the fields "UNAVAILABLE_HOM_REF ... HOM_VAR... _HET" mean? I seem to have a high proportion for both my chip data and my seq data (> 0.95 for these fields). I am thinking that if my chip & seq data do not have a high degree of overlapping sites though the sites they do share seem to be concordant (as calculated by just using varianteval - comp, > 98% concordance), all of these non-shared sites end up in the proportion of sites that are "unavailable"?

Thanks,

MC

Thanks,

MC

Hi MC,

Did you use the --ignoreFilters flag?

As explained in the doc for GenotypeConcordance, if you use it:

Filters will be ignored. The FILTER field of the eval and comp VCFs will be ignored. If this flag is not included, all FILTER sites will be treated as not being present in the VCF. (That is, the genotypes will be assigned UNAVAILABLE, as distinct from NO_CALL).

Geraldine Van der Auwera, PhD

• Posts: 33Member

Oh sorry I didn't see that thank you Geraldine!

Thanks,

MC

• Posts: 43Member

Where could the files cited in GATKwh0-BP-8-variant_analysis.pdf like 1000G_ALL_Sites.vcf.gz, 1000G_62_FIN_Genotypes.bcf, OMNI_MONO_SITES omni_mono.vcf be downloaded?

These files are not publicly available; they are used in example command lines for purposes of demonstration only.

Geraldine Van der Auwera, PhD

• Posts: 8Member
edited July 2013

would you please kindly explain me how can I generate ROC plot from vcf file for comparing variant calling performance.

Post edited by beezone on

Sorry @beezone, we don't provide support for that. You'll need to develop your own analysis.

Geraldine Van der Auwera, PhD

• Posts: 8Member

Thank you @Geraldine, Would you please explain me how can I extract gatk.report from single vcf file and how to convert it into csv format. Is there any tools in GATK to perform these tasks?

• Posts: 8Member
edited July 2013

I got the GATKReport file (shown below) but I am really surprised to see null value for most cases. I did supply dbsnp file but didn't get any concordatRate.

GATKReport.v0.2 CompOverlap : The overlap between eval and comp sites
CompOverlap  CompRod  EvalRod  JexlExpression  Novelty  nEvalVariants  novelSites  nVariantsAtComp  compRate  nConcordant  concordantRate
CompOverlap  dbsnp    input_0  none            all              98294       98294                0      0.00            0            0.00
CompOverlap  dbsnp    input_0  none            known            98140       98140                0      0.00            0            0.00
CompOverlap  dbsnp    input_0  none            novel              154         154                0      0.00            0            0.00

Post edited by Geraldine_VdAuwera on

Hi @beezone, can you tell me what version you're using and post your command line?

Geraldine Van der Auwera, PhD

• Posts: 8Member

Hello@ Geraldine, the version that I am using is GenomeAnalysisTK-2.5-2 (latest version from GATK), and the command line was :

java -jar GenomeAnalysisTK.jar \ -T VariantEval \ -R ref.fasta \ -D dbsnp.vcf \ -select 'set=="Intersection"' -selectName Intersection \ -select 'set=="FOO"' -selectName FOO \ -select 'set=="FOO-filterInBAR"' -selectName InFOO-FilteredInBAR \ -select 'set=="BAR"' -selectName BAR \ -select 'set=="filterInFOO-BAR"' -selectName InBAR-FilteredInFOO \ -select 'set=="FilteredInAll"' -selectName FilteredInAll \ -o merged.eval.gatkreport \ -eval merged.vcf \ -l INFO

(same as mentioned in this issue)

later I tried given command line for single vcf file:

java -Xmx4g -jar GenomeAnalysisTK.jar -T VariantEval -R hg19.fa -o variant_eval_gatkreport --eval input.vcf --comp dbSNP.vcf

Both of these command lines generate similar result as I had posted ( with null values for concordance rate and others).

I see, thanks. Can you maybe have a look at some positions in your callset that are in dbsnp, since it looks like most of your calls are knowns? Perhaps comparing the records will give an indication of why they are not getting counted as overlapping.

Geraldine Van der Auwera, PhD

• MGHPosts: 6Member

I tried using this to compare my variant callsets from UnifiedGenotyper and HaplotypeCaller. Both show similar metrics when evaluated individually and are highly concordant with dbSNP and 1000G but I'm having difficulty generating the comparison between the two using the protocol after combing them described above. It appears that the expressions for each set (ie "Intersection" , "FilteredInAll", "UG", "HC", "HC-filterInUG", "filterInHC-UG") are not being recognized. I checked the possible values of 'set' as described above and also inspected the vcf file to confirm that there are variants that apply to each expression.

I tried using various inputs for the expression (examples below):  -select 'set=="HC-filterInUG"' -selectName InHC-filterInUG   -select 'set="HC-filterInUG"' -selectName InHC-filterInUG   -select 'set=HC-filterInUG' -selectName InHC-filterInUG 

But I keep getting the following output:  CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants novelSites nVariantsAtComp compRate nConcordant concordantRate CompOverlap dbsnp eval FilteredInAll all 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval FilteredInAll known 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval FilteredInAll novel 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval HC all 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval HC known 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval HC novel 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval InHC-filterInUG all 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval InHC-filterInUG known 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval InHC-filterInUG novel 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval InUG-filterInHC all 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval InUG-filterInHC known 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval InUG-filterInHC novel 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval Intersection all 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval Intersection known 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval Intersection novel 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval UG all 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval UG known 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval UG novel 0 0 0 0.00 0 0.00 CompOverlap dbsnp eval none all 5520427 225607 5294820 95.91 5143352 97.14 CompOverlap dbsnp eval none known 5294820 0 5294820 100.00 5143352 97.14 CompOverlap dbsnp eval none novel 225607 225607 0 0.00 0 0.00 

Any thoughts on how to make this comparison? Thanks in advance!

Hi @pradeepn, can you please post the command line you used for combining the callsets?

Geraldine Van der Auwera, PhD

• MGHPosts: 6Member

 java -jar ../tools/GATK/GenomeAnalysisTK-nightly-2013-07-08-g05eadc3/GenomeAnalysisTK.jar \ -T CombineVariants \ -R /seq/references/Homo_sapiens_assembly19/v1/Homo_sapiens_assembly19.fasta \ -V:HC vcf/PCAD_01.HC.hg19.recal.SNPs.indels.vcf \ -V:UG vcf/PCAD_01.UG.hg19.recal.SNPs.indels.vcf \ -priority HC,UG \ -o vcf/PCAD_01.merged.hg19.recal.SNPs.indels.vcf 

Thanks, that all looks okay -- can you post a few lines from the merged VCF too?

Geraldine Van der Auwera, PhD

• MGHPosts: 6Member

Here is an example of a few lines from the merged VCF. It seems to have the appropriate "set=" annotation.

1 726924 . T TGAATGGAATGGAATC 55.50 PASS AC=2;AF=0.125;AN=16;BaseQRankSum=0.486;ClippingRankSum=0.618;DP=296;FS=0.000;MLEAC=2;MLEAF=0.125;MQ=42.78;MQ0=0;MQRankSum=0.302;QD=0.06;ReadPosRankSum=1.067;VQSLOD=3.57;culprit=FS;set=HC GT:AD:DP:GQ:PL 0/0:9,0:9:43:0,43,505 0/0:7,0:7:40:0,40,359 0/1:5,3:8:73:73,0,301 0/0:9,0:9:38:0,38,509 0/0:7,0:7:39:0,39,346 0/0:7,0:7:41:0,41,407 0/0:8,0:8:33:0,33,386 0/1:4,2:6:31:31,0,177

1 726929 . G GGAATGGAATC 115.54 PASS AC=4;AF=0.250;AN=16;BaseQRankSum=-1.505;ClippingRankSum=-0.517;DP=279;FS=1.736;MLEAC=4;MLEAF=0.250;MQ=42.19;MQ0=0;MQRankSum=-0.701;QD=0.07;ReadPosRankSum=2.332;VQSLOD=2.54;culprit=FS;set=HC GT:AD:DP:GQ:PL 0/0:9,0:9:44:0,44,507 0/1:4,2:6:40:40,0,205 0/0:10,0:10:52:0,52,596 0/1:3,3:6:94:94,0,140 0/1:3,2:5:26:26,0,180 0/1:4,1:5:9:9,0,256 0/0:6,0:6:28:0,28,315 0/0:10,0:10:39:0,39,497

1 726934 . G GGAATC 569.20 PASS AC=9;AF=0.563;AN=16;ClippingRankSum=-0.737;DP=971;MQ0=0;NEGATIVE_TRAIN_SITE;set=HC-filterInUG GT:AD:DP:GQ:PL 1/1:0,6:6:7:212,7,0 0/1:8,2:10:49:49,0,341 0/1:7,3:10:98:98,0,329 0/1:10,2:12:12:12,0,466 0/1:6,3:9:47:47,0,331 0/1:6,3:9:80:80,0,297 0/1:4,2:6:66:66,0,149 0/1:8,2:10:57:57,0,389

1 726939 rs4520358 G C 3156.33 VQSRTrancheSNP99.90to100.00 AC=6;AF=0.375;AN=16;BaseQRankSum=-8.091;DB;DP=747;Dels=0.00;FS=2.401;HaplotypeScore=68.7126;MLEAC=6;MLEAF=0.375;MQ=31.14;MQ0=119;MQRankSum=8.041;QD=5.59;ReadPosRankSum=6.997;VQSLOD=-2.923e+01;culprit=HaplotypeScore;set=FilteredInAll GT:AD:DP:GQ:PL 0/1:64,25:87:99:345,0,1098 0/1:67,28:94:99:488,0,1072 0/0:80,7:86:99:0,174,1862 0/1:64,42:102:99:851,0,693

Geraldine Van der Auwera, PhD

• Los Angeles, CAPosts: 4Member

I am trying to create a comparison of the uniquely called and overlapping SNPs between two .vcfs (they are labeled "wex" and "wgs"). Specifically, to eventually create a Venn as in the VariantEval main page, I assume I would be taking the "all Intersection" and the "all wex" and "all wgs." However, the GATKReport indicates that the intersection is larger than the wgs variants. How is this possible? Any light you might be able to shed on this would be greatly appreciated.

CompOverlap CompRod EvalRod JexlExpression Novelty nEvalVariants novelSites nVariantsAtComp compRate nConcordant concordantRate
CompOverlap dbsnp eval FilteredInAll all 0 0 0 0 0 0
CompOverlap dbsnp eval FilteredInAll known 0 0 0 0 0 0
CompOverlap dbsnp eval FilteredInAll novel 0 0 0 0 0 0
CompOverlap dbsnp eval Intersection all 93805 8058 85747 91.41 82032 95.67
CompOverlap dbsnp eval Intersection known 85747 0 85747 100 82032 95.67
CompOverlap dbsnp eval Intersection novel 8058 8058 0 0 0 0
CompOverlap dbsnp eval Inwex-FilteredInwgs all 0 0 0 0 0 0
CompOverlap dbsnp eval Inwex-FilteredInwgs known 0 0 0 0 0 0
CompOverlap dbsnp eval Inwex-FilteredInwgs novel 0 0 0 0 0 0
CompOverlap dbsnp eval Inwgs-FilteredInwex all 0 0 0 0 0 0
CompOverlap dbsnp eval Inwgs-FilteredInwex known 0 0 0 0 0 0
CompOverlap dbsnp eval Inwgs-FilteredInwex novel 0 0 0 0 0 0
CompOverlap dbsnp eval none all 493138 65291 427847 86.76 418234 97.75
CompOverlap dbsnp eval none known 427847 0 427847 100 418234 97.75
CompOverlap dbsnp eval none novel 65291 65291 0 0 0 0
CompOverlap dbsnp eval wex all 382044 51559 330485 86.5 326075 98.67
CompOverlap dbsnp eval wex known 330485 0 330485 100 326075 98.67
CompOverlap dbsnp eval wex novel 51559 51559 0 0 0 0
CompOverlap dbsnp eval wgs all 17289 5674 11615 67.18 10127 87.19
CompOverlap dbsnp eval wgs known 11615 0 11615 100 10127 87.19
CompOverlap dbsnp eval wgs novel 5674 5674 0 0 0 0

The numbers given for "wgs" and for "wex" are the numbers of sites unique to the callsets, respectively, not the total number of variants in the callsets. So this just means that most of your wgs variants are also in your wex callset. I'm a little surprised you would have more unique variants in wex than in wgs, but assuming this is before filtering, it doesn't necessarily mean much.

Geraldine Van der Auwera, PhD

• Los Angeles, CAPosts: 4Member

Thanks for the clarification. Could you tell me the difference between the nEvalVariants in CompOverlap and the nSNPs in CountVariants? Since there's also an "Intersection" in CountVariants, I'm not sure which to use to depict unique and overlapping SNP numbers.

Can you post the table you get from CountVariants, just so I'm sure I'm commenting on the right thing?

Geraldine Van der Auwera, PhD

• Los Angeles, CAPosts: 4Member

• ParisPosts: 22Member

Hi,

I have a vcf file containing variants positions for 2 samples (generated with the tool GenotypeGVCFs: first lines of my vcf file attached as an example). I would like to get the basics stats given by VariantEval per sample.

First, I just used the parameter --sample "sample1" --sample "sample2" but it doesn't work: samples don't appear in the set list in the report

Then, I tried --sample "sample1" --sample "sample1" -selectName sample1 -selectName sample1 but I get an error message:

##### ERROR MESSAGE: Inconsistent number of provided filter names and expressions: names=[sample1, sample2] exps=[]

Finally I tried -select 'set==vc.getGenotype("sample1")' -selectName sample1 -select 'set==vc.getGenotype("sample2")' -selectName sample. It compiles, sample1 and sample2 are in the set list in the tables of the GATK reports but with O counts for all fields, as if the nature of these set names isn't recognized. I attached the first two tables of the report I get with this command.

In the documentation Tool, it is written that "Some stratifications and evaluators are incompatible with each other due to their respective memory requirements, such as AlleleCount and VariantSummary, or Sample and VariantSummary". But I didn't get any error message.

Could you help me specifying to stratify the evaluation by sample ?