It looks like you're new here. If you want to get involved, click one of these buttons!
For a complete, detailed argument reference, refer to the technical documentation page.
You can find detailed information about the various modules here.
Note that the GenotypeConcordance module has been rewritten as a separate walker tool (see its Technical Documentation page).

We in GSA often find ourselves performing an analysis of 2 different call sets. For SNPs, we often show the overlap of the sets (their "venn") and the relative dbSNP rates and/or transition-transversion ratios. The picture provided is an example of such a slide and is easy to create using VariantEval. Assuming you have 2 filtered VCF callsets named 'foo.vcf' and 'bar.vcf', there are 2 quick steps.
java -jar GenomeAnalysisTK.jar \
-R ref.fasta \
-T CombineVariants \
-V:FOO foo.vcf \
-V:BAR bar.vcf \
-priority FOO,BAR \
-o merged.vcf
java -jar GenomeAnalysisTK.jar \
-T VariantEval \
-R ref.fasta \
-D dbsnp.vcf \
-select 'set=="Intersection"' -selectName Intersection \
-select 'set=="FOO"' -selectName FOO \
-select 'set=="FOO-filterInBAR"' -selectName InFOO-FilteredInBAR \
-select 'set=="BAR"' -selectName BAR \
-select 'set=="filterInFOO-BAR"' -selectName InBAR-FilteredInFOO \
-select 'set=="FilteredInAll"' -selectName FilteredInAll \
-o merged.eval.gatkreport \
-eval merged.vcf \
-l INFO
It is wise to check the actual values for the set names present in your file before writing complex VariantEval commands. An easy way to do this is to extract the value of the set fields and then reduce that to the unique entries, like so:
java -jar GenomeAnalysisTK.jar -T VariantsToTable -R ref.fasta -V merged.vcf -F set -o fields.txt
grep -v 'set' fields.txt | sort | uniq -c
This will provide you with a list of all of the possible values for 'set' in your VCF so that you can be sure to supply the correct select statements to VariantEval.
The VariantEval output is formatted as a GATKReport.
The VariantEval genotype concordance module emits information the relationship between the eval calls and genotypes and the comp calls and genotypes. The following three slides provide some insight into three key metrics to assess call sensitivity and concordance between genotypes.
##:GATKReport.v0.1 GenotypeConcordance.sampleSummaryStats : the concordance statistics summary for each sample
GenotypeConcordance.sampleSummaryStats CompRod CpG EvalRod JexlExpression Novelty percent_comp_ref_called_var percent_comp_het_called_het percent_comp_het_called_var percent_comp_hom_called_hom percent_comp_hom_called_var percent_non-reference_sensitivity percent_overall_genotype_concordance percent_non-reference_discrepancy_rate
GenotypeConcordance.sampleSummaryStats compOMNI all eval none all 0.78 97.65 98.39 99.13 99.44 98.80 99.09 3.60
The key outputs:
percent_overall_genotype_concordance percent_non_ref_sensitivity_rate percent_non_ref_discrepancy_rate All defined below.




Comments
If I am using comp to evaluate genotype concordance with chip data and the ref/alt alleles for my two vcf files are different. Let's say an "A" is called for a single sample but for my seq vcf this is the alternate allele and for my chip vcf this is the ref allele, will this genotype still be in "concordance?"
Thanks,
MC
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi chongm-
We have replaced the GenotypeConcordance module of VariantEval with its own walker: GenotypeConcordance. See:
http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_GenotypeConcordance.html
I'm not sure what the old behavior of VariantEval was for the situation you describe, but GenotypeConcordance was written under the assumption that the VCFs passed in adhere to the specification, and that the reference allele represents the bases of the reference used to build the VCF; and while the GATK engine no longer automatically checks VCF reference bases against the actual reference, I'm not sure any behavior can be guaranteed if this is not the case.
For this particular walker the site will be treated as follows:
The site itself will be tabulated as ALLELES_DO_NOT_MATCH in the site table
The genotypes will nevertheless be evaluated, and count against you
The reason the genotypes will be evaluated is to allow for the following case:
EVAL: A T 0/0 1/0
COMP: A C 0/0 1/0
Despite the fact that the second site has a different alternate allele, the first sample is still concordant (A/A vs A/A), while the second sample has a mismatching allele. In your case:
EVAL: A T 0/0 1/0
COMP: T A 1/1 1/0
the first sample (whose alleles are "A/A") still matches the alleles in the EVAL context (on the genotype level), even though the alternate alleles do not match, and is therefore not tabulated under the allele mismatch column, but rather "hom var called as hom ref."
In conclusion, it's best to fix or remove sites that break ValidateVariants.
- Spam
- Abuse
- Troll
1 • Off Topic Disagree Agree 1Like WTF •What do the fields "UNAVAILABLE_HOM_REF ... HOM_VAR... _HET" mean? I seem to have a high proportion for both my chip data and my seq data (> 0.95 for these fields). I am thinking that if my chip & seq data do not have a high degree of overlapping sites though the sites they do share seem to be concordant (as calculated by just using varianteval - comp, > 98% concordance), all of these non-shared sites end up in the proportion of sites that are "unavailable"?
Thanks,
MC
Thanks,
MC
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi MC,
Did you use the --ignoreFilters flag?
As explained in the doc for GenotypeConcordance, if you use it:
http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_sting_gatk_walkers_variantutils_GenotypeConcordance.html
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Oh sorry I didn't see that thank you Geraldine!
Thanks,
MC
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •