The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Adding Genomic Annotations Using SnpEff and VariantAnnotator

Posts: 71Dev mod
edited March 16 in Archive

### This article is out of date and no longer applicable. At this time, we do not provide support for performing functional annotation. Programs that we are aware of and that our collaborators use successfully include Oncotator and Variant Effect Predictor (VEP).

Our testing has shown that not all combinations of snpEff/database versions produce high-quality results. Be sure to read this document completely to familiarize yourself with our recommended best practices BEFORE running snpEff.

### Introduction

Until recently we were using an in-house annotation tool for genomic annotation, but the burden of keeping the database current and our lack of ability to annotate indels has led us to employ the use of a third-party tool instead. After reviewing many external tools (including annoVar, VAT, and Oncotator), we decided that SnpEff best meets our needs as it accepts VCF files as input, can annotate a full exome callset (including indels) in seconds, and provides continually-updated transcript databases. We have implemented support in the GATK for parsing the output from the SnpEff tool and annotating VCFs with the information provided in it.

### SnpEff Setup and Usage

Download the SnpEff core program. If you want to be able to run VariantAnnotator on the SnpEff output, you'll need to download a version of SnpEff that VariantAnnotator supports from this page (currently supported versions are listed below). If you just want the most recent version of SnpEff and don't plan to run VariantAnnotator on its output, you can get it from here.

After unzipping the core program, open the file snpEff.config in a text editor, and change the "database_repository" line to the following:

database_repository = http://sourceforge.net/projects/snpeff/files/databases/


java -jar snpEff.jar download GRCh37.64


You can find a list of available databases here. The human genome databases have GRCh or hg in their names. You can also download the databases directly from the SnpEff website, if you prefer.

The download command by default puts the databases into a subdirectory called data within the directory containing the SnpEff jar file. If you want the databases in a different directory, you'll need to edit the data_dir entry in the file snpEff.config to point to the correct directory.

Run SnpEff on the file containing your variants, and redirect its output to a file. SnpEff supports many input file formats including VCF 4.1, BED, and SAM pileup. Full details and command-line options can be found on the SnpEff home page.

### Supported SnpEff Versions

If you want to take advantage of SnpEff integration in the GATK, you'll need to run SnpEff version **2.0.5*. Note: newer versions are currently unsupported by the GATK, as we haven't yet had the reources to test it.

### Current Recommended Best Practices When Running SnpEff

These best practices are based on our analysis of various snpEff/database versions as described in detail in the Analysis of SnpEff Annotations Across Versions section below.

• We recommend using only the GRCh37.64 database with SnpEff 2.0.5. The more recent GRCh37.65 database produces many false-positive Missense annotations due to a regression in the ENSEMBL Release 65 GTF file used to build the database. This regression has been acknowledged by ENSEMBL and is supposedly fixed as of 1-30-2012; however as we have not yet tested the fixed version of the database we continue to recommend using only GRCh37.64 for now.

• We recommend always running with -onlyCoding true with human databases (eg., the GRCh37.* databases). Setting -onlyCoding false causes snpEff to report all transcripts as if they were coding (even if they're not), which can lead to nonsensical results. The -onlyCoding false option should only be used with databases that lack protein coding information.

• Do not trust annotations from versions of snpEff prior to 2.0.4. Older versions of snpEff (such as 2.0.2) produced many incorrect annotations due to the presence of a certain number of nonsensical transcripts in the underlying ENSEMBL databases. Newer versions of snpEff filter out such transcripts.

### Analyses of SnpEff Annotations Across Versions

See our analysis of the SNP annotations produced by snpEff across various snpEff/database versions here.

• Both snpEff 2.0.2 + GRCh37.63 and snpEff 2.0.5 + GRCh37.65 produce an abnormally high Missense:Silent ratio, with elevated levels of Missense mutations across the entire spectrum of allele counts. They also have a relatively low (~70%) level of concordance with the 1000G Gencode annotations when it comes to Silent mutations. This suggests that these combinations of snpEff/database versions incorrectly annotate many Silent mutations as Missense.

• snpEff 2.0.4 RC3 + GRCh37.64 and snpEff 2.0.5 + GRCh37.64 produce a Missense:Silent ratio in line with expectations, and have a very high (~97%-99%) level of concordance with the 1000G Gencode annotations across all categories.

See our comparison of SNP annotations produced using the GRCh37.64 and GRCh37.65 databases with snpEff 2.0.5 here

• The GRCh37.64 database gives good results on the condition that you run snpEff with the -onlyCoding true option. The -onlyCoding false option causes snpEff to mark all transcripts as coding, and so produces many false-positive Missense annotations.

• The GRCh37.65 database gives results that are as poor as those you get with the -onlyCoding false option on the GRCh37.64 database. This is due to a regression in the ENSEMBL release 65 GTF file used to build snpEff's GRCh37.65 database. The regression has been acknowledged by ENSEMBL and is due to be fixed shortly.

See our analysis of the INDEL annotations produced by snpEff across snpEff/database versions here

• snpEff's indel annotations are highly concordant with those of a high-quality set of genomic annotations from the 1000 Genomes project. This is true across all snpEff/database versions tested.

### Example SnpEff Usage with a VCF Input File

Below is an example of how to run SnpEff version 2.0.5 with a VCF input file and have it write its output in VCF format as well. Notice that you need to explicitly specify the database you want to use (in this case, GRCh37.64). This database must be present in a directory of the same name within the data_dir as defined in snpEff.config.

java -Xmx4G -jar snpEff.jar eff -v -onlyCoding true -i vcf -o vcf GRCh37.64 1000G.exomes.vcf > snpEff_output.vcf


In this mode, SnpEff aggregates all effects associated with each variant record together into a single INFO field annotation with the key EFF. The general format is:

EFF=Effect1(Information about Effect1),Effect2(Information about Effect2),etc.


And here is the precise layout with all the subfields:

EFF=Effect1(Effect_Impact|Effect_Functional_Class|Codon_Change|Amino_Acid_Change|Gene_Name|Gene_BioType|Coding|Transcript_ID|Exon_ID),Effect2(etc...


It's also possible to get SnpEff to output in a (non-VCF) text format with one Effect per line. See the SnpEff home page for full details.

### Adding SnpEff Annotations using VariantAnnotator

Once you have a SnpEff output VCF file, you can use the VariantAnnotator walker to add SnpEff annotations based on that output to the input file you ran SnpEff on.

There are two different options for doing this:

#### Option 1: Annotate with only the highest-impact effect for each variant

NOTE: This option works only with supported SnpEff versions as explained above. VariantAnnotator run as described below will refuse to parse SnpEff output files produced by other versions of the tool, or which lack a SnpEff version number in their header.

The default behavior when you run VariantAnnotator on a SnpEff output file is to parse the complete set of effects resulting from the current variant, select the most biologically-significant effect, and add annotations for just that effect to the INFO field of the VCF record for the current variant. This is the mode we plan to use in our Production Data-Processing Pipeline.

When selecting the most biologically-significant effect associated with the current variant, VariantAnnotator does the following:

• Prioritizes the effects according to the categories (in order of decreasing precedence) "High-Impact", "Moderate-Impact", "Low-Impact", and "Modifier", and always selects one of the effects from the highest-priority category. For example, if there are three moderate-impact effects and two high-impact effects resulting from the current variant, the annotator will choose one of the high-impact effects and add annotations based on it. See below for a full list of the effects arranged by category.

• Within each category, ties are broken using the functional class of each effect (in order of precedence: NONSENSE, MISSENSE, SILENT, or NONE). For example, if there is both a NON_SYNONYMOUS_CODING (MODERATE-impact, MISSENSE) and a CODON_CHANGE (MODERATE-impact, NONE) effect associated with the current variant, the annotator will select the NON_SYNONYMOUS_CODING effect. This is to allow for more accurate counts of the total number of sites with NONSENSE/MISSENSE/SILENT mutations. See below for a description of the functional classes SnpEff associates with the various effects.

• Effects that are within a non-coding region are always considered lower-impact than effects that are within a coding region.

Example Usage:

java -jar dist/GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-A SnpEff \
--variant 1000G.exomes.vcf \        (file to annotate)
--snpEffFile snpEff_output.vcf \    (SnpEff VCF output file generated by running SnpEff on the file to annotate)
-L 1000G.exomes.vcf \
-o out.vcf


VariantAnnotator adds some or all of the following INFO field annotations to each variant record:

• SNPEFF_EFFECT - The highest-impact effect resulting from the current variant (or one of the highest-impact effects, if there is a tie)

• SNPEFF_IMPACT - Impact of the highest-impact effect resulting from the current variant (HIGH, MODERATE, LOW, or MODIFIER)

• SNPEFF_FUNCTIONAL_CLASS - Functional class of the highest-impact effect resulting from the current variant (NONE, SILENT, MISSENSE, or NONSENSE)
• SNPEFF_CODON_CHANGE - Old/New codon for the highest-impact effect resulting from the current variant
• SNPEFF_AMINO_ACID_CHANGE - Old/New amino acid for the highest-impact effect resulting from the current variant
• SNPEFF_GENE_NAME - Gene name for the highest-impact effect resulting from the current variant
• SNPEFF_GENE_BIOTYPE - Gene biotype for the highest-impact effect resulting from the current variant
• SNPEFF_TRANSCRIPT_ID - Transcript ID for the highest-impact effect resulting from the current variant
• SNPEFF_EXON_ID - Exon ID for the highest-impact effect resulting from the current variant

Example VCF records annotated using SnpEff and VariantAnnotator:

1   874779  .   C   T   279.94  . AC=1;AF=0.0032;AN=310;BaseQRankSum=-1.800;DP=3371;Dels=0.00;FS=0.000;HRun=0;HaplotypeScore=1.4493;InbreedingCoeff=-0.0045;
SNPEFF_EFFECT=SYNONYMOUS_CODING;SNPEFF_EXON_ID=exon_1_874655_874840;SNPEFF_FUNCTIONAL_CLASS=SILENT;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;
SNPEFF_IMPACT=LOW;SNPEFF_TRANSCRIPT_ID=ENST00000342066

1   874816  .   C   CT  2527.52 .   AC=15;AF=0.0484;AN=310;BaseQRankSum=-11.876;DP=4718;FS=48.575;HRun=1;HaplotypeScore=91.9147;InbreedingCoeff=-0.0520;
SNPEFF_FUNCTIONAL_CLASS=NONE;SNPEFF_GENE_BIOTYPE=protein_coding;SNPEFF_GENE_NAME=SAMD11;SNPEFF_IMPACT=HIGH;SNPEFF_TRANSCRIPT_ID=ENST00000342066


#### Option 2: Annotate with all effects for each variant

VariantAnnotator also has the ability to take the EFF field from the SnpEff VCF output file containing all the effects aggregated together and copy it verbatim into the VCF to annotate.

Here's an example of how to do this:

java -jar dist/GenomeAnalysisTK.jar \
-T VariantAnnotator \
-R /humgen/1kg/reference/human_g1k_v37.fasta \
-E resource.EFF \
--variant 1000G.exomes.vcf \      (file to annotate)
--resource snpEff_output.vcf \    (SnpEff VCF output file generated by running SnpEff on the file to annotate)
-L 1000G.exomes.vcf \
-o out.vcf


Of course, in this case you can also use the VCF output by SnpEff directly, but if you are using VariantAnnotator for other purposes anyway the above might be useful.

### List of Genomic Effects

Below are the possible genomic effects recognized by SnpEff, grouped by biological impact. Full descriptions of each effect are available on this page.

#### High-Impact Effects

• SPLICE_SITE_ACCEPTOR

• SPLICE_SITE_DONOR

• START_LOST
• EXON_DELETED
• FRAME_SHIFT
• STOP_GAINED
• STOP_LOST

#### Moderate-Impact Effects

• NON_SYNONYMOUS_CODING

• CODON_CHANGE (note: this effect is used by SnpEff only for MNPs, not SNPs)

• CODON_INSERTION
• CODON_CHANGE_PLUS_CODON_INSERTION
• CODON_DELETION
• CODON_CHANGE_PLUS_CODON_DELETION
• UTR_5_DELETED
• UTR_3_DELETED

#### Low-Impact Effects

• SYNONYMOUS_START

• NON_SYNONYMOUS_START

• START_GAINED
• SYNONYMOUS_CODING
• SYNONYMOUS_STOP
• NON_SYNONYMOUS_STOP

#### Modifiers

• NONE

• CHROMOSOME

• CUSTOM
• CDS
• GENE
• TRANSCRIPT
• EXON
• INTRON_CONSERVED
• UTR_5_PRIME
• UTR_3_PRIME
• DOWNSTREAM
• INTRAGENIC
• INTERGENIC
• INTERGENIC_CONSERVED
• UPSTREAM
• REGULATION
• INTRON

### Functional Classes

SnpEff assigns a functional class to certain effects, in addition to an impact:

• NONSENSE: assigned to point mutations that result in the creation of a new stop codon

• MISSENSE: assigned to point mutations that result in an amino acid change, but not a new stop codon

• SILENT: assigned to point mutations that result in a codon change, but not an amino acid change or new stop codon
• NONE: assigned to all effects that don't fall into any of the above categories (including all events larger than a point mutation)

The GATK prioritizes effects with functional classes over effects of equal impact that lack a functional class when selecting the most significant effect in VariantAnnotator. This is to enable accurate counts of NONSENSE/MISSENSE/SILENT sites.

Post edited by Geraldine_VdAuwera on
Tagged:

• Posts: 27Member

We recommend using only the GRCh37.64 database with SnpEff 2.0.5. The more recent GRCh37.65 database produces many false-positive Missense annotations...

Does this simply mean "of the GRCh37 versions, we only recommend GRCh37.64 and not GRCh37.65" or the stronger statement "snpEff works well with GRCh37.64 but we do not recommend using it with hg19 or any other reference genome"?

I believe the former interpretation ("of the GRCh37 versions, we only recommend GRCh37.64 and not GRCh37.65") is the correct one. To be absolutely sure I would recommend contacting the authors of SnpEff directly (see the project's website for details on contacting them).

Geraldine Van der Auwera, PhD

• Posts: 27Member
edited September 2012

The snpEff authors discourage hg19 in their FAQ:

WARNING: Usage of hg19 genome is deprecated and discurraged, you should use GRChXX.YY instead

Reference sequence and annotations are made for an organism, version and sub-version. For examples human genome, version 37, sub-version 63 would be called (GRCh37.63 or hg37.63 aka hg19.63).

UCSC doesn't specify sub-version. They just say hg19. This annoying sub-version problem appeared often and, having reproducibility of results in mind, I dropped UCSC annotations in favor of ENSEMBL ones (they have clear versioning).

Post edited by Geraldine_VdAuwera on
• Posts: 27Member

Given that snpEff does not offer it, what do you recommend as the best way to get other additional annotations such as 1000 genomes minor allele frequency, SIFT prediction, phyloP conservation score, etc?

• Posts: 9Member

I have run SnpEff 2.0.5 (GRCh37.64 ref used) on the variants coming from UnifiedGenotyper (ucsc.hg19 ref used).

I'm ready to run VariantAnnotator using the SnpEff output...

Should I use GRCh37.64 or ucsc.hg19 as the -R argument in VariantAnnnotator? Does it matter?

Thanks:)
Leukogenom

VariantAnnotator wants the reference that was used to call the variants ... So, hg19.

Geraldine Van der Auwera, PhD

• Posts: 2Member

snpEFF is now at Version: 3.0, revision 'f' (2012-08-23). is it still recommended to use 2.05 ?

• Posts: 2Member

ericminikel

@ericminikel said:

Given that snpEff does not offer it, what do you recommend as the best way to get other additional annotations such as 1000 genomes minor allele frequency, SIFT prediction, phyloP conservation score, etc?

SIFT prediction: can that be solved by annotating with dbNSFP ? see

http://snpeff.sourceforge.net/SnpSift.html#dbnsfp

• Posts: 61Member

I have been having a go with using snpEff and dbNSFP. No problem getting it to work, but as far as I can tell dbNSFP only contains annotations for non-synonymous SNPs. Very logical given its stated goal! But it would be nice to also get the 1000G frequencies for variants that are not coding. The GATK provides several solution to annotate with dbSNP rs id. Is there something similar for 1000G frequency?

You can use the Variant Annotator to transfer annotations (e.g. the AC/AF fields from the 1000G VCF in your case) from one VCF into another.

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• Posts: 7Member

Hello,

I am using this pipeline in an attempt to annotate a VCF4.1 output from GATK 2.3.9 UnifiedGenotyper. I have been successfully able to apply the preliminary annotations to each variant using snpEff. However, I have been unsuccessful in being able to generate the final annotation using VariantAnnotator. In the end, I would like to continue using GATK programs in order to analyze whole exome data across many samples to detect low level somatic mosaicisms (sp?). Below is the latest code I have been using:

java -jar /groups/warman/gatk2.3.9/GenomeAnalysisTK.jar
-T VariantAnnotator
-R /groups/warman/Kyle/REF/hg19.fasta
-A SnpEff
-I AACT_samtoolsview.VA_mut_candidates.VAL.pcr.rg.bam.reduced.bam
-V AACT.VA_candidates.UG.ploidy20.vcf.trim
-snpEffFile AACT.VA_candidates.UG.ploidy20.vcf.trim.snpEff_output.vcf
-o AACT.VA_candidates.UG.ploidy20.vcf.trim.snpEff_output.annotate.vcf

The most recent error I receive says the following:

ERROR MESSAGE: Invalid command line: No tribble type was provided on the command line and the type of the file could not be determined dynamically. Please add an explicit type tag :NAME listing the correct type from among the supported types:

##### ERROR ------------------------------------------------------------------------------------------

Due to discrepancies between this "Adding Genomic Annotations Using SnpEff and VariantAnnotator" page, the VariantAnnotator documentation itself, and the help function within GATK, I have been unable to know for certain which arguments/parameters need to be inputted to successfully run VariantAnnotator.

Any insight would be greatly appreciated. Thanks!

Re: your error, it looks like one of your vcf files is not being recognized properly. You should validate all your files to make sure they are not corrupted or malformed. Also, make sure that the extension name is appropriate for each file. I see at least one named *.vcf.trim; that might be confusing the system. I would suggest renaming it to .trim.vcf.

Geraldine Van der Auwera, PhD

• Posts: 1Member

Hi,
does snpEff supports dependent variants?
e.g. 2 SNPs in the same codons, or 2 frameshifts in a transcript leading to inframe insertion/deletion?

another question is whether in case of any insertion (-/AGC) the starnd is mandatory input to find matching transcripts which could be on forwar or reverse strand? so what does snpEff assume in case of VCF input for insertion, that it always forward? if yes why do we need strand as input at all fo rannotations

Thanks
Yuval

Please direct any questions about snpEff to the developers of that tool. We only provide support for tools that convert data to and from snpEff, not the tool itself.

Geraldine Van der Auwera, PhD

• Posts: 33Member

Will GATK be updated to support later versions of SNPeff (version 3)? If so, is there an ETA? The newest version of SNPeff seems to have a handy -o gatk argument so I'm assuming newer versions of GATK and the variant annotator will be compatible with these outputs?

Thanks,

MC

We'd very like to support this, but we don't have time to update the code. We'd accept any patch to the system to do it though.

--
Mark A. DePristo, Ph.D.
Co-Director, Medical and Population Genetics
Broad Institute of MIT and Harvard

• Posts: 261Member ✭✭

Using latest version of snpEff with -o gatk option produces vcf files compatible with GATK.

• Posts: 43Member

I am using the latest version of snpEff and using -o gatk doesn't seem to work for me. The resulting vcf is not compatible with GATK VariantAnnotator. Here are the commands:

java -jar snpEff.jar eff -c snpEff.config -v GRCh37.68 -o gatk temp.haplotypeCaller.vcf > temp.snpEff.vcf

GenomeAnalysisTKLite -T VariantAnnotator -R human_g1k_v37.fa -A SnpEff -o temp.annotated.vcf --variant temp.haplotypeCaller.vcf --snpEffFile temp.snpEff.vcf -L temp.haplotypeCaller.vcf

Hi Kath,

I see that you're using GATK Lite. It is probably not compatible with the newer Snpeff functions. If so you'll need to use a more recent version of GATK for the interaction between the two programs to work properly.

Geraldine Van der Auwera, PhD

• Posts: 43Member

Silly me! It works with the full version.
Thanks very much!

• Posts: 3Member

I am running into difficulty running GATK (v. 3.1-1) with VariantAnnotator along with snpEff (v. 3.6b).

java -Xmx4 -jar ~/bin/GenomeAnalysisTK/GenomeAnalysisTK.jar -R /projects/ref_genome/ucsc_chromosomes/hg19.fa -T VariantAnnotator -A SnpEFF --variant sample1.vcf --snpEffFile sample1_eff.vcf -o sample1_eff_annotated.vcf

I want to filter for only the most egregious effect in my sample, however I get the following error.

##### ERROR MESSAGE: Invalid command line: Argument annotation has a bad value: Annotation SnpEFF was not found; please check that you have specified the annotation name correctly

How do I properly configure SnpEff with GATK? The -A option doesn't seem well documented.

Thanks,

• Posts: 3Member

@jonnycrunch said:
I am running into difficulty running GATK (v. 3.1-1) with VariantAnnotator along with snpEff (v. 3.6b).

java -Xmx4 -jar ~/bin/GenomeAnalysisTK/GenomeAnalysisTK.jar -R /projects/ref_genome/ucsc_chromosomes/hg19.fa -T VariantAnnotator -A SnpEFF --variant sample1.vcf --snpEffFile sample1_eff.vcf -o sample1_eff_annotated.vcf

I want to filter for only the most egregious effect in my sample, however I get the following error.

##### ERROR MESSAGE: Invalid command line: Argument annotation has a bad value: Annotation SnpEFF was not found; please check that you have specified the annotation name correctly

How do I properly configure SnpEff with GATK? The -A option doesn't seem well documented.

Thanks,

figured it out.

typo: should be SnpEff and not SnpEFF

Wow! way too many hours! ( Slap forehead)

Time to move away from the keyboard and get some fresh air

Geraldine Van der Auwera, PhD

• Posts: 57Member
edited September 2014

Hi I'm using GatK v3.2.2
I annotated a vcf file with snpEff (snpEff version SnpEff 3.6c (build 2014-05-20)) using -A SnpEff the -o gatk command.
It all worked fine, except for the annotation of splice site region which appears not to be recognised - i.e. i get the error below which means those annotations are ignored. Is this a GaTK issue or a snpEff issue (is this something you can fix or do I need to contact snpEff)? I assume it's because based on this documentation GaTK is expecting SPLICE_SITE_ACCEPTOR or DONOR...

WARN 21:26:39,712 SnpEff - Skipping malformed SnpEff effect field at X:21794. Error was: "SPLICE_SITE_REGION is not a recognized effect type". Field was: "SPLICE_SITE_REGION(MODIFIER||||CG17636|protein_coding|CODING|FBtr0112921|3)"

Post edited by prepagam on

Hi @prepagam,

Sorry for the late response. This looks like it's a version problem. I think recent versions of SnpEff emit slightly different output, which GATK doesn't understand. Unfortunately we don't have a solution for this at the moment, and we don't have any resources to devote to updating the GATK SnpEff tool.

Geraldine Van der Auwera, PhD

• PekingPosts: 3Member

Hi there,
i plan to use VariantAnnotator with SnpEff.

i just wondering that is all version in this page http://sourceforge.net/projects/snpeff/files/ are supported by VariantAnnotator ?Or i still have to download version 2.0.5 which can't find in the web page.

@Gangpao
Hi,

We are only supporting that version of SnpEff. However, you can look into using Oncotator instead. http://www.broadinstitute.org/oncotator/

-Sheila

• torontoPosts: 14Member

Hi,

I have a quick question about the annotations of variants from external sources like 1000GP. At some point I'm going to have to filter my vcf to eliminate commons variants ( those with an allele frequency > 1% in the population ) using the 1000GP or other sources. I'd like to annotate variants with their allele frequencies or presence in other data sets and was wondering when to do this. It seems like something done after haplotypecaller and gentupeGVCF has been done. Is that true? - Thanks