We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Filtering based on annotations under ANN within the INFO field


I am brand new to using GATK and was assigned the task of filtering a VCF file provided to me. I already had success with filtering based on the AF value and other numerical fields, however, the rest of the tags I should filter are included (or more like, buried) within an extremely long "ANN=..." field inside the INFO column that I'm having trouble extracting the information from.

Basically, what I need is something like
-filter "ANN=has_a_string_somewhere_inside_this_extremely_long_field" -filter-name "HAS_THIS_ANNOTATION"

An example line from my VCF:

chr1    9784423 COSM3751466;COSM3751467 C   T   1563.0  PASS    DP=223;AF=0.340807;SB=3;DP4=62,85,28,48;ANN=T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000361110.6_1|protein_coding|21/23|c.2880C>T|p.Y960Y|2995/3508|2880/3207|960/1068||,T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000536656.5_1|protein_coding|23/25|c.2880C>T|p.Y960Y|3088/5483|2880/3207|960/1068||,T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000628140.2_1|protein_coding|22/24|c.2880C>T|p.Y960Y|3088/5483|2880/3207|960/1068||,T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000377346.8_1|protein_coding|22/24|c.2808C>T|p.Y936Y|3003/5203|2808/3135|936/1044||,T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000543390.2_1|protein_coding|22/24|c.2880C>T|p.Y960Y|2995/3503|2880/3207|960/1068||,T|downstream_gene_variant|MODIFIER|CLSTN1|ENSG00000171603.16_2|transcript|ENST00000377298.8_1|protein_coding||c.*6143G>A|||||4661|,T|downstream_gene_variant|MODIFIER|CLSTN1|ENSG00000171603.16_2|transcript|ENST00000477264.1_1|processed_transcript||n.*4663G>A|||||4663|,T|downstream_gene_variant|MODIFIER|CLSTN1|ENSG00000171603.16_2|transcript|ENST00000435891.5_1|protein_coding||c.*6143G>A|||||4663|WARNING_TRANSCRIPT_NO_START_CODON,T|downstream_gene_variant|MODIFIER|CLSTN1|ENSG00000171603.16_2|transcript|ENST00000361311.4_1|protein_coding||c.*6143G>A|||||4663|;AA=p.Y936Y,p.Y960Y;CDS=c.2808C>T,c.2880C>T;CNT=2,2;GENE=PIK3CD,PIK3CD_ENST00000536656;STRAND=+,+   GT:GQ:DP:AD 0/1:70:223:147,76

where you can see, there is a lot of information inside "ANN=" and what I would like to use as filter expressions are tags like "synonymous_variant".

I was trying to simply include it as a string, because that's only how I have seen it before, like:

./gatk VariantFiltration -V my.snps.vcf -R ref.fasta -filter "ANN=='non_coding_transcript_exon_variant'" --filter-name "EXONIC" -filter "AF < 0.01" --filter-name "AF-FAIL" -O filtered_for_EXON_AF.vcf

or use a regular expression like ".*non_coding_transcript_exon_variant.*", but it brings no results. When I use -invfilter, it adds the filter name to every single line, so I guess I really just need to find a way to describe this string I'm searching for properly...

I assume the biggest issue could be my lack of experience with JEXL, but I have yet to find a simple tutorial on how to describe such a value with these expressions, ie. ANN=(any_context)STRING_I_WANT(any_context)...

I will also welcome any additional tips on how to deal with monstrous ANN fields like this in general, as I would like to also use SelectVariants on it and separately add them into my future .table when I manage to filter it.

I have gatk- and openjdk version "1.8.0_171".


Sign In or Register to comment.