Filtering based on annotations under ANN within the INFO field

Hello!

I am brand new to using GATK and was assigned the task of filtering a VCF file provided to me. I already had success with filtering based on the AF value and other numerical fields, however, the rest of the tags I should filter are included (or more like, buried) within an extremely long "ANN=..." field inside the INFO column that I'm having trouble extracting the information from.

Basically, what I need is something like
-filter "ANN=has_a_string_somewhere_inside_this_extremely_long_field" -filter-name "HAS_THIS_ANNOTATION"

An example line from my VCF:

chr1    9784423 COSM3751466;COSM3751467 C   T   1563.0  PASS    DP=223;AF=0.340807;SB=3;DP4=62,85,28,48;ANN=T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000361110.6_1|protein_coding|21/23|c.2880C>T|p.Y960Y|2995/3508|2880/3207|960/1068||,T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000536656.5_1|protein_coding|23/25|c.2880C>T|p.Y960Y|3088/5483|2880/3207|960/1068||,T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000628140.2_1|protein_coding|22/24|c.2880C>T|p.Y960Y|3088/5483|2880/3207|960/1068||,T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000377346.8_1|protein_coding|22/24|c.2808C>T|p.Y936Y|3003/5203|2808/3135|936/1044||,T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000543390.2_1|protein_coding|22/24|c.2880C>T|p.Y960Y|2995/3503|2880/3207|960/1068||,T|downstream_gene_variant|MODIFIER|CLSTN1|ENSG00000171603.16_2|transcript|ENST00000377298.8_1|protein_coding||c.*6143G>A|||||4661|,T|downstream_gene_variant|MODIFIER|CLSTN1|ENSG00000171603.16_2|transcript|ENST00000477264.1_1|processed_transcript||n.*4663G>A|||||4663|,T|downstream_gene_variant|MODIFIER|CLSTN1|ENSG00000171603.16_2|transcript|ENST00000435891.5_1|protein_coding||c.*6143G>A|||||4663|WARNING_TRANSCRIPT_NO_START_CODON,T|downstream_gene_variant|MODIFIER|CLSTN1|ENSG00000171603.16_2|transcript|ENST00000361311.4_1|protein_coding||c.*6143G>A|||||4663|;AA=p.Y936Y,p.Y960Y;CDS=c.2808C>T,c.2880C>T;CNT=2,2;GENE=PIK3CD,PIK3CD_ENST00000536656;STRAND=+,+   GT:GQ:DP:AD 0/1:70:223:147,76

where you can see, there is a lot of information inside "ANN=" and what I would like to use as filter expressions are tags like "synonymous_variant".

I was trying to simply include it as a string, because that's only how I have seen it before, like:

./gatk VariantFiltration -V my.snps.vcf -R ref.fasta -filter "ANN=='non_coding_transcript_exon_variant'" --filter-name "EXONIC" -filter "AF < 0.01" --filter-name "AF-FAIL" -O filtered_for_EXON_AF.vcf

or use a regular expression like ".*non_coding_transcript_exon_variant.*", but it brings no results. When I use -invfilter, it adds the filter name to every single line, so I guess I really just need to find a way to describe this string I'm searching for properly...

I assume the biggest issue could be my lack of experience with JEXL, but I have yet to find a simple tutorial on how to describe such a value with these expressions, ie. ANN=(any_context)STRING_I_WANT(any_context)...

I will also welcome any additional tips on how to deal with monstrous ANN fields like this in general, as I would like to also use SelectVariants on it and separately add them into my future .table when I manage to filter it.

I have gatk-4.0.6.0 and openjdk version "1.8.0_171".

Answers

Sign In or Register to comment.