If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Filtering based on annotations under ANN within the INFO field


I am brand new to using GATK and was assigned the task of filtering a VCF file provided to me. I already had success with filtering based on the AF value and other numerical fields, however, the rest of the tags I should filter are included (or more like, buried) within an extremely long "ANN=..." field inside the INFO column that I'm having trouble extracting the information from.

Basically, what I need is something like
-filter "ANN=has_a_string_somewhere_inside_this_extremely_long_field" -filter-name "HAS_THIS_ANNOTATION"

An example line from my VCF:

chr1    9784423 COSM3751466;COSM3751467 C   T   1563.0  PASS    DP=223;AF=0.340807;SB=3;DP4=62,85,28,48;ANN=T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000361110.6_1|protein_coding|21/23|c.2880C>T|p.Y960Y|2995/3508|2880/3207|960/1068||,T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000536656.5_1|protein_coding|23/25|c.2880C>T|p.Y960Y|3088/5483|2880/3207|960/1068||,T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000628140.2_1|protein_coding|22/24|c.2880C>T|p.Y960Y|3088/5483|2880/3207|960/1068||,T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000377346.8_1|protein_coding|22/24|c.2808C>T|p.Y936Y|3003/5203|2808/3135|936/1044||,T|synonymous_variant|LOW|PIK3CD|ENSG00000171608.15_2|transcript|ENST00000543390.2_1|protein_coding|22/24|c.2880C>T|p.Y960Y|2995/3503|2880/3207|960/1068||,T|downstream_gene_variant|MODIFIER|CLSTN1|ENSG00000171603.16_2|transcript|ENST00000377298.8_1|protein_coding||c.*6143G>A|||||4661|,T|downstream_gene_variant|MODIFIER|CLSTN1|ENSG00000171603.16_2|transcript|ENST00000477264.1_1|processed_transcript||n.*4663G>A|||||4663|,T|downstream_gene_variant|MODIFIER|CLSTN1|ENSG00000171603.16_2|transcript|ENST00000435891.5_1|protein_coding||c.*6143G>A|||||4663|WARNING_TRANSCRIPT_NO_START_CODON,T|downstream_gene_variant|MODIFIER|CLSTN1|ENSG00000171603.16_2|transcript|ENST00000361311.4_1|protein_coding||c.*6143G>A|||||4663|;AA=p.Y936Y,p.Y960Y;CDS=c.2808C>T,c.2880C>T;CNT=2,2;GENE=PIK3CD,PIK3CD_ENST00000536656;STRAND=+,+   GT:GQ:DP:AD 0/1:70:223:147,76

where you can see, there is a lot of information inside "ANN=" and what I would like to use as filter expressions are tags like "synonymous_variant".

I was trying to simply include it as a string, because that's only how I have seen it before, like:

./gatk VariantFiltration -V my.snps.vcf -R ref.fasta -filter "ANN=='non_coding_transcript_exon_variant'" --filter-name "EXONIC" -filter "AF < 0.01" --filter-name "AF-FAIL" -O filtered_for_EXON_AF.vcf

or use a regular expression like ".*non_coding_transcript_exon_variant.*", but it brings no results. When I use -invfilter, it adds the filter name to every single line, so I guess I really just need to find a way to describe this string I'm searching for properly...

I assume the biggest issue could be my lack of experience with JEXL, but I have yet to find a simple tutorial on how to describe such a value with these expressions, ie. ANN=(any_context)STRING_I_WANT(any_context)...

I will also welcome any additional tips on how to deal with monstrous ANN fields like this in general, as I would like to also use SelectVariants on it and separately add them into my future .table when I manage to filter it.

I have gatk- and openjdk version "1.8.0_171".


Sign In or Register to comment.