The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

# What are JEXL expressions and how can I use them with the GATK?

edited July 2013 in FAQs

### 1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.

### 2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"

• QUAL is a key: the name of the annotation we want to look at
• 30.0 is a value: the threshold that we want to use to evaluate variant quality against
• > is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"


### 3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"


You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"


where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"


where || is the logical "OR".

### 4. Important caveats

#### Sensitivity to case and type

• Case

Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.

• Type

The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).

#### Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.

### 5. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

#### Accessing the underlying VariantContext directly

If you are familiar with the VariantContext, Genotype and its associated classes and methods, you can directly access the full range of capabilities of the underlying objects from the command line. The underlying VariantContext object is available through the vc variable.

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'


Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263


#### Using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'


But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'


### 6. Using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'

Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Tagged:

• Posts: 335Member, GSA Collaborator ✭✭✭

For those trying to access the underlying VariantContext, the class of interest is org.broadinstitute.sting.utils.variantcontext.VariantContext (and Genotype in the same package). Source is here: https://github.com/broadgsa/gatk/blob/master/public/java/src/org/broadinstitute/sting/utils/variantcontext/VariantContext.java

• Posts: 18Member

You state "It is very important to note that the JEXL evaluation subprogram cannot correctly handle cases where the annotations requested by the JEXL expression are missing for some variants in a VCF record". GaTK documentation says there is a standard set of annotations output by Unified Genotyper but when I view my VCF file - not all the standard annotations are there for all variants (e.g. ReadePosRankSum). Is this because they can't be computed? Is there a way to add them back in?

It is possible that that some annotations were left out because they could not be computed. Usually if they could not be computed in the original run, it is not possible to add them afterwards either.

Geraldine Van der Auwera, PhD

• Posts: 64Member ✭✭
edited April 2013

Hi

Is the source code for VariantContext no longer available? Either way is there a JAVA doc page or similar for VariantContext? As I'm looking at building some fairly complex filters using SelectVariants and it would be very useful to find a full list of the various fields that can be queried from vc.

Otherwise can you answer these questions:

Is there a wildcard operator or some other way of specifying multiple samples say if I want to keep all samples that have a GQ score of greater than 20 & a AD where the Alt is >= 2?

Post edited by aeonsim on

VC has been transferred to a different dev team, see this doc article for instructions to access their public repo:

Geraldine Van der Auwera, PhD

• Posts: 9Member

I was just testing VariantFiltration with some files that have missing annotations and was suprised to have the filter expression "MQ < 60.0 || QD < 2.0" evaluate correctly (all the variants with MQ < 60.0 were filtered) even though QD was not defined for some variants. I also tested the expression with a file where no QD values were present and QD was not defined in the header and had the same result.

Command run:

java -Xmx2g -jar GenomeAnalysisTK.jar -T VariantFiltration -R ucsc.hg19.fasta --variant test.vcf
--filterExpression "MQ < 60.0 || QD < 2.0" --filterName "FAIL" -o ./test.hardfiltered.vcf


Subset of results (chromosome locations/genotypes removed):

1 C T   8249.77 PASS    AC=2;AF=1.00;AN=2;DP=285;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.95
2 G A   7401.77 FAIL    AC=2;AF=1.00;AN=2;DP=137;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;MQ0=0;QD=54.03
3 G A   3445.77 FAIL    AC=2;AF=1.00;AN=2;DP=120;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=1.44
5 A T   5440.77 FAIL    AC=2;AF=1.00;AN=2;DP=188;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;MQ0=0;QD=0.94
6 T C   34  PASS    DP=4;DP4=1,1,1,1;FQ=32.3;MQ=60;PV4=1,1,1,1;VDB=0.0073
7 C T,A 66.30   PASS    DP=6;DP4=0,0,6,0;FQ=-42;MQ=60;VDB=0.0000
8 G C   32.80   FAIL    DP=2;DP4=0,0,2,0;FQ=-33;MQ=43;VDB=0.0066
9   G   A   32.80   FAIL    DP=2;DP4=0,0,0,2;FQ=-33;MQ=51;VDB=0.0066


Is this caveat no longer applicable in GATK 2.4-9 or did I misunderstand the meaning of the caveat?

Ack, not a JEXL expressions question ;-)

I'm not sure -- I think your interpretation is correct but JEXL always keeps me guessing. I'm consulting the team and will try to get you a definitive answer asap about the expected behavior. On the bright side it looks like you're getting the right answers, so that's a good problem to have.

Geraldine Van der Auwera, PhD

• Posts: 9Member

Definitely a good problem to have! Sorry- I wasn't sure where to put the question and it was prompted by the caveat in this post, so I thought it might belong here. Thanks for your help!

No worries, this is as good a place as any :)

Geraldine Van der Auwera, PhD

OK, it seems that the behavior as been changed so that the caveat no longer applies -- will check and edit the docs as appropriate. Thanks for pointing this out!

Geraldine Van der Auwera, PhD

• Posts: 31Member

Is there a list of available keys available somewhere? For VCF walkers, you can look at your VCF file and see what is available. However, I am using SomaticIndelDetector and I am not sure what my options are except for the one example that is listed.

I realize SomaticIndelDetector is no longer part of GATK, but it's still there in GATK 2.3.

We do not support SomaticIndelDetector anymore, please contact the Broad's cancer group directly. They have a new tool that has substituted it entirely.

• Posts: 31Member

@Carneiro said: We do not support SomaticIndelDetector anymore, please contact the Broad's cancer group directly. They have a new tool that has substituted it entirely.

So there is no old list of available keys available somewhere?

What is the BCG tool called? I searched their software page, but I could not find anything that looks like SomaticIndelDetector.

I think the new tool is called Indelocator

• Posts: 31Member

@Carneiro said: I think the new tool is called Indelocator

It looks like that is a special version of GATK that includes Indelocator, packaged by the cancer group. We'll work with them to clarify this; in the meantime please ask them for documentation and help running it.

Geraldine Van der Auwera, PhD

• Posts: 8Member

Im trying to filter indels of 20 and greater length from a vcf file. I have an example from a forum but it retrieves indels under 20 length.

What does "vc.getIndelLengths().0" do? Does it just get the length of an indel because changing the "<" around to get those greater than 19 returns no results when I know there are indels of 19 and above.

-select 'vc.getIndelLengths().0 > -19 && vc.getIndelLengths().0 < 19'

• Posts: 8Member

Got it thanks, insertions larger than 19 but smaller than 1000 -select 'vc.getIndelLengths().0 > 19 && vc.getIndelLengths().0 < 1000'

• Montreal, QCPosts: 15Member
edited September 2013

Hello there, I'm working with this tool presently and I'm trying to extract all the polymophic sites from one of my sample. I'm completely able to do it for biallelic sites, but when I have triallelic sites, I'm not able to find the right expression to bypass the problem.

Here's my JEXL expression :

The one that is working with biallelic sites :

java -Xmx1g -jar /home/apps/Logiciels/GATK/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T SelectVariants -R Ref.fa --variant INPUT.vcf -select 'vc.getGenotype("SAMPLE").getAD().0 > 1 && vc.getGenotype("SAMPLE").getAD().1 > 1' -o test.vcf


The one that I'm trying to do :

java -Xmx1g -jar /home/apps/Logiciels/GATK/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T SelectVariants -R Ref.fa --variant INPUT.vcf -select '(vc.getGenotype("SAMPLE").getAD() > 2 && (vc.getGenotype("SAMPLE").getAD().1 > 1 && vc.getGenotype("SAMPLE").getAD().2 > 1) || (vc.getGenotype("SAMPLE").getAD().0 > 1 && vc.getGenotype("SAMPLE").getAD().2 > 1)) || vc.getGenotype("SAMPLE").getAD() == 2 && vc.getGenotype("SAMPLE").getAD().0 > 1 && vc.getGenotype("SAMPLE").getAD().1 > 1' -o test.vcf


So, it's more like a if/else case, could this be possible to do?

Thanks a lot.

Post edited by Geraldine_VdAuwera on
• Posts: 335Member, GSA Collaborator ✭✭✭

I would use the built-in --excludeNonVariants option rather than JEXL. -sn SAMPLE -env will give you only the variants that are polymorphic in SAMPLE, if you want the data for all samples across those variants I think you could do a second SelectVariants across the original file with -L polymorphicInSAMPLE.vcf

• Montreal, QCPosts: 15Member

Would it give me as well ref/ref calls where I have at least a coverage of one on the alternative allele?

Thanks

• Montreal, QCPosts: 15Member

I just tried the command but do not give me what I'm exactly looking for. For sur it only gives me variant sites. But what is different in my case is that I want to extract positions where the positions could be whatever genotype (I'm doing my calling on multiple samples, so 1 sample can be 0/0 ). The additionnal filtering step that I need is that the site itself must contain at least one of the two allele ( so DP >=2 ). The problem that I have resides in the fact that I'm working with bacterias (mutation rate higher) so I have multiple sites with more than 2 alleles. Sometimes, I have only allele 1 and 2 giving me 1/2 genotypes and it's causing problems with my JEXL command. It's causing problems because of the size of the array itself that I want to look at (on biallelic positions).

Hi @JCGrenier,

Could you please give a concrete example to show what you want to do? E.g. post a few variant records and point out which ones you want to extract vs. not extract.

Geraldine Van der Auwera, PhD

• Posts: 335Member, GSA Collaborator ✭✭✭

I'm afraid I don't understand what you're trying to do. DP refers to the number of reads covering a site, it has nothing to do with alleles. And -env should capture the 1/2 case without any problems. If you want to consider sites that are polymorphic in one of several samples, you'd specify additional -sn parameters.

• Montreal, QCPosts: 15Member
edited September 2013

I'll give you a concrete example only giving you GT:AD (I don't care about the other terms).

Here's a list of calls :

1 0/0:10,0
2 0/0:9,1
3 0/0:9,0,1
4 0/0:9,1,0
5 0/1:3,7
6 0/1:3,6,1
7 1/1:0,10
8 1/1:1,9
9 1/1:0,9,1
10 0/2:4,0,6
11 1/2:0,4,6
12 2/2:0,0,10
13 2/2:0,1,9
14 2/2:1,0,9


So, what I want to extract here are the positions 2, 3, 4, 5, 6, 8, 9, 10, 11, 13 and 14 where there's a sign of polymorphism within my sample. I don't care about the other ones for the moment. And as the SNP calling was done using multiple samples, that's why I can have such cases as the position #4.

So my first command line that I showed you in the other post was able to catch all the ones with a coverage of at least 1 in the two first alleles.

Post edited by Geraldine_VdAuwera on
• Montreal, QCPosts: 15Member

Hi folks,

I end up using awk to find out the lines that I want. Here's the command that I used :

awk '{ct=0; split($10, geno, ":") ; split(geno[2], AD, ","); for (i in AD) if(AD[i]>=1) ct++; if ($0~/^#/||ct>=2) print}' INPUT.vcf > OUTPUT.polymorph.vcf

For sure, it works only if I have one sample in my vcf file. It would have been tricky to do so using multiple samples anyway.

Thanks for your help! I'm interested if you find the solution using GATK anyway!

• Posts: 2Member

"To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial."

I would actually love to see this happen since I am currently struggling to make this variant context work. Is this still somewhere on the TODO list?

Hi @FrancoisW,

Yes, it is still very much planned, and we are moving closer to finding time to do it. Hang on in there :)

Geraldine Van der Auwera, PhD

• NetherlandsPosts: 29Member

I'd appreciate more explaination or a long list of good examples(that do not produce warnings for too many of your variants.... for troubleshooting). For example how do you check for the an array.index(vc.hasAttributes('RPA').2) returning a valid value (at least not undefined)? This could be used for many uses the most important is getting data out of multiallelic variants. just like "vc.hasAttributes('RPA') && vc.hasAttributes('RPA').2.something()" will first check if the variable is actually present and then will check for the index number being existent.

Alright, it's on the wishlist.

Geraldine Van der Auwera, PhD

• NetherlandsPosts: 29Member
edited February 12

##### ERROR ------------------------------------------------------------------------------------------

@Geraldine_VdAuwera what do you think of this???ps it is still present in gatk 2.8.1

• Posts: 39Member

@Geraldine_VdAuwera said: You're hitting multi-allelic variants. For multi-allelic variants the AF is not a double - it's a list of doubles. So the expression "AF > 0.0057" is invalid in those cases. You'll need to add a condition to only process variants with 1 ALT allele. Unfortunately there is currently no way to filter over AF for multi-allelic variants.

I see. Thank you for the explanation

@mmterpstra I am using 2.8-1

@Geraldine_VdAuwera Is there somewhere I could take a look at the vc class methods? I only know what is available to me based on these posts. It seems my next course of action needs to be separating out the Multiallelic SNPs as suggested and interrogating them manually (ie. make a student do it MWAHAHAHAHA). What would me the class method equivlent of --restrictAllelesTo BIALLELIC? Thank you very much.

Hi @bwubb,

We don't have explicit docs for that so the quickest way to find out is to look at the code, which is available on github.

my next course of action needs to be separating out the Multiallelic SNPs as suggested and interrogating them manually (ie. make a student do it MWAHAHAHAHA)

Geraldine Van der Auwera, PhD

• Posts: 194Member

hi, Geraldine, I like your MWAHAHAHAHA! :) And thank you so much for your patience in answering all those questions. I wonder if you can help me with this. Here is a line of entry in dbSNP vcf format: 1 54353 rs140052487 C A . . RS=140052487;RSPOS=54353;dbSNPBuildID=134;SSR=0;SAO=0;VP=0x050000000001000016000100;WGT=1;VC=SNV;KGPhase1;KGPROD;OTHERKG;CAF=[0.9927,0.007346];COMMON=1

How to extract those with the 2nd value of CAF > 0.01?

Thanks!

Hi @blueskypy,

I don't know a way to do that directly with JEXL, if that's what you're asking (although others may); I would probably dump the info to a table using VariantsToTable at that point, and query that in R.

Geraldine Van der Auwera, PhD

• NetherlandsPosts: 29Member
edited March 11

I just copied '(vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP()' from the manual above

and ran a minute or less and then the program screamed and died:

ERROR MESSAGE: Invalid JEXL expression detected for select-0 with message ![50,59]: '(vc.getID() == null || vc.getID().equals('.')) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP();' > error

from my post on 21 februari shouldn't this be fixed in the Documentation/Code?

Post edited by mmterpstra on

@mmterpstra, do you mean to make a feature request to get a warning instead of a blocking error when the AF query is made on multiallelic sites?

Geraldine Van der Auwera, PhD

• NetherlandsPosts: 29Member

@Geraldine_VdAuwera‌: That would be sensible because then the AF is stil usable. The other solution will be to update the current docs instructing first to filter out the multiallelic variants. Thinking back on the comment this is a question of beginner friendliness v.s. user safety.

Thinking back on the comment this is a question of beginner friendliness v.s. user safety.

I agree, that's a fair point. I'll see what we can do to make this better.

Geraldine Van der Auwera, PhD

• Posts: 8Member

How to filter depending on the GT Attribute? I try: java -Xmx30G -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg19.fasta --variant variants.vcf -o variants_filtered --genotypeFilterExpression "vc.getGenotype().equals('1/1')" --genotypeFilterName "homozygote" but this Warning occur: WARN Interpreter - ![3,16]: 'vc.getGenotype().equals('1/1');' attempting to call method on null

Thanks.

• NetherlandsPosts: 29Member
edited April 11

did you read the manual? This is a plain copy/paste from the manual:'vc.getGenotype("NA12878").isHomRef()' . where "NA12878" is your genotype that's is present in your vcf for that genotype. '.isHomRef()' soft filters out the homozygotes and only works on a single genotype.

Post edited by mmterpstra on
• Posts: 2Member

The only way I found to filter on AC (which is a list/array of integers) is:

--filterExpression "vc.getAttribute('AC').equals('40')" --filterName AChigh

I have tried many combinations and this is the only one realy working without warning!

If it could help...

Frédéric BIGEY, PhD

• Posts: 8Member

@mmterpstra I read the manual many times, but i didn't recognize how to use isHomRef and at the first trials I overread it sorry. I try it out in this form: isHomRef() but it didn't work well. Now I can use this filter, after some trials with isHomRef == 1. 'Thank you for the hint.

• NetherlandsPosts: 29Member
edited April 29

Explanation of code: 'vc.getGenotype("NA12878").isHomRef()'

The object that stores the information concerning the variant (also known as 'Variant Context'):

 vc


The next piece of code extracts the variant information specific to the genotype (The genotype fields probably, or a new vc object with only a single genotype):

 .getGenotype("NA12878")


Then the following will test that single genotype if it is homozygous for the reference allelle by extracting the 'GT' field from the 'vc..getGenotype("NA12878")' statement and comparing this with the actual homref genotype. This will return a boolean value TRUE or FALSE. also it migth have some checks that the input it is a single genotype and thus may only work for a single genotype.

 .isHomRef()


A small note: Object Oriented code is somewhat difficult to write at the terminal that's why the previous statements are prone to errors. I'm no moderator and my java skills are only basic. This I learned from reading a java Object Oriented tutorial, letting the GATK crash :) and manual variant evaluation. for every question here most of them i did not write any factual testing unless otherwise noted.

Post edited by mmterpstra on
• MontpellierPosts: 1Member

Hi everyone, I'm wondering if there is a wildcard to deal with ALL (or ANY) samples at the same time ?

1) I'd like to discard variants for which each sample has a missing genotype (./.) !! Because I removed others samples. 2) I'd like also to keep only variant having no missing genotypes at all.

I can't think of a straightforward way to do that, sorry; I'll ask the team if they have any ideas.

Geraldine Van der Auwera, PhD

@laRfet3

So, the recommendation from the team is to run the VariantAnnotator with -A ChromosomeCounts. That will update the AC and AN (and AF) fields in the INFO section. Then you can easily use JEXLs like: AC==0 (case #1) and AN != x (case #2, where x is equal to 2 x the number of samples).

Geraldine Van der Auwera, PhD

• Posts: 39Member

Is there an annotation I could use for selection which is simply the non-reference frequency across samples? I am trying to extract variants that are seen in at least 95% of my vcf samples? I do not care if they are het or hom_alt, but annotations such as allele count might not be applicable (tumor data and whatnot). Thank you.

@bwubb I would dump AD and DP to a table and query them in R...

Geraldine Van der Auwera, PhD

• Posts: 39Member

@Geraldine_VdAuwera So once more I must work with my arch-nemesis...

Hah :) Personally, I dislike R less than JEXL...

Geraldine Van der Auwera, PhD

• Posts: 39Member

Agreed. I ended up using old reliable python and hopefully I did things adequately (quick and dirty).

Line by line parsing; incrementing counts for each 0/1 or 1/1 and dividing by the total samples; write to output if >= 0.95