Using JEXL to apply hard filters or select variants based on annotation values

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin
edited March 16 in Methods and Algorithms

1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.


2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"
  • QUAL is a key: the name of the annotation we want to look at
  • 30.0 is a value: the threshold that we want to use to evaluate variant quality against
  • > is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"

3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"

You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"

where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

where || is the logical "OR".


4. Filtering on sample/genotype-level properties

You can also filter individual samples/genotypes in a VCF based on information from the FORMAT field. Variant Filtration will add the sample-level FT tag to the FORMAT field of filtered samples. Note however that this does not affect the record's FILTER tag. This is still a work in progress and isn't quite as flexible and powerful yet as we'd like it to be. For now, you can filter based on most fields as normal (e.g. GQ < 5.0), but the GT (genotype) field is an exception. We have put in convenience methods to enable filtering out heterozygous calls (isHet == 1), homozygous-reference calls (isHomRef == 1), and homozygous-variant calls (isHomVar == 1).


5. Important caveats

Sensitivity to case and type

You're probably used to case being important (whether letters are lowercase or UPPERCASE) but now you need to also pay attention to the type of value that is involved -- for example, numbers are differentiated between integers and floats (essentially, non-integers). These points are especially important to keep in mind:

  • Case

Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.

  • Type

The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).

Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.


6. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

Introducing the VariantContext object

When you use SelectVariants with JEXL, what happens under the hood is that the program accesses something called the VariantContext, which is a representation of the variant call with all its annotation information. The VariantContext is technically not part of GATK; it's part of the variant library included within the Picard tools source code, which GATK uses for convenience.

The reason we're telling you about this is that you can actually make more complex queries than what the GATK offers convenience functions for, provided you're willing to do a little digging into the VariantContext methods. This will allow you to leverage the full range of capabilities of the underlying objects from the command line.

In a nutshell, the VariantContext is available through the vc variable, and you just need to add method calls to that variable in your command line. The bets way to find out what methods are available is to read the VariantContext documentation on the Picard tools source code repository (on SourceForge), but we list a few examples below to whet your appetite.

Examples using VariantContext directly

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'

Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263

Examples using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'

But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'

Example using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin
  • shangqianxieshangqianxie chinaPosts: 3Member

    Hi Geraldine_VdAuwera,
    Thanks for the detail about JEXL expression .I used SelectVariants -select 'vc.getGenotype("SampleN").isHet' or .isHomVar for selecting the known genotype like "0/1 or 1/1". But in my VCF file, there are some missing genotype like "." , and I also want to select those unknow genotype. Can you give me some comments for this. Thanks.

    Best,
    Xie

  • SheilaSheila Broad InstitutePosts: 3,084Member, Broadie, Moderator, Dev admin

    @shangqianxie‌

    Hi Xie,

    To select the unknown genotypes (./.) you can select for anything that is not homref, het, or homvar.

    -Sheila

  • shangqianxieshangqianxie chinaPosts: 3Member

    Hi Sheila, thank you for you timely reply. I select by vc.getGenotype('sample08').isnotHomVar() ..., but there is error:
    "Invalid JEXL expression detected for select-0 with message ![72,85]: vc.getGenotype('sample08').isnotHomVar() && vc.getGenotype('sample08').isnotHet();' unknown, ambiguous or inaccessible method isnotHomVar"
    I think the problem is incorrect JEXL expression. So are there some JEXL expressions for select setting ?
    Thanks,
    Xie

  • SheilaSheila Broad InstitutePosts: 3,084Member, Broadie, Moderator, Dev admin

    @shangqianxie‌

    Hi Xie,

    Instead of using vc.getGenotype('sample08').isnotHomVar(), try using
    ! vc.getGenotype('sample08').isHomVar(). The ! means not.

    -Sheila

  • shangqianxieshangqianxie chinaPosts: 3Member

    Hi Sheila,
    Thanks again for your reply, It makes sense.

    Xie

  • jullflyjullfly CanadaPosts: 6Member
    edited January 2015

    Hi,
    I want to be able to select the sites that have a DP >= 5 for all my samples. I have 24 samples, all of them "C??" (i.e. C01, C04, C12, etc). I tried using the * wildcard character like this: 'vc.getGenotype("C*").getDP() >= 5' , but this gave me an error. Is there a way to do this, or will I have to write the command individually for each sample (separated by &&)? Thanks for the help.

    Post edited by jullfly on
  • SheilaSheila Broad InstitutePosts: 3,084Member, Broadie, Moderator, Dev admin

    @jullfly‌

    Hi,

    Unfortunately, you cannot iterate over sample names using the variant context operations. You will have to make a script to run on each sample name.

    -Sheila

  • jullflyjullfly CanadaPosts: 6Member

    Thanks for your response @Sheila. It's ok, I can script it for all sample names but it's just not as neat and elegant of a code as it would be if there was a way to iterate through. Oh well, it will do what I need it to do nevertheless!

  • jullflyjullfly CanadaPosts: 6Member
    edited January 2015

    Hi, sorry but I have another question. I keep getting errors with this filtering expression, and I can't figure out what I am doing wrong.
    (Note: this is a snippet of my script, $java, $GATK, and $ref variables are defined elsewhere in the script).

    $java -Xmx2g -jar $GATK \
    -T SelectVariants \
    -R $ref \
    --restrictAllelesTo BIALLELIC \
    --variant Variants.SNPs.filtered1.vcf.C01.5.vcf \
    --select_expressions 'vc.getGenotype("C01").isHet()' \
    --select_expressions '((double)(vc.getGenotype("C01").getAD().0) / (vc.getGenotype("C01").getAD().1 + vc.getGenotype("C01").getAD().0)) > 0.25' \
    --select_expressions '((double)(vc.getGenotype("C01").getAD().0) / (vc.getGenotype("C01").getAD().1 + vc.getGenotype("C01").getAD().0)) < 0.75' \
    -o Variants.Het.SNPs.C01.5.vcf
    

    This is the error message:
    java.lang.IllegalArgumentException: Argument select-2has a bad value. Invalid expression used.

    I double-checked my casting rules in Java for casting the int of AD into a double, and it seems fine to me. I even tested an analogous situation with simple numbers and it works.

    boolean x = ((double) (1) / (2+3)) < 1; //x is true
    

    Help would be greatly appreciated!

    Post edited by Geraldine_VdAuwera on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin

    Hmm, I'm not sure JEXL allows you to do type casting -- I've never seen it used in a JEXL expression. To be honest I would probably just dump the data using VariantsToTable and do the selection in R, which is more flexible for complex queries.

    Geraldine Van der Auwera, PhD

  • jullflyjullfly CanadaPosts: 6Member

    Thanks @Geraldine_VdAuwera, I will keep that option in mind as I continue filtering my variants. But actually I was able to get the above script to work by converting it to a float instead of a double. Others may find this useful:
    vc.getGenotype("C01").getAD().0.floatValue()

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin

    Oh nice, glad to hear you got it to work. And thanks for reporting your solution for other users' benefit.

    Geraldine Van der Auwera, PhD

  • mmterpstrammterpstra NetherlandsPosts: 40Member

    Because this has moved a few times since my last post.

    For people who can read (a little) java, here is the link to the implementation:
    https://github.com/samtools/htsjdk/blob/c63464d69dea17b684257b7d981a302126871355/src/java/htsjdk/variant/variantcontext/VariantContext.java#

    I'm not sure this link works, so alternatively go to:
    https://github.com/samtools/htsjdk/ and find it below /src/java/htsjdk/variant/variantcontext/VariantContext.java

    It has all the cool stuffs like vc.isPolymorphicInSamples() vc.isIndel()

  • newbie16newbie16 Posts: 38Member
    edited September 2015

    Hi

    I have a vcf which in FORMAT field has a tag HQ which is an array of integers (e.g. 40,200)

    CHROM POS ID REF ALT QUAL FILTER INFO FORMAT sampleIdGoesHere
    1 815233 . TC CA 540 PASS AN=2;AC=1;RPM=L1M3|L1|31.4;SEGDUP=8 GT:VT:FT:HFT:GQ:HQ:PL:DP:AD 0/1:ref,delins:PASS:PASS,PASS:341:341,488:540,0,364:249:46,200
    1 815521 . T . 358 PASS AN=2;END=815522 GT:VT:FT:HFT:GQ:HQ:DP:AD:PS 0|0:ref,ref:PASS:PASS,PASS:358:358,410:91:91:815520

    I want to filter records such that either of the HQ values are >200
    Is there a way to do that ?

    Post edited by newbie16 on
  • SheilaSheila Broad InstitutePosts: 3,084Member, Broadie, Moderator, Dev admin
    edited September 2015

    @newbie16
    Hi,

    I think using JEXL should work. Have a look at this document: http://gatkforums.broadinstitute.org/discussion/1255/variant-filtering-methods-involving-jexl-queries The last section "Example using JEXL to evaluate arrays" should help.

    -Sheila

    Post edited by Sheila on
  • bbimberbbimber HomePosts: 32Member

    Hello,

    I'd like to perform site-level select or filtering on a VCF, using a criteria based on genotype attributes. Specifically, I'd like do things like select/filter sites where at least one genotype has depth > X, or where all genotypes have depth < Y. There is a post above about evaluating arrays, which I think is analogous to what I'm after. However, that example seems to be subsetting a specific genotype: 'vc.getGenotype("NA12878").getAD().0 > 10', as opposed to doing some sort of loop or aggregate (for example, max(DP) could accomplish what I need). Is this sort of thing possible? Thanks in advance.

    -Ben

  • SheilaSheila Broad InstitutePosts: 3,084Member, Broadie, Moderator, Dev admin

    @bbimber
    Hi Ben,

    Unfortunately, you will need to to write your own script to do what you want. The evaluating arrays section refers only to evaluating annotations that are arrays of values (like AD).

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin

    @bbimber As a workaround I would recommend looking into VariantsToTable to output a table of the properties you're interested in, set up your evaluation logic in an Rscript (or a civilized scripting language) and use that to output a list of positions that you want to filter or keep, then use that to SelectVariants the subset that you want. Bit roundabout but less risky than scripting something to parse the VCF directly, in my opinion.

    Geraldine Van der Auwera, PhD

  • bbimberbbimber HomePosts: 32Member

    That last comment about AD actually brings up another question. I am hoping to apply genotype filters based on AD. I was trying something like:

    -G_filter "isHet && g && g.hasAd() && g.getAD().1 < 5"

    I would have assumed the Genotype was in scope (like vc); however, after reviewing the code, I dont think that's true. Could I confirm that?

    Second, I think AD gets omitted. From what I can see, when the Genotype is put into the JEXLMap, a handful of fields are explicitly added, and then it defers to getExtendedAttributes(). However, if you look at FastGenotype, AD wont be part of getExtendedAttributes():

    https://github.com/samtools/htsjdk/blob/bd92747fd3672635de96473fea6d4b38e8635c8e/src/java/htsjdk/variant/variantcontext/JEXLMap.java

    Is this an oversight?

    Issue · Github
    by Sheila

    Issue Number
    705
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin

    The syntax for getting at AD values (e.g. -select 'vc.getGenotype("NA12878").getAD().0 > 10') is given in the document above, see the last point in the doc.

    Geraldine Van der Auwera, PhD

  • bbimberbbimber HomePosts: 32Member

    sorry if the post wasnt clear, but i'm asking about genotype filters, not site-level filters. As noted above, from what I see in the code, the JEXLMap is populated through different codepaths.

    doing a genotype-level filter on a hard-coded samplename wouldnt make sense, for example. i'm pretty sure 'vc' isnt put in scope for the genotype filters either.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin

    Ah, I'm afraid we don't currently provide support for using advanced JEXL in genotype filters, sorry. It's just not something we use internally, so we've never explored this. What's currently on this page is all we can offer to help with at this time. If you figure out a way to do what you want, please share it so that others may benefit.

    Geraldine Van der Auwera, PhD

  • bbimberbbimber HomePosts: 32Member

    yeah, that's understandable. i wonder if you can answer this: the relevant code here is under HTS-JDK. It wouldnt be hard to submit a pull request to improve the situation (maybe genotype filters use a JEXLMap more like site filters). is this worth my time? will gatk4 use a similar or a completely different codepath?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin

    I can't judge whether it would be worth your time because I don't know how much this capability is worth to you, but I can reassure you that GATK4 will continue to use htsjdk and use the same codepaths to enable JEXL queries for the foreseeable future. And if you do submit this with an example of how it can be leveraged from GATK, we will support your pull request (sometimes things get lost in htsjdk).

    Geraldine Van der Auwera, PhD

  • MatteodiggMatteodigg ItalyPosts: 18Member

    Hi,
    I have a question for you. I have a multi-sample vcf. I'd like to apply a filter where a variant PASS if at least one sample in that position has AD/DP > 0.2
    Is there any way to do it?

    Thanks for your help and your great job!

    Matteo

  • bbimberbbimber HomePosts: 32Member

    One slightly awkward workaround might be to do something like:

    1) in the first pass apply a genotype filter. this could work with the caveat that I'm not sure you can get AD (see my above post)
    2) use setFilteredGenotypeToNoCall (or whatever the parameter is called)
    3) in pass 2, apply a site filter based on # samples called, # not-called, etc.

    Or as a one off, I think you could use a JEXL expression that hard codes all your samples names and strings together a big OR expression:

    -select 'vc.getGenotype("Sample1").getAD().0 / vc.getGenotype("Sample1").getDP() > 0.2 || vc.getGenotype("Sample2").getAD().0 / vc.getGenotype("Sample2").getDP() > 0.2 ... etc. '

    I didnt check my JEXL syntax on the above, so it probably needs to be touched up. Hopefully the general idea makes sense.

  • MatteodiggMatteodigg ItalyPosts: 18Member

    @bbimer,
    thank you for your tip, but it doesn't work... Using that JEXL expression, all my sites results as FILTER and I have problems with homo GT field like 1/1 or 2/2. Just to be clear, I'd like to tag a site in my vcf as PASS if at least one sample has AD/DP > 0.2 for a variant, otherwise, FILTER.
    Any other suggestions?

    Thanks,

    Matteo

  • bbimberbbimber HomePosts: 32Member

    Matteo - sorry, that expression was just typed into the browser, not tested in any way. a couple guesses:

    1) if you want the first alternate allele, should it be AD.1, not .0?
    2) you might want to include checks like isHet or isNoCall. A homozygous ref animal will only have one value for AD, i think.
    3) could you rewrite your ratio to be based off the reference depth? every genotype should have a value for AD.0.
    4) I think JEXL is pretty tolerant to null values (for example, asking for AD.1 on a homozygote), but I am not all that familiar with how these are handled. I'd just be careful. In my experience GATK is pretty verbose about logging issues w/ JEXL filters.

    good luck.

  • bbimberbbimber HomePosts: 32Member

    here's the PR for the change to improve genotype filtering. this should allow proper filtering on AD and other genotype-level attributes:

    https://github.com/samtools/htsjdk/pull/532

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin

    Got it, thanks @bbimber.

    Geraldine Van der Auwera, PhD

  • bbimberbbimber HomePosts: 32Member

    BTW - you asked for an example. The key differences are:

    1) you can access the genotype object in genotype filters:

    -G_filter "isHet && g.hasAd() && g.getAD().1 < 5"

    in my opinion, this is much more useful than the above example for site filters that hard-codes specific sampleIDs. When you want to ask question like 'filter sites where X% of genotypes have Y', you could do one pass where you perform genotype filters and setFilteredGtToNoCall. after this, do a site filter on N_CALLED.

    2) It fixes when I think was a bug in the original implementation, which should have omitted AD + PL from the JEXL map. This is because they were not included in Genotype.getExtendedAttributes().

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin

    I agree that makes a lot of sense.

    I would recommend include this example explicitly in your pull request thread as it will help convince the gatekeepers of htsjdk that this is worthwhile, in case they don't look up this thread.

    Geraldine Van der Auwera, PhD

  • lindenblindenb Posts: 31Member

    FYI : there is now a javascript-based engine in picard FilterVcf . It's, IMHO, much more powerful than JEXL . See :

    https://github.com/samtools/htsjdk/blob/master/testdata/htsjdk/variant/variantFilter02.js

    for an example.

    Issue · Github
    by Sheila

    Issue Number
    732
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin

    @lindenb Ah I saw that go through into htsjdk but must have missed the Picard FilterVcf update. That does look very powerful (thanks for contributing that!) but I do think we should take in @bbimber's PR to shore up our JEXL offering in GATK, as for reasonably simple queries the usage is more straightforward compared to writing javascript.

    Geraldine Van der Auwera, PhD

  • bbimberbbimber HomePosts: 32Member

    Hello - the PR to HTSJDK was accepted, which should allow much better genotype filtering using GATK. My question is: how frequently does GATK update the HTSJDK version it uses? If I read github correctly, GATK is currently pegged to htsjdk version 1.141. These changes are in 2.2.2. Am I reading this wrong? Thanks in advance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 9,857Administrator, Dev admin

    That's correct. We typically update htsjdk when we cut a new release, which nowadays is about 3 times a year -- so you're in luck because the next release, GATK 3.6 is slated for mid-May. In fact there's a PR that just went in for updating htsjdk to 2.2.2 so it should be available in the nightly builds within a few days. Assuming all tests pass.

    Geraldine Van der Auwera, PhD

  • bbimberbbimber HomePosts: 32Member

    good news, thanks.

Sign In or Register to comment.