Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

What are JEXL expressions and how can I use them with the GATK?

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin
edited July 2013 in FAQs

1. JEXL in a nutshell

JEXL stands for Java EXpression Language. It's not a part of the GATK as such; it's a software library that can be used by Java-based programs like the GATK. It can be used for many things, but in the context of the GATK, it has one very specific use: making it possible to operate on subsets of variants from VCF files based on one or more annotations, using a single command. This is typically done with walkers such as VariantFiltration and SelectVariants.

2. Basic structure of JEXL expressions for use with the GATK

In this context, a JEXL expression is a string (in the computing sense, i.e. a series of characters) that tells the GATK which annotations to look at and what selection rules to apply.

JEXL expressions contain three basic components: keys and values, connected by operators. For example, in this simple JEXL expression which selects variants whose quality score is greater than 30:

"QUAL > 30.0"
  • QUAL is a key: the name of the annotation we want to look at
  • 30.0 is a value: the threshold that we want to use to evaluate variant quality against
  • > is an operator: it determines which "side" of the threshold we want to select

The complete expression must be framed by double quotes. Within this, keys are strings (typically written in uppercase or CamelCase), and values can be either strings, numbers or booleans (TRUE or FALSE) -- but if they are strings the values must be framed by single quotes, as in the following example:

"MY_STRING_KEY == 'foo'"

3. Evaluation on multiple annotations

You can build expressions that calculate a metric based on two separate annotations, for example if you want to select variants for which quality (QUAL) divided by depth of coverage (DP) is below a certain threshold value:

"QUAL / DP < 10.0"

You can also join multiple conditional statements with logical operators, for example if you want to select variants that have both sufficient quality (QUAL) and a certain depth of coverage (DP):

"QUAL > 30.0 && DP == 10"

where && is the logical "AND".

Or if you want to select variants that have at least one of several conditions fulfilled:

"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0"

where || is the logical "OR".

4. Important caveats

Sensitivity to case and type

  • Case

Currently, VCF INFO field keys are case-sensitive. That means that if you have a QUAL field in uppercase in your VCF record, the system will not recognize it if you write it differently (Qual, qual or whatever) in your JEXL expression.

  • Type

The types (i.e. string, integer, non-integer or boolean) used in your expression must be exactly the same as that of the value you are trying to evaluate. In other words, if you have a QUAL field with non-integer values (e.g. 45.3) and your filter expression is written as an integer (e.g. "QUAL < 50"), the system will throw a hissy fit (aka a Java exception).

Complex queries

We highly recommend that complex expressions involving multiple AND/OR operations be split up into separate expressions whenever possible to avoid confusion. If you are using complex expressions, make sure to test them on a panel of different sites with several combinations of yes/no criteria.

5. More complex JEXL magic

Note that this last part is fairly advanced and not for the faint of heart. To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial.

Accessing the underlying VariantContext directly

If you are familiar with the VariantContext, Genotype and its associated classes and methods, you can directly access the full range of capabilities of the underlying objects from the command line. The underlying VariantContext object is available through the vc variable.

For example, suppose I want to use SelectVariants to select all of the sites where sample NA12878 is homozygous-reference. This can be accomplished by assessing the underlying VariantContext as follows:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").isHomRef()'

Groovy, right? Now here's a more sophisticated example of JEXL expression that finds all novel variants in the total set with allele frequency > 0.25 but not 1, is not filtered, and is non-reference in 01-0263 sample:

! vc.getGenotype("01-0263").isHomRef() && (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() -o 01-0263.high_freq_novels.vcf -sn 01-0263

Using the VariantContext to evaluate boolean values

The classic way of evaluating a boolean goes like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'DB'

But you can also use the VariantContext object like this:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.hasAttribute("DB")'

6. Using JEXL to evaluate arrays

Sometimes you might want to write a JEXL expression to evaluate e.g. the AD (allelic depth) field in the FORMAT column. However, the AD is technically not an integer; rather it is a list (array) of integers. One can evaluate the array data using the "." operator. Here's an example:

java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R b37/human_g1k_v37.fasta --variant my.vcf -select 'vc.getGenotype("NA12878").getAD().0 > 10'
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • pdexheimerpdexheimer Posts: 335Member, GSA Collaborator ✭✭✭

    For those trying to access the underlying VariantContext, the class of interest is org.broadinstitute.sting.utils.variantcontext.VariantContext (and Genotype in the same package). Source is here: https://github.com/broadgsa/gatk/blob/master/public/java/src/org/broadinstitute/sting/utils/variantcontext/VariantContext.java

  • prepagamprepagam Posts: 18Member

    You state "It is very important to note that the JEXL evaluation subprogram cannot correctly handle cases where the annotations requested by the JEXL expression are missing for some variants in a VCF record". GaTK documentation says there is a standard set of annotations output by Unified Genotyper but when I view my VCF file - not all the standard annotations are there for all variants (e.g. ReadePosRankSum). Is this because they can't be computed? Is there a way to add them back in?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    It is possible that that some annotations were left out because they could not be computed. Usually if they could not be computed in the original run, it is not possible to add them afterwards either.

    Geraldine Van der Auwera, PhD

  • aeonsimaeonsim Posts: 64Member ✭✭
    edited April 2013

    Hi

    Is the source code for VariantContext no longer available? Either way is there a JAVA doc page or similar for VariantContext? As I'm looking at building some fairly complex filters using SelectVariants and it would be very useful to find a full list of the various fields that can be queried from vc.

    Otherwise can you answer these questions:

    Is there a wildcard operator or some other way of specifying multiple samples say if I want to keep all samples that have a GQ score of greater than 20 & a AD where the Alt is >= 2?

    Post edited by aeonsim on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    VC has been transferred to a different dev team, see this doc article for instructions to access their public repo:

    http://www.broadinstitute.org/gatk/guide/article?id=2194

    Geraldine Van der Auwera, PhD

  • nhvanlienhvanlie Posts: 9Member

    I was just testing VariantFiltration with some files that have missing annotations and was suprised to have the filter expression "MQ < 60.0 || QD < 2.0" evaluate correctly (all the variants with MQ < 60.0 were filtered) even though QD was not defined for some variants. I also tested the expression with a file where no QD values were present and QD was not defined in the header and had the same result.

    Command run:

    java -Xmx2g -jar GenomeAnalysisTK.jar -T VariantFiltration -R ucsc.hg19.fasta --variant test.vcf
    --filterExpression "MQ < 60.0 || QD < 2.0" --filterName "FAIL" -o ./test.hardfiltered.vcf
    

    Subset of results (chromosome locations/genotypes removed):

    1 C T   8249.77 PASS    AC=2;AF=1.00;AN=2;DP=285;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.95    
    2 G A   7401.77 FAIL    AC=2;AF=1.00;AN=2;DP=137;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;MQ0=0;QD=54.03    
    3 G A   3445.77 FAIL    AC=2;AF=1.00;AN=2;DP=120;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=1.44 
    5 A T   5440.77 FAIL    AC=2;AF=1.00;AN=2;DP=188;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;MQ0=0;QD=0.94 
    6 T C   34  PASS    DP=4;DP4=1,1,1,1;FQ=32.3;MQ=60;PV4=1,1,1,1;VDB=0.0073   
    7 C T,A 66.30   PASS    DP=6;DP4=0,0,6,0;FQ=-42;MQ=60;VDB=0.0000
    8 G C   32.80   FAIL    DP=2;DP4=0,0,2,0;FQ=-33;MQ=43;VDB=0.0066
    9   G   A   32.80   FAIL    DP=2;DP4=0,0,0,2;FQ=-33;MQ=51;VDB=0.0066    
    

    Is this caveat no longer applicable in GATK 2.4-9 or did I misunderstand the meaning of the caveat?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Ack, not a JEXL expressions question ;-)

    I'm not sure -- I think your interpretation is correct but JEXL always keeps me guessing. I'm consulting the team and will try to get you a definitive answer asap about the expected behavior. On the bright side it looks like you're getting the right answers, so that's a good problem to have.

    Geraldine Van der Auwera, PhD

  • nhvanlienhvanlie Posts: 9Member

    Definitely a good problem to have! Sorry- I wasn't sure where to put the question and it was prompted by the caveat in this post, so I thought it might belong here. Thanks for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    No worries, this is as good a place as any :)

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    OK, it seems that the behavior as been changed so that the caveat no longer applies -- will check and edit the docs as appropriate. Thanks for pointing this out!

    Geraldine Van der Auwera, PhD

  • igorigor Posts: 31Member

    Is there a list of available keys available somewhere? For VCF walkers, you can look at your VCF file and see what is available. However, I am using SomaticIndelDetector and I am not sure what my options are except for the one example that is listed.

    I realize SomaticIndelDetector is no longer part of GATK, but it's still there in GATK 2.3.

  • CarneiroCarneiro Posts: 275Administrator, GATK Developer admin

    We do not support SomaticIndelDetector anymore, please contact the Broad's cancer group directly. They have a new tool that has substituted it entirely.

  • igorigor Posts: 31Member

    @Carneiro said: We do not support SomaticIndelDetector anymore, please contact the Broad's cancer group directly. They have a new tool that has substituted it entirely.

    So there is no old list of available keys available somewhere?

    What is the BCG tool called? I searched their software page, but I could not find anything that looks like SomaticIndelDetector.

  • CarneiroCarneiro Posts: 275Administrator, GATK Developer admin

    I think the new tool is called Indelocator

  • igorigor Posts: 31Member

    @Carneiro said: I think the new tool is called Indelocator

    But Indelocator documentation points back to GATK: http://www.broadinstitute.org/cancer/cga/indelocator_download

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    It looks like that is a special version of GATK that includes Indelocator, packaged by the cancer group. We'll work with them to clarify this; in the meantime please ask them for documentation and help running it.

    Geraldine Van der Auwera, PhD

  • rob123kingrob123king Posts: 8Member

    Im trying to filter indels of 20 and greater length from a vcf file. I have an example from a forum but it retrieves indels under 20 length.

    What does "vc.getIndelLengths().0" do? Does it just get the length of an indel because changing the "<" around to get those greater than 19 returns no results when I know there are indels of 19 and above.

    -select 'vc.getIndelLengths().0 > -19 && vc.getIndelLengths().0 < 19'

  • rob123kingrob123king Posts: 8Member

    Got it thanks, insertions larger than 19 but smaller than 1000 -select 'vc.getIndelLengths().0 > 19 && vc.getIndelLengths().0 < 1000'

  • JCGrenierJCGrenier Montreal, QCPosts: 15Member
    edited September 2013

    Hello there, I'm working with this tool presently and I'm trying to extract all the polymophic sites from one of my sample. I'm completely able to do it for biallelic sites, but when I have triallelic sites, I'm not able to find the right expression to bypass the problem.

    Here's my JEXL expression :

    The one that is working with biallelic sites :

    java -Xmx1g -jar /home/apps/Logiciels/GATK/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T SelectVariants -R Ref.fa --variant INPUT.vcf -select 'vc.getGenotype("SAMPLE").getAD().0 > 1 && vc.getGenotype("SAMPLE").getAD().1 > 1' -o test.vcf
    

    The one that I'm trying to do :

    java -Xmx1g -jar /home/apps/Logiciels/GATK/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T SelectVariants -R Ref.fa --variant INPUT.vcf -select '(vc.getGenotype("SAMPLE").getAD() > 2 && (vc.getGenotype("SAMPLE").getAD().1 > 1 && vc.getGenotype("SAMPLE").getAD().2 > 1) || (vc.getGenotype("SAMPLE").getAD().0 > 1 && vc.getGenotype("SAMPLE").getAD().2 > 1)) || vc.getGenotype("SAMPLE").getAD() == 2 && vc.getGenotype("SAMPLE").getAD().0 > 1 && vc.getGenotype("SAMPLE").getAD().1 > 1' -o test.vcf
    

    So, it's more like a if/else case, could this be possible to do?

    Thanks a lot.

    Post edited by Geraldine_VdAuwera on
  • pdexheimerpdexheimer Posts: 335Member, GSA Collaborator ✭✭✭

    I would use the built-in --excludeNonVariants option rather than JEXL. -sn SAMPLE -env will give you only the variants that are polymorphic in SAMPLE, if you want the data for all samples across those variants I think you could do a second SelectVariants across the original file with -L polymorphicInSAMPLE.vcf

  • JCGrenierJCGrenier Montreal, QCPosts: 15Member

    Thanks for the answer.

    Would it give me as well ref/ref calls where I have at least a coverage of one on the alternative allele?

    Thanks

  • JCGrenierJCGrenier Montreal, QCPosts: 15Member

    I just tried the command but do not give me what I'm exactly looking for. For sur it only gives me variant sites. But what is different in my case is that I want to extract positions where the positions could be whatever genotype (I'm doing my calling on multiple samples, so 1 sample can be 0/0 ). The additionnal filtering step that I need is that the site itself must contain at least one of the two allele ( so DP >=2 ). The problem that I have resides in the fact that I'm working with bacterias (mutation rate higher) so I have multiple sites with more than 2 alleles. Sometimes, I have only allele 1 and 2 giving me 1/2 genotypes and it's causing problems with my JEXL command. It's causing problems because of the size of the array itself that I want to look at (on biallelic positions).

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Hi @JCGrenier,

    Could you please give a concrete example to show what you want to do? E.g. post a few variant records and point out which ones you want to extract vs. not extract.

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 335Member, GSA Collaborator ✭✭✭

    I'm afraid I don't understand what you're trying to do. DP refers to the number of reads covering a site, it has nothing to do with alleles. And -env should capture the 1/2 case without any problems. If you want to consider sites that are polymorphic in one of several samples, you'd specify additional -sn parameters.

  • JCGrenierJCGrenier Montreal, QCPosts: 15Member
    edited September 2013

    I'll give you a concrete example only giving you GT:AD (I don't care about the other terms).

    Here's a list of calls :

    1 0/0:10,0
    2 0/0:9,1
    3 0/0:9,0,1
    4 0/0:9,1,0
    5 0/1:3,7
    6 0/1:3,6,1
    7 1/1:0,10
    8 1/1:1,9
    9 1/1:0,9,1
    10 0/2:4,0,6
    11 1/2:0,4,6
    12 2/2:0,0,10
    13 2/2:0,1,9
    14 2/2:1,0,9
    

    So, what I want to extract here are the positions 2, 3, 4, 5, 6, 8, 9, 10, 11, 13 and 14 where there's a sign of polymorphism within my sample. I don't care about the other ones for the moment. And as the SNP calling was done using multiple samples, that's why I can have such cases as the position #4.

    So my first command line that I showed you in the other post was able to catch all the ones with a coverage of at least 1 in the two first alleles.

    Thanks for your help.

    Post edited by Geraldine_VdAuwera on
  • JCGrenierJCGrenier Montreal, QCPosts: 15Member

    Hi folks,

    I end up using awk to find out the lines that I want. Here's the command that I used :

    awk '{ct=0; split($10, geno, ":") ; split(geno[2], AD, ","); for (i in AD) if(AD[i]>=1) ct++; if ($0~/^#/||ct>=2) print}' INPUT.vcf > OUTPUT.polymorph.vcf

    For sure, it works only if I have one sample in my vcf file. It would have been tricky to do so using multiple samples anyway.

    Thanks for your help! I'm interested if you find the solution using GATK anyway!

  • FrancoisWFrancoisW Posts: 2Member

    "To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial."

    I would actually love to see this happen since I am currently struggling to make this variant context work. Is this still somewhere on the TODO list?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Hi @FrancoisW,

    Yes, it is still very much planned, and we are moving closer to finding time to do it. Hang on in there :)

    Geraldine Van der Auwera, PhD

  • mmterpstrammterpstra NetherlandsPosts: 29Member

    I'd appreciate more explaination or a long list of good examples(that do not produce warnings for too many of your variants.... for troubleshooting). For example how do you check for the an array.index(vc.hasAttributes('RPA').2) returning a valid value (at least not undefined)? This could be used for many uses the most important is getting data out of multiallelic variants. just like "vc.hasAttributes('RPA') && vc.hasAttributes('RPA').2.something()" will first check if the variable is actually present and then will check for the index number being existent.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Alright, it's on the wishlist.

    Geraldine Van der Auwera, PhD

  • mmterpstrammterpstra NetherlandsPosts: 29Member
    edited February 12

    Here is a good one as well (Although completly replaceable by the AB/FS annotattions): '((Double.valueOf(vc.getGenotype('$samplename').getAD().1) / Double.valueOf(vc.getAttribute('DP'))) < 0.01)' Also Jexl Seems to work like java (Beginner level at best), as far as I tried the commands they are all there ('.Equals()' and 'Double.valueOf()' ). So if you want to use Jexl you should try to think like 'How would a Java programmer solve this' also. PS this will get you a lot of warnings.......

    Post edited by mmterpstra on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    So if you want to use Jexl you should try to think like 'How would a Java programmer solve this' also.

    That is exactly right, and it's because the JEXL is basically a way of accessing the Java objects directly. Which makes it difficult to use for people with no Java background of course.

    Producing a tutorial on this is still on the to-do list (but it's a long list and jexl is not at the top)

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 39Member

    I have noticed that I get a stack trace error when I include AF > 0.0057 in my JEXL select command. So far I can not determine the cause; it goes through several variants before I get the error. As far as I can tell every variant has a AF= in there somewhere.

    Is there any possible reason for this or is there another way to select that type of criteria I could try? I can provide more details if need be. I didnt think this was a bug, but rather some sort of user error. I am, however, stumped.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Hi @bwubb, can you post the stack trace?

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 39Member

    Absolutely. My command -select 'vc.getID().equals(".") && vc.isSNP() && vc.isNotFiltered() && AF > 0.0057' works if I remove the AF portion, but when it is included I recieve this after some number of variants.

    ERROR stack trace

    java.lang.IllegalArgumentException: Invalid JEXL expression detected for select-0 with message ![62,73]: 'vc.getID().equals('.') && vc.isSNP() && vc.isNotFiltered() && AF > 0.0057;' > error at org.broadinstitute.variant.variantcontext.JEXLMap.evaluateExpression(JEXLMap.java:183) at org.broadinstitute.variant.variantcontext.JEXLMap.get(JEXLMap.java:140) at org.broadinstitute.variant.variantcontext.JEXLMap.get(JEXLMap.java:23) at org.broadinstitute.variant.variantcontext.VariantContextUtils.match(VariantContextUtils.java:276) at org.broadinstitute.sting.gatk.walkers.variantutils.SelectVariants.map(SelectVariants.java:518) at org.broadinstitute.sting.gatk.walkers.variantutils.SelectVariants.map(SelectVariants.java:182) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    You're hitting multi-allelic variants. For multi-allelic variants the AF is not a double - it's a list of doubles. So the expression "AF > 0.0057" is invalid in those cases. You'll need to add a condition to only process variants with 1 ALT allele. Unfortunately there is currently no way to filter over AF for multi-allelic variants.

    Geraldine Van der Auwera, PhD

  • mmterpstrammterpstra NetherlandsPosts: 29Member

    @bwubb I'm not sure this migth be a real GATK error (please post your version number): I just copied '(vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP()' from the manual above. The expected behaviour is that it produces a warning for each multiallelic site, but it crashes with the following output / error:

    INFO 15:33:20,956 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:33:20,958 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-2-g6bda569, Compiled 2013/08/28 16:30:29 INFO 15:33:20,958 HelpFormatter - Copyright (c) 2010 The Broad Institute INFO 15:33:20,958 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk INFO 15:33:20,962 HelpFormatter - Program Args: -T SelectVariants -R human_g1k_v37.fasta --variant Annotator.vcf -o output.vcf -select (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP() INFO 15:33:20,962 HelpFormatter - Date/Time: 2014/02/21 15:33:20 INFO 15:33:20,962 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:33:20,963 HelpFormatter - -------------------------------------------------------------------------------- INFO 15:33:20,971 ArgumentTypeDescriptor - Dynamically determined type of PriVsMeta_S12_193/vcf/PriVsMeta_S12_193.Annotator.vcf to be VCF INFO 15:33:21,496 GenomeAnalysisEngine - Strictness is SILENT INFO 15:33:21,583 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000 INFO 15:33:21,603 RMDTrackBuilder - Loading Tribble index from disk for file PriVsMeta_S12_193/vcf/PriVsMeta_S12_193.Annotator.vcf INFO 15:33:21,743 GenomeAnalysisEngine - Preparing for traversal INFO 15:33:21,758 GenomeAnalysisEngine - Done preparing for traversal INFO 15:33:21,759 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] INFO 15:33:21,759 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining INFO 15:33:23,159 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.IllegalArgumentException: Invalid JEXL expression detected for select-0 with message ![50,59]: '(vc.getID() == null || vc.getID().equals('.')) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP();' > error at org.broadinstitute.variant.variantcontext.JEXLMap.evaluateExpression(JEXLMap.java:183) at org.broadinstitute.variant.variantcontext.JEXLMap.get(JEXLMap.java:140) at org.broadinstitute.variant.variantcontext.JEXLMap.get(JEXLMap.java:23) at org.broadinstitute.variant.variantcontext.VariantContextUtils.match(VariantContextUtils.java:250) at org.broadinstitute.sting.gatk.walkers.variantutils.SelectVariants.map(SelectVariants.java:518) at org.broadinstitute.sting.gatk.walkers.variantutils.SelectVariants.map(SelectVariants.java:182) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274) at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92) at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48) at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99) at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313) at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245) at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152) at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 2.7-2-g6bda569):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Invalid JEXL expression detected for select-0 with message ![50,59]: '(vc.getID() == null || vc.getID().equals('.')) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP();' > error
    ERROR ------------------------------------------------------------------------------------------

    @Geraldine_VdAuwera what do you think of this???ps it is still present in gatk 2.8.1

  • bwubbbwubb Posts: 39Member

    @Geraldine_VdAuwera said: You're hitting multi-allelic variants. For multi-allelic variants the AF is not a double - it's a list of doubles. So the expression "AF > 0.0057" is invalid in those cases. You'll need to add a condition to only process variants with 1 ALT allele. Unfortunately there is currently no way to filter over AF for multi-allelic variants.

    I see. Thank you for the explanation

    @mmterpstra I am using 2.8-1

    @Geraldine_VdAuwera Is there somewhere I could take a look at the vc class methods? I only know what is available to me based on these posts. It seems my next course of action needs to be separating out the Multiallelic SNPs as suggested and interrogating them manually (ie. make a student do it MWAHAHAHAHA). What would me the class method equivlent of --restrictAllelesTo BIALLELIC? Thank you very much.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Hi @bwubb,

    We don't have explicit docs for that so the quickest way to find out is to look at the code, which is available on github.

    my next course of action needs to be separating out the Multiallelic SNPs as suggested and interrogating them manually (ie. make a student do it MWAHAHAHAHA)

    image

    Geraldine Van der Auwera, PhD

  • blueskypyblueskypy Posts: 194Member

    hi, Geraldine, I like your MWAHAHAHAHA! :) And thank you so much for your patience in answering all those questions. I wonder if you can help me with this. Here is a line of entry in dbSNP vcf format: 1 54353 rs140052487 C A . . RS=140052487;RSPOS=54353;dbSNPBuildID=134;SSR=0;SAO=0;VP=0x050000000001000016000100;WGT=1;VC=SNV;KGPhase1;KGPROD;OTHERKG;CAF=[0.9927,0.007346];COMMON=1

    How to extract those with the 2nd value of CAF > 0.01?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Hi @blueskypy,

    I don't know a way to do that directly with JEXL, if that's what you're asking (although others may); I would probably dump the info to a table using VariantsToTable at that point, and query that in R.

    Geraldine Van der Auwera, PhD

  • mmterpstrammterpstra NetherlandsPosts: 29Member
    edited March 11

    I just copied '(vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP()' from the manual above

    and ran a minute or less and then the program screamed and died:

    ERROR MESSAGE: Invalid JEXL expression detected for select-0 with message ![50,59]: '(vc.getID() == null || vc.getID().equals('.')) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP();' > error

    from my post on 21 februari shouldn't this be fixed in the Documentation/Code?

    Post edited by mmterpstra on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    @mmterpstra, do you mean to make a feature request to get a warning instead of a blocking error when the AF query is made on multiallelic sites?

    Geraldine Van der Auwera, PhD

  • mmterpstrammterpstra NetherlandsPosts: 29Member

    @Geraldine_VdAuwera‌: That would be sensible because then the AF is stil usable. The other solution will be to update the current docs instructing first to filter out the multiallelic variants. Thinking back on the comment this is a question of beginner friendliness v.s. user safety.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Thinking back on the comment this is a question of beginner friendliness v.s. user safety.

    I agree, that's a fair point. I'll see what we can do to make this better.

    Geraldine Van der Auwera, PhD

  • airtimeairtime Posts: 8Member

    How to filter depending on the GT Attribute? I try: java -Xmx30G -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg19.fasta --variant variants.vcf -o variants_filtered --genotypeFilterExpression "vc.getGenotype().equals('1/1')" --genotypeFilterName "homozygote" but this Warning occur: WARN Interpreter - ![3,16]: 'vc.getGenotype().equals('1/1');' attempting to call method on null

    Thanks.

  • mmterpstrammterpstra NetherlandsPosts: 29Member
    edited April 11

    did you read the manual? This is a plain copy/paste from the manual:'vc.getGenotype("NA12878").isHomRef()' . where "NA12878" is your genotype that's is present in your vcf for that genotype. '.isHomRef()' soft filters out the homozygotes and only works on a single genotype.

    Post edited by mmterpstra on
  • bigeybigey Posts: 2Member

    The only way I found to filter on AC (which is a list/array of integers) is:

    --filterExpression "vc.getAttribute('AC').equals('40')" --filterName AChigh

    I have tried many combinations and this is the only one realy working without warning!

    If it could help...

    Frédéric BIGEY, PhD

  • airtimeairtime Posts: 8Member

    @mmterpstra I read the manual many times, but i didn't recognize how to use isHomRef and at the first trials I overread it sorry. I try it out in this form: isHomRef() but it didn't work well. Now I can use this filter, after some trials with isHomRef == 1. 'Thank you for the hint.

  • mmterpstrammterpstra NetherlandsPosts: 29Member
    edited April 29

    Explanation of code: 'vc.getGenotype("NA12878").isHomRef()'

    The object that stores the information concerning the variant (also known as 'Variant Context'):

     vc
    

    The next piece of code extracts the variant information specific to the genotype (The genotype fields probably, or a new vc object with only a single genotype):

     .getGenotype("NA12878")
    

    Then the following will test that single genotype if it is homozygous for the reference allelle by extracting the 'GT' field from the 'vc..getGenotype("NA12878")' statement and comparing this with the actual homref genotype. This will return a boolean value TRUE or FALSE. also it migth have some checks that the input it is a single genotype and thus may only work for a single genotype.

     .isHomRef()
    

    A small note: Object Oriented code is somewhat difficult to write at the terminal that's why the previous statements are prone to errors. I'm no moderator and my java skills are only basic. This I learned from reading a java Object Oriented tutorial, letting the GATK crash :) and manual variant evaluation. for every question here most of them i did not write any factual testing unless otherwise noted.

    Post edited by mmterpstra on
  • laRfet3laRfet3 MontpellierPosts: 1Member

    Hi everyone, I'm wondering if there is a wildcard to deal with ALL (or ANY) samples at the same time ?

    1) I'd like to discard variants for which each sample has a missing genotype (./.) !! Because I removed others samples. 2) I'd like also to keep only variant having no missing genotypes at all.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    I can't think of a straightforward way to do that, sorry; I'll ask the team if they have any ideas.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    @laRfet3

    So, the recommendation from the team is to run the VariantAnnotator with -A ChromosomeCounts. That will update the AC and AN (and AF) fields in the INFO section. Then you can easily use JEXLs like: AC==0 (case #1) and AN != x (case #2, where x is equal to 2 x the number of samples).

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 39Member

    Is there an annotation I could use for selection which is simply the non-reference frequency across samples? I am trying to extract variants that are seen in at least 95% of my vcf samples? I do not care if they are het or hom_alt, but annotations such as allele count might not be applicable (tumor data and whatnot). Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    @bwubb I would dump AD and DP to a table and query them in R...

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 39Member

    @Geraldine_VdAuwera So once more I must work with my arch-nemesis...

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Hah :) Personally, I dislike R less than JEXL...

    Geraldine Van der Auwera, PhD

  • bwubbbwubb Posts: 39Member

    Agreed. I ended up using old reliable python and hopefully I did things adequately (quick and dirty).

    Line by line parsing; incrementing counts for each 0/1 or 1/1 and dividing by the total samples; write to output if >= 0.95

    Still appreciate your help!.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,046Administrator, GATK Developer admin

    Ah, good old python. Also my language of choice if I had to pick one -- except for plotting, because ggplot, yay...

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.