Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Questions about JEXL expressions for selecting variants according to specific reuqirements

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin
This discussion was created from comments split from: What are JEXL expressions and how can I use them with the GATK?.

Comments

  • pdexheimerpdexheimer Member ✭✭✭✭

    For those trying to access the underlying VariantContext, the class of interest is org.broadinstitute.sting.utils.variantcontext.VariantContext (and Genotype in the same package). Source is here: https://github.com/broadgsa/gatk/blob/master/public/java/src/org/broadinstitute/sting/utils/variantcontext/VariantContext.java

  • prepagamprepagam Member

    You state "It is very important to note that the JEXL evaluation subprogram cannot correctly handle cases where the annotations requested by the JEXL expression are missing for some variants in a VCF record". GaTK documentation says there is a standard set of annotations output by Unified Genotyper but when I view my VCF file - not all the standard annotations are there for all variants (e.g. ReadePosRankSum). Is this because they can't be computed? Is there a way to add them back in?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It is possible that that some annotations were left out because they could not be computed. Usually if they could not be computed in the original run, it is not possible to add them afterwards either.

  • aeonsimaeonsim Member ✭✭✭
    edited April 2013

    Hi

    Is the source code for VariantContext no longer available? Either way is there a JAVA doc page or similar for VariantContext? As I'm looking at building some fairly complex filters using SelectVariants and it would be very useful to find a full list of the various fields that can be queried from vc.

    Otherwise can you answer these questions:

    Is there a wildcard operator or some other way of specifying multiple samples say if I want to keep all samples that have a GQ score of greater than 20 & a AD where the Alt is >= 2?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    VC has been transferred to a different dev team, see this doc article for instructions to access their public repo:

    http://www.broadinstitute.org/gatk/guide/article?id=2194

  • nhvanlienhvanlie Member

    I was just testing VariantFiltration with some files that have missing annotations and was suprised to have the filter expression "MQ < 60.0 || QD < 2.0" evaluate correctly (all the variants with MQ < 60.0 were filtered) even though QD was not defined for some variants. I also tested the expression with a file where no QD values were present and QD was not defined in the header and had the same result.

    Command run:

    java -Xmx2g -jar GenomeAnalysisTK.jar -T VariantFiltration -R ucsc.hg19.fasta --variant test.vcf
    --filterExpression "MQ < 60.0 || QD < 2.0" --filterName "FAIL" -o ./test.hardfiltered.vcf
    

    Subset of results (chromosome locations/genotypes removed):

    1 C T   8249.77 PASS    AC=2;AF=1.00;AN=2;DP=285;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=28.95    
    2 G A   7401.77 FAIL    AC=2;AF=1.00;AN=2;DP=137;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;MQ0=0;QD=54.03    
    3 G A   3445.77 FAIL    AC=2;AF=1.00;AN=2;DP=120;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=1.44 
    5 A T   5440.77 FAIL    AC=2;AF=1.00;AN=2;DP=188;FS=0.000;MLEAC=2;MLEAF=1.00;MQ=50.00;MQ0=0;QD=0.94 
    6 T C   34  PASS    DP=4;DP4=1,1,1,1;FQ=32.3;MQ=60;PV4=1,1,1,1;VDB=0.0073   
    7 C T,A 66.30   PASS    DP=6;DP4=0,0,6,0;FQ=-42;MQ=60;VDB=0.0000
    8 G C   32.80   FAIL    DP=2;DP4=0,0,2,0;FQ=-33;MQ=43;VDB=0.0066
    9   G   A   32.80   FAIL    DP=2;DP4=0,0,0,2;FQ=-33;MQ=51;VDB=0.0066    
    

    Is this caveat no longer applicable in GATK 2.4-9 or did I misunderstand the meaning of the caveat?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ack, not a JEXL expressions question ;-)

    I'm not sure -- I think your interpretation is correct but JEXL always keeps me guessing. I'm consulting the team and will try to get you a definitive answer asap about the expected behavior. On the bright side it looks like you're getting the right answers, so that's a good problem to have.

  • nhvanlienhvanlie Member

    Definitely a good problem to have! Sorry- I wasn't sure where to put the question and it was prompted by the caveat in this post, so I thought it might belong here. Thanks for your help!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No worries, this is as good a place as any :)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    OK, it seems that the behavior as been changed so that the caveat no longer applies -- will check and edit the docs as appropriate. Thanks for pointing this out!

  • igorigor New YorkMember ✭✭

    Is there a list of available keys available somewhere? For VCF walkers, you can look at your VCF file and see what is available. However, I am using SomaticIndelDetector and I am not sure what my options are except for the one example that is listed.

    I realize SomaticIndelDetector is no longer part of GATK, but it's still there in GATK 2.3.

  • CarneiroCarneiro Charlestown, MAMember admin

    We do not support SomaticIndelDetector anymore, please contact the Broad's cancer group directly. They have a new tool that has substituted it entirely.

  • igorigor New YorkMember ✭✭

    @Carneiro said:
    We do not support SomaticIndelDetector anymore, please contact the Broad's cancer group directly. They have a new tool that has substituted it entirely.

    So there is no old list of available keys available somewhere?

    What is the BCG tool called? I searched their software page, but I could not find anything that looks like SomaticIndelDetector.

  • CarneiroCarneiro Charlestown, MAMember admin

    I think the new tool is called Indelocator

  • igorigor New YorkMember ✭✭

    @Carneiro said:
    I think the new tool is called Indelocator

    But Indelocator documentation points back to GATK:
    http://www.broadinstitute.org/cancer/cga/indelocator_download

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    It looks like that is a special version of GATK that includes Indelocator, packaged by the cancer group. We'll work with them to clarify this; in the meantime please ask them for documentation and help running it.

  • Im trying to filter indels of 20 and greater length from a vcf file. I have an example from a forum but it retrieves indels under 20 length.

    What does "vc.getIndelLengths().0" do? Does it just get the length of an indel because changing the "<" around to get those greater than 19 returns no results when I know there are indels of 19 and above.

    -select 'vc.getIndelLengths().0 > -19 && vc.getIndelLengths().0 < 19'

  • Got it thanks, insertions larger than 19 but smaller than 1000
    -select 'vc.getIndelLengths().0 > 19 && vc.getIndelLengths().0 < 1000'

  • JCGrenierJCGrenier Montreal, QCMember
    edited September 2013

    Hello there, I'm working with this tool presently and I'm trying to extract all the polymophic sites from one of my sample. I'm completely able to do it for biallelic sites, but when I have triallelic sites, I'm not able to find the right expression to bypass the problem.

    Here's my JEXL expression :

    The one that is working with biallelic sites :

    java -Xmx1g -jar /home/apps/Logiciels/GATK/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T SelectVariants -R Ref.fa --variant INPUT.vcf -select 'vc.getGenotype("SAMPLE").getAD().0 > 1 && vc.getGenotype("SAMPLE").getAD().1 > 1' -o test.vcf
    

    The one that I'm trying to do :

    java -Xmx1g -jar /home/apps/Logiciels/GATK/GenomeAnalysisTK-2.4-9-g532efad/GenomeAnalysisTK.jar -T SelectVariants -R Ref.fa --variant INPUT.vcf -select '(vc.getGenotype("SAMPLE").getAD() > 2 && (vc.getGenotype("SAMPLE").getAD().1 > 1 && vc.getGenotype("SAMPLE").getAD().2 > 1) || (vc.getGenotype("SAMPLE").getAD().0 > 1 && vc.getGenotype("SAMPLE").getAD().2 > 1)) || vc.getGenotype("SAMPLE").getAD() == 2 && vc.getGenotype("SAMPLE").getAD().0 > 1 && vc.getGenotype("SAMPLE").getAD().1 > 1' -o test.vcf
    

    So, it's more like a if/else case, could this be possible to do?

    Thanks a lot.

  • pdexheimerpdexheimer Member ✭✭✭✭

    I would use the built-in --excludeNonVariants option rather than JEXL. -sn SAMPLE -env will give you only the variants that are polymorphic in SAMPLE, if you want the data for all samples across those variants I think you could do a second SelectVariants across the original file with -L polymorphicInSAMPLE.vcf

  • JCGrenierJCGrenier Montreal, QCMember

    Thanks for the answer.

    Would it give me as well ref/ref calls where I have at least a coverage of one on the alternative allele?

    Thanks

  • JCGrenierJCGrenier Montreal, QCMember

    I just tried the command but do not give me what I'm exactly looking for. For sur it only gives me variant sites. But what is different in my case is that I want to extract positions where the positions could be whatever genotype (I'm doing my calling on multiple samples, so 1 sample can be 0/0 ). The additionnal filtering step that I need is that the site itself must contain at least one of the two allele ( so DP >=2 ).
    The problem that I have resides in the fact that I'm working with bacterias (mutation rate higher) so I have multiple sites with more than 2 alleles. Sometimes, I have only allele 1 and 2 giving me 1/2 genotypes and it's causing problems with my JEXL command. It's causing problems because of the size of the array itself that I want to look at (on biallelic positions).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @JCGrenier,

    Could you please give a concrete example to show what you want to do? E.g. post a few variant records and point out which ones you want to extract vs. not extract.

  • pdexheimerpdexheimer Member ✭✭✭✭

    I'm afraid I don't understand what you're trying to do. DP refers to the number of reads covering a site, it has nothing to do with alleles. And -env should capture the 1/2 case without any problems. If you want to consider sites that are polymorphic in one of several samples, you'd specify additional -sn parameters.

  • JCGrenierJCGrenier Montreal, QCMember
    edited September 2013

    I'll give you a concrete example only giving you GT:AD (I don't care about the other terms).

    Here's a list of calls :

    1 0/0:10,0
    2 0/0:9,1
    3 0/0:9,0,1
    4 0/0:9,1,0
    5 0/1:3,7
    6 0/1:3,6,1
    7 1/1:0,10
    8 1/1:1,9
    9 1/1:0,9,1
    10 0/2:4,0,6
    11 1/2:0,4,6
    12 2/2:0,0,10
    13 2/2:0,1,9
    14 2/2:1,0,9
    

    So, what I want to extract here are the positions 2, 3, 4, 5, 6, 8, 9, 10, 11, 13 and 14 where there's a sign of polymorphism within my sample. I don't care about the other ones for the moment. And as the SNP calling was done using multiple samples, that's why I can have such cases as the position #4.

    So my first command line that I showed you in the other post was able to catch all the ones with a coverage of at least 1 in the two first alleles.

    Thanks for your help.

  • JCGrenierJCGrenier Montreal, QCMember

    Hi folks,

    I end up using awk to find out the lines that I want.
    Here's the command that I used :

    awk '{ct=0; split($10, geno, ":") ; split(geno[2], AD, ","); for (i in AD) if(AD[i]>=1) ct++; if ($0~/^#/||ct>=2) print}' INPUT.vcf > OUTPUT.polymorph.vcf

    For sure, it works only if I have one sample in my vcf file. It would have been tricky to do so using multiple samples anyway.

    Thanks for your help! I'm interested if you find the solution using GATK anyway!

  • FrancoisWFrancoisW Member

    "To be frank, it's also explained rather more briefly than the topic deserves. But if there's enough demand for this level of usage (click the "view in forum" link and leave a comment) we'll consider producing a full-length tutorial."

    I would actually love to see this happen since I am currently struggling to make this variant context work. Is this still somewhere on the TODO list?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @FrancoisW,

    Yes, it is still very much planned, and we are moving closer to finding time to do it. Hang on in there :)

  • mmterpstrammterpstra NetherlandsMember ✭✭

    I'd appreciate more explaination or a long list of good examples(that do not produce warnings for too many of your variants.... for troubleshooting).
    For example how do you check for the an array.index(vc.hasAttributes('RPA').2) returning a valid value (at least not undefined)?
    This could be used for many uses the most important is getting data out of multiallelic variants.
    just like "vc.hasAttributes('RPA') && vc.hasAttributes('RPA').2.something()" will first check if the variable is actually present and then will check for the index number being existent.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Alright, it's on the wishlist.

  • mmterpstrammterpstra NetherlandsMember ✭✭
    edited February 2014

    Here is a good one as well (Although completly replaceable by the AB/FS annotattions): '((Double.valueOf(vc.getGenotype('$samplename').getAD().1) / Double.valueOf(vc.getAttribute('DP'))) < 0.01)'
    Also Jexl Seems to work like java (Beginner level at best), as far as I tried the commands they are all there ('.Equals()' and 'Double.valueOf()' ). So if you want to use Jexl you should try to think like 'How would a Java programmer solve this' also.
    PS this will get you a lot of warnings.......

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    So if you want to use Jexl you should try to think like 'How would a Java programmer solve this' also.

    That is exactly right, and it's because the JEXL is basically a way of accessing the Java objects directly. Which makes it difficult to use for people with no Java background of course.

    Producing a tutorial on this is still on the to-do list (but it's a long list and jexl is not at the top)

  • bwubbbwubb Member ✭✭

    I have noticed that I get a stack trace error when I include AF > 0.0057 in my JEXL select command. So far I can not determine the cause; it goes through several variants before I get the error. As far as I can tell every variant has a AF= in there somewhere.

    Is there any possible reason for this or is there another way to select that type of criteria I could try? I can provide more details if need be. I didnt think this was a bug, but rather some sort of user error. I am, however, stumped.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @bwubb, can you post the stack trace?

  • bwubbbwubb Member ✭✭

    Absolutely. My command -select 'vc.getID().equals(".") && vc.isSNP() && vc.isNotFiltered() && AF > 0.0057' works if I remove the AF portion, but when it is included I recieve this after some number of variants.

    ERROR stack trace

    java.lang.IllegalArgumentException: Invalid JEXL expression detected for select-0 with message ![62,73]: 'vc.getID().equals('.') && vc.isSNP() && vc.isNotFiltered() && AF > 0.0057;' > error
    at org.broadinstitute.variant.variantcontext.JEXLMap.evaluateExpression(JEXLMap.java:183)
    at org.broadinstitute.variant.variantcontext.JEXLMap.get(JEXLMap.java:140)
    at org.broadinstitute.variant.variantcontext.JEXLMap.get(JEXLMap.java:23)
    at org.broadinstitute.variant.variantcontext.VariantContextUtils.match(VariantContextUtils.java:276)
    at org.broadinstitute.sting.gatk.walkers.variantutils.SelectVariants.map(SelectVariants.java:518)
    at org.broadinstitute.sting.gatk.walkers.variantutils.SelectVariants.map(SelectVariants.java:182)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    You're hitting multi-allelic variants. For multi-allelic variants the AF is not a double - it's a list of doubles. So the expression "AF > 0.0057" is invalid in those cases. You'll need to add a condition to only process variants with 1 ALT allele. Unfortunately there is currently no way to filter over AF for multi-allelic variants.

  • mmterpstrammterpstra NetherlandsMember ✭✭

    @bwubb
    I'm not sure this migth be a real GATK error (please post your version number):
    I just copied '(vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP()' from the manual above. The expected behaviour is that it produces a warning for each multiallelic site, but it crashes with the following output / error:

    INFO 15:33:20,956 HelpFormatter - --------------------------------------------------------------------------------
    INFO 15:33:20,958 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-2-g6bda569, Compiled 2013/08/28 16:30:29
    INFO 15:33:20,958 HelpFormatter - Copyright (c) 2010 The Broad Institute
    INFO 15:33:20,958 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
    INFO 15:33:20,962 HelpFormatter - Program Args: -T SelectVariants -R human_g1k_v37.fasta --variant Annotator.vcf -o output.vcf -select (vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP()
    INFO 15:33:20,962 HelpFormatter - Date/Time: 2014/02/21 15:33:20
    INFO 15:33:20,962 HelpFormatter - --------------------------------------------------------------------------------
    INFO 15:33:20,963 HelpFormatter - --------------------------------------------------------------------------------
    INFO 15:33:20,971 ArgumentTypeDescriptor - Dynamically determined type of PriVsMeta_S12_193/vcf/PriVsMeta_S12_193.Annotator.vcf to be VCF
    INFO 15:33:21,496 GenomeAnalysisEngine - Strictness is SILENT
    INFO 15:33:21,583 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 15:33:21,603 RMDTrackBuilder - Loading Tribble index from disk for file PriVsMeta_S12_193/vcf/PriVsMeta_S12_193.Annotator.vcf
    INFO 15:33:21,743 GenomeAnalysisEngine - Preparing for traversal
    INFO 15:33:21,758 GenomeAnalysisEngine - Done preparing for traversal
    INFO 15:33:21,759 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
    INFO 15:33:21,759 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
    INFO 15:33:23,159 GATKRunReport - Uploaded run statistics report to AWS S3

    ERROR ------------------------------------------------------------------------------------------
    ERROR stack trace

    java.lang.IllegalArgumentException: Invalid JEXL expression detected for select-0 with message ![50,59]: '(vc.getID() == null || vc.getID().equals('.')) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP();' > error
    at org.broadinstitute.variant.variantcontext.JEXLMap.evaluateExpression(JEXLMap.java:183)
    at org.broadinstitute.variant.variantcontext.JEXLMap.get(JEXLMap.java:140)
    at org.broadinstitute.variant.variantcontext.JEXLMap.get(JEXLMap.java:23)
    at org.broadinstitute.variant.variantcontext.VariantContextUtils.match(VariantContextUtils.java:250)
    at org.broadinstitute.sting.gatk.walkers.variantutils.SelectVariants.map(SelectVariants.java:518)
    at org.broadinstitute.sting.gatk.walkers.variantutils.SelectVariants.map(SelectVariants.java:182)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:267)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano$TraverseLociMap.apply(TraverseLociNano.java:255)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(NanoScheduler.java:274)
    at org.broadinstitute.sting.utils.nanoScheduler.NanoScheduler.execute(NanoScheduler.java:245)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:144)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:92)
    at org.broadinstitute.sting.gatk.traversals.TraverseLociNano.traverse(TraverseLociNano.java:48)
    at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:99)
    at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
    at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
    at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
    at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 2.7-2-g6bda569):
    ERROR
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Invalid JEXL expression detected for select-0 with message ![50,59]: '(vc.getID() == null || vc.getID().equals('.')) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP();' > error
    ERROR ------------------------------------------------------------------------------------------

    @Geraldine_VdAuwera what do you think of this???ps it is still present in gatk 2.8.1

  • bwubbbwubb Member ✭✭

    @Geraldine_VdAuwera said:
    You're hitting multi-allelic variants. For multi-allelic variants the AF is not a double - it's a list of doubles. So the expression "AF > 0.0057" is invalid in those cases. You'll need to add a condition to only process variants with 1 ALT allele. Unfortunately there is currently no way to filter over AF for multi-allelic variants.

    I see. Thank you for the explanation

    @mmterpstra I am using 2.8-1

    @Geraldine_VdAuwera Is there somewhere I could take a look at the vc class methods? I only know what is available to me based on these posts. It seems my next course of action needs to be separating out the Multiallelic SNPs as suggested and interrogating them manually (ie. make a student do it MWAHAHAHAHA). What would me the class method equivlent of --restrictAllelesTo BIALLELIC? Thank you very much.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @bwubb,

    We don't have explicit docs for that so the quickest way to find out is to look at the code, which is available on github.

    my next course of action needs to be separating out the Multiallelic SNPs as suggested and interrogating them manually (ie. make a student do it MWAHAHAHAHA)

    image

  • blueskypyblueskypy Member ✭✭

    hi, Geraldine,
    I like your MWAHAHAHAHA! :) And thank you so much for your patience in answering all those questions. I wonder if you can help me with this. Here is a line of entry in dbSNP vcf format:
    1 54353 rs140052487 C A . . RS=140052487;RSPOS=54353;dbSNPBuildID=134;SSR=0;SAO=0;VP=0x050000000001000016000100;WGT=1;VC=SNV;KGPhase1;KGPROD;OTHERKG;CAF=[0.9927,0.007346];COMMON=1

    How to extract those with the 2nd value of CAF > 0.01?

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @blueskypy,

    I don't know a way to do that directly with JEXL, if that's what you're asking (although others may); I would probably dump the info to a table using VariantsToTable at that point, and query that in R.

  • mmterpstrammterpstra NetherlandsMember ✭✭
    edited March 2014

    I just copied '(vc.getID() == null || vc.getID().equals(".")) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP()' from the manual above

    and ran a minute or less and then the program screamed and died:

    ERROR MESSAGE: Invalid JEXL expression detected for select-0 with message ![50,59]: '(vc.getID() == null || vc.getID().equals('.')) && AF > 0.25 && AF < 1.0 && vc.isNotFiltered() && vc.isSNP();' > error

    from my post on 21 februari
    shouldn't this be fixed in the Documentation/Code?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @mmterpstra, do you mean to make a feature request to get a warning instead of a blocking error when the AF query is made on multiallelic sites?

  • mmterpstrammterpstra NetherlandsMember ✭✭

    @Geraldine_VdAuwera‌: That would be sensible because then the AF is stil usable. The other solution will be to update the current docs instructing first to filter out the multiallelic variants. Thinking back on the comment this is a question of beginner friendliness v.s. user safety.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Thinking back on the comment this is a question of beginner friendliness v.s. user safety.

    I agree, that's a fair point. I'll see what we can do to make this better.

  • airtimeairtime Member

    How to filter depending on the GT Attribute?
    I try:
    java -Xmx30G -jar GenomeAnalysisTK.jar -T VariantFiltration -R hg19.fasta --variant variants.vcf -o variants_filtered --genotypeFilterExpression "vc.getGenotype().equals('1/1')" --genotypeFilterName "homozygote"
    but this Warning occur:
    WARN Interpreter - ![3,16]: 'vc.getGenotype().equals('1/1');' attempting to call method on null

    Thanks.

  • mmterpstrammterpstra NetherlandsMember ✭✭
    edited April 2014

    did you read the manual? This is a plain copy/paste from the manual:'vc.getGenotype("NA12878").isHomRef()' . where "NA12878" is your genotype that's is present in your vcf for that genotype. '.isHomRef()' soft filters out the homozygotes and only works on a single genotype.

  • bigeybigey Montpellier, FranceMember

    The only way I found to filter on AC (which is a list/array of integers) is:

    --filterExpression "vc.getAttribute('AC').equals('40')" --filterName AChigh

    I have tried many combinations and this is the only one realy working without warning!

    If it could help...

  • airtimeairtime Member

    @mmterpstra I read the manual many times, but i didn't recognize how to use isHomRef and at the first trials I overread it sorry.
    I try it out in this form: isHomRef() but it didn't work well. Now I can use this filter, after some trials with isHomRef == 1.
    'Thank you for the hint.

  • mmterpstrammterpstra NetherlandsMember ✭✭
    edited April 2014

    Explanation of code:
    'vc.getGenotype("NA12878").isHomRef()'

    The object that stores the information concerning the variant (also known as 'Variant Context'):

     vc
    

    The next piece of code extracts the variant information specific to the genotype (The genotype fields probably, or a new vc object with only a single genotype):

     .getGenotype("NA12878")
    

    Then the following will test that single genotype if it is homozygous for the reference allelle by extracting the 'GT' field from the 'vc..getGenotype("NA12878")' statement and comparing this with the actual homref genotype. This will return a boolean value TRUE or FALSE. also it migth have some checks that the input it is a single genotype and thus may only work for a single genotype.

     .isHomRef()
    

    A small note: Object Oriented code is somewhat difficult to write at the terminal that's why the previous statements are prone to errors. I'm no moderator and my java skills are only basic. This I learned from reading a java Object Oriented tutorial, letting the GATK crash :) and manual variant evaluation. for every question here most of them i did not write any factual testing unless otherwise noted.

  • laRfet3laRfet3 MontpellierMember

    Hi everyone,
    I'm wondering if there is a wildcard to deal with ALL (or ANY) samples at the same time ?

    1) I'd like to discard variants for which each sample has a missing genotype (./.) !! Because I removed others samples.
    2) I'd like also to keep only variant having no missing genotypes at all.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I can't think of a straightforward way to do that, sorry; I'll ask the team if they have any ideas.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @laRfet3

    So, the recommendation from the team is to run the VariantAnnotator with -A ChromosomeCounts. That will update the AC and AN (and AF) fields in the INFO section. Then you can easily use JEXLs like: AC==0 (case #1) and AN != x (case #2, where x is equal to 2 x the number of samples).

  • bwubbbwubb Member ✭✭

    Is there an annotation I could use for selection which is simply the non-reference frequency across samples? I am trying to extract variants that are seen in at least 95% of my vcf samples? I do not care if they are het or hom_alt, but annotations such as allele count might not be applicable (tumor data and whatnot). Thank you.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @bwubb I would dump AD and DP to a table and query them in R...

  • bwubbbwubb Member ✭✭

    @Geraldine_VdAuwera So once more I must work with my arch-nemesis...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hah :) Personally, I dislike R less than JEXL...

  • bwubbbwubb Member ✭✭

    Agreed. I ended up using old reliable python and hopefully I did things adequately (quick and dirty).

    Line by line parsing; incrementing counts for each 0/1 or 1/1 and dividing by the total samples; write to output if >= 0.95

    Still appreciate your help!.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Ah, good old python. Also my language of choice if I had to pick one -- except for plotting, because ggplot, yay...

Sign In or Register to comment.