We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

SelectVariants from FILTER column GATK4

I have a vcf annotated for several different tranches labelled in the FILTER column. I'm trying to use SelectVariants to pick sites from different tranches, but I can't get it to work. I'm using GATKv4.0.0.0.

Example VCF before filtering:

CHROM POS ID REF ALT QUAL FILTER INFO FORMAT ANN0801 ANN0802 ANN0803 ANN0804 ANN0805 ANN0

HanXRQChr17 774 . G A 65.46 VQSRTrancheSNP99.00to100.00 AC=1;AF=3.704e-03;AN=270;Bas
HanXRQChr17 812 . G A 5839.94 VQSRTrancheSNP50.00to70.00 AC=53;AF=0.212;AN=250;BaseQR
HanXRQChr17 823 . A G 5378.42 VQSRTrancheSNP90.00to99.00 AC=49;AF=0.201;AN=244;BaseQR
HanXRQChr17 830 . C T 648.72 VQSRTrancheSNP99.00to100.00 AC=5;AF=0.021;AN=240;BaseQRa
HanXRQChr17 845 . T C 345.33 VQSRTrancheSNP99.00to100.00 AC=4;AF=0.020;AN=204;DP=428;
HanXRQChr17 852 . CA C 3030.40 PASS AC=38;AF=0.173;AN=220;BaseQRankSum=1.01;ClippingRank
HanXRQChr17 866 . T G 89.60 VQSRTrancheSNP99.00to100.00

Code I've tried:
gatk SelectVariants --variant tmp.vcf.gz -O tmp.tranche100.vcf.gz -select "FILTER == VQSRTrancheSNP99.00to100.00"
This crashes gatk because of an JEXL error

gatk SelectVariants --variant tmp.vcf.gz -O tmp.tranche100.vcf.gz -select "FILTER == 'VQSRTrancheSNP99.00to100.00'"
This runs, but produces a vcf file with zero sites.

Ultimately I'd like to select multiple different values from the FILTER column using '||' but for now I'm trying to get it to select one.

Thanks

Answers

  • Greg_OwensGreg_Owens Member

    On further testing, there are some issues with VQSRTrancheSNP99.00to100.00 containing numbers. The reason I can't use
    gatk SelectVariants --variant tmp.vcf.gz -O tmp.tranche100.vcf.gz -select "FILTER == VQSRTrancheSNP99.00to100.00"
    is the numbers in the string. That being said, I can't get it to search for any string correctly.

    gatk SelectVariants --variant tmp.vcf.gz -O tmp.tranche100.vcf.gz -select "FILTER == PASS"
    produces an empty vcf, even though there are sites with with PASS in the FILTER column.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Can you use grep to filter PASS?

    You can reheader the vcf later with bcftools.

  • Greg_OwensGreg_Owens Member

    Right now I'm using grep in my stream
    gatk SelectVariants ...| grep "#\|VQSRTrancheSNP99.00to100.00
    This does work, although it seems like a bug that the select expression doesn't work in SelectVariants.

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭
    edited July 2018

    @Greg_Owens
    Hi,

    I think the GATK tools mostly ignore filtered sites (sites without PASS or .). I agree it would be nice to choose sites within each tranche, but since you can easily grep for them, the team probably won't dedicate any time to this.

    -Sheila

  • init_jsinit_js Member
    edited July 2018

    I think the GATK tools mostly ignore filtered sites (sites without PASS or .). I agree it would be nice to choose sites within each tranche, but since you can easily grep for them, the team probably won't dedicate any time to this.

    I would like to clarify that SelectVariants is not part of the set of such tools. i.e. SelectVariants runs filter expressions on all sites in the VCF. And, it looks like the team already has dedicated a large amount of time to this issue, but it's not covered in the docs. The gatk JEXL page points to the variantcontext java file, but there are no examples of selecting tranches or based on filter values.

    The FILTER column is processed separately from other columns. It is indeed recognized as a string column however, but for some reason, it's not possible to match against it in any useful way with basic operators directly. If I pretend it is a boolean column, for instance:

    SelectVariants --select "FILTER"
    

    I get java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Boolean. So it's misleading that it allows you to form string expressions with it.

    When building the variantcontext, the FILTER column value gets thrown in the CommonInfo bag, and then added to a Set<String> object. The data structure would support having more than one FILTER value per site, although tools such as ApplyVQSR won't make use of that feature.

    One can obtain all sites that are PASS (and all sites of VCFs that have no filter information), using:

     --select "vc.isNotFiltered()"
    

    Another equivalent way of obtaining all PASS entries is:

    --select "vc.getFilters().size() == 0"
    

    In the API, there is a separate check to ensure that filtering information is available. I assume that if your file doesn't have any filter column, or no filter information, the previous check will evaluate true -- i.e. vc.getFilters() always evaluates to an empty set, or a non-empty set, but never null. To ensure that your VCF file indeed has filtering information, one could use as a precondition:

    --select "vc.filtersWereApplied()"
    

    Disclaimer: I've not actually tested whether a VCF without a FILTER column, and a VCF with no ##FILTER entries in the headers are both treated as having no filter info. ymmv

    In the case of a VCF produced by ApplyVQSR you can select for the exact tranche name (using the associated filter value) by checking if it's part of the set. (Note: If the value you want is PASS, you can't go this route, you have to use one of the checks I mentioned above. The string "PASS" will not be a member of that set), e.g.:

    --select "vc.getFilters().contains('VQSRTrancheSNP99.00to100.00')"
    

    I'm not sure if there's a more concise JEXL way to check if one or more values are in a set, but you could check for multiple values with a disjunction '||'. e.g:

    --select "vc.getFilters().contains('FOO') || vc.getFilters().contains('BAR')"
    

    If you want to select all entries that are not PASS in one go, you can use:

    --select "vc.isFiltered()"
    

    Alternatively, if your VCF has been produced by ApplyVQSR, there's only ever one element in the set, so you could use vc.getFilters().size() > 0. If there's no filter information at all in the VCF, getFilters() returns an empty set. For ApplyVQSR, in fact, vc.getFilters().size() > 0 and vc.getFilters().size() == 1 seem to be equivalent.

    why a set?

    It's not clear to me why filters are stored in a Set, and not just a plain string column. It could be a VCF format standard which allows for comma-separated values or such, I'm not sure. It does allow to differentiate between null, 0, and 1 element easily however, which might not be possible if "" (empty string) were a valid FILTER name.

    See Variant Context Code

    Tested on gatk4.0.0.0

  • Hello,
    You can try the below GATK command to filter variants by 'PASS':

    gatk --java-options "-Xmx20G -XX:+UseParallelGC -XX:ParallelGCThreads=8" SelectVariants -R reference.fasta -V snps.vcf.gz --exclude-filtered true -O filter.vcf.gz

    Alternatively, you can use awk commands to do the same as below:
    awk -F '\t' '{if($0 ~ /#/) print; else if($7 == "PASS") print}' snps.vcf > filter.vcf

    Hope this answer your questions

    -Lakshman

Sign In or Register to comment.