We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Selecting Variants by searching the FORMAT field array

Hello all,

First post. Thank you for these amazing tools. I have spent two days pulling my hair out, trying all enumerations, searching the documentation and forums, and in the end I come to you for help. Please forgive me if these topics have been covered elsewhere.

I have several VCFs generated by SomaticSniper that I'd like to filter based on the SomaticScore (SSC in the FORMAT field). I was working with VariantFiltration and SelectVariants, and trying to use different options, as well as regular expressions, to select those calls with a SSC over 40. I have been unable to do so. I also looked into trying to figure out JEXL, and using the last command listed on the documentation page, about using the VariantContext feature to drill into an array. I cannot get it to recognize the SSC column of the FORMAT field and then filter for the TUMOR sample.

Using VariantFiltration I was using -select (but I understand now that this searches the INFO field only). I was then using the --genotypeFilterExpression, but it would not add the FT tag to the FORMAT field as it said it would, it would just apply PASS to everything.

java -Xmx4-jar GenomeAnalysisTK.jar -T VariantFiltration -R ~/Documents/reference/human_g1k_v37.fasta -V '/home/registry3/Desktop/merged/104024sniperRAWSNPS.vcf' --G_filter "SSC < 40.0" --G_filterName "myFilter" -o '/home/registry3/Desktop/merged/104024sniperFILTEREDSNPS.vcf'

Using SelectVariants, I was using -sn to select the TUMOR sample and then using -select_expressions, but I guess this also only works on the INFO field. I had been trying to use --sample_expression which gives the ability to use a regular expression, but then I never had good results; it wouldn't do any filtering, and output the entire input file. Does the regular expression only apply to the sample name, and not the content of each line? Trying to select SSC over 40 from a line like this

1   10177   0   A   C   0   0   AC=1;AF=0.500;AN=2;DP=62    GT:AMQ:BCOUNT:BQ:DP:DP4:GQ:IGT:MQ:SS:SSC:VAQ    0/1:16,15:40,22,0,0:28,25:62:31,9,10,12:37:0/1:16:2:19:37

I used a line such as this, to look at the second to last number in the FORMAT field based on : dividers

java -Xmx4-jar GenomeAnalysisTK.jar -R ~/Documents/reference/human_g1k_v37.fasta -T SelectVariants --variant '/home/registry3/Desktop/merged/104024sniperSNPs.vcf' -o '/home/registry3/Desktop/merged/104024sniperSSC40.vcf' -se ".*:[4,5,6,7,8,9][1,2,3,4,5,6,7,8,9]:[1234567890]{2,3}$"

I am not a coder, as you can probably see, but I'm trying to get this worked out. This output the entire file still, with SSC values above and below 40.

Looking into use the vc.getGenotype array access, I could not find much documentation about VariantContext; I was looking through the files on github, looking through the code and looking for samples, since the .getAD() from the documentation seems to work, but alas, there is no .getSCC() available. Is using vc. the best way to drill into an array (the FORMAT field) and search for what I want?

I didn't post all the code and output, trying to keep this as short as possible. I can post pastebin outputs if that would be helpful. Thank you, David

Best Answers


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi David,

    Well we certainly can't fault you for not trying! Variant filtration using JEXLs is one of the most vexing topics people often ask us about, so please don't feel bad.

    You are on the right track, using VariantFiltration with a genotype filter expression is indeed the way to go. I'm not entirely certain why what you tried already didn't work. However... from the look of it, this SSC annotation is a single integer value per sample, correct? Not an array and not a float? So you shouldn't need to query the vc context at least. Maybe it's a typing issue. In the example you posted above you have -G_filter "SSC < 40.0". Have you tried the same but with the integer representation, ie "SSC < 40"?

  • ddarcyddarcy Member

    Hello and thank you for getting back so quickly!


    java -Xmx4g -jar GenomeAnalysisTK.jar -R /Homo_sapiens_assembly19.fasta -T VariantFiltration -V /sniper/104024-T--104024-N.snv.soticsniper.vcf -o /sniper/104024-T--104024-N.snv.somaticsniper.overSCC40.vcf -G_filter "SSC < 40" -G_filterName "SSC40"

    I get thousands of:

    WARN 14:25:59,784 Interpreter - ![0,3]: 'SSC < 40;' undefined variable SSC
    WARN 14:25:59,784 Interpreter - ![0,3]: 'SSC < 40;' undefined variable SSC

    And it outputs all lines. "SSC < 40.0" does the same.

    It is a single integer per sample; So it is not an array, even though it is in the Sample column as number:number:string:number? I figured an array referred to a number referenced by the FORMAT column that was in a SAMPLE column buried between colons. The header contains

    ##FORMAT=<ID=SSC,Number=1,Type=Integer,Description="Somatic Score">

    So I don't know why it's undefined...

    Thank you,

  • ddarcyddarcy Member
    edited September 2013

    In the VCF, all lines have the SSC field defined in the FORMAT column. The normal sample had a . for this value ( as in 22:1:.:44 etc) but the tumor samples all had a number in this spot. I tried filtering just the sample TUMOR, so the the VCF FORMAT field had SCC and every sample had a value for that, but to no avail (underfined variable SSC). Regarding the FT annotation, as far as I understand that would be added to the FORMAT field, correct? And yes, DP does work, but I thought this might be because DP is defined in the INFO column as well. I can't get it to find any terms in the FORMAT column (SSC, AMQ, DP4, etc). Perhaps I could post a small sample? Would it be possible to do so without the headers as to keep study data private?

    Thank you,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi David,

    If you'd like you can upload a file snippet to our FTP server (see FAQs for instructions); no one else but us will be able to access it. Feel free to edit out any sensitive information from the headers, as long as you leave the parts of the header that are required for it to be a valid vcf.

  • ddarcyddarcy Member

    I tried filtering just the sample TUMOR (so just there wasn't a NORMAL sample which had a . for the SSC field) and then FT annotation worked. I thought I had done this before, but maybe I had gone from 4 samples to 2. I guess it was due to having two samples in the VCF and one sample didn't have that data (it was normal, and didn't have the SomaticSniper SSC score).

    However after the FT annotation happens, then I can't get SelectVariants to just pull out those variants. I have selected and tagged the SSC > 40 with "SCCover40", but pulling those lines out to a new file...? The SelectVariants -select function only seems to mine the INFO field. How do you select out variants to a new file by the FORMAT field?

    Thank you, David

  • Lisa999Lisa999 UKMember


    I am working on a multi sample VCF file and trying to use SelectVariants to filter out all variants with a repeat length of less that 11bp. I am trying to select for RPA which is an array, and so far I am not succeeding.

    I have tried the following:

    java -Xmx3g -jar $GATK -T SelectVariants -R $REF_FA \
    --variant $IN_VCF -o $SEL_VCF -select 'vc.getGenotype("sample1").getRPA().0 > 10' \
    -log $LOG_FILE -l INFO

    This does not work, but if I swap AD for RPA the code does work:

    java -Xmx3g -jar $GATK -T SelectVariants -R $REF_FA \
    --variant $IN_VCF -o $SEL_VCF -select 'vc.getGenotype("sample1").getAD().0 > 10' \
    -log $LOG_FILE -l INFO

    I have also tried the following without any luck:

    java -Xmx3g -jar $GATK -T SelectVariants -R $REF_FA \
    --variant $IN_VCF -o $SEL_VCF -select 'vc.hasAttribute("RPA").getRPA().0 > 10' \
    -log $LOG_FILE -l INFO

    I would be very grateful if someone could suggest a better method.


  • pdexheimerpdexheimer Member ✭✭✭✭

    Hi Lisa -

    This is essentially what was discussed in another thread, and I think that answer will apply (though you'll probably getAttributeAsInt) - but I still haven't actually tested it

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @Lisa999,

    The AD annotation is special-cased, currently there is no equivalent method that is specific for getting the RPA value. But you can try one of the generic "get" methods; getAnyAttribute("RPA") might work.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Actually the link @pdexheimer posted is a good source to start from -- you may need to experiment with the various get methods. getExtendedAttribute is also a good one to try, though as he mentions type conversion may be an issue. Let us know how it turns out.

  • eflanneryeflannery San DiegoMember

    I have a vcf file that I've called variants on using Variant Filtration. Variant Filtration annotates the format field with the correct expressions for samples that do not pass, but I still get PASS in the FILTER field, and thus cannot filter with SelectVariants. Is there another way I am supposed to be doing this?


    java -Xmx4g -jar ~/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar \
        -R ~/Dropbox/Genomes/PlasmoDB-13.0_Pfalciparum3D7_Genome.fasta \
        -T VariantFiltration \
        -L /Users/eflannery/Dropbox/data/SNV_falciparum/fal_nomultigene.interval.list \
        --variant ${input} \
        -o ${input}_variantFiltration.vcf \
        -window 10 \
        -filter "ReadPosRankSum > 7.0" -filterName "HighRPRS" \
        -filter "ReadPosRankSum < -7.0" -filterName "LowRPRS" \
        -filter "QD < 22.0" -filterName "LowQD" \
        -filter "SOR > 4.2" -filterName "HighSOR" \
        -filter "MQ < 50.0" -filterName "LowMQ" \
        -filter "DP > 2131" -filterName "HighDP" \
        -filter "MQRankSum > 6.0" -filterName "HighMQRS" \
        -filter "MQRankSum < -6.0" -filterName "LowMQRS" \
        --genotypeFilterExpression "DP < 6" --genotypeFilterName "LowFormatDP" \
        --genotypeFilterExpression "GQ < 30.0" --genotypeFilterName "LowGQ" \
        1>> ${input}_err 2>> ${input}_err

    Pf3D7_04_v3 96641 . A AATATATATAT 1644.06 PASS AC=14;AF=0.875;AN=16;DP=70;FS=0.000;MLEAC=11;MLEAF=0.688;MQ=60.62;MQ0=0;QD=24.54;SOR=2.303 GT:AD:DP:FT:GQ:PL 0/0:0,0:0:LowFormatDP;LowGQ:0:0,0,1 1/1:0,3:3:LowFormatDP;LowGQ:12:179,12,0 1/1:0,1:1:LowFormatDP;LowGQ:3:28,3,0 1/1:0,8:8:LowGQ:26:316,26,0...etc there are a few other samples, but they are all LowFormat and/or LowGQ.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @eflannery When you apply genotype-level filters for which individual samples all fail, that doesn't affect the filter status of the variant site itself, so what you're seeing is expected. To filter the variants themselves based on genotypes you need to use methods that access the variant context using JEXL. It's a little more complicated but others have posted useful examples of how to do this.

  • everestial007everestial007 GreensboroMember ✭✭

    Hi @Geraldine_VdAuwera

    I am trying to filter the variants file by GT. I read the all the questions in this post and find that the post by @pdexheimer would be helpful in selecting the variants by FORMAT.
    But, for now I am trying pull the variants that are only GT: 1/2.

    I used the following script using -G_filter with combinations of hets ("isHet == 1"), refs ("isHomRef == 1"), or homs ("isHomVar == 1") but non of the expression is able to get this exact genotype.

    For eg:
    java -jar GenomeAnalysisTK.jar -T VariantFiltration -R dmel_refGenome_r6.09.fasta -V raw01_variants_S1-forTest.vcf --genotypeFilterExpression "isHomVar == 1" --genotypeFilterName "not-hom_var" -invG_filter -o S1_hets-only.vcf
    This filters all the variants except GT:1/1

    hets ("isHet == 1") - filters sites 0/1 and 1/2
    ("isHomRef == 1") - filters sites 0/0 which is actually non-existent in a vcf file with variant site called.
    homs ("isHomVar == 1") - filters sites 1/1

    I tired "isHet == 2" assuming it would filter site having 2nd allele and then inverse the selection but this doesn't work either.


    Issue · Github
    by Sheila

    Issue Number
    Last Updated
    Closed By
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi there, the problem is that the "isFoo" queries give a yes/no answer (typically 0 means no and 1 means yes). I can't think of a way to do that directly in GATK with JEXL, but you may want to look into the Picard FilterVcf, which can now take javascript functions as filters (example here).

Sign In or Register to comment.