Selecting Variants by searching the FORMAT field array

ddarcyddarcy Posts: 4Member

Hello all,

First post. Thank you for these amazing tools. I have spent two days pulling my hair out, trying all enumerations, searching the documentation and forums, and in the end I come to you for help. Please forgive me if these topics have been covered elsewhere.

I have several VCFs generated by SomaticSniper that I'd like to filter based on the SomaticScore (SSC in the FORMAT field). I was working with VariantFiltration and SelectVariants, and trying to use different options, as well as regular expressions, to select those calls with a SSC over 40. I have been unable to do so. I also looked into trying to figure out JEXL, and using the last command listed on the documentation page, about using the VariantContext feature to drill into an array. I cannot get it to recognize the SSC column of the FORMAT field and then filter for the TUMOR sample.

Using VariantFiltration I was using -select (but I understand now that this searches the INFO field only). I was then using the --genotypeFilterExpression, but it would not add the FT tag to the FORMAT field as it said it would, it would just apply PASS to everything.

java -Xmx4-jar GenomeAnalysisTK.jar -T VariantFiltration -R ~/Documents/reference/human_g1k_v37.fasta -V '/home/registry3/Desktop/merged/104024sniperRAWSNPS.vcf' --G_filter "SSC < 40.0" --G_filterName "myFilter" -o '/home/registry3/Desktop/merged/104024sniperFILTEREDSNPS.vcf'

Using SelectVariants, I was using -sn to select the TUMOR sample and then using -select_expressions, but I guess this also only works on the INFO field. I had been trying to use --sample_expression which gives the ability to use a regular expression, but then I never had good results; it wouldn't do any filtering, and output the entire input file. Does the regular expression only apply to the sample name, and not the content of each line? Trying to select SSC over 40 from a line like this

1   10177   0   A   C   0   0   AC=1;AF=0.500;AN=2;DP=62    GT:AMQ:BCOUNT:BQ:DP:DP4:GQ:IGT:MQ:SS:SSC:VAQ    0/1:16,15:40,22,0,0:28,25:62:31,9,10,12:37:0/1:16:2:19:37

I used a line such as this, to look at the second to last number in the FORMAT field based on : dividers

java -Xmx4-jar GenomeAnalysisTK.jar -R ~/Documents/reference/human_g1k_v37.fasta -T SelectVariants --variant '/home/registry3/Desktop/merged/104024sniperSNPs.vcf' -o '/home/registry3/Desktop/merged/104024sniperSSC40.vcf' -se ".*:[4,5,6,7,8,9][1,2,3,4,5,6,7,8,9]:[1234567890]{2,3}$"

I am not a coder, as you can probably see, but I'm trying to get this worked out. This output the entire file still, with SSC values above and below 40.

Looking into use the vc.getGenotype array access, I could not find much documentation about VariantContext; I was looking through the files on github, looking through the code and looking for samples, since the .getAD() from the documentation seems to work, but alas, there is no .getSCC() available. Is using vc. the best way to drill into an array (the FORMAT field) and search for what I want?

I didn't post all the code and output, trying to keep this as short as possible. I can post pastebin outputs if that would be helpful. Thank you, David

Best Answers


  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,414Administrator, GATK Developer admin

    Hi David,

    Well we certainly can't fault you for not trying! Variant filtration using JEXLs is one of the most vexing topics people often ask us about, so please don't feel bad.

    You are on the right track, using VariantFiltration with a genotype filter expression is indeed the way to go. I'm not entirely certain why what you tried already didn't work. However... from the look of it, this SSC annotation is a single integer value per sample, correct? Not an array and not a float? So you shouldn't need to query the vc context at least. Maybe it's a typing issue. In the example you posted above you have -G_filter "SSC < 40.0". Have you tried the same but with the integer representation, ie "SSC < 40"?

    Geraldine Van der Auwera, PhD

  • ddarcyddarcy Posts: 4Member

    Hello and thank you for getting back so quickly!


    java -Xmx4g -jar GenomeAnalysisTK.jar -R /Homo_sapiens_assembly19.fasta -T VariantFiltration -V /sniper/104024-T--104024-N.snv.soticsniper.vcf -o /sniper/104024-T--104024-N.snv.somaticsniper.overSCC40.vcf -G_filter "SSC < 40" -G_filterName "SSC40"

    I get thousands of:

    WARN 14:25:59,784 Interpreter - ![0,3]: 'SSC < 40;' undefined variable SSC
    WARN 14:25:59,784 Interpreter - ![0,3]: 'SSC < 40;' undefined variable SSC

    And it outputs all lines. "SSC < 40.0" does the same.

    It is a single integer per sample; So it is not an array, even though it is in the Sample column as number:number:string:number? I figured an array referred to a number referenced by the FORMAT column that was in a SAMPLE column buried between colons. The header contains

    ##FORMAT=<ID=SSC,Number=1,Type=Integer,Description="Somatic Score">

    So I don't know why it's undefined...

    Thank you,

  • ddarcyddarcy Posts: 4Member
    edited September 2013

    In the VCF, all lines have the SSC field defined in the FORMAT column. The normal sample had a . for this value ( as in 22:1:.:44 etc) but the tumor samples all had a number in this spot. I tried filtering just the sample TUMOR, so the the VCF FORMAT field had SCC and every sample had a value for that, but to no avail (underfined variable SSC). Regarding the FT annotation, as far as I understand that would be added to the FORMAT field, correct? And yes, DP does work, but I thought this might be because DP is defined in the INFO column as well. I can't get it to find any terms in the FORMAT column (SSC, AMQ, DP4, etc). Perhaps I could post a small sample? Would it be possible to do so without the headers as to keep study data private?

    Thank you,

    Post edited by ddarcy on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,414Administrator, GATK Developer admin

    Hi David,

    If you'd like you can upload a file snippet to our FTP server (see FAQs for instructions); no one else but us will be able to access it. Feel free to edit out any sensitive information from the headers, as long as you leave the parts of the header that are required for it to be a valid vcf.

    Geraldine Van der Auwera, PhD

  • ddarcyddarcy Posts: 4Member

    I tried filtering just the sample TUMOR (so just there wasn't a NORMAL sample which had a . for the SSC field) and then FT annotation worked. I thought I had done this before, but maybe I had gone from 4 samples to 2. I guess it was due to having two samples in the VCF and one sample didn't have that data (it was normal, and didn't have the SomaticSniper SSC score).

    However after the FT annotation happens, then I can't get SelectVariants to just pull out those variants. I have selected and tagged the SSC > 40 with "SCCover40", but pulling those lines out to a new file...? The SelectVariants -select function only seems to mine the INFO field. How do you select out variants to a new file by the FORMAT field?

    Thank you, David

  • Lisa999Lisa999 UKPosts: 1Member


    I am working on a multi sample VCF file and trying to use SelectVariants to filter out all variants with a repeat length of less that 11bp. I am trying to select for RPA which is an array, and so far I am not succeeding.

    I have tried the following:

    java -Xmx3g -jar $GATK -T SelectVariants -R $REF_FA \
    --variant $IN_VCF -o $SEL_VCF -select 'vc.getGenotype("sample1").getRPA().0 > 10' \
    -log $LOG_FILE -l INFO

    This does not work, but if I swap AD for RPA the code does work:

    java -Xmx3g -jar $GATK -T SelectVariants -R $REF_FA \
    --variant $IN_VCF -o $SEL_VCF -select 'vc.getGenotype("sample1").getAD().0 > 10' \
    -log $LOG_FILE -l INFO

    I have also tried the following without any luck:

    java -Xmx3g -jar $GATK -T SelectVariants -R $REF_FA \
    --variant $IN_VCF -o $SEL_VCF -select 'vc.hasAttribute("RPA").getRPA().0 > 10' \
    -log $LOG_FILE -l INFO

    I would be very grateful if someone could suggest a better method.


  • pdexheimerpdexheimer Posts: 444Member, GSA Collaborator ✭✭✭✭

    Hi Lisa -

    This is essentially what was discussed in another thread, and I think that answer will apply (though you'll probably getAttributeAsInt) - but I still haven't actually tested it

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,414Administrator, GATK Developer admin

    Hi @Lisa999,

    The AD annotation is special-cased, currently there is no equivalent method that is specific for getting the RPA value. But you can try one of the generic "get" methods; getAnyAttribute("RPA") might work.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,414Administrator, GATK Developer admin

    Actually the link @pdexheimer posted is a good source to start from -- you may need to experiment with the various get methods. getExtendedAttribute is also a good one to try, though as he mentions type conversion may be an issue. Let us know how it turns out.

    Geraldine Van der Auwera, PhD

Sign In or Register to comment.