The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

# Selecting Variants by searching the FORMAT field array

Posts: 4Member

Hello all,

First post. Thank you for these amazing tools. I have spent two days pulling my hair out, trying all enumerations, searching the documentation and forums, and in the end I come to you for help. Please forgive me if these topics have been covered elsewhere.

I have several VCFs generated by SomaticSniper that I'd like to filter based on the SomaticScore (SSC in the FORMAT field). I was working with VariantFiltration and SelectVariants, and trying to use different options, as well as regular expressions, to select those calls with a SSC over 40. I have been unable to do so. I also looked into trying to figure out JEXL, and using the last command listed on the documentation page, about using the VariantContext feature to drill into an array. I cannot get it to recognize the SSC column of the FORMAT field and then filter for the TUMOR sample.

Using VariantFiltration I was using -select (but I understand now that this searches the INFO field only). I was then using the --genotypeFilterExpression, but it would not add the FT tag to the FORMAT field as it said it would, it would just apply PASS to everything.

java -Xmx4-jar GenomeAnalysisTK.jar -T VariantFiltration -R ~/Documents/reference/human_g1k_v37.fasta -V '/home/registry3/Desktop/merged/104024sniperRAWSNPS.vcf' --G_filter "SSC < 40.0" --G_filterName "myFilter" -o '/home/registry3/Desktop/merged/104024sniperFILTEREDSNPS.vcf'

Using SelectVariants, I was using -sn to select the TUMOR sample and then using -select_expressions, but I guess this also only works on the INFO field. I had been trying to use --sample_expression which gives the ability to use a regular expression, but then I never had good results; it wouldn't do any filtering, and output the entire input file. Does the regular expression only apply to the sample name, and not the content of each line? Trying to select SSC over 40 from a line like this

#CHROM   POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TUMOR
1   10177   0   A   C   0   0   AC=1;AF=0.500;AN=2;DP=62    GT:AMQ:BCOUNT:BQ:DP:DP4:GQ:IGT:MQ:SS:SSC:VAQ    0/1:16,15:40,22,0,0:28,25:62:31,9,10,12:37:0/1:16:2:19:37

I used a line such as this, to look at the second to last number in the FORMAT field based on : dividers

java -Xmx4-jar GenomeAnalysisTK.jar -R ~/Documents/reference/human_g1k_v37.fasta -T SelectVariants --variant '/home/registry3/Desktop/merged/104024sniperSNPs.vcf' -o '/home/registry3/Desktop/merged/104024sniperSSC40.vcf' -se ".*:[4,5,6,7,8,9][1,2,3,4,5,6,7,8,9]:[1234567890]{2,3}$" I am not a coder, as you can probably see, but I'm trying to get this worked out. This output the entire file still, with SSC values above and below 40. Looking into use the vc.getGenotype array access, I could not find much documentation about VariantContext; I was looking through the files on github, looking through the code and looking for samples, since the .getAD() from the documentation seems to work, but alas, there is no .getSCC() available. Is using vc. the best way to drill into an array (the FORMAT field) and search for what I want? I didn't post all the code and output, trying to keep this as short as possible. I can post pastebin outputs if that would be helpful. Thank you, David Tagged: ## Best Answers ## Answers • Posts: 5,988Administrator, GATK Developer admin Hi David, Well we certainly can't fault you for not trying! Variant filtration using JEXLs is one of the most vexing topics people often ask us about, so please don't feel bad. You are on the right track, using VariantFiltration with a genotype filter expression is indeed the way to go. I'm not entirely certain why what you tried already didn't work. However... from the look of it, this SSC annotation is a single integer value per sample, correct? Not an array and not a float? So you shouldn't need to query the vc context at least. Maybe it's a typing issue. In the example you posted above you have -G_filter "SSC < 40.0". Have you tried the same but with the integer representation, ie "SSC < 40"? Geraldine Van der Auwera, PhD • Posts: 4Member Hello and thank you for getting back so quickly! From: java -Xmx4g -jar GenomeAnalysisTK.jar -R /Homo_sapiens_assembly19.fasta -T VariantFiltration -V /sniper/104024-T--104024-N.snv.soticsniper.vcf -o /sniper/104024-T--104024-N.snv.somaticsniper.overSCC40.vcf -G_filter "SSC < 40" -G_filterName "SSC40" I get thousands of: WARN 14:25:59,784 Interpreter - ![0,3]: 'SSC < 40;' undefined variable SSC WARN 14:25:59,784 Interpreter - ![0,3]: 'SSC < 40;' undefined variable SSC And it outputs all lines. "SSC < 40.0" does the same. It is a single integer per sample; So it is not an array, even though it is in the Sample column as number:number:string:number? I figured an array referred to a number referenced by the FORMAT column that was in a SAMPLE column buried between colons. The header contains ##FORMAT=<ID=SSC,Number=1,Type=Integer,Description="Somatic Score"> So I don't know why it's undefined... Thank you, DD • Posts: 4Member edited September 2013 In the VCF, all lines have the SSC field defined in the FORMAT column. The normal sample had a . for this value ( as in 22:1:.:44 etc) but the tumor samples all had a number in this spot. I tried filtering just the sample TUMOR, so the the VCF FORMAT field had SCC and every sample had a value for that, but to no avail (underfined variable SSC). Regarding the FT annotation, as far as I understand that would be added to the FORMAT field, correct? And yes, DP does work, but I thought this might be because DP is defined in the INFO column as well. I can't get it to find any terms in the FORMAT column (SSC, AMQ, DP4, etc). Perhaps I could post a small sample? Would it be possible to do so without the headers as to keep study data private? Thank you, David Post edited by ddarcy on • Posts: 5,988Administrator, GATK Developer admin Hi David, If you'd like you can upload a file snippet to our FTP server (see FAQs for instructions); no one else but us will be able to access it. Feel free to edit out any sensitive information from the headers, as long as you leave the parts of the header that are required for it to be a valid vcf. Geraldine Van der Auwera, PhD • Posts: 4Member I tried filtering just the sample TUMOR (so just there wasn't a NORMAL sample which had a . for the SSC field) and then FT annotation worked. I thought I had done this before, but maybe I had gone from 4 samples to 2. I guess it was due to having two samples in the VCF and one sample didn't have that data (it was normal, and didn't have the SomaticSniper SSC score). However after the FT annotation happens, then I can't get SelectVariants to just pull out those variants. I have selected and tagged the SSC > 40 with "SCCover40", but pulling those lines out to a new file...? The SelectVariants -select function only seems to mine the INFO field. How do you select out variants to a new file by the FORMAT field? Thank you, David • UKPosts: 1Member Hi, I am working on a multi sample VCF file and trying to use SelectVariants to filter out all variants with a repeat length of less that 11bp. I am trying to select for RPA which is an array, and so far I am not succeeding. I have tried the following: java -Xmx3g -jar$GATK -T SelectVariants -R $REF_FA \ --variant$IN_VCF -o $SEL_VCF -select 'vc.getGenotype("sample1").getRPA().0 > 10' \ -log$LOG_FILE -l INFO

This does not work, but if I swap AD for RPA the code does work:

java -Xmx3g -jar $GATK -T SelectVariants -R$REF_FA \ --variant $IN_VCF -o$SEL_VCF -select 'vc.getGenotype("sample1").getAD().0 > 10' \ -log $LOG_FILE -l INFO I have also tried the following without any luck: java -Xmx3g -jar$GATK -T SelectVariants -R $REF_FA \ --variant$IN_VCF -o $SEL_VCF -select 'vc.hasAttribute("RPA").getRPA().0 > 10' \ -log$LOG_FILE -l INFO

I would be very grateful if someone could suggest a better method.

Thanks, Lisa

• Posts: 330Member, GSA Collaborator ✭✭✭

Hi Lisa -

This is essentially what was discussed in another thread, and I think that answer will apply (though you'll probably getAttributeAsInt) - but I still haven't actually tested it

Hi @Lisa999,

The AD annotation is special-cased, currently there is no equivalent method that is specific for getting the RPA value. But you can try one of the generic "get" methods; getAnyAttribute("RPA") might work.

Geraldine Van der Auwera, PhD