The current GATK version is 3.2-2

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Bug Bulletin: The recent 3.2 release fixes many issues. If you run into a problem, please try the latest version before posting a bug report, as your problem may already have been solved.

# Using SelectVariants

Posts: 71GATK Developer mod
edited April 2013

Introduction

SelectVariants is a GATK tool used to subset a VCF file by many arbitrary criteria listed in the command line options below. The output VCF wiil have the AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) annotations updated as necessary to accurately reflect the file's new contents.

Select Variants operates on VCF files (ROD Tracks) provided in the command line using the GATK's built in --variant option. You can provide multiple tracks for Select Variants but at least one must be named 'variant' and this will be the file all your analysis will be based of. Other tracks can be named as you please. Options requiring a reference to a ROD track name will use the track name provided in the -B option to refer to the correct VCF file (e.g. --discordance / --concordance ). All other analysis will be done in the 'variant' track.

Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose. Given a single VCF file, one or more samples can be extracted from the file (based on a complete sample name or a pattern match). Variants can be further selected by specifying criteria for inclusion, i.e. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These JEXL expressions are documented here in the FAQ article on JEXL expressions; it is particularly important to note the section on working with complex expressions.

### Command-line arguments

For a complete, detailed argument reference, refer to the GATK document page here.

### How do the AC, AF, AN, and DP fields change?

Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.

BOB        MARY        LINDA
1/0:20     0/0:30      1/1:50


In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).

Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:

BOB        MARY
1/0:20     0/0:30


The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.

Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like

BOB        MARY
1/0:20     ./.:.


with AN=2, AC=1, AF=0.5, and DP=20.

### Subsetting by sample and ALT alleles

SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:

1       82154   rs4477212       A       G       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       T       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942


Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().

• isVariant => is there an ALT allele?

• isPolymorphic => is some sample non-ref in the samples?

In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic. Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this.

For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:

1       82154   rs4477212       A       .       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       .       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942


If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.

### Known issues

Some VCFs may have repeated header entries with the same key name, for instance:

##fileformat=VCFv3.3
##FILTER=ABFilter,&quot;AB &gt; 0.75&quot;
##FILTER=HRunFilter,&quot;HRun &gt; 3.0&quot;
##FILTER=QDFilter,&quot;QD &lt; 5.0&quot;
##UG_bam_file_used=file1.bam
##UG_bam_file_used=file2.bam
##UG_bam_file_used=file3.bam
##UG_bam_file_used=file4.bam
##UG_bam_file_used=file5.bam
##source=UnifiedGenotyper
##source=VariantFiltration
##source=AnnotateVCFwithMAF
...


Here, the "UG_bam_file_used" and "source" header lines appear multiple times. When SelectVariants is run on such a file, the program will emit warnings that these repeated header lines are being discarded, resulting in only the first instance of such a line being written to the resulting VCF. This behavior is not ideal, but expected under the current architecture.

For information on how to construct regular expressions for use with this tool, see the "Summary of regular-expression constructs" section here.

Post edited by Geraldine_VdAuwera on
Tagged:

• Posts: 28Member

I'm trying to find the concordant variants between two vcf's where each were generated with 2 different reference genomes(ucsc and 1000 genomes) using the --concordance option. Since ucsc genome has the chromosomes prefixed by "chr", how can I find the concordant calls. I was wondering if removing the "chr" would do or would the header have to be replaced too somehow.

You should be very careful when comparing variants called with different references because the references don't just differ by their contig names, there are also some differences in the sequence. The safest way to do this is to "lift over" variants. See this document for more details on how to do this:

Geraldine Van der Auwera, PhD

• Posts: 28Member

Thank you. I checked the 2.3-9 version of the GATK2 download and the resource bundle at ftp.broadinstitute.org for the "liftOverVCF.pl" script. Where would this be?

Geraldine Van der Auwera, PhD

• Posts: 18Member

I'm trying to debug why my JEXL select statement is not working. When invoking GATK from the shell, I get this error:

##### ERROR ------------------------------------------------------------------------------------------

I have no idea what the actual error is based on this output. When I debug GATK in Eclipse, I get a different error output:

##### ERROR Invalid argument value '0.01)'' at position 11.

This is much more helpful at least point to me to where in the select statement the problem is. Is there any way to get this more descriptive error when I invoke GATK from the shell?

You can try using -l DEBUG to get more detailed information.

Geraldine Van der Auwera, PhD

• Posts: 8Member

I am trying to extract variants from a two-genotype VCF, which are polymorphic between genotypes. Is there a way to do it with SelectVariants or any other GATK tool?

what do you mean by two-genotype VCF? two samples?

• Posts: 8Member

I am sorry for the confusion. Yes, it should read "two-sample".

Yup, SelectVariants allows you to select by sample. Take a look at the documentation.

• Posts: 8Member

Thank you for the link. I read it through and could not find the answer. I have got a vcf file with SNPs called for two different samples. I want to select only those SNPs, which distinguish between these two samples, e.g. sample1: 1/1; sample 2: 0/0.

Hi @ArtemPankin,

I think you'll need to compose a JEXL expression to do what you want. See this documentation page.

Geraldine Van der Auwera, PhD

• Posts: 35Member

Hi, there. I have 10 vcf result files from 10 samples, I want to extract the concordance of the variants of the 10 samples, can SelectVariants help me do this ? Can the argument "--variants" be used more than once?

• Posts: 35Member

Can I use the "--concordance" argument to select the intersect of the variants of the 10 samples?

Hi Jack,

It is easier to first merge the VCFs (using CombineVariants) then use VariantEval to look at the intersection as described here: http://www.broadinstitute.org/gatk/guide/article?id=48

Geraldine Van der Auwera, PhD

• Posts: 8Member

Hi,

I'm trying to subset a multisample vcf file to single sample files. I use the following commandline to extract variant locations from a VQSR recalibrated file , both for snp and indel:

java -Xmx1500m -jar ../../binaries/gatk/2.4.9-g532efad/bin/GenomeAnalysisTK.jar -T SelectVariants -R ../../References/hg19/samtools/0.1.19/hg19.fasta -V UnifiedGenotyper.multi.vcf -o single.sample.vcf -sn 'lq-hl-85-c4-01' -nt 4 -env

most locations work fine, but there seems to be an issue with multiallelic indel locations.

for example: the multi-sample input file (trimmed down):

chr1 7828173 . TAAAAA T,TAAA,TAA,TAAAA 2156.23 PASS AC=12,13,17,12;AF=0.188,0.203,0.266,0.188;AN=64;BaseQRankSum=0.754;DP=729;FS=18.902;InbreedingCoeff=0.7206;MLEAC=9,12,11,12;MLEAF=0.141,0.188,0.172,0.188;MQ=59.06;MQ0=0;MQRankSum=2.746;QD=1.09;RPA=18,13,16,15,17;RU=A;ReadPosRankSum=4.206;STR;VQSLOD=0.578;culprit=FS GT:AD:DP:GQ:PL 1/1:0,6,0,0,0:27:50:982,51,0,711,51,664,552,50,545,503,864,51,705,551,823 1/4:0,3,1,1,0:59:29:750,29,448,521,76,534,376,165,432,498,626,0,493,352,587 0/3:0,0,3,0,0:14:3:118,129,368,72,130,104,0,78,45,51,78,115,74,3,91 ...(20 more)

when I select the third sample, the result is :

chr1 7828173 . TAAAAA TAA 2156.23 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.754;DP=14;FS=18.902;InbreedingCoeff=0.7206;MQ=59.06;MQ0=0;MQRankSum=2.746;QD=1.09;RPA=18,13,16,15,17;RU=A;ReadPosRankSum=4.206;STR;VQSLOD=0.578;culprit=FS GT:DP:GQ 0/1:14:3

This seems incorrect in two ways: - original : homozygous, third allele ; selectvariants : heterozygous (allele is correct) - selectVariant file is missing AD:GQ:PL fields.

Is this default behaviour for some reason?

Best,

Geert

• Posts: 678GATK Developer mod

Hi Geert,

Your original VCF shows the 3rd sample as being heterozygous (0/3) so the "selected" VCF looks perfectly fine to me. The other tags are removed when you select down from many alternate alleles to just one for technical reasons, as discussed in other threads on this forum.

Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

• Posts: 8Member

oops, my mistake. It is indeed heterozygous. In the meantime, I've found some posts about the removed tags. Thanks for the pointer.

• Posts: 191Member

I'd like to subset 500 samples from a vcf file of 2000 samples, do I have to list -sn 500 times since there isn't a consist pattern in the names of the 500 samples?

Hi @blueskypy‌,

If there's no consistent pattern in the names, I can't think of any way to do this without listing -sn 500 times, sorry. Someone else may be aware of a trick to do this more easily but I don't know of any.

Geraldine Van der Auwera, PhD

• Posts: 324Member, GSA Collaborator ✭✭✭

@blueskypy‌ - You could write the sample names to a file and use the --sample_file/-sf argument to SelectVariants