Using SelectVariants

delangeldelangel Posts: 71GATK Developer mod
edited April 2013 in Methods and Workflows

Introduction

SelectVariants is a GATK tool used to subset a VCF file by many arbitrary criteria listed in the command line options below. The output VCF wiil have the AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) annotations updated as necessary to accurately reflect the file's new contents.

Select Variants operates on VCF files (ROD Tracks) provided in the command line using the GATK's built in --variant option. You can provide multiple tracks for Select Variants but at least one must be named 'variant' and this will be the file all your analysis will be based of. Other tracks can be named as you please. Options requiring a reference to a ROD track name will use the track name provided in the -B option to refer to the correct VCF file (e.g. --discordance / --concordance ). All other analysis will be done in the 'variant' track.

Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). SelectVariants can be used for this purpose. Given a single VCF file, one or more samples can be extracted from the file (based on a complete sample name or a pattern match). Variants can be further selected by specifying criteria for inclusion, i.e. "DP > 1000" (depth of coverage greater than 1000x), "AF < 0.25" (sites with allele frequency less than 0.25). These JEXL expressions are documented here in the FAQ article on JEXL expressions; it is particularly important to note the section on working with complex expressions.

Command-line arguments

For a complete, detailed argument reference, refer to the GATK document page here.

How do the AC, AF, AN, and DP fields change?

Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.

BOB        MARY        LINDA
1/0:20     0/0:30      1/1:50

In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).

Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:

BOB        MARY
1/0:20     0/0:30

The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.

Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like

BOB        MARY
1/0:20     ./.:.

with AN=2, AC=1, AF=0.5, and DP=20.

Subsetting by sample and ALT alleles

SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:

1       82154   rs4477212       A       G       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       T       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().

  • isVariant => is there an ALT allele?

  • isPolymorphic => is some sample non-ref in the samples?

In part this is complicated as the semantics of sites-only VCFs, where ALT = . is used to mean not-polymorphic. Unfortunately, I just don't think there's a consistent convention right now, but it might be worth at some point to adopt a single approach to handling this.

For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:

1       82154   rs4477212       A       .       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       .       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.

Known issues

Some VCFs may have repeated header entries with the same key name, for instance:

##fileformat=VCFv3.3
##FILTER=ABFilter,&quot;AB &gt; 0.75&quot;
##FILTER=HRunFilter,&quot;HRun &gt; 3.0&quot;
##FILTER=QDFilter,&quot;QD &lt; 5.0&quot;
##UG_bam_file_used=file1.bam
##UG_bam_file_used=file2.bam
##UG_bam_file_used=file3.bam
##UG_bam_file_used=file4.bam
##UG_bam_file_used=file5.bam
##source=UnifiedGenotyper
##source=VariantFiltration
##source=AnnotateVCFwithMAF
...

Here, the "UG_bam_file_used" and "source" header lines appear multiple times. When SelectVariants is run on such a file, the program will emit warnings that these repeated header lines are being discarded, resulting in only the first instance of such a line being written to the resulting VCF. This behavior is not ideal, but expected under the current architecture.

Additional information

For information on how to construct regular expressions for use with this tool, see the "Summary of regular-expression constructs" section here.

Post edited by Geraldine_VdAuwera on

Comments

  • vyellapavyellapa Posts: 31Member

    I'm trying to find the concordant variants between two vcf's where each were generated with 2 different reference genomes(ucsc and 1000 genomes) using the --concordance option. Since ucsc genome has the chromosomes prefixed by "chr", how can I find the concordant calls. I was wondering if removing the "chr" would do or would the header have to be replaced too somehow.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,163Administrator, GATK Developer admin

    You should be very careful when comparing variants called with different references because the references don't just differ by their contig names, there are also some differences in the sequence. The safest way to do this is to "lift over" variants. See this document for more details on how to do this:

    http://www.broadinstitute.org/gatk/guide/article?id=63

    Geraldine Van der Auwera, PhD

  • vyellapavyellapa Posts: 31Member

    Thank you. I checked the 2.3-9 version of the GATK2 download and the resource bundle at ftp.broadinstitute.org for the "liftOverVCF.pl" script.
    Where would this be?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,163Administrator, GATK Developer admin

    You need to download it from the source code repo:

    https://github.com/broadgsa/gatk/tree/master/public/perl

    Geraldine Van der Auwera, PhD

  • golharamgolharam Posts: 23Member ✭✭

    I'm trying to debug why my JEXL select statement is not working. When invoking GATK from the shell, I get this error:

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.3-9-gdcdccbb):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Invalid command line: Invalid JEXL expression detected for select-0 with message ![1,43]: '(vc.getAttribute('1000g2012apr_ALL') < 0.01);' < error
    ERROR ------------------------------------------------------------------------------------------

    I have no idea what the actual error is based on this output. When I debug GATK in Eclipse, I get a different error output:

    ERROR MESSAGE: Invalid argument value '<' at position 10.
    ERROR Invalid argument value '0.01)'' at position 11.

    This is much more helpful at least point to me to where in the select statement the problem is. Is there any way to get this more descriptive error when I invoke GATK from the shell?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,163Administrator, GATK Developer admin

    You can try using -l DEBUG to get more detailed information.

    Geraldine Van der Auwera, PhD

  • ArtemPankinArtemPankin Posts: 9Member

    I am trying to extract variants from a two-genotype VCF, which are polymorphic between genotypes. Is there a way to do it with SelectVariants or any other GATK tool?

  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    what do you mean by two-genotype VCF? two samples?

  • ArtemPankinArtemPankin Posts: 9Member

    I am sorry for the confusion. Yes, it should read "two-sample".

  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    Yup, SelectVariants allows you to select by sample. Take a look at the documentation.

  • ArtemPankinArtemPankin Posts: 9Member

    Thank you for the link. I read it through and could not find the answer. I have got a vcf file with SNPs called for two different samples. I want to select only those SNPs, which distinguish between these two samples, e.g. sample1: 1/1; sample 2: 0/0.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,163Administrator, GATK Developer admin

    Hi @ArtemPankin,

    I think you'll need to compose a JEXL expression to do what you want. See this documentation page.

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 36Member

    Hi, there. I have 10 vcf result files from 10 samples, I want to extract the concordance of the variants of the 10 samples, can SelectVariants help me do this ? Can the argument "--variants" be used more than once?

  • JackJack Posts: 36Member

    Can I use the "--concordance" argument to select the intersect of the variants of the 10 samples?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,163Administrator, GATK Developer admin

    Hi Jack,

    It is easier to first merge the VCFs (using CombineVariants) then use VariantEval to look at the intersection as described here: http://www.broadinstitute.org/gatk/guide/article?id=48

    Geraldine Van der Auwera, PhD

  • geertvandeweyergeertvandeweyer Posts: 11Member

    Hi,

    I'm trying to subset a multisample vcf file to single sample files. I use the following commandline to extract variant locations from a VQSR recalibrated file , both for snp and indel:

    java -Xmx1500m -jar ../../binaries/gatk/2.4.9-g532efad/bin/GenomeAnalysisTK.jar -T SelectVariants -R ../../References/hg19/samtools/0.1.19/hg19.fasta -V UnifiedGenotyper.multi.vcf -o single.sample.vcf -sn 'lq-hl-85-c4-01' -nt 4 -env

    most locations work fine, but there seems to be an issue with multiallelic indel locations.

    for example: the multi-sample input file (trimmed down):

    chr1 7828173 . TAAAAA T,TAAA,TAA,TAAAA 2156.23 PASS AC=12,13,17,12;AF=0.188,0.203,0.266,0.188;AN=64;BaseQRankSum=0.754;DP=729;FS=18.902;InbreedingCoeff=0.7206;MLEAC=9,12,11,12;MLEAF=0.141,0.188,0.172,0.188;MQ=59.06;MQ0=0;MQRankSum=2.746;QD=1.09;RPA=18,13,16,15,17;RU=A;ReadPosRankSum=4.206;STR;VQSLOD=0.578;culprit=FS GT:AD:DP:GQ:PL 1/1:0,6,0,0,0:27:50:982,51,0,711,51,664,552,50,545,503,864,51,705,551,823 1/4:0,3,1,1,0:59:29:750,29,448,521,76,534,376,165,432,498,626,0,493,352,587 0/3:0,0,3,0,0:14:3:118,129,368,72,130,104,0,78,45,51,78,115,74,3,91 ...(20 more)

    when I select the third sample, the result is :

    chr1 7828173 . TAAAAA TAA 2156.23 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.754;DP=14;FS=18.902;InbreedingCoeff=0.7206;MQ=59.06;MQ0=0;MQRankSum=2.746;QD=1.09;RPA=18,13,16,15,17;RU=A;ReadPosRankSum=4.206;STR;VQSLOD=0.578;culprit=FS GT:DP:GQ 0/1:14:3

    This seems incorrect in two ways:
    - original : homozygous, third allele ; selectvariants : heterozygous (allele is correct)
    - selectVariant file is missing AD:GQ:PL fields.

    Is this default behaviour for some reason?

    Best,

    Geert

  • ebanksebanks Posts: 684GATK Developer mod

    Hi Geert,

    Your original VCF shows the 3rd sample as being heterozygous (0/3) so the "selected" VCF looks perfectly fine to me. The other tags are removed when you select down from many alternate alleles to just one for technical reasons, as discussed in other threads on this forum.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • geertvandeweyergeertvandeweyer Posts: 11Member

    oops, my mistake. It is indeed heterozygous. In the meantime, I've found some posts about the removed tags. Thanks for the pointer.

  • blueskypyblueskypy Posts: 228Member

    I'd like to subset 500 samples from a vcf file of 2000 samples, do I have to list -sn 500 times since there isn't a consist pattern in the names of the 500 samples?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,163Administrator, GATK Developer admin

    Hi @blueskypy‌,

    If there's no consistent pattern in the names, I can't think of any way to do this without listing -sn 500 times, sorry. Someone else may be aware of a trick to do this more easily but I don't know of any.

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 417Member, GSA Collaborator ✭✭✭✭

    @blueskypy‌ - You could write the sample names to a file and use the --sample_file/-sf argument to SelectVariants

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,163Administrator, GATK Developer admin

    Ah, there you go! @pdexheimer to the rescue as usual :)

    Geraldine Van der Auwera, PhD

  • jhl667jhl667 OregonPosts: 3Member

    Using v3.1, I'm getting the following error:

    Line 581694: there aren't enough columns for line Total compute time in PairHMM computeLikelihoods() : 0.0 (we expected 9 tokens, and saw 1 )

    My command line is:

    java -jar -Xmx2g $GATK -R $REF -T SelectVariants -V $PROJ/output/$CHROM.vcf -selectType SNP -o $PROJ/snps/$CHROM_snps.vcf

    I've not run in to this problem before, so am at a loss for what I may be overlooking.

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,163Administrator, GATK Developer admin

    @jhl667 I believe that was a bug in 3.2 (where log output ended up in the data output stream) that has been fixed since.

    Geraldine Van der Auwera, PhD

  • jhl667jhl667 OregonPosts: 3Member

    @Geraldine_VdAuwera‌ I suspected that might be the case, so I went back and tried the same command line with v3.2-2. I am seeing the same result.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,163Administrator, GATK Developer admin

    The problem is in your VCF file, so you need to regenerate it (or fix it) to allow SelectVariants to work properly.

    Geraldine Van der Auwera, PhD

  • jhl667jhl667 OregonPosts: 3Member

    @Geraldine_VdAuwera‌ Alright, I will keep working on it. Thanks.

Sign In or Register to comment.