The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.10.2 is now available at
GATK version 4.beta.2 (i.e. the second beta release) is out. See the GATK4 BETA page for download and details.

Selecting variants of interest from a callset

delangeldelangel Broad InstituteMember
edited May 2015 in Methods and Algorithms

This document describes why you might want to extract a subset of variants from a callset and how you would achieve this.

Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). The GATK tool that we use the most for subsetting calls in various ways is SelectVariants; it enables easy and convenient subsetting of VCF files according to many criteria.

Select Variants operates on VCF files (also sometimes referred to as ROD in our documentation, for Reference Ordered Data) provided at the command line using the GATK's built in --variant option. You can provide multiple VCF files for Select Variants, but at least one must be named 'variant' and this will be the file (or set of files) from which variants will be selected. Other files can be used to modify the selection based on concordance or discordance between the callsets (see --discordance / --concordance arguments in the tool documentation).

There are many options for setting the selection criteria, depending on what you want to achieve. For example, given a single VCF file, one or more samples can be extracted from the file, based either on a complete sample name, or on a pattern match. Variants can also be selected based on annotated properties, such as depth of coverage or allele frequency. This is done using JEXL expressions; make sure to read the linked document for details, especially the section on working with complex expressions.

Note that in the output VCF, some annotations such as AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) are recalculated as appropriate to accurately reflect the composition of the subset callset. See further below for an explanation of how that works.

Command-line arguments

For a complete, detailed argument reference, refer to the GATK document page here.

Subsetting by sample and ALT alleles

SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:

1       82154   rs4477212       A       G       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       T       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().

  • isVariant => is there an ALT allele?

  • isPolymorphic => is some sample non-ref in the samples?

For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:

1       82154   rs4477212       A       .       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       .       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.

How do the AC, AF, AN, and DP fields change?

Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.

BOB        MARY        LINDA
1/0:20     0/0:30      1/1:50

In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).

Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:

BOB        MARY
1/0:20     0/0:30

The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.

Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like

BOB        MARY
1/0:20     ./.:.

with AN=2, AC=1, AF=0.5, and DP=20.

Additional information

For information on how to construct regular expressions for use with this tool, see the method article on variant filtering with JEXL, or "Summary of regular-expression constructs" section here for more hardcore reading.

Post edited by Geraldine_VdAuwera on

Issue · Github
by Geraldine_VdAuwera

Issue Number
Last Updated


  • I'm trying to find the concordant variants between two vcf's where each were generated with 2 different reference genomes(ucsc and 1000 genomes) using the --concordance option. Since ucsc genome has the chromosomes prefixed by "chr", how can I find the concordant calls. I was wondering if removing the "chr" would do or would the header have to be replaced too somehow.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    You should be very careful when comparing variants called with different references because the references don't just differ by their contig names, there are also some differences in the sequence. The safest way to do this is to "lift over" variants. See this document for more details on how to do this:

  • Thank you. I checked the 2.3-9 version of the GATK2 download and the resource bundle at for the "" script.
    Where would this be?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    You need to download it from the source code repo:

  • I'm trying to debug why my JEXL select statement is not working. When invoking GATK from the shell, I get this error:

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.3-9-gdcdccbb):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions
    ERROR MESSAGE: Invalid command line: Invalid JEXL expression detected for select-0 with message ![1,43]: '(vc.getAttribute('1000g2012apr_ALL') < 0.01);' < error
    ERROR ------------------------------------------------------------------------------------------

    I have no idea what the actual error is based on this output. When I debug GATK in Eclipse, I get a different error output:

    ERROR MESSAGE: Invalid argument value '<' at position 10.
    ERROR Invalid argument value '0.01)'' at position 11.

    This is much more helpful at least point to me to where in the select statement the problem is. Is there any way to get this more descriptive error when I invoke GATK from the shell?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    You can try using -l DEBUG to get more detailed information.

  • I am trying to extract variants from a two-genotype VCF, which are polymorphic between genotypes. Is there a way to do it with SelectVariants or any other GATK tool?

  • CarneiroCarneiro Charlestown, MAMember

    what do you mean by two-genotype VCF? two samples?

  • I am sorry for the confusion. Yes, it should read "two-sample".

  • CarneiroCarneiro Charlestown, MAMember

    Yup, SelectVariants allows you to select by sample. Take a look at the documentation.

  • Thank you for the link. I read it through and could not find the answer. I have got a vcf file with SNPs called for two different samples. I want to select only those SNPs, which distinguish between these two samples, e.g. sample1: 1/1; sample 2: 0/0.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @ArtemPankin,

    I think you'll need to compose a JEXL expression to do what you want. See this documentation page.

  • JackJack Member

    Hi, there. I have 10 vcf result files from 10 samples, I want to extract the concordance of the variants of the 10 samples, can SelectVariants help me do this ? Can the argument "--variants" be used more than once?

  • JackJack Member

    Can I use the "--concordance" argument to select the intersect of the variants of the 10 samples?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi Jack,

    It is easier to first merge the VCFs (using CombineVariants) then use VariantEval to look at the intersection as described here:

  • Hi,

    I'm trying to subset a multisample vcf file to single sample files. I use the following commandline to extract variant locations from a VQSR recalibrated file , both for snp and indel:

    java -Xmx1500m -jar ../../binaries/gatk/2.4.9-g532efad/bin/GenomeAnalysisTK.jar -T SelectVariants -R ../../References/hg19/samtools/0.1.19/hg19.fasta -V UnifiedGenotyper.multi.vcf -o single.sample.vcf -sn 'lq-hl-85-c4-01' -nt 4 -env

    most locations work fine, but there seems to be an issue with multiallelic indel locations.

    for example: the multi-sample input file (trimmed down):

    chr1 7828173 . TAAAAA T,TAAA,TAA,TAAAA 2156.23 PASS AC=12,13,17,12;AF=0.188,0.203,0.266,0.188;AN=64;BaseQRankSum=0.754;DP=729;FS=18.902;InbreedingCoeff=0.7206;MLEAC=9,12,11,12;MLEAF=0.141,0.188,0.172,0.188;MQ=59.06;MQ0=0;MQRankSum=2.746;QD=1.09;RPA=18,13,16,15,17;RU=A;ReadPosRankSum=4.206;STR;VQSLOD=0.578;culprit=FS GT:AD:DP:GQ:PL 1/1:0,6,0,0,0:27:50:982,51,0,711,51,664,552,50,545,503,864,51,705,551,823 1/4:0,3,1,1,0:59:29:750,29,448,521,76,534,376,165,432,498,626,0,493,352,587 0/3:0,0,3,0,0:14:3:118,129,368,72,130,104,0,78,45,51,78,115,74,3,91 ...(20 more)

    when I select the third sample, the result is :

    chr1 7828173 . TAAAAA TAA 2156.23 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.754;DP=14;FS=18.902;InbreedingCoeff=0.7206;MQ=59.06;MQ0=0;MQRankSum=2.746;QD=1.09;RPA=18,13,16,15,17;RU=A;ReadPosRankSum=4.206;STR;VQSLOD=0.578;culprit=FS GT:DP:GQ 0/1:14:3

    This seems incorrect in two ways:

    • original : homozygous, third allele ; selectvariants : heterozygous (allele is correct)
    • selectVariant file is missing AD:GQ:PL fields.

    Is this default behaviour for some reason?



  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Hi Geert,

    Your original VCF shows the 3rd sample as being heterozygous (0/3) so the "selected" VCF looks perfectly fine to me. The other tags are removed when you select down from many alternate alleles to just one for technical reasons, as discussed in other threads on this forum.

  • oops, my mistake. It is indeed heterozygous. In the meantime, I've found some posts about the removed tags. Thanks for the pointer.

  • I'd like to subset 500 samples from a vcf file of 2000 samples, do I have to list -sn 500 times since there isn't a consist pattern in the names of the 500 samples?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @blueskypy‌,

    If there's no consistent pattern in the names, I can't think of any way to do this without listing -sn 500 times, sorry. Someone else may be aware of a trick to do this more easily but I don't know of any.

  • pdexheimerpdexheimer Member, Dev

    @blueskypy‌ - You could write the sample names to a file and use the --sample_file/-sf argument to SelectVariants

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, there you go! @pdexheimer to the rescue as usual :)

  • jhl667jhl667 OregonMember

    Using v3.1, I'm getting the following error:

    Line 581694: there aren't enough columns for line Total compute time in PairHMM computeLikelihoods() : 0.0 (we expected 9 tokens, and saw 1 )

    My command line is:

    java -jar -Xmx2g $GATK -R $REF -T SelectVariants -V $PROJ/output/$CHROM.vcf -selectType SNP -o $PROJ/snps/$CHROM_snps.vcf

    I've not run in to this problem before, so am at a loss for what I may be overlooking.


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @jhl667 I believe that was a bug in 3.2 (where log output ended up in the data output stream) that has been fixed since.

  • jhl667jhl667 OregonMember

    @Geraldine_VdAuwera‌ I suspected that might be the case, so I went back and tried the same command line with v3.2-2. I am seeing the same result.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    The problem is in your VCF file, so you need to regenerate it (or fix it) to allow SelectVariants to work properly.

  • jhl667jhl667 OregonMember

    @Geraldine_VdAuwera‌ Alright, I will keep working on it. Thanks.

  • eflanneryeflannery San DiegoMember
    edited April 2015

    Hi, I am using the following command line to run SelectVariants:

    java -Xmx4g -jar ~/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar \
        -R ~/Dropbox/Genomes/PlasmoDB-13.0_Pfalciparum3D7_Genome.fasta \
        -T VariantFiltration \
        --variant ${input} \
        -o ${input}_variantFiltration.vcf \
        -filter "ReadPosRankSum > 12.4" -filterName "HighRPRS" \
        -filter "ReadPosRankSum < -9.4" -filterName "LowRPRS" \
        -filter "QD < 18.0" -filterName "LowQD" \
        -filter "SOR > 5.8" -filterName "HighSOR" \
        -filter "MQ < 42.0" -filterName "LowMQ" \
        -filter "DP > 3456" -filterName "HighDP" \
        -filter "MQRankSum > 9.7" -filterName "HighMQRS" \
        -filter "MQRankSum < -6.5" -filterName "LowMQRS" \
        --genotypeFilterExpression "DP < 6" --genotypeFilterName "LowFormatDP" \
        --genotypeFilterExpression "GQ < 30" --genotypeFilterName "LowGQ" 

    In the header of my input file the chromosome order is the same as the chromosome order in the .dict file and the fasta file.
    input vcf header:


    Yet I still get this error:

    ERROR MESSAGE: Input files variant and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.
    ERROR variant contigs = [M76611, PFC10_API_IRAB, Pf3D7_01_v3, Pf3D7_02_v3, Pf3D7_03_v3, Pf3D7_04_v3, Pf3D7_05_v3, Pf3D7_06_v3, Pf3D7_07_v3, Pf3D7_08_v3, Pf3D7_09_v3, Pf3D7_10_v3, Pf3D7_11_v3, Pf3D7_12_v3, Pf3D7_13_v3, Pf3D7_14_v3]
    ERROR reference contigs = [Pf3D7_04_v3, Pf3D7_05_v3, Pf3D7_02_v3, Pf3D7_09_v3, Pf3D7_12_v3, Pf3D7_06_v3, Pf3D7_14_v3, Pf3D7_03_v3, Pf3D7_07_v3, Pf3D7_13_v3, Pf3D7_08_v3, Pf3D7_01_v3, Pf3D7_10_v3, Pf3D7_11_v3, M76611, PFC10_API_IRAB]

    It's saying the variant contigs are in an order they are not in. Any ideas?


  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    It looks likes your vcf is not sorted in the same order as the reference. You should be able to use Picard's SortVcf to fix this.


  • eflanneryeflannery San DiegoMember

    @Sheila Hi Sheila, I did use SortVcf to make the vcf file and I used the same .dict file as the reference. You can see in the .vcf header that the file is sorted in that order, and indeed the variants are in that order. The error does not say the correct order that the .vcf is actually sorted in.


  • pdexheimerpdexheimer Member, Dev

    @eflannery - What happens when you delete the vcf.idx file?

  • eflanneryeflannery San DiegoMember

    Thank-you @pdexheimer, that was the problem!

  • everestial007everestial007 GreensboroMember

    I am having a problem while doing select variants.

    • GATK (HaplotypeCaller) was used to generate vcf for several samples jointly.
    • The vcf file was given to the phASER program along with generated vcf to do phasing. The phased vcf has extra fields and I don't see any other major changes.
    • Now, I want to select individual variants (to separate them) for downstream analyses, but I am getting the following error:

    $ java -jar /home
    /everestial007/GenomeAnalysisTK-3.6/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V phaser_2ms01e_test04.vcf -sn 2ms02g -o phaser_2ms02g_only_test01.vcf
    INFO 11:18:13,845 HelpFormatter - --------------------------------------------------------------------------------
    INFO 11:18:13,847 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.6-0-g89b7209, Compiled 2016/06/01 22:27:29
    INFO 11:18:13,847 HelpFormatter - Copyright (c) 2010-2016 The Broad Institute
    INFO 11:18:13,847 HelpFormatter - For support and documentation go to
    INFO 11:18:13,847 HelpFormatter - [Thu Aug 25 11:18:13 EDT 2016] Executing on Linux 4.4.0-34-generic amd64
    INFO 11:18:13,847 HelpFormatter - OpenJDK 64-Bit Server VM 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14 JdkDeflater
    INFO 11:18:13,850 HelpFormatter - Program Args: -T SelectVariants -R lyrata_genome.fa -V phaser_2ms01e_test04.vcf -sn 2ms02g -o phaser_2ms02g_only_test01.vcf
    INFO 11:18:13,854 HelpFormatter - Executing as everestial007@everestial007-Inspiron-3647 on Linux 4.4.0-34-generic amd64; OpenJDK 64-Bit Server VM 1.8.0_91-8u91-b14-3ubuntu1~16.04.1-b14.
    INFO 11:18:13,855 HelpFormatter - Date/Time: 2016/08/25 11:18:13
    INFO 11:18:13,855 HelpFormatter - --------------------------------------------------------------------------------
    INFO 11:18:13,855 HelpFormatter - --------------------------------------------------------------------------------
    INFO 11:18:13,870 GenomeAnalysisEngine - Strictness is SILENT
    INFO 11:18:14,202 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
    INFO 11:18:14,409 GenomeAnalysisEngine - Preparing for traversal
    INFO 11:18:14,412 GenomeAnalysisEngine - Done preparing for traversal
    INFO 11:18:14,413 ProgressMeter - | processed | time | per 1M | | total | remaining
    INFO 11:18:14,413 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
    INFO 11:18:14,418 SelectVariants - Including sample '2ms02g'

    ERROR --
    ERROR stack trace

    java.lang.NumberFormatException: For input string: ""
    at java.lang.NumberFormatException.forInputString(
    at java.lang.Integer.parseInt(
    at java.lang.Integer.valueOf(
    at htsjdk.variant.vcf.AbstractVCFCodec.createGenotypeMap(
    at htsjdk.variant.vcf.AbstractVCFCodec$LazyVCFGenotypesParser.parse(
    at htsjdk.variant.variantcontext.LazyGenotypesContext.decode(
    at htsjdk.variant.variantcontext.LazyGenotypesContext.ensureSampleNameMap(
    at htsjdk.variant.variantcontext.GenotypesContext.getSampleNames(
    at htsjdk.variant.variantcontext.VariantContext.getSampleNames(
    at htsjdk.variant.variantcontext.VariantContext.subContextFromSamples(
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano$TraverseLociMap.apply(
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.executeSingleThreaded(
    at org.broadinstitute.gatk.utils.nanoScheduler.NanoScheduler.execute(
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(
    at org.broadinstitute.gatk.engine.traversals.TraverseLociNano.traverse(
    at org.broadinstitute.gatk.engine.executive.LinearMicroScheduler.execute(
    at org.broadinstitute.gatk.engine.GenomeAnalysisEngine.execute(
    at org.broadinstitute.gatk.engine.CommandLineExecutable.execute(
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(
    at org.broadinstitute.gatk.utils.commandline.CommandLineProgram.start(
    at org.broadinstitute.gatk.engine.CommandLineGATK.main(

    ERROR ------------------------------------------------------------------------------------------
    ERROR A GATK RUNTIME ERROR has occurred (version 3.6-0-g89b7209):
    ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
    ERROR If not, please post the error message, with stack trace, to the GATK forum.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions
    ERROR MESSAGE: For input string: ""
    ERROR ------------------------------------------------------------------------------------------
    • The error message in the 7th last line suggests that its may be a bug, but I am thinking it might be more of a formatting issue. I think there is something wrong with in the vcf file not sure what it is.


  • pdexheimerpdexheimer Member, Dev

    @everestial007 -

    Based purely on the stack trace, I would guess that one of the genotypes in your file is blank. There's a ValidateVariants program somewhere, can't remember if it's GATK or Picard, that might shed more light. I wonder if your phasing program dropped single alleles that it didn't trust, resulting in a GT of something like "0/" or "/1". I think I remember seeing that in the spec once as a valid construct, but I wouldn't be too surprised if it's not handled correctly by GATK

  • shleeshlee CambridgeMember, Broadie, Moderator
  • Hi,
    I'm trying to select a set of variants present in a subset.vcf file in a master.vcf file. Basically I want to create a vcf file from the master.vcf containing only those calls present in subset.vcf.
    Could SelectVariants --concordance be of help in this case?

    Many thanks for your help!


  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    Hi Angelica,

    Yes, that is correct. Have a look at the tool doc for an example use case.


Sign In or Register to comment.