GATK licensing moves to direct-through-Broad model -- read about it on the GATK blog

Selecting variants of interest from a callset

delangeldelangel Posts: 71GATK Dev mod

This document describes why you might want to extract a subset of variants from a callset and how you would achieve this.


Often, a VCF containing many samples and/or variants will need to be subset in order to facilitate certain analyses (e.g. comparing and contrasting cases vs. controls; extracting variant or non-variant loci that meet certain requirements, displaying just a few samples in a browser like IGV, etc.). The GATK tool that we use the most for subsetting calls in various ways is SelectVariants; it enables easy and convenient subsetting of VCF files according to many criteria.

Select Variants operates on VCF files (also sometimes referred to as ROD in our documentation, for Reference Ordered Data) provided at the command line using the GATK's built in --variant option. You can provide multiple VCF files for Select Variants, but at least one must be named 'variant' and this will be the file (or set of files) from which variants will be selected. Other files can be used to modify the selection based on concordance or discordance between the callsets (see --discordance / --concordance arguments in the tool documentation).

There are many options for setting the selection criteria, depending on what you want to achieve. For example, given a single VCF file, one or more samples can be extracted from the file, based either on a complete sample name, or on a pattern match. Variants can also be selected based on annotated properties, such as depth of coverage or allele frequency. This is done using JEXL expressions; make sure to read the linked document for details, especially the section on working with complex expressions.

Note that in the output VCF, some annotations such as AN (number of alleles), AC (allele count), AF (allele frequency), and DP (depth of coverage) are recalculated as appropriate to accurately reflect the composition of the subset callset. See further below for an explanation of how that works.


Command-line arguments

For a complete, detailed argument reference, refer to the GATK document page here.


Subsetting by sample and ALT alleles

SelectVariants now keeps (r5832) the alt allele, even if a record is AC=0 after subsetting the site down to selected samples. For example, when selecting down to just sample NA12878 from the OMNI VCF in 1000G (1525 samples), the resulting VCF will look like:

1       82154   rs4477212       A       G       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       T       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

Although NA12878 is 0/0 at the first sites, ALT allele is preserved in the VCF record. This is the correct behavior, as reducing samples down shouldn't change the character of the site, only the AC in the subpopulation. This is related to the tricky issue of isPolymorphic() vs. isVariant().

  • isVariant => is there an ALT allele?

  • isPolymorphic => is some sample non-ref in the samples?

For clarity, in previous versions of SelectVariants, the first two monomorphic sites lose the ALT allele, because NA12878 is hom-ref at this site, resulting in VCF that looks like:

1       82154   rs4477212       A       .       .       PASS    AC=0;AF=0.00;AN=2;CR=100.0;DP=0;GentrainScore=0.7826;HW=1.0     GT:GC   0/0:0.7205
1       534247  SNP1-524110     C       .       .       PASS    AC=0;AF=0.00;AN=2;CR=99.93414;DP=0;GentrainScore=0.7423;HW=1.0  GT:GC   0/0:0.6491
1       565286  SNP1-555149     C       T       .       PASS    AC=2;AF=1.00;AN=2;CR=98.8266;DP=0;GentrainScore=0.7029;HW=1.0   GT:GC   1/1:0.3471
1       569624  SNP1-559487     T       C       .       PASS    AC=2;AF=1.00;AN=2;CR=97.8022;DP=0;GentrainScore=0.8070;HW=1.0   GT:GC   1/1:0.3942

If you really want a VCF without monomorphic sites, use the option to drop monomorphic sites after subsetting.


How do the AC, AF, AN, and DP fields change?

Let's say you have a file with three samples. The numbers before the ":" will be the genotype (0/0 is hom-ref, 0/1 is het, and 1/1 is hom-var), and the number after will be the depth of coverage.

BOB        MARY        LINDA
1/0:20     0/0:30      1/1:50

In this case, the INFO field will say AN=6, AC=3, AF=0.5, and DP=100 (in practice, I think these numbers won't necessarily add up perfectly because of some read filters we apply when calling, but it's approximately right).

Now imagine I only want a file with the samples "BOB" and "MARY". The new file would look like:

BOB        MARY
1/0:20     0/0:30

The INFO field will now have to change to reflect the state of the new data. It will be AN=4, AC=1, AF=0.25, DP=50.

Let's pretend that MARY's genotype wasn't 0/0, but was instead "./." (no genotype could be ascertained). This would look like

BOB        MARY
1/0:20     ./.:.

with AN=2, AC=1, AF=0.5, and DP=20.


Additional information

For information on how to construct regular expressions for use with this tool, see the method article on variant filtering with JEXL, or "Summary of regular-expression constructs" section here for more hardcore reading.

Post edited by Geraldine_VdAuwera on

Issue · Github
by Geraldine_VdAuwera

Issue Number
31
State
open
Last Updated

Comments

  • vyellapavyellapa Posts: 31Member

    I'm trying to find the concordant variants between two vcf's where each were generated with 2 different reference genomes(ucsc and 1000 genomes) using the --concordance option. Since ucsc genome has the chromosomes prefixed by "chr", how can I find the concordant calls. I was wondering if removing the "chr" would do or would the header have to be replaced too somehow.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    You should be very careful when comparing variants called with different references because the references don't just differ by their contig names, there are also some differences in the sequence. The safest way to do this is to "lift over" variants. See this document for more details on how to do this:

    http://www.broadinstitute.org/gatk/guide/article?id=63

    Geraldine Van der Auwera, PhD

  • vyellapavyellapa Posts: 31Member

    Thank you. I checked the 2.3-9 version of the GATK2 download and the resource bundle at ftp.broadinstitute.org for the "liftOverVCF.pl" script.
    Where would this be?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    You need to download it from the source code repo:

    https://github.com/broadgsa/gatk/tree/master/public/perl

    Geraldine Van der Auwera, PhD

  • golharamgolharam Posts: 27Member ✭✭

    I'm trying to debug why my JEXL select statement is not working. When invoking GATK from the shell, I get this error:

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.3-9-gdcdccbb):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Invalid command line: Invalid JEXL expression detected for select-0 with message ![1,43]: '(vc.getAttribute('1000g2012apr_ALL') < 0.01);' < error
    ERROR ------------------------------------------------------------------------------------------

    I have no idea what the actual error is based on this output. When I debug GATK in Eclipse, I get a different error output:

    ERROR MESSAGE: Invalid argument value '<' at position 10.
    ERROR Invalid argument value '0.01)'' at position 11.

    This is much more helpful at least point to me to where in the select statement the problem is. Is there any way to get this more descriptive error when I invoke GATK from the shell?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    You can try using -l DEBUG to get more detailed information.

    Geraldine Van der Auwera, PhD

  • ArtemPankinArtemPankin Posts: 9Member

    I am trying to extract variants from a two-genotype VCF, which are polymorphic between genotypes. Is there a way to do it with SelectVariants or any other GATK tool?

  • CarneiroCarneiro Posts: 274Administrator, GATK Dev admin

    what do you mean by two-genotype VCF? two samples?

  • ArtemPankinArtemPankin Posts: 9Member

    I am sorry for the confusion. Yes, it should read "two-sample".

  • CarneiroCarneiro Posts: 274Administrator, GATK Dev admin

    Yup, SelectVariants allows you to select by sample. Take a look at the documentation.

  • ArtemPankinArtemPankin Posts: 9Member

    Thank you for the link. I read it through and could not find the answer. I have got a vcf file with SNPs called for two different samples. I want to select only those SNPs, which distinguish between these two samples, e.g. sample1: 1/1; sample 2: 0/0.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Hi @ArtemPankin,

    I think you'll need to compose a JEXL expression to do what you want. See this documentation page.

    Geraldine Van der Auwera, PhD

  • JackJack Posts: 36Member

    Hi, there. I have 10 vcf result files from 10 samples, I want to extract the concordance of the variants of the 10 samples, can SelectVariants help me do this ? Can the argument "--variants" be used more than once?

  • JackJack Posts: 36Member

    Can I use the "--concordance" argument to select the intersect of the variants of the 10 samples?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Hi Jack,

    It is easier to first merge the VCFs (using CombineVariants) then use VariantEval to look at the intersection as described here: http://www.broadinstitute.org/gatk/guide/article?id=48

    Geraldine Van der Auwera, PhD

  • geertvandeweyergeertvandeweyer Posts: 11Member

    Hi,

    I'm trying to subset a multisample vcf file to single sample files. I use the following commandline to extract variant locations from a VQSR recalibrated file , both for snp and indel:

    java -Xmx1500m -jar ../../binaries/gatk/2.4.9-g532efad/bin/GenomeAnalysisTK.jar -T SelectVariants -R ../../References/hg19/samtools/0.1.19/hg19.fasta -V UnifiedGenotyper.multi.vcf -o single.sample.vcf -sn 'lq-hl-85-c4-01' -nt 4 -env

    most locations work fine, but there seems to be an issue with multiallelic indel locations.

    for example: the multi-sample input file (trimmed down):

    chr1 7828173 . TAAAAA T,TAAA,TAA,TAAAA 2156.23 PASS AC=12,13,17,12;AF=0.188,0.203,0.266,0.188;AN=64;BaseQRankSum=0.754;DP=729;FS=18.902;InbreedingCoeff=0.7206;MLEAC=9,12,11,12;MLEAF=0.141,0.188,0.172,0.188;MQ=59.06;MQ0=0;MQRankSum=2.746;QD=1.09;RPA=18,13,16,15,17;RU=A;ReadPosRankSum=4.206;STR;VQSLOD=0.578;culprit=FS GT:AD:DP:GQ:PL 1/1:0,6,0,0,0:27:50:982,51,0,711,51,664,552,50,545,503,864,51,705,551,823 1/4:0,3,1,1,0:59:29:750,29,448,521,76,534,376,165,432,498,626,0,493,352,587 0/3:0,0,3,0,0:14:3:118,129,368,72,130,104,0,78,45,51,78,115,74,3,91 ...(20 more)

    when I select the third sample, the result is :

    chr1 7828173 . TAAAAA TAA 2156.23 PASS AC=1;AF=0.500;AN=2;BaseQRankSum=0.754;DP=14;FS=18.902;InbreedingCoeff=0.7206;MQ=59.06;MQ0=0;MQRankSum=2.746;QD=1.09;RPA=18,13,16,15,17;RU=A;ReadPosRankSum=4.206;STR;VQSLOD=0.578;culprit=FS GT:DP:GQ 0/1:14:3

    This seems incorrect in two ways:
    - original : homozygous, third allele ; selectvariants : heterozygous (allele is correct)
    - selectVariant file is missing AD:GQ:PL fields.

    Is this default behaviour for some reason?

    Best,

    Geert

  • ebanksebanks Broad InstitutePosts: 684Member, Administrator, GATK Dev, Broadie, Moderator, DSDE Member, GP Member admin

    Hi Geert,

    Your original VCF shows the 3rd sample as being heterozygous (0/3) so the "selected" VCF looks perfectly fine to me. The other tags are removed when you select down from many alternate alleles to just one for technical reasons, as discussed in other threads on this forum.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • geertvandeweyergeertvandeweyer Posts: 11Member

    oops, my mistake. It is indeed heterozygous. In the meantime, I've found some posts about the removed tags. Thanks for the pointer.

  • blueskypyblueskypy Posts: 229Member

    I'd like to subset 500 samples from a vcf file of 2000 samples, do I have to list -sn 500 times since there isn't a consist pattern in the names of the 500 samples?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Hi @blueskypy‌,

    If there's no consistent pattern in the names, I can't think of any way to do this without listing -sn 500 times, sorry. Someone else may be aware of a trick to do this more easily but I don't know of any.

    Geraldine Van der Auwera, PhD

  • pdexheimerpdexheimer Posts: 463Member, GATK Dev, DSDE Member mod

    @blueskypy‌ - You could write the sample names to a file and use the --sample_file/-sf argument to SelectVariants

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    Ah, there you go! @pdexheimer to the rescue as usual :)

    Geraldine Van der Auwera, PhD

  • jhl667jhl667 OregonPosts: 3Member

    Using v3.1, I'm getting the following error:

    Line 581694: there aren't enough columns for line Total compute time in PairHMM computeLikelihoods() : 0.0 (we expected 9 tokens, and saw 1 )

    My command line is:

    java -jar -Xmx2g $GATK -R $REF -T SelectVariants -V $PROJ/output/$CHROM.vcf -selectType SNP -o $PROJ/snps/$CHROM_snps.vcf

    I've not run in to this problem before, so am at a loss for what I may be overlooking.

    Thanks!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    @jhl667 I believe that was a bug in 3.2 (where log output ended up in the data output stream) that has been fixed since.

    Geraldine Van der Auwera, PhD

  • jhl667jhl667 OregonPosts: 3Member

    @Geraldine_VdAuwera‌ I suspected that might be the case, so I went back and tried the same command line with v3.2-2. I am seeing the same result.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,781Administrator, GATK Dev admin

    The problem is in your VCF file, so you need to regenerate it (or fix it) to allow SelectVariants to work properly.

    Geraldine Van der Auwera, PhD

  • jhl667jhl667 OregonPosts: 3Member

    @Geraldine_VdAuwera‌ Alright, I will keep working on it. Thanks.

  • eflanneryeflannery San DiegoPosts: 7Member
    edited April 20

    Hi, I am using the following command line to run SelectVariants:

    java -Xmx4g -jar ~/bin/GenomeAnalysisTK-3.3-0/GenomeAnalysisTK.jar \
        -R ~/Dropbox/Genomes/PlasmoDB-13.0_Pfalciparum3D7_Genome.fasta \
        -T VariantFiltration \
        --variant ${input} \
        -o ${input}_variantFiltration.vcf \
        -filter "ReadPosRankSum > 12.4" -filterName "HighRPRS" \
        -filter "ReadPosRankSum < -9.4" -filterName "LowRPRS" \
        -filter "QD < 18.0" -filterName "LowQD" \
        -filter "SOR > 5.8" -filterName "HighSOR" \
        -filter "MQ < 42.0" -filterName "LowMQ" \
        -filter "DP > 3456" -filterName "HighDP" \
        -filter "MQRankSum > 9.7" -filterName "HighMQRS" \
        -filter "MQRankSum < -6.5" -filterName "LowMQRS" \
        --genotypeFilterExpression "DP < 6" --genotypeFilterName "LowFormatDP" \
        --genotypeFilterExpression "GQ < 30" --genotypeFilterName "LowGQ" 
    

    In the header of my input file the chromosome order is the same as the chromosome order in the .dict file and the fasta file.
    input vcf header:

    ##contig=<ID=Pf3D7_04_v3,length=1200490>
    ##contig=<ID=Pf3D7_05_v3,length=1343557>
    ##contig=<ID=Pf3D7_02_v3,length=947102>
    ##contig=<ID=Pf3D7_09_v3,length=1541735>
    ##contig=<ID=Pf3D7_12_v3,length=2271494>
    ##contig=<ID=Pf3D7_06_v3,length=1418242>
    ##contig=<ID=Pf3D7_14_v3,length=3291936>
    ##contig=<ID=Pf3D7_03_v3,length=1067971>
    ##contig=<ID=Pf3D7_07_v3,length=1445207>
    ##contig=<ID=Pf3D7_13_v3,length=2925236>
    ##contig=<ID=Pf3D7_08_v3,length=1472805>
    ##contig=<ID=Pf3D7_01_v3,length=640851>
    ##contig=<ID=Pf3D7_10_v3,length=1687656>
    ##contig=<ID=Pf3D7_11_v3,length=2038340>
    ##contig=<ID=M76611,length=5967>
    ##contig=<ID=PFC10_API_IRAB,length=34242>
    ##reference=file:///home/eflannery/Genomes/PlasmoDB-13.0_Pfalciparum3D7_Genome.fasta
    

    Yet I still get this error:

    ERROR MESSAGE: Input files variant and reference have incompatible contigs: Relative ordering of overlapping contigs differs, which is unsafe.
    ERROR variant contigs = [M76611, PFC10_API_IRAB, Pf3D7_01_v3, Pf3D7_02_v3, Pf3D7_03_v3, Pf3D7_04_v3, Pf3D7_05_v3, Pf3D7_06_v3, Pf3D7_07_v3, Pf3D7_08_v3, Pf3D7_09_v3, Pf3D7_10_v3, Pf3D7_11_v3, Pf3D7_12_v3, Pf3D7_13_v3, Pf3D7_14_v3]
    ERROR reference contigs = [Pf3D7_04_v3, Pf3D7_05_v3, Pf3D7_02_v3, Pf3D7_09_v3, Pf3D7_12_v3, Pf3D7_06_v3, Pf3D7_14_v3, Pf3D7_03_v3, Pf3D7_07_v3, Pf3D7_13_v3, Pf3D7_08_v3, Pf3D7_01_v3, Pf3D7_10_v3, Pf3D7_11_v3, M76611, PFC10_API_IRAB]

    It's saying the variant contigs are in an order they are not in. Any ideas?

    Thanks!!

    Post edited by eflannery on
  • SheilaSheila Broad InstitutePosts: 1,241Member, GATK Dev, Broadie, Moderator, DSDE Member admin

    @eflannery
    Hi,

    It looks likes your vcf is not sorted in the same order as the reference. You should be able to use Picard's SortVcf to fix this. http://broadinstitute.github.io/picard/command-line-overview.html#SortVcf

    -Sheila

  • eflanneryeflannery San DiegoPosts: 7Member

    @Sheila Hi Sheila, I did use SortVcf to make the vcf file and I used the same .dict file as the reference. You can see in the .vcf header that the file is sorted in that order, and indeed the variants are in that order. The error does not say the correct order that the .vcf is actually sorted in.

    Erika

  • pdexheimerpdexheimer Posts: 463Member, GATK Dev, DSDE Member mod

    @eflannery - What happens when you delete the vcf.idx file?

  • eflanneryeflannery San DiegoPosts: 7Member

    Thank-you @pdexheimer, that was the problem!

Sign In or Register to comment.