It looks like you're new here. If you want to get involved, click one of these buttons!
This tool combines VCF records from different sources. Any (unique) name can be used to bind your rod data and any number of sources can be input. This tool currently supports two different combination types for each of variants (the first 8 fields of the VCF) and genotypes (the rest)
For a complete, detailed argument reference, refer to the GATK document page here.
CombineVariants will include a record at every site in all of your input VCF files, and annotate which input ROD bindings the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. However, any part of the Venn of the N merged VCFs can be exacted using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.
The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the Tech Doc page for the tool.
The set INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.
set=Intersection : occurred in both call sets, not filtered out
set=NAME : occurred in the call set NAME only
set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2
set=filteredInAll : occurred in both call sets, but was filtered out of both
For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.
You can use -setKey foo to change the set=XXX tag to foo=XXX in your output. Additionally, -setKey null stops the set tag=value pair from being emitted at all.
Add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. The only fields emitted will be GT:GQ for genotypes and the keySet for INFO
An even more extreme output format is -sites_only, a general engine capability, where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.
Add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. Useful, for example in combining several data sets where we only want to keep sites present in for example at least 2 of them (in which case -minN 2 should be added to the command line).
In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == ";Intersection";' -o intersect.vcf
This results in two vcf files, which look like:
==> union.vcf <==
1 990839 SNP1-980702 C T . PASS AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1 990882 SNP1-980745 C T . PASS CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni
1 990984 SNP1-980847 G A . PASS CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni
1 992265 SNP1-982128 C T . PASS CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni
1 992819 SNP1-982682 G A . id50 CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll
1 993987 SNP1-983850 T C . PASS CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni
1 994391 rs2488991 G T . PASS AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3
1 996184 SNP1-986047 G A . PASS CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni
1 998395 rs7526076 A G . PASS AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
1 999649 SNP1-989512 G A . PASS CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni
==> intersect.vcf <==
1 950243 SNP1-940106 A C . PASS AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection
1 957640 rs6657048 C T . PASS AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection
1 959842 rs2710888 C T . PASS AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection
1 977780 rs2710875 C T . PASS AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection
1 985900 SNP1-975763 C T . PASS AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection
1 987200 SNP1-977063 C T . PASS AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection
1 987670 SNP1-977533 T G . PASS AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection
1 990417 rs2465136 T C . PASS AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection
1 990839 SNP1-980702 C T . PASS AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1 998395 rs7526076 A G . PASS AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
Comments
I am trying to use a variety of the GATK variant validation tools including CombineVariants and SelectVariants. However, I have been unable to get either tool to work. I have posted the first few lines of one of my VCFs below:
CHROM POS ID REF ALT QUAL FILTER INFO
chr1 762084 . T C . . . chr1 762136 . A C . . . chr1 762189 . A C . . . chr1 762192 . T C . . . chr1 762195 . C A . . . chr1 762196 . T C . . .
As you can see, all I care about is the exact chromosomal location (based on the hg19 ref). I want to start by merging two different VCFs using the following code:
java -jar GenomeAnalysisTK.jar -T CombineVariants -R hg19.fasta -V:ABC,abc.vcf -V:XYZ,xyz.vcf -o merged_abc_xyz.vcf -minimalVCF
After, I will be extracting variants that intersect both sets using SelectVariants.
The program completes without error, however, my merged VCF output does not populate with any data beyond the header (ie. no variants are listed). The program only runs for under a second and the completes and shuts off. What could be my issues? Please help. Thanks!!
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •What is the output to the console? Is there an error or other message?
Geraldine Van der Auwera, PhD
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •are abc.vcf and xyz.vcf well formed? How were they generated?
Mauricio Carneiro, PhD http://www.broadinstitute.org/~carneiro/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Hi Geraldine and Carneiro,
Thanks for the responses. It turns out, the VCF files were slightly malformed on a couple of lines. They were not generated using a GATK variant caller which is probably why CombineVariants wasn't working. For now, I think I have everything under control after some manual reformatting.
Will get back to you if anything else comes up.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •It is a nice tool, thank you. Suppose I have a vcf file containing the genotype of 5 samples, and want to find variants occurring in at lease 3 of them. To get the "set=" attribute, do I need to split the vcf file into 5 ones first, by using SelectVariant, and then merge the 5 vcfs again using CombineVariants? Is there a better way to do that?
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •@ecyeh use SelectVariants to find the variants on at least 3 of them. If you only have 5 samples, there aren't too many combinations. If you had more samples than that, I'd write a walker to do it.
Mauricio Carneiro, PhD http://www.broadinstitute.org/~carneiro/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •Indeed I have more than 5 samples to compare. Thank you for the clarification. Now I feel so lucky to be a programmer.
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •fantastic, go ahead and write a quick RODWalker to solve that one. You can base yourself on SelectVariants.
Mauricio Carneiro, PhD http://www.broadinstitute.org/~carneiro/
- Spam
- Abuse
- Troll
0 • Off Topic Disagree Agree Like WTF •