Using CombineVariants

delangeldelangel Posts: 71GATK Developer mod
edited October 2013 in Methods and Workflows

1. About CombineVariants

This tool combines VCF records from different sources. Any (unique) name can be used to bind your rod data and any number of sources can be input. This tool currently supports two different combination types for each of variants (the first 8 fields of the VCF) and genotypes (the rest)

For a complete, detailed argument reference, refer to the GATK document page here.

2. Logic for merging records across VCFs

CombineVariants will include a record at every site in all of your input VCF files, and annotate which input ROD bindings the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. However, any part of the Venn of the N merged VCFs can be exacted using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.

3. Handling PASS/FAIL records at the same site in multiple input files

The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the Tech Doc page for the tool.

4. Understanding the set attribute

The set INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.

  • set=Intersection : occurred in both call sets, not filtered out

  • set=NAME : occurred in the call set NAME only

  • set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2

  • set=filteredInAll : occurred in both call sets, but was filtered out of both

For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.

5. Changing the set key

You can use -setKey foo to change the set=XXX tag to foo=XXX in your output. Additionally, -setKey null stops the set tag=value pair from being emitted at all.

6. Minimal VCF output

Add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. The only fields emitted will be GT:GQ for genotypes and the keySet for INFO

An even more extreme output format is -sites_only, a general engine capability, where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.

7. Combining Variant Calls with a minimum set of input sites

Add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. Useful, for example in combining several data sets where we only want to keep sites present in for example at least 2 of them (in which case -minN 2 should be added to the command line).

8. Example: intersecting two VCFs

In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.

java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == "Intersection";' -o intersect.vcf

This results in two vcf files, which look like:

==> union.vcf <==
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       990882  SNP1-980745     C       T       .       PASS    CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni
1       990984  SNP1-980847     G       A       .       PASS    CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni
1       992265  SNP1-982128     C       T       .       PASS    CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni
1       992819  SNP1-982682     G       A       .       id50    CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll
1       993987  SNP1-983850     T       C       .       PASS    CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni
1       994391  rs2488991       G       T       .       PASS    AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3
1       996184  SNP1-986047     G       A       .       PASS    CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
1       999649  SNP1-989512     G       A       .       PASS    CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni

==> intersect.vcf <==
1       950243  SNP1-940106     A       C       .       PASS    AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection
1       957640  rs6657048       C       T       .       PASS    AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection
1       959842  rs2710888       C       T       .       PASS    AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection
1       977780  rs2710875       C       T       .       PASS    AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection
1       985900  SNP1-975763     C       T       .       PASS    AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection
1       987200  SNP1-977063     C       T       .       PASS    AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection
1       987670  SNP1-977533     T       G       .       PASS    AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection
1       990417  rs2465136       T       C       .       PASS    AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
Post edited by Geraldine_VdAuwera on

Comments

  • mpviverompvivero Posts: 7Member

    I am trying to use a variety of the GATK variant validation tools including CombineVariants and SelectVariants. However, I have been unable to get either tool to work. I have posted the first few lines of one of my VCFs below:

    CHROM POS ID REF ALT QUAL FILTER INFO

    chr1 762084 . T C . . . chr1 762136 . A C . . . chr1 762189 . A C . . . chr1 762192 . T C . . . chr1 762195 . C A . . . chr1 762196 . T C . . .

    As you can see, all I care about is the exact chromosomal location (based on the hg19 ref). I want to start by merging two different VCFs using the following code:

    java -jar GenomeAnalysisTK.jar -T CombineVariants -R hg19.fasta -V:ABC,abc.vcf -V:XYZ,xyz.vcf -o merged_abc_xyz.vcf -minimalVCF

    After, I will be extracting variants that intersect both sets using SelectVariants.

    The program completes without error, however, my merged VCF output does not populate with any data beyond the header (ie. no variants are listed). The program only runs for under a second and the completes and shuts off. What could be my issues? Please help. Thanks!!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    What is the output to the console? Is there an error or other message?

    Geraldine Van der Auwera, PhD

  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    are abc.vcf and xyz.vcf well formed? How were they generated?

  • mpviverompvivero Posts: 7Member

    Hi Geraldine and Carneiro,

    Thanks for the responses. It turns out, the VCF files were slightly malformed on a couple of lines. They were not generated using a GATK variant caller which is probably why CombineVariants wasn't working. For now, I think I have everything under control after some manual reformatting.

    Will get back to you if anything else comes up.

  • ecyehecyeh Posts: 8Member

    It is a nice tool, thank you. Suppose I have a vcf file containing the genotype of 5 samples, and want to find variants occurring in at lease 3 of them. To get the "set=" attribute, do I need to split the vcf file into 5 ones first, by using SelectVariant, and then merge the 5 vcfs again using CombineVariants? Is there a better way to do that?

  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    @ecyeh use SelectVariants to find the variants on at least 3 of them. If you only have 5 samples, there aren't too many combinations. If you had more samples than that, I'd write a walker to do it.

  • ecyehecyeh Posts: 8Member

    Indeed I have more than 5 samples to compare. Thank you for the clarification. Now I feel so lucky to be a programmer.

  • CarneiroCarneiro Posts: 274Administrator, GATK Developer admin

    fantastic, go ahead and write a quick RODWalker to solve that one. You can base yourself on SelectVariants.

  • chongmchongm Posts: 33Member

    Hi, I'm combining different samples (each VCF has a "batch" of samples that were processed together and I just wanted to have my variants in one VCF for convenient's sake - i.e. I'm not combining variants in order to merge samples that were spread out on different runs). I was wondering why the PL changes for each genotype when I'm simply putting all my variants in one place.

    Thanks,

    MC

  • weberATillinoiseduweberATillinoisedu Posts: 4Member
    edited October 2013

    I think that there is a typo in Example 8. "-select 'set == ";Intersection";'" should lose that first semi-colon. With the semicolon, I get 0 VCF records produced. When I use "grep" there are 2300+ in my union.vcf data set. After removal, the resulting intersection.vcf file contains the expected number of records. I expect that because ";" is the separator char, it shouldn't be in the pattern. This was tested using v2.7-2 and v2.7-4.

    I hope this helps anyone else who was tripped up by it.

    Post edited by weberATillinoisedu on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Hi @chongm,

    First, be careful when you merge VCFs containing variants that were called separately. If you're doing it to compare results of different calling iterations, evaluate concordance etc, that's perfectly fine. But combining them for convenience is dangerous because if later, you want to filter them, there are some annotations that you shouldn't use to filter them together, because the values are relative, not absolute, and will not be on the same scale between different sets. I'm not saying you shouldn't do it, but if you do it, you should be careful.

    The PLs changing is a known issue that occurs when combining variants with more than one alternate allele. CombineVariants currently does not handle multiallelic variants well.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    @weberATillinoisedu, you're absolutely correct. I will fix the typo in the article. Thanks for pointing it out!

    Geraldine Van der Auwera, PhD

  • chongmchongm Posts: 33Member

    Hi @Geraldine_VdAuwera, okay thanks for the warning. I will filter each VCF separately then. Once all batches are complete though, I'm going to call variants simultaneously for all batches which should result in one VCF. This is the best way to call variants right? Will some less frequent variants be filtered out if I call them using the entire cohort of samples vs. run by run?

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,643Administrator, GATK Developer admin

    Yes, joint calling followed by VQSR (variant recalibration) is indeed what we recommend. The process will increase your discovery power for difficult variants, but should not negatively affect the calling of rare variants.

    Geraldine Van der Auwera, PhD

  • evakoeevakoe Posts: 24Member

    It would be great if you could extend the -setKey argument so that one can not only specify the key, but also the values. If I combine three VCF files, let's say a.vcf, b.vcf, and c.vcf, the INFO field might look like set=variant-variant1. I assume "variant" corresponds to the first file given with -V and "variant1" to the second file given with -V, but of course to help me still know this in 1 month it would be great if I could specify other strings for "variant" and "variant1". Or to make thinks easier, you could also just take the file prefixes a, b, and c. Thanks. Eva

  • evakoeevakoe Posts: 24Member

    Actually, looking closer at my new VCF file combined from three VCF files, I now see that the set argument has four different values, even though I merged only three VCF files. These values are variant, variant1, variant2, and variant3.

    I used the option -setKey source and what I see now are: source=variant3, source=variant2, source=variant-variant2, but no source=variant1, source=variant or source=variant-variant1. Is it possible that there is a bug with the naming of the sets when it comes to the combinations?

    Thank you Eva

  • ebanksebanks Posts: 684GATK Developer mod

    Hi Eva,

    You just need to name your tracks. E.g. "-V:foo first.vcf -V:bar second.vcf", then the set values will be 'first' and 'second'.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • evakoeevakoe Posts: 24Member

    Hi Eric, thank you very much for this info, I did not know about this option. When I name the input VCFs, I really only observe the three values that I specified.

    If you want to take a closer look at why there are four different values when you don't name the inputs I can send you the three input vcfs I used.

Sign In or Register to comment.