Combining variants from different files into one

delangeldelangel Posts: 71GATK Dev mod

Solutions for combining variant callsets depending on purpose

There are three main reasons why you might want to combine variants from different files into one, and the tool to use depends on what you are trying to achieve.

  1. The most common case is when you have been parallelizing your variant calling analyses, e.g. running HaplotypeCaller per-chromosome, producing separate VCF files (or gVCF files) per-chromosome. For that case, you can use a tool called CatVariants to concatenate the files. There are a few important requirements (e.g. the files should contain all the same samples, and distinct intervals) which you can read about on the tool's documentation page.

  2. The second case is when you have been using HaplotypeCaller in -ERC GVCF or -ERC BP_RESOLUTION to call variants on a large cohort, producing many gVCF files. We recommend combining the output gVCF in batches of e.g. 200 before putting them through joint genotyping with GenotypeGVCFs (for performance reasons), which you can do using CombineGVCFs, which is specific for handling gVCF files.

  3. The third case is when you want to combine variant calls that were produced from the same samples but using different methods, for comparison. For example, if you're evaluating variant calls produced by different variant callers, different workflows, or the same but using different parameters. This produces separate callsets for the same samples, which are then easier to compare if you combine them into a single file. For that purpose, you can use CombineVariants, which is capable of merging VCF records intelligently, treating the same samples as separate or not as desired, combining annotations as appropriate. This is the case that requires the most preparation and forethought because there are many options that may be used to adapt the behavior of the tool.

There is also one reason you might want to combine variants from different files into one, that we do not recommend following. That is, if you have produced variant calls from various samples separately, and want to combine them for analysis. This is how people used to do variant analysis on large numbers of samples, but we don't recommend proceeding this way because that workflow suffers from serious methodological flaws. Instead, you should follow our recommendations as laid out in the Best Practices documentation.


Merging records across VCFs with CombineVariants

Here we provide some more information and a worked out example to illustrate the third case because it is less straightforward than the other two.

A key point to understand is that CombineVariants will include a record at every site in all of your input VCF files, and annotate in which input callsets the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. Any part of the Venn of the N merged VCFs can then be extracted specifically using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.

Handling PASS/FAIL records at the same site in multiple input files

The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the tool documentation page linked above.

Understanding the set attribute

The set property of the INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.

  • set=Intersection : occurred in both call sets, not filtered out

  • set=NAME : occurred in the call set NAME only

  • set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2

  • set=filteredInAll : occurred in both call sets, but was filtered out of both

For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.

You specify the NAME of a callset is by using the following syntax in your command line: -V:omni 1000G_omni2.5.b37.sites.vcf.

Emitting minimal VCF output

You can add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. In that case, the only fields emitted will be GT:GQ for genotypes and the keySet for INFO.

An even more extreme output format is -sites_only (a general engine capability listed in the CommandLineGATK documentation) where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.

Requiring sites to be present in a minimum number of callsets

Sometimes you may want to combine several data sets but you only keep sites that are present in at least 2 of them. To do so, simply add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. In our example, you would add -minN 2 to the command line.

Example: intersecting two VCFs

In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.

# combine the data
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf

# select the intersection
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == "Intersection";' -o intersect.vcf

This results in two vcf files, which look like:

# contents of union.vcf
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       990882  SNP1-980745     C       T       .       PASS    CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni
1       990984  SNP1-980847     G       A       .       PASS    CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni
1       992265  SNP1-982128     C       T       .       PASS    CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni
1       992819  SNP1-982682     G       A       .       id50    CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll
1       993987  SNP1-983850     T       C       .       PASS    CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni
1       994391  rs2488991       G       T       .       PASS    AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3
1       996184  SNP1-986047     G       A       .       PASS    CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
1       999649  SNP1-989512     G       A       .       PASS    CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni

# contents of intersect.vcf
1       950243  SNP1-940106     A       C       .       PASS    AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection
1       957640  rs6657048       C       T       .       PASS    AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection
1       959842  rs2710888       C       T       .       PASS    AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection
1       977780  rs2710875       C       T       .       PASS    AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection
1       985900  SNP1-975763     C       T       .       PASS    AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection
1       987200  SNP1-977063     C       T       .       PASS    AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection
1       987670  SNP1-977533     T       G       .       PASS    AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection
1       990417  rs2465136       T       C       .       PASS    AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
Post edited by Geraldine_VdAuwera on

Comments

  • mpviverompvivero Posts: 7Member

    I am trying to use a variety of the GATK variant validation tools including CombineVariants and SelectVariants. However, I have been unable to get either tool to work. I have posted the first few lines of one of my VCFs below:

    CHROM POS ID REF ALT QUAL FILTER INFO

    chr1 762084 . T C . . .
    chr1 762136 . A C . . .
    chr1 762189 . A C . . .
    chr1 762192 . T C . . .
    chr1 762195 . C A . . .
    chr1 762196 . T C . . .

    As you can see, all I care about is the exact chromosomal location (based on the hg19 ref). I want to start by merging two different VCFs using the following code:

    java -jar GenomeAnalysisTK.jar -T CombineVariants -R hg19.fasta -V:ABC,abc.vcf -V:XYZ,xyz.vcf -o merged_abc_xyz.vcf -minimalVCF

    After, I will be extracting variants that intersect both sets using SelectVariants.

    The program completes without error, however, my merged VCF output does not populate with any data beyond the header (ie. no variants are listed). The program only runs for under a second and the completes and shuts off. What could be my issues? Please help. Thanks!!

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,273Administrator, GATK Dev admin

    What is the output to the console? Is there an error or other message?

    Geraldine Van der Auwera, PhD

  • CarneiroCarneiro Posts: 274Administrator, GATK Dev admin

    are abc.vcf and xyz.vcf well formed? How were they generated?

  • mpviverompvivero Posts: 7Member

    Hi Geraldine and Carneiro,

    Thanks for the responses. It turns out, the VCF files were slightly malformed on a couple of lines. They were not generated using a GATK variant caller which is probably why CombineVariants wasn't working. For now, I think I have everything under control after some manual reformatting.

    Will get back to you if anything else comes up.

  • ecyehecyeh Posts: 8Member

    It is a nice tool, thank you.
    Suppose I have a vcf file containing the genotype of 5 samples, and want to find variants occurring in at lease 3 of them. To get the "set=" attribute, do I need to split the vcf file into 5 ones first, by using SelectVariant, and then merge the 5 vcfs again using CombineVariants? Is there a better way to do that?

  • CarneiroCarneiro Posts: 274Administrator, GATK Dev admin

    @ecyeh use SelectVariants to find the variants on at least 3 of them. If you only have 5 samples, there aren't too many combinations. If you had more samples than that, I'd write a walker to do it.

  • ecyehecyeh Posts: 8Member

    Indeed I have more than 5 samples to compare. Thank you for the clarification. Now I feel so lucky to be a programmer.

  • CarneiroCarneiro Posts: 274Administrator, GATK Dev admin

    fantastic, go ahead and write a quick RODWalker to solve that one. You can base yourself on SelectVariants.

  • chongmchongm Posts: 33Member

    Hi, I'm combining different samples (each VCF has a "batch" of samples that were processed together and I just wanted to have my variants in one VCF for convenient's sake - i.e. I'm not combining variants in order to merge samples that were spread out on different runs). I was wondering why the PL changes for each genotype when I'm simply putting all my variants in one place.

    Thanks,

    MC

  • weberATillinoiseduweberATillinoisedu Posts: 4Member
    edited October 2013

    I think that there is a typo in Example 8. "-select 'set == ";Intersection";'" should lose that first semi-colon. With the semicolon, I get 0 VCF records produced. When I use "grep" there are 2300+ in my union.vcf data set. After removal, the resulting intersection.vcf file contains the expected number of records. I expect that because ";" is the separator char, it shouldn't be in the pattern. This was tested using v2.7-2 and v2.7-4.

    I hope this helps anyone else who was tripped up by it.

    Post edited by weberATillinoisedu on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,273Administrator, GATK Dev admin

    Hi @chongm,

    First, be careful when you merge VCFs containing variants that were called separately. If you're doing it to compare results of different calling iterations, evaluate concordance etc, that's perfectly fine. But combining them for convenience is dangerous because if later, you want to filter them, there are some annotations that you shouldn't use to filter them together, because the values are relative, not absolute, and will not be on the same scale between different sets. I'm not saying you shouldn't do it, but if you do it, you should be careful.

    The PLs changing is a known issue that occurs when combining variants with more than one alternate allele. CombineVariants currently does not handle multiallelic variants well.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,273Administrator, GATK Dev admin

    @weberATillinoisedu, you're absolutely correct. I will fix the typo in the article. Thanks for pointing it out!

    Geraldine Van der Auwera, PhD

  • chongmchongm Posts: 33Member

    Hi @Geraldine_VdAuwera, okay thanks for the warning. I will filter each VCF separately then. Once all batches are complete though, I'm going to call variants simultaneously for all batches which should result in one VCF. This is the best way to call variants right? Will some less frequent variants be filtered out if I call them using the entire cohort of samples vs. run by run?

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 8,273Administrator, GATK Dev admin

    Yes, joint calling followed by VQSR (variant recalibration) is indeed what we recommend. The process will increase your discovery power for difficult variants, but should not negatively affect the calling of rare variants.

    Geraldine Van der Auwera, PhD

  • evakoeevakoe Posts: 26Member

    It would be great if you could extend the -setKey argument so that one can not only specify the key, but also the values. If I combine three VCF files, let's say a.vcf, b.vcf, and c.vcf, the INFO field might look like set=variant-variant1. I assume "variant" corresponds to the first file given with -V and "variant1" to the second file given with -V, but of course to help me still know this in 1 month it would be great if I could specify other strings for "variant" and "variant1". Or to make thinks easier, you could also just take the file prefixes a, b, and c. Thanks. Eva

  • evakoeevakoe Posts: 26Member

    Actually, looking closer at my new VCF file combined from three VCF files, I now see that the set argument has four different values, even though I merged only three VCF files. These values are variant, variant1, variant2, and variant3.

    I used the option -setKey source and what I see now are: source=variant3, source=variant2, source=variant-variant2, but no source=variant1, source=variant or source=variant-variant1. Is it possible that there is a bug with the naming of the sets when it comes to the combinations?

    Thank you
    Eva

  • ebanksebanks Broad InstitutePosts: 689Member, Administrator, GATK Dev, Broadie, Moderator, DSDE Dev, GP Member admin

    Hi Eva,

    You just need to name your tracks. E.g. "-V:foo first.vcf -V:bar second.vcf", then the set values will be 'first' and 'second'.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • evakoeevakoe Posts: 26Member

    Hi Eric,
    thank you very much for this info, I did not know about this option. When I name the input VCFs, I really only observe the three values that I specified.

    If you want to take a closer look at why there are four different values when you don't name the inputs I can send you the three input vcfs I used.

Sign In or Register to comment.