The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!


You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Formatting tip!


Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

Combining variants from different files into one

delangeldelangel Broad InstituteMember
edited May 2015 in Methods and Algorithms

Solutions for combining variant callsets depending on purpose

There are three main reasons why you might want to combine variants from different files into one, and the tool to use depends on what you are trying to achieve.

  1. The most common case is when you have been parallelizing your variant calling analyses, e.g. running HaplotypeCaller per-chromosome, producing separate VCF files (or gVCF files) per-chromosome. For that case, you can use a tool called CatVariants to concatenate the files. There are a few important requirements (e.g. the files should contain all the same samples, and distinct intervals) which you can read about on the tool's documentation page.

  2. The second case is when you have been using HaplotypeCaller in -ERC GVCF or -ERC BP_RESOLUTION to call variants on a large cohort, producing many gVCF files. We recommend combining the output gVCF in batches of e.g. 200 before putting them through joint genotyping with GenotypeGVCFs (for performance reasons), which you can do using CombineGVCFs, which is specific for handling gVCF files.

  3. The third case is when you want to combine variant calls that were produced from the same samples but using different methods, for comparison. For example, if you're evaluating variant calls produced by different variant callers, different workflows, or the same but using different parameters. This produces separate callsets for the same samples, which are then easier to compare if you combine them into a single file. For that purpose, you can use CombineVariants, which is capable of merging VCF records intelligently, treating the same samples as separate or not as desired, combining annotations as appropriate. This is the case that requires the most preparation and forethought because there are many options that may be used to adapt the behavior of the tool.

There is also one reason you might want to combine variants from different files into one, that we do not recommend following. That is, if you have produced variant calls from various samples separately, and want to combine them for analysis. This is how people used to do variant analysis on large numbers of samples, but we don't recommend proceeding this way because that workflow suffers from serious methodological flaws. Instead, you should follow our recommendations as laid out in the Best Practices documentation.


Merging records across VCFs with CombineVariants

Here we provide some more information and a worked out example to illustrate the third case because it is less straightforward than the other two.

A key point to understand is that CombineVariants will include a record at every site in all of your input VCF files, and annotate in which input callsets the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. Any part of the Venn of the N merged VCFs can then be extracted specifically using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.

Handling PASS/FAIL records at the same site in multiple input files

The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the tool documentation page linked above.

Understanding the set attribute

The set property of the INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.

  • set=Intersection : occurred in both call sets, not filtered out

  • set=NAME : occurred in the call set NAME only

  • set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2

  • set=filteredInAll : occurred in both call sets, but was filtered out of both

For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.

You specify the NAME of a callset is by using the following syntax in your command line: -V:omni 1000G_omni2.5.b37.sites.vcf.

Emitting minimal VCF output

You can add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. In that case, the only fields emitted will be GT:GQ for genotypes and the keySet for INFO.

An even more extreme output format is -sites_only (a general engine capability listed in the CommandLineGATK documentation) where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.

Requiring sites to be present in a minimum number of callsets

Sometimes you may want to combine several data sets but you only keep sites that are present in at least 2 of them. To do so, simply add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. In our example, you would add -minN 2 to the command line.

Example: intersecting two VCFs

In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.

# combine the data
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf

# select the intersection
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == "Intersection";' -o intersect.vcf

This results in two vcf files, which look like:

# contents of union.vcf
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       990882  SNP1-980745     C       T       .       PASS    CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni
1       990984  SNP1-980847     G       A       .       PASS    CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni
1       992265  SNP1-982128     C       T       .       PASS    CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni
1       992819  SNP1-982682     G       A       .       id50    CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll
1       993987  SNP1-983850     T       C       .       PASS    CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni
1       994391  rs2488991       G       T       .       PASS    AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3
1       996184  SNP1-986047     G       A       .       PASS    CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
1       999649  SNP1-989512     G       A       .       PASS    CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni

# contents of intersect.vcf
1       950243  SNP1-940106     A       C       .       PASS    AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection
1       957640  rs6657048       C       T       .       PASS    AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection
1       959842  rs2710888       C       T       .       PASS    AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection
1       977780  rs2710875       C       T       .       PASS    AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection
1       985900  SNP1-975763     C       T       .       PASS    AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection
1       987200  SNP1-977063     C       T       .       PASS    AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection
1       987670  SNP1-977533     T       G       .       PASS    AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection
1       990417  rs2465136       T       C       .       PASS    AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
Post edited by Geraldine_VdAuwera on

Comments

  • I am trying to use a variety of the GATK variant validation tools including CombineVariants and SelectVariants. However, I have been unable to get either tool to work. I have posted the first few lines of one of my VCFs below:

    CHROM POS ID REF ALT QUAL FILTER INFO

    chr1 762084 . T C . . .
    chr1 762136 . A C . . .
    chr1 762189 . A C . . .
    chr1 762192 . T C . . .
    chr1 762195 . C A . . .
    chr1 762196 . T C . . .

    As you can see, all I care about is the exact chromosomal location (based on the hg19 ref). I want to start by merging two different VCFs using the following code:

    java -jar GenomeAnalysisTK.jar -T CombineVariants -R hg19.fasta -V:ABC,abc.vcf -V:XYZ,xyz.vcf -o merged_abc_xyz.vcf -minimalVCF

    After, I will be extracting variants that intersect both sets using SelectVariants.

    The program completes without error, however, my merged VCF output does not populate with any data beyond the header (ie. no variants are listed). The program only runs for under a second and the completes and shuts off. What could be my issues? Please help. Thanks!!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    What is the output to the console? Is there an error or other message?

  • CarneiroCarneiro Charlestown, MAMember

    are abc.vcf and xyz.vcf well formed? How were they generated?

  • Hi Geraldine and Carneiro,

    Thanks for the responses. It turns out, the VCF files were slightly malformed on a couple of lines. They were not generated using a GATK variant caller which is probably why CombineVariants wasn't working. For now, I think I have everything under control after some manual reformatting.

    Will get back to you if anything else comes up.

  • ecyehecyeh Member

    It is a nice tool, thank you.
    Suppose I have a vcf file containing the genotype of 5 samples, and want to find variants occurring in at lease 3 of them. To get the "set=" attribute, do I need to split the vcf file into 5 ones first, by using SelectVariant, and then merge the 5 vcfs again using CombineVariants? Is there a better way to do that?

  • CarneiroCarneiro Charlestown, MAMember

    @ecyeh use SelectVariants to find the variants on at least 3 of them. If you only have 5 samples, there aren't too many combinations. If you had more samples than that, I'd write a walker to do it.

  • ecyehecyeh Member

    Indeed I have more than 5 samples to compare. Thank you for the clarification. Now I feel so lucky to be a programmer.

  • CarneiroCarneiro Charlestown, MAMember

    fantastic, go ahead and write a quick RODWalker to solve that one. You can base yourself on SelectVariants.

  • Hi, I'm combining different samples (each VCF has a "batch" of samples that were processed together and I just wanted to have my variants in one VCF for convenient's sake - i.e. I'm not combining variants in order to merge samples that were spread out on different runs). I was wondering why the PL changes for each genotype when I'm simply putting all my variants in one place.

  • edited October 2013

    I think that there is a typo in Example 8. "-select 'set == ";Intersection";'" should lose that first semi-colon. With the semicolon, I get 0 VCF records produced. When I use "grep" there are 2300+ in my union.vcf data set. After removal, the resulting intersection.vcf file contains the expected number of records. I expect that because ";" is the separator char, it shouldn't be in the pattern. This was tested using v2.7-2 and v2.7-4.

    I hope this helps anyone else who was tripped up by it.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @chongm,

    First, be careful when you merge VCFs containing variants that were called separately. If you're doing it to compare results of different calling iterations, evaluate concordance etc, that's perfectly fine. But combining them for convenience is dangerous because if later, you want to filter them, there are some annotations that you shouldn't use to filter them together, because the values are relative, not absolute, and will not be on the same scale between different sets. I'm not saying you shouldn't do it, but if you do it, you should be careful.

    The PLs changing is a known issue that occurs when combining variants with more than one alternate allele. CombineVariants currently does not handle multiallelic variants well.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @weberATillinoisedu, you're absolutely correct. I will fix the typo in the article. Thanks for pointing it out!

  • Hi @Geraldine_VdAuwera, okay thanks for the warning. I will filter each VCF separately then. Once all batches are complete though, I'm going to call variants simultaneously for all batches which should result in one VCF. This is the best way to call variants right? Will some less frequent variants be filtered out if I call them using the entire cohort of samples vs. run by run?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Yes, joint calling followed by VQSR (variant recalibration) is indeed what we recommend. The process will increase your discovery power for difficult variants, but should not negatively affect the calling of rare variants.

  • It would be great if you could extend the -setKey argument so that one can not only specify the key, but also the values. If I combine three VCF files, let's say a.vcf, b.vcf, and c.vcf, the INFO field might look like set=variant-variant1. I assume "variant" corresponds to the first file given with -V and "variant1" to the second file given with -V, but of course to help me still know this in 1 month it would be great if I could specify other strings for "variant" and "variant1". Or to make thinks easier, you could also just take the file prefixes a, b, and c. Thanks. Eva

  • Actually, looking closer at my new VCF file combined from three VCF files, I now see that the set argument has four different values, even though I merged only three VCF files. These values are variant, variant1, variant2, and variant3.

    I used the option -setKey source and what I see now are: source=variant3, source=variant2, source=variant-variant2, but no source=variant1, source=variant or source=variant-variant1. Is it possible that there is a bug with the naming of the sets when it comes to the combinations?

    Thank you
    Eva

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Hi Eva,

    You just need to name your tracks. E.g. "-V:foo first.vcf -V:bar second.vcf", then the set values will be 'first' and 'second'.

  • Hi Eric,
    thank you very much for this info, I did not know about this option. When I name the input VCFs, I really only observe the three values that I specified.

    If you want to take a closer look at why there are four different values when you don't name the inputs I can send you the three input vcfs I used.

  • varshavarsha FloridaMember

    Hi @ebanks, I saw some previous documentation on using -genotypeMergeOptions REQUIRE_UNIQUE and UNIQUIFY and also -­‐filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED -­‐filteredAreUncalled for combining variants but I am not sure which of these is ideal for generating PON vcf with only 1 record for each variant (record positions with the same variant in the same location only once). Please let me know where I can find documentation regarding this. Sorry for the trouble, your help is much appreciated. I have looked for the info in these threads-

    https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php
    http://gatkforums.broadinstitute.org/discussion/53/combining-variants-from-different-files-into-one
    http://gatkforums.broadinstitute.org/discussion/4641/build-a-panel-of-normal-for-mutect

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @varsha
    Hi,

    Are you trying to merge VCFs with the same sample names? -genotypeMergeOptions REQUIRE_UNIQUE makes sure all the sample names are different in the input VCFs. -genotypeMergeOptions UNIQUIFY adds a suffix to any sample names that are the same in the VCFs so you can distinguish them. Both options will create a single record for any site that exists in your VCFs.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    It doesn't matter how you choose to merge genotypes because when MuTect uses the PON, it ignores the genotype information and only looks at site-level information.

  • varshavarsha FloridaMember

    Hi @Sheila, @Geraldine_VdAuwera, Thank you for your help. I did give all different sample names for the normal vcfs to be merged. I ended up using --genotypemergeoption UNSORTED and it seemed to work fine. I am currently using this PON.vcf generated to run MuTect on my unpaired samples using --normal_panel PON.vcf. I am waiting to see if this works without any errors, I will get back with any updates/ issues. Thanks again.

  • varshavarsha FloridaMember

    Hi would you be able to specify the filteredrecordsmergetype to be used for combining the vcfs for PON? I am unable to narrow down the exact syntax from the forum posts. Thank you for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Just use --filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED

  • varshavarsha FloridaMember

    Hi @Geraldine_VdAuwera, I used both --genotypemergeoption UNSORTED --filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED to generate my PON vcf successfully , thanks again!

  • everestial007everestial007 GreensboroMember

    @Carneiro said:
    @ecyeh use SelectVariants to find the variants on at least 3 of them. If you only have 5 samples, there aren't too many combinations. If you had more samples than that, I'd write a walker to do it.

    @Geraldine_VdAuwera @Sheila
    I am in a similar situation. I have vcf from 6 different samples and want to select variants that are present in at least 3 of them. I have used combine variants (with -minN 3) to merge and retain the variants that are present in at least 3 different samples. The resulting *.vcf file using this trick should be equivalent to the one if I had first merged the variants and then selected the ones present in at least 3 samples, rite??
    But, I want to know how to used select variants to find the variants that are present in at least 3 samples. I have used -select 'set == "Intersection";' to obtain the variants that are in all 6 samples. But, finding the right flag for selecting variants from at least 3 sample has been a problem. I have checked the select variants tool page but don't find any thing useful.

    Thank you in advance,

  • everestial007everestial007 GreensboroMember
    edited December 2015

    After I combine variants (or select variants), I want to use the combined variants.vcf for VQSR. If I want to remove any field and/or info that is not useful for VQSR, what/how can I do it? Also, I am not sure what fields should be retained so the vcf is useful for VQSR analyses.

    To remove or retain the field of interest I came across some complex commands using awk and/or python. Is there any command in GATK or vcf tools which could be useful for the purpose.

    Thank you !

  • everestial007everestial007 GreensboroMember
    edited December 2015

    @everestial007 said:

    @Carneiro said:
    @ecyeh use SelectVariants to find the variants on at least 3 of them. If you only have 5 samples, there aren't too many combinations. If you had more samples than that, I'd write a walker to do it.

    @Geraldine_VdAuwera @Sheila
    I am in a similar situation. I have vcf from 6 different samples and want to select variants that are present in at least 3 of them. I have used combine variants (with -minN 3) to merge and retain the variants that are present in at least 3 different samples. The resulting *.vcf file using this trick should be equivalent to the one if I had first merged the variants and then selected the ones present in at least 3 samples, rite??
    But, I want to know how to used select variants to find the variants that are present in at least 3 samples. I have used -select 'set == "Intersection";' to obtain the variants that are in all 6 samples. But, finding the right flag for selecting variants from at least 3 sample has been a problem. I have checked the select variants tool page but don't find any thing useful.

    Thank you in advance,

    Looks like the above question had some typos which made the question unclear. I am sorry for that. I have reworded the question again:

    I am in a similar situation. I have vcf from 6 different samples and want to select variants that are present in at least 3 of them. I have used combine variants (with -minN 3) to merge and retain the variants that were present in at least 3 different samples.
    java -Xmx4g -jar GenomeAnalysisTK.jar -T CombineVariants -R lyrata_genome.fa --variant:varMA605 commonVARiantsMA605PRIORITY.vcf --variant:varMA611 commonVARiantsMA611PRIORITY.vcf --variant:varMA622 commonVARiantsMA622PRIORITY.vcf --variant:varMA625 commonVARiantsMA625PRIORITY.vcf --variant:varMA629 commonVARiantsMA629PRIORITY.vcf --variant:varNcm8 commonVARiantsNcm8PRIORITY.vcf -o unionVARiantsMAYODANmin3.vcf -minN 3
    I used this combine variants method as I was not able to use select variants to pull the variants that were present in at least 3 sample. I used the following command. java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V unionVARiantsMAYODAN.vcf -select 'set == "minN 3";' -o commonVARiantsMAYODANminN3.vcf
    which did not work. I know that the used of select 'set== "minN 3" is not appropriate in here but not sure how I can do it. let me now if there is way.

    But, I think the resulting *.vcf file using this trick (combine variants with minN 3) should be equivalent to the one if I had first merged the variants and then selected the ones present in at least 3 samples, rite??

    Thanks in advance !

    • Bishwa K.
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @everestial007 Yes the method you used is correct and no further manipulation of the file should be necessary.

    I believe SelectVariants has an option to drop non-essential fields of the vcf -- have a look at the tool doc for a complete list.

  • everestial007everestial007 GreensboroMember

    It is surprising to see that the ouput of the GenotypeGVCFs from several samples is often small compared to the g.vcf (for each sample we used). My several g.vcf files are above 2 gb but the genotypedGVCF from six samples is only 1 gb. Is this normal?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    This is expected, as the gVCFs include data for many sites that are not called variant, whereas by default the output of GGVCFs only contains sites where a variant has been called. Also, the lines corresponding to each genome position from each gVCF get condensed into a single line in the final output.

  • everestial007everestial007 GreensboroMember

    @Geraldine_VdAuwera said:
    @everestial007 Yes the method you used is correct and no further manipulation of the file should be necessary.

    I believe SelectVariants has an option to drop non-essential fields of the vcf -- have a look at the tool doc for a complete list.

    Hi @Geraldine_VdAuwera
    After looking through the available flags my thought was that -IDs and -excludeIDs are the appropriate flags to select/exclude the field of interest. But, for some reason I am not finding a proper way of doing it. I created a fileKeep document and typed in the strings (as is in the vcf file) for the fields I am interested in. The script runs and outputs a file but with out any information on it except the headers (but there seems to be no selection). So, probably my selection of the flag is right but I am not able to find a right way of using it.
    Please let me know if there is somewhere else I should be looking for.

    Thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, my bad, it's CombineVariants that has a -minimalVCF argument. And there's a general engine capability called -sites_only that goes even further.

  • everestial007everestial007 GreensboroMember

    @Geraldine_VdAuwera
    I checked and found that -minimalVCF requires a boolean logic. I am not sure what value should I give. Providing true/false didnot help and i found no example to help myself. Also, I found no -sites_only argument for CombineVariants.
    Could you please provide me with an example that could be helpful.

    Thank you,

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    For booleans, just including the flag (without specifying a value) is sufficient to set it to True.

    -sites_only is an engine capability so it's documented with the rest of the engine arguments, not with any particular tool's.

  • everestial007everestial007 GreensboroMember

    @Geraldine_VdAuwera said:
    For booleans, just including the flag (without specifying a value) is sufficient to set it to True.

    -sites_only is an engine capability so it's documented with the rest of the engine arguments, not with any particular tool's.

    Thank you !!!

  • GuillefriisGuillefriis SpainMember

    Hi,

    I'm trying to intersect a GATK called vcf file with other vcf called with a different genotyper (I've tried with samtools and TASSEL so far) to use the intersect as reliable SNP dataset to perform the BQSR since I work with a non-model avian genus and there is no other datasets available. I'm having a lot of problems, like incoherences with respect the reference genoe (even when I've used exactely the same one) or related with lacking the the vcf.idx (##### ERROR MESSAGE: Problem detecting index type) at the first step, when using th CombineVariants tool. Has somebody tried this before? Suggestions? Thanksd a lot.

    Guillermo

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @Guillefriis You'll need to post the details of each problem separately, as we can't give you a one-size-fits-all solution.

  • GuillefriisGuillefriis SpainMember

    Hi @Geraldine_VdAuwera

    Let's leave aside the intersection with the TASSEL calling dataset.

    I performed a SNPcalling following the GATK best practices workflow till the BQSR step using Genotyping-by-sequencing (GBS) data for around four hundred individuals. Since I work with emberizids, I used the Zebra Finch genome as reference.
    I reproduced the GATK SNPcalling with strict quality thresholds and did an alternative SNPcalling using the mpileup tool from SAMTOOLS in order to recover those SNPs detected by both workflows and use them in the BQSR, assuming they would be more reliable. I'm using the script detailed here:

    "
    module load GATK/3.3-0

    gatk=/gpfs/res_apps/GATK/3.3-0/GenomeAnalysisTK.jar
    genome=~/data/Borja/finch_genome/ZEFI_gatkindex/Taeniopygia_guttata.taeGut3.2.4.dna.toplevel.fa
    samtools_snps=~/scratch/PIPELINE/GBS/Final_Workflow/T_samtools_calling/out_strict.vcf

    java -jar $gatk \
    -T CombineVariants \
    -R $genome \
    -V JUNCOsnps_ZEFI_HQ_intersectBQSR.vcf \
    -V $samtools_snps \
    -o tassel_gatk.vcf

    java -jar $gatk \
    -T SelectVariants \
    -V tassel_gatk.vcf \
    -R $genome \
    -o Intersect_ZEFI_BQSR.vcf \
    -nt 16 \
    -select 'set == "Intersection";'
    "

    However the CombineVariants tool gives an error somehow related with the index, my guess the lacking samtools snps vcf index:

    "

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Problem detecting index type
    ERROR ------------------------------------------------------------------------------------------

    "

    I couldn't find a way to produce and index for the file. I also checked related posts (http://gatkforums.broadinstitute.org/gatk/discussion/2283/in-regards-to-intersecting-vcf-files) but couldn't find a solution. Maybe is not doable? Thanks for your attention Geraldine.

    Cheers
    Guillermo

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @Guillefriis
    Hi Guillermo,

    I suspect the issue is with the non-GATK generated VCF. Can you try running ValidateVariants on your VCFs? If they pass, you can try deleting the VCF indices and GATK will regenerate them for you.

    -Sheila

  • SumuduSumudu Sri LankaMember

    Hi,

    I have a multi-sample (6 samples) vcf file and want to select sites present in at least 3 of them. Someone has asked the same question previously and the answer was to use SelectVariants. But I can't still figure out how to use it. Should I use VariantContext or a script to do that. Appreciate if you would clarify it a bit more.

    Thanks
    -Sumu

  • everestial007everestial007 GreensboroMember
    edited January 3

    @Sumudu:
    I have used CombineVariants to select variants present in at least 3 samples like this:

    java -Xmx4g -jar /home/everestial007/GenomeAnalysisTK-3.4-46/GenomeAnalysisTK.jar -T CombineVariants -R lyrata_genome.fa --variant:varSNPsMA605 filteredHCVariantsSNPsMA605.vcf --variant:varSNPsMA611 filteredHCVariantsSNPsMA611.vcf --variant:varSNPsMA622 filteredHCVariantsSNPsMA622.vcf --variant:varSNPsMA625 filteredHCVariantsSNPsMA625.vcf --variant:varSNPsMA629 filteredHCVariantsSNPsMA629.vcf --variant:varSNPsNcm8 filteredHCVariantsSNPsNcm8.vcf -o unionSNPsvariantMAYODANmin3.vcf -minN 3.

    I think the -minN 3 is an universal walker so just try it with SelectVariants. If not you will have to separate the samples and then combine it.

  • SumuduSumudu Sri LankaMember

    Thanks @everestial007 . I have a multi sample vcf and I'm looking for a more convenient way of selecting variants without separating my samples in to individual vcfs. If any other suggestion would be appreciated.

    Sumudu

  • everestial007everestial007 GreensboroMember
    edited January 4

    @Sumudu : I am not sure if you found the solution yet. But, I tested --maxNOCALLnumber with SelectVariants and it worked well. Unforutnately, -minN isn't a universal walker and doesn't work with SelectVariants.

    Since, you want sites called at least in 3 samples you can use it in following format:

    java -Xmx4g -jar /home/everestial007/G enomeAnalysisTK-3.7/GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V MY.phased_variants.vcf --maxNOCALLnumber 3 -o MY.phased_variants.minN3.vcf

    --maxNOCALLnumber 3 - this option tells the walker to select the sites from the input *.vcf where maximum of 3 samples don't have any genotype called. You can also use the fraction using --maxNOCALLfraction

    Also, make sure to read all the possible option on the GATK tool for any task -T and try it out on small vcf samples.

    Issue · Github
    by Sheila

    Issue Number
    1582
    State
    closed
    Last Updated
    Assignee
    Array
    Milestone
    Array
    Closed By
    vdauwera
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    The maxNOCALLnumber approach will probably not work on a proper joint-called multisample VCF since any samples that are not variant but have enough data present at the sites of interest will have hom-ref calls. There are some decently simple ways to retrieve variants present at a certain frequency in the population, but embarrassingly we don't have a straightforward way to simply select variants based on how many samples are variant or not variant. Which is not to say it can be done; there are several circuitous ways you could do it, including doing a first pass to filter out hom-ref genotypes (using VariantFiltration and the dreaded variant context), then a second to set these to no-calls (still with VariantFiltration, then finally use the maxNOCALLnumber trick proposed by @everestial007.

    My recommendation? Use VariantsToTable to output a table of your variants with contig, position and genotypes, and use your scripting language of choice (heck, even Excel) to determine which you want to keep. Then simply produce an intervals list using the contig and position, and use that as -L value in SelectVariants. It's cleaner and while you're at it you can collect some preliminary stats on what your filtering is going to produce.

  • SumuduSumudu Sri LankaMember
    edited January 9

    @Geraldine_VdAuwera

    Thank you for the detailed clear answer. Actually, I used SnpSift tool using the expression "countRef() < 4" to select variants present in at least 3 samples (Het or Homo Var) out of my 6 samples, in order to obtain a consistent variant set. I hope this method is also ok.

    Sumudu.

  • everestial007everestial007 GreensboroMember

    @Geraldine_VdAuwera @Sumudu

    HI Geraldine,

    Rather than using first pass method combined with other process, I think it would help - splitting the vcf file into multiple files by samples and then combining with them using -minN option.

    Thanks,

  • SumuduSumudu Sri LankaMember

    @everestial007,

    Thanks. But what if you have >200 samples or so? Splitting them in to separate samples will be a tedious task.
    Sumudu.

  • everestial007everestial007 GreensboroMember
    edited January 10

    @Sumudu
    You have to realize that there are no perfect tools that do all. If you are not from informatics background, I sugggest you start looking into other forums and tools, which might have easier solution than GATK.

    Solutions

    1:

    For, splitting you vcf samples by samples check this:
    https://www.biostars.org/p/224702/ - look for the answer by WouterDeCoster Using GNU parallels. Other answers are python based (I think).

    After this you can CombineVariants from GATK with -minN option. Don't use uniquify or prioritize.
    I think with lots of sample you can create a file containing list of vcf and submit it in the GATK script, but I am not sure about it @Geraldine_VdAuwera @Sheila may be able to confirm.

    2:

    Another useful tool would be
    VCFtools http://vcftools.sourceforge.net/man_latest.html
    Reading through the description, I think SITE FILTERING OPTIONS > Genotype value filtering > (--maxmissing or --maxmissingCount) would probably work.

    3:

    Or, BCFtools https://samtools.github.io/bcftools/bcftools.html
    Look into bcftools isec option

    I have not personally verified the use of vcftools bcftools but they are pretty good tools for variants filtering. Another issue that you need to be careful is to look at the vcf file (or the regions within it) after filtering visually using Vim Editior on linux or IGV.
    Also, you may apply all three methods: CombineVariants with GATK, vcftools, bcftools and see how well each of these tool/method has worked. It would be nice if you could let us know about your experience !

    Thanks,

  • SumuduSumudu Sri LankaMember

    @ everestial007,
    Thanks for your answer. Let me go through what you have suggested and will let you know about my findings.
    Sumudu.

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @everestial007 and @Sumudu,

    For many tool arguments, in GATK3 you can use the -args, aka --arg_file <arg_file> option within the command. The arguments must be on the same line, e.g. -V file1.vcf -V file2.vcf ... -V fileN.vcf. I just happened to test an arg_file last week with CombineVariants and it works well.

    In GATK4, you will be able to provide a file listing the paths to each input file per line, e.g. -V one-per-line-directory-paths.txt.

    I hope this is helpful.

  • zejunyanzejunyan EdinburghMember

    @shlee

    Hi, thank you for you reply on population prior.
    I got a new question:
    I am trying to combine my callset with dbSNP in vcf format (public data from NCBI) using CombineVariants. two questions:
    first, when combining, I have to make sure these callsets were produced based on the same reference genome, right? The problem is that these two callsets were produced using different version of chicken ref genomes, mine used version-5, the dbSNP used version-4.

    Second, if I used the same ref genome as the dbSNP, the problem is that the #CHROM column in my callset does not match with that in dbSNP. For example, chromosome 22 in my callset is named 'NC_006109.4', but it is named as '22' in the dbSNP.vcf. I guess this does matter, and it does not allow me to combine two vcf files. Right?

    The solution to this could be that if I can change NC_006109.4 to 22 in my callset vcf file. Do you know a good approach to do this, possible?

    Cheers

    Dr yan

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi Dr. Yan (@zejunyan),

    If the reference assembly coordinates shifted for the chicken genome between v4 and v5, then for the dbSNP resource file you can use something we call liftover. Lifting over coordinates is not an ideal thing to do but given the cost of regenerating the resource file (by realigning and calling on the new reference), we make do with liftover. Liftover requires a chain file and you can obtain these, e.g. from UCSC (Genome Browser), NCBI/GRC and ENSEMBLE. UCSC also offers a web-based liftOver application that is quite popular. Picard also offers LiftoverVcf as a tool. I suggest you compare the liftover results using different chains/functions and decide which is best for you. A month ago I actually inquired about acceptable rates of rejection and was told by an expert in the field that ~1% is the cutoff.

    Lifting over will resolve differences in contig nomenclature. For GATK and Picard tools, contig names must match exactly.

    Dr. Lee

  • zejunyanzejunyan EdinburghMember

    @shlee
    Thank you very much!
    I will give a try

    Dr yan

  • zejunyanzejunyan EdinburghMember

    @shlee
    Hi, shlee
    I am trying this liftover with commandline:
    java -jar ../tools/picard-tools-1.141/picard.jar LiftoverVcf I=AVSEQ1300078.variants.g.vcf.gz O=lifted_over.vcf CHAIN=galGal4ToGalGal5.over.chain.gz REJECT=rejected_variants.vcf R=/mnt/Pella/Processing/NGS-working/zyan/galGal5/galGal5_TopLevel/Gallus_gallus.Gallus_gallus-5.0.dna.toplevel.fa
    However, I did not see the chromosome coordinates in galgal4 changed to that in galgal5. In fact, all the records in my input vcf (ASEQ1300078.variants.g.vcf) are rejected into rejected.vcf, The lifted_over.vcf is empty.

    I am nor sure where I input the cut-off value for the rate of rejection?

    Cheers

    Dr yan

  • shleeshlee CambridgeMember, Broadie, Moderator

    Hi @zejunyan,

    It looks like you are using an older version of Picard. We are on Picard v2.10.5 currently. See how the latest version handles your data and also see how UCSC's liftOver handles your data. The latter allows a cut-off value for the rate of rejection. Let me know how these runs go.

  • zejunyanzejunyan EdinburghMember

    @shlee
    Hi, shlee,

    When I tried to do BQSR using the snp database for chicken as the input for the knownSite, it always gave me this error message. Is there a way to fix this problem throughout this input vcf file?
    ** ERROR MESSAGE: The provided VCF file is malformed at approximately line number 5466413: The VCF specification does not allow for whitespace in the INending field value was "RSPOS=67905532;GENEINFO=101749622:SERPINB10 CPOX|420895:SERPINB6;dbSNPBuildID=138;SAO=0;VC=snp;VLD;VP=0500000C0005000000000100", foe: /gensys/projects/NGS/zyan/snp_db/00-All.sorted.vcf.gz**

    Bests

    Dr Yan

  • shleeshlee CambridgeMember, Broadie, Moderator

    @zejunyan,

    Sounds like the known sites data file has some sort of formatting issue.

    ERROR MESSAGE: The provided VCF file is malformed at approximately line number 5466413

    Look into the file to see what is causing the error and if this type of formatting issue is systematic. Then see if you can correct the formatting issue.

Sign In or Register to comment.