The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Powered by Vanilla. Made with Bootstrap.

Combining variants from different files into one

delangeldelangel Dev Posts: 71
edited May 2015 in Methods and Algorithms

Solutions for combining variant callsets depending on purpose

There are three main reasons why you might want to combine variants from different files into one, and the tool to use depends on what you are trying to achieve.

  1. The most common case is when you have been parallelizing your variant calling analyses, e.g. running HaplotypeCaller per-chromosome, producing separate VCF files (or gVCF files) per-chromosome. For that case, you can use a tool called CatVariants to concatenate the files. There are a few important requirements (e.g. the files should contain all the same samples, and distinct intervals) which you can read about on the tool's documentation page.

  2. The second case is when you have been using HaplotypeCaller in -ERC GVCF or -ERC BP_RESOLUTION to call variants on a large cohort, producing many gVCF files. We recommend combining the output gVCF in batches of e.g. 200 before putting them through joint genotyping with GenotypeGVCFs (for performance reasons), which you can do using CombineGVCFs, which is specific for handling gVCF files.

  3. The third case is when you want to combine variant calls that were produced from the same samples but using different methods, for comparison. For example, if you're evaluating variant calls produced by different variant callers, different workflows, or the same but using different parameters. This produces separate callsets for the same samples, which are then easier to compare if you combine them into a single file. For that purpose, you can use CombineVariants, which is capable of merging VCF records intelligently, treating the same samples as separate or not as desired, combining annotations as appropriate. This is the case that requires the most preparation and forethought because there are many options that may be used to adapt the behavior of the tool.

There is also one reason you might want to combine variants from different files into one, that we do not recommend following. That is, if you have produced variant calls from various samples separately, and want to combine them for analysis. This is how people used to do variant analysis on large numbers of samples, but we don't recommend proceeding this way because that workflow suffers from serious methodological flaws. Instead, you should follow our recommendations as laid out in the Best Practices documentation.


Merging records across VCFs with CombineVariants

Here we provide some more information and a worked out example to illustrate the third case because it is less straightforward than the other two.

A key point to understand is that CombineVariants will include a record at every site in all of your input VCF files, and annotate in which input callsets the record is present, pass, or filtered in in the set attribute in the INFO field (see below). In effect, CombineVariants always produces a union of the input VCFs. Any part of the Venn of the N merged VCFs can then be extracted specifically using JEXL expressions on the set attribute using SelectVariants. If you want to extract just the records in common between two VCFs, you would first CombineVariants the two files into a single VCF, and then run SelectVariants to extract the common records with -select 'set == "Intersection"', as worked out in the detailed example below.

Handling PASS/FAIL records at the same site in multiple input files

The -filteredRecordsMergeType argument determines how CombineVariants handles sites where a record is present in multiple VCFs, but it is filtered in some and unfiltered in others, as described in the tool documentation page linked above.

Understanding the set attribute

The set property of the INFO field indicates which call set the variant was found in. It can take on a variety of values indicating the exact nature of the overlap between the call sets. Note that the values are generalized for multi-way combinations, but here we describe only the values for 2 call sets being combined.

  • set=Intersection : occurred in both call sets, not filtered out

  • set=NAME : occurred in the call set NAME only

  • set=NAME1-filteredInNAME : occurred in both call sets, but was not filtered in NAME1 but was filtered in NAME2

  • set=filteredInAll : occurred in both call sets, but was filtered out of both

For three or more call sets combinations, you can see records like NAME1-NAME2 indicating a variant occurred in both NAME1 and NAME2 but not all sets.

You specify the NAME of a callset is by using the following syntax in your command line: -V:omni 1000G_omni2.5.b37.sites.vcf.

Emitting minimal VCF output

You can add the -minimalVCF argument to CombineVariants if you want to eliminate unnecessary information from the INFO field and genotypes. In that case, the only fields emitted will be GT:GQ for genotypes and the keySet for INFO.

An even more extreme output format is -sites_only (a general engine capability listed in the CommandLineGATK documentation) where the genotypes for all samples are completely stripped away from the output format. Enabling this option results in a significant performance speedup as well.

Requiring sites to be present in a minimum number of callsets

Sometimes you may want to combine several data sets but you only keep sites that are present in at least 2 of them. To do so, simply add the -minN (or --minimumN) command, followed by an integer if you want to only output records present in at least N input files. In our example, you would add -minN 2 to the command line.

Example: intersecting two VCFs

In the following example, we use CombineVariants and SelectVariants to obtain only the sites in common between the OMNI 2.5M and HapMap3 sites in the GSA bundle.

# combine the data
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T CombineVariants -R bundle/b37/human_g1k_v37.fasta -L 1:1-1,000,000 -V:omni bundle/b37/1000G_omni2.5.b37.sites.vcf -V:hm3 bundle/b37/hapmap_3.3.b37.sites.vcf -o union.vcf

# select the intersection
java -Xmx2g -jar dist/GenomeAnalysisTK.jar -T SelectVariants -R ~/Desktop/broadLocal/localData/human_g1k_v37.fasta -L 1:1-1,000,000 -V:variant union.vcf -select 'set == "Intersection";' -o intersect.vcf

This results in two vcf files, which look like:

# contents of union.vcf
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       990882  SNP1-980745     C       T       .       PASS    CR=99.79873;GentrainScore=0.7403;HW=0.005225421;set=omni
1       990984  SNP1-980847     G       A       .       PASS    CR=99.76005;GentrainScore=0.8406;HW=0.26163524;set=omni
1       992265  SNP1-982128     C       T       .       PASS    CR=100.0;GentrainScore=0.7412;HW=0.0025895447;set=omni
1       992819  SNP1-982682     G       A       .       id50    CR=99.72961;GentrainScore=0.8505;HW=4.811053E-17;set=FilteredInAll
1       993987  SNP1-983850     T       C       .       PASS    CR=99.85935;GentrainScore=0.8336;HW=9.959717E-28;set=omni
1       994391  rs2488991       G       T       .       PASS    AC=1936;AF=0.69341;AN=2792;CR=99.89378;GentrainScore=0.7330;HW=1.1741E-41;set=filterInomni-hm3
1       996184  SNP1-986047     G       A       .       PASS    CR=99.932205;GentrainScore=0.8216;HW=3.8830226E-6;set=omni
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
1       999649  SNP1-989512     G       A       .       PASS    CR=99.93262;GentrainScore=0.7965;HW=4.9767335E-4;set=omni

# contents of intersect.vcf
1       950243  SNP1-940106     A       C       .       PASS    AC=826;AF=0.29993;AN=2754;CR=97.341675;GentrainScore=0.7311;HW=0.15148845;set=Intersection
1       957640  rs6657048       C       T       .       PASS    AC=127;AF=0.04552;AN=2790;CR=99.86667;GentrainScore=0.6806;HW=2.286109E-4;set=Intersection
1       959842  rs2710888       C       T       .       PASS    AC=654;AF=0.23559;AN=2776;CR=99.849;GentrainScore=0.8072;HW=0.17526293;set=Intersection
1       977780  rs2710875       C       T       .       PASS    AC=1989;AF=0.71341;AN=2788;CR=99.89077;GentrainScore=0.7875;HW=2.9912625E-32;set=Intersection
1       985900  SNP1-975763     C       T       .       PASS    AC=182;AF=0.06528;AN=2788;CR=99.79926;GentrainScore=0.8374;HW=0.017794203;set=Intersection
1       987200  SNP1-977063     C       T       .       PASS    AC=1956;AF=0.70007;AN=2794;CR=99.45917;GentrainScore=0.7914;HW=1.413E-42;set=Intersection
1       987670  SNP1-977533     T       G       .       PASS    AC=2485;AF=0.89196;AN=2786;CR=99.51427;GentrainScore=0.7005;HW=0.24214932;set=Intersection
1       990417  rs2465136       T       C       .       PASS    AC=1113;AF=0.40007;AN=2782;CR=99.7599;GentrainScore=0.8750;HW=8.595538E-5;set=Intersection
1       990839  SNP1-980702     C       T       .       PASS    AC=150;AF=0.05384;AN=2786;CR=100.0;GentrainScore=0.7267;HW=0.0027632264;set=Intersection
1       998395  rs7526076       A       G       .       PASS    AC=2234;AF=0.80187;AN=2786;CR=100.0;GentrainScore=0.8758;HW=0.67373306;set=Intersection
Post edited by Geraldine_VdAuwera on

Comments

  • mpviverompvivero Member Posts: 7

    I am trying to use a variety of the GATK variant validation tools including CombineVariants and SelectVariants. However, I have been unable to get either tool to work. I have posted the first few lines of one of my VCFs below:

    CHROM POS ID REF ALT QUAL FILTER INFO

    chr1 762084 . T C . . .
    chr1 762136 . A C . . .
    chr1 762189 . A C . . .
    chr1 762192 . T C . . .
    chr1 762195 . C A . . .
    chr1 762196 . T C . . .

    As you can see, all I care about is the exact chromosomal location (based on the hg19 ref). I want to start by merging two different VCFs using the following code:

    java -jar GenomeAnalysisTK.jar -T CombineVariants -R hg19.fasta -V:ABC,abc.vcf -V:XYZ,xyz.vcf -o merged_abc_xyz.vcf -minimalVCF

    After, I will be extracting variants that intersect both sets using SelectVariants.

    The program completes without error, however, my merged VCF output does not populate with any data beyond the header (ie. no variants are listed). The program only runs for under a second and the completes and shuts off. What could be my issues? Please help. Thanks!!

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,684 admin

    What is the output to the console? Is there an error or other message?

    Geraldine Van der Auwera, PhD

  • CarneiroCarneiro Administrator, Dev Posts: 274 admin

    are abc.vcf and xyz.vcf well formed? How were they generated?

  • mpviverompvivero Member Posts: 7

    Hi Geraldine and Carneiro,

    Thanks for the responses. It turns out, the VCF files were slightly malformed on a couple of lines. They were not generated using a GATK variant caller which is probably why CombineVariants wasn't working. For now, I think I have everything under control after some manual reformatting.

    Will get back to you if anything else comes up.

  • ecyehecyeh Member Posts: 8

    It is a nice tool, thank you.
    Suppose I have a vcf file containing the genotype of 5 samples, and want to find variants occurring in at lease 3 of them. To get the "set=" attribute, do I need to split the vcf file into 5 ones first, by using SelectVariant, and then merge the 5 vcfs again using CombineVariants? Is there a better way to do that?

  • CarneiroCarneiro Administrator, Dev Posts: 274 admin

    @ecyeh use SelectVariants to find the variants on at least 3 of them. If you only have 5 samples, there aren't too many combinations. If you had more samples than that, I'd write a walker to do it.

  • ecyehecyeh Member Posts: 8

    Indeed I have more than 5 samples to compare. Thank you for the clarification. Now I feel so lucky to be a programmer.

  • CarneiroCarneiro Administrator, Dev Posts: 274 admin

    fantastic, go ahead and write a quick RODWalker to solve that one. You can base yourself on SelectVariants.

  • chongmchongm Member Posts: 33

    Hi, I'm combining different samples (each VCF has a "batch" of samples that were processed together and I just wanted to have my variants in one VCF for convenient's sake - i.e. I'm not combining variants in order to merge samples that were spread out on different runs). I was wondering why the PL changes for each genotype when I'm simply putting all my variants in one place.

    Thanks,

    MC

  • weberATillinoiseduweberATillinoisedu Member Posts: 4
    edited October 2013

    I think that there is a typo in Example 8. "-select 'set == ";Intersection";'" should lose that first semi-colon. With the semicolon, I get 0 VCF records produced. When I use "grep" there are 2300+ in my union.vcf data set. After removal, the resulting intersection.vcf file contains the expected number of records. I expect that because ";" is the separator char, it shouldn't be in the pattern. This was tested using v2.7-2 and v2.7-4.

    I hope this helps anyone else who was tripped up by it.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,684 admin

    Hi @chongm,

    First, be careful when you merge VCFs containing variants that were called separately. If you're doing it to compare results of different calling iterations, evaluate concordance etc, that's perfectly fine. But combining them for convenience is dangerous because if later, you want to filter them, there are some annotations that you shouldn't use to filter them together, because the values are relative, not absolute, and will not be on the same scale between different sets. I'm not saying you shouldn't do it, but if you do it, you should be careful.

    The PLs changing is a known issue that occurs when combining variants with more than one alternate allele. CombineVariants currently does not handle multiallelic variants well.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,684 admin

    @weberATillinoisedu, you're absolutely correct. I will fix the typo in the article. Thanks for pointing it out!

    Geraldine Van der Auwera, PhD

  • chongmchongm Member Posts: 33

    Hi @Geraldine_VdAuwera, okay thanks for the warning. I will filter each VCF separately then. Once all batches are complete though, I'm going to call variants simultaneously for all batches which should result in one VCF. This is the best way to call variants right? Will some less frequent variants be filtered out if I call them using the entire cohort of samples vs. run by run?

    Thanks,

    MC

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,684 admin

    Yes, joint calling followed by VQSR (variant recalibration) is indeed what we recommend. The process will increase your discovery power for difficult variants, but should not negatively affect the calling of rare variants.

    Geraldine Van der Auwera, PhD

  • evakoeevakoe Member Posts: 26

    It would be great if you could extend the -setKey argument so that one can not only specify the key, but also the values. If I combine three VCF files, let's say a.vcf, b.vcf, and c.vcf, the INFO field might look like set=variant-variant1. I assume "variant" corresponds to the first file given with -V and "variant1" to the second file given with -V, but of course to help me still know this in 1 month it would be great if I could specify other strings for "variant" and "variant1". Or to make thinks easier, you could also just take the file prefixes a, b, and c. Thanks. Eva

  • evakoeevakoe Member Posts: 26

    Actually, looking closer at my new VCF file combined from three VCF files, I now see that the set argument has four different values, even though I merged only three VCF files. These values are variant, variant1, variant2, and variant3.

    I used the option -setKey source and what I see now are: source=variant3, source=variant2, source=variant-variant2, but no source=variant1, source=variant or source=variant-variant1. Is it possible that there is a bug with the naming of the sets when it comes to the combinations?

    Thank you
    Eva

  • ebanksebanks Broad InstituteMember, Administrator, Broadie, Moderator, Dev Posts: 698 admin

    Hi Eva,

    You just need to name your tracks. E.g. "-V:foo first.vcf -V:bar second.vcf", then the set values will be 'first' and 'second'.

    Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

  • evakoeevakoe Member Posts: 26

    Hi Eric,
    thank you very much for this info, I did not know about this option. When I name the input VCFs, I really only observe the three values that I specified.

    If you want to take a closer look at why there are four different values when you don't name the inputs I can send you the three input vcfs I used.

  • varshavarsha FloridaMember Posts: 37

    Hi @ebanks, I saw some previous documentation on using -genotypeMergeOptions REQUIRE_UNIQUE and UNIQUIFY and also -­‐filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED -­‐filteredAreUncalled for combining variants but I am not sure which of these is ideal for generating PON vcf with only 1 record for each variant (record positions with the same variant in the same location only once). Please let me know where I can find documentation regarding this. Sorry for the trouble, your help is much appreciated. I have looked for the info in these threads-

    https://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_CombineVariants.php
    http://gatkforums.broadinstitute.org/discussion/53/combining-variants-from-different-files-into-one
    http://gatkforums.broadinstitute.org/discussion/4641/build-a-panel-of-normal-for-mutect

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,284 admin

    @varsha
    Hi,

    Are you trying to merge VCFs with the same sample names? -genotypeMergeOptions REQUIRE_UNIQUE makes sure all the sample names are different in the input VCFs. -genotypeMergeOptions UNIQUIFY adds a suffix to any sample names that are the same in the VCFs so you can distinguish them. Both options will create a single record for any site that exists in your VCFs.

    -Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,684 admin

    It doesn't matter how you choose to merge genotypes because when MuTect uses the PON, it ignores the genotype information and only looks at site-level information.

    Geraldine Van der Auwera, PhD

  • varshavarsha FloridaMember Posts: 37

    Hi @Sheila, @Geraldine_VdAuwera, Thank you for your help. I did give all different sample names for the normal vcfs to be merged. I ended up using --genotypemergeoption UNSORTED and it seemed to work fine. I am currently using this PON.vcf generated to run MuTect on my unpaired samples using --normal_panel PON.vcf. I am waiting to see if this works without any errors, I will get back with any updates/ issues. Thanks again.

  • varshavarsha FloridaMember Posts: 37

    Hi would you be able to specify the filteredrecordsmergetype to be used for combining the vcfs for PON? I am unable to narrow down the exact syntax from the forum posts. Thank you for your help.

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,684 admin

    Just use --filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED

    Geraldine Van der Auwera, PhD

  • varshavarsha FloridaMember Posts: 37

    Hi @Geraldine_VdAuwera, I used both --genotypemergeoption UNSORTED --filteredrecordsmergetype KEEP_IF_ANY_UNFILTERED to generate my PON vcf successfully , thanks again!

  • everestial007everestial007 GreensboroMember Posts: 65

    @Carneiro said:
    @ecyeh use SelectVariants to find the variants on at least 3 of them. If you only have 5 samples, there aren't too many combinations. If you had more samples than that, I'd write a walker to do it.

    @Geraldine_VdAuwera @Sheila
    I am in a similar situation. I have vcf from 6 different samples and want to select variants that are present in at least 3 of them. I have used combine variants (with -minN 3) to merge and retain the variants that are present in at least 3 different samples. The resulting *.vcf file using this trick should be equivalent to the one if I had first merged the variants and then selected the ones present in at least 3 samples, rite??
    But, I want to know how to used select variants to find the variants that are present in at least 3 samples. I have used -select 'set == "Intersection";' to obtain the variants that are in all 6 samples. But, finding the right flag for selecting variants from at least 3 sample has been a problem. I have checked the select variants tool page but don't find any thing useful.

    Thank you in advance,

  • everestial007everestial007 GreensboroMember Posts: 65
    edited December 2015

    After I combine variants (or select variants), I want to use the combined variants.vcf for VQSR. If I want to remove any field and/or info that is not useful for VQSR, what/how can I do it? Also, I am not sure what fields should be retained so the vcf is useful for VQSR analyses.

    To remove or retain the field of interest I came across some complex commands using awk and/or python. Is there any command in GATK or vcf tools which could be useful for the purpose.

    Thank you !

  • everestial007everestial007 GreensboroMember Posts: 65
    edited December 2015

    @everestial007 said:

    @Carneiro said:
    @ecyeh use SelectVariants to find the variants on at least 3 of them. If you only have 5 samples, there aren't too many combinations. If you had more samples than that, I'd write a walker to do it.

    @Geraldine_VdAuwera @Sheila
    I am in a similar situation. I have vcf from 6 different samples and want to select variants that are present in at least 3 of them. I have used combine variants (with -minN 3) to merge and retain the variants that are present in at least 3 different samples. The resulting *.vcf file using this trick should be equivalent to the one if I had first merged the variants and then selected the ones present in at least 3 samples, rite??
    But, I want to know how to used select variants to find the variants that are present in at least 3 samples. I have used -select 'set == "Intersection";' to obtain the variants that are in all 6 samples. But, finding the right flag for selecting variants from at least 3 sample has been a problem. I have checked the select variants tool page but don't find any thing useful.

    Thank you in advance,

    Looks like the above question had some typos which made the question unclear. I am sorry for that. I have reworded the question again:

    I am in a similar situation. I have vcf from 6 different samples and want to select variants that are present in at least 3 of them. I have used combine variants (with -minN 3) to merge and retain the variants that were present in at least 3 different samples.
    java -Xmx4g -jar GenomeAnalysisTK.jar -T CombineVariants -R lyrata_genome.fa --variant:varMA605 commonVARiantsMA605PRIORITY.vcf --variant:varMA611 commonVARiantsMA611PRIORITY.vcf --variant:varMA622 commonVARiantsMA622PRIORITY.vcf --variant:varMA625 commonVARiantsMA625PRIORITY.vcf --variant:varMA629 commonVARiantsMA629PRIORITY.vcf --variant:varNcm8 commonVARiantsNcm8PRIORITY.vcf -o unionVARiantsMAYODANmin3.vcf -minN 3
    I used this combine variants method as I was not able to use select variants to pull the variants that were present in at least 3 sample. I used the following command. java -Xmx4g -jar GenomeAnalysisTK.jar -T SelectVariants -R lyrata_genome.fa -V unionVARiantsMAYODAN.vcf -select 'set == "minN 3";' -o commonVARiantsMAYODANminN3.vcf
    which did not work. I know that the used of select 'set== "minN 3" is not appropriate in here but not sure how I can do it. let me now if there is way.

    But, I think the resulting *.vcf file using this trick (combine variants with minN 3) should be equivalent to the one if I had first merged the variants and then selected the ones present in at least 3 samples, rite??

    Thanks in advance !

    • Bishwa K.
  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,684 admin

    @everestial007 Yes the method you used is correct and no further manipulation of the file should be necessary.

    I believe SelectVariants has an option to drop non-essential fields of the vcf -- have a look at the tool doc for a complete list.

    Geraldine Van der Auwera, PhD

  • everestial007everestial007 GreensboroMember Posts: 65

    It is surprising to see that the ouput of the GenotypeGVCFs from several samples is often small compared to the g.vcf (for each sample we used). My several g.vcf files are above 2 gb but the genotypedGVCF from six samples is only 1 gb. Is this normal?

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,684 admin

    This is expected, as the gVCFs include data for many sites that are not called variant, whereas by default the output of GGVCFs only contains sites where a variant has been called. Also, the lines corresponding to each genome position from each gVCF get condensed into a single line in the final output.

    Geraldine Van der Auwera, PhD

  • everestial007everestial007 GreensboroMember Posts: 65
  • everestial007everestial007 GreensboroMember Posts: 65

    @Geraldine_VdAuwera said:
    @everestial007 Yes the method you used is correct and no further manipulation of the file should be necessary.

    I believe SelectVariants has an option to drop non-essential fields of the vcf -- have a look at the tool doc for a complete list.

    Hi @Geraldine_VdAuwera
    After looking through the available flags my thought was that -IDs and -excludeIDs are the appropriate flags to select/exclude the field of interest. But, for some reason I am not finding a proper way of doing it. I created a fileKeep document and typed in the strings (as is in the vcf file) for the fields I am interested in. The script runs and outputs a file but with out any information on it except the headers (but there seems to be no selection). So, probably my selection of the flag is right but I am not able to find a right way of using it.
    Please let me know if there is somewhere else I should be looking for.

    Thanks,

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,684 admin

    Ah, my bad, it's CombineVariants that has a -minimalVCF argument. And there's a general engine capability called -sites_only that goes even further.

    Geraldine Van der Auwera, PhD

  • everestial007everestial007 GreensboroMember Posts: 65

    @Geraldine_VdAuwera
    I checked and found that -minimalVCF requires a boolean logic. I am not sure what value should I give. Providing true/false didnot help and i found no example to help myself. Also, I found no -sites_only argument for CombineVariants.
    Could you please provide me with an example that could be helpful.

    Thank you,

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,684 admin

    For booleans, just including the flag (without specifying a value) is sufficient to set it to True.

    -sites_only is an engine capability so it's documented with the rest of the engine arguments, not with any particular tool's.

    Geraldine Van der Auwera, PhD

  • everestial007everestial007 GreensboroMember Posts: 65

    @Geraldine_VdAuwera said:
    For booleans, just including the flag (without specifying a value) is sufficient to set it to True.

    -sites_only is an engine capability so it's documented with the rest of the engine arguments, not with any particular tool's.

    Thank you !!!

  • mglclinicalmglclinical USAMember Posts: 78

    nice post

  • GuillefriisGuillefriis SpainMember Posts: 16

    Hi,

    I'm trying to intersect a GATK called vcf file with other vcf called with a different genotyper (I've tried with samtools and TASSEL so far) to use the intersect as reliable SNP dataset to perform the BQSR since I work with a non-model avian genus and there is no other datasets available. I'm having a lot of problems, like incoherences with respect the reference genoe (even when I've used exactely the same one) or related with lacking the the vcf.idx (##### ERROR MESSAGE: Problem detecting index type) at the first step, when using th CombineVariants tool. Has somebody tried this before? Suggestions? Thanksd a lot.

    Guillermo

  • Geraldine_VdAuweraGeraldine_VdAuwera Administrator, Dev Posts: 10,684 admin

    @Guillefriis You'll need to post the details of each problem separately, as we can't give you a one-size-fits-all solution.

    Geraldine Van der Auwera, PhD

  • GuillefriisGuillefriis SpainMember Posts: 16

    Hi @Geraldine_VdAuwera

    Let's leave aside the intersection with the TASSEL calling dataset.

    I performed a SNPcalling following the GATK best practices workflow till the BQSR step using Genotyping-by-sequencing (GBS) data for around four hundred individuals. Since I work with emberizids, I used the Zebra Finch genome as reference.
    I reproduced the GATK SNPcalling with strict quality thresholds and did an alternative SNPcalling using the mpileup tool from SAMTOOLS in order to recover those SNPs detected by both workflows and use them in the BQSR, assuming they would be more reliable. I'm using the script detailed here:

    "
    module load GATK/3.3-0

    gatk=/gpfs/res_apps/GATK/3.3-0/GenomeAnalysisTK.jar
    genome=~/data/Borja/finch_genome/ZEFI_gatkindex/Taeniopygia_guttata.taeGut3.2.4.dna.toplevel.fa
    samtools_snps=~/scratch/PIPELINE/GBS/Final_Workflow/T_samtools_calling/out_strict.vcf

    java -jar $gatk \
    -T CombineVariants \
    -R $genome \
    -V JUNCOsnps_ZEFI_HQ_intersectBQSR.vcf \
    -V $samtools_snps \
    -o tassel_gatk.vcf

    java -jar $gatk \
    -T SelectVariants \
    -V tassel_gatk.vcf \
    -R $genome \
    -o Intersect_ZEFI_BQSR.vcf \
    -nt 16 \
    -select 'set == "Intersection";'
    "

    However the CombineVariants tool gives an error somehow related with the index, my guess the lacking samtools snps vcf index:

    "

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 3.3-0-g37228af):
    ERROR
    ERROR This means that one or more arguments or inputs in your command are incorrect.
    ERROR The error message below tells you what is the problem.
    ERROR
    ERROR If the problem is an invalid argument, please check the online documentation guide
    ERROR (or rerun your command with --help) to view allowable command-line arguments for this tool.
    ERROR
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR Please do NOT post this error to the GATK forum unless you have really tried to fix it yourself.
    ERROR
    ERROR MESSAGE: Problem detecting index type
    ERROR ------------------------------------------------------------------------------------------

    "

    I couldn't find a way to produce and index for the file. I also checked related posts (http://gatkforums.broadinstitute.org/gatk/discussion/2283/in-regards-to-intersecting-vcf-files) but couldn't find a solution. Maybe is not doable? Thanks for your attention Geraldine.

    Cheers
    Guillermo

  • SheilaSheila Broad InstituteMember, Broadie, Moderator, Dev Posts: 4,284 admin

    @Guillefriis
    Hi Guillermo,

    I suspect the issue is with the non-GATK generated VCF. Can you try running ValidateVariants on your VCFs? If they pass, you can try deleting the VCF indices and GATK will regenerate them for you.

    -Sheila

Sign In or Register to comment.