Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Attention:
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
We will be out of the office for a Broad Institute event from Dec 10th to Dec 11th 2019. We will be back to monitor the GATK forum on Dec 12th 2019. In the meantime we encourage you to help out other community members with their queries.
Thank you for your patience!
CombineVariants in GATK4

Is it planned to add CombineVariants tool into GATK4.0 toolkit (it existed in previous GATK versions)? The only similar tool currently available in GATK4.0 Beta is GatherVCFs which has very limited possibility and cannot concatenate unsorted VCFs or merge different INFO fields correctly.
Thanks!
Best Answers
-
bhanuGandham Cambridge MA admin
Hi @nketchi
The developers got back to me and mentioned that porting CombineVariants to GTAK4 is a work in progress. I have let them know that this is a feature that the users wants and hence will be prioritized.
For now you can use it from GATK3.
I hope this helps.Regards
Bhanu
Answers
@Vladimir_Kovacevic
Hi.
There is a tool called MergeVcfs which you can use instead of CombineVariants. It looks like there is no documentation for it yet, but if you use
--list
with gatk-launch, it lists the available tools. We should have better documentation within the coming months when GATK4 is out of beta.-Sheila
Hi @Sheila!
Thank you for your response and suggestion. We tried MergeVcfs and unfortunately it failed with two VCFs that pass with CombineVariants. Here is the error:
Using GATK jar /GATK/gatk-4.beta.2/gatk-package-4.beta.2-local.jar
Running:
java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=1 -Dsnappy.disable=true -jar /GATK/gatk-4.beta.2/gatk-package-4.beta.2-local.jar MergeVcfs --input reheadered_subset.vcf --input tp.fp.subset.vcf --output output.vcf
12:00:35.965 INFO NativeLibraryLoader - Loading libgkl_compression.dylib from jar:file:/GATK/gatk-4.beta.2/gatk-package-4.beta.2-local.jar!/com/intel/gkl/native/libgkl_compression.dylib
[September 19, 2017 12:00:35 PM CEST] MergeVcfs --input reheadered_subset.vcf --input tp.fp.subset.vcf --output output.vcf --VALIDATION_STRINGENCY STRICT --COMPRESSION_LEVEL 1 --MAX_RECORDS_IN_RAM 500000 --CREATE_INDEX true --CREATE_MD5_FILE false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_inflater false
[September 19, 2017 12:00:35 PM CEST] Executing as [email protected] on Mac OS X 10.12.6 x86_64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_121-b13; Version: 4.beta.2
12:00:41.015 INFO MergeVcfs - HTSJDK Defaults.COMPRESSION_LEVEL : 1
12:00:41.016 INFO MergeVcfs - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
12:00:41.016 INFO MergeVcfs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
12:00:41.016 INFO MergeVcfs - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
12:00:41.016 INFO MergeVcfs - Deflater: IntelDeflater
12:00:41.016 INFO MergeVcfs - Inflater: IntelInflater
12:00:41.016 INFO MergeVcfs - Initializing engine
12:00:41.016 INFO MergeVcfs - Done initializing engine
12:00:41.573 INFO MergeVcfs - Processed 10,000 records. Elapsed time: 00:00:00s. Time for last 10,000: 0s. Last read position: 3:189,995,416
12:00:41.785 INFO MergeVcfs - Processed 20,000 records. Elapsed time: 00:00:00s. Time for last 10,000: 0s. Last read position: 8:132,077,728
12:00:41.957 INFO MergeVcfs - Shutting down engine
[September 19, 2017 12:00:41 PM CEST] org.broadinstitute.hellbender.tools.picard.vcf.MergeVcfs done. Elapsed time: 0.10 minutes.
Runtime.totalMemory()=352845824
java.lang.IllegalStateException: The elements of the input Iterators are not sorted according to the comparator htsjdk.variant.variantcontext.VariantContextComparator
at htsjdk.samtools.util.MergingIterator.next(MergingIterator.java:113)
at org.broadinstitute.hellbender.tools.picard.vcf.MergeVcfs.doWork(MergeVcfs.java:126)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:115)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:170)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgram.instanceMain(PicardCommandLineProgram.java:62)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:131)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:152)
at org.broadinstitute.hellbender.Main.main(Main.java:230)
Do you have any more suggestions?
FYI @teodora_aleksic
@Vladimir_Kovacevic
Hi,
Hmm. Can you confirm the VCFs pass ValidateVariants?
Also, can you try deleting the VCF indices and re-generating them?
Thanks,
Sheila
I am moving to GATK4 and had the same question. It worked for me
Thank you,
Said MM
It appears the MergeVcfs is built on top of Picard (GATK 4.0 does mentions this too). So, if you run into issues with this command go to Picard and look at what they recommend to do (files must be sorted the same, output file has a file type designated (vcf etc). Just skimming the error message above I think the error is caused by the files not being sorted the same.
Here is a similar command using straight Picard
java -jar picard MergeVcfs \
I=sample_8_filtered_raw_SNPs.vcf \
I=filtered_sample_8_raw_indels.vcf \
O=combined_Filtered_Variants_4-4-2018.vcf
While Picard's MergeVcfs can be used to combine variants from VCFs containing the same samples, this covers only one mode (UNION) of two (MERGE, UNION) that CombineVariants offered.
The MERGE mode is more akin to bcftools merge, combining info about variant loci from multiple samples, each contained in their individual VCFs.
Is there a tool in GATK4 that covers the MERGE mode? (I would like to use it to combine results from MuTect for multiple samples.)
Cheers,
@hdetering
Hi Harry,
You should be able to use CombineVariants to merge VCFs from different samples. Is there a reason you are against using it?
-Sheila
@hdetering Apologies for being nosy but implementing a multi-sample mode or workflow is one of our top priorities for Mutect2 in the coming months. What features would this have to have to suit your needs?
Is there a specific reason why CombineVariants is not in GATK 4? I am also interested in a CombineVariants port to GATK 4 (eg. for combining variants of the same sample from different callers for side-by-side comparison of variants), in addition to a multi-sample workflow for Mutect2 (for genotyping the same variant across multiple samples)
@FPBarthel
Hi,
There is MergeVCFs for your first case. For the second (combining different sample VCFs), I think the team discourages this because we have the GVCF workflow and don't recommend merging single sample VCFs.
-Sheila
Hi @Sheila,
Admittedly I have not tested this but from the documentation
MergeVCFs
does not have a feature that allows for variant intersection? Eg. supplyMergeVCFs
with three input VCF files and have it output only variants present in at least 2/3 input files? Something like this? That would be what I am looking for for the first case.The second case would be to take the union of multiple VCF files from different samples, discarding the FORMAT columns, followed by a second step of genotyping the variant set across multiple samples, thus reintroducing the FORMAT column for multiple samples.
Floris
@FPBarthel
Hi Floris,
In this case, you would have to first merge the VCFs, then use SelectVariants to select the sites of interest.
-Sheila
Hi @Sheila,
I am a bit confused, after following the GATK Best Practices and reaching till here https://software.broadinstitute.org/gatk/documentation/article.php?id=2806, obtaining the INDEL.vcf and SNP.vcf from a sample, how should we continue, analyzing each separately or could we combine them using MergeVCFs?
My intentions are annotations and GenotypePosteriori and VariantsToTable.
PS: (I am using WES).
Best,
Marius.
If you want to proceed with
CalculateGenotypePosteriors
you would want to combine the SNP and indel vcfs withMergeVcfs
. If you're usingVariantsToTable
it depends on whether having separate tables for SNPs and indels is more convenient.Thanks, @davidben it worked with MergeVcfs perfectly
Why do we need to merge vcfs? Do we have to or can we keep them separate?
@kmmahan At this point we're talking about analyses downstream of the GATK. Sometimes it will be convenient to keep SNPs and indels together; sometimes it won't. I will say, however, that if the variants do somehow end up as the input to a GATK tool, it would most likely expect a single merged vcf. That's just the usual GATK style.
Hi everyone,
I am faced with a similar problem. Having gotten to the point where I am happy with my vcf file filtering, I would like to use CombineVariants to select only SNPs which occur across all of my populations. Before with gatk3, we did it like this:
gatk --java-options "-Xmx10g" -l INFO -T CombineVariants -R $reference \
--variant $maxmiss_filtered.pop1.recode.vcf \
--variant $maxmiss_filtered.pop2.recode.vcf \
--variant $maxmiss_filtered.pop3.recode.vcf \
--variant $maxmiss_filtered.pop4.recode.vcf \
--variant $maxmiss_filtered.pop5.recode.vcf \
--variant $maxmiss_filtered.pop6.recode.vcf \
--minimumN 6 -o $maxmiss_filtered.merged.vcf
Can someone please tell me which steps I would have to take using mergevcf and selectvariants to achieve a similar result? (using gatk4)
I need a merged vcf across my populations for all downstream analyses!
I am still wondering why this tool was removed? (it was so useful!)
Thanks,
Josie
Hi @nketchi
1) We are discussing with the developers why CombineVariants has not been ported to GATK4 and I will get back to you with an answer.
2)If it helps your case, you can continue using CombineVariants from GATK3 in the mean while.
Regards
Bhanu
Hi @nketchi
The developers got back to me and mentioned that porting CombineVariants to GTAK4 is a work in progress. I have let them know that this is a feature that the users wants and hence will be prioritized.
For now you can use it from GATK3.
I hope this helps.
Regards
Bhanu
Thank you very much, @bhanuGandham !
Hopefully this is done fast... I would like to use CombineVariants in GATK4 as well.
Thanks @bhanuGandham !
Just thought I'd add my vote to bring CombineVariants into GATK4. Will be using GATK3 for this alone until then.
Hi @aschoenr @Vladimir_Kovacevic @nketchi @hdetering @Rosmaninho @FPBarthel
Thank you all for bringing this to our notice, I am trying to see if our team can either port CombineVariants or have those features included in MergeVcfs in GATK4.
What would be very helpful is you could tell me some specific features that are useful to you from CombineVariants that are not possible with MergeVcfs ? This information will be very helpful. Thank you in advance.
Regards
Bhanu
Union of variants from two or more somatic callers (Tumor-Normal VCF), where tumor normal pair are from the same person. If I were you I would would test different set of mostly used, for example Strelka, Vardict, Mutect2, Varscan, Somatic Sniper, Rufus etc.
I did not even use Mergevcfs because I I didn't want to Mergevcfs.
I used CombineVcfs because of the -minN option where it only outputs sites observed in a minimum number of samples from a certain number of samples.
For example, I have a set of 10 samples with a phenotype in common and I want a vcf file with the sites that are present in a minimum of 5 of those 10 samples.
This way I got a vcf with variants that are present in a significant number of my samples and might be significant but I am not being extremely restrictive and asking for variants present in all my samples.
@Rosmaninho, I did not even know about this feature of CombineVariants. Well, what can I say except, it is one hell of a tool!
Well, I was looking specifically for a tool to do this so i found it in CombineVariants...
If you were looking for this I'm sure you would find about it as well. :P
I need CombineVariants to merge two VCFs with priority. Seems this can't be done in MergeVcfs, so I'm adding my vote for GATK4 CombineVariants as well. Thanks so much to the developers.
Hi @bhanuGandham, sorry, I didn't see this until now. I needed to use CombineVariants in my implementation of the Germline short variant discovery workflow. Specifically, I implemented the joint genotyping in parallel, processing each chromosome individually. Since I needed a single VCF for VQSR, I needed to combine all of the VCFs I produced. I tried another method and I ended up with far too many entries (columns) in the resulting VCF. Since the samples were the same (and in the same order) in all of my VCFs, I think I just needed to basically concatenate the files together while maintaining the header, and CombineVariants was the tool I found mentioned on the GATK forums to do this. Please let me know if I can achieve this with another tool in GATK4. Like I said, this is probably the simplest case in which VCFs need to be combined.
Thanks for all the feedback!
@bhanuGandham
I need it also, for example, if I want to combine vcf from freebayes and vcf from vqsr or vcf from streakle2 together, CombineVariants tool will be helpful.
Thank you.
Hi @xiucz We are tracking this issue, and it would be helpful to find out which feature of CombineVariants is needed. In other words, what do you need that MergeVcfs does not currently provide. This information can be used to improve the next version, the GATK3 version was sometimes inconsistent in performance and that is why it was not ported over in its entirety.
Those 2 tools don't even work the same way.
When 2 vcfs from different callers were merged using mergevcfs same positions are not handled as one but CombineVariants can handle this scenario using uniquify option
An alternative could be using bcftools however it messes up with FORMAT fields in the final output when 2 callers have different FORMAT options.
I guess everybody needs this feature to be ported to GATK4 and picard. Maintaining an older GATK version in workflows is really getting confusing sometimes.
Here is an example. One vcf is from HaplotypeCaller and the other is from UnifiedGenotyper from the same sample.
MergeVcfs output for the same position
And here is the output from CombineVariants with uniquify option for the same position
Finally here is the output from bcftools merge
CombineVariants is also not perfect and looks like missing INFO fields that conflict for values however since this step is usually done after individual VCFs are filtered it is not much of a problem in the final output.
This may be viewed as a non-canonical and even non-standard way of handling vcf files for HTS however this particular practice is quite common when you are using multiple callers and try not to loose potentially important variants due to algorithmic differences.
Hi @SkyWarrior
Thank you for the input and we are working on finding a solution for this. Your input is very helpful.
@SkyWarrior Thanks for your comment, and it is really helpful. Let us wait for the new features of this tool.
@bhanuGandham And thank you and GATK team. I think it is more reasonable to combine different vcfs from different callers at vcf-level.
@SkyWarrior, here is an example command, any guidance is greatly appreciated.
@xiucz in your case you may need to use UNIQUIFY option to see all callers' results in a single line for one site.
Hi,
@SkyWarrior thanks for your promotion advice. Strangely, I didn't get any information in the second FORMAT column, the sample.freebayes column only contains "./." .
After regexing the INFO column, I got these set tags:
set=FilteredInAll means union, set=freebayes and set=VQSR means the variant is only in their own dataset. What does the "set=filterInVQSR-freebayes" tag mean?
Thank you.
Your combined result is from a different position. Can you check the proper position that you posted in the bottom?
@SkyWarrior
Sorry, I posted the combined result wrongly , it looks fine now. Thank you.
@SkyWarrior Thank you for helping out with this issue. We appreciate your help.
I was running into this, largely because the Mutect2 documentation that shows up first on Google is the documentation for the GATK 3.8 version. for the current 4.x version, we instead need to be running
gatk CreateSomaticPanelOfNormals
which appears to replace the use ofCombineVariants
for that purpose.Mentioning that here might help others in the future.
Any news on getting
CombineVariants
working in GATK4?@olavur
We will not be porting CombineVariants to GATK4(for now).
@bhanuGandham ,
you said in October 2018:
"The developers got back to me and mentioned that porting CombineVariants to GTAK4 is a work in progress. I have let them know that this is a feature that the users wants and hence will be prioritized.
For now you can use it from GATK3.
I hope this helps.".
What happen? Is the CombineVariants that tough nut to crack?
HI @Vladimir_Kovacevic
We have hit some roadblocks with that. We will let GATK users know when we know more. Currently the team is focused on other goals.
@bhanuGandham
CombineVariants is really useful, we can combine different callers' results together.