The Frontline Support team will be offline February 18 for President's Day but will be back February 19th. Thank you for your patience as we get to all of your questions!
Picard MergeVcfs vs GATK CatVariants

I've noticed the WDL scripts are using Picard's MergeVcfs immediately following a scattered GATK HaplotypeCaller to do what CatVariants was seemingly designed for. Is there any benefit to using one over the other? In the tests I've run with the same compute node, inputs and resource allocation Picard MergeVcfs performs inappreciably slower than CatVariants.
Is there a reason for this shift of just personal preference by the coder in question?
Does Picard's MergeVcfs not suffer from the same file IO issues CatVariants has when working in network shares?
Does CatVariants have some other issue I'm unaware of?
Picard MergeVcfs:
Merges multiple VCF or BCF files into one VCF file. Input files must be sorted by their contigs and, within contigs, by start position. The input files must have the same sample and contig lists. An index file is created and a sequence dictionary is required by default.
vs
GATK CatVariants:
The main purpose of this tool is to speed up the gather function when using scatter-gather parallelization. This tool concatenates the scattered output VCF files. It assumes that:
All the input VCFs (or BCFs) contain the same samples in the same order.
The variants in each input file are from non-overlapping (scattered) intervals.
When the input files are already sorted based on the intervals start positions, use -assumeSorted.
Best Answer
-
KateN Cambridge, MA admin
Historically, we have used Picard over GATK as the go-to toolkit for utility tools, such as merging vcfs. As we are moving towards joining GATK and PIcard into one toolkit, there are inevitably overlaps like this where two tools appear to accomplish the same function.
However, the fastest answer is to use neither of the two tools you've mentioned. In the case of our recently published public WDL pipeline, the authors chose to use Picard's MergeVcfs because it supports index creation when running on gzipped files. A faster way to accomplish the same thing would be to run GatherVcfs, then run Tabix afterward to create an index.
Answers
Historically, we have used Picard over GATK as the go-to toolkit for utility tools, such as merging vcfs. As we are moving towards joining GATK and PIcard into one toolkit, there are inevitably overlaps like this where two tools appear to accomplish the same function.
However, the fastest answer is to use neither of the two tools you've mentioned. In the case of our recently published public WDL pipeline, the authors chose to use Picard's MergeVcfs because it supports index creation when running on gzipped files. A faster way to accomplish the same thing would be to run GatherVcfs, then run Tabix afterward to create an index.
Thanks! I'll take a look at picard's GatherVcfs. If it runs at a similar speed to samtools cat then that'll be a nice way to cut down the 2 hours or so that catvariants takes.