It looks like you're new here. If you want to get involved, click one of these buttons!
A new tool has been released!
Check out the documentation at CombineVariants.
Is there a good reason that this tool requires -R ref.fasta? It doesn't seem like it should be necessary.
Danny Park, PhD -- Broad Institute, IDI, Sabeti Lab
I believe this is meant to provide a sanity check -- you wouldn't want to be combining variants called with different references or reference versions.
Geraldine Van der Auwera, PhD
But the fasta doesn't add any information that doesn't already exist in the REF columns of the input files (except for positions that don't exist in the input files--which we don't care about anyway).
The reference .dict file (which is automatically loaded with the fasta) does provide the information that Geraldine refers to. By design every GATK tool requires a reference for this reason.
Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT
If the answer is that it's simply a limitation of the implementation or is a cheap way to provide some speed up, that's fine, but I think we can agree that the fasta and dict files are informationally redundant and contain no new data that is not already in the VCFs (for the purposes of CombineVariants). From the point of view of someone running the black box from the command line, I find myself wondering why it needs this extra file (for example, vcftools's merge operation does not ask for a reference fasta, but it's also orders of magnitude slower). The first four columns of all the VCF files provide all the necessary information that Geraldine refers to. I suppose the same question could be asked of VariantEval (though I haven't used that tool enough to know--maybe one of its options might legitimately require information that only a ref.fasta could provide).
Hmm, no I do not agree that the dict file is "informationally redundant." Just because 2 VCF records have the exact same values for the first 4 columns does not at all mean that they represent the same variation, e.g. if the 2 files are created from different genome builds or even different organisms. The dict file is used to compare against the VCF indexes to confirm genome build concordance. In any case this is really a moot point because every GATK tool always requires a reference file (and we have no plans to change that)!