Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

CombineVariants

SystemSystem Posts: 226Administrator admin
edited July 2012 in Tool Bulletin

A new tool has been released!

Check out the documentation at CombineVariants.

Comments

  • dparkdpark Posts: 9Member

    Is there a good reason that this tool requires -R ref.fasta? It doesn't seem like it should be necessary.

    Danny Park, PhD -- Broad Institute, IDI, Sabeti Lab

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,192Administrator, GATK Developer admin

    I believe this is meant to provide a sanity check -- you wouldn't want to be combining variants called with different references or reference versions.

    Geraldine Van der Auwera, PhD

  • dparkdpark Posts: 9Member

    But the fasta doesn't add any information that doesn't already exist in the REF columns of the input files (except for positions that don't exist in the input files--which we don't care about anyway).

    Danny Park, PhD -- Broad Institute, IDI, Sabeti Lab

  • ebanksebanks Posts: 683GATK Developer mod

    The reference .dict file (which is automatically loaded with the fasta) does provide the information that Geraldine refers to. By design every GATK tool requires a reference for this reason.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • dparkdpark Posts: 9Member

    If the answer is that it's simply a limitation of the implementation or is a cheap way to provide some speed up, that's fine, but I think we can agree that the fasta and dict files are informationally redundant and contain no new data that is not already in the VCFs (for the purposes of CombineVariants). From the point of view of someone running the black box from the command line, I find myself wondering why it needs this extra file (for example, vcftools's merge operation does not ask for a reference fasta, but it's also orders of magnitude slower). The first four columns of all the VCF files provide all the necessary information that Geraldine refers to. I suppose the same question could be asked of VariantEval (though I haven't used that tool enough to know--maybe one of its options might legitimately require information that only a ref.fasta could provide).

    Danny Park, PhD -- Broad Institute, IDI, Sabeti Lab

  • ebanksebanks Posts: 683GATK Developer mod

    Hmm, no I do not agree that the dict file is "informationally redundant." Just because 2 VCF records have the exact same values for the first 4 columns does not at all mean that they represent the same variation, e.g. if the 2 files are created from different genome builds or even different organisms. The dict file is used to compare against the VCF indexes to confirm genome build concordance. In any case this is really a moot point because every GATK tool always requires a reference file (and we have no plans to change that)!

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Sign In or Register to comment.