Merge VCF Files


BACKGROUND: I am working with a public data set that consists of VCF files. ( I cannot go back upstream in the process). VCF files are broken out by patient sample. And broken out further by chromosome for 0/0 calls with NON_REF listed as the ALT. The variant calls 0/1 and 1/1 and so forth are in a separate VCF file for each patient for variant calls listed across the entire genome. I concatenated all the files for each patient. So for each patient, the ALT for a 0/0 call is NON_REF and the ALT for a variant call is always listed as a value, such as "G" or "TT." Now, I wish to merge my 5000 patient samples into a single VCF file.

1. I went back to an older version of GATK 3.5 and used CombineVariants and got flagged with this message:

ERROR MESSAGE: CombineVariants should not be used to merge gVCFs produced by the HaplotypeCaller; use CombineGVCFs instead
  1. I also tried GATK4 and used CombineGVCFs and got flagged with this message:
ERROR MESSAGE: The list of input alleles must contain as an allele but that is not the case at position 15274; please use the Haplotype Caller with gVCF output to generate appropriate records

QUESTION: How do I solve this and merge my files? Is there a VCF merge function that can handle a mix of calls that sometimes list NON_REF as the ALT and sometimes list an actual value for ALT?

P.S. Bcftools will not let me do this, but vcf-tools merge will handle this, but it is very slow. I am hoping to use GATK.

Jim Kozubek

Best Answer


Sign In or Register to comment.