Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Merge VCF Files


BACKGROUND: I am working with a public data set that consists of VCF files. ( I cannot go back upstream in the process). VCF files are broken out by patient sample. And broken out further by chromosome for 0/0 calls with NON_REF listed as the ALT. The variant calls 0/1 and 1/1 and so forth are in a separate VCF file for each patient for variant calls listed across the entire genome. I concatenated all the files for each patient. So for each patient, the ALT for a 0/0 call is NON_REF and the ALT for a variant call is always listed as a value, such as "G" or "TT." Now, I wish to merge my 5000 patient samples into a single VCF file.

1. I went back to an older version of GATK 3.5 and used CombineVariants and got flagged with this message:

ERROR MESSAGE: CombineVariants should not be used to merge gVCFs produced by the HaplotypeCaller; use CombineGVCFs instead
  1. I also tried GATK4 and used CombineGVCFs and got flagged with this message:
ERROR MESSAGE: The list of input alleles must contain as an allele but that is not the case at position 15274; please use the Haplotype Caller with gVCF output to generate appropriate records

QUESTION: How do I solve this and merge my files? Is there a VCF merge function that can handle a mix of calls that sometimes list NON_REF as the ALT and sometimes list an actual value for ALT?

P.S. Bcftools will not let me do this, but vcf-tools merge will handle this, but it is very slow. I am hoping to use GATK.

Jim Kozubek

Best Answer


Sign In or Register to comment.