This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!
We have a pile of full-genome sequences from fish genomes, and we're trying to call variants so that we can look for P-G associations. Given the genetic distance between some of our samples, we're trying to genotype both using HaplotypeCaller (genotype & call), and using HaplotypeCaller followed by GenotypeGVCFs . We hope that comparing the two approaches will help us work out which is doing the better job.
My problem is that I have to parallelise in order to get my jobs to run on a shared computing cluster. I've been doing this by running HaplotypeCaller & GenotypeGVCFs on ~1MBp 'slices' of the genome, specified using the '-L' intervals flag, then 'sticking' together a full-genome vcf file using the CatVariants tool. However, I'm seeing error messages like the one below when I then try to use the VariantsToBinaryPed tool to convert our vcfs for analysis using PLINK.
##### ERROR MESSAGE: The provided VCF file is malformed at approximately line number 24252683: there are 62 genotypes while the header requires that 63 genotypes be present for all records at Scaffold167:15
I can track down the 'slice' on which the genotype is missing, remake that file, then re-CatVariants the big vcf, but then I get another similar error when I get back to the VariantsToBinaryPed step. Since slicing the genome gives me ~2000 vcf files to CatVar. back together – and I've been iterating this error-discovery process for a few days now – I'd really like to be able to programatically discover errors in the vcfs, but the ValidateVariants tools does not seem to catch them (I suspect that it might not be able to since we have no a priori 'dbsnp' file to provide).
Can anyone suggest another way to flag these errors, or perhaps to make VariantsToBinaryPed more error-tolerant?