Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Troubleshooting: ERROR - variant files have inconsistent references for the same position.
I am hoping to get some help troubleshooting a frustrating error I am having trying to genotype a large set of data. The source data is nearly 12000 WES samples, which were sequenced by a 3rd party company, so I am assuming it is worth the money that was spent . I know they followed best practices and used the same reference file for all samples. I have the gvcf files for the entire set, and I have successfully genotyped the entire WES intervals, as well as subset the gvcf files for 74 genes and successfully genotyped those. All of this with GATK v3.7.
I now have a third set of intervals (SXP) I am trying to process. SelectVariants with this interval set works fine. I create 40 cohort.g.vcf files with roughly ~290 samples in each, and this process has worked without any errors in all three use cases.
However, now with these SXP cohorts, I get about 2.5% through GenotypeGVCF and will receive an error
##### ERROR MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at 1:62732364, A* vs. G*
I identify that a single cohort has this ref anomaly. I looked for it in the individual SXP subset g.vcfs of all the samples in that cohort, but cannot find a single sample with that position as such; I have no idea where it comes from. I tried removing that position from the cohort g.vcf. I receive the same error, at a different position, in a different cohort, but I notice that its technically happening in the same gene as the original error.
I removed that gene from my interval list, re-subset the entire sample set and made the same cohorts from the modified data; receive the same error, at a different position, in a different cohort, in a different gene.
I can find no evidence that these data had any sort of inconsistent reference when they were created, and again I have used them successfully a couple of times already, and so have other researchers working with the data files.
I do not understand where these genotypes are coming from. From my understanding I can not run ValidateVariants on gvcfs and get anything meaningful. Is there anything else I can be doing to find the issue or is there a way to GenotypeGVCFs move passed these error positions? I think they only thing I havent tried is upgrading to GATK 4.0, but I am dubious it will make a difference. Thank you!