Troubleshooting: ERROR - variant files have inconsistent references for the same position.

bwubbbwubb Member ✭✭
edited January 31 in Ask the GATK team

Greetings,
I am hoping to get some help troubleshooting a frustrating error I am having trying to genotype a large set of data. The source data is nearly 12000 WES samples, which were sequenced by a 3rd party company, so I am assuming it is worth the money that was spent . I know they followed best practices and used the same reference file for all samples. I have the gvcf files for the entire set, and I have successfully genotyped the entire WES intervals, as well as subset the gvcf files for 74 genes and successfully genotyped those. All of this with GATK v3.7.

I now have a third set of intervals (SXP) I am trying to process. SelectVariants with this interval set works fine. I create 40 cohort.g.vcf files with roughly ~290 samples in each, and this process has worked without any errors in all three use cases.

However, now with these SXP cohorts, I get about 2.5% through GenotypeGVCF and will receive an error

##### ERROR MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at 1:62732364, A* vs. G*

I identify that a single cohort has this ref anomaly. I looked for it in the individual SXP subset g.vcfs of all the samples in that cohort, but cannot find a single sample with that position as such; I have no idea where it comes from. I tried removing that position from the cohort g.vcf. I receive the same error, at a different position, in a different cohort, but I notice that its technically happening in the same gene as the original error.

I removed that gene from my interval list, re-subset the entire sample set and made the same cohorts from the modified data; receive the same error, at a different position, in a different cohort, in a different gene.

I can find no evidence that these data had any sort of inconsistent reference when they were created, and again I have used them successfully a couple of times already, and so have other researchers working with the data files.

I do not understand where these genotypes are coming from. From my understanding I can not run ValidateVariants on gvcfs and get anything meaningful. Is there anything else I can be doing to find the issue or is there a way to GenotypeGVCFs move passed these error positions? I think they only thing I havent tried is upgrading to GATK 4.0, but I am dubious it will make a difference. Thank you!

-bwubb

Answers

  • bwubbbwubb Member ✭✭

    Further confusion. I tried looking at the position of the error from removing the originally "troublesome" gene, and I do not see the inconsistent reference as described.

    ##### ERROR MESSAGE: The provided variant file(s) have inconsistent references for the same position(s) at 1:27087991, C* vs. A*

    So I went looking.

    for i in data/cohort_work/*.vcf; do echo $i; awk '{ print $1,$2,$4,$5 }' $i | grep 27087991; done
    data/cohort_work/cohort.10.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.11.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.12.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.13.g.vcf
    data/cohort_work/cohort.14.g.vcf
    data/cohort_work/cohort.15.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.16.g.vcf
    data/cohort_work/cohort.17.g.vcf
    data/cohort_work/cohort.18.g.vcf
    1 27087991 C *,<NON_REF>
    data/cohort_work/cohort.19.g.vcf
    data/cohort_work/cohort.1.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.20.g.vcf
    data/cohort_work/cohort.21.g.vcf
    data/cohort_work/cohort.22.g.vcf
    data/cohort_work/cohort.23.g.vcf
    data/cohort_work/cohort.24.g.vcf
    data/cohort_work/cohort.25.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.26.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.27.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.28.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.29.g.vcf
    data/cohort_work/cohort.2.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.30.g.vcf
    data/cohort_work/cohort.31.g.vcf
    data/cohort_work/cohort.32.g.vcf
    data/cohort_work/cohort.33.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.34.g.vcf
    data/cohort_work/cohort.35.g.vcf
    data/cohort_work/cohort.36.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.37.g.vcf
    data/cohort_work/cohort.38.g.vcf
    data/cohort_work/cohort.39.g.vcf
    data/cohort_work/cohort.3.g.vcf
    data/cohort_work/cohort.40.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.4.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/cohort.5.g.vcf
    data/cohort_work/cohort.6.g.vcf
    data/cohort_work/cohort.7.g.vcf
    data/cohort_work/cohort.8.g.vcf
    data/cohort_work/cohort.9.g.vcf
    1 27087991 C <NON_REF>
    data/cohort_work/original_cohort.36.g.vcf
    1 27087991 C <NON_REF>
    

    So I do not see an 'A' in any of those, so I am not sure which variant file is actually inconsistent? The only one that looks different is 'data/cohort_work/cohort.18.g.vcf' with '1 27087991 C *,'

    I am not sure what else I can be doing to look for the cause of this issue.

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin
    edited February 2

    @bwubb
    Hi bwubb,

    It may just be the case that you are trying to split the GVCFs at unnatural breakpoints in the chromosome. You say GenotypeGVCFs works well on exome intervals. Is it possible to run on each chromosome, then subset to the intervals you are interested in? That is what we recommend.

    Also, have a look at this thread for more information.

    -Sheila

  • bwubbbwubb Member ✭✭

    @Sheila
    Hi Sheila,

    What would be considered a natural breakpoint? I took my target regions and expanded them to be the whole gene regions for a set genes. Well defined human genes. At 30.7% I receive the same type of error (new cohort, new position on a "new" chromosome). While this technically an improvement in one regard, I still cant understand why this is happening for this particular use case.

    Where do the ref/alt from combineGVCFs come from? Shouldnt at least one sample in a cohort have the offending REF allele somehow?

    In regards to previous uses cases I was able to take a subset of ~3000 sample g.vcfs and combine them with WES exomes of mine for joint-calling. I was also able to take this full sample set (same Im trying to do now) and make calls in a set of 74 gene regions. Have I been lucky twice or just unlucky once?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator admin

    @bwubb
    Hi,

    Natural breakpoints would be at the ends of chromosomes :smile: or at stretches of Ns in the genome. They could also be in messy regions where we know we can never make confident calls.

    I think this issue can be solved by running on appropriate intervals then subsetting to the intervals that you are interested in. Please do have a look at that thread I linked to before. It contains some useful information. Let us know if that works, and the error message goes away.

    Thanks,
    Sheila

Sign In or Register to comment.