We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Problem with validating resource bundle

I have been having trouble with the VCF files generated by GATK with either an error or warning when I run validate vcf saying that the reference allele is not matching. I went back an re downloaded the resource bundle from broad and ran validate VCF on the DBSNP vcf and get an error that "the REF allele is incorrect for the record at position 1:10054, fasta says CTA vs. VCF says CAA". I think that the troubles I have been having are due to a discrepancy between the various vcf files within the bundle.

Answers

  • SheilaSheila Broad InstituteMember, Broadie ✭✭✭✭✭

    @cooperjam
    Hi,

    What version of GATK are you using and can you tell us the exact command you ran?

    Thanks
    Sheila

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Did you validate the dbsnp against the reference from the bundle, or another reference?

  • cooperjamcooperjam NIHMember

    java -Xmx10g -Djava.io.tmpdir=/scratch/cooperjam/temp -jar $GATK_HOME/GenomeAnalysisTK.jar -T ValidateVariants -V /data/cooperjam/directGATKresource/b37/dbsnp_138.b37.vcf -R /data/cooperjam/directGATKresource/b37/human_g1k_v37.fasta --dbsnp /data/cooperjam/directGATKresource/b37/dbsnp_138.b37.vcf

    The reference and dbsnp were both from the bundle that I download this morning from the Broad server.

    The error is ##### ERROR MESSAGE: File /data/cooperjam/directGATKresource/b37/dbsnp_138.b37.vcf fails strict validation: the REF allele is incorrect for the record at position 1:10054, fasta says CTA vs. VCF says CAA

    This error was with version 3.5 but I loaded 3.4 and 3.3 and got the same error

    Issue · Github
    by Sheila

    Issue Number
    442
    State
    open
    Last Updated
    Assignee
    Array
    Milestone
    Array
  • cooperjamcooperjam NIHMember

    The other thing to mention is that the original reason I was doing this is because I have been having a problem getting VariantRecal/ApplyRecal to work properly. There is no error when I run either tool but the snp and indel VCF that they output is truncated. When I look at the log files it looks like only a small fraction of the genome is being covered. For example this is part of the log for ApplyRecal for indels:

    INFO 07:19:44,729 ApplyRecalibration - Keeping all variants in tranche Tranche ts=99.90 minVQSLod=-7.1423 known=(34466 @ 0.0000) novel=(10619 @ 0.0000) truthSites(18420 accessible, 18401 called), name=VQSRTrancheINDEL99.00to99.90]
    INFO 07:20:14,708 ProgressMeter - 12:83081593 310198.0 30.0 s 96.0 s 65.6% 45.0 s 15.0 s
    INFO 07:20:30,204 ProgressMeter - done 498594.0 45.0 s 91.0 s 98.1% 45.0 s 0.0 s

  • cooperjamcooperjam NIHMember

    Sorry bring this up again, but have you guys had a chance to see if you run ValidateVariants on the resource bundle dbsnp and reference wether there is a discrepancy between the two?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We have, and you were right. I'm not sure what happened - sounds like a liftover gone wrong. I'll look into it.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @cooperjam Can you please open a separate discussion for your VQSR issue? It sounds unrelated to the dbsnp file problem, and since it's going to take us a while before we can look at that in detail, I don't want to leave you blocked at VQSR.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    We discussed the dbsnp issue today. We did confirm that there are indeed positions where the dbsnp file has the wrong REF allele. This is a legacy file so we're unsure about how the problem arose, but we suspect it was through an incomplete liftover from b36 where those sites should have been pruned out but were not. We'll try to figure out whether that is actually the case by spot-checking whether those sites exist in the "official" dbsnp. In the meantime we think the problem is probably harmless and can be safely ignored because we have been using this file for years without any errors. But you could also choose to get a clean dbsnp file from the dbsnp database if you want to be absolutely sure (which is the right thing to do scientifically of course)

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @cooperjam We confirmed that the sites that ValidateVariants is complaining about are indeed sites that should have been dropped. If you look up the sites at dbsnp you get the following:

    rs376643643 was deleted on Feb 11, 2015 due to mapping or clustering errors. The submitted snp(ss) listed below were removed from the Reference SNP (rs) cluster. We will reevaluate the mapping positions for these ss in future builds and either assign them to existing RS or to a new RS.
    

    We don't have the time right now to remove the sites from the file but we have determined that these errors can be safely ignored.

Sign In or Register to comment.