We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

VariantsToVCF bug

mforde84mforde84 chicagoMember

I'm trying to convert a hg19 snp138.txt annotation to vcf. I downloaded the annotation from iGenome UCSC.

Head of the file looks like:

585 chr1 10019 10020 rs376643643 0 + A A -/A genomic deletion unknown 0 0 near-gene-5 exact 1 1 SSMP, 0
585 chr1 10055 10055 rs373328635 0 + AA AA -/A genomic in-del unknown 0 0 near-gene-5 between 1 ObservedMismatch 1 SSMP, 0 observed-mismatch
585 chr1 10108 10109 rs376007522 0 + A A A/T genomic single unknown 0 0 near-gene-5 exact 1 1 BILGI_BIOE, 0
585 chr1 10138 10139 rs368469931 0 + A A A/T genomic single unknown 0 0 near-gene-5 exact 1 1 BILGI_BIOE, 0

VariantsToVCF outs:

$ java -Xmx8g -Djava.io.tmpdir=./tmp -jar $gatk -T VariantsToVCF -V:OLDDBSNP snp138.txt -R /glusterfs/users/mforde/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa -o snp138.vcf
INFO 11:23:03,573 HelpFormatter - --------------------------------------------------------------------------------
INFO 11:23:03,580 HelpFormatter - The Genome Analysis Toolkit (GATK) v3.3-0-g37228af, Compiled 2014/10/24 01:07:22
INFO 11:23:03,580 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 11:23:03,581 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 11:23:03,589 HelpFormatter - Program Args: -T VariantsToVCF -V:OLDDBSNP snp138.txt -R /glusterfs/users/mforde/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa -o snp138.vcf
INFO 11:23:03,599 HelpFormatter - Executing as [email protected] on Linux 3.13.0-32-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.7.0_65-b17.
INFO 11:23:03,600 HelpFormatter - Date/Time: 2015/04/08 11:23:03
INFO 11:23:03,601 HelpFormatter - --------------------------------------------------------------------------------
INFO 11:23:03,601 HelpFormatter - --------------------------------------------------------------------------------
INFO 11:23:07,699 GenomeAnalysisEngine - Strictness is SILENT
INFO 11:23:07,904 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 1000
WARN 11:29:32,411 IndexDictionaryUtils - Track /glusterfs/users/mforde/Homo_sapiens/UCSC/hg19/Annotation/Archives/archive-2014-06-02-13-47-56/Variation/snp138.txt doesn't have a sequence dictionary built in, skipping dictionary validation
INFO 11:29:32,448 RMDTrackBuilder - Writing Tribble index to disk for file /glusterfs/users/mforde/Homo_sapiens/UCSC/hg19/Annotation/Archives/archive-2014-06-02-13-47-56/Variation/snp138.txt.idx
INFO 11:29:32,978 GenomeAnalysisEngine - Preparing for traversal
INFO 11:29:33,006 GenomeAnalysisEngine - Done preparing for traversal
INFO 11:29:33,007 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 11:29:33,008 ProgressMeter - | processed | time | per 1M | | total | remaining
INFO 11:29:33,008 ProgressMeter - Location | sites | elapsed | sites | completed | runtime | runtime
INFO 11:29:33,993 ProgressMeter - done 0.0 0.0 s 11.4 d 100.0% 0.0 s 0.0 s
INFO 11:29:33,994 ProgressMeter - Total runtime 0.99 secs, 0.02 min, 0.00 hours

and I end up with empty vcf files.

$ ll
total 22021120
drwxrwxr-x 3 mforde mforde 0 Apr 8 11:29 ./
drwxrwxr-x 5 mforde mforde 0 Jun 2 2014 ../
-rwxrwxr-x 1 mforde mforde 10228584083 Jun 2 2014 snp137.txt*
-rwxrwxr-x 1 mforde mforde 12318134451 Jun 2 2014 snp138.txt*
-rw-rw-r-- 1 mforde mforde 751 Apr 8 11:29 snp138.txt.idx
-rw-rw-r-- 1 mforde mforde 0 Apr 8 11:29 snp138.vcf
-rw-rw-r-- 1 mforde mforde 751 Apr 8 11:29 snp138.vcf.idx
drwxrwxr-x 2 mforde mforde 0 Apr 8 10:47 tmp/

Answers

  • mforde84mforde84 chicagoMember

    Any suggestions? I'm kinda screwed up to this point. I'm assuming it has to do with the header of the snp annotation file, but I really have no idea how to proceed from here, and I've tried troubleshooting it myself without luck. Could really use the help.

  • mforde84mforde84 chicagoMember

    Anyone at all?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hmm, have you tried not specifying :OLDDBSNP?

    You could also try to find an original vcf version of the file -- I can't imagine that there's not someone out there who has one. Are you particularly attached to using v138 of dbsnp? We have other version (in vcf) in our resource bundle.

  • mforde84mforde84 chicagoMember

    Yes, when I don't specify --variant:type it errors out. I've tried all of the other types as well, the only one that runs without breaking is olddbsnp. The reason I want to use this annotation is that it came bundled with the iGenomes UCSC hg19 package which came with prebuilt bwa indexes. so for consistency in processing, i was hoping to use the annotation provided. it's a very common annotation file, so I'm just puzzled it fails outright.

    either way, i just downloaded ncbi prebuilt GRCh37.p13 vcf annotation. so hopefully that works.

  • mforde84mforde84 chicagoMember

    Case in point. I used the new annotation, and GATK says there are not over lapping contigs! My god this software is beyond frustrating.

    ERROR MESSAGE: Input files /glusterfs/users/mforde/Homo_sapiens/UCSC/hg19/Annotation/Variation/snps.vcf and reference have incompatible contigs: No overlapping contigs found.
    ERROR /glusterfs/users/mforde/Homo_sapiens/UCSC/hg19/Annotation/Variation/snps.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT]
    ERROR reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY]
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Have you tried reading the FAQs about input files? From the doc:

    Important note about human genome reference versions

    If you are using human data, your reads must be aligned to one of the official b3x (e.g. b36, b37) or hg1x (e.g. hg18, hg19) references. The names and order of the contigs in the reference you used must exactly match that of one of the official references canonical orderings. These are defined by historical karotyping of largest to smallest chromosomes, followed by the X, Y, and MT for the b3x references; the order is thus 1, 2, 3, ..., 10, 11, 12, ... 20, 21, 22, X, Y, MT. The hg1x references differ in that the chromosome names are prefixed with "chr" and chrM appears first instead of last.

    You are trying to use an hg19 resource file with a b37 reference build.

    The really frustrating thing here is the parallel existence of subtly different reference builds, which has nothing to do with our software.

  • mforde84mforde84 chicagoMember

    Agreed. But couldn't this issue be avoided if I could convert the snp annotations provided with hg19 to vcf? any idea what's going on with VariantsToVCF in this instance, because like I said this specific annotation is very common.

  • mforde84mforde84 chicagoMember

    really the prospect of realigning 3000 exomes doesn't sound very appealing to me.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Actually I got it wrong, you are trying to use a b37 resource file with an hg19 reference build. Well, same difference I guess.

    No need to realign your exomes. You can either liftover your dbsnp file, or get the hg19 version of dbsnp from our resource bundle if you're not married to a specific number of dbsnp. All the info you need to do this is in our documentation.

  • mforde84mforde84 chicagoMember

    yea, im trying out the hg19.vcf file provided by broad. if that doesnt work, ill try liftover. crosses fingers.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Good luck, keep cool, and don't smash in your monitor if it fails. We'll help you get there.

Sign In or Register to comment.