Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

VCFHeaderLine error in UnifiedGenotyper

williarjwilliarj University of TorontoMember

Hi,

I'm trying to run UnifiedGenotyper on some data, and I'm getting an error I can't make heads nor tails of. It seems to be complaining that I have an equal sign in the sample name, which as far as I can tell I do not. Bellow is the command I've executed, the stack trace, and part of the header from the sam file I'm trying to call data from.

Command:
java -jar /GenomeAnalysisTK-2.7-4-g6f46d11/GenomeAnalysisTK.jar -T UnifiedGenotyper -R ../reference/Spolyrrhiza.current.genomic.fasta -I SRR072269_reorder_readgroup_sorted.bam --output_mode EMIT_ALL_SITES -o robertTest.vcf

Stack:
Spolyrrhiza.current.genomic.fasta -I SRR072269_reorder_readgroup_sorted.bam --output_mode EMIT_ALL_SITES -o Test.vcf
INFO 13:02:22,959 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:02:22,964 HelpFormatter - The Genome Analysis Toolkit (GATK) v2.7-4-g6f46d11, Compiled 2013/10/10 17:27:51
INFO 13:02:22,964 HelpFormatter - Copyright (c) 2010 The Broad Institute
INFO 13:02:22,964 HelpFormatter - For support and documentation go to http://www.broadinstitute.org/gatk
INFO 13:02:22,973 HelpFormatter - Program Args: -T UnifiedGenotyper -R ../reference/Spolyrrhiza.current.genomic.fasta -I SRR072269_reorder_readgroup_sorted.bam --output_mode EMIT_ALL_SITES -o robertTest.vcf
INFO 13:02:22,974 HelpFormatter - Date/Time: 2014/01/15 13:02:22
INFO 13:02:22,974 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:02:22,974 HelpFormatter - --------------------------------------------------------------------------------
INFO 13:02:24,330 GenomeAnalysisEngine - Strictness is SILENT
INFO 13:02:25,935 GenomeAnalysisEngine - Downsampling Settings: Method: BY_SAMPLE, Target Coverage: 250
INFO 13:02:25,950 SAMDataSource$SAMReaders - Initializing SAMRecords in serial
INFO 13:02:26,529 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.58
INFO 13:02:27,069 GenomeAnalysisEngine - Preparing for traversal over 1 BAM files
INFO 13:02:27,188 GenomeAnalysisEngine - Done preparing for traversal
INFO 13:02:27,189 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING]
INFO 13:02:27,190 ProgressMeter - Location processed.sites runtime per.1M.sites completed total.runtime remaining
INFO 13:02:28,450 GATKRunReport - Uploaded run statistics report to AWS S3

ERROR ------------------------------------------------------------------------------------------
ERROR stack trace

java.lang.IllegalArgumentException: VCFHeaderLine: ID cannot contain an equals sign
at org.broadinstitute.variant.vcf.VCFSimpleHeaderLine.initialize(VCFSimpleHeaderLine.java:80)
at org.broadinstitute.variant.vcf.VCFSimpleHeaderLine.(VCFSimpleHeaderLine.java:71)
at org.broadinstitute.variant.vcf.VCFContigHeaderLine.(VCFContigHeaderLine.java:52)
at org.broadinstitute.variant.vcf.VCFUtils.makeContigHeaderLine(VCFUtils.java:165)
at org.broadinstitute.variant.vcf.VCFUtils.makeContigHeaderLines(VCFUtils.java:156)
at org.broadinstitute.variant.vcf.VCFUtils.withUpdatedContigsAsLines(VCFUtils.java:128)
at org.broadinstitute.variant.vcf.VCFUtils.withUpdatedContigsAsLines(VCFUtils.java:114)
at org.broadinstitute.variant.vcf.VCFUtils.withUpdatedContigs(VCFUtils.java:110)
at org.broadinstitute.sting.utils.variant.GATKVCFUtils.withUpdatedContigs(GATKVCFUtils.java:175)
at org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub.writeHeader(VariantContextWriterStub.java:232)
at org.broadinstitute.sting.gatk.walkers.genotyper.UnifiedGenotyper.initialize(UnifiedGenotyper.java:304)
at org.broadinstitute.sting.gatk.executive.LinearMicroScheduler.execute(LinearMicroScheduler.java:83)
at org.broadinstitute.sting.gatk.GenomeAnalysisEngine.execute(GenomeAnalysisEngine.java:313)
at org.broadinstitute.sting.gatk.CommandLineExecutable.execute(CommandLineExecutable.java:113)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:245)
at org.broadinstitute.sting.commandline.CommandLineProgram.start(CommandLineProgram.java:152)
at org.broadinstitute.sting.gatk.CommandLineGATK.main(CommandLineGATK.java:91)

ERROR ------------------------------------------------------------------------------------------
ERROR A GATK RUNTIME ERROR has occurred (version 2.7-4-g6f46d11):
ERROR
ERROR This might be a bug. Please check the documentation guide to see if this is a known problem.
ERROR If not, please post the error message, with stack trace, to the GATK forum.
ERROR Visit our website and forum for extensive documentation and answers to
ERROR commonly asked questions http://www.broadinstitute.org/gatk
ERROR
ERROR MESSAGE: VCFHeaderLine: ID cannot contain an equals sign
ERROR ------------------------------------------------------------------------------------------

Sam header:
@HD VN:1.4 SO:coordinate
...
@RG ID:1 PL:illumina PU:flowcell_1 LB:library_1 SM:SRR072269$@PG ID:dvtgm PN:stampy VN:1.0.21_(r1713) CL:-g ../reference/spirodela -h ../reference/spirodela -M ../rawdata/[email protected] TM:Fri, 10 Jan 2014 17:43:03 EST WD:/data/home/learnspirodela/stampy_out HN:server.ca UN:williarj

Also, I've noticed that the reads in the files I'm using do have "=" in the read headers:

@SRR072258.sra.2 GL1KSOJ02I7J10 length=93
GACTGGACCTTCTGAGCCTTGCACCAGCGCATTGATGCGGACCTTGAACTCCTCATACTCCNCTGNACCGCCAGGCCAGCAGGGNTAGGGNNN
+SRR072258.sra.2 GL1KSOJ02I7J10 length=93
BBBBAA:[email protected]:655BBAA??<9:::441-/----411446<=?=>=<<44!34<!<;9722/..997774111!43,,,!!!

Could that be where my mystery equal sign problem originates?

Answers

  • williarjwilliarj University of TorontoMember

    I've also found that my reference genome has "=" in the scaffold names:

    Spipo0001S00360|feature=genomic|len=338

    ATGAAGGAGGCAGTTGAAGAGAATACAGCACAACTGAGCGGGATCTCAGCTACGAATCTATCCCGGGTCTCCTCTGCCT

    Maybe this is the issue?

  • pdexheimerpdexheimer Member ✭✭✭✭

    The presence of VCFContigHeaderLine in your stack trace makes me think that your second idea is the right one - the contig names in your reference contain an equals, which is then propagated into the @SQ lines in your VCF. As the error says, those equals signs aren't permitted in that field of the VCF

    The best solution is to fix "dvtgm" so that it doesn't emit those equals signs (or at least throws an error when it encounters them), but the more realistic short to medium term solution might be to modify the names in your reference and realign. With careful editing, it's possible in principle to modify the reference/VCF/etc in place, but there's a bunch of places that need to be carefully fixed. Safer and probably easier to just redo it

  • pdexheimerpdexheimer Member ✭✭✭✭

    Sorry, got bam and vcf mixed up. These are the "##contig" lines in a vcf.

    Wait a minute, why are you getting VCF validation errors in a UG run with no input VCF?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    I think the error might be happening at the stage of writing out the output, i.e. when the new VCF header is being written.

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Yes, the '=' in your contig name is causing problems with the contig lines (because it isn't allowed for in the VCF spec). I'm actually not sure how to handle this and will need to think it over...

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @williarj,

    It looks like we're not going to be able to solve this issue anytime soon, so my recommendation would be to rename the contigs to not include characters like = that break the spec. Sorry for the bad news.

Sign In or Register to comment.