GenotypeAndValidate

SophiaSophia Member
edited July 2012 in Ask the GATK team
Dear all,
I have been struggeling for a while, but have not managed, so here goes my question: Did anyone get to run the GATK GenotypeAndValidate analysis?

I keep getting the
##### ERROR MESSAGE: File associated with name /project/production/DAT/projects/APPLE/validated_sorted.vcf.gz is malformed: Problem reading the interval file caused by Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file.

Meanwhile, I have tried this with three different vcf files at the -alleles and -L input positions (and a suitable bam file based on the same reference at the -I position), and the way I see it they all DO have a correct header, in particular they have the mentioned header line. I assume the error really refers to another problem with the command line. Now I am curious to know whether any of you experienced the same and got it to work.

Below the (compressed) headers of the respective files I tried.

1.
A vcf file output from GATK (UnifiedGenotyper) itself for a human sample:
##fileformat=VCFv4.1
##FORMAT=
[...]
##FORMAT=
##INFO=
[...]
##INFO=
##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=
[...]
##contig=
[...]
##contig=
##reference=file:///project/production/Genomes/fasta/hsapiens_coordsort_v37.fa
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT D999

2.
A vcf file output by samtools | vcftools for a non-human sample:
##fileformat=VCFv4.1
##samtoolsVersion=0.1.18 (r982:295)
##INFO=
[...]
##INFO=
##FILTER=
[...]
##FILTER=
##FORMAT=
[...]
##FORMAT=
##ID=
[...]
##ID=
##INFO=
##source_20120425.1=/apps/VCFTOOLS/0.1.7/bin/vcf-annotate -a /project/production/DAT/projects/APPLE/IRSC_9K_apple_SNPs_annotations.txt.gz -d key=INFO,ID=IRSC,Number=1,Type=String,Description=This position is also found in the IRSC_9K_apple_SNP list -c CHROM,FROM,TO,INFO/IRSC
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT DELICIOUS_CNAG

3.
A "self-made" vcf file with minimal header:
##fileformat=VCFv4.1
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 112151 . C T 100 PASS .

Any comments will be very much appreciated.

Cheers,
Sophia
Post edited by Carneiro on

Answers

  • ebanksebanks Broad InstituteMember, Broadie, Dev
    Is that header line tab-delimited (good) or whitespace-delimited (bad)?
  • SophiaSophia Member
    my examples 1. and 2. were used without editing just as they came from the previous programs.
    In the home-made 3. example, they are definitely tab-delimited.
  • mmterpstrammterpstra NetherlandsMember
    edited July 2013

    I have had this error also. with versions 2.4-9-g532efad and 2.5-2-gf57256b. I removed the switch -nt 4 and it worked. I hope this helps.... (crashes on --dbsnp dbsnp_137.b37.vcf from resource bundle using UG)

    output (I do not want to post the command line but I hope this helps):

        INFO  18:02:00,576 HelpFormatter - Date/Time: 2013/07/04 18:02:00 
        INFO  18:02:00,576 HelpFormatter - -------------------------------------------------------------------------------- 
        INFO  18:02:00,576 HelpFormatter - -------------------------------------------------------------------------------- 
        INFO  18:02:00,616 ArgumentTypeDescriptor - Dynamically determined type of /path/to/dbsnp_137.b37.vcf to be VCF 
        INFO  18:02:01,212 GenomeAnalysisEngine - Strictness is SILENT 
        INFO  18:02:01,299 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
        INFO  18:02:01,305 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
        INFO  18:02:01,341 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 
        INFO  18:02:01,353 RMDTrackBuilder - Loading Tribble index from disk for file /path/to/dbsnp_137.b37.vcf 
        INFO  18:02:03,931 IntervalUtils - Processing 50621019 bp from intervals 
        INFO  18:02:03,952 MicroScheduler - Running the GATK in parallel mode with 4 total threads, 1 CPU thread(s) for each of 4 data thread(s), of 16 processors available on this machine 
        INFO  18:02:03,994 GenomeAnalysisEngine - Creating shard strategy for 5 BAM files 
        INFO  18:02:05,702 GenomeAnalysisEngine - Done creating shard strategy 
        INFO  18:02:05,702 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
        INFO  18:02:05,702 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
        INFO  18:02:05,874 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
        INFO  18:02:05,895 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 
        INFO  18:02:05,896 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
        INFO  18:02:05,911 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
        INFO  18:02:05,911 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
        INFO  18:02:05,924 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
        INFO  18:02:06,016 RMDTrackBuilder - Loading Tribble index from disk for file /path/to/dbsnp_137.b37.vcf 
        INFO  18:02:06,082 RMDTrackBuilder - Loading Tribble index from disk for file /path/to/dbsnp_137.b37.vcf 
        INFO  18:02:06,143 RMDTrackBuilder - Loading Tribble index from disk for file /path/to/dbsnp_137.b37.vcf 
        INFO  18:02:15,090 GATKRunReport - Uploaded run statistics report to AWS S3 
        ##### ERROR ------------------------------------------------------------------------------------------
        ##### ERROR A USER ERROR has occurred (version 2.4-9-g532efad): 
        ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
        ##### ERROR Please do not post this error to the GATK forum
        ##### ERROR
        ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
        ##### ERROR Visit our website and forum for extensive documentation and answers to 
        ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
        ##### ERROR
        ##### ERROR MESSAGE: Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file, for input source: /tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub2091328271715881240.tmp
        ##### ERROR ------------------------------------------------------------------------------------------
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @mmterpstra, can you try again with the latest version (2.6) and let us know if you still get the same error?

  • mmterpstrammterpstra NetherlandsMember
    edited July 2013

    No, seems fixed (also these lines changed ):

    INFO 18:02:03,994 GenomeAnalysisEngine - Creating shard strategy for 5 BAM files 
    INFO 18:02:05,702 GenomeAnalysisEngine - Done creating shard strategy
    

    for

    INFO  09:39:14,917 GenomeAnalysisEngine - Preparing for traversal over 5 BAM files 
    INFO  09:39:15,487 GenomeAnalysisEngine - Done preparing for traversal 
    

    and the old error is still repeatable.

    conclusion use 2-6-4 or be careful with the -nt flag for the 2.4-9-g532efad and 2.5-2-gf57256b versions.

Sign In or Register to comment.