Complete this survey about your research needs and be entered to win an Amazon gift card or FireCloud credit.
Read more about it here!
Download the latest Picard release at https://github.com/broadinstitute/picard/releases.
GATK version 4.beta.6 is out. See the GATK4 beta page for download and details.

GenotypeAndValidate

SophiaSophia Member
edited July 2012 in Ask the GATK team
Dear all,
I have been struggeling for a while, but have not managed, so here goes my question: Did anyone get to run the GATK GenotypeAndValidate analysis?

I keep getting the
##### ERROR MESSAGE: File associated with name /project/production/DAT/projects/APPLE/validated_sorted.vcf.gz is malformed: Problem reading the interval file caused by Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file.

Meanwhile, I have tried this with three different vcf files at the -alleles and -L input positions (and a suitable bam file based on the same reference at the -I position), and the way I see it they all DO have a correct header, in particular they have the mentioned header line. I assume the error really refers to another problem with the command line. Now I am curious to know whether any of you experienced the same and got it to work.

Below the (compressed) headers of the respective files I tried.

1.
A vcf file output from GATK (UnifiedGenotyper) itself for a human sample:
##fileformat=VCFv4.1
##FORMAT=
[...]
##FORMAT=
##INFO=
[...]
##INFO=
##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=
[...]
##contig=
[...]
##contig=
##reference=file:///project/production/Genomes/fasta/hsapiens_coordsort_v37.fa
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT D999

2.
A vcf file output by samtools | vcftools for a non-human sample:
##fileformat=VCFv4.1
##samtoolsVersion=0.1.18 (r982:295)
##INFO=
[...]
##INFO=
##FILTER=
[...]
##FILTER=
##FORMAT=
[...]
##FORMAT=
##ID=
[...]
##ID=
##INFO=
##source_20120425.1=/apps/VCFTOOLS/0.1.7/bin/vcf-annotate -a /project/production/DAT/projects/APPLE/IRSC_9K_apple_SNPs_annotations.txt.gz -d key=INFO,ID=IRSC,Number=1,Type=String,Description=This position is also found in the IRSC_9K_apple_SNP list -c CHROM,FROM,TO,INFO/IRSC
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT DELICIOUS_CNAG

3.
A "self-made" vcf file with minimal header:
##fileformat=VCFv4.1
#CHROM POS ID REF ALT QUAL FILTER INFO
chr1 112151 . C T 100 PASS .

Any comments will be very much appreciated.

Cheers,
Sophia
Post edited by Carneiro on

Answers

  • ebanksebanks Broad InstituteMember, Broadie, Dev
    Is that header line tab-delimited (good) or whitespace-delimited (bad)?
  • SophiaSophia Member
    my examples 1. and 2. were used without editing just as they came from the previous programs.
    In the home-made 3. example, they are definitely tab-delimited.
  • mmterpstrammterpstra NetherlandsMember
    edited July 2013

    I have had this error also. with versions 2.4-9-g532efad and 2.5-2-gf57256b. I removed the switch -nt 4 and it worked. I hope this helps.... (crashes on --dbsnp dbsnp_137.b37.vcf from resource bundle using UG)

    output (I do not want to post the command line but I hope this helps):

        INFO  18:02:00,576 HelpFormatter - Date/Time: 2013/07/04 18:02:00 
        INFO  18:02:00,576 HelpFormatter - -------------------------------------------------------------------------------- 
        INFO  18:02:00,576 HelpFormatter - -------------------------------------------------------------------------------- 
        INFO  18:02:00,616 ArgumentTypeDescriptor - Dynamically determined type of /path/to/dbsnp_137.b37.vcf to be VCF 
        INFO  18:02:01,212 GenomeAnalysisEngine - Strictness is SILENT 
        INFO  18:02:01,299 GenomeAnalysisEngine - Downsampling Settings: No downsampling 
        INFO  18:02:01,305 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
        INFO  18:02:01,341 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.04 
        INFO  18:02:01,353 RMDTrackBuilder - Loading Tribble index from disk for file /path/to/dbsnp_137.b37.vcf 
        INFO  18:02:03,931 IntervalUtils - Processing 50621019 bp from intervals 
        INFO  18:02:03,952 MicroScheduler - Running the GATK in parallel mode with 4 total threads, 1 CPU thread(s) for each of 4 data thread(s), of 16 processors available on this machine 
        INFO  18:02:03,994 GenomeAnalysisEngine - Creating shard strategy for 5 BAM files 
        INFO  18:02:05,702 GenomeAnalysisEngine - Done creating shard strategy 
        INFO  18:02:05,702 ProgressMeter - [INITIALIZATION COMPLETE; STARTING PROCESSING] 
        INFO  18:02:05,702 ProgressMeter -        Location processed.sites  runtime per.1M.sites completed total.runtime remaining 
        INFO  18:02:05,874 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
        INFO  18:02:05,895 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.02 
        INFO  18:02:05,896 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
        INFO  18:02:05,911 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
        INFO  18:02:05,911 SAMDataSource$SAMReaders - Initializing SAMRecords in serial 
        INFO  18:02:05,924 SAMDataSource$SAMReaders - Done initializing BAM readers: total time 0.01 
        INFO  18:02:06,016 RMDTrackBuilder - Loading Tribble index from disk for file /path/to/dbsnp_137.b37.vcf 
        INFO  18:02:06,082 RMDTrackBuilder - Loading Tribble index from disk for file /path/to/dbsnp_137.b37.vcf 
        INFO  18:02:06,143 RMDTrackBuilder - Loading Tribble index from disk for file /path/to/dbsnp_137.b37.vcf 
        INFO  18:02:15,090 GATKRunReport - Uploaded run statistics report to AWS S3 
        ##### ERROR ------------------------------------------------------------------------------------------
        ##### ERROR A USER ERROR has occurred (version 2.4-9-g532efad): 
        ##### ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
        ##### ERROR Please do not post this error to the GATK forum
        ##### ERROR
        ##### ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
        ##### ERROR Visit our website and forum for extensive documentation and answers to 
        ##### ERROR commonly asked questions http://www.broadinstitute.org/gatk
        ##### ERROR
        ##### ERROR MESSAGE: Unable to parse header with error: Your input file has a malformed header: We never saw the required CHROM header line (starting with one #) for the input VCF file, for input source: /tmp/org.broadinstitute.sting.gatk.io.stubs.VariantContextWriterStub2091328271715881240.tmp
        ##### ERROR ------------------------------------------------------------------------------------------
    
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @mmterpstra, can you try again with the latest version (2.6) and let us know if you still get the same error?

  • mmterpstrammterpstra NetherlandsMember
    edited July 2013

    No, seems fixed (also these lines changed ):

    INFO 18:02:03,994 GenomeAnalysisEngine - Creating shard strategy for 5 BAM files 
    INFO 18:02:05,702 GenomeAnalysisEngine - Done creating shard strategy
    

    for

    INFO  09:39:14,917 GenomeAnalysisEngine - Preparing for traversal over 5 BAM files 
    INFO  09:39:15,487 GenomeAnalysisEngine - Done preparing for traversal 
    

    and the old error is still repeatable.

    conclusion use 2-6-4 or be careful with the -nt flag for the 2.4-9-g532efad and 2.5-2-gf57256b versions.

Sign In or Register to comment.