If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
We will be out of the office on November 11th and 13th 2019, due to the U.S. holiday(Veteran's day) and due to a team event(Nov 13th). We will return to monitoring the GATK forum on November 12th and 14th respectively. Thank you for your patience.
Picard FindMendelianViolations: "Malformed header" error when specifying output directory
I am relatively new at NGS analysis and especially at using GATK. I am curently analyzing a small set of exome seq data from a small family (3 generations, 2 individuals per generation) and wanted check for mendelian errors using picard FindMendelianViolations (+filtering the variants for a minimum coverage of 30x to avoid false calls at sparsely covered intronic SNPs). The data was generated at the BGI on a HiSeq Ten X and processed using GATK (as far as i can extract from the VCF header)
The FindMendelianViolations program works fine when using the command
java -jar /opt/picard/picard.jar FindMendelianViolations I=../../variant_files/vcf/combine.snp.vcf.gz PED=../../../0_pedigree/trio.ped OUTPUT=mendelian_trio.DP30b.txt MIN_DP=30
However, when I add an output folder the tool first runs through the vcf, but then stops reporting the with the error:
"Your input file has a malformed header: BUG: VCF header has duplicate sample names". The error appears only when I specify an output folder (which appears quite weird to me), but I could reproduce the error several times. I could not figure out what exactly happens. The output folder remains empty, although it seems that the tool attempts to write a file named 1.vcf.
$ java -jar /opt/picard/picard.jar FindMendelianViolations I=../../variant_files/vcf_reheader/combine.snp.reheader-out.vcf.gz PED=../../../0_pedigree/trio_nospaces.ped OUTPUT=mendelian_trio.DP30-2.txt MIN_DP=30 VCF_DIR=vcf_violations30/ INFO 2019-06-21 20:30:14 FindMendelianViolations ********** NOTE: Picard's command line syntax is changing. ********** ********** For more information, please see: ********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition) ********** ********** The command line looks like this in the new syntax: ********** ********** FindMendelianViolations -I ../../variant_files/vcf_reheader/combine.snp.reheader-out.vcf.gz -PED ../../../0_pedigree/trio_nospaces.ped -OUTPUT mendelian_trio.DP30-2.txt -MIN_DP 30 -VCF_DIR vcf_violations30/ ********** 20:30:15.252 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/picard/picard.jar!/com/intel/gkl/native/libgkl_compression.so [Fri Jun 21 20:30:15 CEST 2019] FindMendelianViolations INPUT=../../variant_files/vcf_reheader/combine.snp.reheader-out.vcf.gz TRIOS=../../../0_pedigree/trio_nospaces.ped OUTPUT=mendelian_trio.DP30-2.txt MIN_DP=30 VCF_DIR=vcf_violations30 MIN_GQ=30 MIN_HET_FRACTION=0.3 SKIP_CHROMS=[MT, chrM] MALE_CHROMS=[chrY, Y] FEMALE_CHROMS=[chrX, X] PSEUDO_AUTOSOMAL_REGIONS=[chrX:10000-2781479, X:10001-2649520, chrX:155701382-156030895, X:59034050-59373566] THREAD_COUNT=1 TAB_MODE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false [Fri Jun 21 20:30:15 CEST 2019] Executing as [email protected] on Linux 4.15.0-51-generic amd64; OpenJDK 64-Bit Server VM 11.0.3+7-Ubuntu-1ubuntu218.04.1; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.2-SNAPSHOT INFO 2019-06-21 20:30:15 FindMendelianViolations Loading and filtering trios. WARNING 2019-06-21 20:30:15 FindMendelianViolations Removing trio due to the following missing samples in VCF:  WARNING 2019-06-21 20:30:15 FindMendelianViolations Removing trio due to the following missing samples in VCF:  WARNING 2019-06-21 20:30:15 FindMendelianViolations Removing trio due to the following missing samples in VCF:  INFO 2019-06-21 20:30:16 FindMendelianViolations variants analyzed 10,000 records. Elapsed time: 00:00:01s. Time for last 10,000: 0s. Last read position: chr1:62,594,480 [ ... omitted ... ] INFO 2019-06-21 20:30:20 FindMendelianViolations variants analyzed 240,000 records. Elapsed time: 00:00:05s. Time for last 10,000: 0s. Last read position: chr22:44,368,204 INFO 2019-06-21 20:30:20 FindMendelianViolations Writing family violation VCFs to /media/q005sc/WINDOWS/ngs_analysis/exome/2_analysis/recomb_TL/picard/vcf_violations30/ INFO 2019-06-21 20:30:20 FindMendelianViolations Writing 1 violation VCF to /media/q005sc/WINDOWS/ngs_analysis/exome/2_analysis/recomb_TL/picard/vcf_violations30/1.vcf [Fri Jun 21 20:30:20 CEST 2019] picard.vcf.MendelianViolations.FindMendelianViolations done. Elapsed time: 0.09 minutes. Runtime.totalMemory()=206569472 To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp Exception in thread "main" htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: BUG: VCF header has duplicate sample names at htsjdk.variant.vcf.VCFHeader.<init>(VCFHeader.java:142) at picard.vcf.MendelianViolations.FindMendelianViolations.writeAllViolations(FindMendelianViolations.java:288) at picard.vcf.MendelianViolations.FindMendelianViolations.doWork(FindMendelianViolations.java:262) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)
However, the header seems fine to me (AXX to TXX are the six samples):
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT AXX01 EXX01 GXX01 NXX01 OXX01 TXX01
The input ped file looks like. It does not contain all samples because we are interested only in generation 2 and 3, but the error appears also when including all samples into the ped file:
1 OXX01 0 0 1 1 1 NXX01 0 0 2 0 1 TXX01 OXX01 NXX01 1 1 1 EXX01 0 0 2 0
The output of ValidateVariants is as follows (run from the docker image)
[email protected]:/gatk# gatk --version The Genome Analysis Toolkit (GATK) v220.127.116.11 HTSJDK Version: 2.19.0 Picard Version: 2.19.0
[email protected]:/gatk# gatk ValidateVariants --variant combine.snp.reheader-out.vcf.gz Using GATK jar /gatk/gatk-package-18.104.22.168-local.jar Running: java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-22.214.171.124-local.jar ValidateVariants --variant combine.snp.reheader-out.vcf.gz 15:16:38.147 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-126.96.36.199-local.jar!/com/intel/gkl/native/libgkl_compression.so Jun 22, 2019 3:16:39 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine INFO: Failed to detect whether we are running on Google Compute Engine. 15:16:39.895 INFO ValidateVariants - ------------------------------------------------------------ 15:16:39.896 INFO ValidateVariants - The Genome Analysis Toolkit (GATK) v188.8.131.52 15:16:39.896 INFO ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/ 15:16:39.896 INFO ValidateVariants - Executing as [email protected] on Linux v4.15.0-51-generic amd64 15:16:39.897 INFO ValidateVariants - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12 15:16:39.897 INFO ValidateVariants - Start Date/Time: June 22, 2019 3:16:38 PM UTC 15:16:39.897 INFO ValidateVariants - ------------------------------------------------------------ 15:16:39.897 INFO ValidateVariants - ------------------------------------------------------------ 15:16:39.897 INFO ValidateVariants - HTSJDK Version: 2.19.0 15:16:39.897 INFO ValidateVariants - Picard Version: 2.19.0 15:16:39.897 INFO ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2 15:16:39.898 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false 15:16:39.898 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true 15:16:39.898 INFO ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false 15:16:39.898 INFO ValidateVariants - Deflater: IntelDeflater 15:16:39.898 INFO ValidateVariants - Inflater: IntelInflater 15:16:39.898 INFO ValidateVariants - GCS max retries/reopens: 20 15:16:39.898 INFO ValidateVariants - Requester pays: disabled 15:16:39.898 INFO ValidateVariants - Initializing engine 15:16:40.150 INFO FeatureManager - Using codec VCFCodec to read file file:///gatk/combine.snp.reheader-out.vcf.gz 15:16:40.268 INFO ValidateVariants - Done initializing engine 15:16:40.269 INFO ProgressMeter - Starting traversal 15:16:40.269 INFO ProgressMeter - Current Locus Elapsed Minutes Variants Processed Variants/Minute 15:16:41.899 INFO ProgressMeter - chrX:142605437 0.0 245817 9048478.5 15:16:41.900 INFO ProgressMeter - Traversal complete. Processed 245817 total variants in 0.0 minutes. 15:16:41.900 INFO ValidateVariants - Shutting down engine
I was not able to distill from the output above whether my vcf is ok or not. No report file was written to the directory (exectuted in /gatk)
I would be very grateful for any help to figure out what is happening! Thank you very much!