We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

Picard FindMendelianViolations: "Malformed header" error when specifying output directory

StefanCStefanC AustriaMember
edited June 2019 in Ask the GATK team

Hi all

I am relatively new at NGS analysis and especially at using GATK. I am curently analyzing a small set of exome seq data from a small family (3 generations, 2 individuals per generation) and wanted check for mendelian errors using picard FindMendelianViolations (+filtering the variants for a minimum coverage of 30x to avoid false calls at sparsely covered intronic SNPs). The data was generated at the BGI on a HiSeq Ten X and processed using GATK (as far as i can extract from the VCF header)

The FindMendelianViolations program works fine when using the command

java -jar /opt/picard/picard.jar FindMendelianViolations I=../../variant_files/vcf/combine.snp.vcf.gz PED=../../../0_pedigree/trio.ped OUTPUT=mendelian_trio.DP30b.txt MIN_DP=30

However, when I add an output folder the tool first runs through the vcf, but then stops reporting the with the error:
"Your input file has a malformed header: BUG: VCF header has duplicate sample names". The error appears only when I specify an output folder (which appears quite weird to me), but I could reproduce the error several times. I could not figure out what exactly happens. The output folder remains empty, although it seems that the tool attempts to write a file named 1.vcf.

$ java -jar /opt/picard/picard.jar FindMendelianViolations I=../../variant_files/vcf_reheader/combine.snp.reheader-out.vcf.gz PED=../../../0_pedigree/trio_nospaces.ped OUTPUT=mendelian_trio.DP30-2.txt MIN_DP=30 VCF_DIR=vcf_violations30/
INFO    2019-06-21 20:30:14 FindMendelianViolations 

********** NOTE: Picard's command line syntax is changing.
**********
********** For more information, please see:
********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)
**********
********** The command line looks like this in the new syntax:
**********
**********    FindMendelianViolations -I ../../variant_files/vcf_reheader/combine.snp.reheader-out.vcf.gz -PED ../../../0_pedigree/trio_nospaces.ped -OUTPUT mendelian_trio.DP30-2.txt -MIN_DP 30 -VCF_DIR vcf_violations30/
**********


20:30:15.252 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/picard/picard.jar!/com/intel/gkl/native/libgkl_compression.so
[Fri Jun 21 20:30:15 CEST 2019] FindMendelianViolations INPUT=../../variant_files/vcf_reheader/combine.snp.reheader-out.vcf.gz TRIOS=../../../0_pedigree/trio_nospaces.ped OUTPUT=mendelian_trio.DP30-2.txt MIN_DP=30 VCF_DIR=vcf_violations30    MIN_GQ=30 MIN_HET_FRACTION=0.3 SKIP_CHROMS=[MT, chrM] MALE_CHROMS=[chrY, Y] FEMALE_CHROMS=[chrX, X] PSEUDO_AUTOSOMAL_REGIONS=[chrX:10000-2781479, X:10001-2649520, chrX:155701382-156030895, X:59034050-59373566] THREAD_COUNT=1 TAB_MODE=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
[Fri Jun 21 20:30:15 CEST 2019] Executing as [email protected] on Linux 4.15.0-51-generic amd64; OpenJDK 64-Bit Server VM 11.0.3+7-Ubuntu-1ubuntu218.04.1; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.2-SNAPSHOT
INFO    2019-06-21 20:30:15 FindMendelianViolations Loading and filtering trios.
WARNING 2019-06-21 20:30:15 FindMendelianViolations Removing trio due to the following missing samples in VCF: [0]
WARNING 2019-06-21 20:30:15 FindMendelianViolations Removing trio due to the following missing samples in VCF: [0]
WARNING 2019-06-21 20:30:15 FindMendelianViolations Removing trio due to the following missing samples in VCF: [0]
INFO    2019-06-21 20:30:16 FindMendelianViolations variants analyzed        10,000 records.  Elapsed time: 00:00:01s.  Time for last 10,000:    0s.  Last read position: chr1:62,594,480

[ ... omitted ... ]

INFO    2019-06-21 20:30:20 FindMendelianViolations variants analyzed       240,000 records.  Elapsed time: 00:00:05s.  Time for last 10,000:    0s.  Last read position: chr22:44,368,204
INFO    2019-06-21 20:30:20 FindMendelianViolations Writing family violation VCFs to /media/q005sc/WINDOWS/ngs_analysis/exome/2_analysis/recomb_TL/picard/vcf_violations30/
INFO    2019-06-21 20:30:20 FindMendelianViolations Writing 1 violation VCF to /media/q005sc/WINDOWS/ngs_analysis/exome/2_analysis/recomb_TL/picard/vcf_violations30/1.vcf
[Fri Jun 21 20:30:20 CEST 2019] picard.vcf.MendelianViolations.FindMendelianViolations done. Elapsed time: 0.09 minutes.
Runtime.totalMemory()=206569472
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.tribble.TribbleException$InvalidHeader: Your input file has a malformed header: BUG: VCF header has duplicate sample names
    at htsjdk.variant.vcf.VCFHeader.<init>(VCFHeader.java:142)
    at picard.vcf.MendelianViolations.FindMendelianViolations.writeAllViolations(FindMendelianViolations.java:288)
    at picard.vcf.MendelianViolations.FindMendelianViolations.doWork(FindMendelianViolations.java:262)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:103)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:113)

However, the header seems fine to me (AXX to TXX are the six samples):

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  AXX01   EXX01   GXX01   NXX01   OXX01   TXX01

The input ped file looks like. It does not contain all samples because we are interested only in generation 2 and 3, but the error appears also when including all samples into the ped file:

1   OXX01   0  0  1  1
1   NXX01   0  0  2  0
1   TXX01   OXX01  NXX01  1  1
1   EXX01   0  0  2  0

The output of ValidateVariants is as follows (run from the docker image)

[email protected]:/gatk# gatk --version
The Genome Analysis Toolkit (GATK) v4.1.2.0
HTSJDK Version: 2.19.0
Picard Version: 2.19.0
[email protected]:/gatk# gatk ValidateVariants --variant combine.snp.reheader-out.vcf.gz 
Using GATK jar /gatk/gatk-package-4.1.2.0-local.jar
Running:
    java -Dsamjdk.use_async_io_read_samtools=false -Dsamjdk.use_async_io_write_samtools=true -Dsamjdk.use_async_io_write_tribble=false -Dsamjdk.compression_level=2 -jar /gatk/gatk-package-4.1.2.0-local.jar ValidateVariants --variant combine.snp.reheader-out.vcf.gz
15:16:38.147 INFO  NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/gatk/gatk-package-4.1.2.0-local.jar!/com/intel/gkl/native/libgkl_compression.so
Jun 22, 2019 3:16:39 PM shaded.cloud_nio.com.google.auth.oauth2.ComputeEngineCredentials runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
15:16:39.895 INFO  ValidateVariants - ------------------------------------------------------------
15:16:39.896 INFO  ValidateVariants - The Genome Analysis Toolkit (GATK) v4.1.2.0
15:16:39.896 INFO  ValidateVariants - For support and documentation go to https://software.broadinstitute.org/gatk/
15:16:39.896 INFO  ValidateVariants - Executing as [email protected] on Linux v4.15.0-51-generic amd64
15:16:39.897 INFO  ValidateVariants - Java runtime: OpenJDK 64-Bit Server VM v1.8.0_191-8u191-b12-0ubuntu0.16.04.1-b12
15:16:39.897 INFO  ValidateVariants - Start Date/Time: June 22, 2019 3:16:38 PM UTC
15:16:39.897 INFO  ValidateVariants - ------------------------------------------------------------
15:16:39.897 INFO  ValidateVariants - ------------------------------------------------------------
15:16:39.897 INFO  ValidateVariants - HTSJDK Version: 2.19.0
15:16:39.897 INFO  ValidateVariants - Picard Version: 2.19.0
15:16:39.897 INFO  ValidateVariants - HTSJDK Defaults.COMPRESSION_LEVEL : 2
15:16:39.898 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
15:16:39.898 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
15:16:39.898 INFO  ValidateVariants - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
15:16:39.898 INFO  ValidateVariants - Deflater: IntelDeflater
15:16:39.898 INFO  ValidateVariants - Inflater: IntelInflater
15:16:39.898 INFO  ValidateVariants - GCS max retries/reopens: 20
15:16:39.898 INFO  ValidateVariants - Requester pays: disabled
15:16:39.898 INFO  ValidateVariants - Initializing engine
15:16:40.150 INFO  FeatureManager - Using codec VCFCodec to read file file:///gatk/combine.snp.reheader-out.vcf.gz
15:16:40.268 INFO  ValidateVariants - Done initializing engine
15:16:40.269 INFO  ProgressMeter - Starting traversal
15:16:40.269 INFO  ProgressMeter -        Current Locus  Elapsed Minutes    Variants Processed  Variants/Minute
15:16:41.899 INFO  ProgressMeter -       chrX:142605437              0.0                245817        9048478.5
15:16:41.900 INFO  ProgressMeter - Traversal complete. Processed 245817 total variants in 0.0 minutes.
15:16:41.900 INFO  ValidateVariants - Shutting down engine

I was not able to distill from the output above whether my vcf is ok or not. No report file was written to the directory (exectuted in /gatk)

I would be very grateful for any help to figure out what is happening! Thank you very much!

Stefan

Post edited by StefanC on

Issue · Github
by bhanuGandham

Issue Number
1354
State
open
Last Updated

Answers

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited June 2019

    Hi @StefanC

    Take a look at this thread: https://gatkforums.broadinstitute.org/gatk/discussion/4277/error-message-your-input-file-has-a-malformed-header

    PS: Checkout Terra for end-to-end GATK pipelining solutions and let us know what more pipelines we can add that will make using GATK easier for you! For more details on whether this is the right fit for you checkout our blog page.

    Post edited by bhanuGandham on
  • StefanCStefanC AustriaMember

    Hi @bhanuGandham

    Thank you very much for your reply. I checked the link. It proposes that the header is separated by spaces instead of tabs. Unfortunately this is not the case for my file. Both the columns and the header uses tabs. See here below the output of cat -T. Also Notepadqq shows only tabs.

    $ cat -T combine.snp.reheader-out.vcf | grep "#CHROM" -A2
    #CHROM^IPOS^IID^IREF^IALT^IQUAL^IFILTER^IINFO^IFORMAT^IAXX01^IEXX01^IGXX01^INXX01^IOXX01^ITXX01
    chr1^I14653^I.^IC^IT^I925.34^IPASS^IAC=6;AF=0.5;AN=12;BaseQRankSum=0.234;ClippingRankSum=0;DP=287;ExcessHet=14.6052;FS=13.082;MLEAC=6;MLEAF=0.5;MQ=40.44;MQRankSum=-0.756;QD=3.24;ReadPosRankSum=-0.395;SOR=1.633^IGT:AD:DP:GQ:PL^I0/1:34,14:48:99:264,0,871^I0/1:46,6:52:48:48,0,1317^I0/1:39,8:47:99:114,0,1085^I0/1:36,11:47:99:192,0,991^I0/1:42,7:49:95:95,0,1194^I0/1:31,12:43:99:249,0,828
    chr1^I14677^I.^IG^IA^I291.15^IPASS^IAC=1;AF=0.083;AN=12;BaseQRankSum=-0.639;ClippingRankSum=0;DP=343;ExcessHet=3.0103;FS=4.993;MLEAC=1;MLEAF=0.083;MQ=71.77;MQRankSum=-1.597;QD=5.29;ReadPosRankSum=-0.544;SOR=1.376^IGT:AD:DP:GQ:PL^I0/1:38,17:55:99:324,0,1152^I0/0:54,3:57:91:0,91,1715^I0/0:58,0:58:99:0,120,1800^I0/0:64,0:64:99:0,120,1800^I0/0:60,2:62:99:0,119,1802^I0/0:45,0:45:99:0,120,1800
    

    Do you have any idea what else might be the problem?

    best
    Stefan

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @StefanC

    Can you please post the header for this file: combine.snp.reheader-out.vcf.gz

  • StefanCStefanC AustriaMember

    Hi @bhanuGandham

    sure. It is:


    Sequencing and file generation was done at the BGI.

    best regards
    Stefan

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    Hi @StefanC

    Looks like this is a bug in FindMendelianViolations. I have created an issue ticket for the dev team and we are looking into it. You can follow the progress issue on this here: https://github.com/broadinstitute/picard/issues/1354

Sign In or Register to comment.