We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

BAM file is not good for ReadsPipelineSpark version 4.1.3 while it's good for 4.0.10.0 version

Hi all.
When running the ReadsPipelineSpark version 4.1.3 on my BAM file, I got the following exception:

A USER ERROR has occurred: Failed to read bam header from hdfs://cloudera08/gatk-test2/WES2019-024_S5_rgok.BAM
**Caused by:Cannot find format extension for **hdfs://cloudera08/gatk-test2/WES2019-024_S5_rgok.BAM


org.broadinstitute.hellbender.exceptions.UserException: Failed to read bam header from hdfs://cloudera08/gatk-test2/WES2019-024_S5_rgok.BAM
Caused by:Cannot find format extension for hdfs://cloudera08/gatk-test2/WES2019-024_S5_rgok.BAM
at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:188)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReads(GATKSparkTool.java:562)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(GATKSparkTool.java:541)
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:531)
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: Cannot find format extension for hdfs://cloudera08/gatk-test2/WES2019-024_S5_rgok.BAM
at org.disq_bio.disq.HtsjdkReadsRddStorage.read(HtsjdkReadsRddStorage.java:153)
at org.disq_bio.disq.HtsjdkReadsRddStorage.read(HtsjdkReadsRddStorage.java:123)
at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(ReadsSparkSource.java:185)
... 22 more

The same file runs well (till the end) with the same tool in version 4.0.10.

This is the header that I can see with the samtools:
[[email protected] temp]# samtools-1.7/samtools view -H WES2019-024-40044/WES2019-024_S5_rgok.BAM
@HD VN:1.5 SO:queryname
@RG ID:A LB:WES2019-024_S5 PL:illumina SM:WES2019-024_S5 PU:L1

I don't get what the 4.1.3 version of the tool considers wrong in this header, while the 4.0.10 version does not.

Thanks a lot.
Alessandro

Best Answers

Answers

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Does your bam header include sequence dictionary? Maybe that's why 4.1.3 is not acting.

  • Many thanks for your response.
    This is the output of the samtools view -H command:

    -bash-4.1$ ../samtools-1.7/samtools view -H WES2019-022_S4.BAM
    @HD VN:1.5 SO:coordinate
    @RG ID:WES2019-022 SM:WES2019-022_S4 PL:illumina PU:L1

    There is not the sequence dictionary.
    I created that BAM using the FastqToSam tool because I have only fastq files and I want to use the ReadsPipelineSpark tool that doesn't accept fastq as input.

    This is the command I used:
    /opt/gatk/gatk-4.1.3.0/gatk FastqToSam --FASTQ WES2019-022_S4_R1_001.fastq.gz --FASTQ2 WES2019-022_S4_R2_001.fastq.gz --SAMPLE_NAME WES2019-022_S4 --OUTPUT WES2019-022_S4.BAM --CREATE_INDEX true --SORT_ORDER coordinate --READ_GROUP_NAME WES2019-022 --PLATFORM illumina --PLATFORM_UNIT L1

    Could you please tell me in what I am wrong here.

    Thanks a lot for your help.
    Alessandro

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Since there is no mapping there is no need for SO to be set to coordinate. Can you change that to queryname or leave as is without touching that parameter so the SO will be set to queryname? That could also be a problem. Since it says coordinate that could be why tool is looking for a sequence dictionary.

  • Problem persists even with a BAM created without setting the sort order in FastqToSam tool:

    -bash-4.1$ ../samtools-1.7/samtools view -H WES2019-022_S4.BAM
    @HD VN:1.5 SO:queryname
    @RG ID:WES2019-022 SM:WES2019-022_S4 PL:illumina PU:L1

    Again the same error in the ReadsPipelineSpark execution:

    A USER ERROR has occurred: Failed to read bam header from hdfs://cloudera08/gatk-test2/WES2019-022_S4.BAM
    Caused by:Cannot find format extension for hdfs://cloudera08/gatk-test2/WES2019-022_S4.BAM

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited September 2019
  • HI @bhanuGandham , thanks for your response.
    I got no errors from the Validate tool. I used latest version of picard.jar.

    -bash-4.1$ java -jar ../picard.jar ValidateSamFile I=WES2019-022_S4.BAM MODE=SUMMARY
    INFO 2019-09-24 11:35:23 ValidateSamFile

    ********** NOTE: Picard's command line syntax is changing.


    ********** For more information, please see:
    ********** https://github.com/broadinstitute/picard/wiki/Command-Line-Syntax-Transition-For-Users-(Pre-Transition)


    ********** The command line looks like this in the new syntax:


    ********** ValidateSamFile -I WES2019-022_S4.BAM -MODE SUMMARY


    11:35:23.450 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/opt/temp/picard.jar!/com/intel/gkl/native/libgkl_compression.so
    [Tue Sep 24 11:35:23 CEST 2019] ValidateSamFile INPUT=WES2019-022_S4.BAM MODE=SUMMARY MAX_OUTPUT=100 IGNORE_WARNINGS=false VALIDATE_INDEX=true INDEX_VALIDATION_STRINGENCY=EXHAUSTIVE IS_BISULFITE_SEQUENCED=false MAX_OPEN_TEMP_FILES=8000 SKIP_MATE_VALIDATION=false VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json USE_JDK_DEFLATER=false USE_JDK_INFLATER=false
    [Tue Sep 24 11:35:23 CEST 2019] Executing as [email protected] on Linux 2.6.32-754.17.1.el6.x86_64 amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13; Deflater: Intel; Inflater: Intel; Provider GCS is not available; Picard version: 2.20.8-SNAPSHOT
    WARNING 2019-09-24 11:35:23 ValidateSamFile NM validation cannot be performed without the reference. All other validations will still occur.
    INFO 2019-09-24 11:36:12 SamFileValidator Validated Read 10,000,000 records. Elapsed time: 00:00:48s. Time for last 10,000,000: 48s. Last read position: /
    INFO 2019-09-24 11:37:00 SamFileValidator Validated Read 20,000,000 records. Elapsed time: 00:01:37s. Time for last 10,000,000: 48s. Last read position: /
    INFO 2019-09-24 11:37:49 SamFileValidator Validated Read 30,000,000 records. Elapsed time: 00:02:25s. Time for last 10,000,000: 48s. Last read position: /
    INFO 2019-09-24 11:38:37 SamFileValidator Validated Read 40,000,000 records. Elapsed time: 00:03:14s. Time for last 10,000,000: 48s. Last read position: /
    INFO 2019-09-24 11:39:26 SamFileValidator Validated Read 50,000,000 records. Elapsed time: 00:04:02s. Time for last 10,000,000: 48s. Last read position: /
    INFO 2019-09-24 11:40:14 SamFileValidator Validated Read 60,000,000 records. Elapsed time: 00:04:50s. Time for last 10,000,000: 48s. Last read position: /
    INFO 2019-09-24 11:41:03 SamFileValidator Validated Read 70,000,000 records. Elapsed time: 00:05:39s. Time for last 10,000,000: 48s. Last read position: /
    INFO 2019-09-24 11:41:52 SamFileValidator Validated Read 80,000,000 records. Elapsed time: 00:06:28s. Time for last 10,000,000: 49s. Last read position: /
    INFO 2019-09-24 11:42:41 SamFileValidator Validated Read 90,000,000 records. Elapsed time: 00:07:17s. Time for last 10,000,000: 48s. Last read position: /
    INFO 2019-09-24 11:43:29 SamFileValidator Validated Read 100,000,000 records. Elapsed time: 00:08:06s. Time for last 10,000,000: 48s. Last read position: /
    INFO 2019-09-24 11:44:18 SamFileValidator Validated Read 110,000,000 records. Elapsed time: 00:08:54s. Time for last 10,000,000: 48s. Last read position: /
    INFO 2019-09-24 11:45:07 SamFileValidator Validated Read 120,000,000 records. Elapsed time: 00:09:43s. Time for last 10,000,000: 48s. Last read position: /
    INFO 2019-09-24 11:45:55 SamFileValidator Validated Read 130,000,000 records. Elapsed time: 00:10:32s. Time for last 10,000,000: 48s. Last read position: /
    INFO 2019-09-24 11:46:44 SamFileValidator Validated Read 140,000,000 records. Elapsed time: 00:11:20s. Time for last 10,000,000: 48s. Last read position: /
    INFO 2019-09-24 11:47:33 SamFileValidator Validated Read 150,000,000 records. Elapsed time: 00:12:09s. Time for last 10,000,000: 48s. Last read position: /
    No errors found
    [Tue Sep 24 11:47:43 CEST 2019] picard.sam.ValidateSamFile done. Elapsed time: 12.34 minutes.
    Runtime.totalMemory()=1207435264

    Have you any hints?
    Many thanks.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited September 2019

    Hi @asammarco

    Can you please post the ReadsPipelineSpark command you are using. Does it look like this for uBam input?
    gatk ReadsPipelineSpark -I gs://my-gcs-bucket/unaligned_reads.bam -R gs://my-gcs-bucket/reference.fasta --known-sites gs://my-gcs-bucket/sites_of_variation.vcf --align -O gs://my-gcs-bucket/output.vcf -- --sparkRunner GCS --cluster my-dataproc-cluster

    See tool docs: https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_spark_pipelines_ReadsPipelineSpark.php

  • It is a little more detailed than yours but as I told you the tool is working with GATK version 4.0.10.0, so it seems that newer version of pipeline iincluded in 4.1.3 is maybe going to check something more in the BAM header than in previuos versions.

    /opt/gatk/gatk-4.0.10.0/gatk ReadsPipelineSpark --spark-runner SPARK --spark-master yarn --spark-submit-command spark2-submit -I hdfs://cloudera08/gatk-test2/WES2019-022_S4.BAM -O hdfs://cloudera08/gatk-test2/WES2019-022_S4_out.gvcf -R hdfs://cloudera08/gatk-test1/ucsc.hg19.fasta -L hdfs://cloudera08/gatk-test2/RefGene_exons.interval_list --known-sites hdfs://cloudera08/gatk-test1/dbsnp_150_hg19.vcf.gz --known-sites hdfs://cloudera08/gatk-test1/Mills_and_1000G_gold_standard.indels.hg19.vcf.gz --align true --emit-ref-confidence GVCF --conf deploy-mode=cluster --conf "spark.driver.memory=2g" --conf "spark.executor.memory=18g" --conf "spark.storage.memoryFraction=1" --conf "spark.akka.frameSize=200" --conf "spark.default.parallelism=100" --conf "spark.core.connection.ack.wait.timeout=600" --conf "spark.yarn.executor.memoryOverhead=4096" --conf "spark.yarn.driver.memoryOverhead=400"

    Thanks for your support.

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin

    @asammarco I will look into it and get back to you.

  • asammarcoasammarco Member

    Thanks it worked!

  • asammarcoasammarco Member

    Hi @cnorman, as i said before, the ReadsPipelineSpark has started after changing the extension of the file to lowercase .bam, but it stopped after about 3 hours of execution due to the following error:

    Serialization trace:
    initializer (htsjdk.samtools.util.Lazy)
    dictionary (htsjdk.samtools.reference.IndexedFastaSequenceFile)
    sequenceFile (org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile)
    val$taskReferenceSequenceFile (org.broadinstitute.hellbender.tools.HaplotypeCallerSpark$1)
    at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:101)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
    at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:575)
    at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:79)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
    at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:575)
    at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:79)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
    at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:575)
    at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:79)
    at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:508)
    at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:651)
    at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:241)
    at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:140)
    at org.apache.spark.serializer.SerializerManager.dataSerializeStream(SerializerManager.scala:174)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$7.apply(BlockManager.scala:1174)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1$$anonfun$apply$7.apply(BlockManager.scala:1172)
    at org.apache.spark.storage.DiskStore.put(DiskStore.scala:69)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1172)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1156)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:1091)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1156)
    at org.apache.spark.storage.BlockManager.putIterator(BlockManager.scala:914)
    at org.apache.spark.storage.BlockManager.putSingle(BlockManager.scala:1481)
    at org.apache.spark.broadcast.TorrentBroadcast.writeBlocks(TorrentBroadcast.scala:123)
    at org.apache.spark.broadcast.TorrentBroadcast.(TorrentBroadcast.scala:88)
    at org.apache.spark.broadcast.TorrentBroadcastFactory.newBroadcast(TorrentBroadcastFactory.scala:34)
    at org.apache.spark.broadcast.BroadcastManager.newBroadcast(BroadcastManager.scala:62)
    at org.apache.spark.SparkContext.broadcast(SparkContext.scala:1489)
    at org.apache.spark.api.java.JavaSparkContext.broadcast(JavaSparkContext.scala:650)
    at org.broadinstitute.hellbender.tools.HaplotypeCallerSpark.assemblyRegionEvaluatorSupplierBroadcastFunction(HaplotypeCallerSpark.java:265)
    at org.broadinstitute.hellbender.tools.HaplotypeCallerSpark.assemblyRegionEvaluatorSupplierBroadcast(HaplotypeCallerSpark.java:245)
    at org.broadinstitute.hellbender.tools.HaplotypeCallerSpark.callVariantsWithHaplotypeCallerAndWriteOutput(HaplotypeCallerSpark.java:303)
    at org.broadinstitute.hellbender.tools.spark.pipelines.ReadsPipelineSpark.runTool(ReadsPipelineSpark.java:224)
    at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(GATKSparkTool.java:533)
    at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(SparkCommandLineProgram.java:31)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:139)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:191)
    at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:210)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
    at org.broadinstitute.hellbender.Main.main(Main.java:291)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:849)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
    Caused by: java.lang.RuntimeException: Could not serialize lambda
    at com.esotericsoftware.kryo.serializers.ClosureSerializer.write(ClosureSerializer.java:69)
    at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:575)
    at com.esotericsoftware.kryo.serializers.ObjectField.write(ObjectField.java:79)
    ... 53 more
    Caused by: java.lang.NoSuchMethodException: htsjdk.samtools.reference.AbstractFastaSequenceFile$$Lambda$92/1666456788.writeReplace()
    at java.lang.Class.getDeclaredMethod(Class.java:2130)
    at com.esotericsoftware.kryo.serializers.ClosureSerializer.write(ClosureSerializer.java:60)
    ... 55 more

    So another difference in the 4.1.3 version, because the execution completes without errors with the 4.0.10 version.
    Could you please help me in understanding why do I get this error now?

    Thanks a lot.
    Alessandro

  • asammarcoasammarco Member

    Hi @cnorman, thanks for your response. Could you please tell me when the next release (with the fix) will be released?
    Thanks

  • cnormancnorman United StatesMember, Broadie, Dev ✭✭

    @asammarco I expect there will be a release within the next week or so.

Sign In or Register to comment.