If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra

Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

BAM file is not good for ReadsPipelineSpark version 4.1.3 while it's good for version

Hi all.
When running the ReadsPipelineSpark version 4.1.3 on my BAM file, I got the following exception:

A USER ERROR has occurred: Failed to read bam header from hdfs://cloudera08/gatk-test2/WES2019-024_S5_rgok.BAM
**Caused by:Cannot find format extension for **hdfs://cloudera08/gatk-test2/WES2019-024_S5_rgok.BAM

org.broadinstitute.hellbender.exceptions.UserException: Failed to read bam header from hdfs://cloudera08/gatk-test2/WES2019-024_S5_rgok.BAM
Caused by:Cannot find format extension for hdfs://cloudera08/gatk-test2/WES2019-024_S5_rgok.BAM
at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeReads(
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.initializeToolInputs(
at org.broadinstitute.hellbender.engine.spark.GATKSparkTool.runPipeline(
at org.broadinstitute.hellbender.engine.spark.SparkCommandLineProgram.doWork(
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(
at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(
at org.broadinstitute.hellbender.Main.runCommandLineProgram(
at org.broadinstitute.hellbender.Main.mainEntry(
at org.broadinstitute.hellbender.Main.main(
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(
at sun.reflect.DelegatingMethodAccessorImpl.invoke(
at java.lang.reflect.Method.invoke(
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:195)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:86)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:924)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:933)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.lang.IllegalArgumentException: Cannot find format extension for hdfs://cloudera08/gatk-test2/WES2019-024_S5_rgok.BAM
at org.broadinstitute.hellbender.engine.spark.datasources.ReadsSparkSource.getHeader(
... 22 more

The same file runs well (till the end) with the same tool in version 4.0.10.

This is the header that I can see with the samtools:
[[email protected] temp]# samtools-1.7/samtools view -H WES2019-024-40044/WES2019-024_S5_rgok.BAM
@HD VN:1.5 SO:queryname
@RG ID:A LB:WES2019-024_S5 PL:illumina SM:WES2019-024_S5 PU:L1

I don't get what the 4.1.3 version of the tool considers wrong in this header, while the 4.0.10 version does not.

Thanks a lot.


  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Does your bam header include sequence dictionary? Maybe that's why 4.1.3 is not acting.

  • asammarcoasammarco Member

    Many thanks for your response.
    This is the output of the samtools view -H command:

    -bash-4.1$ ../samtools-1.7/samtools view -H WES2019-022_S4.BAM
    @HD VN:1.5 SO:coordinate
    @RG ID:WES2019-022 SM:WES2019-022_S4 PL:illumina PU:L1

    There is not the sequence dictionary.
    I created that BAM using the FastqToSam tool because I have only fastq files and I want to use the ReadsPipelineSpark tool that doesn't accept fastq as input.

    This is the command I used:
    /opt/gatk/gatk- FastqToSam --FASTQ WES2019-022_S4_R1_001.fastq.gz --FASTQ2 WES2019-022_S4_R2_001.fastq.gz --SAMPLE_NAME WES2019-022_S4 --OUTPUT WES2019-022_S4.BAM --CREATE_INDEX true --SORT_ORDER coordinate --READ_GROUP_NAME WES2019-022 --PLATFORM illumina --PLATFORM_UNIT L1

    Could you please tell me in what I am wrong here.

    Thanks a lot for your help.

  • SkyWarriorSkyWarrior TurkeyMember ✭✭✭

    Since there is no mapping there is no need for SO to be set to coordinate. Can you change that to queryname or leave as is without touching that parameter so the SO will be set to queryname? That could also be a problem. Since it says coordinate that could be why tool is looking for a sequence dictionary.

  • asammarcoasammarco Member

    Problem persists even with a BAM created without setting the sort order in FastqToSam tool:

    -bash-4.1$ ../samtools-1.7/samtools view -H WES2019-022_S4.BAM
    @HD VN:1.5 SO:queryname
    @RG ID:WES2019-022 SM:WES2019-022_S4 PL:illumina PU:L1

    Again the same error in the ReadsPipelineSpark execution:

    A USER ERROR has occurred: Failed to read bam header from hdfs://cloudera08/gatk-test2/WES2019-022_S4.BAM
    Caused by:Cannot find format extension for hdfs://cloudera08/gatk-test2/WES2019-022_S4.BAM

  • bhanuGandhambhanuGandham Cambridge MAMember, Administrator, Broadie, Moderator admin
    edited September 21
Sign In or Register to comment.