Wrong path to file after executing method to index a BAM in another google bucket

akmanningakmanning United StatesMember

Hello --
We created a very simple WDL which indexes a BAM file. The BAM is located in another google bucket, and the index file is created within the Workspace google bucket. But the resulting path to the bai is not correct -- it links to the google bucket in which the BAM file sits, not the file in the google bucket.

Issue · Github
by KateN

Issue Number
2096
State
closed
Last Updated
Assignee
Array
Milestone
Array
Closed By
knoblett

Answers

  • akmanningakmanning United StatesMember
    edited May 2017

    Here is the exec:

    #!/bin/bash
    tmpDir=$(mktemp -d /cromwell_root/tmp.XXXXXX)
    chmod 777 $tmpDir
    export _JAVA_OPTIONS=-Djava.io.tmpdir=$tmpDir
    export TMPDIR=$tmpDir
    
    (
    cd /cromwell_root
    samtools index /cromwell_root/genomics-public-data/1000-genomes/bam/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
    )
    echo $? > /cromwell_root/indexer-rc.txt.tmp
    (
    cd /cromwell_root
    
    )
    sync
    mv /cromwell_root/indexer-rc.txt.tmp /cromwell_root/indexer-rc.txt
    
  • akmanningakmanning United StatesMember

    And here are the monitor details:

    Workflow ID:ee162a7e-9975-47ac-9ad9-13b1b3dc8281
    Status:
    Call Caching:Disabled
    Submitted:May 16, 2017, 5:29 PM (1 day ago)
    Started:May 16, 2017, 5:29 PM (1 day ago)
    Ended:May 16, 2017, 5:44 PM (1 day ago)
    Inputs:Hide
    indexer_wf.indexer.bam → gs://genomics-public-data/1000-genomes/bam/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam
    Outputs:Hide
    indexer_wf.indexer.index → gs://genomics-public-data/1000-genomes/bam/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
    Workflow Log:workflow.ee162a7e-9975-47ac-9ad9-13b1b3dc8281.log
    Workflow Timing:Show
    Calls:
    indexer_wf.indexerShow
    
  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @akmanning I witnessed this issue along with you. In fact, I launched the WDL which created the index. We observed in the data tab "gs://genomics-public-data/....../*.bai" (which is a valid GSURL, but the file is NOT there) instead of "gs://fc-....../..../....bai" where the file was actually delocalized to.

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    The tool run is just a BAM indexer (samtools):

    task indexer
    {
    
    File bam
    String output_file_name = sub(bam, "\\.bam$", ".bam.bai")
    
    command <<<
    samtools index ${bam}
    >>>
    
    runtime {
        docker : "gsaksena/samtools_filter:1"
        disks: "local-disk 200 HDD"
    }
    
    output {
        File index="${output_file_name}"
    }
    
    }
    
    workflow indexer_wf {
    
        call indexer
    }
    

    The output_filename variable was modeled after the example here : https://github.com/broadinstitute/wdl/blob/develop/SPEC.md#string-substring-string-string

    Should additional "sub" be called to remove "gs://..." stuff?

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    My gut says this is a bug, but I will be investigating this more before filing away a ticket.

  • ThibThib CambridgeMember, Broadie, Dev ✭✭

    Hi !

    This is failing because sub(bam, "\\.bam$", ".bam.bai") only swaps the extension of your input path.
    For example if your input bam is located at gs://my_bucket/mybam.bam, then output_file_name will be gs://my_bucket/mybam.bam.bai, which is not the filename.
    The output section of the task expects File paths to be local paths created by the tool and that we want to be delocalized.
    As @esalinas suggested, to get the filename we can nest 2 subs:

    String output_file_name = sub(sub(bam, "gs://.*/", ""), "\\.bam$", ".bam.bai")

    Assuming that samtools index ${bam} will create a file with the same name as the bam with a bam.bai extension instead, this should work.

    Note that Cromwell will very soon support a basename WDL function that will hopefully make this more intuitive !

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @Thib @KateN the task ended with success code and successfully delocalized the file to the bucket for the workspace here :

    wm8b1-75c:test_11 esalinas$ gsutil ls gs://fc-8448fd62-67c7-47dd-965b-2d0167bd5ce6/39e398f0-3873-4215-993a-3450b0d0c15a/**|egrep  '\.bai$'
    gs://fc-8448fd62-67c7-47dd-965b-2d0167bd5ce6/39e398f0-3873-4215-993a-3450b0d0c15a/indexer_wf/e6a1d4ee-fdaf-4cda-9b18-a31650d39847/call-indexer/genomics-public-data/1000-genomes/bam/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
    wm8b1-75c:test_11 esalinas$ 
    
    

    Note the "genomics-public-data/1000-genomes/bam" in the "call-indexer" directory.
    The source of the BAM was: gs://genomics-public-data/1000-genomes/bam/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam

    The data-entity model was updated with :
    gs://genomics-public-data/1000-genomes/bam/HG00096.mapped.ILLUMINA.bwa.GBR.low_coverage.20120522.bam.bai
    and I note the absence of the workspace's bucket (instead of the "genomes-public-data" bucket).

Sign In or Register to comment.