Update: July 26, 2019
This section of the forum is no longer actively monitored. We are working on a support migration plan that we will share here shortly. Apologies for this inconvenience.

Delocalize file to a specific GCS storage bucket location

I'm using cromwell 31.1 and a simple WDL workflow to upload a bgzipped gVCF to GCS storage bucket and then generate a tabix index. Cromwell is not delocalizing the output tabix index file to the location that I expected. Here is my workflow:

workflow load_sample {

    String source_dir
    String bucket
    Array[String] sample_names

    scatter (name in sample_names) {
        call upload { input: source_dir=source_dir, sample_name=name, bucket=bucket }
        call create_index { input: bucket_vcf=upload.output_vcf }
    }

    output {
        create_index.bucket_vcf_index
    }
}

task upload {
    String source_dir
    String sample_name
    String bucket

    String source_vcf = source_dir + sample_name + ".original.gvcf.gz"
    String bucket_vcf = bucket + sample_name + ".g.vcf.gz"

    command {
        gsutil -m -o GSUtil:parallel_composite_upload_threshold=150M cp ${source_vcf} ${bucket_vcf}
    }
    runtime {
        backend: "Local"
    }

    output {
        String output_vcf = bucket_vcf
    }
}

task create_index {
    File bucket_vcf

    command {
        /bin/tabix ${bucket_vcf}
    }

    output {
        File bucket_vcf_index = bucket_vcf + ".tbi"
    }

    runtime {
        docker: "gcr.io/my-project/htslib"
    }
}

The value of upload.output_vcf is gs://my-project/samples/sample-1.g.vcf.gz and the value of create_index.bucket_vcf_index is gs://my-project/samples/sample-1.g.vcf.gz.tbi. I was expecting the tabix index file to be delocalized to the specified GCS storage bucket location. But it is delocalized to gs://my-project/workflows/upload/<UUID>/call-create_index/shard-0/my-project/samples/sample-1.g.vcf.gz.tbi which is in the storage bucket location I have set for the jes_gcs_root option.

Is there a way to have cromwell put the tabix index file in the location specified by bucket_vcf_index?

Answers

  • RuchiRuchi Member, Broadie, Moderator, Dev admin

    Hey @mmundy,

    Instead of declaring an output to the task create_index, you could instead add a gsutil cp command inside the create index task that moves it to a specified bucket (as a task input). As for the output -- if you're interesting in capturing the final string location -- the bucket_vcf_index can also be the output to the task.

    For example:

    task create_index {
        File bucket_vcf
        String bucket_vcf_index =  bucket_vcf + ".tbi"
    
        command {
            /bin/tabix ${bucket_vcf}
    
           gsutil -m -o GSUtil:parallel_composite_upload_threshold=150M cp ${bucket_vcf_index} ${bucket_vcf}
        }
    
        output {
        }
    
        runtime {
            docker: "gcr.io/my-project/htslib"
        }
    }
    
    
Sign In or Register to comment.