Using a participant with an "array of files" annotation as input to a workflow

aryeearyee Member, Broadie

I have a WDL (https://portal.firecloud.org/#methods/aryeelab/preprocess_hic/1) that runs locally and operates on two array of file inputs. (The samples were sequenced multiple times, and thus have multiple FASTQs associated with them). The WDL takes as input:

        Array[File] r1_fastq
        Array[File] r2_fastq

The corresponding input json contains:

  "preprocess_hic.hicpro.r1_fastq": ["test_data/hic/imr90_small/SRR1658672_1.small.fastq.gz", "test_data/hic/imr90_small/SRR1658673_1.small.fastq.gz"],
  "preprocess_hic.hicpro.r2_fastq": ["test_data/hic/imr90_small/SRR1658672_2.small.fastq.gz", "test_data/hic/imr90_small/SRR1658673_2.small.fastq.gz"],

How should I specify these Array[Files] in the metadata TSV I upload to Firecloud?

Best Answers

  • aryeearyee
    Accepted Answer

    Thanks @Tiffany_at_Broad and @esalinas! We're now successfully using Tiffany's split task. We pass an input String containing a comma-separated list of files. The split task turns this into an array of Files that's then used as the input for the next task.

Answers

  • aryeearyee Member, Broadie

    Instead of using the unmapped_bams_file_of_file_names as a first input, could I instead use a string? e.g.

    String unmapped_bam_file_names
    

    where unmapped_bam_file_names = "gs://bucket/file1.bam,gs://bucket/file2.bam"

    Is there a way to split this string using a separator (like "," in this case) and assign the output to Array[File]? This would let me avoid making and uploading two separate file_lists for each sample.

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin
    edited October 2017

    A WDL resource that can help:
    Read_lines WDL function explained (the first thing I recommended)

    Creating a task with Array string output attached.

  • esalinasesalinas BroadMember, Broadie ✭✭✭
    edited October 2017

    Hi @aryee one thing you could try is a FOF (file-of-files). See the WDL here and example input and the stdout

    Here is input as shown from reading files in bucket:

    wm8b1-75c:fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3 esalinas$ for BUCKET_FILE in `gsutil ls gs://fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/*.txt`; do echo "Found bucket file $BUCKET_FILE" ; echo "Its contents : " ; gsutil cat $BUCKET_FILE ; done ;
    Found bucket file gs://fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/bucket_fof.txt
    Its contents : 
    gs://fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f1.txt
    gs://fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f2.txt
    gs://fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f3.txt
    Found bucket file gs://fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f1.txt
    Its contents : 
    I am a file and this is my data.
    Found bucket file gs://fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f2.txt
    Its contents : 
    I am a different file and here is my other data.
    Found bucket file gs://fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f3.txt
    Its contents : 
    oh my data!
    wm8b1-75c:fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3 esalinas$ 
    

    Here is WDL (see usage of read_lines and write_lines). See how there's one input (the FOF whose lines are GS-urls). The input is seen above in "bucket_fof.txt" in the bucket.

    task fof_usage_task
        {
        File fof
        Array[File] my_files=read_lines(fof)
    
        command <<<
    
        #increase verbosity and adjust error tolerance
        set -eux -o pipefail
    
        #show files
        FOF_LOCALIZED=${write_lines(my_files)}
        cat $FOF_LOCALIZED
    
        #same thing, but has bucket URLS
        cat ${fof}
    
        #read contents of files
        for F in `cat $FOF_LOCALIZED`; do
            echo "To show contents of $F" ;
            cat $F ; 
            done ;
    
    
        >>>
    
        runtime {
            docker : "ubuntu:16.04"
            disks: "local-disk 1 HDD"
            memory: "0.6GB"
            }  
    
    
        }
    
    
    workflow fof_usage_wf {
    
        call fof_usage_task
    
        }
    

    Here is output of WDL (see the FOF in bucket_URL (gs) form and in localized_file form (cromwell_root) :

    wm8b1-75c:fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3 esalinas$ gsutil cat gs://fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/8400399e-06a2-4394-978e-75a3ec40edc5/fof_usage_wf/67e9534d-a1fb-43d0-8c14-7b994fde61b6/call-fof_usage_task/fof_usage_task-stdout.log
    /cromwell_root/fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f1.txt
    /cromwell_root/fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f2.txt
    /cromwell_root/fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f3.txt
    gs://fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f1.txt
    gs://fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f2.txt
    gs://fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f3.txt
    To show contents of /cromwell_root/fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f1.txt
    I am a file and this is my data.
    To show contents of /cromwell_root/fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f2.txt
    I am a different file and here is my other data.
    To show contents of /cromwell_root/fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3/f3.txt
    oh my data!
    wm8b1-75c:fc-4b9df97f-1c1f-4a4c-9617-3493695dcbd3 esalinas$ 
    
    
    
    

    It seems like 1) if you use 2 FOFs like this it could achieve the desired objective. 2) it would seem to not require much editing to any WDL

    @Tiffany_at_Broad if FC can write an array to a data entity then a user should be able to do so as well.

    Post edited by esalinas on
  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin
  • aryeearyee Member, Broadie
    Accepted Answer

    Thanks @Tiffany_at_Broad and @esalinas! We're now successfully using Tiffany's split task. We pass an input String containing a comma-separated list of files. The split task turns this into an array of Files that's then used as the input for the next task.

Sign In or Register to comment.