In there a way of not getting redundant inputs for each task run on my workflow

Hi

Sorry if this is a dumb question. I am a newbie when it comes to cromwell and wdl

I have a workflow that seems to be working ok

However for many of the tasks some files are reused (such as the genome reference file)

Looking at the cromwell execution it seems it copies these files as inputs for EVERY task that requires them.
This is evey worse when a task is run with scatter where each shard gets its own input folder containing the very same files

This balloons the disk usage. Is there a way to configure cromwell to not cause this amount of redundancy?

Many thanks

Duarte

Answers

  • EADGEADG KielMember ✭✭✭

    Hi @Pepetideo ,

    normally cromwell doesnt copy all the files, it tries to create soft/hardlinks for each file which is needed as an input. Only if soft/hardlink creation fails cromwell starts copying the files. If this happen you get a warning in your log file.

    Greetings EADG

  • PepetideoPepetideo Member ✭✭

    Many thanks... I did get a few warnings for some files

    like this one:
    [2018-11-01 23:35:19,87] [warn] Localization via hard link has failed: ~/gatk_germline_test/cromwell-executions/my_workflow/f9e21a61-be63-4b45-ba11-41250shard-8/inputs/-1145978593/Mills_and_1000G_gold_standard.indels.vcf.gz.tbi -> /nfs/rodney/department/compbio/bcbio/genomes/Hsapiens/hg38/variation/Mills_and_1000G_gold_standard.indels.vcf.gz.tbi: Invalid cro

    but I check those and they files on the inputs of those tasks and they are indeed softlinks.

    However

    I get no such error message for by genome reference files and my base recalibration task (un in parallel using scatter:

    # Perform Base Quality Score Recalibration (BQSR) on the sorted BAM in parallel
    scatter (subgroup in CreateSequenceGroupingTSV.sequence_grouping) {
    # Generate the recalibration model by interval
    call BaseRecalibrator {
    input:
    input_bam = SortAndFixTags.output_bam,
    input_bam_index = SortAndFixTags.output_bam_index,
    recalibration_report_filename = bam_basename + ".recal_data.csv",
    sequence_group_interval = subgroup,
    dbSNP_vcf = dbSNP_vcf,
    dbSNP_vcf_index = dbSNP_vcf_index,
    known_indels_sites_VCFs = known_indels_sites_VCFs,
    known_indels_sites_indices = known_indels_sites_indices,
    ref_dict = ref_dict,
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index,
    gatk_path = gatk_path,
    }
    }

    copies the reference files as inputs for every shard produced.

    The same is true for my task that takes in the unmapped bam and aligns it with BWA men

    # Map reads to reference
    call SamToFastqAndBwaMem {
    input:
    input_bam = unmapped_bam,
    bwa_commandline = bwa_commandline,
    output_bam_basename = bam_basename + ".unmerged",
    ref_fasta = ref_fasta,
    ref_fasta_index = ref_fasta_index,
    ref_dict = ref_dict,
    bwa_path = bwa_path,
    picard_path = picard_path,
    compression_level = compression_level
    }

    the reference files are present in the input folder not as soft links but as full files themselves.
    (could not find any warning about them on my log

  • EADGEADG KielMember ✭✭✭

    Hi,

    looks like you running cromwell on an nfs-fileshare and your reference-file are on a (local) machine, is that right ?

    Can you copy your reference-files to that share ? It would help if they are in the same file system as the cromwell run. Or you run your task localy (at the same place were the reference files are) and copy the ouput to the share. You can declare an output directory when you start cromwell or set it for every job by using an options.json.

    Greets,

    EADG

  • PepetideoPepetideo Member ✭✭

    no... both the reference files and cromwell are running on the same filesystem and share.
    In fact the reference fasta and its bwa index are in the same folder where cromwell is producing its results cromwell-execution and cromwell-logs

  • EADGEADG KielMember ✭✭✭

    hm ok i was curios about the /nfs/rodney/department/ .

    Can you try to create soft/hardlinks for your reference-files manually ?

  • PepetideoPepetideo Member ✭✭

    Sorry... I should have explained a bit better.

    Yes... Some of these files are elsewhere. And for those I did got a warning message (that I shared above), however the files for which I got a watching message have been soft linked in the inputs is the way you described.

    My issue is with reference fasta files that are contained in the same folder as the whl workflow.

    This are definitely in the same filestore as Cromwell and nevertheless they are copied instead of softlinked leading to the large disk space usage.

  • EADGEADG KielMember ✭✭✭

    Hi,

    strange...which version of cromwell do you use?

    I would try 2 things..first move the reference file into another folder then the wdl-workflow.
    Second, add this to your config:

              local {
    
                # Try to hard link (ln), then soft-link (ln -s), and if both fail, then copy the files.
                localization: [
                  "hard-link", "soft-link", "copy"
                ]
    

    without copy, so cromwell will only try to hard- / Softlink your files.

    Don't know if this helps but I think it is worth a try ;)

  • PepetideoPepetideo Member ✭✭

    I used cromwell 3.6

    I will try your suggestion... many thanks for your help

  • PepetideoPepetideo Member ✭✭

    Ok ... after I read your answer I bit more carefully I think I might know what is going on

    You just highlighted that cromwell first created hardlinks

    I was calculating the size of the execution folder using du

    And hard-links get counts as any other file for the total disk space

    You are correct ... these files I thought were copies are in fact hardlinks.

    I would actually prefer to use softlinks instead

    is there a problem using your explanation to configure it in that manner?

    local {

            # Try to soft-link (ln -s), then hard link (ln), and if both fail, then copy the files.
            localization: [
             "soft-link", "hard-link", "copy"
            ]
    

    }

  • EADGEADG KielMember ✭✭✭
    edited November 5

    Hi,

    I don't think so it is from the standard application.conf
    https://github.com/broadinstitute/cromwell/blob/develop/cromwell.examples.conf

    Switching hard- / softlink should work :)

Sign In or Register to comment.