Cromwell/WDL script can't find bwa index

I have a WDL script that is just supposed to do alignment using bwa mem, but I'm getting this error in the stderr log:
[E::bwa_idx_load_from_disk] fail to locate the index files

My script:

workflow bwamem {
call bwa_mem_tool
}

task bwa_mem_tool {
Int threads
Int min_seed_length
File reference
String reads
File reads1
File reads2

command {
bwa mem \
-t ${threads} \
-k ${min_seed_length} \
${reference} \
${reads1} ${reads2} > ${reads}.sam
}

output {
File bamout = "${reads}.sam"
}
}

My input json:
{
"bwamem.bwa_mem_tool.reads": "mother",
"bwamem.bwa_mem_tool.reads1": "/home/campus.ncl.ac.uk/njss3/GATK/Test/fastq/mother_R1.fastq",
"bwamem.bwa_mem_tool.reads2": "/home/campus.ncl.ac.uk/njss3/GATK/Test/fastq/mother_R2.fastq",
"bwamem.bwa_mem_tool.min_seed_length": 16,
"bwamem.bwa_mem_tool.threads": 4,
"bwamem.bwa_mem_tool.reference": "/home/campus.ncl.ac.uk/njss3/GATK/Test/ref/ref.fasta"
}

Result from cromwell console:
[2017-11-13 15:47:28,43] [error] WorkflowManagerActor Workflow 4d844dd9-158c-4dbe-94f2-6d91b79f2e79 failed (during ExecutingWorkflowState): Job bwamem.bwa_mem_tool:NA:1 exited with return code 1 which has not been declared as a valid return code. See 'continueOnReturnCode' runtime attribute for more details.
Check the content of stderr for potential additional information: /home/campus.ncl.ac.uk/njss3/bin/cromwell-executions/bwamem/4d844dd9-158c-4dbe-94f2-6d91b79f2e79/call-bwa_mem_tool/execution/stderr

I have looked in the 4d844dd9-158c-4dbe-94f2-6d91b79f2e79 directory and ref.fasta is there but not the index files. However, the index files are in the working directory specified in the json file.

I would really appreciate it if someone can point me in the right direction to correct this error.

Best Answers

  • EADGEADG KielMember
    Accepted Answer

    Hi @jss,

    argh I think I know whats wrong can you change the type of sampleName from File to String in both task and call ?

    The problem is that cromwell/wdl assume that your sampleName "mother" is a file and try to copy it from the execution directory to the inputs-directory of your bwa_mem_tool task. Because there is no file with the name mother cromwell/wdl throw this execution.

    Greetings EADG

Answers

  • EADGEADG KielMember

    Hi @jss ,

    you have to add the index files to your task and input.json too. Otherwise the actual task wont "see" them.

    Greetings EADG

  • Hi @EADG

    I figured that, that would be the case but I'm not sure how. I can define a variable but how do I indicate to the script which variable to use because bwa does not have a parameter for the index files. It usually finds them in the same place as the reference fasta.

    Regards
    jss

  • @jss
    I ran into the same problem, the issue is that bwa assumes that the index files for reference.fasta are in the same folder with the same name, e.g. reference.fai

    I have the following function to automatically create all the required input files, and generate the appropriate output. (You might have to change the commands for picard, samtools and bwa to match your setup).

    task reference {
        File ref
    
        String basename = basename(ref)
        String refname  = basename(ref,".fasta")
    
        command {
            cp "${ref}" "${basename}"
            java -jar /usr/gitc/picard.jar CreateSequenceDictionary R="${basename}" O="${refname}.dict"
            /usr/local/bin/samtools faidx "${basename}"
            /usr/gitc/bwa index "${basename}"
        }
    
        output {
            File dict   = "${refname}.dict"
            File fasta  = "${basename}"
            File amb    = "${basename}.amb"
            File ann    = "${basename}.ann"
            File bwt    = "${basename}.bwt"
            File fai    = "${basename}.fai"
            File pac    = "${basename}.pac"
            File sa     = "${basename}.sa"
        }
    }
    

    And then you can pass the required files to the bwa task using reference.fai, reference.fasta etc.

  • @Redmar_van_den_Berg
    Thanks for you feedback.

    However, I'm a bit lost. I don't understand why I would need to run the above script. My reference fasta and its indices have the same name and they all are in the same folder.

    What I have done is to add a variable of type File that is called ref_fasta_index. It points to the directory with the reference files (fasta and indices). It seems to work but I don't know why because it is not used anywhere.

    I now have another problem. The output file is created in the cromwell execution directory. I want it to be in a specific directory but when I try to specify the directory I get errors again. How can I add a path to my output file?

    I would like to specify a path and append that to String reads, identified in my script above and then append .sam to that. I'm not clear whether to use File or String. I tried both but both result in errors.

  • @Redmar_van_den_Berg
    But what if I want the resultant file moved to somewhere specific? How do I get cromwell to do that?

  • EADGEADG KielMember

    Hi @jss ,

    sorry I was assuming that you use cromwell as well at the first time...but luckily @Redmar_van_den_Berg could help out :)

    To force cromewell to put the result to a specific directory you have to use the input option, take a look at the cromwell documentation: https://cromwell.readthedocs.io/en/develop/wf_options/Overview/#output-copying

    You have to create an option.json like:

    {
        "final_workflow_outputs_dir": "/Users/michael_scott/cromwell/outputs",
        "final_workflow_log_dir": "/Users/michael_scott/cromwell/wf_logs",
        "final_call_logs_dir": "/Users/michael_scott/cromwell/call_logs"
    }
    

    (this qoute is stolen from the docs) :)

    and then start your workflow like:

    java -jar /path/to/cromwell.jar your.wdl -i your.input.json -o options.json
    

    Ah and don't to forget to specify which data you want to copy at the end, like:

    workflow bwamem {
      call bwa_mem_tool
      call do_something {
        input: samfile=bwa_mem_tool.bamout
      }
      output {
      do_something.*
     }
    
    }
    

    Greetings EADG

  • Hi @EADG and @Redmar_van_den_Berg
    Thank you very much for all your help. I really do appreciate it.

    The script now works but I must have done something wrong in the next it where I try to use the output from task bwa_mem_tool. I feel a bit stupid because I'm sure it is something silly that I'm just not spotting:

    workflow germline {
      File ref_fasta
      File ref_fasta_index
      File sample_name
    
      call bwa_mem_tool {
        input:
          ref_fasta = ref_fasta,
          ref_fasta_index = ref_fasta_index,
          sample_name = sample_name
      }
    
      call samtools_2bam {
        input:
          input_sam = bwa_mem_tool.output_sam,
          sample_name = sample_name
      }
    }
    
    task bwa_mem_tool {
      File sample_name
      File reads1
      File reads2
      Int min_seed_length
      Int threads
      File ref_fasta
      File ref_fasta_index
    
      command {
        bwa mem  \
          -t ${threads} \
          -k ${min_seed_length} \
        ${ref_fasta} \
        ${reads1} ${reads2} > "${sample_name}.sam"
      }
    
      output {
        File output_sam = "${sample_name}.sam"
      }
    }
    
    task samtools_2bam {
      File sample_name
      File input_sam
    
      command {
        samtools view \
          -bS ${input_sam} > "${sample_name}.bam"
      }
    
      output {
        File output_bam = "${sample_name}.bam"
      }
    }
    
  • EADGEADG KielMember

    Hi @Jss,

    try to remove the double quotes from "${sample_name}.sam" in the command-section of bwa_mem_tool.

    Also, you can actually pipe the output of bwa mem directly into samtools, so you actually don't need 2 separate task to create a bam-file. Like:

    bwa mem \
       -t ${threads} \
       -k ${min_seed_length} \
        ${ref_fasta} \
        ${reads1} ${reads2} | samtools view -1 - > ${sample_name}.bam
    

    Greetings EADG

  • jssjss Member
    edited November 2017

    @EADG
    I know it can be done with a pipe but I'm just trying to learn how to use cromwell and wdl, that is why I'm doing it in a roundabout way.

    I removed the quotes but it still doesn't work. The following is from the error message on the console (if that will help):

    [2017-11-15 14:43:30,98] [error] BackgroundConfigAsyncJobExecutionActor [85403420germline.bwa_mem_tool:NA:1]: Error attempting to Execute
    cromwell.backend.standard.StandardAsyncExecutionActor$$anonfun$$nestedInanonfun$commandLinePreProcessor$1$1$$anon$1: :
    Could not localize mother -> /home/n3/bin/cromwell-executions/germline/85403420-5d01-4eb5-b516-845f706d1d28/call-bwa_mem_tool/inputs/home/n3/bin/mother:
        mother doesn't exists
        File not found /home/n3/bin/cromwell-executions/germline/85403420-5d01-4eb5-b516-845f706d1d28/call-bwa_mem_tool/inputs/home/n3/bin/mother -> /home/n3/bin/mother
        File not found mother
        File not found /home/n3/bin/mother
    

    I don't understand why it is looking in /home/n3/bin/ for the file

  • EADGEADG KielMember
    Accepted Answer

    Hi @jss,

    argh I think I know whats wrong can you change the type of sampleName from File to String in both task and call ?

    The problem is that cromwell/wdl assume that your sampleName "mother" is a file and try to copy it from the execution directory to the inputs-directory of your bwa_mem_tool task. Because there is no file with the name mother cromwell/wdl throw this execution.

    Greetings EADG

  • Hi @EADG
    I just managed to get it working by adding a path to the sample. That is not ideal though so I'm going to give your suggestion a go.

    Thanks.

  • EADGEADG KielMember

    Hi @jss ,

    my suggestion is based on the wdl/ input.json you already posted. If you change them and you wouldn't mind to repost them I would take a look to get a better understanding of the problem.

    Greets EADG

  • Hi @EADG
    I now works perfectly. Here is the wdl and json:

    WDL

    workflow germline {
      File ref_fasta
      File ref_fasta_index
      File sample_name
    
      call bwa_mem_tool {
        input:
          ref_fasta = ref_fasta,
          ref_fasta_index = ref_fasta_index,
          sample = sample_name
      }
    
      call samtools_2bam {
        input:
          input_sam = bwa_mem_tool.output_sam,
          sample = sample_name
      }
    }
    
    task bwa_mem_tool {
      String sample
      File reads1
      File reads2
      Int min_seed_length
      Int threads
      File ref_fasta
      File ref_fasta_index
    
      command {
        bwa mem  \
          -t ${threads} \
          -k ${min_seed_length} \
        ${ref_fasta} \
        ${reads1} ${reads2} > "${sample}.sam"
      }
    
      output {
        File output_sam = "${sample}.sam"
      }
    }
    
    task samtools_2bam {
      String sample
      File input_sam
    
      command {
        samtools view \
          -bS ${input_sam} > "${sample}.bam"
      }
    
      output {
        File output_bam = "${sample}.bam"
      }
    }
    

    JSON

    {
      "germline.bwa_mem_tool.reads1": "/home/n3/GATK/Test/fastq/mother_R1.fastq",
      "germline.bwa_mem_tool.min_seed_length": 16,
      "germline.bwa_mem_tool.threads": 4,
      "germline.bwa_mem_tool.reads2": "/home/n3/GATK/Test/fastq/mother_R2.fastq",
      "germline.sample_name": "mother",
      "germline.ref_fasta": "/home/n3/GATK/Test/ref/ref.fasta",
      "germline.ref_fasta_index": "/home/n3/GATK/Test/ref/"
    }
    
  • jssjss Member
    edited November 2017

    The script above runs perfectly if I run it in command line mode but now it fails when I run it in server mode. I get the same error as right in the beginning: [E::bwa_idx_load_from_disk] fail to locate the index files

    Now it works again. Don't know what changed?!?!

    Post edited by jss on
  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jss
    Hi,

    I just moved this to the WDL section where @KateN can help you.

    -Sheila

Sign In or Register to comment.