Samtools fails with an error

nrashinrashi DC - District of ColumbiaMember

Hi,

I've been trying to run a samtools WDL file on local and it has been with an failed state error. Attached is a screenshot for reference. Can someone please spot the issue and help me run this locally before pushing it to firecloud? Also, can a user push or run a tool on Firecloud if it hasn't been tested locally using cromwell?

Thanks,

RN

Answers

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    Hello Rashi,

    Could you share with me your samtools.WDL and the samtools.json?

    And to answer your second question: Yes, you can push a tool to FireCloud that hasn't been first tested locally running cromwell. We recommend, however, first testing locally as the test/debug/fix code development cycle is faster running your test code locally compared with running on the cloud (note that we are working to address this with a FireCloud Developers Workbench that we are currently assembling).

    -Chet

  • nrashinrashi DC - District of ColumbiaMember

    Hi Chet,

    Attached is the WDL and json file. For some reasons the output is not being generated in the directory defined. In addition when am trying to import this WDL on Firecloud, it got uploaded, but nothing happens when I try to import it in a workspace. Not sure the issue there?

    Thanks for helping!

    Wasn't able to attach a WDL file here so pasting it below, please find the JSON below WDL:

    task samtools {

    File INBAM

    command{
    echo pwd
    samtools index ${INBAM}
    echo pwd
    cp ${INBAM}.bai .
    }
    runtime {
    docker : "stevetsa/samtools:v1.3.1"
    }
    output {
    File response_star=stdout()
    File outbam="sm2G28029.bam.bai"
    }
    }

    workflow BamIndex {
    File INBAM

    call samtools {
    input:
    INBAM=INBAM
    }
    }

    JSON

    {
    "BamIndex.INBAM": "sm2G28029.bam"
    }

    Thanks for helping!

    Rashi

  • esalinasesalinas BroadMember, Broadie ✭✭✭
    edited November 2016

    I've been able to push the WDL into firecloud and make a method config from it.
    FC has been updated since your attempt and so I wonder maybe if you were to retry the upload and/or workspace import you'd have success?
    I used a different BAM however after I imported the WDL into a workspace, and the delocalization was unsuccessful.

    I changed the WDL (as seen below) to include the name of the BAM to be incorporated into the index output and successfully ran it.

    In the attached figure you can see how I set up the method config with the bam and the name to allow successful execution

    task samtools {
    
    File INBAM
    String bamName
    
    command{
    
    echo `pwd` 
    samtools index ${INBAM}
    echo `pwd`
    cp  ${INBAM}.bai .
    }
    runtime {
    docker : "stevetsa/samtools:v1.3.1"
    }
    output {
    File response_star=stdout()
    File index="${bamName}.bai"
    }
    }
    
    workflow BamIndex {
    File INBAM
    String bamName
    
    call samtools {
    input:
    INBAM=INBAM,
    bamName=bamName
    }
    
    }
    
  • nrashinrashi DC - District of ColumbiaMember

    Hi Eddie,

    Thanks for the solution; I was able to import this in a workspace this time, however when I'm trying to input a BAM file in the expression box - gs://console.cloud.google.com/storage/browser/fc-d24fc78b-8aa4-43fe-a6fb-0519d6bfed3a/sm2G28029.bam it still shows me a red error expression on FC. Can I add you to the workspace for you to have a look? I'm probably putting the bucket address wrong? What I know about the format it should be in is - gs://[Google_Bucket]/[file_name]

    Thanks for helping,

    Rashi

  • abaumannabaumann Broad DSDEMember, Broadie ✭✭✭

    This should work as long as you put double quotes around the bucket url, if that is not working can you take a screenshot of what the method config section and its error looks like?

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @nrashi
    Alex Baumann is correct that you need quotes (as seen in the figure I uploaded). You are correct in your thinking that you put the bucket address wrong.
    You need to put the bucket URL, not the URL for viewing via the console bucket viewer from the browser.
    The bucket URL is like this "gs://fc-1234-abcd-5678-efgh/some/path/to/your/data.bam"

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    You need to put the URL as "gs://fc-d24fc78b-8aa4-43fe-a6fb-0519d6bfed3a/sm2G28029.bam" in quotes.
    Your bucket name is in your post and I generated the URL from it using the "rule" as I noted in the post from 2:45

  • nrashinrashi DC - District of ColumbiaMember

    Thanks @abaumann and @esalinas - adding quotes worked! :)

  • nrashinrashi DC - District of ColumbiaMember

    Hi @esalinas , you mentioned that when you tried to import the WDL I provided, the delocalization was unsuccessful; by this you mean that it did not get imported into the workspace due to some issue in the WDL or there was a system update that was blocking this WDL to be imported into a workspace? Also how the changes you made to the WDL made it work? Just trying to understand the issue for an issue in the future.

    Thanks !

    Rashi

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    I was able to successfully import the WDL into the method repo and then into the workspace as a method configuration. I was able to run it too.

    The input file BAM had a different name than "sm2G28029.bam" So the output index had a different name. Since the output name was different it was not "sm2G28029.bam.bai". "sm2G28029.bam.bai" is specified to be de-localized. Since the output index had a different name, the file "sm2G28029.bam.bai" was not found during de-localization so there was a file-not-found error so delocalization failed.

    De-localization is the process of copying files from a VM to a cloud-storage bucket so they are not lost when the VM is shut down.

    My WDL update included the name of the BAM so I was able to identify the name of the output so was able to delocalize it and not get a file-not-found error.

    -eddie

  • nrashinrashi DC - District of ColumbiaMember

    @esalinas, Thanks Eddie, this helps!

  • nrashinrashi DC - District of ColumbiaMember

    @esalinas Hi Eddie,

    I was hoping to upscale this tool in FC, and wanting to add bam files from the existing TCGA bam files in the workspaces/cohorts already present in the Firecloud. How can I do this? I cloned the workspace from which I'm going to pick my BAM files, imported the method config, how do pick and choose probably only 10 bam files from this workspace? Does this requires downloading all the bam files on my local system first (which is not possible because of their size) and then upload them onto google bucket? Is there a way to avoid that kind of download? And what would the be google bucket address for these files in the TSV files and input expressions if not downloading?

    Please do point to some related documentation if I missed somewhere instructions for the case user tool and TCGA data on the cloud. Referring to this documentation here - https://docs.google.com/document/d/1X7q4zYAb16Py8raxGhP_HPzp5KRjrNfTeSR0wIRrzQU/edit#heading=h.mi1b4aia0p1c

    Thanks for helping!

    Rashi

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    Note that there are already index files for the TCGA bams under the column "*_bai_path" (ending in "_bai_path") for TCGA samples with a BAM.

    You might want to create a pair_set or sample_set of samples of BAMs you want to use. https://docs.google.com/document/d/1X7q4zYAb16Py8raxGhP_HPzp5KRjrNfTeSR0wIRrzQU/edit#heading=h.u8oxsgxbusxx

    -eddie

  • nrashinrashi DC - District of ColumbiaMember
    edited November 2016

    @esalinas Hi Eddie, yes there are bai files present already, but this is just testing a user tool using existing TCGA data on FC.
    Coming back to my previous question, the documentation mentions the membership files and load files and updating a metadata file, is there an example somewhere to see what all the pair_set or sample_set tables look like? Also, my question that what google bucket address will be used in these tsv files when am using the existing files on the FC? I'm perhaps not clear on how to upscale this; would there be a how-to-video on this?

    Thanks for helping!

    Rashi

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @nrashi If you create a TSV and want to refer to TCGA data, you will have to use the GS urls to point to the BAMs. If you go to the TCGA workspaces you can find the URLs there under the "Data" tab. You might also want to download the TSV data using the "Download 'partiipant' data" and "Download 'sample' data" links to see an example of load files for participants and samples.

    If you go to the link https://docs.google.com/document/d/1X7q4zYAb16Py8raxGhP_HPzp5KRjrNfTeSR0wIRrzQU/edit#heading=h.u8oxsgxbusxx
    and scroll to the section "Set Entity Load Files" you can see an example load file for defining a set of participants. The load file for samples is the same except replace "participant" with "sample".

  • nrashinrashi DC - District of ColumbiaMember

    Thank you @esalinas , yes that makes sense now about how to pick and choose to add only a bunch of files from the data set and what address to give in the TSV files. Also it seems that there would be a need to add an input expression for each file being used..? Is that understanding correct? For example when making a tool setting to sample_set or participant_set, how to add additional files and expression boxes there?

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    If a method-configuration has as its root entity type a "sample" but you want to run it on a "sample set" then after one clicks "Launch" one can click the sample-set upon which one desires to run the workflow and then in the expression box there, one can type "this.samples" so that the WDL of root-entity-type "sample" runs on each sample in the selected sample set.

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    Note that if you have one workspace, you can import data entities from another workspace by using the "Import Data" button the "Copy from Another Workspace" to import data from another workspace.

    -eddie

  • nrashinrashi DC - District of ColumbiaMember

    @esalinas Hi Eddie, if I'm making changes to a method configuration (something that a user uploaded) after I have imported it into a workspace, does the system also apply those changes in the original method configuration in the method repository as well?

    Thanks,
    Rashi

  • nrashinrashi DC - District of ColumbiaMember
    edited November 2016

    Also when I'm trying to import data from another workspace, it is not showing me all the data in the workspace copied for a sample; not all columns are displayed as present in the original TCGA workspace after I copy a sample in my workspace, however I did check the settings for the columns to hide and show and don't see a scroll bar either to go left or right. Is there another setting to this? Attached a screenshot here.

    Thanks,
    Rashi

  • nrashinrashi DC - District of ColumbiaMember

    @esalinas While I'm trying to run the samtoolsindex tool on a TCGA file, the task failed and also when I click on the workspace's associated bucket, it fails to load as well. Do I need to add TSV files even if the data is copied from an already existing TCGA data workspace? Given that TSV data contains metadata, I can understand the reason they're mandatory when uploading your own data though.

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    @nrashi You wrote "Hi Eddie, if I'm making changes to a method configuration (something that a user uploaded) after I have imported it into a workspace, does the system also apply those changes in the original method configuration in the method repository as well?"

    The answer to this question the answer is no. Making a change in the workspace method config won't affect any published method config.

    You wrote "Also when I'm trying to import data from another workspace, it is not showing me all the data in the workspace copied for a sample; not all columns are displayed as present in the original TCGA workspace after I copy a sample in my workspace, however I did check the settings for the columns to hide and show and don't see a scroll bar either to go left or right. Is there another setting to this? Attached a screenshot here."

    I'm not sure I understand your question. But please know that it is possible to "resize" columns by cliking and dragging a column-column border in the header.....not unlike a similar feature in Excel.

    You wrote "While I'm trying to run the samtoolsindex tool on a TCGA file, the task failed and also when I click on the workspace's associated bucket, it fails to load as well. Do I need to add TSV files even if the data is copied from an already existing TCGA data workspace? Given that TSV data contains metadata, I can understand the reason they're mandatory when uploading your own data though."

    At the present moment, we are experiencing technical difficulties. I have not been able to run any workflow myself.

  • nrashinrashi DC - District of ColumbiaMember
    edited November 2016

    @esalinas Hi Eddie, thanks for answering my questions. The question about seeing all the data, sorry about not being so clear earlier -
    I created a protected data workspace, imported a sample from TCGA_LUAD_ControlledAccess_V1-0_DATA workspace into my workspace. I should be ideally seeing all the data from a sample in various columns as the TCGA_LUAD_ControlledAccess_V1-0_DATA has for each sample. I do not see any data after WXS bam column in my workspace when I copied a sample. I also do not see a scroll bar at the bottom to slide the frame for more data on the right may be, but there are no additional data columns present actually. Attached again is the screenshot of this issue.

    As you can see in this screenshot, these are all mostly tumor samples, LUAD samples have additional data columns for RNAseq bams, protein expression, SNP etc. I do not see all of that aditional data in my workspace when I copy one of these samples from the LUAD controlled access workspace.

    Also, are TSV files needed even when data used is from the existing workspaces on FC?

    Thanks for helping,

    Rashi

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    I you click the "gear" button in the upper-right you can bring columns into and out of view.

    does that help?

  • nrashinrashi DC - District of ColumbiaMember

    I did try that, but those columns are also the only ones I see in the list too. Attached is a screenshot. I tried it in a new workspace I created, it works fine elsewhere though.

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    If the 6 samples you imported have nothing for those columns then they might not be displayed.

    Is that the issue?

    -eddie

  • nrashinrashi DC - District of ColumbiaMember

    Most of those samples have data in the original workspace (TCGA_LUAD_ControlledAccess_V1-0_DATA) they were copied from in additional columns, so that is why I was expecting it be showing up in the destination workspace too.

  • nrashinrashi DC - District of ColumbiaMember

    Hi Eddie,

    Are the technical issues resolved now which were limiting to run tasks on Firecloud?

    Thanks,

    Rashi

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    I believe so yes. I have been running submissions this morning and afternoon

  • nrashinrashi DC - District of ColumbiaMember

    Hi Eddie, to run my tool on a bunch of bam files; what I understand from the documentation is that this will need a couple of TSV files, namely - sample_set_membership.txt where columns would be sample_set_id and sample_id. sample_id is where I can list all the samples I'd want to run. When am trying to upload the sample_set_entity.tsv file I get an error saying that the first column header should be ending in _id. My first column header is ending in _id, am not sure why would it not upload it. Doing something wrong? - attached are the screenshots.

    Thanks,

    Rashi

  • esalinasesalinas BroadMember, Broadie ✭✭✭

    Please check out the "FireCloud Basics" document

    See link : https://docs.google.com/document/d/1X7q4zYAb16Py8raxGhP_HPzp5KRjrNfTeSR0wIRrzQU/edit

    There you can find a section called "Set Entity Load Files"

    An example sample set TSV (but for participants) is there not unlike what is shown here:

    membership:participant_set_id participant_id
    TCGA_COAD TCGA-5M-AAT4
    TCGA_COAD TCGA-NH-A8F8
    TCGA_BRCA TCGA-A8-A07L
    

    Can you try modeling your TSV after this one from the document?
    Try changing "participant" to "sample" and change 'TCGA_COAD' to your new sample set name? and use your own sample IDS?
    Also change "TCGA_BRCA" to whatever you want to name your sample set?

  • nrashinrashi DC - District of ColumbiaMember
    edited November 2016

    Hi Eddie, I already tried that. Those are the errors I get on uploading in the format as in documentation. Attached are the screenshots from the tsv files I'm trying to upload. I know there is an order in which these tsv files are uploaded, in that particular workspace, I uploaded participant.tsv > sample.tsv > sample_set_entity.tsv > sample_set_membership.tsv files. I get an error when I load the 3rd file in this sequence. Also gives me an error when I try to upload sample_set_membership.tsv before sample_set_entity.tsv (which I believe is not the correct order anyway). Does it makes a difference with the participant.tsv and sample.tsv uploaded already in the workspace?

    Thanks,

    Rashi

  • jneffjneff BostonMember, Broadie, Moderator admin

    Hi Rashi,

    In image 2, the issue may be that that you list the same sample set twice "TCGA_LUAD". I wonder what would happen if you delete row 3 and try to re-upload.

    Let me know of the result. I can create another example for the documentation that only uses one sample set with multiple samples.

    Thanks,
    Jason

  • nrashinrashi DC - District of ColumbiaMember
    edited November 2016

    @jneff Hi Jason, I would imagine that two rows were needed since am trying to upload two samples here - like in other tsv files one would need one participant per row. But if it only requires to mention once for the whole sample set, I can make that change and try it again. I think this is not mentioned in the documentation or I missed it and hence this may be issue perhaps. In the given documentation, the example has a couple of lines for the participant_set_id twice in two lines (TCGA_COAD), but I think that's not the case with the sample_set_id. Will let you know if that works.

    In addition, just understanding this better - all metadata files (tsv files) are called load files, and there is one for each entity type. Membership files are just a type of another load file. Do these additional load files have a certain order to be uploaded too? And does it makes a difference when the participant.tsv and sample.tsv files already exist in the workspace?

    Thanks for helping!

    Rashi

  • jneffjneff BostonMember, Broadie, Moderator admin

    Hi Rashi. There is a very brief note that states "Multiple rows for the same set entity are not permitted in Entity load files." I will add an example to make this clearer. The current example may also be confusing because I only show two rows with two different sample set names.

    Please do let me know if you have success as I want to make sure FireCloud is functioning as described in the documentation.

  • nrashinrashi DC - District of ColumbiaMember

    Thanks Jason, I will try this out and see if removing one line from that file works.

  • jneffjneff BostonMember, Broadie, Moderator admin

    Rashi,

    To clarify, the Participants load file must be uploaded before the Samples load file, but you do not need to upload the Sample Set Entity file at all unless you want to specify attributes for your sample set.

    For this reason, the order suggested in the documentation is actually "participants > samples > sample set membership > sample set entity".

  • nrashinrashi DC - District of ColumbiaMember

    Hi Jason,

    Thanks for clarifying the order of files as well. I loaded participant.tsv > sample.tsv > sample_set_membership.tsv > sample_set_entity.tsv. I removed the extra TCGA_LUAD line from my sample_set_entity file as you suggested; I still get the same errors. The way I understood this is that since I'm trying to use only two TCGA samples from the LUAD controlled data workspace to run it with my samtoolsindex tool, it would require to mention these samples in these tsv files. I created these 4 tsv files accordingly, the participant.tsv and participant.tsv got uploaded successfully; however, this way I'm not able to select more than one sample when I launch analysis. This necessitated me to upload the sample_set_membership.tsv > sample_set_entity.tsv files where I've mentioned only two samples to try it out.

    Are you saying that only participant and sample tsv files should be enough to use the tool with only two samples from the LUAD data set? - If so, I wasn't able to select more than one sample when I launch analysis.

    Errors:
    On uploading sample_set_membership.tsv
    "Could not resolve some entity references"

    On uploading sample_set_entity.tsv
    "Invalid first column header, entity type should end in _id"

    Attached are the screenshots from these two tsv files and their respective errors.

    Thanks for helping,

    Rashi

  • jneffjneff BostonMember, Broadie, Moderator admin
    edited November 2016

    Hi Rashi,

    I was saying the Sample Set Entity file should not be necessary unless you want to specify attributes for your sample set. The issue regarding the Sample Set Membership file could be that you appended "_sample" to the end of each sample name. These sample names should correspond to the names of the samples in your workspace. In the TCGA LUAD workspace, I believe the sample entities you are referencing end in "TP". I took a screenshot and attached an example file that might help.

    Jason

    Post edited by jneff on
  • nrashinrashi DC - District of ColumbiaMember

    Hi Jason,
    I ran the tool after making the changes to these metadata files and it seems to be running fine - using the sample_set as entity. However, while setting the expressions on the inputs and outputs on the method config edit page, how does a user adds an additional input expression box? For example I wanted to add the input bam addresses for a couple of bam files. Right now although I ran it on a sample set of two samples, it gives me a single bai file in the output, ideally there should be two. Is there something I'm missing?

    Thanks,
    Rashi

  • jneffjneff BostonMember, Broadie, Moderator admin

    Hi Rashi,

    That's great. I'm curious is sample set the root entity? Because the BAMs are associated with samples in our pre-loaded TCGA data, it might be easier to make the root entity for your method config samples. Then you could still select sample set at runtime and enter this.samples in the Define Expression field.

    Regarding your question about method config input boxes, these are ultimately determined by the WDL, so editing the WDL could be the solution. Before doing so, we can also take a look at your method config and workspace if you share them with us.

    Thanks,
    Jason

  • nrashinrashi DC - District of ColumbiaMember

    Hi Jason, I just added you to the project leidos-firecloud-eval/samtoolsindex_upscale5 and perhaps you can help me with choosing the right entity type - I was in an impression that making a sample_set out of two samples might help my purpose as Eddie earlier suggested. About the WDL, it isn't hard coded for a particular BAM, but what changes you suggest?

    Thanks for taking a look!

    Rashi

  • jneffjneff BostonMember, Broadie, Moderator admin
    edited November 2016

    No problem. Can you also grant us READER access to the samtoolsindex method? Method Repo>Click on method>Permissions>Add New>Add [email protected] and [email protected] Thanks

  • nrashinrashi DC - District of ColumbiaMember

    Hi Jason, I just added you and Eddie on the method config! Thanks!

    Rashi

  • jneffjneff BostonMember, Broadie, Moderator admin

    Hi Rashi,

    Our apologies for the delay! I'm in the process of testing your WDL with a root entity type of sample. Hopefully, this change will produce the two BAI file outputs.

    Thanks,
    Jason

  • nrashinrashi DC - District of ColumbiaMember

    Thanks Jason! :)

  • jneffjneff BostonMember, Broadie, Moderator admin

    Hi Rashi,

    We are still working on resolving your issue. I shared with you a new workspace with an updated WDL: broad-firecloud-testing/samtoolsindex_upscale5-test-11-29. Currently, we are running into issues with disk space.

    I will let you know once we get both your samples to run.

    Thanks,
    Jason

  • jneffjneff BostonMember, Broadie, Moderator admin

    Hi Rashi,

    Sorry for the delay! This took us a bit longer to test as Google recently made an update that caused temporary issues with disk space.

    It appears that both samples ran successfully in the shared workspace: broad-firecloud-testing/samtoolsindex_upscale5-test-11-29. Please take a look at the most recent analysis in the Monitor tab: November 29, 2016 5:48 PM.

    To make this work, we changed the root entity to sample and updated the WDL. You can run this WDL on a sample set by entering this.samples in the Define Expression field at runtime.

    Let us know of any questions!

    -Jason

  • nrashinrashi DC - District of ColumbiaMember

    Hi Jason,

    Thanks for testing, so was this not working earlier because of the Google updates?

    Also, since I was trying to make it work when the root entity type was set to sample_set, how is this different from setting it to sample, although in both cases I was trying to give it two BAM files?

    It is a little misleading for a user when they would read the documentation; it is easy to assume that sample_set would be a choice of root entity when there are large number of samples. Just trying to understand the difference and reason of choosing one entity type over the other.

    I will try this WDL today with more number of files and see how it goes.

    Thanks for helping,

    Rashi

  • jneffjneff BostonMember, Broadie, Moderator admin

    Hi Rashi,

    As of yesterday afternoon, the failures were likely attributable to the Google updates. Previously, the failures were related to the WDL.

    As to your question about root entity, both the sample and sample set could work. It's simply a matter of preference and how you set up your WDL and method config. Because your WDL listed File as an input, FireCloud expects a single value (single BAM versus two BAMs). In that case, sample was the correct root entity to choose. If you had instead listed an Array[File] as an input, you could have used the sample set root entity. In that case, you would also need to enter this.samples.BAM in the method config input.

    I will update the documentation to describe the different options, and any tradeoffs. Arguably, the sample root entity is more flexible as you can run your method config on both individual samples and a sample set using the this.samples expression at runtime.

    For an example of a correctly configured WDL/method config with a sample set as the root entity, see the RNA-Seq workspace > aggregate_rsem_results_v1-0_BETA_cfg.

    Best,
    Jason

Sign In or Register to comment.