How to run analysis with GDC (TCGA) mRNA bam files?

Hi! I want to run analysis with TCGA mRNA bam data. But I don't know how to import such data. Additionally, TCGA data have been move to GDC since June. Would some one could give me some instruction that how to import mRNA bam data from GDC? Thank you!

BTW, I have TCGA controlled data access authorization.


Best Answer


  • jemimalwhjemimalwh ChinaMember

    @jneff said:
    In the future, FireCloud will support integrations to GDC data. Currently, FireCloud's pre-loaded TCGA workspaces refer to Google Cloud Storage buckets that exist independently of GDC. The storage costs for these buckets are shared between the Broad Institute and ISB.

    If you were willing to pay for storage costs, you could download GDC data and upload it to a Google Cloud Storage bucket. My suggestion would be to first browse through FireCloud's available TCGA data. If FireCloud has the TCGA data you're looking for, you can shallow copy this data into your workspace without incurring storage costs.

    To find TCGA controlled access data in FireCloud, you can search for "controlled" in the workspace list. Then click on a workspace, e.g., TCGA_LUAD_ControlledAccess_V1-0_DATAand browse through the Data tab. Filtering in the sample entity will generally display the BAMs and mRNA data. I included a screenshot.

    If you want to shallow copy this data to your own workspace (i.e., copy the pointers to TCGA buckets), you can use the Import Data... feature in the Data tab.

    The basic steps are:

    1. Create a new workspace and select Workspace intended to contain NIH protected data.
    2. Go to the Data tab and click Import Data...
    3. Click Copy from another workspace. Search for "controlled" and select a TCGA workspace e.g., TCGA_LUAD_ControlledAccess_V1-0_DATA.
    4. Choose the pair_set entity type. Select an Entity name and click Copy. Note this will import all of the TCGA data for the LUAD workspace. It may take a minute or two.

    Afterwards, you could reference the TCGA LUAD data in method configs within your workspace.

    Thank you for your reply. Now I can import sample, participant, and sample_set data from TCGA. Another problem is that, I want to run hundreds of sample analysis, but I can only launch one sample at one time. It's really a time-consuming job if I can only launch the sample analysis one-by-one.

    I wonder if there is some other ways to launch analysis in a batch and act faster? Thank you!

  • jneffjneff BostonMember, Broadie, Moderator

    Let us know if the response here addresses your question. If not, we can provide further support.

