We've moved!
You can find our new documentation site and support forum for posting questions here.

New file paths for TARGET and TCGA workspaces

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

Heads up if you use the TARGET and/or TCGA data workspaces: we're planning an update that will change the paths to the data files. This will disable access to the original files in any clones of the original workspaces that retain the old paths, and it will affect call caching in any new clones of the original workspaces that you create after the update rolls out in a few weeks.

Read on to understand what this change entails and why it's an important improvement that is worth making.

What's changing?

The file paths will have a new URL structure. After the update, file paths will no longer start with “gs://” followed by a bucket path. Instead, they will follow this structure: “dos://”.

What are the consequences for your work?

If you are working in clones of the original master workspaces, the original data files will no longer be available at the old "gs://" paths. So if you want to run new analyses in those previously existing workspaces, you will need to update the metadata to use the new "dos://" paths. However, any output files you previously derived from analyzing those datasets will be unaffected and will remain accessible.

The first time you run workflows on the data using the new paths, in any workspace (old or new), call caching will NOT kick in even if you have previously run those workflows on the same data. The workflows will have to run in full; this is because call caching uses the file paths as part of the algorithm that identifies whether a given computation has already been run with the same starting conditions.

Why are we making this change?

We understand that this update has the potential to disrupt your work, so please rest assured we are not making this decision lightly. In a nutshell, the switch in file path structure brings substantial benefits that we believe are worth risking some disruption.

In the past, the Genomic Data Commons (GDC) only released new datasets a few times a year, but the frequency at which we receive data has been increasing steadily, which has been making it more difficult to manage the new content and make it available in a timely way. This update will enable us to respond more efficiently and in a standardized manner to new datasets released by the GDC. As a result, you can expect to see updates to TARGET and TCGA workspaces happen much more quickly than in the past, and we'll also be able to provide access to new GDC datasets from our platform as they become available.

For context, this move is part of larger shift within the GDC towards location-agnostic URLs, which allow the physical data to relocate without changing references to those data. This follows changes in standards of how datasets are stored across the Data Commons Framework (DCF), which we expect will provide a more standardized and streamlined experience going forward. We are keen to align with the GDC on this effort and we look forward to seeing its benefits materialize for everyone.

Let us know if you have any concerns or questions; as always we are here to help.



  • birgerbirger Member, Broadie, CGA-mod ✭✭✭


    Does this apply to just the hg38 TCGA and TARGET workspaces, which didn't reference files via gs:// URLs but rather GDC UUIDs, or does it also apply to the hg19 TCGA workspaces?


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Chet, that's a good question -- @abaumann can you shed some light on that?

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    I'm a bit confused because the hg38 workspaces never contained gs: urls. They just contained GDC UUIDs. As a temporary solution we provided workflows for retrieving files from the GDC based on the UUIDs, but the plan was for the GDC and FireCloud to implement UUID to URL resolution. It looks like instead of doing that you and the GDC are introducing a new type of location-independent URL (which should really be called a URI - universal resource identifier). Will these "dos" paths incorporate the GDC file UUIDs? Will they be included in any file manifests downloaded from the GDC?

  • abaumannabaumann Broad DSDEMember, Broadie ✭✭✭

    For hg38 you are correct those never had gs:// urls, it had the uuids and your downloader got that data. The gs:// urls would be replaced with dos:// uris for the hg19 workspaces.

    For the hg38 workspaces, these uuids would be replaced with dos://. It's not location dependent - these URIs resolve to google bucket URLs (or files in AWS, etc). There are different resolvers out there that can all take these uuids and resolve them to URLs.

    The uuids for these DOS urls are the same uuids as your workspaces had, but looking through your workspaces they included also the file names, which these new workspaces will not - they would only have dos://. Does this resolve the question you have about manifests given you will still have the same uuids (just in dos uri form)? We put in the actual uri so that we can know how to handle it in the ui for instance - click and see a preview. If we didn't have this we couldn't otherwise identify between DOS uuids and other uuids (like the workflow, submission, and other uuids that our system also uses for instance).

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    I included the file names in the hg38 workspaces to support the file downloaders. The filenames should not be required once we move to "dos" URIs. Will the manifest provide dos URIs or just have the uuids, and we will construct the dos urls from the uuids?

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    Could you provide an example of a dos:// url? I tried adding the url "dos://0141f91f-b350-45df-bbb8-983007daf27c", where the body of the url is the uuid of a file on the GDC, to a firecloud workspace. The Firecloud GUI identifies it as a file url, but fails to resolve it to the file on cloud storage. I suspect I'm using the incorrect syntax for incorporating a file's GDC UUID into the dos url. Thanks.

Sign In or Register to comment.