We've moved!
You can find our new documentation site and support forum for posting questions here.

(howto) Import metadata

Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin
edited April 2018 in Tutorials

You can import metadata into your workspace's data model by either copying from an existing workspace or importing a file.

Copying from an existing workspace

  1. Choose the workspace you want to import metadata from. You will notice that you can only import data from workspaces that are compatible with the Authorization domain you have set.
  2. Pick the participants, samples, pairs, or sets you want. Importing sets will bring over all the data required for the set. For example, if you import a sample set, the sample and participant data linked to the set will also be copied over.
    • Import conflicts can occur if you already have an identical participant, sample or pair in your workspace that matches what you are importing. FireCloud will notify you that the entity already exists in the workspace.
    • Copying metadata from another workspace will not import any linked files into your workspace bucket. Rather, it will refer to file paths in the bucket of the workspace you copied. Thus, if that workspace bucket is deleted, your workspace data model will no longer refer to an existing bucket path.

Importing a file

You import metadata corresponding to entity type -- Participant, Sample, or Pair -- by uploading load files in tab-separated-value format, a type of text file (.tsv or .txt). Separate files must be used for uploading each entity type. The first line of each file must contain the appropriate field names in their respective column headers. See the individual entity entries for examples of load files.

Note that for each of the basic entities, the data model also supports set entities, which are essentially lists of the basic entity type:

  • Participant Set
  • Sample Set
  • Pair Set

In set load files, each line lists the membership of a non-set entity (e.g., participant) in a set (e.g., participant set). The first column contains the identifier of the set entity and the second column contains a key referencing a member of that set. For example, a load file for a participant set looks like this:

membership:participant_set_id participant_id

Note that multiple rows in a set load file may have the same set entity id (e.g. TCGA_COAD).

Order for uploading Load Files

Load files must be imported in a strict order due to references to other entities.

The order is as follows ("A > B" means entity type A must precede B in order of upload):

  • participants > samples
  • samples > pairs
  • participants > participant sets
  • samples > sample sets
  • pairs > pair sets
  • set membership > set entity, e.g., participants > samples > sample set membership > sample set entity.

Uploading an Array of files or strings

You may be in the situation where you have multiple files or strings of metadata that belong to one participant, sample, pair, or sets of these. For example, say you have been given genotyping files in VCF format for a collection of samples, for a total of twenty-two files per sample set. For each sample set, you don’t want to create a new column in the data table for each file because that's time-consuming. You would also have to launch the analysis in FireCloud repeatedly to run on each file. Instead, you want to build a WDL that inputs an array of VCF files because you’d like your tools to run on each item in the array without manual intervention.

To get the array into your data model, you can write WDL code that will output a file of file paths or strings into an array format. This requires a file that contains a list of file paths or strings as the input. A task in your WDL can read the lines of the file, output it to your data model as an array, then you can use the method configuration to assign it to a workspace attribute (“workspace.X”) or an attribute of the participant, sample, pair, or set that you are running on (“this.X”).

Here are two examples that can be altered for your use case. In the example above, the input would be a file that has a list of VCF file paths, one per line using “gs://” format.

Example 1 has a command portion left blank so that you can manipulate the array if you desire. This WDL will copy your files to the virtual machine the task spins up, which makes sense if you are manipulating the array of files further. 50 GB disk size is to account for those files that are being copied to the virtual machine and should be changed for your use case. If you do not want to manipulate the array, see Example 2.

Example 1:

Example 1’s Method and Method Configuration are published in the Methods Repository.

workflow fof_usage_wf {
   File file_of_files
   call fof_usage_task {
     fof = file_of_files
   output {
    Array[File] array_output = fof_usage_task.array_of_files
task fof_usage_task {
   File fof
   Array[File] my_files=read_lines(fof)
   command {
   #do stuff with arrays below
   runtime {
       docker : "ubuntu:16.04"
       disks: "local-disk 50 HDD"
       memory: "2 GB"
   output {
    Array[File] array_of_files = my_files

Example 2:

workflow fof_usage_wf {
   File file_of_files
   Array[File] array_of_files = read_lines(file_of_files)

   output {
    Array[File] array_output = array_of_files

Importing arrays into your data model directly with a TSV is not currently available. We are working on functionality to make this easier to do in the web interface.

Post edited by KateN on


Sign In or Register to comment.