Metadata formatting

SystemSystem Administrator admin
This discussion was created from comments split from: Getting error when uploading the genomic data model.

Comments

  • Hi Tiffany,
    I have a metadata which is a value for a task as an input file (Please see the attached file). I was wondering if we can set up a skype meeting to make sure I am formatting it correctly?

    The workspace "paired-fastq-to-unmapped-bam" ran successfully with 3 sample fastq files. I am trying to prepare a metadata file for my 10 new fastq files and then 900 other fastq files.

    I have attached the metadata file I am referring to: My questions are:

    1- Is the H06HDADXX130110.1.ATCACGAT a multiplex adaptor and should I adjust it to my sequencer machine?

    2- Should I keep the other columns the same (except the Bucket address)?

    3- Within the WDL where the run time is specified ( Please see below), what is the best practice for number of cpu and disk space for about 500 bam files?

    runtime {
    docker: docker
    memory: select_first([machine_mem_gb, 10]) + " GB"
    cpu: "1"
    disks: "local-disk " + select_first([disk_space_gb, 100]) + " HDD"
    preemptible: select_first([preemptible_attempts, 3])

    }

    Best,

  • KateNKateN Cambridge, MAMember, Broadie, Moderator admin

    Hi @smehr, I've created a new discussion from your comment so that we can better answer your question. @bshifaw had some ideas, so I'm going to pass your question over to him.

  • bshifawbshifaw Member, Broadie, Moderator admin
    edited August 2018

    1- Is the H06HDADXX130110.1.ATCACGAT a multiplex adaptor and should I adjust it to my sequencer machine?

    No, you shouldn't have to. But if you need further guidance in preparing samples you can reference the first step in Map and clean up short read sequence data efficiently doc.

    2- Should I keep the other columns the same (except the Bucket address)?

    The columns for each sample should be changed. The headers for columns are listed below:
    readgroup | fastq_pair1 | fastq_pair2 | sample_name | library_name | platform_unit | run_date | platform_name | sequecing_center

    3- Within the WDL where the run time is specified ( Please see below), what is the best practice for number of cpu and disk space for about 500 bam files?

    We don’t currently have a best practices for the runtime parameters relative to the number of input files. If you would like to try and calculate the diskspace, best way is to use the size of your largest fastq pair.
    Ideally, and maybe we could upgrade this into the workflow later, the task would adjust the disk space by calculating the input files first before creating the VM.

    Did this answer your question?

Sign In or Register to comment.