samtools view slice of cloud storage bam not working

ekofmanekofman Member, Broadie
edited June 2018 in Ask the FireCloud Team

Hi,

I am working with WGS data, and since it's so huge (upwards of 300 GB in some cases), when I scatter across many instances I'd like to be able to avoid localizing the entire bam for each scatter. Instead, I'd like to be able to operate on only the portion of the corresponding to the interval I've assigned to each scatter instance. To this end, I'm trying to use samtools to view only certain parts of the bam. I'm trying to follow the instructions listed here, but can't get it to work: http://isb-cancer-genomics-cloud.readthedocs.io/en/latest/sections/data/data2/data_in_GCS.html

These are the commands being run in my instance, along with the output/error message ensuing.

>> gcloud auth print-access-token
+ GCS_OAUTH_TOKEN=*******[redacted]********

>> samtools view gs://fc-47b16dc3-db04-48f5-a26a-ddec3c09c578/workspace_name/RP-1476/WGS/MSK-004_T_P1/v7/MSK-004_T_P1.bam 1:1-15000000
open: No such file or directory
[main_samview] fail to open "gs://fc-47b16dc3-db04-48f5-a26a-ddec3c09c578/workspace_name/RP-1476/WGS/MSK-004_T_P1/v7/MSK-004_T_P1.bam" for reading.

And this is the WDL command code that generated those commands:

task ProportionalCoverage_WGS_Task {
    File reference
    File referenceDict
    File referenceIndex
    File inputBamLocation
    String sampleID
    Int memoryGb
    Int diskSpaceGb
    File targetsIntervalList
    Int preemptible

    command <<<
        samtools view ${inputBamLocation} $(head -n1 ${targetsIntervalList}) >> bam_section.bam
        samtools index bam_section.bam

        java -jar /gatk/gatk.jar CalculateTargetCoverage \
        -L ${targetsIntervalList} \
        --output ${sampleID}.pcov \
        --groupBy SAMPLE \
        --transform PCOV \
        --input bam_section.bam \
        --reference ${reference}
    >>>

    output {
        File pcov = "${sampleID}.pcov"
    }

    runtime {
        docker: "broadinstitute/gatk:4.beta.6"
        memory: "${memoryGb} GB"
        cpu: "1"
        disks: "local-disk ${diskSpaceGb} HDD"
        preemptible: preemptible
    }
}

How can I do this? This will save me countless hours while developing my workflows for WGS, and I'm sure would be very useful to others in the community.

Thanks,

Eric

Best Answer

Answers

  • Tiffany_at_BroadTiffany_at_Broad Cambridge, MAMember, Administrator, Broadie, Moderator admin

    Hi @ekofman I know we use a feature called NIO in our $5 genome pipeline to localize parts of the Bam (blog post about it). I don't know the answer, but I will follow up with a colleague and get back to you.

Sign In or Register to comment.