failing to get size of file

birgerbirger Member, Broadie, CGA-mod ✭✭✭

The QC workflow that we run on BAMs before running mutect is failing on a particular data set. The data files in this set reside in a non-workspace, protected bucket that I have read access to (broad-ibmwatson-broad_private_data-bucket). The failure appears in the first task of the workflow that takes as input the sizes of several files; i.e.,

call QC_Prepare_Task {
input:
preemptible=preemptible,
tBamBytes=size(tumorBam),
tBaiBytes=size(tumorBamIdx),
nBamBytes=size(normalBam),
nBaiBytes=size(normalBamIdx),
regionFileBytes=size(regionFile),
rgBLBytes=size(readGroupBlackList),
capNormDBZipBytes=size(captureNormalsDBRCLZip),
fastaBytes=size(refFasta),
fastaDictBytes=size(refFastaDict),
fastaIdxBytes=size(refFastaIdx),
exomeIntervalsBytes=size(exomeIntervals),
snpSixBytes=size(SNP6Bed),
hapMapVCFBytes=size(HapMapVCF),
hapDBForCCBytes=size(HaplotypeDBForCrossCheck),
dbSNPVCFBytes=size(DB_SNP_VCF),
dbSNPVCFIDXBytes=size(DB_SNP_VCF_IDX),
picardHapMapVCFBytes=size(picardHapMap),
picardTargetIntervalsBytes=size(picardTargetIntervals),
picardBaitIntervalsBytes=size(picardBaitIntervals)
}

The files that reside on the private bucket are tumorBam, tumorBamIdx, normalBam and normalBamIdx.

This prepare tasks fails with the following message:
message: Couldn't resolve all inputs for QC_Workflow.QC_Prepare_Task at index None.
causedBy:
message: Input evaluation for Call QC_Workflow.QC_Prepare_Task failed.
causedBy:
message: nBamBytes
causedBy:
message: fc-a903aa03-a935-463b-bdff-bf782a05a55a/ac0ebcbc-4a9a-43b1-abdb-d5fc82c89986/QC_Workflow/c5d05e91-cec0-4734-ad04-5b7d45656417/call-QC_Prepare_Task/gs:/broad-ibmwatson-broad_private_data-bucket/seq/picard_aggregation/RP-897/Exome/05246_CCPM_030102_Blood/v5/05246_CCPM_030102_Blood.bam
message: tBaiBytes
causedBy:
message: fc-a903aa03-a935-463b-bdff-bf782a05a55a/ac0ebcbc-4a9a-43b1-abdb-d5fc82c89986/QC_Workflow/c5d05e91-cec0-4734-ad04-5b7d45656417/call-QC_Prepare_Task/gs:/broad-ibmwatson-broad_private_data-bucket/xchip/bloodbiopsy/data/cell_free_DNA/27Feb17SR_cfDNA_IBM_WES/mergedBamFilesV2/FC19270072.markDuplicates.bai
message: nBaiBytes
causedBy:
message: fc-a903aa03-a935-463b-bdff-bf782a05a55a/ac0ebcbc-4a9a-43b1-abdb-d5fc82c89986/QC_Workflow/c5d05e91-cec0-4734-ad04-5b7d45656417/call-QC_Prepare_Task/gs:/broad-ibmwatson-broad_private_data-bucket/seq/picard_aggregation/RP-897/Exome/05246_CCPM_030102_Blood/v5/05246_CCPM_030102_Blood.bai
message: tBamBytes
causedBy:
message: fc-a903aa03-a935-463b-bdff-bf782a05a55a/ac0ebcbc-4a9a-43b1-abdb-d5fc82c89986/QC_Workflow/c5d05e91-cec0-4734-ad04-5b7d45656417/call-QC_Prepare_Task/gs:/broad-ibmwatson-broad_private_data-bucket/xchip/bloodbiopsy/data/cell_free_DNA/27Feb17SR_cfDNA_IBM_WES/mergedBamFilesV2/FC19270072.markDuplicates.bam

However, if I run a simple single-task workflow on the same entity where the workflow takes as input one of these files, rather than the file size, the file gets successfully localized (from task log):

2017/06/12 21:00:26 I: Running command: sudo gsutil -q -m cp gs://broad-ibmwatson-broad_private_data-bucket/xchip/bloodbiopsy/data/cell_free_DNA/27Feb17SR_cfDNA_IBM_WES/mergedBamFilesV2/FC19270072.markDuplicates.bai /mnt/local-disk/gs:/broad-ibmwatson-broad_private_data-bucket/xchip/bloodbiopsy/data/cell_free_DNA/27Feb17SR_cfDNA_IBM_WES/mergedBamFilesV2/FC19270072.markDuplicates.bai
2017/06/12 21:00:28 I: Done copying files.

I believe it is cromwell that is retrieving the size of the files from the bucket, possibly by making a gsutil stats call. What credentials does cromwell use to make this request? If it is not employing the user's credentials, then that might explain the failure.

Note that the gs urls in the failure message are corrupted (gs:/ instead of gs://). This might hint at the presence of a bug, or could simply be a formatting issue associated with the generation of the error message. Regardless, this is blocking progress on our collaboration project.

Issue · Github
by Geraldine_VdAuwera

Issue Number
2160
State
open
Last Updated

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    This does sound like a Cromwell problem, I'll have the team look into it. Can you just confirm that the same workflow works fine if you run it on data located in a public bucket?

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    I'm looking into this now. One thing that jumps out at me is this chimeric-looking path:

    fc-a903aa03-a935-463b-bdff-bf782a05a55a/ac0ebcbc-4a9a-43b1-abdb-d5fc82c89986/QC_Workflow/c5d05e91-cec0-4734-ad04-5b7d45656417/call-QC_Prepare_Task/gs:/broad-ibmwatson-broad_private_data-bucket/seq/picard_aggregation/RP-897/Exome/05246_CCPM_030102_Blood/v5/05246_CCPM_030102_Blood.bam

    There appears to be an inappropriate prefix in front of the GCS path (the gs:/ versus gs:// thing I believe is a formatting-related red herring)

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    After copying the files to the workspace's bucket, it worked fine. I can also successfully run the same workflow on cell-line data, which resides in a public bucket.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    @mcovarr @Geraldine_VdAuwera

    Miguel and Geraldine: This bug is impacting a critical deadline. @mhanna, @Kara_Slowik and I would appreciate frequent status updates.

    Does the cromwell service account need to be granted Read access to the private bucket?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @mcovarr is looking into this now and will update you as the case develops.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    thank you....really appreciate it.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    tl;dr This may not be a permissions problem but possibly an issue with the way Cromwell handles underscores in a bucket name. Can you rename your bucket to not have underscores?

    In the refresh token mode FireCloud uses, Cromwell will utilize the user's refresh token for these size operations. I've tested this locally and it seems to work as expected: with a specific user's refresh token I can call size on only those files to which the user has access. When I intentionally mismatch the refresh token to file permissions, I get explicit auth errors like the following, which look quite different from what you have here:

            "failures": [
              {
                "causedBy": [
                  {
                    "causedBy": [
                      {
                        "causedBy": [
                          {
                            "causedBy": [
                              {
                                "causedBy": [],
                                "message": "403 Forbidden\n{\n  \"code\" : 403,\n  \"errors\" : [ {\n    \"domain\" : \"global\",\n    \"message\" : \"Caller does not have storage.objects.get access to object my-project-dev/my_file.txt.\",\n    \"reason\" : \"forbidden\"\n  } ],\n  \"message\" : \"Caller does not have storage.objects.get access to object my-project-dev/my_file.txt.\"\n}"
                              }
                            ],
                            "message": "Caller does not have storage.objects.get access to object my-project-dev/my_file.txt."
                          }
                        ],
                        "message": "sz"
                      }
                    ],
                    "message": "Input evaluation for Call size_wf.size_task failed."
                  }
                ],
                "message": "Couldn't resolve all inputs for size_wf.size_task at index None."
              }
            ],
    

    Thanks

    Miguel

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    You can't rename a bucket once it's been created. The TCGA bucket names don't have underscores in them, so we do have evidence that it works without underscores. I'll run some positive and negative tests, and get back to you.

    @mhanna and @Kara_Slowik ended up copying all the files to a workspace bucket to work around this issue complete their analyses runs. Nevertheless, I'll try to get you my test results to you by tomorrow.

    If the issue is with the use of underscores in the bucket name, I would consider that a cromwell bug because Google supports underscores in bucket names, and a workspace should be able to reference files in any google bucket.

    Thanks again for your help.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    Although this was new to me, apparently this is a known issue in Cromwell. I'll see if I can get to this during my current bug rotation.

  • mcovarrmcovarr Cambridge, MAMember, Broadie, Dev ✭✭

    Just to confirm what we already suspected, I repeated my tests using a bucket name with underscores in it and I now see failures consistent with those reported here.

  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    Thanks @mcovarr . I don't think there is any need, then, for me to run my own tests then... you have already confirmed that it is the use of underscores in the bucket name that is causing the problem.

Sign In or Register to comment.