Latest Release: 02/19/19
Release Notes can be found here.

uploading large datasets?

esalinasesalinas BroadMember, Broadie ✭✭✭

I’m uploading large datasets and I got a suggestion to make the upload faster. How to I implement this?

Answers

  • esalinasesalinas BroadMember, Broadie ✭✭✭
    edited August 2016

    One may issue the command "gsutil help cp" to find help about copying files to learn about the "-m" switch for multithreading.

    One may do a google search for "parallel composite upload" and find the google hit
    https://cloud.google.com/storage/docs/gsutil/commands/cp

    where one can find

    To try parallel composite uploads you can run the command:
    gsutil -o GSUtil:parallel_composite_upload_threshold=150M cp bigfile gs://your-bucket
    

    An example usage with some timing data is below:

    wm8b1-75c:pcu esalinas$ ls -alht
    total 2048000
    drwxr-xr-x+ 196 esalinas  CHARLES\Domain Users   6.5K Aug 11 17:09 ..
    -rw-r--r--    1 esalinas  CHARLES\Domain Users   1.0G Aug 11 17:07 1gb.dat
    drwxr-xr-x    3 esalinas  CHARLES\Domain Users   102B Aug 11 17:06 .
    wm8b1-75c:pcu esalinas$ 
    
    wm8b1-75c:pcu esalinas$ time gsutil cp 1gb.dat gs://fc-97a9614f-f6af-4551-9e1c-27e65f700b6e 
    Copying file://1gb.dat [Content-Type=application/octet-stream]...
    ==> NOTE: You are uploading one or more large file(s), which would run
    significantly faster if you enable parallel composite uploads. This
    feature can be enabled by editing the
    "parallel_composite_upload_threshold" value in your .boto
    configuration file. However, note that if you do this large files will
    be uploaded as `composite objects
    <https://cloud.google.com/storage/docs/composite-objects>`_,which
    means that any user who downloads such objects will need to have a
    compiled crcmod installed (see "gsutil help crcmod"). This is because
    without a compiled crcmod, computing checksums on composite objects is
    so slow that gsutil disables downloads of composite objects.
    
    Resuming upload for file://1gb.dat
    Uploading   .../fc-97a9614f-f6af-4551-9e1c-27e65f700b6e/1gb.dat: 1000 MiB/1000 MiB      
    
    real    2m55.588s
    user    0m22.443s
    sys 0m5.018s
    wm8b1-75c:pcu esalinas$ time gsutil -o GSUtil:parallel_composite_upload_threshold=150M -m cp 1gb.dat gs://fc-97a9614f-f6af-4551-9e1c-27e65f700b6e/1gb.dat.pcu.m.dat
    Copying file://1gb.dat [Content-Type=application/octet-stream]...
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_11: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_6: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_19: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_15: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_17: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_3: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_12: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_2: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_8: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_7: 50 MiB/50 MiB    
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_0: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_18: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_5: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_14: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_16: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_1: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_10: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_13: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_4: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_9: 50 MiB/50 MiB       
    
    real    1m20.800s
    user    0m24.071s
    sys 0m5.197s
    wm8b1-75c:pcu esalinas$ 
    
  • pradeepnpradeepn MGHMember
    edited August 2016

    This works for me:

    gsutil cp ./chr21.vcf.gz gs://mybucket` works for me
    

    However I get the error "Failure: can't start new thread." with this:

    gsutil -o GSUtil:parallel_composite_upload_threshold=150M -m cp ./chr21.vcf.gz gs://mybucket
    
  • esalinasesalinas BroadMember, Broadie ✭✭✭
    edited August 2016

    Given that I was able to use the same kind of command, but got no error, I have not been able to reproduce the bug at the present moment.

    Have you tried without "-m"?

    I was able to upload without "-m" and observe similar timing data:

    <br />wm8b1-75c:pcu esalinas$ time gsutil -o GSUtil:parallel_composite_upload_threshold=150M  cp 1gb.dat gs://fc-97a9614f-f6af-4551-9e1c-27e65f700b6e/1gb.dat.pcu.dat
    Copying file://1gb.dat [Content-Type=application/octet-stream]...
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_18: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_8: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_19: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_6: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_2: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_1: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_5: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_11: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_15: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_14: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_16: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_4: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_12: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_3: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_10: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_7: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_0: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_17: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_13: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_9: 50 MiB/50 MiB       
    
    real    1m16.922s
    user    0m23.593s
    sys 0m7.620s
    wm8b1-75c:pcu esalinas$ 
    
    
    
  • pradeepnpradeepn MGHMember

    Same error
    without -m but with -o GSUtil:parallel_composite_upload_threshold=150M
    and
    without -o GSUtil:parallel_composite_upload_threshold=150M but with -m

    Am I supposed to change something in gsutil config first?

  • esalinasesalinas BroadMember, Broadie ✭✭✭
    edited August 2016

    I am unable to reproduce your issue at the moment, but I am able to run the command without error using the broadinstitute/firecloud-cli docker image

    wm8b1-75c:pcu esalinas$ ls -alht
    total 2048000
    drwxr-xr-x+ 197 esalinas  CHARLES\Domain Users   6.5K Aug 12 10:41 ..
    -rw-r--r--    1 esalinas  CHARLES\Domain Users   1.0G Aug 11 17:07 1gb.dat
    drwxr-xr-x    3 esalinas  CHARLES\Domain Users   102B Aug 11 17:06 .
    wm8b1-75c:pcu esalinas$ pwd
    /Users/esalinas/pcu
    wm8b1-75c:pcu esalinas$ time docker run -it  -v "$HOME"/.config:/.config   -v  "$PWD":/working broadinstitute/firecloud-cli   gsutil -m -o GSUtil:parallel_composite_upload_threshold=150M -m cp 1gb.dat gs://fc-97a9614f-f6af-4551-9e1c-27e65f700b6e/1gb.dat.pcu.m.cli_image.dat
    Copying file://1gb.dat [Content-Type=application/x-ns-proxy-autoconfig]...
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_5: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_1: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_4: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_7: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_6: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_2: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_9: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_3: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_0: 50 MiB/50 MiB       
    Uploading   ...sutil_help_cp/ba970c6914aaaba30e72e58a1cc9c481_8: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_18: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_14: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_19: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_17: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_16: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_15: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_12: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_11: 50 MiB/50 MiB       
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_13: 50 MiB/50 MiB      
    Uploading   ...util_help_cp/ba970c6914aaaba30e72e58a1cc9c481_10: 50 MiB/50 MiB       
    
    real    1m48.954s
    user    0m0.040s
    sys 0m0.049s
    wm8b1-75c:pcu esalinas$ 
    
    
    

    Note that to login the "gloud auth login" command must have been run before, perhaps also in the broadinstitute/firecloud-cli image as shown here as shown at the firecloud-cli github repo README : https://github.com/broadinstitute/firecloud-cli

    docker run --rm -it -v "$HOME"/.config:/.config broadinstitute/firecloud-cli gcloud auth login
    
  • birgerbirger Member, Broadie, CGA-mod ✭✭✭

    Note that Eddie demonstrated the use of the firecloud-cli docker image because that image includes an installation of the Google Cloud SDK, not because the firecloud-cli plays any role in the the use of the gsutil cp command.

Sign In or Register to comment.