Using gsutil -m: how do I specify multiple sources and multiple destinations

I want to download 15,000 files from my FireCloud bucket, all of which have the same filename but different paths, eg gs://fc-mybucket/MyTool/*/MySubtask/output.vcf

It's too slow to make 15,000 calls to gsutil as follows: gsutil cp gs://fc-mybucket/MyTool/ID1/MySubtask/output.vcf ID1.output.vcf

Using the -m option as follows doesn't work:
gsutil cp -m gs://fc-mybucket/MyTool/*/MySubtask/output.vcf mydir/

because it keeps copying all the source files into a single file mydir/output.vcf.

I also tried using -r:
gsutil cp -m -r gs://fc-mybucket/MyTool/*/MySubtask/output.vcf mydir/
But this also fails to create all the subdirectories necessary.

I would like to specify two input files: a list of 15000 source paths and a list of 15000 destination paths.

Any suggestions of how to download and rename all my files in parallel?

Tagged:

Answers

  • scalvoscalvo Member

    Or alternatively -- how can I create a tar file of all my */MySubtask/output.vcf in the cloud? Then I would only need to download a single tar file.

  • bshifawbshifaw Member, Broadie, Moderator admin

    Hi @scalvo,

    Mind testing whether gsutil ls gs://fc-mybucket/MyTool/**/MySubtask/output.vcf lists the files you want to copy without the source files? This command uses two asterisks which should be helpful for spanning your search through directories. (mentioned here). If it lists all the files you would like to download without the source files you don't want, perhaps you could use this with the gsutil cp -m command

  • scalvoscalvo Member

    I don't understand your suggestion. I already have a list of all the source URLs I want to copy, which all have different URLS but have the same basename BLAHBLAH/output.vcf. The problem is if I try "cat MYFILELIST | gsutil cp -m -I mydir/ " then it only outputs a single file mydir/output.vcf.

  • bshifawbshifaw Member, Broadie, Moderator admin
    edited March 9

    Most likely because having the same basename across all subdirectories causes gsutil to overwrite the file its copying. Unfortunately gsutil does not have the --parentoption which would allow you preserve the directory structure from the remote directory (issue ticket). I'll refer to my team to check if they have any suggestions.

Sign In or Register to comment.