We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

BaseRecalibratorSpark double copying of reference input files


I am testing BaseRecalibratorSpark on GCP through wdl-runner (i.e. "gcloud alpha genomics pipelines run"), since I would like to include the Spark version of this tool in my future wdl scripts, once it is out of the beta. However, I have encountered the following issue:

When I allocate disk space by dynamic sizing the tool fails because it runs out of disk space. To avoid this I need to double-size the local disk size. Reading the info written to stderr I have found that the spark engine copies all files (including ref fasta file and dbSNP.vcf file, which are pretty large files) to a temp folder at /cromwell_root, even if the files are already at another sub-folder of /cromwell_root (they are copied there when the VM machine is created). This double copying of files implies a more expensive run of the tool on the cloud due to a substantial extra disk size being required and also due to the extra copying time.

An example of this:

19/02/12 12:39:20 INFO SparkContext: Added file /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta at file:/cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta with timestamp 1549975160856
19/02/12 12:39:20 INFO Utils: Copying /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta to /cromwell_root/spark-639f237f-74bc-41f4-86c7-af2f294f9c0c/userFiles-7a595035-e731-49fd-a493-f129381ff1e7/Homo_sapiens_assembly38.fasta


19/02/12 12:39:49 INFO SparkContext: Added file /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf at file:/cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf with timestamp 1549975189045
19/02/12 12:39:49 INFO Utils: Copying /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf to /cromwell_root/spark-639f237f-74bc-41f4-86c7-af2f294f9c0c/userFiles-7a595035-e731-49fd-a493-f129381ff1e7/Homo_sapiens_assembly38.dbsnp138.vcf

This also happens when run locally on my laptop.

Is there a way to configure the tool in order to prevent this "double-copying" of files from happening? Can the NIO implementation be used here to access the big reference files from the Broad's public buckets directly (i.e. without copying them to the VM machine at all)?

This is the command used:

gatk --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:gc_log.log -Xms${command_mem}m" \
      BaseRecalibratorSpark \
      -R ${ref_fasta} \
      -I ${input_bam} \
      --use-original-qualities \
      -O ${recalibration_report_filename} \
      --known-sites ${dbSNP_vcf} \
      --known-sites ${sep=" --known-sites " known_indels_sites_VCFs} \
      -L ${sequence_group_interval} \
      -- --spark-master 'local[*]'

Thanks in advance

Best Answers

  • SChaluvadiSChaluvadi admin
    Accepted Answer

    There was a ticket made to make this improvement so you can follow its progress here!


  • mack812mack812 SpainMember

    Thanks for your reply @SChaluvadi, I will try this option soon.

    Regarding NIO, is there any way to know what GATK4 tools are able to work in "NIO mode"? I have tried Mutect2 and it works nicely without copying anything into local disks, by just declaring the input files in the task section of the wdl as "String" instead of "Files". What other GATK4 tools accept this?

  • mack812mack812 SpainMember
  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin
    Accepted Answer

    There was a ticket made to make this improvement so you can follow its progress here!

Sign In or Register to comment.