Heads up:
We’re moving the GATK website, docs and forum to a new platform. Read the full story and breakdown of key changes on this blog.
Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

BaseRecalibratorSpark double copying of reference input files

Hi,

I am testing BaseRecalibratorSpark on GCP through wdl-runner (i.e. "gcloud alpha genomics pipelines run"), since I would like to include the Spark version of this tool in my future wdl scripts, once it is out of the beta. However, I have encountered the following issue:

When I allocate disk space by dynamic sizing the tool fails because it runs out of disk space. To avoid this I need to double-size the local disk size. Reading the info written to stderr I have found that the spark engine copies all files (including ref fasta file and dbSNP.vcf file, which are pretty large files) to a temp folder at /cromwell_root, even if the files are already at another sub-folder of /cromwell_root (they are copied there when the VM machine is created). This double copying of files implies a more expensive run of the tool on the cloud due to a substantial extra disk size being required and also due to the extra copying time.

An example of this:

19/02/12 12:39:20 INFO SparkContext: Added file /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta at file:/cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta with timestamp 1549975160856
19/02/12 12:39:20 INFO Utils: Copying /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.fasta to /cromwell_root/spark-639f237f-74bc-41f4-86c7-af2f294f9c0c/userFiles-7a595035-e731-49fd-a493-f129381ff1e7/Homo_sapiens_assembly38.fasta

Or:

19/02/12 12:39:49 INFO SparkContext: Added file /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf at file:/cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf with timestamp 1549975189045
19/02/12 12:39:49 INFO Utils: Copying /cromwell_root/broad-references/hg38/v0/Homo_sapiens_assembly38.dbsnp138.vcf to /cromwell_root/spark-639f237f-74bc-41f4-86c7-af2f294f9c0c/userFiles-7a595035-e731-49fd-a493-f129381ff1e7/Homo_sapiens_assembly38.dbsnp138.vcf

This also happens when run locally on my laptop.

Is there a way to configure the tool in order to prevent this "double-copying" of files from happening? Can the NIO implementation be used here to access the big reference files from the Broad's public buckets directly (i.e. without copying them to the VM machine at all)?

This is the command used:

gatk --java-options "-XX:GCTimeLimit=50 -XX:GCHeapFreeLimit=10 -XX:+PrintFlagsFinal -XX:+PrintGCTimeStamps -XX:+PrintGCDateStamps -XX:+PrintGCDetails -Xloggc:gc_log.log -Xms${command_mem}m" \
      BaseRecalibratorSpark \
      -R ${ref_fasta} \
      -I ${input_bam} \
      --use-original-qualities \
      -O ${recalibration_report_filename} \
      --known-sites ${dbSNP_vcf} \
      --known-sites ${sep=" --known-sites " known_indels_sites_VCFs} \
      -L ${sequence_group_interval} \
      -- --spark-master 'local[*]'

Thanks in advance

Best Answers

  • SChaluvadiSChaluvadi admin
    Accepted Answer

    @mack812
    There was a ticket made to make this improvement so you can follow its progress here!

Answers

  • mack812mack812 SpainMember

    Thanks for your reply @SChaluvadi, I will try this option soon.

    Regarding NIO, is there any way to know what GATK4 tools are able to work in "NIO mode"? I have tried Mutect2 and it works nicely without copying anything into local disks, by just declaring the input files in the task section of the wdl as "String" instead of "Files". What other GATK4 tools accept this?

  • mack812mack812 SpainMember
  • SChaluvadiSChaluvadi Member, Broadie, Moderator admin
    Accepted Answer

    @mack812
    There was a ticket made to make this improvement so you can follow its progress here!

Sign In or Register to comment.