Failure while processing large unmapped bam file as input to cromwell

Hi Team,

We are running five dollar pipeline by running Cromwell(v39) on AWS. Whenever we are running our pipeline on small files(~300MB) process is proceeding fine with further processing. But, whenever we are providing large file ranging 48GB to 68GB then it is failing due to below error.

We observed that "SplitLargeReadGroup.SamSplitter" only triggering for large files but not for smaller files which are ranging in ~MBs.

Do we need to perform any special configuration to handle large files or we are doing some wrong config which is causing failure to this pipeline.

Exception:

AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2m5f712db2^[[0mSplitLargeReadGroup.SamSplitter:NA:1]: ^[[38;5;5mset -e
mkdir output_dir

total_reads=$(samtools view -c /cromwell_root/cromwelleast/references/broad-references/macrogen_NA12878_full.bam)

java -Dsamjdk.compression_level=2 -Xms3000m -jar /usr/gitc/picard.jar SplitSamByNumberOfReads \
  INPUT=/cromwell_root/cromwellbucket/references/broad-references/macrogen_NA12878_full.bam \
  OUTPUT=output_dir \
  SPLIT_TO_N_READS=48000000 \
  TOTAL_READS_IN_INPUT=$total_reads^[[0m
[2019-04-18 20:50:06,31] [^[[38;5;1merror^[[0m] AwsBatchAsyncBackendJobExecutionActor [^[[38;5;2m5f712db2^[[0mSplitLargeReadGroup.SamSplitter:NA:1]: Error attempting to Execute
cromwell.engine.io.IoAttempts$EnhancedCromwellIoException: [Attempted 1 time(s)] - FileSystemException: /tmp/temp-s3-538074772416833219ce_WholeGenomeGermlineSingleSample_91352b21-b271-443c-b332-0a25b27ec894_call-UnmappedBamToAlignedBam_UnmappedBamToAlignedBam_b2858ebe-2463-48b7-bfc8-f83a786e5247_call-SplitRG_shard-0_SplitLargeReadGroup_5f712db2-4f6e-9955-7feeb03af894_call-SamSplitter_script: File name too long
Caused by: java.nio.file.FileSystemException: /tmp/temp-s3-538074772416833219ce_WholeGenomeGermlineSingleSample_91352b21-b271-443c-b332-0a25b27ec894_call-UnmappedBamToAlignedBam_UnmappedBamToAlignedBam_b2858ebe-2463-48b7-bfc8-f83a786e5247_call-SplitRG_shard-0_SplitLargeReadGroup_5f712db2-4f6e-9955-7feeb03af894_call-SamSplitter_script: File name too long
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102)
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107)
        at sun.nio.fs.UnixFileSystemProvider.newByteChannel(UnixFileSystemProvider.java:214)
        at java.nio.file.Files.newByteChannel(Files.java:361)
        at java.nio.file.Files.createFile(Files.java:632)
        at java.nio.file.TempFileHelper.create(TempFileHelper.java:138)
        at java.nio.file.TempFileHelper.createTempFile(TempFileHelper.java:161)
        at java.nio.file.Files.createTempFile(Files.java:897)
        at org.lerch.s3fs.S3SeekableByteChannel.<init>(S3SeekableByteChannel.java:52)
        at org.lerch.s3fs.S3FileSystemProvider.newByteChannel(S3FileSystemProvider.java:360)
        at java.nio.file.spi.FileSystemProvider.newOutputStream(FileSystemProvider.java:434)
        at java.nio.file.Files.newOutputStream(Files.java:216)
        at java.nio.file.Files.write(Files.java:3292)
        at better.files.File.writeByteArray(File.scala:270)
        at better.files.File.write(File.scala:280)
        at cromwell.core.path.BetterFileMethods.write(BetterFileMethods.scala:179)
        at cromwell.core.path.BetterFileMethods.write$(BetterFileMethods.scala:178)
        at cromwell.filesystems.s3.S3Path.write(S3PathBuilder.scala:158)
        at cromwell.core.path.EvenBetterPathMethods.writeContent(EvenBetterPathMethods.scala:99)
        at cromwell.core.path.EvenBetterPathMethods.writeContent$(EvenBetterPathMethods.scala:97)
        at cromwell.filesystems.s3.S3Path.writeContent(S3PathBuilder.scala:158)
        at cromwell.engine.io.nio.NioFlow.$anonfun$write$1(NioFlow.scala:89)
        at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:12)
        at cats.effect.internals.IORunLoop$.cats$effect$internals$IORunLoop$$loop(IORunLoop.scala:87)
        at cats.effect.internals.IORunLoop$RestartCallback.signal(IORunLoop.scala:351)
        at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:372)
        at cats.effect.internals.IORunLoop$RestartCallback.apply(IORunLoop.scala:312)
        at cats.effect.internals.IOShift$Tick.run(IOShift.scala:36)
        at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:40)
        at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:44)

Answers

  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    Hi there -

    I am curious whether increasing the size of the /tmp/ directory would help with this?

    Also, this seems to refer back to a previous fix found here.

    The Cromwell team has set up some documentation that is specific to AWS here.

  • ssb_cromwellssb_cromwell Member

    @AdelaideR , Thanks for your response!

    I'm just trying to understand as which portion of the task you want us to increase with memory.

    I'm feeling a bit confused here, shall we enhance memory: "3.75 GB"

    or I'm missing some other component where we should update this? please help.

    task SamSplitter {
      input {
        File input_bam
    
    #    File input_bam_index
        Int n_reads
        Int preemptible_tries
        Int compression_level
      }
    
      Float unmapped_bam_size = size(input_bam, "GB")
      # Since the output bams are less compressed than the input bam we need a disk multiplier that's larger than 2.
      Float disk_multiplier = 2.5
      Int disk_size = ceil(disk_multiplier * unmapped_bam_size + 20)
    
      command {
        set -e
        mkdir output_dir
    
        total_reads=$(samtools view -c ~{input_bam})
    
        java -Dsamjdk.compression_level=~{compression_level} -Xms3000m -jar /usr/gitc/picard.jar SplitSamByNumberOfReads \
          INPUT=~{input_bam} \
          OUTPUT=output_dir \
          SPLIT_TO_N_READS=~{n_reads} \
          TOTAL_READS_IN_INPUT=$total_reads
      }
      output {
        Array[File] split_bams = glob("output_dir/*.bam")
      }
      runtime {
    #    docker: "us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.4.1-1540490856"
        docker: "us.gcr.io/broad-gotc-prod/genomes-in-the-cloud:2.4.2-1552931386"
        preemptible: preemptible_tries
        memory: "3.75 GB"
    #    disks: "local-disk " + disk_size + " HDD"
        disks: "local-disk"
      }
    }
    
  • AdelaideRAdelaideR Unconfirmed, Member, Broadie, Moderator admin

    HI @ssb_cromwell

    You might need to try a few different parameters for [SplitSamByNumberOfReads] https://software.broadinstitute.org/gatk/documentation/tooldocs/4.0.6.0/picard_sam_SplitSamByNumberOfReads.php

    You currently have your Java memory cap set to -Xms3000m, you could increase to 10G and also additionally provide a temporary directory to which memory can spill to. This can be set with, e.g. TMP_DIR=/tmp if calling from the Picard jar or --TMP_DIR /tmp if calling the tool from the GATK4 jar

    See this thread for some discussion around TMP_DIR.

    The other option is to try to set the --USE_JDK_DEFLATER option, which compresses the output.

Sign In or Register to comment.