Latest Release: 03/12/19
Release Notes can be found here.

Workaround for the GenomicsDBImport + Draft Assemblies + Firecloud = World of Pain Equation

jasongallant1jasongallant1 East Lansing, MIMember

OK, so my enthusiasm is a bit reserved now that I've gotten to the final phase, joint calling! My naive approach was to hack up the lovely featured WDLs to ignore BQSR, etc. to produce first-round VCFs. What I neglected to realize is that Firecloud can't handle 4,000 scatter jobs gracefully. My first run failed, and I can't load the page to see what even happened.

Scratching my head, I went back to the drawing boards and message boards. One way of solving this would be to remake a 'pseudo genome' linked by NNN's across my smaller contigs, but this would require realignment and a bunch of other stuff I didn't feel like doing. Instead, I spent the past two days banging my head against the keyboard, and came up with a workflow that relies on each "task" looping over an interval list such that each process handles about 16mb of the genome. WDLs require some really special care to avoid bash variable expansion issues, so there was a lot of iteration to get something to work.

The obvious solution to this would be to get GenomicsDBImport to handle more than one interval at a time, which I hear is in development. Any updates on when this may happen?

Here's my general approach. The contortions to get bash to play nice with WDL were a bit tricky, so if someone has suggestions for a 'better way' I'm all ears.

Currently testing on FC right now, can report back if things get dicey. So far, I'm getting some reasonable runtimes. I'm documenting all of this, FWIW, on GitHub: https://github.com/msuefishlab/pkings_firecloud.

The general workflow goes something like this:

Run SplitIntervals on the original Genome (using the appropriate mode to get equal sized chunks for best performance on scatter)

./gatk SplitIntervals \
  -R draft_assembly.fasta \
  -L ./intervals.list \
  -scatter 50 \
  -O interval-files \
  -mode BALANCING_WITHOUT_INTERVAL_SUBDIVISION_WITH_OVERFLOW

Run the following WDL:

## Simplified Joint Genotyping With Contig 'Batching'
## Jason Gallant
## Modified From gatk/joint-discovery-gatk4 snapshot6 on Firecloud ([email protected])

workflow JointGenotyping {
  File list_of_scatter_files

  String callset_name
  File sample_name_map

  File ref_fasta
  File ref_fasta_index
  File ref_dict

  String gatk_docker
  String gatk_path
  String python_docker

  Int small_disk
  Int medium_disk
  Int huge_disk


  Array[File] scatter_array = read_lines(list_of_scatter_files)


  #should be a list of 50 lists of intervals, approximately same length (here 15 mbp)
  scatter (idx in range(length(scatter_array))) {

  call ImportGVCFs {
      input:
        sample_name_map = sample_name_map,
        interval_list = scatter_array[idx],
        workspace_dir_name = "genomicsdb",
        disk_size = medium_disk,
        docker_image = gatk_docker,
        gatk_path = gatk_path,
        batch_size = 50
    }

  output {
    ImportGVCFs.output_genomicsdb
  }

  call GenotypeGVCFs {
    input:
      workspace_tar = ImportGVCFs.output_genomicsdb,
      interval_list = scatter_array[idx],
      ref_fasta = ref_fasta,
      ref_fasta_index = ref_fasta_index,
      ref_dict = ref_dict,
      disk_size = medium_disk,
      docker_image = gatk_docker,
      gatk_path = gatk_path
  }

  } #end scatter

call GatherVcfs as FinalGatherVcf {
  input:
    input_vcfs_fofn = write_lines(GenotypeGVCFs.output_vcf),
    output_vcf_name = callset_name + ".raw.vcf.gz",
    disk_size = medium_disk,
    docker_image = gatk_docker,
    gatk_path = gatk_path
}

output {
  # outputs from the small callset path through the wdl
  FinalGatherVcf.output_vcf
  FinalGatherVcf.output_vcf_index
}

} # end workflow

task ImportGVCFs {
  File sample_name_map
  File interval_list

  String workspace_dir_name

  String java_opt
  String gatk_path

  String docker_image
  Int disk_size
  String mem_size
  Int preemptibles
  Int batch_size
  String dollar = "$"

  command <<<
    i=0 && \
    grep -v '^@' ${interval_list} | while read -r line ; do
      let "i++"
      the_interval=`echo $line | awk '{printf "%s:%s-%s\n", $1,$2,$3}'`
      echo "working on the $i th file..., interval $the_interval..."
            ${gatk_path} --java-options "${java_opt}" \
            GenomicsDBImport \
            --genomicsdb-workspace-path ${workspace_dir_name}_$i \
            --batch-size ${batch_size} \
            --L $the_interval \
            --sample-name-map ${sample_name_map} \
            --reader-threads 5 \
            -ip 500

            tar -cf ${workspace_dir_name}_$i.tar ${workspace_dir_name}_$i
    done

  >>>

  output {
          Array[File] output_genomicsdb = glob("${workspace_dir_name}_*.tar")
  }
  runtime {
    docker: docker_image
    memory: mem_size
    cpu: "2"
    disks: "local-disk " + disk_size + " HDD"
    preemptible: preemptibles
  }
}

task GenotypeGVCFs {
  Array[File] workspace_tar
  File interval_list
  String gatk_path
  String java_opt

  File ref_fasta
  File ref_fasta_index
  File ref_dict

  String docker_image
  Int disk_size
  String mem_size
  Int preemptibles
  String dollar = "$"


  command <<<

        i=0 && \
        grep -v '^@' ${interval_list} | while read -r line ; do
        let "i++"
        the_interval=`echo $line | awk '{printf "%s:%s-%s\n", $1,$2,$3}'`
        echo "working on the $i th file..., interval $the_interval..., and the workspace $the_wkspc"
        the_wkspc=$(cat ${write_lines(workspace_tar)} | sed "${dollar}{i}q;d")
        tar -xf $the_wkspc
        WORKSPACE=$( basename $the_wkspc .tar)

        ${gatk_path} --java-options "${java_opt}" \
        GenotypeGVCFs \
        -R ${ref_fasta} \
        -O output_$i.vcf.gz \
        -G StandardAnnotation \
        --only-output-calls-starting-in-intervals \
        -new-qual \
        -V gendb://$WORKSPACE \
        -L $the_interval

      done
  >>>
  runtime {
    docker: docker_image
    memory: mem_size
    cpu: "2"
    disks: "local-disk " + disk_size + " HDD"
    preemptible: preemptibles
  }
  output {
    Array[File] output_vcf = glob("output_*.vcf.gz")
    Array[File] output_vcf_index = glob("output_*.vcf.gz.tbi")
  }
}
task GatherVcfs {
  File input_vcfs_fofn
  String output_vcf_name

  String gatk_path
  String java_opt

  String docker_image
  Int disk_size
  String mem_size
  Int preemptibles

  command <<<

    tr '\t' '\n' < ${input_vcfs_fofn} > inputs.args
    #cat inputs.args

    # ignoreSafetyChecks make a big performance difference so we include it in our invocation
    ${gatk_path} --java-options "${java_opt}" \
    GatherVcfsCloud \
    --ignore-safety-checks \
    --gather-type BLOCK \
    --input inputs.args \
    --output ${output_vcf_name}

    ${gatk_path} --java-options "-Xmx6g -Xms6g" \
    IndexFeatureFile \
    --feature-file ${output_vcf_name}
  >>>
  runtime {
    docker: docker_image
    memory: mem_size
    cpu: "1"
    disks: "local-disk " + disk_size + " HDD"
    preemptible: preemptibles
  }
  output {
    File output_vcf = "${output_vcf_name}"
    File output_vcf_index = "${output_vcf_name}.tbi"
  }
}

Best Answer

Answers

  • jasongallant1jasongallant1 East Lansing, MIMember

    Update: preemption can be a big problem for long running jobs. Most jobs complete pretty rapidly (< 2h) but a handful run for 8h, particularly those with long interval lists (i.e. many short contigs). Although there are the same number of bp per interval, there seems to be a fixed amount of overhead reading the files in-- the more intervals seems to predict the length of the job to some degree, even when bp is held constant. FC finished importing and genotyping 48/50 intervals in a few hours time, but is churning on the last file, and an unfortunate middle scaffold that keeps getting preempted. I suppose I could pony up the additional $$ for non-preemptible instances, but with no idea how long it actually takes, I'm reticent to do this!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi Jason, sorry to hear your enthusiasm has waned, let me see what we can do to revive it :)

    First -- the system should in fact be able to handle that submission size; we recently ran some joint calling tests on 76K samples and that went fine. Would you be able to share your workspace with our engineers ([email protected]) so they can troubleshoot?

    Re: GenomicsDB, no update yet on multi-interval support, but I'll see if anyone on the team can advise on getting this to work for you. Will have to be after the weekend though.

  • jasongallant1jasongallant1 East Lansing, MIMember

    Thanks, Geraldine. I will share the workspace with the engineers just now. The problem is a 504 error when trying to load the workflow detail that lists all of the job outputs in FC. The name of the method configuration (there are a lot in there) is gatk4-germline-snps-indels-simple_O-d9pUmppY8, run on February 27, 2018, 10:24 PM

  • jasongallant1jasongallant1 East Lansing, MIMember
    edited March 2018

    Hi @Geraldine_VdAuwera, thanks for the response on this one. I tried poking around with the API tool originally, but was clearly sniffing under the wrong operation. Looks like you have your authentication issue solved, because I was able to get the whole JSON file just now. My issue was that I couldn't even see which jobs failed, and I didn't feel like skimming through 4000+ folders to find it.

    I found the source of the issue in the .sterr.log file from a failed job. Any insights? Seems like a transient google issue (see https://gatkforums.broadinstitute.org/firecloud/discussion/9657/com-google-cloud-storage-storageexception-500-internal-server-error-backend-error). Gonna try to restart the job using call caching.

    A USER ERROR has occurred: Couldn't read file. Error was: Failure while waiting for FeatureReader to initialize  with exception: com.google.cloud.storage.StorageException: 502 Bad Gateway
    <!DOCTYPE html>
    <html lang=en>
      <meta charset=utf-8>
      <meta name=viewport content="initial-scale=1, minimum-scale=1, width=device-width">
      <title>Error 502 (Server Error)!!1</title>
      <style>
        *{margin:0;padding:0}html,code{font:15px/22px arial,sans-serif}html{background:#fff;color:#222;padding:15px}body{margin:7% auto 0;max-width:390px;min-height:180px;padding:30px 0 15px}* > body{background:url(//www.google.com/images/errors/robot.png) 100% 5px no-repeat;padding-right:205px}p{margin:11px 0 22px;overflow:hidden}ins{color:#777;text-decoration:none}a img{border:0}@media screen and (max-width:772px){body{background:none;margin-top:0;max-width:none;padding-right:0}}#logo{background:url(//www.google.com/images/branding/googlelogo/1x/googlelogo_color_150x54dp.png) no-repeat;margin-left:-5px}@media only screen and (min-resolution:192dpi){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat 0% 0%/100% 100%;-moz-border-image:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) 0}}@media only screen and (-webkit-min-device-pixel-ratio:2){#logo{background:url(//www.google.com/images/branding/googlelogo/2x/googlelogo_color_150x54dp.png) no-repeat;-webkit-background-size:100% 100%}}#logo{display:inline-block;height:54px;width:150px}
      </style>
      <a href=//www.google.com/><span id=logo aria-label=Google></span></a>
      <p><b>502.</b> <ins>That?s an error.</ins>
      <p>The server encountered a temporary error and could not complete your request.<p>Please try again in 30 seconds.  <ins>That?s all we know.</ins>
    
  • jasongallant1jasongallant1 East Lansing, MIMember

    I was able to survey the damage, and it looks like only TWO files failed, both with the same error detailed above. They failed within 20 minutes of each other, but apparently not at the same time (according to the date stamp on the stderr file, which may be a poor indicator). They occurred during the 'batch import' phase of GenomicsDBImport, one about 0.37 minutes in, the other after 23 minutes.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hah, yes that looks like a transient error alright. We have auto retries for some of those but it's not perfect. Let me know how the re-run goes.

  • jasongallant1jasongallant1 East Lansing, MIMember

    Yes, it was indeed a transient error! Good lord, two weeks down the drain trying to build a better mousetrap! :blush:

Sign In or Register to comment.