Genome STRiP interim release 1.04.1162 (and upgrade instructions)

bhandsakerbhandsaker Member, Broadie, Moderator

We made available today a new interim release of Genome STRiP (1.04.1162).

This release requires an upgrade procedure if you want to use it with existing metadata directories (see below).

Some features of this release:

  • The effective genome sizes are how computed during pre-processing and stored in the metadata directory (in file genome_sizes.txt).
    You no longer need to edit your configuration file to set the genome sizes based on your reference sequence and genome mask.

  • Support for accessing bam files over http, ftp and s3.
    Currently, this is mostly useful for small-scale workflows, like genotyping a small number of sites against the 1000 Genomes data.

  • Enhancements to the IntensityRankSum annotator to support duplications as well as deletions.

  • Various performance and stability enhancements and bug fixes.

Upgrade Procedure (excerpted from UPGRADE.txt)

During the preprocessing phase, Genome STRiP produces a metadata directory containing summary information about your data set.
Most releases of Genome STRiP are backwards compatible with metadata computed from earlier versions.
However, if you are upgrading to a release newer than 1.04.1068 with metadata generated by release 1.04.1068 or earlier, then you need to upgrade your metadata as described below.

In releases newer than 1.04.1068, we have automated the computation of the effective genome sizes and have changed this computation slightly.
The effective genome sizes are now calculated as part of SVPreprocess and are stored in the metadata directory in a file called genome_sizes.txt.
These effective genome sizes are no longer read from the configuration file (the old values can remain there, but will be ignored).
This means you no longer need to edit your configuration file to change these values based on your reference genome and genome mask.

If you have a metadata directory created by a previous version of Genome STRiP, you will need to update the genome_sizes.txt file and possibly other metadata.
The process is slightly different depending on whether your data set contains data aligned to your entire reference genome or data from only a portion of the reference
(for example, if your input bams contain only reads aligned to one chromosome).

If your data covers your entire reference genome, then to update the genome_sizes.txt file, it should be sufficient to rerun SVPreprocess.q with the same settings you used originally.
The Queue script should only run a single command, ComputeGenomeSizes, to create metadata/genome_sizes.txt.
It is recommended that you do a Queue dry run (no -run argument) first to make sure that SVPreprocess.q will not try to recreate all of your metadata.
Note that the results you get using the new genome sizes may be slightly different (hopefully better!) than the results from previous releases of Genome STRiP.

If your data covers a subset of your reference genome, then you need to create genome_sizes.txt and you need to recompute your GC-bias profiles.
To do this, you need to first remove the file metadata/gcprofile/.reference.gcprof.zip.done (where metadata is your metadata directory) and then rerun SVPreprocess.q.
In addition, you should use two additional arguments to SVPreprocess.q when you rerun it:

  -useMultiStep
  -computeSizesInterval <interval>

The -useMultiStep argument should prevent recomputing all of your metadata (only recompute the gc-bias profiles).
The -computeSizesInterval argument specifies the interval on the reference covered by your bam files (only a single co-linear interval is supported).
For example, if your bam files cover only chromosome 20, you would add "-computeSizesInterval 20".
It is recommended that you do a dry run first to make sure that only commands being run by SVPreprocess.q are ComputeGenomeSizes, ComputeGCProfiles and MergeGCProfiles.

Comments

Sign In or Register to comment.