ComputeGenomeMask

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,145Administrator, GATK Developer admin
edited January 2013 in GenomeSTRiP Documentation

1. Introduction

The ComputeGenomeMask utility determines the alignability of each base in the
reference genome.

Mask files are generated based on a fixed read length L. A base is considered
alignable if a window of length L centered on the base is unique within the
reference sequence.

The ComputeGenomeMask utility works by dividing the input genome into
sequences of length L for every base position and then aligning these
sequences against the reference genome (using BWA) and looking for unique
matches.

The implementation of genome masking in Genome STRiP is likely to change in
the future.

2. Inputs / Arguments

  • -R <fasta-file> : Reference sequence. : An indexed fasta file containing
    the reference sequence to mask. : The fasta file must have been indexed using
    BWA in preparation for BWA alignment. The fasta file must also be indexed
    with 'samtools faidx' or the equivalent.

  • -readLength <length> : The size of the window to use for determining
    alignability.

  • -sequence <sequenceName> : The name of the sequence to process (default is
    to process the entire genome). This can be used to parallelize the
    computation by chromosome.

3. Outputs

  • -O <mask-file> : Output genome mask file (fasta format). Default is to
    write to stdout. The mask file contains '0' at alignable positions and '1'
    at non-alignable positions.

4. Running

ComputeGenomeMask is part of Genome STRiP, which is a third-party GATK library.

An example invocation is shown below:

export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}

java -Xmx2g -cp SVToolkit.jar:GenomeAnalysisTK.jar \
    org.broadinstitute.sv.apps.ComputeGenomeMask \ 
    -R Homo_sapiens_assembly18.fasta \ 
    -O Homo_sapiens_assembly18.mask.chr1.36.fasta \ 
    -readLength 36 \ 
    -sequence chr1 

To generate a mask for the entire genome, it is generally preferably to
compute the mask for each chromosome separately and then concatenate the
output files in the correct order.

The following script provides an example of generating the appropiate indexes
and running ComputeGenomeMask in parallel on each chromosome. This is an
example and not a general purpose script; it will likely need to be modified
for your environment. After each chromosome has been run, the resulting files
must be concatenated together in the same order as the reference sequence to
create the final mask.

#!/bin/bash

outdir=example readLength=75 reference=Homo_sapiens_hg19.fasta
export SV_DIR=/humgen/cnp04/bobh/svtoolkit/stable

#These executables must be on your path.
which java > /dev/null || exit 1
which bwa > /dev/null || exit 1
#The directory containing libbwa.so must be on your LD_LIBRARY_PATH
export LD_LIBRARY_PATH=${SV_DIR}/bwa:${LD_LIBRARY_PATH}
classpath="${SV_DIR}/lib/SVToolkit.jar:${SV_DIR}/lib/gatk/GenomeAnalysisTK.jar"

mkdir -p ${outdir}/work

localReference=${outdir}/work/`echo ${reference} | awk -F / '{ print $NF }'`
if [ ! -e ${localReference} ]; then ln ${reference} ${localReference} || exit 1 fi

java -cp ${classpath} -Xmx4g \ 
    org.broadinstitute.sv.apps.IndexFastaFile \ 
    -I ${localReference} \ 
    -O ${localReference}.fai \
    || exit 1

bwa index -a bwtsw ${localReference} || exit 1

chroms=`cat ${localReference}.fai | cut -f 1`
for chr in ${chroms}; do
    bsub -o ${outdir}/work/svmask_${chr}.log \ 
    -R "rusage[mem=5000]" \ 
    java -cp ${classpath} -Xmx4g \ 
        org.broadinstitute.sv.apps.ComputeGenomeMask \ 
        -R ${localReference} \ 
        -O ${outdir}/work/svmask_${chr}.fasta \ 
        -readLength ${readLength} \ 
        -sequence ${chr} \
        || exit 1
done
Post edited by bhandsaker on

Geraldine Van der Auwera, PhD

Sign In or Register to comment.