The current GATK version is 3.8-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

#### ☞ Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

GATK version 4.beta.3 (i.e. the third beta release) is out. See the GATK4 beta page for download and details.

# Genome Mask Files

edited September 2012

### 1. Introduction

Genome STRiP makes use of mask files that identify portions of the reference
sequence that are not reliably alignable.

Genome mask files are fasta files with the same number of sequences and of the
same length as the reference sequence. In a genome mask file, a base position
is marked with a 0 if it is reliably alignable and 1 if it is not. Each genome
mask file is specific to the reference sequence and to the parameters used to
determine alignability.

The current generation of mask files are based on fixed read lengths. A base
is assigned a 0 if an N base sequence centered on this read is unique within
the reference genome. You should use a genome mask with a value of N that
corresponds to the read lengths of your input data set. For example, if you
have data that is a uniform set of Illumina paired-end data with 101bp reads,
then you should use (or generate) a genome mask with a read length of 101. If
your data is a mixture of read lengths, one viable strategy is to use a
"lowest common denominator" approach and use a mask length corresponding to
the shortest reads in your input data set. Using the smallest read length will
cause a small additional fraction of the genome to be marked inaccessible, but
will give the best specificity. Alternatively, you can use a larger N, which
should modestly improve sensitivity at the cost of a modest increase in false
discovery rate and a modest decrease in genotyping accuracy.

### 2. Resources

Some precomputed mask files for a variety of reference sequences and read

### 3. Generating your own genome mask

The ComputeGenomeMask command line utility is available
to generate genome mask files, but queue scripts to automate the process have
not been written. A reasonable strategy is to compute the genome mask in
parallel chromsome-by-chromosome and then merge the resulting fasta files into
a final genome-wide mask file.

### 4. Planned Enhancements

The implementation of mask files will be replaced in a future release.

Mask files are being converted from textual fasta files to binary files and
are being enhanced to better support input data sets with multiple read
lengths (so the use of a "lowest common denominator" strategy will no longer
be necessary).

Post edited by Geraldine_VdAuwera on
Tagged:

• HomeMember

Hello - is this still an area of active development, and have there been improvements since this post? Thanks.