Sampling and filtering reads

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,364Administrator, GATK Developer admin
edited October 2012 in Developer Zone

1. Introduction

Reads can be filtered out of traversals by either pileup size through one of our downsampling methods or by read property through our read filtering mechanism.
Both techniques and described below.

2. Downsampling

Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.

Defaults

The GATK's default downsampler exhibits the following properties:

  • The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples.

  • The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup.

  • The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.

By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.

Customizing

From the command line:

  • To disable the downsampler, specify -dt NONE.

  • To change the default coverage per-sample, specify the desired coverage to the -dcov option.

To modify the walker's default behavior:

  • Add the @Downsample interface to the top of your walker. Override the downsampling type by changing the by=<value>. Override the downsampling depth by changing the toCoverage=<value>.

Algorithm details

The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data.
Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the
algorithm is as follows:

For each sample:

  • Select reads with the next alignment start.

  • While the number of existing reads + the number of incoming reads is greater than the target sample size:

    Walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.

  • If we have n slots avaiable where n is >= 1, randomly select n of the incoming reads and add them to the pileup.

  • Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.

3. Read filtering

To selectively filter out reads before they reach your walker, implement one or multiple net.sf.picard.filter.SamRecordFilter, and attach it to your walker as follows:

@ReadFilters({Platform454Filter.class, ZeroMappingQualityReadFilter.class})

4. Command-line arguments for read filters

You can add command-line arguments for filters with the @Argument tag, just as with walkers. Here's an example of our new max read length filter:

public class MaxReadLengthFilter implements SamRecordFilter {
    @Argument(fullName = "maxReadLength", shortName = "maxRead", doc="Discard reads with length greater than the specified value", required=false)
    private int maxReadLength;

    public boolean filterOut(SAMRecord read) { return read.getReadLength() > maxReadLength; }
}

Adding this filter to the top of your walker using the @ReadFilters attribute will add a new required command-line argument, maxReadLength, which will filter reads > maxReadLength before your walker is called.

Note that when you specify a read filter, you need to strip the Filter part of its name off! E.g. in the example above, if you want to use MaxReadLengthFilter, you need to call it like this:

--read_filter MaxReadLength

5. Adding filters dynamically using command-line arguments

The --read-filter argument will allow you to apply whatever read filters you'd like to your dataset, before the reads reach your walker. To add the MaxReadLength filter above to PrintReads, you'd add the command line parameters:

--read_filter MaxReadLength --maxReadLength 76

You can add as many filters as you like by using multiple copies of the --read_filter parameter:

--read_filter MaxReadLength --maxReadLength 76 --read_filter ZeroMappingQualityRead
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Comments

  • annatannat Posts: 21Member

    Hi,
    We are having problems running gatk commands with read filters:

    $ java -Xmx50g -jar /software/additional/GenomeAnalysisTK-2.0-39-gd091f72/GenomeAnalysisTK.jar -T RealignerTargetCreator -I /software/additional/GenomeAnalysisTK-2.0-39-gd091f72/resources/exampleBAM.bam -R /software/additional/GenomeAnalysisTK-2.0-39-gd091f72/resources/exampleFASTA.fasta -o testGATK.intervals --read_filter UnmappedReadFilter

    ERROR ------------------------------------------------------------------------------------------
    ERROR A USER ERROR has occurred (version 2.0-39-gd091f72):
    ERROR The invalid arguments or inputs must be corrected before the GATK can proceed
    ERROR Please do not post this error to the GATK forum
    ERROR
    ERROR See the documentation (rerun with -h) for this tool to view allowable command-line arguments.
    ERROR Visit our website and forum for extensive documentation and answers to
    ERROR commonly asked questions http://www.broadinstitute.org/gatk
    ERROR
    ERROR MESSAGE: Could not find filter with name: UnmappedReadFilter
    ERROR ------------------------------------------------------------------------------------------

    Are there other programs etc that we need in our path?

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,364Administrator, GATK Developer admin

    Hi Anna, welcome to the forum!

    When using read filters, you need to strip the Filter part of the name when passing it as an argument. So UnmappedReadFilter is passed as:

      --read_filter UnmappedRead
    

    Sorry if that is not clear in the documentation, we will add a note to clarify this usage.

    Geraldine Van der Auwera, PhD

  • jgouldjgould GouldPosts: 9Member

    It seems as though filters depend on the type of walker used. For example if I pass the arguments " --read_filter MappingQuality --min_mapping_quality_score 60" to a walker that extends ReadWalker I get the info message "106568 reads (5.35% of total) failing MappingQualityFilter", but if I pass the same arguments to a walker that extends LocusWalker, no reads are filtered.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,364Administrator, GATK Developer admin

    That shouldn't happen. Could you please upload a small test file to our FTP so we can try to reproduce that locally? Instructions here if needed: http://www.broadinstitute.org/gatk/guide/article?id=1894

    Geraldine Van der Auwera, PhD

  • jgouldjgould GouldPosts: 9Member

    I uploaded walker-filters.zip. I contains 2 walkers-one in which the MappingQualityFilter is applied and one in which the MappingQualityFilter is not applied. Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,364Administrator, GATK Developer admin

    Thanks for the test files, I'll have a look at them today and let you know what we find in this thread.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,364Administrator, GATK Developer admin

    @jgould, I just realized you sent us the walkers themselves -- what we need is the data you were running the walkers on. Can you please upload a bam snippet that reproduces the error?

    Geraldine Van der Auwera, PhD

  • jgouldjgould GouldPosts: 9Member

    I uploaded the to walker-filter-data.zip. Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,364Administrator, GATK Developer admin

    I don't see your file in our incoming directory, can you please confirm that your upload was successful?

    Geraldine Van der Auwera, PhD

  • jgouldjgould GouldPosts: 9Member

    I uploaded the file again-sorry I don't know what happened to the previous upload. Thanks.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,364Administrator, GATK Developer admin

    Thanks @jgould, I was able to reproduce the error; not sure yet if the read filters are not getting applied or if the counts are not getting updated. We'll look into it and let you know what we find in this thread.

    Geraldine Van der Auwera, PhD

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,364Administrator, GATK Developer admin

    OK, we found the problem -- it was indeed an issue with incrementing the counts, which wasn't being done properly. The filters were actually applied, just not reported correctly by the LocusWalker. We have a proposed fix for this which is now in code review, and should get into the nightly build within a day or two.

    Thanks for reporting this issue and contributing your test files!

    Geraldine Van der Auwera, PhD

  • fjrossellofjrossello Posts: 12Member

    Hi Geraldine,

    In regards to this issue, and as @jgould reported, I noticed that even though duplicates were accurately marked and properly identified when using PrintReads with DuplicateReads filter, both UC and HC LocusWalkers failed to report the rigth no. of reads filtered.
    I am using GATK version 2.6-5-gba531bd.
    Could you please let me know if this issue has been solved? Should I update to the nightly build?

    Thanks in advance.

    Cheers,

    Fernando

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 7,364Administrator, GATK Developer admin

    Hi Fernando,

    This is indeed fixed in the nightly builds, and the fix will be in release 2.7 (which is imminent).

    Geraldine Van der Auwera, PhD

  • fjrossellofjrossello Posts: 12Member

    Hi Geraldine,

    Perfect, just finished testing the latest nightly build (August 19th) and this issue has been totally fixed.
    Thanks for your support, it's great.

    Cheers,

    Fernando

Sign In or Register to comment.