Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited December 2015 in Dictionary

Downsampling is a process by which read depth is reduced, either at a particular position or within a region.

Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.

Note that there is also a proportional "downsample to fraction" mechanism that is mostly intended for testing the effect of different overall coverage means on analysis results.

See below for details of how this is implemented and controlled in GATK.

1. Downsampling to a target coverage

The principle of this downsampling type is to downsample reads to a given capping threshold coverage. Its purpose is to get rid of excessive coverage, because above a certain depth, having additional data is not informative and imposes unreasonable computational costs. The downsampling process takes two different forms depending on the type of analysis it is used with. For locus-based traversals (LocusWalkers like UnifiedGenotyper and ActiveRegionWalkers like HaplotypeCaller), downsample_to_coverage controls the maximum depth of coverage at each locus. For read-based traversals (ReadWalkers like BaseRecalibrator), it controls the maximum number of reads sharing the same alignment start position. For ReadWalkers you will typically need to use much lower dcov values than you would with LocusWalkers to see an effect. Note that this downsampling option does not produce an unbiased random sampling from all available reads at each locus: instead, the primary goal of the to-coverage downsampler is to maintain an even representation of reads from all alignment start positions when removing excess coverage. For a truly unbiased random sampling of reads, use -dfrac instead. Also note that the coverage target is an approximate goal that is not guaranteed to be met exactly: the downsampling algorithm will under some circumstances retain slightly more or less coverage than requested.


The GATK's default downsampler (invoked by -dcov) exhibits the following properties:

  • The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples.
  • The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup.
  • The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.

By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.


From the command line:

  • To disable the downsampler, specify -dt NONE.
  • To change the default coverage per-sample, specify the desired coverage to the -dcov option.

To modify the walker's default behavior:

  • Add the @Downsample interface to the top of your walker. Override the downsampling type by changing the by=<value>. Override the downsampling depth by changing the toCoverage=<value>.

Algorithm details

The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:

For each sample:

  • Select reads with the next alignment start.
  • While the number of existing reads + the number of incoming reads is greater than the target sample size:

Now walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.

  • If we have n slots available where n is >= 1, randomly select n of the incoming reads and add them to the pileup.
  • Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.

2. Downsampling to a fraction of the coverage

Reads will be downsampled so the specified fraction remains; e.g. if you specify -dfrac 0.25, three-quarters of the reads will be removed, and the remaining one quarter will be used in the analysis. This method of downsampling is truly unbiased and random. It is typically used to simulate the effect of generating different amounts of sequence data for a given sample. For example, you can use this in a pilot experiment to evaluate how much target coverage you need to aim for in order to obtain enough coverage in all loci of interest.

Post edited by Geraldine_VdAuwera on


  • jgouldjgould GouldMember

    It seems as though filters depend on the type of walker used. For example if I pass the arguments " --read_filter MappingQuality --min_mapping_quality_score 60" to a walker that extends ReadWalker I get the info message "106568 reads (5.35% of total) failing MappingQualityFilter", but if I pass the same arguments to a walker that extends LocusWalker, no reads are filtered.

  • mlaylwarmlaylwar CanadaMember

    For testing purposes I am trying to downsample to only on read per sample per site. Currently when I use downsampler I receive the error message saying that the minimum coverage to downsample to is 200. Do you know if there is any way to overcome this error?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Are you using HaplotypeCaller? We do not recommend changing the downsampling settings in the tool.


  • prasundutta87prasundutta87 EdinburghMember


    I was just comparing some variants using bcftools mpileup command and I have used --max-depth 1000000 (max per-file depth; avoids excessive memory usage) as I am dealing with RNAseq data and the coverage is not same across after alignment to a reference genome. In GATK's Haplotypecaller, the downsampling is set to 500 by default. Is this parameter same as -max-depth in mpileup?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    Yes, that is the same. However, note there are some quirks to the way HaplotypeCaller downsamples, so results may not be exactly the same when considering depth. We have made the downsampling more straightforward in GATK4.


  • prasundutta87prasundutta87 EdinburghMember
    edited June 2017

    Thanks for the reply Sheila. So if I want to compare variant calling outputs between mpileup and GATK, what should be the best value of the down sampling parameter that matches the -max-depth in mpileup?

    I also came across --maxReadsInRegionPerSample (default: 10000). Do I change this parameter or -dcov to mimic -max-depth in mpileup?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator


    I think the best thing to do is change the mpileup downsampling, as we do not recommend fiddling with HaplotypeCaller's downsampling arguments.

    Perhaps try setting -max-depth 500 in mpileup.


Sign In or Register to comment.