Downsampling

Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie
edited December 2015 in Dictionary

Downsampling is a process by which read depth is reduced, either at a particular position or within a region.

Normal sequencing and alignment protocols can often yield pileups with vast numbers of reads aligned to a single section of the genome in otherwise well-behaved datasets. Because of the frequency of these 'speed bumps', the GATK now downsamples pileup data unless explicitly overridden.

Note that there is also a proportional "downsample to fraction" mechanism that is mostly intended for testing the effect of different overall coverage means on analysis results.

See below for details of how this is implemented and controlled in GATK.


1. Downsampling to a target coverage

The principle of this downsampling type is to downsample reads to a given capping threshold coverage. Its purpose is to get rid of excessive coverage, because above a certain depth, having additional data is not informative and imposes unreasonable computational costs. The downsampling process takes two different forms depending on the type of analysis it is used with. For locus-based traversals (LocusWalkers like UnifiedGenotyper and ActiveRegionWalkers like HaplotypeCaller), downsample_to_coverage controls the maximum depth of coverage at each locus. For read-based traversals (ReadWalkers like BaseRecalibrator), it controls the maximum number of reads sharing the same alignment start position. For ReadWalkers you will typically need to use much lower dcov values than you would with LocusWalkers to see an effect. Note that this downsampling option does not produce an unbiased random sampling from all available reads at each locus: instead, the primary goal of the to-coverage downsampler is to maintain an even representation of reads from all alignment start positions when removing excess coverage. For a truly unbiased random sampling of reads, use -dfrac instead. Also note that the coverage target is an approximate goal that is not guaranteed to be met exactly: the downsampling algorithm will under some circumstances retain slightly more or less coverage than requested.

Defaults

The GATK's default downsampler (invoked by -dcov) exhibits the following properties:

  • The downsampler treats data from each sample independently, so that high coverage in one sample won't negatively impact calling in other samples.
  • The downsampler attempts to downsample uniformly across the range spanned by the reads in the pileup.
  • The downsampler's memory consumption is proportional to the sampled coverage depth rather than the full coverage depth.

By default, the downsampler is limited to 1000 reads per sample. This value can be adjusted either per-walker or per-run.

Customizing

From the command line:

  • To disable the downsampler, specify -dt NONE.
  • To change the default coverage per-sample, specify the desired coverage to the -dcov option.

To modify the walker's default behavior:

  • Add the @Downsample interface to the top of your walker. Override the downsampling type by changing the by=<value>. Override the downsampling depth by changing the toCoverage=<value>.

Algorithm details

The downsampler algorithm is designed to maintain uniform coverage while preserving a low memory footprint in regions of especially deep data. Given an already established pileup, a single-base locus, and a pile of reads with an alignment start of single-base locus + 1, the outline of the algorithm is as follows:

For each sample:

  • Select reads with the next alignment start.
  • While the number of existing reads + the number of incoming reads is greater than the target sample size:

Now walk backward through each set of reads having the same alignment start. If the count of reads having the same alignment start is > 1, throw out one randomly selected read.

  • If we have n slots available where n is >= 1, randomly select n of the incoming reads and add them to the pileup.
  • Otherwise, we have zero slots available. Choose the read from the existing pileup with the least alignment start. Throw it out and add one randomly selected read from the new pileup.

2. Downsampling to a fraction of the coverage

Reads will be downsampled so the specified fraction remains; e.g. if you specify -dfrac 0.25, three-quarters of the reads will be removed, and the remaining one quarter will be used in the analysis. This method of downsampling is truly unbiased and random. It is typically used to simulate the effect of generating different amounts of sequence data for a given sample. For example, you can use this in a pilot experiment to evaluate how much target coverage you need to aim for in order to obtain enough coverage in all loci of interest.

Post edited by Geraldine_VdAuwera on

Comments

  • jgouldjgould GouldMember

    It seems as though filters depend on the type of walker used. For example if I pass the arguments " --read_filter MappingQuality --min_mapping_quality_score 60" to a walker that extends ReadWalker I get the info message "106568 reads (5.35% of total) failing MappingQualityFilter", but if I pass the same arguments to a walker that extends LocusWalker, no reads are filtered.

  • mlaylwarmlaylwar CanadaMember

    For testing purposes I am trying to downsample to only on read per sample per site. Currently when I use downsampler I receive the error message saying that the minimum coverage to downsample to is 200. Do you know if there is any way to overcome this error?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @mlaylwar
    Hi,

    Are you using HaplotypeCaller? We do not recommend changing the downsampling settings in the tool.

    -Sheila

  • prasundutta87prasundutta87 EdinburghMember

    Hi,

    I was just comparing some variants using bcftools mpileup command and I have used --max-depth 1000000 (max per-file depth; avoids excessive memory usage) as I am dealing with RNAseq data and the coverage is not same across after alignment to a reference genome. In GATK's Haplotypecaller, the downsampling is set to 500 by default. Is this parameter same as -max-depth in mpileup?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @prasundutta87
    Hi,

    Yes, that is the same. However, note there are some quirks to the way HaplotypeCaller downsamples, so results may not be exactly the same when considering depth. We have made the downsampling more straightforward in GATK4.

    -Sheila

  • prasundutta87prasundutta87 EdinburghMember
    edited June 2017

    Thanks for the reply Sheila. So if I want to compare variant calling outputs between mpileup and GATK, what should be the best value of the down sampling parameter that matches the -max-depth in mpileup?

    I also came across --maxReadsInRegionPerSample (default: 10000). Do I change this parameter or -dcov to mimic -max-depth in mpileup?

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @prasundutta87
    Hi,

    I think the best thing to do is change the mpileup downsampling, as we do not recommend fiddling with HaplotypeCaller's downsampling arguments.

    Perhaps try setting -max-depth 500 in mpileup.

    -Sheila

  • Hi,

    I'm working on some amplicon data. Naturally they come with very high coverage (up to ~8000X). I'm trying to run the analyses with and without downsampling and found some interesting discrepancies. The workflow is: Picard add read groups, GATK HaplotyperCaller, GATK Combine gVCFs and GATK Genotype gVCFs. For the HaplotypeCaller step, I tried the default and with the option "--max-reads-per-alignment-start 0" to disable downsampling. In addition, I wrote a custom script to directly tally reference and alternative allele bases at the queried sites, as an additional comparison. Here are what I found: for some samples, HaplotypeCaller without downsampling gave incorrect number of ref/alt depth and in consequence, gave incorrect genotype calls later. See the samples 15, 17, 18, 19, 21, 22, 23, 26, 27, 28 in the following table for one SNP site:

    sample ref_no_downsampling alt_no_downsampling GT_no_downsampling ref_default_downsampling alt_default_downsampling GT_default_downsampling ref_sam alt_sam
    sample1 7800 0 0/0 97 0 0/0 7657 9
    sample2 3962 0 0/0 97 0 0/0 3910 4
    sample3 5641 0 0/0 106 0 0/0 5546 5
    sample4 4498 0 0/0 101 0 0/0 4403 5
    sample5 4684 0 0/0 77 0 0/0 4608 3
    sample6 2024 0 0/0 61 0 0/0 1989 3
    sample7 7581 0 0/0 104 0 0/0 7453 11
    sample8 7376 0 0/0 109 0 0/0 7250 5
    sample9 5464 0 0/0 87 0 0/0 5399 4
    sample10 18 0 0/0 18 0 0/0 18 0
    sample11 3487 0 0/0 82 0 0/0 3415 1
    sample12 1370 0 0/0 59 0 0/0 1336 2
    sample13 2995 2702 0/1 49 40 0/1 3002 2704
    sample14 84 88 0/1 24 27 0/1 84 88
    sample15 3754 0 0/0 28 38 0/1 1970 1721
    sample16 21 3277 1/1 4 73 1/1 15 3279
    sample17 5496 0 0/0 49 53 0/1 2927 2568
    sample18 4088 0 ./. 0 86 1/1 3 4028
    sample19 3444 0 0/0 42 36 0/1 1814 1626
    sample20 6 3405 1/1 0 68 1/1 8 3356
    sample21 3621 0 0/0 40 38 0/1 1872 1747
    sample22 3428 0 ./. 2 79 1/1 3 3422
    sample23 3038 0 0/0 32 39 0/1 1522 1463
    sample24 7 3758 1/1 0 77 1/1 8 3760
    sample25 12 4927 1/1 1 97 1/1 12 4927
    sample26 3726 0 ./. 0 83 1/1 19 3706
    sample27 4839 0 ./. 0 83 1/1 14 4823
    sample28 4106 0 0/0 51 39 0/1 2174 1864
    sample29 3455 0 0/0 72 0 0/0 3396 4
    sample30 8360 0 0/0 99 0 0/0 7988 13
    sample31 4182 0 0/0 109 0 0/0 4082 7
    sample32 2880 0 0/0 76 0 0/0 2826 3

    Without downsampling, GATK seems to give null or completely opposite read depth and genotype calls. However, default downsampled GATK workflow seems to give the right genotype calls in the end, but it drastically underused the coverage generated from amplicons. I wonder what is behind this and which would be the right way to deal with amplicon data like this.

    Thanks!

    Ke

  • I did a little experimentation with different "--max-reads-per-alignment-start" values. For my specific case, only the default (50) gave 100% right allele counts and genotype calls (based on pysam allele depth). Increasing this value to 100, 200, 500, 1000 generated several inconsistent allele counts and genotype calls. Especially, when the site is homozygous or heterozygous alternative, GATK tends to count everything as reference, ignoring the reads supporting alternative alleles altogether, thus call the wrong genotype of 0/0. Here are some examples:

    sample pysam default(50) 100_reads 200_reads 500_reads 1000_reads no_downsampling
    sample15 1970/1721 28/38 61/55 116/100 520/0 1020/0 3754/0
    sample16 15/3279 4/73 3/124 3/224 527/0 1027/0 21/3277
    sample17 2927/2568 49/53 156/0 256/0 556/0 1056/0 5496/0
    sample18 3/4028 0/86 0/136 0/236 1/535 1036/0 4088/0
    sample19 1814/1626 42/36 65/63 127/101 263/264 528/496 3444/0
    sample20 8/3356 0/68 0/118 0/218 0/518 0/1017 6/3405
    sample21 1872/1747 40/38 63/65 122/106 528/0 1028/0 3621/0
    sample22 3/3422 2/79 2/130 3/229 4/527 1/1031 3428/0
    sample23 1522/1463 32/39 60/61 96/124 261/260 503/517 3038/0
    sample26 19/3706 0/83 1/133 2/231 534/0 1034/0 3726/0
    sample27 14/4823 0/83 0/133 0/233 2/531 2/1031 4839/0
    sample28 2174/1864 51/39 143/0 130/110 306/232 1043/0 4106/0

    This behavior seems to be random, but tend to happen more when the threshold is set higher. Value 1000 gave much more wrong allele counts and genotype calls than 100 and 200.

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    @biojiangke Can you tell us which version you used? The downsampling code was significantly changed in GATK4, so what you observed here may no longer apply.

  • gatk-package-4.0.3.0-local.jar

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @biojiangke
    Hi Ke,

    Can you try with the very latest version? I may need you to submit a bug report.

    -Sheila

Sign In or Register to comment.