Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

Structure in the “random” samples produced by the LevelingDownSampler class

evolvedmicrobeevolvedmicrobe MGHPosts: 14Member

I have noticed some strange structure in the samples produced by the LevelingDownSampler class and was hoping someone might be nice enough to explain what is going on and how to avoid it.

I was attempting to use the unified genotyper to infer frequencies of heteroplasmic mutations in mitochondrial DNA by setting a high ploidy level to account for the multiplicity of mtDNA in a sample. However, when test datasets with different mitochondrial sequences were combined at known frequencies, the frequency inferred by the GATK (alleles at different levels of ploidy) was consistently different from the known mixing frequency. In other words, the recovered frequencies were biased.

As the bias was reduced as the dcov parameter was increased, I looked at the downsampler class to check for issues. It appeared that the levelGroups function was not randomly sampling reads, as there was a lot of structure in the groups it was leveling as well as the way it chose to level them.

I created a toy dataset by placing a SNP in a 300 bp window with high coverage and then merging two BAM files with samtools. I then had the class output at the end of the GATK analysis the number of reads that came from each original alignment position and from each file (or those reads returned by the consumeFinalizedItems method). The two issues were:

First, the downsampler appeared to sample too many reads at the start of the genome segment before leveling off to correct level, before dropping again at the end of the segment being sequenced (though plenty of reads were available). Is it possible to avoid this? Second, although in the middle of the genomic segment the read counts were uniformly distributed, their source in the BAM file was not. At low downsampling settings (<75) there would be bursts where one section of reads came entirely from one of the source bam files, followed by another burst where all the reads came from a different bam file (though these files were combined in the merge). Does anyone know if it is the case that spatial structure in the BAM file will appear as spatial structure in the downsampled reads? Is there any way to avoid this behavior?

Tagged:

Best Answer

Answers

  • droazendroazen Posts: 51GATK Developer mod

    First, some background on what the LevelingDownsampler is doing: given a set of "groups" (in the case of the UG, these groups would be stacks of reads sharing the same alignment start position), it tries to eliminate roughly the same number of items from each group such that the sum of the sizes of all groups is less than or equal to a target size (in the case of UG: your -dcov value).

    This is why the levelGroups() method you mentioned does not do any random sampling directly -- it just determines the right number of items to eliminate from each group to bring the overall total under the target value, then hands off to the downsampleOneGroup() method to eliminate the items from each group in an unbiased fashion via MathUtils.sampleIndicesWithoutReplacement(). So, within each group the elimination is unbiased, but the selection of the actual number of items to eliminate from each group is not random at all, since the goal is to eliminate roughly the same number of items from each group.

    This approach works well if you assume that there are always more reads forthcoming in the stream, and want reads from each start position to be roughly equally represented at each locus. However, it's a known issue of the GATK's position-based downsampling that you can get artifacts, at, for example, the end of a contig when there aren't any additional reads forthcoming. At such boundary points it's possible to see a drop in coverage that is purely an artifact of the downsampling process.

    Having said that, you definitely shouldn't be seeing any artifacts near your start positions, or bias towards particular samples/bam files, as each sample is by default downsampled independently by the GATK engine, and the -dcov value is by default a per-sample limit rather than a global limit. You're not using the -dt ALL_READS argument to override this behavior, are you?

Sign In or Register to comment.