The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ` ) each to make a code block as demonstrated here.

Picard 2.10.4 has MAJOR CHANGES that impact throughput of pipelines. Default compression is now 1 instead of 5, and Picard now handles compressed data with the Intel Deflator/Inflator instead of JDK.
GATK version 4.beta.2 (i.e. the second beta release) is out. See the GATK4 BETA page for download and details.

# Structure in the “random” samples produced by the LevelingDownSampler class

MGHMember

I have noticed some strange structure in the samples produced by the LevelingDownSampler class and was hoping someone might be nice enough to explain what is going on and how to avoid it.

I was attempting to use the unified genotyper to infer frequencies of heteroplasmic mutations in mitochondrial DNA by setting a high ploidy level to account for the multiplicity of mtDNA in a sample. However, when test datasets with different mitochondrial sequences were combined at known frequencies, the frequency inferred by the GATK (alleles at different levels of ploidy) was consistently different from the known mixing frequency. In other words, the recovered frequencies were biased.

As the bias was reduced as the dcov parameter was increased, I looked at the downsampler class to check for issues. It appeared that the levelGroups function was not randomly sampling reads, as there was a lot of structure in the groups it was leveling as well as the way it chose to level them.

I created a toy dataset by placing a SNP in a 300 bp window with high coverage and then merging two BAM files with samtools. I then had the class output at the end of the GATK analysis the number of reads that came from each original alignment position and from each file (or those reads returned by the consumeFinalizedItems method). The two issues were:

First, the downsampler appeared to sample too many reads at the start of the genome segment before leveling off to correct level, before dropping again at the end of the segment being sequenced (though plenty of reads were available). Is it possible to avoid this?
Second, although in the middle of the genomic segment the read counts were uniformly distributed, their source in the BAM file was not. At low downsampling settings (<75) there would be bursts where one section of reads came entirely from one of the source bam files, followed by another burst where all the reads came from a different bam file (though these files were combined in the merge). Does anyone know if it is the case that spatial structure in the BAM file will appear as spatial structure in the downsampled reads? Is there any way to avoid this behavior?

Tagged: