The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

#### ☞ Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

#### ☞ Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks (  ) each to make a code block as demonstrated here.

Picard 2.10.4 has MAJOR CHANGES that impact throughput of pipelines. Default compression is now 1 instead of 5, and Picard now handles compressed data with the Intel Deflator/Inflator instead of JDK.
GATK version 4.beta.3 (i.e. the third beta release) is out. See the github release page for download and details.

# Only minor reduction in BAM file size when running ReduceReads

San Diego, CaliforniaMember

I am putting together a class on NGS data analysis and am working with one of Illumina's "Platinum" data sets (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=ERR262996). This is a result set for a long range mate-pair experiment with about 33x coverage. When I run ReduceReads on the mapped, aligned, and indel re-aligned BAM file using default parameters, I get only about a 20% reduction in file size. Closer inspection shows that none of the reads were actually removed. I tried with both the complete results and a chromosome 20 subset.

What am I missing?

Tagged:

Hi there,

When you say none of the reads have been removed, how did you determine this? 20% is not much reduction but it does indicate that some of the data was compressed.

What version of GATK are you using, by the way?

• San Diego, CaliforniaMember
edited September 2013

I looked at the results using IGV and it seemed as if all the reads were still there. To confirm, I just ran samtools idxstats. That actually showed that about 4% of reads had been removed from chr20, for example. For other chromosomes, the numbers ranged from about 2% to 15% with the exception of chrM (97%) and chrY (69%).

I tried both with versions v2.5-2-gf57256b and v2.5-428-g6bda569 with identical results.

Thank you very much for your assistance!

• San Diego, CaliforniaMember
edited September 2013

Unfortunately, the result is almost identical (6.8% reduction in reads for chr20).

Hmm, I see. I can't think of anything else at this point but I'll pass your question on to the lead developer of ReduceReads, @Carneiro, who will be able to advise you better on this. Note that he is currently on vacation until Wednesday so it may be a few days before he can get back to you.

• San Diego, CaliforniaMember

Sure:

java -Djava.io.tmpdir=/storage/tmp/ -Xmx48g -jar GenomeAnalysisTK.jar \<br /> -T ReduceReads \<br /> -R /storage/ucsc.hg19.standard.fasta \<br /> -I ERR262996_mem_sorted.REF_chr20.realigned.bam \<br /> -o ERR262996_mem_sorted.REF_chr20.rr.bam \<br /> -L chr20`

Looks fine to me. Next request: could you possibly load the 2 bams into IGV and take a screenshot of say a 500bp region on chr20? Always easier to visualize these things.

• San Diego, CaliforniaMember
edited October 2013

Here is the screen shot.

EDIT: the original had the wrong reference.

Post edited by helgew on

Can you upload the snippet of BAM to our FTP? We'll have a look at it.

• San Diego, CaliforniaMember
edited October 2013

I just uploaded ERR262996_mem_sorted.REF_chr20.snippet.bam, which is chr20:13807413-13837680. It also does not behave well when subjected to the ReduceReads walker.

In the meantime, I have worked up ERR194146 and the reduction seems to have worked. The resulting file size is a little less than 20% of the original and the number of reads was reduced from 32844550 to 4143647. ERR194146 is from a "regular paired reads" experiment with higher coverage (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=ERR194146).