The current GATK version is 3.6-0
Examples: Monday, today, last week, Mar 26, 3/26/04

#### Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

# Only minor reduction in BAM file size when running ReduceReads

San Diego, CaliforniaMember Posts: 6

I am putting together a class on NGS data analysis and am working with one of Illumina's "Platinum" data sets (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=ERR262996). This is a result set for a long range mate-pair experiment with about 33x coverage. When I run ReduceReads on the mapped, aligned, and indel re-aligned BAM file using default parameters, I get only about a 20% reduction in file size. Closer inspection shows that none of the reads were actually removed. I tried with both the complete results and a chromosome 20 subset.

What am I missing?

Tagged:

Hi there,

When you say none of the reads have been removed, how did you determine this? 20% is not much reduction but it does indicate that some of the data was compressed.

What version of GATK are you using, by the way?

Geraldine Van der Auwera, PhD

• San Diego, CaliforniaMember Posts: 6
edited September 2013

I looked at the results using IGV and it seemed as if all the reads were still there. To confirm, I just ran samtools idxstats. That actually showed that about 4% of reads had been removed from chr20, for example. For other chromosomes, the numbers ranged from about 2% to 15% with the exception of chrM (97%) and chrY (69%).

I tried both with versions v2.5-2-gf57256b and v2.5-428-g6bda569 with identical results.

Thank you very much for your assistance!

Geraldine Van der Auwera, PhD

• San Diego, CaliforniaMember Posts: 6
edited September 2013

Unfortunately, the result is almost identical (6.8% reduction in reads for chr20).

Hmm, I see. I can't think of anything else at this point but I'll pass your question on to the lead developer of ReduceReads, @Carneiro, who will be able to advise you better on this. Note that he is currently on vacation until Wednesday so it may be a few days before he can get back to you.

Geraldine Van der Auwera, PhD

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• San Diego, CaliforniaMember Posts: 6

Sure:

java -Djava.io.tmpdir=/storage/tmp/ -Xmx48g -jar GenomeAnalysisTK.jar \ -T ReduceReads \ -R /storage/ucsc.hg19.standard.fasta \ -I ERR262996_mem_sorted.REF_chr20.realigned.bam \ -o ERR262996_mem_sorted.REF_chr20.rr.bam \ -L chr20

Looks fine to me. Next request: could you possibly load the 2 bams into IGV and take a screenshot of say a 500bp region on chr20? Always easier to visualize these things.

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT

• San Diego, CaliforniaMember Posts: 6
edited October 2013

Here is the screen shot.

EDIT: the original had the wrong reference.

Post edited by helgew on

Can you upload the snippet of BAM to our FTP? We'll have a look at it.

Geraldine Van der Auwera, PhD

• San Diego, CaliforniaMember Posts: 6
edited October 2013

I just uploaded ERR262996_mem_sorted.REF_chr20.snippet.bam, which is chr20:13807413-13837680. It also does not behave well when subjected to the ReduceReads walker.

In the meantime, I have worked up ERR194146 and the reduction seems to have worked. The resulting file size is a little less than 20% of the original and the number of reads was reduced from 32844550 to 4143647. ERR194146 is from a "regular paired reads" experiment with higher coverage (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=ERR194146).

Hi there,

Sorry for the long delay in getting to this, but Reduce Reads isn't a terribly high priority for us these days (we are working on an improved pipeline that will obviate the need for this tool).

I took a closer look at your bam file and figured out the problem. It seems that the adaptor sequences were SOFT-clipped in your bam file instead of being HARD-clipped out. This is very noticable when you enable viewing of soft-clipped bases in IGV (you'll see the same sequence over and over again in your reads). Reduce Reads tries to preserve soft-clips (because they often signify real events) so I am not surprised that your bam file is not getting reduced.

Please note that this bam file is not usable for variant calling even in the un-reduced state. Nearly any variant caller will see those consistent soft-clips and call insertions throughout the genome. Not good at all...

I'm not sure whether you got this bam elsewhere or processed it yourself, but either way I'd recommend not using the bam as is in your class. Sorry if this comes too late in the school year.

Eric Banks, PhD -- Director, Data Sciences and Data Engineering, Broad Institute of Harvard and MIT