The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

Howdy, Stranger!

It looks like you're new here. If you want to get involved, click one of these buttons!

Get notifications!

You can opt in to receive email notifications, for example when your questions get answered or when there are new announcements, by following the instructions given here.

Got a problem?

1. Search using the upper-right search box, e.g. using the error message.
2. Try the latest version of tools.
3. Include tool and Java versions.
4. Tell us whether you are following GATK Best Practices.
5. Include relevant details, e.g. platform, DNA- or RNA-Seq, WES (+capture kit) or WGS (PCR-free or PCR+), paired- or single-end, read length, expected average coverage, somatic data, etc.
6. For tool errors, include the error stacktrace as well as the exact command.
7. For format issues, include the result of running ValidateSamFile for BAMs or ValidateVariants for VCFs.
8. For weird results, include an illustrative example, e.g. attach IGV screenshots according to Article#5484.
9. For a seeming variant that is uncalled, include results of following Article#1235.

Did we ask for a bug report?

Then follow instructions in Article#1894.

Formatting tip!

Wrap blocks of code, error messages and BAM/VCF snippets--especially content with hashes (#)--with lines with three backticks ( ``` ) each to make a code block as demonstrated here.

Jump to another community
Picard 2.10.4 has MAJOR CHANGES that impact throughput of pipelines. Default compression is now 1 instead of 5, and Picard now handles compressed data with the Intel Deflator/Inflator instead of JDK.
GATK version 4.beta.3 (i.e. the third beta release) is out. See the github release page for download and details.

Only minor reduction in BAM file size when running ReduceReads

helgewhelgew San Diego, CaliforniaMember

I am putting together a class on NGS data analysis and am working with one of Illumina's "Platinum" data sets ( This is a result set for a long range mate-pair experiment with about 33x coverage. When I run ReduceReads on the mapped, aligned, and indel re-aligned BAM file using default parameters, I get only about a 20% reduction in file size. Closer inspection shows that none of the reads were actually removed. I tried with both the complete results and a chromosome 20 subset.

What am I missing?



  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi there,

    When you say none of the reads have been removed, how did you determine this? 20% is not much reduction but it does indicate that some of the data was compressed.

    What version of GATK are you using, by the way?

  • helgewhelgew San Diego, CaliforniaMember
    edited September 2013

    I looked at the results using IGV and it seemed as if all the reads were still there. To confirm, I just ran samtools idxstats. That actually showed that about 4% of reads had been removed from chr20, for example. For other chromosomes, the numbers ranged from about 2% to 15% with the exception of chrM (97%) and chrY (69%).

    I tried both with versions v2.5-2-gf57256b and v2.5-428-g6bda569 with identical results.

    Thank you very much for your assistance!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Ah, you may benefit from trying again with the latest version (2.7) which includes several important updates and bugfixes for ReduceReads.

  • helgewhelgew San Diego, CaliforniaMember
    edited September 2013

    Unfortunately, the result is almost identical (6.8% reduction in reads for chr20).

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hmm, I see. I can't think of anything else at this point but I'll pass your question on to the lead developer of ReduceReads, @Carneiro, who will be able to advise you better on this. Note that he is currently on vacation until Wednesday so it may be a few days before he can get back to you.

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Can you please post your command line?

  • helgewhelgew San Diego, CaliforniaMember

    Can you please post your command line?


    java -Xmx48g -jar GenomeAnalysisTK.jar \<br /> -T ReduceReads \<br /> -R /storage/ucsc.hg19.standard.fasta \<br /> -I ERR262996_mem_sorted.REF_chr20.realigned.bam \<br /> -o ERR262996_mem_sorted.REF_chr20.rr.bam \<br /> -L chr20

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Looks fine to me. Next request: could you possibly load the 2 bams into IGV and take a screenshot of say a 500bp region on chr20? Always easier to visualize these things.

  • helgewhelgew San Diego, CaliforniaMember
    edited October 2013

    Here is the screen shot.

    EDIT: the original had the wrong reference.

    Post edited by helgew on
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Can you upload the snippet of BAM to our FTP? We'll have a look at it.

  • helgewhelgew San Diego, CaliforniaMember
    edited October 2013

    I just uploaded ERR262996_mem_sorted.REF_chr20.snippet.bam, which is chr20:13807413-13837680. It also does not behave well when subjected to the ReduceReads walker.

    In the meantime, I have worked up ERR194146 and the reduction seems to have worked. The resulting file size is a little less than 20% of the original and the number of reads was reduced from 32844550 to 4143647. ERR194146 is from a "regular paired reads" experiment with higher coverage (

  • ebanksebanks Broad InstituteMember, Broadie, Dev

    Hi there,

    Sorry for the long delay in getting to this, but Reduce Reads isn't a terribly high priority for us these days (we are working on an improved pipeline that will obviate the need for this tool).

    I took a closer look at your bam file and figured out the problem. It seems that the adaptor sequences were SOFT-clipped in your bam file instead of being HARD-clipped out. This is very noticable when you enable viewing of soft-clipped bases in IGV (you'll see the same sequence over and over again in your reads). Reduce Reads tries to preserve soft-clips (because they often signify real events) so I am not surprised that your bam file is not getting reduced.

    Please note that this bam file is not usable for variant calling even in the un-reduced state. Nearly any variant caller will see those consistent soft-clips and call insertions throughout the genome. Not good at all...

    I'm not sure whether you got this bam elsewhere or processed it yourself, but either way I'd recommend not using the bam as is in your class. Sorry if this comes too late in the school year. :(

Sign In or Register to comment.