Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Only minor reduction in BAM file size when running ReduceReads

helgewhelgew San Diego, CaliforniaPosts: 6Member

I am putting together a class on NGS data analysis and am working with one of Illumina's "Platinum" data sets (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=ERR262996). This is a result set for a long range mate-pair experiment with about 33x coverage. When I run ReduceReads on the mapped, aligned, and indel re-aligned BAM file using default parameters, I get only about a 20% reduction in file size. Closer inspection shows that none of the reads were actually removed. I tried with both the complete results and a chromosome 20 subset.

What am I missing?

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hi there,

    When you say none of the reads have been removed, how did you determine this? 20% is not much reduction but it does indicate that some of the data was compressed.

    What version of GATK are you using, by the way?

    Geraldine Van der Auwera, PhD

  • helgewhelgew San Diego, CaliforniaPosts: 6Member
    edited September 2013

    I looked at the results using IGV and it seemed as if all the reads were still there. To confirm, I just ran samtools idxstats. That actually showed that about 4% of reads had been removed from chr20, for example. For other chromosomes, the numbers ranged from about 2% to 15% with the exception of chrM (97%) and chrY (69%).

    I tried both with versions v2.5-2-gf57256b and v2.5-428-g6bda569 with identical results.

    Thank you very much for your assistance!

    Post edited by helgew on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Ah, you may benefit from trying again with the latest version (2.7) which includes several important updates and bugfixes for ReduceReads.

    Geraldine Van der Auwera, PhD

  • helgewhelgew San Diego, CaliforniaPosts: 6Member
    edited September 2013

    Unfortunately, the result is almost identical (6.8% reduction in reads for chr20).

    Post edited by helgew on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Hmm, I see. I can't think of anything else at this point but I'll pass your question on to the lead developer of ReduceReads, @Carneiro, who will be able to advise you better on this. Note that he is currently on vacation until Wednesday so it may be a few days before he can get back to you.

    Geraldine Van der Auwera, PhD

  • ebanksebanks Posts: 671GSA Member mod

    Can you please post your command line?

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • helgewhelgew San Diego, CaliforniaPosts: 6Member

    Can you please post your command line?

    Sure:

    java -Djava.io.tmpdir=/storage/tmp/ -Xmx48g -jar GenomeAnalysisTK.jar \ -T ReduceReads \ -R /storage/ucsc.hg19.standard.fasta \ -I ERR262996_mem_sorted.REF_chr20.realigned.bam \ -o ERR262996_mem_sorted.REF_chr20.rr.bam \ -L chr20

  • ebanksebanks Posts: 671GSA Member mod

    Looks fine to me. Next request: could you possibly load the 2 bams into IGV and take a screenshot of say a 500bp region on chr20? Always easier to visualize these things.

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

  • helgewhelgew San Diego, CaliforniaPosts: 6Member
    edited October 2013

    Here is the screen shot.

    EDIT: the original had the wrong reference.

    Post edited by helgew on
  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    Can you upload the snippet of BAM to our FTP? We'll have a look at it.

    Geraldine Van der Auwera, PhD

  • helgewhelgew San Diego, CaliforniaPosts: 6Member
    edited October 2013

    I just uploaded ERR262996_mem_sorted.REF_chr20.snippet.bam, which is chr20:13807413-13837680. It also does not behave well when subjected to the ReduceReads walker.

    In the meantime, I have worked up ERR194146 and the reduction seems to have worked. The resulting file size is a little less than 20% of the original and the number of reads was reduced from 32844550 to 4143647. ERR194146 is from a "regular paired reads" experiment with higher coverage (http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=viewer&m=data&s=viewer&run=ERR194146).

    Post edited by helgew on
  • ebanksebanks Posts: 671GSA Member mod

    Hi there,

    Sorry for the long delay in getting to this, but Reduce Reads isn't a terribly high priority for us these days (we are working on an improved pipeline that will obviate the need for this tool).

    I took a closer look at your bam file and figured out the problem. It seems that the adaptor sequences were SOFT-clipped in your bam file instead of being HARD-clipped out. This is very noticable when you enable viewing of soft-clipped bases in IGV (you'll see the same sequence over and over again in your reads). Reduce Reads tries to preserve soft-clips (because they often signify real events) so I am not surprised that your bam file is not getting reduced.

    Please note that this bam file is not usable for variant calling even in the un-reduced state. Nearly any variant caller will see those consistent soft-clips and call insertions throughout the genome. Not good at all...

    I'm not sure whether you got this bam elsewhere or processed it yourself, but either way I'd recommend not using the bam as is in your class. Sorry if this comes too late in the school year. :(

    Eric Banks, PhD -- Senior Group Leader, MPG Analysis, Broad Institute of Harvard and MIT

Sign In or Register to comment.