Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

ReduceReads compressing exome BAMS at 3X...any way to get it to 20-100X?

stevepiccolostevepiccolo Member
edited December 2012 in Ask the GATK team

I'm working with exome-sequencing BAM files (GATK v2.4). My pipeline (indel realignment, base quality recal, reduce reads) is working well to reduce the bams to a third of their original size. This is really helpful. The best-practices guide suggests it may be possible to get these to compress more, perhaps to 20-100X. I believe the parameters I am using are pretty standard (see below), but please correct me if not. Is there something else I can/should do to get the BAMs smaller?

echo Creating realigned target file
java -Xmx22g -jar $gatkDir/GenomeAnalysisTK.jar \
  -T RealignerTargetCreator \
  -R $genomeFastaFile \
  -I $bamFile \
  --known $knownVcfFile \
  -o $bamFile.intervals

echo Performing Indel realignment on $bamFile
java -Xmx22g -jar $gatkDir/GenomeAnalysisTK.jar \
  -I $bamFile \
  -R $genomeFastaFile \
  -T IndelRealigner \
  -targetIntervals $bamFile.intervals \
  -known $knownVcfFile \
  -LOD 0.4 \
  -o ${bamFile}.tmp.bam

echo Performing base recalibration
java -Xmx22g -jar $gatkDir/GenomeAnalysisTK.jar \
  -I ${bamFile}.tmp.bam \
  -R $genomeFastaFile \
  -T BaseRecalibrator \
  -knownSites $knownVcfFile \
  -nct 4 \
  -o ${bamFile}.recal.grp

echo Printing recalibrated reads
java -Xmx22g -jar $gatkDir/GenomeAnalysisTK.jar \
  -T PrintReads \
  -R $genomeFastaFile \
  -I ${bamFile}.tmp.bam \
  -BQSR ${bamFile}.recal.grp \
  -nct 4 \
  -o $bamFile

echo Reindexing BAM
$samToolsDir/samtools index $bamFile

echo Reducing BAM
java -Xmx22g -jar $gatkDir/GenomeAnalysisTK.jar \
  -T ReduceReads \
  -R $genomeFastaFile \
  -I $bamFile \
  -o $outBamFile

Answers

  • ebanksebanks Broad InstituteMember, Broadie, Dev ✭✭✭✭

    Hi Steve,

    The 20-100x compression is relative to the file size before the reducing step, not to the original file size before recalibration (which adds new tags to the file).

  • Unless I'm totally missing something, we are not getting anything close to 20-100x compression, no matter which way I look at it. Perhaps my code (above) is missing an important step or I am doing something else wrong?

  • Mark_DePristoMark_DePristo Broad InstituteMember admin

    A few possibilities I can think of.

    First, what sequencing tech are you using? Anything with a high indel error rate -- ion, pacbio, 454 -- is going to cause problems with reducing. I'm not sure what happens with SOLiD, which we aren't going to support as it's a depreciated technology. Our RR experience is with ILMN data.

    Second, perhaps you have a ton of off-target reads? If you have 30% off-target reads and they are all over the genome we are going to have a very hard time reducing it. One way around this is to only include the targeted intervals in the RR step. That will deal with the off-target reads.

    Finally, if you have just low-coverage exomes we'll not be able to do so well overall.

  • Thanks for your reply. These are Illumina reads. I just ran it using a BED file for my target regions on the ReduceReads step. That helped. The file before RR was 14 GB. After RR, it was 3 GB. I believe the average coverage is at least 50, but I'll let you know if I find otherwise.

Sign In or Register to comment.