Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

Performance of ReduceReads

LaurentLaurent Member, Broadie ✭✭
edited August 2012 in Ask the GATK team

Hi all,

I have a set of 250 BAM files with whole-genome sequence of trios (3 samples per BAM) at 12x coverage. Each BAM file is currently about 200GB-250GB.
I am currently trying to process a number of BAMs with ReduceReads prior to using them with the Unified Genotyper and was wondering whether the run times I'm getting are in line with what I should expect. If they are, I was wondering if you think the time spent on ReduceReads should be then gain when calling using the UG.

I've tried running ReduceReads in 2 different ways:

  1. Per trio: Each BAM is input/output as a whole. This gives estimated run times of ~5days.

    java -Xmx4g -Djava.io.tmpdir=/local -jar ~/tools/GenomeAnalysisTK-2.1-8-g5efb575/GenomeAnalysisTK.jar \
    -T ReduceReads \
    -R human_g1k_v37.fa \
    -I  A4.human_g1k_v37.trio_realigned.bam \
    -o A4.human_g1k_v37.trio_realigned.reduced.bam
    
  2. Per individual: Each BAM is input with option -rgbl for all read groups not belonging to the individual. This way I would run 3 ReduceReads processes on each trio-BAM. Each of these gives an estimated run time of 2days.

    java -Xmx4g -Djava.io.tmpdir=/local -jar ~/tools/GenomeAnalysisTK-2.1-8-g5efb575/GenomeAnalysisTK.jar \
    -T ReduceReads \
    -R human_g1k_v37.fa \
    -I A4.human_g1k_v37.trio_realigned.bam \
    -o A4a.human_g1k_v37.trio_realigned.reduced.bam \
    -rgbl ID:L3 \
    -rgbl ID:L5 \
    -rgbl ID:L6.1 \
    -rgbl ID:L6.2 \
    -rgbl ID:L7.1 \
    -rgbl ID:L7.2 \
    -rgbl ID:L7.3 \
    -rgbl ID:L7.4
    

Thanks a lot!
Laurent

Best Answer

  • Mark_DePristoMark_DePristo Broad Institute admin
    Accepted Answer

    Hi Laurent,

    I would expect reduced reads to take quite some time on a WGS data file. We routinely see 6 hours for an exome, which is like 10x less data than a WGS data set. So a few days is reasonable. Running the trio BAM together is fine, in my opinion, as right now each sample is reduced independently so it's just easier, really. If you want to reduce latency in the pipeline you can always process each chromosome independently, which would finish for chr1 in like 10 hours, and then merge the resulting BAM files together, which will take a bit of time itself. So it's more CPU cost for shorter wall time.

    As for the performance of calling with reduce reads -- we've see like 10x speed ups for calling, perhaps more in your case. It's hard to tell because we've never tried it with 12x WGS BAMs. It also vastly reduces the IO costs of calling, so across many jobs you may get an even bigger boast, depending on your infrastructure. Finally, one advantage of the reduced bams will be that you can move your original full BAMs to a lower tier of storage that could be vastly cheap than the Isilon or equivalent infrastructure needed to support many jobs calling off many BAM files.

Answers

  • Mark_DePristoMark_DePristo Broad InstituteMember admin
    Accepted Answer

    Hi Laurent,

    I would expect reduced reads to take quite some time on a WGS data file. We routinely see 6 hours for an exome, which is like 10x less data than a WGS data set. So a few days is reasonable. Running the trio BAM together is fine, in my opinion, as right now each sample is reduced independently so it's just easier, really. If you want to reduce latency in the pipeline you can always process each chromosome independently, which would finish for chr1 in like 10 hours, and then merge the resulting BAM files together, which will take a bit of time itself. So it's more CPU cost for shorter wall time.

    As for the performance of calling with reduce reads -- we've see like 10x speed ups for calling, perhaps more in your case. It's hard to tell because we've never tried it with 12x WGS BAMs. It also vastly reduces the IO costs of calling, so across many jobs you may get an even bigger boast, depending on your infrastructure. Finally, one advantage of the reduced bams will be that you can move your original full BAMs to a lower tier of storage that could be vastly cheap than the Isilon or equivalent infrastructure needed to support many jobs calling off many BAM files.

  • LaurentLaurent Member, Broadie ✭✭

    Thanks a lot for your answer Mark! I might try the per-chromosome approach and see how much performance gain I get on one chromosome before starting the process over the whole set.

Sign In or Register to comment.