Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
Performance of ReduceReads
I have a set of 250 BAM files with whole-genome sequence of trios (3 samples per BAM) at 12x coverage. Each BAM file is currently about 200GB-250GB.
I am currently trying to process a number of BAMs with ReduceReads prior to using them with the Unified Genotyper and was wondering whether the run times I'm getting are in line with what I should expect. If they are, I was wondering if you think the time spent on ReduceReads should be then gain when calling using the UG.
I've tried running ReduceReads in 2 different ways:
Per trio: Each BAM is input/output as a whole. This gives estimated run times of ~5days.
java -Xmx4g -Djava.io.tmpdir=/local -jar ~/tools/GenomeAnalysisTK-2.1-8-g5efb575/GenomeAnalysisTK.jar \ -T ReduceReads \ -R human_g1k_v37.fa \ -I A4.human_g1k_v37.trio_realigned.bam \ -o A4.human_g1k_v37.trio_realigned.reduced.bam
Per individual: Each BAM is input with option -rgbl for all read groups not belonging to the individual. This way I would run 3 ReduceReads processes on each trio-BAM. Each of these gives an estimated run time of 2days.
java -Xmx4g -Djava.io.tmpdir=/local -jar ~/tools/GenomeAnalysisTK-2.1-8-g5efb575/GenomeAnalysisTK.jar \ -T ReduceReads \ -R human_g1k_v37.fa \ -I A4.human_g1k_v37.trio_realigned.bam \ -o A4a.human_g1k_v37.trio_realigned.reduced.bam \ -rgbl ID:L3 \ -rgbl ID:L5 \ -rgbl ID:L6.1 \ -rgbl ID:L6.2 \ -rgbl ID:L7.1 \ -rgbl ID:L7.2 \ -rgbl ID:L7.3 \ -rgbl ID:L7.4
Thanks a lot!