Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
GATK2 ReduceReads gets stuck on large (100Gb) Bam Files after a few hours
I have been running GATK2 ReduceReads on a large (100Gb) Bam file, and even though at the very beginning it runs very smoothly and predicts a week for finishing the task, after a few hours it gets totally stock. We first thought that it could be a garbage collection (or java memory allocation issue), but the logs show that the garbage collection works well.
The command is (similar behavior for smaller Xms and Xmx values) java -Xmx30g -Xms30g -XX:+PrintGCTimeStamps -XX:+UseParallelOldGC -XX:+PrintGCDetails -Xloggc:gc.log -verbose:gc -jar $path $ref -T ReduceReads -I input.bam -o output.bam
The first few lines of the log file are
INFO 01:12:21,541 TraversalEngine - chr1:1094599 5.89e+05 9.9 m 16.8 m 0.0% 19.4 d 19.4 d
INFO 01:13:21,628 TraversalEngine - chr1:2112411 9.44e+05 10.9 m 11.6 m 0.1% 11.2 d 11.2 d
INFO 01:14:22,065 TraversalEngine - chr1:3051535 1.29e+06 11.9 m 9.3 m 0.1% 8.5 d 8.5 d
INFO 01:15:22,297 TraversalEngine - chr1:4084547 1.59e+06 12.9 m 8.1 m 0.1% 6.9 d 6.9 d
INFO 01:16:24,130 TraversalEngine - chr1:4719991 1.82e+06 13.9 m 7.7 m 0.2% 6.4 d 6.4 d
but after a short while it gets totally stock, and even in the location 121485073 of chromosome 1, there is almost no progress at all, and the estimated finish time goes over 11 weeks, and still increasing.
Any idea what the reason for this could be, and how we can solve the problem? The same command runs successfully on small (less than 5gig) Bam files though
Thanks in advance.