The current GATK version is 3.7-0
Examples: Monday, today, last week, Mar 26, 3/26/04

# RealignerTargetCreator appears to take more time when multithreaded using the -nt flag

Member Posts: 6
edited August 2012

Hi all,

We're doing some analysis on quite big data and time is an issue, so I did a bit of scaling testing on a subset of the data before beginning. The results were unexpected.

When I run GATK RealignerTargetCreator with -nt 8 and give it 8 cores to work with, it actually takes about 2.5 times LONGER than if I just run it single-threaded. I don't mean that the user or CPU time goes up - the real, walltime goes up. In the -nt 8 case, the 8 cores would have been on a single node of our cluster with shared memory.

I tried testing on two different kinds of subsets of the data and both performed worse when multithreaded. I first tried restricting the input data by genomic region, ie just analysing chr22. When multithreading didn't seem to be working as expected in this test, I thought that maybe GATK was trying to parallelise over genomic regions, so I instead tried testing on a single lane of input data (a 9.6G bam file spread over the whole genome). This also ran more slowly when multithreaded.

So my question is: should I use -nt 8 in my real analysis even though it was a bad option in testing? Is it possible that multithreading will be bad for small amounts of data, but good in the large-data case? Or, does this indicate that I'm doing something wrong when trying to run RealignerTargetCreator multithreaded?

I really would like to use the fastest option for the real data as it will be very big. Any help much appreciated.

Thanks,
Clare

Member Posts: 6

I should have said, this is GATK 1.6-7

This can happen when you are IO limited and by specifying -nt 8 you are just causing the machine's IO system to thrash. We note that in our infrastructure -- which is Isilon backed and so very high throughput -- that -nt 8 is about the max we can use without seeing diminishing returns. Also, RealignerTargetCreator is an extremely inexpensive operation outside of the IO, so it's not easy to get a boost from nt. Have you tried nt 2 or another value? If you are really ambitous you can actually copy the BAM locally or into a ramcache and run multi-threaded against that. It's very much more efficient.

Also, we have an outline for a more efficient implementation of nt that will do manage IO and CPU parallelism separately, but that's months away

Member Posts: 6

Thanks Mark!

The system we're on actually has very fast IO too, it's designed for life sciences. I did also originally try -nt 6 and it was the worst of the three options, slightly worse than -nt 8.

However, since making my post I've tried again with a much larger dataset (a different sample though unfortunately) and this time the multithreaded (8 core) run was a lot faster. Actually, the single-thread run hasn't finished yet, but the 8-core run has finished the ~400GB bam file between yesterday and today. So this is great, but I can't explain my earlier test results which I'm pretty certain of. Maybe it really is low coverage that makes multithreading inefficient?