The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!
Memory-map reference files to increase parallel performance?
I am a systems engineer trying to optimize the efficiency of some of our processing.
GATK tools appear not to be memory mapping the reference genomes (and other read-only reference files). I am wondering why this is the case? Perhaps unnecessary for a reason I have missed? I am suspecting that the memory use of multiple data-threads or multiple processes would be reduced by holding only a single copy of reference files in memory. Memory-mapping achieves that for free - it becomes a problem for the OS to handle, not the application.
Ordinarily I would simply look at the source code to understand the implementation of the algorithm & experiment with changes to see if performance can be lifted (or specifically here: memory use reduced so we can run more in parallel).
An example I'm currently looking at is BaseRecalibrator. The reference files total around 3GB. The active memory section of each process is around 9GB.
pmap suggests that no memory mapping of reference files is performed. It does show file-like buffers are used in an anonymous sense. I presume this is some form of caching of the data to be output at the end.
Version is: GenomeAnalysisTK-2.5-2-gf57256b