Holiday Notice:
The Frontline Support team will be slow to respond December 17-18 due to an institute-wide retreat and offline December 22- January 1, while the institute is closed. Thank you for your patience during these next few weeks. Happy Holidays!

Memory-map reference files to increase parallel performance?

I am a systems engineer trying to optimize the efficiency of some of our processing.

GATK tools appear not to be memory mapping the reference genomes (and other read-only reference files). I am wondering why this is the case? Perhaps unnecessary for a reason I have missed? I am suspecting that the memory use of multiple data-threads or multiple processes would be reduced by holding only a single copy of reference files in memory. Memory-mapping achieves that for free - it becomes a problem for the OS to handle, not the application.

Ordinarily I would simply look at the source code to understand the implementation of the algorithm & experiment with changes to see if performance can be lifted (or specifically here: memory use reduced so we can run more in parallel).

An example I'm currently looking at is BaseRecalibrator. The reference files total around 3GB. The active memory section of each process is around 9GB.

pmap suggests that no memory mapping of reference files is performed. It does show file-like buffers are used in an anonymous sense. I presume this is some form of caching of the data to be output at the end.

Version is: GenomeAnalysisTK-2.5-2-gf57256b


  • Mark_DePristoMark_DePristo Broad InstituteMember admin

    This is an interesting question. The GATK uses a CachingIndexedFastaReader that holds 1MB's of reference sequence in memory, each managed using a ThreadLocal variable. Our experience and profiles is that loading the reference sequences is essentially free, relative to all other costs in the GATK.

    Additionally, we have had some bad experiences with memory-mapped files, in particular I believe with increasing permanent generation space (this happened in Picard with memory mapped indices).

    All of that said, we haven't explored memory-mapped FASTA sequences but it would be interesting to hear your experiences and if better we could certainly incorporate into our master engine branch.


  • RobMRobM Member

    Thanks for your reply Mark. That is helpful background on the current state and what has previously been attempted. I will get in touch if I can allocate time to experiment in this space. Big files & memory mapping may be an issue if 32 bit support continues to be desired (ie: need then to support a fallback for 32 bit clients).

    My first priority is to look at all the tools we are currently using and:

    • identify where we are not using the existing tool most efficiently (disk I/O / RAM / CPU balance)
    • identify tools which can theoretically be improved, and
    • identify the tools that might be improved for best overall gain to the entire chain.

    I'm suspecting that while I feel GATK is using more RAM than absolutely efficient, that other processing we are doing might be necessarily highly RAM intensive. If that is the case, to improve my tool chain I may simply increase the ratio of RAM to CPU cores. I currently average 8GB per real CPU core. No point me adding more processing nodes if I can't max the CPU at present, need more physical RAM. Alternatively I need ALL my tools to need less RAM.

Sign In or Register to comment.