Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

How to make a walker compatible with multi-threading

Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin
edited June 2013 in Developer Zone

This document provides an overview of what are the steps required to make a walker multi-threadable using the -nct and the -nt arguments, which make use of the NanoSchedulable and TreeReducible interfaces, respectively.


NanoSchedulable / -nct

Providing -nct support requires that you certify that your walker's map() method is thread-safe -- eg., if any data structures are shared across map() calls, access to these must be properly synchronized. Once your map() method is thread-safe, you can implement the NanoSchedulable interface, an empty interface with no methods that just marks your walker as having a map() method that's safe to parallelize:

/**
 * Root parallelism interface.  Walkers that implement this
 * declare that their map function is thread-safe and so multiple
 * map calls can be run in parallel in the same JVM instance.
 */
public interface NanoSchedulable {
}

TreeReducible / -nt

Providing -nt support requires that both map() and reduce() be thread-safe, and you also need to implement the TreeReducible interface. Implementing TreeReducible requires you to write a treeReduce() method that tells the engine how to combine the results of multiple reduce() calls:

public interface TreeReducible<ReduceType> {
    /**
     * A composite, 'reduce of reduces' function.
     * @param lhs 'left-most' portion of data in the composite reduce.
     * @param rhs 'right-most' portion of data in the composite reduce.
     * @return The composite reduce type.
     */
    ReduceType treeReduce(ReduceType lhs, ReduceType rhs);
}

This method differs from reduce() in that while reduce() adds the result of a single map() call onto a running total, treeReduce() takes the aggregated results from multiple map/reduce tasks that have been run in parallel and combines them. So, lhs and rhs might each represent the final result from several hundred map/reduce calls.

Example treeReduce() implementation from the UnifiedGenotyper:

public UGStatistics treeReduce(UGStatistics lhs, UGStatistics rhs) {
    lhs.nBasesCallable += rhs.nBasesCallable;
    lhs.nBasesCalledConfidently += rhs.nBasesCalledConfidently;
    lhs.nBasesVisited += rhs.nBasesVisited;
    lhs.nCallsMade += rhs.nCallsMade;
    return lhs;
}
Post edited by Geraldine_VdAuwera on

Geraldine Van der Auwera, PhD

Sign In or Register to comment.