Parallelism - Multithreading - Scatter Gather
- The concept of parallelism
- Parallel computing in practice (sort of)
- Levels of computing
- Multi-threading (+Spark)
- Scatter-gather (+Cromwell)
- Compare and combine
1. The concept of parallelism
Parallelism is a way to make a program finish faster by performing several operations in parallel, rather than sequentially (i.e. waiting for each operation to finish before starting the next one).
Imagine you need to cook rice for sixty-four people, but your rice cooker can only make enough rice for four people at a time. If you have to cook all the batches of rice sequentially, it's going to take all night. But if you have eight rice cookers that you can use in parallel, you can finish up to eight times faster.
This is a very simple idea but it has a key requirement: you have to be able to break down the job into smaller tasks that can be done independently. It's easy enough to divide portions of rice because rice itself is a collection of discrete units. In contrast, let's look at a case where you can't make that kind of division: it takes one pregnant woman nine months to grow a baby, but you can't do it in one month by having nine women share the work.
The good news is that most GATK runs are more like rice than like babies. Because GATK tools are built to use the Map/Reduce method (see doc for details), most GATK runs essentially consist of a series of many small independent operations that can be parallelized.
A quick warning about tradeoffs
Parallelism is a great way to speed up processing on large amounts of data, but it has "overhead" costs. Without getting too technical at this point, let's just say that parallelized jobs need to be managed, you have to set aside memory for them, regulate file access, collect results and so on. So it's important to balance the costs against the benefits, and avoid dividing the overall work into too many small jobs.
Going back to the introductory example, you wouldn't want to use a million tiny rice cookers that each boil a single grain of rice. They would take way too much space on your countertop, and the time it would take to distribute each grain then collect it when it's cooked would negate any benefits from parallelizing in the first place.
2. Parallel computing in practice (sort of)
OK, parallelism sounds great (despite the tradeoffs caveat), but how do we get from cooking rice to executing programs? What actually happens in the computer?
Consider that when you run a program like the GATK, you're just telling the computer to execute a set of instructions.
Let's say we have a text file and we want to count the number of lines in it. The set of instructions to do this can be as simple as:
open the file, count the number of lines in the file, tell us the number, close the file
tell us the number can mean writing it to the console, or storing it somewhere for use later on.
Now let's say we want to know the number of words on each line. The set of instructions would be:
open the file, read the first line, count the number of words, tell us the number, read the second line, count the number of words, tell us the number, read the third line, count the number of words, tell us the number
And so on until we've read all the lines, and finally we can close the file. It's pretty straightforward, but if our file has a lot of lines, it will take a long time, and it will probably not use all the computing power we have available.
So to parallelize this program and save time, we just cut up this set of instructions into separate subsets like this:
open the file, index the lines
read the first line, count the number of words, tell us the number
read the second line, count the number of words, tell us the number
read the third line, count the number of words, tell us the number
[repeat for all lines]
collect final results and close the file
read the Nth line steps can be performed in parallel, because they are all independent operations.
You'll notice that we added a step,
index the lines. That's a little bit of peliminary work that allows us to perform the
read the Nth line steps in parallel (or in any order we want) because it tells us how many lines there are and where to find each one within the file. It makes the whole process much more efficient. As you may know, the GATK requires index files for the main data files (reference, BAMs and VCFs); the reason is essentially to have that indexing step already done.
Anyway, that's the general principle: you transform your linear set of instructions into several subsets of instructions. There's usually one subset that has to be run first and one that has to be run last, but all the subsets in the middle can be run at the same time (in parallel) or in whatever order you want.
3. Levels of computing
Before we get into the different ways of parallelizing work, we need to talk about levels of computing, and by that I mean the computing units in terms of hardware: the core, the machine (or CPU) and the cluster or cloud.
- Core: the level below the machine.
On your laptop or desktop, the CPU (central processing unit, or processor) contains one or more cores. If you have a recent machine, your CPU probably has at least two cores, and is therefore called dual-core. If it has four, it's a quad-core, and so on. High-end consumer machines like the latest Mac Pro have up to twelve-core CPUs (which should be called dodeca-core if we follow the Latin terminology) but the CPUs on some professional-grade machines can have tens or hundreds of cores.
- Machine: the middle of the scale.
For most of us in our day-to-day, the machine is the laptop or desktop computer. Really we should refer to the CPU specifically, since that's the relevant part that does the processing, but the most common usage is to say machine. Except if the machine is part of a cluster, in which case it's called a node.
- Cluster or cloud: the level above the machine. This is a high-performance computing structure made of a bunch of machines (usually called nodes) networked together to some extent. If you have access to a cluster, chances are it either belongs to your institution, or your company is renting time on it. A cluster can also be called a server farm or a load-sharing facility. Oh, and technically a cloud is not so much a networked cluster as a bunch of computers that you can rent, but pot-ay-to, pot-ah-to.
Parallelism can be applied at all three of these levels, but in different ways of course, and under different names. Parallelism takes the name of multi-threading at the core and machine levels, and scatter-gather at the cluster/cloud level.
In computing, a thread of execution is a set of instructions that the program issues to the processor to get work done. In single-threading mode, a program only sends a single thread at a time to the processor and waits for it to be finished before sending another one. In multi-threading mode, the program may send several threads to the processor at the same time.
Not making sense? Let's go back to our earlier example, in which we wanted to count the number of words in each line of our text document. Hopefully it is clear that the first version of our little program (one long set of sequential instructions) is what you would run in single-threaded mode. And the second version (several subsets of instructions) is what you would run in multi-threaded mode, with each subset forming a separate thread. You would send out the first thread, which performs the preliminary work; then once it's done you would send the "middle" threads, which can be run in parallel; then finally once they're all done you would send out the final thread to clean up and collect final results.
If you're still having a hard time visualizing what the different threads are like, just imagine that you're doing cross-stitching. If you're a regular human, you're working with just one hand. You're pulling a needle and thread (a single thread!) through the canvas, making one stitch after another, one row after another. Now try to imagine an octopus doing cross-stitching. She can make several rows of stitches at the same time using a different needle and thread for each. Multi-threading in computers is surprisingly similar to that. No, really.
Multi-threading with Spark in GATK
Earlier versions of the GATK used some home-grown, hand-rolled code to produce multithreading. If you're an old-timer you may be familiar with the engine arguments
-nct. In GATK4, we take advantage instead of an open-source industry-standard software library called Spark, produced by the Apache Software Foundation.
Our advice about Spark: don't google it, that's just a great way to give yourself a headache. We have some dedicated documentation in the works specifically about all the stuff you might care about as far as GATK concerned. Start with the highlights by reading this doc.
If you google it, you'll find that the term scatter-gather can refer to a lot of different things, including strategies to get the best price quotes from online vendors, methods to control memory allocation and… an indie-rock band. What all of those things have in common (except possibly the band) is that they involve breaking up a task into smaller, parallelized tasks (scattering) then collecting and integrating the results (gathering). That should sound really familiar to you by now, since it's the general principle of parallel computing.
So yes, "scatter-gather" is really just another way to say we're parallelizing things. OK, but how is it different from multithreading, and why do we need yet another name?
As you know by now, multithreading specifically refers to what happens internally when the one program you're running (in our case, the GATK) sends several sets of instructions to the processor to achieve the instructions that you originally gave it in a single command-line. In contrast, the scatter-gather strategy involves separate copies of the program, each proceeding independently of the others, under the control of some kind of pipelining or "workflow execution" program. Our preferred pipelining solution that we support for scatter-gathering GATK analysis consists of a workflow engine called Cromwell, which runs scripts written in WDL (and soon, CWL). They are quite different, but both are able to generate separate GATK jobs (each with its own command-line) to achieve the instructions given in a script.
So based on the instructions you provide in the workflow script, the execution engine will create separate GATK commands that will each run that tool on a portion of the input data (= the scatter step). The results of each run will be stored in temporary files. Then once all the runs are done, the engine will collate all the results into the final output files, as if the tool had been run as a single command (= the gather step).
Note that Cromwell has additional capabilities, such as managing the use of multiple tools (not just GATK) in a dependency-aware manner to run complex pipelines, but that is outside the scope of this article. You can learn more about using Cromwell to run GATK workflows here.
6. Compare and combine
So you see that scatter-gather is a very different process from multi-threading because the parallelization happens outside of the program itself. The big advantage is that this opens up the upper level of computing: the cluster level. Remember, the GATK program is limited to dispatching threads to the processor of the machine on which it is run – it cannot by itself send threads to a different machine. But an execution engine like Cromwell can dispatch scattered GATK jobs to different machines in a computing cluster or on a cloud platform by interfacing with the appropriate job management software.
That being said, multithreading has the great advantage that cores and machines all have access to shared machine memory with very high bandwidth capacity. In contrast, the multiple machines on a network used for scatter-gather are fundamentally limited by network costs.
The good news is that you can combine scatter-gather and multithreading: use Cromwell to scatter GATK jobs to different nodes on your cluster or cloud platform, then use the GATK's internal multithreading capabilities to parallelize the jobs running on each node.
Going back to the rice-cooking example, it's as if instead of cooking the rice yourself, you hired a catering company to do it for you. The company assigns the work to several people, who each have their own cooking station with multiple rice cookers. Now you can feed a lot more people in the same amount of time! And you don't even have to clean the dishes.