Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

Runtime Performance of UnifiedGenotyper for Increasing Data Size

CindyCindy Posts: 19Member

Hello there,

I ran GATK's UnifiedGenotyper of version 2.7-4 (SNP calling) with multiple data sets that where growing in size (10 sets ranging from 8.5 GB to 88GB) with 36 cpu threads and 8 GB heap assigned. I noticed that the UnifiedGenotyper performed much better than linear with a runtime increase of 4.6 times for the largest data set, although this was 10 times larger than the smallest. Now I wonder how this is achieved by GATK? What part of the processing makes it scale so well to be even better than linear here? Is it the SNP calling that is so performant, or how the data is preprocessed/filtered?

Best,

Cindy

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,228Administrator, GSA Member admin

    Hi Cindy,

    What you're seeing is due to how the UnifiedGenotyper processes data. Internally, the tool applies a map/reduce approach, where the mapping is done per position in the genome (i.e. the calculations are done independently for slices of data per position). So, assuming that the size increase in your files is due to having more depth of coverage (more data per position), not a longer list of positions to process, the additional burden is that each calculation takes a small amount of time longer. But there are not additional separate calculations to do, so the efficiency goes up. Does that make sense?

    Geraldine Van der Auwera, PhD

  • CindyCindy Posts: 19Member

    Hi Geraldine,

    yes, I understand that better now. But actually, the files with the increasing data size covered more positions in the genome. I split up the data so that the smallest data set includes all reads covering positions 0 - 25,000,000, the next data set to include reads covering positions 0 - 50,000,000 and so on so that the largest file contains all reads covering positions 0 - 250,000,000 (of chromosome 1, actually).

    Ok, so if I get it right: If my file is larger and covers more positions, this can be better parallelized than a file that is smaller but covers fewer regions? So how is the data split up then for the map phase? Are there fixed regions or does it depend on the amount of available threads? I still do not get why the performance is then even better than linear (I would have expected linear then).

    Nevertheless, regarding my other question this also does not explain why runtime for processing the same file increases when I assign more cpu threads, because I now would expect that the slices to process in the map phase are smaller and processed in parallel, so performance should improve.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,228Administrator, GSA Member admin

    Ah, interesting -- that wasn't my expectation, but ultimately the real answer is it depends a lot on what your data looks like. One major factor is the depth of coverage you have; I'm not sure where the tradeoff shifts between added burden of making the existing calculations harder vs. adding more calculations (which are trivial if they involve very little data). The shift won't happen in the same place with different data profiles (coverage, number of samples run together etc). And keep in mind that it will also be different for other tools that apply different traversal methods (e.g. where UG processes per-locus, or position, other tools process the data per read, per variant, or other basic data units).

    There are several layers to this, and it comes with a bit of processing overhead (so parallelizing bigger jobs should be more efficient than smaller jobs, all other things being equal). Initially the dataset is split into a big pile of shards (basically sets of I/O instructions to load regions of the genome). This is done the same way whether multithreading is used or not. If you have several cpu threads, the subsystem called Microscheduler spawns several (#nt) instances of traversal engines working in parallel. Each will get to work on a shard; when it's done, it will move on to the next shard in the pile. Within a single cpu thread/ traversal engine, the subsystem called Nanoscheduler spawns several instances of the walker (the tool you're using, e.g. UnifiedGenotyper) and directs each to process data from the pile loaded from the shard that the traversal engine is working on. And finally, within the walker itself, there is a map/reduce operation.

    Hopefully now you can see why the interactions/tradeoffs in performance are not trivial to predict...?

    Geraldine Van der Auwera, PhD

  • CindyCindy Posts: 19Member

    Hm, wow ok. I'll have to think about that...

    Another thing I just wanted to mention: When executing GATK with multiple threads, I could always only see a maximum workload of 3 to 4 CPUs used. Could that have to do with the Java VM GATK is running on? I now even increased heap size to 16GB, but nothing changed in runtime performance. When using multiple threads I would expect to see that in the system's workload overview.

  • CindyCindy Posts: 19Member

    And I also forgot to mention: I have reads with a length of 250 bp with a coverage of 66 on average...

  • CindyCindy Posts: 19Member

    Hi pdexheimer,

    I'll try out, thanks for the hint. Currently, I access the data from an SSD drive, so I would expect this to be very performant. Nevertheless, I'll talk to the administration colleagues who set up the system.

    Also, this would explain why GATK gets slower for more threads, because more of them are waiting then for the data... wouldn't it?

  • pdexheimerpdexheimer Posts: 297Member ✭✭✭

    I think so. SSDs are very fast compared to spinning disks, but they're still slow compared to CPUs or RAM - so it's certainly not going to be a cure-all.

    The other thing to try is to reduce the BAMs. I actually just did that on one of my datasets - the HC time (split 50 ways with Queue) dropped from ~4 or 5 days to ~12 hours

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,228Administrator, GSA Member admin

    Reducing the bams can lead to pretty big wins. We never run HC on unreduced bams anymore.

    Geraldine Van der Auwera, PhD

  • CindyCindy Posts: 19Member

    Ok, tested with fewer thread size (1, 2, 3, 4) and it turns out that your guess was right, @pdexheimer. GATK performance increased with growing thread sizes up to 4 threads.

    So, to conclude the answers to my questions: 1.) Running GATK with multiple threads here is I/O-bound, where the threshold actually lies at 3 to 4 threads. Therefore, when increasing the thread size above this threshold, runtime performance decreases because more threads have to be served with data by I/O processes.

    2.) Running GATK with increasing file sizes -- same coverage, more positions covered -- performs so well (seems better than linear) because of the many layers that are applied for parallelization: Micro-, Nanoscheduler and then again Map/Reduce.

    Thanks all for the help and have a happy halloween! :)

    Best regards,

    Cindy

Sign In or Register to comment.