Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
What does each data thread stand for in HaplotypeCaller
I'm using multi-threading for HaplotypeCaller by setting the nct option.
But actually, I found that the speedup it gains isn't in proportional to the increase of the number of data threads.
I tried nct as 8,12,16,24 on my machine, and gained a speedup of 4.1x, 4.2x, 4.2x, 4.2x. Seems that there is an upper bound of performance gains when enabling mult-threading for HaplotypeCaller.
I'm wondering what each data thread stands for in HaplotypeCaller. We need to use PairHMM to calculate the likelihood array in each active region. Are we distributing each read-haplotype pair in the region as one data thread and map it to a CPU thread? Or are we distributing the calculation in each region as one data thread?