How to Initiate Scatter Gather on One Machine

beninebenine earthMember
edited March 2015 in Ask the GATK team

The HaplotypeCaller documentation recommends using Queue to parallelize HaplotypeCaller instead of -nct, so I've been attempting to do that, however I can't seem to get Queue to do any kind of parallel processing. I'm currently working on a machine with 8 cores and I'm consistently getting Queue to run, but it only runs single-threaded. I don't have access to a distributed computing environment, but I don't see why Queue wouldn't be able to parallelize on one machine with multiple cores, and I see no documentation indicating that threading by Queue is only available in distributed computing environments.

What I've done is a minimal modification of the ExampleUnifiedGenotyper.scala script to use it to run HaplotypeCaller. I have tried running it a couple of times to see how it would run. I tried a couple times with just the reference file and mapping file as input, plus I tried a couple times with an intervals file listing each of the chromosomes as separate intervals. Every time, it ran single threaded.

I've found several articles and comments indicating that Queue should be used to Scatter/Gather a job and even explain how Scatter/Gather works, so I was under the assumption that this is just what Queue does and it would use multi-core systems to their full advantage, however this is not my experience and I don't see anything in the documentation to explain why. If it could be explained to me either how I'm running the command wrong, or why Queue can't be used to parallelize on one machine, I would be very grateful.


Best Answer


  • beninebenine earthMember

    Ok thank you. Do you know if the issues with running HaplotypeCaller with -nct have been addressed in version 3.3-0?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    No they have not been addressed, and probably will not be. We are moving away from multithreading in favor of scatter gather. It is a much more stable method of parallelism and less prone to race conditions.

  • beninebenine earthMember

    Are there plans to incorporate some kind of job scheduler into GATK for those of us who do not have access to a distributed computing environment, but wish to take advantage of HaplotypeCaller without waiting for several weeks for one job and without setting up our own job scheduler for one machine?

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Not as such, no, because our priority is to enable operational scales far greater than single machine setups. But we are looking into making GATK more cloud-friendly so that people will be able to take advantage of commercial cloud computing platforms.

Sign In or Register to comment.