Effect of Queue scatter+gather on HaplotypeCaller?


I have a question about choosing the number of scatter jobs when running the HaplotypeCaller in Queue.

Basically, is there a hard and fast rule about how small you can split up the job? From what I understand of HC, given it does local reconstruction of haplotypes anyway, splitting into more jobs shouldn't affect the results.

(My current dataset is mouse whole-genome data with 24 samples, and even scattered into 250 jobs, the longest jobs still took ~6d to run... I'd love to be able to speed it up if I have to re-run HC by splitting into more jobs. As long as it doesn't affect the results!)


Best Answer


  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    Hi @mfletcher,

    Are you running on all 24 samples simultaneously? You might see a more significant speedup by switching over to our new recommended workflow, in which you call variants with HC per-sample in GVCF mode then do a joint analysis on all the GVCFs. This takes away much of the exponential increase of time caused by multi-sample processing. Have a look at the doc here: http://www.broadinstitute.org/gatk/guide/article?id=3893

  • mfletchermfletcher DEMember

    Hi @Geraldine_VdAuwera‌,

    Yes, I've been running HC on all 24 samples in one go.

    I'll try the new per-sample GVCF workflow, as you suggest (anything that will speed up HC runtime is good!). However, I'm still interested in whether running HC as a single, big job vs scatter+gather makes a difference in results - I've got HC calls for one sample run as a single job and scatter+gather, and there're about 1% unique variants in each HC callset.

    Would this be due to the scatter+gather? (i.e. if the HC is being run on a specific genomic interval, as specified by Queue scatter+gather, when it reconstructs the local haplotypes will HC look outside the genomic interval specified?) Or is there generally variation between HC runs?


  • mfletchermfletcher DEMember

    Thanks for the info @Geraldine_VdAuwera‌!

    The differences being potentially due to downsampling makes sense, but I haven't looked at them in any detail. Given my low coverage data I'd expect most of them to be of low confidence, so they'd be filtered out following VQSR anyway.

