Effect of Queue scatter+gather on HaplotypeCaller?

Gidday,

I have a question about choosing the number of scatter jobs when running the HaplotypeCaller in Queue.

Basically, is there a hard and fast rule about how small you can split up the job? From what I understand of HC, given it does local reconstruction of haplotypes anyway, splitting into more jobs shouldn't affect the results.

(My current dataset is mouse whole-genome data with 24 samples, and even scattered into 250 jobs, the longest jobs still took ~6d to run... I'd love to be able to speed it up if I have to re-run HC by splitting into more jobs. As long as it doesn't affect the results!)

Thanks!

Best Answer

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hi @mfletcher,

    Are you running on all 24 samples simultaneously? You might see a more significant speedup by switching over to our new recommended workflow, in which you call variants with HC per-sample in GVCF mode then do a joint analysis on all the GVCFs. This takes away much of the exponential increase of time caused by multi-sample processing. Have a look at the doc here: http://www.broadinstitute.org/gatk/guide/article?id=3893

  • mfletchermfletcher DEMember

    Hi @Geraldine_VdAuwera‌,

    Yes, I've been running HC on all 24 samples in one go.

    I'll try the new per-sample GVCF workflow, as you suggest (anything that will speed up HC runtime is good!). However, I'm still interested in whether running HC as a single, big job vs scatter+gather makes a difference in results - I've got HC calls for one sample run as a single job and scatter+gather, and there're about 1% unique variants in each HC callset.

    Would this be due to the scatter+gather? (i.e. if the HC is being run on a specific genomic interval, as specified by Queue scatter+gather, when it reconstructs the local haplotypes will HC look outside the genomic interval specified?) Or is there generally variation between HC runs?

    Thanks!

  • mfletchermfletcher DEMember

    Thanks for the info @Geraldine_VdAuwera‌!

    The differences being potentially due to downsampling makes sense, but I haven't looked at them in any detail. Given my low coverage data I'd expect most of them to be of low confidence, so they'd be filtered out following VQSR anyway.

Sign In or Register to comment.