We've moved!
This site is now read-only. You can find our new documentation site and support forum for posting questions here.
Be sure to read our welcome blog!

What are the smallest units I can break whole human genomes into, for scatter-gather?

mcg255mcg255 San FranciscoMember

Hi, and thank you so much for the wonderful tools and support! :smile:

For our current project, we'd like to run 2000+ whole genomes from FASTQ to VCF using GATK best practices.

I'd like to optimize the runtime, in particular of GenotypeGVCFs.

Previously, we have used -nt with GenotypeGVCFs for parallelism.
With GATK 3.7, using threads (-nt) with GenotypeGVCFs has always crashed due to what appears to be lack of thread safety.
From what I've read on this forum, this is a known issue, and users are urged to use a scatter-gather approach.

If I understand correctly, scatter-gather for GenotypeGVCFs would entail splitting the combined, multi-sample, Whole-Genome gVCFs into, say, combined, multi-sample, Per-Chromosome gVCFs. And then executing GenotypeGVCFs on each multi-sample, chromosomal gVCF, on a cluster, in single-threaded mode?

Please correct me if this understanding is not accurate. I have read the Parallelism and Scatter-Gather pages on the forums.

If my understanding of scatter-gather is accurate, then it seems that to get the best performance when scaling out, you would want to subset the multi-sample, whole-genome gVCFs into as-small-as-reasonably-possible gVCFs, so that you could run hundreds or thousands of them in parallel on the cluster.

E.g. Partition the multi-sample, whole-genome gVCFs into, say, 10kb regions over each chromosome, yielding ~300,000 multi-sample gVCFs. Then you submit those to your batch/queue system and run each as its own invocation of GenotypeGVCFs with -nt 1.

However, the recommendations on this forum, for whole-genome data, tend to be to split at the chromosome level.
This would limit your parallelism to 22 if your were running the human autosomes.
And if there are a large number of high-coverage samples, and you're forced into single thread mode, this will not be efficient.

So, what is the smallest unit one can break-up whole genome data for GenotypeGVCFs?

Best Answers


  • mcg255mcg255 San FranciscoMember
    edited May 2017

    Hi @Geraldine_VdAuwera. Thank you so much for your reply.
    Yes, eagerly awaiting GATK4 on Spark.

    In terms of subsetting a large whole-genome gVCF, sounds like breaking it up into smaller files is a no-go.

    What about keeping the large whole-genome gVCF as-is, and then invoking GenotypeGVCFs with the -L flag for each interval?

    In this case, if there were an event that spanned an interval breakpoint, certainly the data for that event would be available to the GATK process. But unsure whether GenotypeGVCFs would try to make use of it, or instead only call events that fit exactly in the interval.

  • mcg255mcg255 San FranciscoMember

    Awesome! Thank you for such a thorough answer, @Geraldine_VdAuwera

  • willpitcherswillpitchers Michigan State UMember

    Thank you for this post – looking forward to version 4!

    I'd be grateful for your thoughts on the use of scatter-gather approaches to parallelism with non-model-organism data. We are working with a comparatively small number of samples (139 individuals) in an electric fish system, and running GATK on a shared HPC cluster.

    In order to balance speed with being good citizens in our resource use, I've been keeping sequencing individual/library files separate through trimming, alignment, deduplication and base recalibration, then merging ..bam files by individual. Once at that stage I'm then calling variants with HaplotypeCaller and GenotypeGVCFs, but pleasantly-parallel-scatter-gathering ~5000 jobs, each dealing with an ~20KB region of genome (our reference is in super-scaffolds, but not resolved to chromosome-level) assigned with the -L flag.

    Before we can deploy version 4, with its GenomicsDB goodness, do you have any recommendations for how we can minimise the downsides of this approach...? We're concerned about missing variants that may turn out to be important, but are not sure how to know where is/is not a 'good' spot to break the sequence up...

    Many thanks!

    Issue · Github
    by shlee

    Issue Number
    Last Updated
    Closed By
  • shleeshlee CambridgeMember, Broadie ✭✭✭✭✭

    Hi @willpitchers,

    I've asked someone on the team to follow up.

  • willpitcherswillpitchers Michigan State UMember

    Thanks shlee!

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie admin

    @willpitchers Do you mean the ~20kb segments are the super-scaffolds, and that's what you're scattering over? If so that seems very safe; assuming the performance works out well I don't think you need to worry. If you mean you're scattering within the super-scaffolds, then I go back to my earlier response -- it's all about finding natural boundaries where you know the sequence is interrupted (eg by a stretch of N bases), or crazy regions with a ton of short repeats where calls will be bad anyway.

  • willpitcherswillpitchers Michigan State UMember

    Hey @Geraldine_VdAuwera, what I had been doing was splitting up those scaffolds that were longer than 20kb so that each parallel job could run for approximately the same length of time...

    Thank you for clarifying things for me. We are going to have to look into finding natural boundaries, but in the meantime I'm going to alter my scripts to do variant calling on ≥1 scaffold at a time.

Sign In or Register to comment.