parallel running in GATK

In HC, CombineGVCFs, and GenotypeGVCFs, besides running each chr separately in parallel, can I also break a chr into smaller sections and run each in parallel?




  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    That should be fine for GenotypeGVCFs. Not sure about HC and CombineGVCFs as you may run into edge cases at the section starts and ends.

  • mcg255mcg255 San FranciscoMember

    Hi, just posted a similar question, but specific to GenotypeGVCFs. Apologies, it's taken a lot of searching to get here! :smile:

    So this is an affirmative answer that we can use sub-chromosome splits, say 1Mb, for scatter-gather of GenotypeGVCFs?

    If so, two more follow-ups :smile:

    1. If we had a large multi-sample, whole-genome combined gVCF that we wanted to run this strategy on. Would we...

    A. Split that gVCF into many, many small, interval-only gVCFs and invoke GATK separately on all of them? Like the following pseudocode recipe?

    # make small, interval gVCFs from whole genome gVCFs
    SelectVariants -L 1:1-1,000,000 -V CohortWholeGenome.gvcf -o Split1.gvcf
    SelectVariants -L 1:1,000,001-2,000,000 -V CohortWholeGenome.gvcf -o Split2.gvcf
    SelectVariants -L 22:... -V CohortWholeGenome.gvcf -o SplitN.gvcf
    # invoke GenotypeGVCFs on each interval gVCF
    GenotypeGVCFs -L 1:1-1,000,000 -V Split1.gcvf -o Split1.vcf
    GenotypeGVCFs -L 1:1,000,001-2,000,000 -V Split2.gvcf -o Split2.vcf
    GenotypeGVCFs -L 22:... -V SplitN.gvcf -o SplitN.vcf

    B. Run GenotypeGVCFs once for each interval, using just the large, combined whole-genome gVCF, passing the interval of interest with -L?

    # invoke GenotypeGVCFs on each interval, using whole genome gVCF
    GenotypeGVCFs -L 1:1-1,000,000 -V CohortWholeGenome.gvcf -o Split1.vcf
    GenotypeGVCFs -L 1:1,000,001-2,000,000 -V CohortWholeGenome.gvcf -o Split2.vcf
    GenotypeGVCFs -L 22:... -V CohortWholeGenome.gvcf -o SplitN.vcf
    1. Are there any caveats to our choice of intervals?
      In a similar question about using small intervals for HaplotypeCaller scatter-gather, rpoplin mentioned that the intervals should not overlap. Anything like that apply here?
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hah, this thread was old! We've learned a lot since then. Let's continue the discussion in your other thread where I just responded.

