To celebrate the release of GATK 4.0, we are giving away free credits for running the GATK4 Best Practices pipelines in FireCloud, our secure online analysis portal. It’s first come first serve, so sign up now to claim your free credits worth $250. Sponsored by Google Cloud. Learn more at https://software.broadinstitute.org/firecloud/documentation/freecredits

parallel running in GATK

In HC, CombineGVCFs, and GenotypeGVCFs, besides running each chr separately in parallel, can I also break a chr into smaller sections and run each in parallel?

Thanks!

Tagged:

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    That should be fine for GenotypeGVCFs. Not sure about HC and CombineGVCFs as you may run into edge cases at the section starts and ends.

  • mcg255mcg255 San FranciscoMember

    Hi, just posted a similar question, but specific to GenotypeGVCFs. Apologies, it's taken a lot of searching to get here! :smile:

    So this is an affirmative answer that we can use sub-chromosome splits, say 1Mb, for scatter-gather of GenotypeGVCFs?

    If so, two more follow-ups :smile:

    1. If we had a large multi-sample, whole-genome combined gVCF that we wanted to run this strategy on. Would we...

    A. Split that gVCF into many, many small, interval-only gVCFs and invoke GATK separately on all of them? Like the following pseudocode recipe?

    # make small, interval gVCFs from whole genome gVCFs
    
    SelectVariants -L 1:1-1,000,000 -V CohortWholeGenome.gvcf -o Split1.gvcf
    SelectVariants -L 1:1,000,001-2,000,000 -V CohortWholeGenome.gvcf -o Split2.gvcf
    ... 
    SelectVariants -L 22:... -V CohortWholeGenome.gvcf -o SplitN.gvcf
    
    # invoke GenotypeGVCFs on each interval gVCF
    
    GenotypeGVCFs -L 1:1-1,000,000 -V Split1.gcvf -o Split1.vcf
    GenotypeGVCFs -L 1:1,000,001-2,000,000 -V Split2.gvcf -o Split2.vcf
    ...
    GenotypeGVCFs -L 22:... -V SplitN.gvcf -o SplitN.vcf
    

    B. Run GenotypeGVCFs once for each interval, using just the large, combined whole-genome gVCF, passing the interval of interest with -L?

    # invoke GenotypeGVCFs on each interval, using whole genome gVCF
    
    GenotypeGVCFs -L 1:1-1,000,000 -V CohortWholeGenome.gvcf -o Split1.vcf
    GenotypeGVCFs -L 1:1,000,001-2,000,000 -V CohortWholeGenome.gvcf -o Split2.vcf
    ...
    GenotypeGVCFs -L 22:... -V CohortWholeGenome.gvcf -o SplitN.vcf
    
    1. Are there any caveats to our choice of intervals?
      In a similar question about using small intervals for HaplotypeCaller scatter-gather, rpoplin mentioned that the intervals should not overlap. Anything like that apply here?
  • Geraldine_VdAuweraGeraldine_VdAuwera Cambridge, MAMember, Administrator, Broadie

    Hah, this thread was old! We've learned a lot since then. Let's continue the discussion in your other thread where I just responded.

Sign In or Register to comment.