Bug Bulletin: we have identified a bug that affects indexing when producing gzipped VCFs. This will be fixed in the upcoming 3.2 release; in the meantime you need to reindex gzipped VCFs using Tabix.

GATK Queue and the Data Processing Pipeline

pwhitepwhite Posts: 3Member

What is the best way to get Queue to optimize utilization of a given number of cores in an SGE cluster? The DataProcessingPipeline.scala has a hidden parameter "scatter_gather" which sets the nContigs variable. Is it safe to use this option? For example, if you had 100 cores available in your cluster could you set the option to 100? Is there any advantage to setting it higher?

Without setting it, Queue appears to set the nContigs value based on the number of chromosomes in the BAM input. So if using a whole genome BAM it's 25, your example Chr20 data it's 1, or with an unaligned BAM it's 0. So if starting with unaligned data, it appears to run on a single core?

Tagged:

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235 admin
    Answer ✓

    As I recall Queue will not "over-scatter" your jobs, i.e. it won't forcefully split up the data further than makes sense for the analysis. The rest is up to the script you use. The DPP does those things correctly, if that's what you're using. As for changing the scatter count, it should be OK, but I would recommend you read the pipeline script to check where that parameter is used and verify that it matches your expectations.

    Geraldine Van der Auwera, PhD

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 5,235Administrator, GSA Member admin

    The merits of increasing scatter-gather count depend a lot on what kind of jobs you're sending out, and also a little on what is the setup of your cluster. Some jobs can't be scattered beyond a certain count because the data simply can't be divvied up further. For example, if the smallest unit of data your job can operate on is the chromosome, there is no point in increasing the scatter gather count beyond number of chromosomes. If the smallest unit is a gene interval and you have 20000 intervals, technically you can set scatter count to the number of intervals, 20000. The advantage is that the scattered jobs will have very short runtimes, which is good if you have a fast-moving "short jobs" queue in your cluster setup. But having so many jobs adds overhead for processing, so at some point the increase in scatter count will stop yielding any performance gains, and start costing you. Finally, if you have unaligned data and the job is to align it, it will run on a single node by default because the aligner job needs to have access to all the data -- as far as I know you can't just split up the data and send them to be aligned on different machines. Keep in mind that scatter-gather is different from multithreading. We'll have a new documentation article on that topic out very soon.

    Geraldine Van der Auwera, PhD

  • pwhitepwhite Posts: 3Member

    Thanks Geraldine. So if I am running Queue with whole human genome resequencing data to perform alignment, realignment, dedup and recalibration is it smart enough to know that realignment can be ran across multiple intervals, dedup on a single genome BAM, and to calculate the covariates over multiple intervals, sum them for the genome and apply them to the intervals, and finally merge into a single processed BAM? Can I safely increase the scatter_gather variable to over 25 (the number of chromosomes, which it defaults to if given an aligned BAM as input)?

Sign In or Register to comment.