Bug Bulletin: The GenomeLocPArser error in SplitNCigarReads has been fixed; if you encounter it, use the latest nightly build.

GATK Queue and the Data Processing Pipeline

pwhitepwhite Posts: 3Member

What is the best way to get Queue to optimize utilization of a given number of cores in an SGE cluster? The DataProcessingPipeline.scala has a hidden parameter "scatter_gather" which sets the nContigs variable. Is it safe to use this option? For example, if you had 100 cores available in your cluster could you set the option to 100? Is there any advantage to setting it higher?

Without setting it, Queue appears to set the nContigs value based on the number of chromosomes in the BAM input. So if using a whole genome BAM it's 25, your example Chr20 data it's 1, or with an unaligned BAM it's 0. So if starting with unaligned data, it appears to run on a single core?

Tagged:

Best Answer

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,192 admin
    Answer ✓

    As I recall Queue will not "over-scatter" your jobs, i.e. it won't forcefully split up the data further than makes sense for the analysis. The rest is up to the script you use. The DPP does those things correctly, if that's what you're using. As for changing the scatter count, it should be OK, but I would recommend you read the pipeline script to check where that parameter is used and verify that it matches your expectations.

    Geraldine Van der Auwera, PhD

Answers

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,192Administrator, GATK Developer admin

    The merits of increasing scatter-gather count depend a lot on what kind of jobs you're sending out, and also a little on what is the setup of your cluster. Some jobs can't be scattered beyond a certain count because the data simply can't be divvied up further. For example, if the smallest unit of data your job can operate on is the chromosome, there is no point in increasing the scatter gather count beyond number of chromosomes. If the smallest unit is a gene interval and you have 20000 intervals, technically you can set scatter count to the number of intervals, 20000. The advantage is that the scattered jobs will have very short runtimes, which is good if you have a fast-moving "short jobs" queue in your cluster setup. But having so many jobs adds overhead for processing, so at some point the increase in scatter count will stop yielding any performance gains, and start costing you. Finally, if you have unaligned data and the job is to align it, it will run on a single node by default because the aligner job needs to have access to all the data -- as far as I know you can't just split up the data and send them to be aligned on different machines. Keep in mind that scatter-gather is different from multithreading. We'll have a new documentation article on that topic out very soon.

    Geraldine Van der Auwera, PhD

  • pwhitepwhite Posts: 3Member

    Thanks Geraldine. So if I am running Queue with whole human genome resequencing data to perform alignment, realignment, dedup and recalibration is it smart enough to know that realignment can be ran across multiple intervals, dedup on a single genome BAM, and to calculate the covariates over multiple intervals, sum them for the genome and apply them to the intervals, and finally merge into a single processed BAM? Can I safely increase the scatter_gather variable to over 25 (the number of chromosomes, which it defaults to if given an aligned BAM as input)?

  • byb121byb121 UKPosts: 21Member

    @Geraldine_VdAuwera

    Hi, after a search of the forum, I think this is the best place to ask similar questions. What will happen if I split HaplotypeCaller job into more than 25 jobs (the number of chromosomes in my bam files). Will the variants that happen to be on break points of the genome be missed? I constantly scatter HC jobs into 400 small jobs without thinking the problem until I found a genuine variant was not called at all. The variant has good coverage and excellent ratio of ref and minor allele as well (94 C= wildtype, 131 T = variant). To summary, my questions are:

    1) Is the best number to split HC jobs the number of chromosomes in bam files, in terms of the overall speed and quality of variant calls?

    2) If splitting the job into more (say 100) jobs can speed up the process, will this affect variant calls (variants on break points could be missed, for example)? I am also wondering how these jobs are split, I mean whether each chromosome is split into 4 evenly or the whole genome is split into 100 evenly.

    3) Can be the variant I mentioned above not detected by any other reason?

    Thanks in advance.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,192Administrator, GATK Developer admin

    Hi @byb121,

    1. It is the most straightforward and safest way to do it, but not the fastest since it limits how much you can parallelize.

    2. This is a little tricky but I believe you can scatter within chromosomes, on the condition that you use the interval padding function. This will ensure that you don't lose variants on the edges, without actually calling variants in the padding regions. I would recommend padding of 100 bases. Be sure to test this and check that it's behaving as expected, as it's not part of our official recommendations.

    3. Hard to say without seeing the data. When troubleshooting a "difficult" variant, first thing to do is run on the surrounding interval with the -bamOut argument to see what the region looks like after re-assembly. Also look at soft-clips, mapqs etc.

    Geraldine Van der Auwera, PhD

  • byb121byb121 UKPosts: 21Member

    Hi @Geraldine_VdAuwera‌,

    Thanks, I will try the interval padding and visualise the difficult variant. Will let you know the result.

  • byb121byb121 UKPosts: 21Member

    Hi @Geraldine_VdAuwera‌,

    I tried the interval padding function (added a few lines in my scala script for padding of 100bp), but from the intervals recorded in the "scatter.intervals" files in temp_ folders, I don't see 200bp overlapping between intervals but only no-overlapping consecutive regions. From the log of the queue "FunctionEdge" lines, I don't see any extra options were added. An example is shown below:

    DEBUG 12:41:30,314 FunctionEdge - Starting: /mnt/lustre/users/a5907529/Haplocall_Sophie_A1969_b1b2/ForIntervals/.queue/scatterGather/HaplotypeCaller-1-sg/temp_100_of_100 > 'java' '-Xmx12288m' '-XX:+UseParallelOldGC' '-XX:ParallelGCThreads=4' '-XX:GCTimeLimit=50' '-XX:GCHeapFreeLimit=10' '-Djava.io.tmpdir=/users/a5907529/lustre/Haplocall_Sophie_A1969_b1b2/ForIntervals/HaplotypeCaller_tmp' '-cp' '/opt/gridware/pkg/apps/gatk+queue/3.1.1/noarch/Queue.jar' 'org.broadinstitute.sting.gatk.CommandLineGATK' '-T' 'HaplotypeCaller' '-I' '/users/a5907529/lustre/Sophie_unsovled_cases/A1969/D129266/GATK/D129266_nodups.sorted.realigned.Recal.bam' '-I' '/users/a5907529/lustre/Sophie_unsovled_cases/A1969/D134130/GATK/D134130_nodups.sorted.realigned.Recal.bam' '-L' '/mnt/lustre/users/a5907529/Haplocall_Sophie_A1969_b1b2/ForIntervals/.queue/scatterGather/HaplotypeCaller-1-sg/temp_100_of_100/scatter.intervals' '-R' '/users/a5907529/data/GATK/bundle2.8/hg19/ucsc.hg19.YAOBO.fasta' '-o' '/mnt/lustre/users/a5907529/Haplocall_Sophie_A1969_b1b2/ForIntervals/.queue/scatterGather/HaplotypeCaller-1-sg/temp_100_of_100/HaplotyperCaller.vcf'

    Does this mean that my interval padding failed? or is there somewhere else I need to look? Thanks.

    By the way, the variant that I mentioned last time appeared with the latest version of GATK, not sure the exact cause of the missing, I will look further into it. I am also wondering if you could give me some hints on how to produce a bam file for each scattered HaplotypeCaller job, it is just to check what is going on at the both ends of an interval.

    Best wishes.

  • Geraldine_VdAuweraGeraldine_VdAuwera Posts: 6,192Administrator, GATK Developer admin

    @byb121 Upon further thought I realize now that Queue will not add the interval padding to scatter intervals. Sorry for the bad recommendation, I had not thought that through. But actually Queue should natively know the right way to scatter HC jobs, which is to do it by ActiveRegion. In fact I believe you can set the scatter count to whatever above the contig count, and Queue will do the right thing.

    To look at the scattered bam files you'll need to look in the directory where Queue creates the temporary files. I'll have to check with @kshakir‌ what is the recommended method for using those files for debugging.

    Geraldine Van der Auwera, PhD

  • byb121byb121 UKPosts: 21Member
    edited May 7

    @Geraldine_VdAuwera‌,

    Thanks a lot. I think I will have to visualize some of the bam files comparing to a bam of a sample. This will be the slowest way, I guess, glad to hear if there is a more efficient method.

    One more thing that just popped to my mind is, if I use scattered HC call variants from whole genome data, will this ActiveRegion be still functional to handle ends of scattered intervals?

    Post edited by byb121 on
  • mikedmiked Posts: 17Member
    edited May 23

    @Geraldine_VdAuwera,

    I made a post to this thread yesterday and I got the message that it was 'awaiting moderator approval'.. It still does not appear on this thread. Should I repost with the questions that I have? It could have been thrown into the Spam folder as I included links into other GATK threads.

    Thanks.

    Post edited by miked on
  • SheilaSheila Broad InstitutePosts: 428Member, GATK Developer, Broadie, Moderator admin

    @miked

    Hi.

    I just verified your account, so now you are able to post without getting interpreted as spam!

    -Sheila

Sign In or Register to comment.