Notice:
If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We encourage our fourm members to be more involved, jump in and help out your fellow researchers with their questions. GATK forum is a community forum and helping each other with using GATK tools and research is the cornerstone of our success as a genomics research community.We appreciate your help!

Test-drive the GATK tools and Best Practices pipelines on Terra


Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.

HaplotypeCaller scattered job run time

byb121byb121 UKMember

Hi,

I just managed to use HaplotypeCaller with the lasted version of Queue to call variants on 40 human exomes. The HaplotypeCaller job were scattered into 50 sub jobs and spread in our cluster with Sun Grid Engine.

The problem I found is that sub jobs take quite vary time to finish, which is from 5 hours to 80 hours and majority of them are below 55 hours, hence the whole job were actually slowed down by just a few longer sub jobs. I know that part of the difference were definitely caused by the performance of the cluster node running the job, but I think the major cause of the difference is reply on how the job were split. The qscript I used is adapted from here (without filtering part), from which I can not figure out how the job were split. Hence, I am wondering if anyone could tell me based on what (Genomic Regions ?) HaplotypeCaller job were actually scattered and how I can split the job more evenly so most of the sub jobs will finish at about the same time.

Thanks in advance,

Best,

Yaobo

Best Answers

Answers

  • brspurribrspurri New YorkMember

    I'm running into this exact problem. I have my own interval list, but I don't know how to separate the interval list out into chunks and send a different chunk to each scatter node. Is this even possible? The way it looks to me, scatter manages which portions of the intervals are sent to each scatter node (which is wildly disproportionate - at least the way I know how to use it). Anyone have any advice?

Sign In or Register to comment.