If you happen to see a question you know the answer to, please do chime in and help your fellow community members. We appreciate your help!
Test-drive the GATK tools and Best Practices pipelines on Terra
Check out this blog post to learn how you can get started with GATK and try out the pipelines in preconfigured workspaces (with a user-friendly interface!) without having to install anything.
HaplotypeCaller scattered job run time
I just managed to use HaplotypeCaller with the lasted version of Queue to call variants on 40 human exomes. The HaplotypeCaller job were scattered into 50 sub jobs and spread in our cluster with Sun Grid Engine.
The problem I found is that sub jobs take quite vary time to finish, which is from 5 hours to 80 hours and majority of them are below 55 hours, hence the whole job were actually slowed down by just a few longer sub jobs. I know that part of the difference were definitely caused by the performance of the cluster node running the job, but I think the major cause of the difference is reply on how the job were split. The qscript I used is adapted from here (without filtering part), from which I can not figure out how the job were split. Hence, I am wondering if anyone could tell me based on what (Genomic Regions ?) HaplotypeCaller job were actually scattered and how I can split the job more evenly so most of the sub jobs will finish at about the same time.
Thanks in advance,