GATK Queue GridEngine Job submission slows dramatically over time

jeremylp2jeremylp2 Member
edited July 2015 in Ask the GATK team

I'm using Queue 3.4 with a Univa GridEngine cluster. Everything runs fine, and at first the time between job submissions is only a few seconds. But the rate at which queue submits jobs to the cluster slows to a crawl - up to 40 seconds per job after a few hundred submissions, and slowly increasing. Any idea what might be happening here? With a scatterCount of 1000, it can take a full day just to submit all jobs to the cluster for a HaplotypeCaller run.

Tagged:

Answers

  • I've pasted some lines from the log here so that you can see what this slowdown looks like.

    INFO 15:45:32,484 QGraph - 964 Pend, 24 Run, 0 Fail, 24 Done
    INFO 15:46:06,011 QGraph - 954 Pend, 34 Run, 0 Fail, 24 Done
    INFO 15:46:38,589 QGraph - 946 Pend, 42 Run, 0 Fail, 24 Done
    INFO 15:47:11,023 QGraph - 939 Pend, 49 Run, 0 Fail, 24 Done
    INFO 15:47:42,397 QGraph - 933 Pend, 55 Run, 0 Fail, 24 Done
    INFO 15:48:14,216 QGraph - 928 Pend, 60 Run, 0 Fail, 24 Done
    INFO 15:48:47,888 QGraph - 923 Pend, 65 Run, 0 Fail, 24 Done
    INFO 15:49:22,128 QGraph - 918 Pend, 69 Run, 0 Fail, 25 Done
    INFO 15:49:55,408 QGraph - 914 Pend, 73 Run, 0 Fail, 25 Done
    INFO 15:50:28,256 QGraph - 910 Pend, 74 Run, 0 Fail, 28 Done
    INFO 15:51:01,664 QGraph - 906 Pend, 76 Run, 0 Fail, 30 Done
    INFO 15:51:41,260 QGraph - 902 Pend, 72 Run, 0 Fail, 38 Done
    INFO 15:52:18,828 QGraph - 898 Pend, 70 Run, 0 Fail, 44 Done
    INFO 15:52:54,121 QGraph - 895 Pend, 68 Run, 0 Fail, 49 Done
    INFO 15:53:33,503 QGraph - 891 Pend, 71 Run, 0 Fail, 50 Done
    INFO 15:54:07,489 QGraph - 888 Pend, 71 Run, 0 Fail, 53 Done
    INFO 15:54:42,231 QGraph - 885 Pend, 73 Run, 0 Fail, 54 Done
    INFO 15:55:16,136 QGraph - 882 Pend, 71 Run, 0 Fail, 59 Done
    INFO 15:55:48,958 QGraph - 879 Pend, 74 Run, 0 Fail, 59 Done
    INFO 15:56:22,879 QGraph - 876 Pend, 73 Run, 0 Fail, 63 Done
    INFO 15:56:57,497 QGraph - 873 Pend, 75 Run, 0 Fail, 64 Done
    INFO 15:57:33,289 QGraph - 870 Pend, 76 Run, 0 Fail, 66 Done
    INFO 15:58:10,100 QGraph - 867 Pend, 74 Run, 0 Fail, 71 Done
    INFO 15:58:50,624 QGraph - 864 Pend, 74 Run, 0 Fail, 74 Done
    INFO 15:59:29,533 QGraph - 861 Pend, 73 Run, 0 Fail, 78 Done
    INFO 16:00:09,765 QGraph - 858 Pend, 75 Run, 0 Fail, 79 Done
    INFO 16:00:40,019 QGraph - 856 Pend, 72 Run, 0 Fail, 84 Done
    INFO 16:01:10,057 QGraph - 854 Pend, 71 Run, 0 Fail, 87 Done
    INFO 16:01:42,362 QGraph - 852 Pend, 70 Run, 0 Fail, 90 Done
    INFO 16:02:15,883 QGraph - 849 Pend, 71 Run, 0 Fail, 92 Done
    INFO 16:02:48,671 QGraph - 847 Pend, 67 Run, 0 Fail, 98 Done
    INFO 16:03:22,405 QGraph - 845 Pend, 69 Run, 0 Fail, 98 Done
    INFO 16:03:56,440 QGraph - 843 Pend, 69 Run, 0 Fail, 100 Done
    INFO 16:04:32,423 QGraph - 841 Pend, 69 Run, 0 Fail, 102 Done
    INFO 16:05:06,594 QGraph - 839 Pend, 69 Run, 0 Fail, 104 Done
    INFO 16:05:38,903 QGraph - 837 Pend, 69 Run, 0 Fail, 106 Done
    INFO 16:06:13,956 QGraph - 835 Pend, 69 Run, 0 Fail, 108 Done
    INFO 16:06:51,736 QGraph - 833 Pend, 65 Run, 0 Fail, 114 Done
    INFO 16:07:26,260 QGraph - 831 Pend, 65 Run, 0 Fail, 116 Done
    INFO 16:08:02,062 QGraph - 829 Pend, 64 Run, 0 Fail, 119 Done
    INFO 16:08:38,973 QGraph - 827 Pend, 66 Run, 0 Fail, 119 Done
    INFO 16:09:16,786 QGraph - 825 Pend, 63 Run, 0 Fail, 124 Done
    INFO 16:09:52,354 QGraph - 823 Pend, 62 Run, 0 Fail, 127 Done
    INFO 16:10:29,158 QGraph - 821 Pend, 64 Run, 0 Fail, 127 Done
    INFO 16:11:06,211 QGraph - 819 Pend, 66 Run, 0 Fail, 127 Done
    INFO 16:11:45,325 QGraph - 817 Pend, 66 Run, 0 Fail, 129 Done
    INFO 16:12:22,746 QGraph - 815 Pend, 68 Run, 0 Fail, 129 Done
    INFO 16:13:01,561 QGraph - 813 Pend, 66 Run, 0 Fail, 133 Done
    INFO 16:13:43,926 QGraph - 811 Pend, 67 Run, 0 Fail, 134 Done
    INFO 16:14:25,277 QGraph - 809 Pend, 66 Run, 0 Fail, 137 Done
    INFO 16:15:09,395 QGraph - 807 Pend, 68 Run, 0 Fail, 137 Done
    INFO 16:15:51,624 QGraph - 805 Pend, 68 Run, 0 Fail, 139 Done
    INFO 16:16:34,159 QGraph - 803 Pend, 68 Run, 0 Fail, 141 Done
    INFO 16:17:16,764 QGraph - 801 Pend, 70 Run, 0 Fail, 141 Done
    INFO 16:18:01,186 QGraph - 799 Pend, 71 Run, 0 Fail, 142 Done
    INFO 16:18:49,611 QGraph - 797 Pend, 73 Run, 0 Fail, 142 Done
    INFO 16:19:39,305 QGraph - 794 Pend, 75 Run, 0 Fail, 143 Done
    INFO 16:20:23,610 QGraph - 792 Pend, 76 Run, 0 Fail, 144 Done
    INFO 16:21:10,820 QGraph - 790 Pend, 74 Run, 0 Fail, 148 Done
    INFO 16:22:02,183 QGraph - 788 Pend, 72 Run, 0 Fail, 152 Done
    INFO 16:22:48,848 QGraph - 786 Pend, 73 Run, 0 Fail, 153 Done
    INFO 16:23:42,993 QGraph - 783 Pend, 74 Run, 0 Fail, 155 Done
    INFO 16:24:34,003 QGraph - 781 Pend, 75 Run, 0 Fail, 156 Done
    INFO 16:25:23,470 QGraph - 779 Pend, 76 Run, 0 Fail, 157 Done
    INFO 16:26:17,262 QGraph - 777 Pend, 78 Run, 0 Fail, 157 Done
    INFO 16:27:09,964 QGraph - 775 Pend, 78 Run, 0 Fail, 159 Done
    INFO 16:28:04,408 QGraph - 773 Pend, 78 Run, 0 Fail, 161 Done
    INFO 16:28:56,902 QGraph - 771 Pend, 79 Run, 0 Fail, 162 Done
    INFO 16:29:50,830 QGraph - 769 Pend, 79 Run, 0 Fail, 164 Done
    INFO 16:30:45,044 QGraph - 767 Pend, 78 Run, 0 Fail, 167 Done
    INFO 16:31:43,173 QGraph - 765 Pend, 78 Run, 0 Fail, 169 Done
    INFO 16:32:40,758 QGraph - 763 Pend, 78 Run, 0 Fail, 171 Done
    INFO 16:33:11,266 QGraph - 762 Pend, 76 Run, 0 Fail, 174 Done
    INFO 16:33:43,015 QGraph - 761 Pend, 75 Run, 0 Fail, 176 Done
    INFO 16:34:13,405 QGraph - 760 Pend, 76 Run, 0 Fail, 176 Done
    INFO 16:35:10,533 QGraph - 758 Pend, 78 Run, 0 Fail, 176 Done
    INFO 16:36:08,890 QGraph - 756 Pend, 77 Run, 0 Fail, 179 Done
    INFO 16:36:46,069 QGraph - 755 Pend, 73 Run, 0 Fail, 184 Done
    INFO 16:37:18,403 QGraph - 754 Pend, 73 Run, 0 Fail, 185 Done
    INFO 16:38:15,227 QGraph - 752 Pend, 74 Run, 0 Fail, 186 Done
    INFO 16:39:13,975 QGraph - 750 Pend, 74 Run, 0 Fail, 188 Done
    INFO 16:39:44,905 QGraph - 749 Pend, 73 Run, 0 Fail, 190 Done
    INFO 16:40:15,247 QGraph - 748 Pend, 70 Run, 0 Fail, 194 Done
    INFO 16:40:48,191 QGraph - 747 Pend, 68 Run, 0 Fail, 197 Done
    INFO 16:41:47,257 QGraph - 745 Pend, 66 Run, 0 Fail, 201 Done
    INFO 16:42:19,451 QGraph - 744 Pend, 67 Run, 0 Fail, 201 Done
    INFO 16:42:51,291 QGraph - 743 Pend, 67 Run, 0 Fail, 202 Done
    INFO 16:43:25,003 QGraph - 742 Pend, 68 Run, 0 Fail, 202 Done
    INFO 16:43:58,416 QGraph - 741 Pend, 67 Run, 0 Fail, 204 Done
    INFO 16:44:31,374 QGraph - 740 Pend, 65 Run, 0 Fail, 207 Done
    INFO 16:45:03,266 QGraph - 739 Pend, 66 Run, 0 Fail, 207 Done
    INFO 16:45:35,121 QGraph - 738 Pend, 64 Run, 0 Fail, 210 Done
    INFO 16:46:08,261 QGraph - 737 Pend, 65 Run, 0 Fail, 210 Done
    INFO 16:46:39,858 QGraph - 736 Pend, 66 Run, 0 Fail, 210 Done
    INFO 16:47:11,807 QGraph - 735 Pend, 66 Run, 0 Fail, 211 Done
    INFO 16:47:44,993 QGraph - 734 Pend, 66 Run, 0 Fail, 212 Done
    INFO 16:48:18,841 QGraph - 732 Pend, 65 Run, 0 Fail, 215 Done
    INFO 16:48:55,241 QGraph - 731 Pend, 61 Run, 0 Fail, 220 Done
    INFO 16:49:30,527 QGraph - 730 Pend, 60 Run, 0 Fail, 222 Done
    INFO 16:50:04,162 QGraph - 729 Pend, 59 Run, 0 Fail, 224 Done
    INFO 16:50:38,730 QGraph - 728 Pend, 60 Run, 0 Fail, 224 Done
    INFO 16:51:09,959 QGraph - 727 Pend, 60 Run, 0 Fail, 225 Done
    INFO 16:51:46,934 QGraph - 726 Pend, 61 Run, 0 Fail, 225 Done
    INFO 16:52:23,325 QGraph - 725 Pend, 58 Run, 0 Fail, 229 Done
    INFO 16:52:57,194 QGraph - 724 Pend, 58 Run, 0 Fail, 230 Done
    INFO 16:53:31,460 QGraph - 723 Pend, 59 Run, 0 Fail, 230 Done
    INFO 16:54:07,855 QGraph - 721 Pend, 59 Run, 0 Fail, 232 Done
    INFO 16:54:41,157 QGraph - 720 Pend, 59 Run, 0 Fail, 233 Done
    INFO 16:55:14,564 QGraph - 719 Pend, 60 Run, 0 Fail, 233 Done
    INFO 16:55:48,413 QGraph - 718 Pend, 59 Run, 0 Fail, 235 Done
    INFO 16:56:23,202 QGraph - 717 Pend, 59 Run, 0 Fail, 236 Done
    INFO 16:57:01,981 QGraph - 716 Pend, 56 Run, 0 Fail, 240 Done
    INFO 16:57:38,568 QGraph - 715 Pend, 53 Run, 0 Fail, 244 Done
    INFO 16:58:13,460 QGraph - 713 Pend, 51 Run, 0 Fail, 248 Done
    INFO 16:58:47,182 QGraph - 712 Pend, 50 Run, 0 Fail, 250 Done
    INFO 16:59:21,609 QGraph - 711 Pend, 50 Run, 0 Fail, 251 Done
    INFO 16:59:55,928 QGraph - 710 Pend, 51 Run, 0 Fail, 251 Done
    INFO 17:00:33,243 QGraph - 708 Pend, 51 Run, 0 Fail, 253 Done
    INFO 17:01:08,153 QGraph - 707 Pend, 48 Run, 0 Fail, 257 Done
    INFO 17:01:41,577 QGraph - 706 Pend, 48 Run, 0 Fail, 258 Done
    INFO 17:02:14,976 QGraph - 705 Pend, 46 Run, 0 Fail, 261 Done
    INFO 17:02:47,821 QGraph - 704 Pend, 47 Run, 0 Fail, 261 Done
    INFO 17:03:21,541 QGraph - 703 Pend, 48 Run, 0 Fail, 261 Done
    INFO 17:03:54,358 QGraph - 702 Pend, 48 Run, 0 Fail, 262 Done
    INFO 17:04:28,234 QGraph - 701 Pend, 48 Run, 0 Fail, 263 Done
    INFO 17:05:01,858 QGraph - 699 Pend, 48 Run, 0 Fail, 265 Done
    INFO 17:05:35,316 QGraph - 698 Pend, 45 Run, 0 Fail, 269 Done
    INFO 17:06:08,599 QGraph - 697 Pend, 45 Run, 0 Fail, 270 Done
    INFO 17:06:41,929 QGraph - 696 Pend, 46 Run, 0 Fail, 270 Done
    INFO 17:07:17,453 QGraph - 695 Pend, 45 Run, 0 Fail, 272 Done
    INFO 17:07:53,964 QGraph - 693 Pend, 46 Run, 0 Fail, 273 Done
    INFO 17:08:30,574 QGraph - 692 Pend, 46 Run, 0 Fail, 274 Done
    INFO 17:09:07,190 QGraph - 691 Pend, 45 Run, 0 Fail, 276 Done
    INFO 17:09:42,813 QGraph - 690 Pend, 44 Run, 0 Fail, 278 Done
    INFO 17:10:19,051 QGraph - 689 Pend, 44 Run, 0 Fail, 279 Done
    INFO 17:10:54,900 QGraph - 688 Pend, 44 Run, 0 Fail, 280 Done
    INFO 17:11:34,032 QGraph - 687 Pend, 42 Run, 0 Fail, 283 Done
    INFO 17:12:11,456 QGraph - 686 Pend, 41 Run, 0 Fail, 285 Done
    INFO 17:12:45,846 QGraph - 685 Pend, 38 Run, 0 Fail, 289 Done
    INFO 17:13:20,197 QGraph - 684 Pend, 39 Run, 0 Fail, 289 Done
    INFO 17:13:57,220 QGraph - 683 Pend, 39 Run, 0 Fail, 290 Done
    INFO 17:14:34,229 QGraph - 682 Pend, 40 Run, 0 Fail, 290 Done
    INFO 17:15:13,735 QGraph - 681 Pend, 39 Run, 0 Fail, 292 Done
    INFO 17:15:51,577 QGraph - 679 Pend, 39 Run, 0 Fail, 294 Done
    INFO 17:16:29,515 QGraph - 678 Pend, 39 Run, 0 Fail, 295 Done
    INFO 17:17:06,909 QGraph - 677 Pend, 38 Run, 0 Fail, 297 Done
    INFO 17:17:45,886 QGraph - 675 Pend, 38 Run, 0 Fail, 299 Done
    INFO 17:18:24,922 QGraph - 674 Pend, 38 Run, 0 Fail, 300 Done
    INFO 17:19:03,189 QGraph - 673 Pend, 38 Run, 0 Fail, 301 Done
    INFO 17:19:41,434 QGraph - 672 Pend, 39 Run, 0 Fail, 301 Done
    INFO 17:20:20,376 QGraph - 671 Pend, 39 Run, 0 Fail, 302 Done
    INFO 17:20:58,628 QGraph - 670 Pend, 38 Run, 0 Fail, 304 Done
    INFO 17:21:37,853 QGraph - 669 Pend, 38 Run, 0 Fail, 305 Done
    INFO 17:22:14,169 QGraph - 668 Pend, 38 Run, 0 Fail, 306 Done
    INFO 17:22:53,445 QGraph - 667 Pend, 39 Run, 0 Fail, 306 Done
    INFO 17:23:32,054 QGraph - 666 Pend, 38 Run, 0 Fail, 308 Done
    INFO 17:24:12,710 QGraph - 665 Pend, 38 Run, 0 Fail, 309 Done
    INFO 17:24:51,930 QGraph - 664 Pend, 36 Run, 0 Fail, 312 Done
    INFO 17:25:32,810 QGraph - 663 Pend, 35 Run, 0 Fail, 314 Done
    INFO 17:26:12,938 QGraph - 661 Pend, 35 Run, 0 Fail, 316 Done
    INFO 17:26:52,326 QGraph - 660 Pend, 34 Run, 0 Fail, 318 Done
    INFO 17:27:32,283 QGraph - 659 Pend, 35 Run, 0 Fail, 318 Done
    INFO 17:28:12,348 QGraph - 658 Pend, 36 Run, 0 Fail, 318 Done
    INFO 17:28:52,366 QGraph - 657 Pend, 37 Run, 0 Fail, 318 Done
    INFO 17:29:33,246 QGraph - 656 Pend, 35 Run, 0 Fail, 321 Done
    INFO 17:30:13,248 QGraph - 655 Pend, 35 Run, 0 Fail, 322 Done
    INFO 17:30:53,896 QGraph - 654 Pend, 36 Run, 0 Fail, 322 Done
    INFO 17:31:34,971 QGraph - 653 Pend, 36 Run, 0 Fail, 323 Done
    INFO 17:32:15,068 QGraph - 652 Pend, 36 Run, 0 Fail, 324 Done
    INFO 17:32:54,741 QGraph - 651 Pend, 36 Run, 0 Fail, 325 Done
    INFO 17:33:32,897 QGraph - 650 Pend, 37 Run, 0 Fail, 325 Done
    INFO 17:34:13,144 QGraph - 649 Pend, 38 Run, 0 Fail, 325 Done
    INFO 17:34:53,701 QGraph - 648 Pend, 38 Run, 0 Fail, 326 Done
    INFO 17:35:35,290 QGraph - 647 Pend, 38 Run, 0 Fail, 327 Done
    INFO 17:36:16,571 QGraph - 646 Pend, 37 Run, 0 Fail, 329 Done
    INFO 17:36:58,410 QGraph - 645 Pend, 37 Run, 0 Fail, 330 Done
    INFO 17:37:39,679 QGraph - 644 Pend, 37 Run, 0 Fail, 331 Done
    INFO 17:38:21,946 QGraph - 643 Pend, 35 Run, 0 Fail, 334 Done

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jeremylp2
    Hi,

    Unfortunately, I am not sure how to help you at this point. Perhaps you can ask your IT support to check the health of the cluster nodes.

    -Sheila

  • Hi Sheila,

    Thanks for responding. Our cluster nodes are fine and work great in all other situations, so unfortunately our IT cannot help; submitting jobs with anything other than Queue always takes about a second. Note that the jobs from Queue run to completion once they are submitted, the only problem is that Queue is extremely slow to submit them. Any idea on where to look for a problem, at least? I am not the first person to experience this, and on the another thread no solution was found either: http://gatkforums.broadinstitute.org/discussion/4077/queue-job-submission-rate-slows-over-time. Unfortunately, this problem makes it such that it takes Queue longer to submit jobs than to actually run, in some cases.

    Thanks,

    Jeremy

  • SheilaSheila Broad InstituteMember, Broadie, Moderator

    @jeremylp2
    Hi Jeremy,

    This sounds like a problem with the infrastructure, and unfortunately, there is really nothing we can do at this time. Hopefully some of the other users in the other thread will report some solutions.

    -Sheila

Sign In or Register to comment.