Scheduler

Overview

This paper introduces the structure of the Grid Engine scheduler. Note that the scheduler was originally a separate daemon but is now a thread in the qmaster. Some documentation may still refer to the separate "schedd".

For additional source code documentation on scheduling questions see also the description of the scheduling library libsched.

A scheduling run begins with making a copy of the schedd internal data. This is done because this data can now be modified within a scheduling run without changing the state resulting from the event update. Then, beginning with the most important job (see the following sections), the scheduler tries to dispatch all pending jobs to queues and generates a so called order for each of these decisions. Besides these dispatch-orders also some other orders are prepared by the scheduler e.g. to implement the so called suspend_thresholds (see the queue_conf(5) manual page).

The default scheduler

Priorization of jobs

The order in which runnable jobs are dispatched depends on the policies selected by the administrator.

Jobs receive tickets according to 4 high level policies. The following is just a very brief overview.


Individual users and projects have relative entitlements in the overall resources of a Grid Engine cluster, so user A might be entitled to receive 30% of all resources while user B might be entitled to 60% and a user C to 10%. Such entitlements are relative and are meant to be granted over time. They are relative in the sense that if user C never uses its 10% share then those 10% are distributed among user A and user B, but in a proportional 30:60 ratio. The granting over time means that Grid Engine tries to achieve the entitlements over a configurable sliding window of time, e.g. over 2 months. Such entitlements are so called long term entitlements. In order to grant them, Grid Engine has to look at the usage of resources which users (or projects) have accumulated in the past and Grid Engine also has to compensate for over- or underutilization of resources as compared to the long term entitlements. Such compensations are done by assigning short term entitlements to the users and projects which can differ from the long term resource allocation.


Long term entitlements are defined in a hierarchical fashion in a so called share tree. The share tree can represent the organizational structure of a site which uses GE. The leaves of the share tree always are individual projects and/or individual users. The hierarchy level above would be groups of users or projects, then another level above further grouping until the root of the tree which represents the entire organization having access to the cluster.


Again, entitlements are defined but based on functional attributes like being a certain user or being associated with a project or a certain user group. The entitlements are static in the sense that Grid Engine does not look at past usage which was accumulated. So there is no need to compensate for under- or overutilization and thus not short term entitlements exist. Functional entitlements represent a kind of relative priority.


Jobs can be submitted with a definition of a deadline. The entitlement of deadline jobs grows automatically as they approach their deadline.


The automatic policies above can be overridden manually via this policy. It allows, for instance, to double the entitlement which a job (but also an entire project or one user) currently receives. It is also often used to assign a constant level of entitlement to a user, a user group or a project.


The entitlements assigned to a particular job by each of the 4 policies above are translated into so called tickets. Each policy can deliver an amount of tickets to a job for which all those tickets are added together, thereby combining the 4 policies. The job with the highest amount of tickets is investigated by the Grid Engine scheduler first. The -p priority is used as a means for a user to increase the tickets of one job and to decrease the tickets of another one.


Note, that the order in which the scheduler attempts to dispatch jobs often is not identical to the order in which jobs are started. The job with the highest priority, for instance, may not fit on any of the resources which are currently available. Hence the second job in line may get started first. The code handling jobs in priority order can be found in dispatch_jobs() in scheduler.c.


Selection of queues

Usually there is more than one queue suitable for a certain job. The queue_sort_method in sched_conf(5) decides which queue is occupied first. If queue_sort_method is defined to be load, then the queue which resides at the host with the lowest load is selected first. The queue sequence number (seq_no in queue_conf(5)) is then considered only as a second criterion in case two hosts have the same load index. As opposed to this, if seqno is used as queue_sort_method the queue's sequence number is the first criterion and the load index is the second one, considered only if two queues have the same sequence number. In both cases the load_formula in sched_conf(5) specifies how the load index is computed. A user can override this selection scheme by using so called -soft resource requests (see the qsub(1) manual page). The code handling queue selection order can be found in sge_replicate_queues_suitable4job() in libs/sched/sge_select_queue.c.


Dispatch strategies

The function sge_replicate_queues_suitable4job() implements two different approaches to find the best suited queue(s) with the lowest effort. For straight sequential jobs the first matching queue is fine. For parallel jobs, however, and for jobs with soft requests the scheduler has to be more sophisticated before a decision can be taken which is usually more expensive.

Alternative Schedulers

Classifying Alternative Schedulers

Alternative schedulers can be classified in three different categories and it's worth to become aware of the category of your scheduler, before you start a project:


Scheduling API

We know that many people are eager to write their own schedulers and hope to find easy-to-use interfaces for this purpose. As of today, there is no interface which at the same time


1. offers access to all potentially interesting information of the system,

2. has the potential for a well performing scheduler dispatching many thousand jobs in a short time frame, and which

3. stays stable from release to release.


But it should be possible to find compromises and to design an interface, which is usable for a majority of all problems. The challenge here is to keep the interface stable.

Scheduling Framework

The present scheduler framwork supports implementation of alternative schedulers even though a standardized and stable scheduler interface is not yet available. The scheduler code is organized in a way to allow for implementing alternative schedulers in the same manner as the default scheduler was implemented.This gives you the ability to reuse the existing framework.


The first thing you will have to do will be to add your own scheduler function pointer to the sched_funcs table in sge_schedd.c. Two samples have been added already there to point out the spots to touch when starting with an alternative scheduler. Schedulers implemented in the way suggested by the samples would be able to get activated at schedd runtime by changing the scheduler configuration schedd_conf(5) parameter 'algorithm'.


The picture below classifies the default algorithm and the two code samples:




We expect that some of the functionality that was implemented for the 'default' scheduler can be reused in many different other schedulers and it should be possible to accumulate a collection of well-performing functions shared between different scheduler implementations. But the 'default' scheduler uses also many functions like available_slots_at_queue() that are very specific to the approach used in the 'default' scheduler. In such cases it is better to reuse only parts of the implementation or to use them only as a pattern for an own implementation.


Copyright 2001 Sun Microsystems, Inc. All rights reserved.