Scheduling is a problem that is grounded in many different levels of computer science and computer hardware engineering. Various scheduling and sequencing problems have been addressed since the 1950's by researchers in computer science, operations research and discrete mathematics [64].
The scheduling problem in its most general form is known to be NP-complete [147], as is the creation of optimal execution schedules under a number of conditions. This is due to the large number of inter-related factors that directly and indirectly contribute to the execution time of the individual tasks. Consequently, many heuristics have been developed to generate adequate (but sub-optimal) schedules [56].
The problem that we consider is how to distribute (or schedule) processes among processing elements to achieve performance goal(s), such as minimising execution time, minimising communication delays, and/or maximising resource utilizations [34]. In a system consisting of real-time transactions, each of which requires computational, communication and data resources to processed, scheduling is the problem of allocating resources to satisfy the requirements of those transactions [142].
We recognise four main levels of instruction scheduling: machine-code instruction scheduling; interpreted program code (converted at run-time to machine-code); the scheduling of threads within a program; and the scheduling of programs within a distributed system. These are listed in order of increasing granularity and abstraction from the physical code that is executed by processors.
The systems described in chapter
operate at
the last two levels: the scheduling of threads within a program; and,
the scheduling of programs within a distributed system. This thesis
is concerned primarily with the scheduling of coarse-grained programs
across distributed systems.
In this chapter we review some of the research in the area of
scheduling, across all levels of granularity. We provide discussion of
the applicability of the research to scheduling coarse-grained
programs. The problem of scheduling on a local and global scale is
considered. Different models of scheduling, including static, dynamic
and hybrid approaches are discussed in
section
. The scheduling of independent
programs is discussed in section
and
of dependent programs in section
. A
discussion of the state information that is available to schedulers is
presented in section
, and an historical
review of scheduling is presented in
section
. The applicability of the
existing research to the problem we consider is discussed in
section
. The review is summarised
in section
.
At the highest level, a distinction is drawn between local and global scheduling. The local scheduling discipline determines how the processes resident on a single CPU are allocated and executed; a global scheduling policy uses information about the system to allocate processes to multiple processors with the view of optimising a system-wide performance objective. Under the classification of global scheduling, the problem domain is further broken into the cases where the tasks to be scheduled are independent or related [7].
When executing a complex job in a distributed system, scheduling
occurs in many places and on many levels. Proceeding in a top-down
manner, the program is submitted into a queue of waiting jobs. At some
point, the scheduler will select the job to run. The program (or
current job) is broken into a number of tasks to be executed, each of
which is scheduled, possibly on different machines. When each task
arrives at the target machine, if it involves parallel computation,
then each of those parallel components is then distributed to target
processors. Once the code arrives onto the processor where it will be
executed, it is then scheduled by the operating system. The scheduling
of code onto a processor is done in conjunction with system processes
and quite probably with other users' processes. Shown in
figure
, this is the manner in which
most of the batch queueing systems discussed in
chapter
operate.

Figure 9: Graphical
representation of scheduling. The user's program is submitted to a
queue. When a program is selected to run, the scheduler allocates it
to a target machine, where the local scheduler controls the local
execution. If the target machine is has a parallel architecture, the
program may be distributed across the local processors.
Choosing the processors on which the tasks that comprise a program will run is a difficult decision. There are many factors, from the viewpoints of both the application and the system as a whole, that effect the decision, including: the number of tasks that comprise the program; the priority of the program; the current load of the system's nodes; whether all of the nodes can execute each task; and the resource owner's usage policies.
There are a number of reasons why scheduling programs, or the tasks that comprise the programs, is important to both the users and owners of machines. From the user's point of view, it is important that the programs they wish to run are executed as quickly as possible, using the resources that are best suited to the problem, which are also reasonably accessible to the user. In contrast, the machine owners wish to make the best utilizations of their resource. The term resource does not only refer to the machines that make compute cycles available, but also includes clusters of computers, communications link utilizations and storage services that may be offered, as well as whatever other peripherals each may have attached. The two objectives, fast turnaround time and optimal resource utilizations, are not always complementary. Owners are not usually willing to let a single user utilise all their resources, and users are not usually willing to wait an arbitrarily long time before they are allowed access to particular resources. Scheduling, from both points of view, is the process by which each party achieves a satisfactory quality of service, where the time spent waiting for a number of dedicated nodes on a supercomputer that is connected to a massive storage engine may be traded off against the actual opportunity to use the resources.
There is a complex trade-off between the user using resources that an owner may charge little (or nothing) to use, but may be of lower performance or have fewer desirable attributes (such as storage systems or high-bandwidth interconnections to other resources), and a larger resource which costs more to use but on which the problem will be solved more quickly and will be able to utilise any specialised peripherals that are attached to the system.
There is no single set of terminology that is used throughout the scheduling literature. For example, the terms program, job and task, and the terms host, processor and node are often used interchangeably in the literature.
When we use the term program, we refer to code that is directly executable by the user of a system, such as a command shell or a compiled application. We use the term job to refer to a collection of programs that are run concurrently or sequentially, for a single purpose. An example of a job is a shell script that calls several programs. The input- and output-dependencies between the programs that comprise a job are sometimes termed the job's processing pipeline. We use the term task to refer to the code that runs on a single processor. The task may consist of a number of threads of control.
As multi-processor workstations become commodity items, a new distinction is being made between the term processor and the terms host and node. When only uniprocessor machines were available, the terms were equivalent; now one must be careful to use the correct terminology. We use the term node, or processing entity (PE), to refer to a processor which is not normally accessed by the user. For example, as the processors that comprise the parallel processing array in a Thinking Machines CM-5 are not directly accessible by the user, they are termed nodes. We use the term hosts to describe the processors to which users usually have direct access.
There are different approaches to the selection of processors onto which programs will be placed for execution. The models range from static, where each of the programs is assigned once to a processor before execution of the program commences, to dynamic, where a program may be reassigned to different processors before being executed. Finally, we describe a hybrid approach, which combines those taken by the static and dynamic models.
One of the first taxonomies of scheduling in distributed computing was performed by Casavant and Kuhl [34]. They divided the problem domain into static and dynamic scheduling. Static scheduling involves assigning the programs that comprise the job to processors and then not allowing them to move. This minimises turnaround time for the user but is not responsive to changes in the execution environment. Dynamic scheduling, or load balancing, involves the averaging of load over the chosen processors, in an effort to maximise average resource utilizations and reduce turnaround time for the user.
Many theoretical studies consider off-line systems, and search for optimal solutions using the assumption that everything is known a priori. They often also assume nothing changes in the system's environment. Real systems, however, operate in a dynamic on-line environment, and need to contend with unpredictable arrivals of new work [62].
In the static model, every task comprising the job is assigned once to a (possibly different) node. Thus, the placement of a program is static, and a firm estimate of the cost of the computation can be made. Heuristic models for static task scheduling are discussed in [206].
One of the major benefits of the static model is that it is easier to program from a scheduling and placement view. The placement of tasks is fixed a priori; it is easy to monitor the progress of the computation and hence termination of processing is simplified. By the same reasoning, estimating the cost of jobs is simplified. Processors can give estimates of the time that will be spent processing programs. On completion of the program, the processor can be instructed to supply the precise time that was spent in processing. This allows the cost to the user to be updated, as well as any internal representations that are used for making performance estimates of new programs.
Using this model, it is also possible to reserve resources. Thus, instead of the scheduler that is creating the placement information simply looking up a (dynamic) table of host names and the programs available on those hosts, the system could send a reservation message to the host, enquiring as to its willingness to run a program.
When scheduling jobs in a static fashion, only the scheduler which creates the job placement and cost estimates needs to know the existence and relative costs of the other hosts in the system. Although other hosts may know about the programs that another host can execute, all they must know is how to send messages to the next (or previous) host in the job's processing pipeline. Using the static model, each host is told from where to get the input parameters that are needed, or where to send the outputs of the programs.
The static model allows for a `global view' of programs and costs. Heuristics are used to decide whether to incur slightly higher processing costs in order to keep all the programs involved in a job on the same or tightly-coupled nodes, or to search for lower computational costs and be penalised with slightly higher communication costs.
Unfortunately, the static model does not allow for the very real possibility that one of the nodes selected to perform a service may have failed, be isolated from the system due to network failure, or at least so heavily laden with jobs that it is not responding.
General-purpose dynamic scheduling operates on two levels: the local scheduling discipline, and a load distribution strategy. The load distributing strategy determines how programs will be placed on remote machines. It uses an information policy to determine which information is to be collected from each machine in the processor pool, at what frequency, and also how often the local information should be exchanged with other machines. Scheduling may be approached from many different aspects, including the viewpoints of the user and the resource owner.
In a strictly dynamic scheduling model, the tasks that comprise a parallel or distributed job are assigned to processors based on whether a processor predicts that it can provide an adequate quality of service to a task. The meaning of quality of service is dependent on the application. The term includes: whether a maximum bound can be placed on the time a job will have to wait before starting execution; the minimum time quanta that a given job will be able to execute without interruption; and, the relative speed of a processor when compared to others in the processing pool. If the processor is assigned too many tasks, it may invoke a transfer policy to decide whether to transfer some tasks, and which tasks to transfer. A location policy determines which processor(s) will receive the tasks. There are two important models for location policies: sender-initiated and receiver-initiated. These depend on whether the sender polls potential targets for task transfer, or whether processors that are willing to receive extra tasks advertise the fact, respectively.
The advantage of dynamic load balancing over static scheduling is that the system need not be aware of the run-time behaviour of the application before execution. Dynamic load balancing is particularly useful in a system where the primary performance goal is the maximising of processor utilizations as opposed to the minimisation of runtime for individual jobs [207, ]. This is often found in systems consisting of networks of workstations.
The main consideration involved in mapping threads to PEs is the requirement to balance the loads. It has been shown that load balancing is crucial to achieving adequate performance [53, ]. Load balancing can be performed in one of two fundamental ways, where the decision of where to place the new thread is either made by a local queue or global queue. In a local queue, each PE makes its own decision; in a global queue there is a single point from where the threads are dispatched.
There are a number of methods that have been proposed for mapping a thread to a processor (from Feitelson [62] section 4.1.1) using local queues, for example:
Global queues are easy to implement in shared memory machines; they are not realistic options for distributed memory machines due to the need to maintain coherent lists of available tasks [62]. In most systems that use a global queue, tasks are allocated to a PE, execute for a time quantum, and are returned to the queue for re-allocation. The main advantage of using a global queue is that of load sharing, in which all processors get a roughly equal proportion of the workload. The disadvantages are the contention for the shared global queue, and the lack of memory locality.
In the dynamic model, no placement information is assigned to the job. Consider the case in which a computation can be divided into a number of dependent parts. Initially, a node advertising the job's first service will be passed the job and once the task is complete, the node will consult its own records to determine the location (and cost information) of a server that can perform the next task, and will dispatch the job to that host.
This approach forces all hosts to be aware of, if not all, then at least a large part of the system. It does, however, provide for the truly dynamic nature of the system, in which nodes and services may or may not be available, and communication links may be down. This suggests that there may be problems with the dissemination of server (and service) information to the remainder of the system. Problems may arise if the network becomes segmented.
The server estimates the cost of each part of the job's processing pipeline, and assigns each part to the processors in a manner that minimises the total execution time. Thus, the assignment of computation parts to processors is a static assignment. Problems arise if, after assignment, one of the servers selected to participate in the computation is aware of an alternative server that can perform the part at less expense. If the original schedule is used for execution, the computation will take longer than it could have through using the alternate server.
The onus, therefore, is on the current server to find the next, most appropriate server to continue the computation. It may be advisable to configure the system with facilities to `backtrack' servers so if a server does not know about any appropriate servers to continue the computation, the job may be passed back to the last server to host it. The system would have to be carefully monitored, and trained not to follow local cost minima, which allow the computation to be passed to more remote, but more communications-costly nodes when it comes to return the results to the user.
This model runs into difficulties where the computation forks into a number of concurrent processing stages. One possible solution to this problem is to require that concurrent computations send periodic updates to the server which was used before the concurrent execution was started, to synchronise and continue the computation.
There needs to be some semblance of a priority system. To prevent users from abusing the priority system by requesting everything be run at the highest priority, it may be decided to take the solution that has been adopted in CCS [193], where the user is charged for the priority level at which they submit their job.
The performance of dynamic load
balancing models in a heterogeneous system was studied by Chow and
Kohler [42], where they considered deterministic
and non-deterministic routing strategies. The non-deterministic
strategy uses a probability,
of a process being routed to
processor i, which is either chosen arbitrarily or based on a
function of the system parameters.
Deterministic routing involves a job dispatcher which routes jobs to processors according to some policy. Three deterministic policies are investigated: minimum response time, minimum system time, and maximum throughput. Chow and Kohler's study concludes that the maximum throughput policy, which uses information on the arrival rate as well as service rate and system state, is more optimal than those policies that base their decisions on the service rate and system state only.
We discuss the topic of independent tasks further in
chapter
. A tool is presented that allows the
comparison of different static and dynamic scheduling algorithms for
the placement of independent jobs. We use a number of simple placement
algorithms and also a variant of Chow and Kohler's algorithm.
There are a number of different ways in which schedules for the execution of related tasks can be generated in the situation where the resources that are to be used are dedicated. Feitelson and Rudolph [63] recognise a number of different static approaches to parallel job scheduling, including critical path methods, list scheduling methods, and the partitioning of a DAG into clusters of nodes. As creating an optimal schedule is NP-complete, heuristic algorithms based on list schedules and clustering techniques are common [4, , ].
List scheduling is relatively straightforward to implement. Most commonly, list scheduling is used to place the tasks that comprise a parallel program [4]. Tasks comprising a program are assigned priorities and placed in a list ordered in decreasing priority. Whenever tasks contend for processors, the highest priority task that is immediately executable is assigned first. List schedules may be preemptive or non-preemptive; if there are two tasks with the same priority that can execute, one is chosen randomly.
List scheduling algorithms rely on the knowledge of task execution
times, and precedence relationships. They are most often used to
schedule dependent tasks on clusters of homogeneous
processors [4]. Given a set of tasks
, which has the precedence graph G, we let the
execution time of task
be
. The length of a directed path
is defined as the sum of all the weights along the path including the
initial and final vertex. The level of a task within a program
is defined as the length from a vertex
to an exit node (a node
which has no successors). Similarly, the co-level of a task is
defined as the maximum distance from the vertex
to the entry
vertex (which has no predecessors). A table can therefore be created
in which each task is assigned a level and
co-level. Figure
shows a precedence
graph and a table detailing each task's level and co-level. The level
and co-level are used to order the tasks for execution. There is no
allowance made for tasks not to be able to execute on any processor in
the pool.
The trade-off between cost and accuracy of list schedules is discussed by Adam, Chandy and Dickson [4], where they consider the problem of scheduling two or more homogeneous processors to minimise the execution time of a program which consists of partially-ordered tasks. Independent tasks are not considered. They found that when the tasks that comprise a program are organised into a precedence graph, the most optimal schedules are found when using the levels of tasks, incorporating the task execution times to increase the estimate of the priority which to assign the task. In this case, the generated list schedules were found to be within 4.4% of the optimal execution time. Communications time between tasks was not considered in this study.
Clustering, or processor assignment, involves the collection of tasks that exchange a large amount of data onto the same processors, while at the same time distributing the tasks in order to achieve good load balancing [183]. Heuristics have been suggested for clustering [86].
Yang and Gerasoulis' dominant sequence clustering (DSC) algorithm [240] is one part of a multi-part graph scheduling system [84]. The algorithm clusters the dominant sequence of a directed acyclic graph (DAG), which is the critical path of a DAG whose nodes have been allocated to processors. The critical path of a DAG is the path connecting the programs which dominate the execution time of the DAG. If the programs on the critical path are not optimally scheduled, the resulting execution time for the job will be non-minimal. Once the dominant sequence of the DAG has been clustered, the remainder of the DAG is placed so as to minimise total execution time. The work on DSC was later applied to iterative task graphs [77, , ].
An example of clustering is shown in
figure
. In this figure, nodes 1, 2 and 4
have been clustered together and assigned to host 1; nodes 3 and 5
have been clustered and assigned to host 2; and nodes 6 and 7 have
been clustered and assigned to host 3. The decision of which nodes to
cluster together are based on heuristics, including the amount of data
to be shared between the nodes, and the relative speeds of each of the
hosts. Clustering is discussed in the context of our model in
chapter
. We use an algorithm based
loosely on DSC.
The situation in which everything is known about the nodes
participating in a distributed system is called complete system state
information. This concept is shown in
figure
, where each node is aware
of all operations that each of the others can support. In the
case in which all nodes are under dedicated control of a master node,
complete system state is not an unreasonable assumption. In a real
distributed system, characteristics including the load on each machine,
and the number of tasks awaiting execution comprise the reported
system information.
In practice, the use of complete system state information means that
the algorithm performing task scheduling does not need to estimate any
of the characteristics of remote nodes. This allows a scheduler to
make accurate predictions on the behaviour of a remote node. Most
resource management systems (see section
) assume they
have complete system information, which is updated either by the
master explicitly polling the slave machines; the slave machines
regularly reporting their state information to the master; or as found
in [172], there may be a hierarchical arrangement
through which complete information is propagated and can be
searched.
If the nodes are regular in behaviour and use, maintaining complete system state information may not be difficult. In the case where nodes are dynamically joining and leaving the distributed system, or are not dedicated to the use of the distributed environment, such maintenance is a non-trivial problem. In the case where maintaining complete system state information is too difficult or impractical, partial system state information may be still be of value.

Figure 12: Complete system state
information. Each node has complete, accurate information about all
other nodes. For example: node A can support operations F and G; node
B can support operations H and I; node C can support operations J and
K; and, node D can support operations L and M. This information is
replicated across all nodes in the distributed system.

Figure 13: Consequences of partial system state information, When
the state information maintained by a distributed system is not
complete, some nodes may know about resources that others do not. i)
Initially, each node only knows about itself and the operations it
supports; ii) nodes B and C exchange knowledge; iii) nodes C and D
exchange knowledge; iv) nodes A and B exchange knowledge. These
exchanges result in partial system state knowledge being maintained
about the system. If node A needs to know where an operation, L, is
supported, then it can ask those nodes that it knows about, in
the hope that they know of such a node. In the case of static
information, complete system state information is attained quickly;
the dynamic case is more complicated as information needs to be
updated periodically.
Partial system state information allows nodes to keep their own view
of the complete system state. This idea is illustrated in
figure
.
Figure
i) shows the initial state of
the distributed system where each node is unaware of the other nodes.
Through some mechanism, nodes B and C may need to communicate. The
results of this communication are shown in
figure
ii). This may occur as: a
product of a multi-cast request by either node; a manager at one of the
nodes may be aware that the other is now online; or, after restoring
some saved state information, either node may see some information
that references the other, and seek to update its information.
The problem of partial system state information is similar to that of assuring consistency amongst distributed databases. An example of this problem is that found in the Internet Domain Name System (DNS) [160, ]. Machines connected to the Internet always have a unique address, called the IP address [221], which is used for all communications. In addition, they usually have a human-readable names. For example, the host named opal.cs.adelaide.edu.au has the IP address 129.127.8.80. Within DNS, each logical group of computers is termed a domain. Domains can be split into further sub-domains, and many domains can be part of a larger domain in a tree-like fashion [9]. The information describing the mapping is stored locally to a domain. DNS exists as a mechanism for the exchange of local information between distributed domains. It relies on a central authority to delegate the authority to create domain names, and because of its hierarchical nature, when a hostname is unable to be resolved by the local domain, the request is passed to the larger, encapsulating, domain [199]. The design of DNS is not a directly useful approach to the problem we consider in the remainder of this thesis, as we do not impose a hierarchical ordering of logical domains in the DISCWorld.
Although not as accurate as complete system state information, partial
information is easier to maintain. Schedulers can use partial
system state information when constructing placement schedules. We
impose the restriction of partial system state information in the
model and algorithm developed in chapter
.
Static scheduling is simpler to implement and it seems to be the more widely-used approach to scheduling. The literature features many examples of static scheduling, where the authors have made different assumptions about such details as:
This section presents an historical review of scheduling literature. Of course, this is not an exhaustive list of all the literature; it encompasses what we believe to be important studies. We are not interested in the complexity of the models - just the models themselves.

Table 3: Comparison of models found in the scheduling
literature. The type of scheduling can be either Static, Dynamic or
Hybrid (static then dynamic). Task allocation can either be
Restricted, Semi-Restricted or Unrestricted. The model inputs can be
an Arbitrary Graph, a Tree, a Precedence Graph, a DAG, or Unrelated
Tasks. Task execution time is either Known or Unknown, and inter-task
communication time in the model is either included as a constant
constant (Y), included but variable (V), or not included (N). The
processors used by the model are either homogeneous or heterogeneous.
Table
shows a comparison of models
found in the scheduling literature. The majority of the systems
studied were formulated in the context of parallel program execution
on tightly-clustered homogeneous processors. It can be seen that most
of the systems do not incorporate the concept of restricted program
placement between nodes - they assume that all programs can
be placed on any processor in the pool. In addition, the input to most
of the models is a tree, or graph, of programs (or tasks). Tasks
organized into a tree have precedence relations; they are instances of
task graphs with multiple source or sink nodes. In most cases, an
arbitrary task graph, or a precedence graph, is implicitly
directed. If the directed graph does not contain cycles, it is termed
a directed acyclic graph (DAG). DAGs are used to represent processing
requests in the model and algorithm developed in
chapter
. Some of the systems featured in
table
are described in the remainder
of this section.
Stone [215] used a modified commodity flow algorithm and cutsets for scheduling an arbitrary graph over two and three heterogeneous processors. Each task had known, but not identical execution time. Each branch in the network had an associated amount of information to be transferred. The capacity of the network, was used to measure the actual amount of information flowing between the tasks at runtime. Cutsets divide the graph into information sources and sinks. The model incorporated variable communication time between processors. It also allowed concurrent execution, but neither parallel execution or loops and conditionals in the input graph. The model had the main characteristic of the input task graph being undirected, thus representing two-way communication between modules; the cost of moving a computation to another processor is traded off against communication costs of transferring the results between processors. This model was one of the first to consider semi-restricted tasks. Some tasks were assigned to processors; others were able to `float' between processors during program execution.
Kaufman [138] used a longest path method to statically schedule a tree of tasks onto n homogeneous processors, in an unrestricted fashion. The longest path method is equivalent to a list scheduling algorithm, ordering by decreasing task level. Trees of tasks have precedence relations between tasks. Tasks were assumed to have known but non-identical execution times, but communication time was not incorporated into the cost model. Concurrent execution of tasks was supported, but not parallel execution, or loops and conditionals in the input tree. This was the first study to consider tasks with non-homogeneous execution times.
List schedule algorithms were compared in [4],
across
homogeneous processors, using a
precedence graph as input. Tasks were assumed to be unrestricted in
assignment, and had random, but deterministic execution
times. Communication time was not incorporated into the cost
model. Concurrent execution time was allowed, but not parallel
execution, and loops and conditionals were not supported. If tasks
are treated as programs, and task intercommunication is ignored
(i.e. tasks are independent), list scheduling is equivalent to the
first-come first-serve scheduling algorithm used in the tool
described in chapter
.
Bokhari [28] featured a static then
dynamic scheduling algorithm, using unrestricted placement of a tree
of tasks, across
heterogeneous processors. Tasks had
known but non-identical execution time, and variable communication
time was incorporated into the cost model. This study was one of the
first to consider variable communication times. Parallelism between
tasks, loops and conditionals in the input tree were not
considered. All processors are assumed to be dissimilar, but are able
to execute any task in the tree. The shortest tree algorithm is used
to minimise the sum of execution time and inter-processor
communication time. The algorithm suggested by this study exhaustively
iterates through all possible assignments of modules to processors,
using the shortest tree. This algorithm is similar to the model that
we develop in chapter
.
Chou and Abraham [41] suggested seven program descriptors for dynamic execution on distributed systems and policy iteration. The descriptors are: execution time of a task; the communication time for the results of a task; probability a task fails on a processor; the time to create a checkpoint for a task; time to restart a failed task; time to initiate a set of concurrent tasks on a processor; and, the communication time for the results of a set of concurrent tasks. The paper presents an algorithm for optimal task assignment on n heterogeneous processors, featuring probabilistic conditional branches in the arbitrary input task graph. The study was one of the first to consider parallel as well as concurrent execution of application tasks, and although communication was considered in the cost model, the communication cost between processors was fixed.
Towsley [224] extended Bokhari's shortest path method, and incorporated the concept of processor reliability from [41] to produce run-time estimations by loop-unravelling. An arbitrary task graph is scheduled for execution on n heterogeneous processors, where tasks are either allowed to execute on any processor, or they are all assigned to certain processors. Task execution time is known but non-identical, and communications time is incorporated into the cost model, albeit fixed. This research is significant because it supports conditionals and loops within the arbitrary task graph as input.
Chen and Eshaghian [40] present a fast recursive mapping algorithm for parallel applications on parallel systems. Their system involves the clustering of the algorithm's task graph, and the separate clustering of the parallel system's graph. The algorithm provides for the mapping between the two clustered graphs in O(MN) time, where M is the number of task modules and N is the number of underlying processors. In this study, the authors assume that each module (comprising the application) executes for a single time unit, and communicates a uniform amount of data to any child modules. They further assume that each processor is the same speed and that all have the same network characteristics (transmission rate and latency). Although this study is not applicable in the situation we consider for this thesis, it is applicable to the case where only a single variable (either the program graph or the architecture of the parallel machine) is changing. Thus, many program tasks, for example, could be mapped to a single parallel architecture graph.
The automatic heterogeneous supercomputing (AHS) system [49] uses a semi-dynamic strategy for scheduling user application programs. The system maintains a file for each application program containing an estimate of the execution time of that application on each of the available machines; when the user invokes the program, a machine is chosen such as to minimise the turnaround time, which is a function of the current load on the machine and the expected execution time on that machine. The system is semi-dynamic in that once the application is started on a given machine, there is no intervention from the scheduler. The amount of information that this scheduling algorithm uses is minimal, and while it benefits from low scheduling overhead, it does fail to take into account any issues arising from the input data locality.
The self-adjusting dynamic scheduling (SADS)
algorithm [101] develops a detailed cost model
which is used to trade-off effects of processor load imbalance,
interprocessor communication delays and scheduling and synchronisation
overhead. It creates partial schedules, minimising the cost function
by using an algorithm similar to
branch-and-bound [236]. The time that is taken to create
the partial schedules is bound by the speed with which least-loaded
processor in the pool can complete its tasks. Once the
least-loaded processor completes its tasks, the computed partial
schedules are send to all the machines in the pool, and the process
begins again. This algorithm suffers from the need to have a dedicated
processor to create the partial schedules, whether this processor is
the same all the time, or whether it is chosen at execution time. This
is extended in the self-adjusting scheduling for heterogeneous systems
(SASH) algorithm [102]. The SADS algorithm
is similar to the continual adaptive scheduling algorithm for
independent programs that is considered in
chapter
.
The scheduling scheme suggested by Shirazi, Chen and Marquis [205] involves the clustering of a program's task graph into linear tasks, and then statically scheduling them after optimisation. Linear clustering of nodes is used to decrease the effects of inter-processor communications, while parallelising the nodes reduces the execution time. The implementation of this algorithm has the drawback that one of the assumptions that it makes is that the task graph that is to be used as input has a single root node, that an unbounded number of target processors are available, and they are homogeneous. No consideration is given to contention that may arise from running multiple programs at once, or the effects of introducing heterogeneity into the processor pool.
Weissman and Grimshaw [233] present a framework for partitioning data parallel computations in heterogeneous environments, which is implemented in Mentat. This framework is designed to effectively utilise clusters of workstations and supercomputers with different communications topologies, with a view to reducing the total elapsed time as seen by a program. Different communication topologies are characterised by different cost functions. This study includes the notion of router delays and per-byte data coercion (between data formats) costs. The values used for communications costs in the study are for an ideal system in stand-alone mode, and do not consider the effects of real-time machine load. The process placement algorithms used are communication topology-dependent, and rely on callbacks from the tasks whilst they are in computation and communication phases to provide the information that is needed to make intelligent placement decisions.
Shirazi, Wang and Pathak [206] consider the effects of three static scheduling algorithms in the context of a multiprocessor environment, where the execution time of each of the components of the input are known and invariant. Their study concentrates on achieving the fastest turn-around time for an input DAG. The algorithms they consider are the critical path method, and two novel algorithms, a heaviest-node first method, where the longest-executing program nodes are assigned to the least-loaded processors first, and a weighted length algorithm which extends the critical path algorithm by incorporating such information as the branching factor of each node, the number of children, and their weights. Although novel, and effective in terms of computational complexity, the models do not take into consideration the delay effects of interprocessor communication or the effects of any processor heterogeneity.
El-Rewini and Lewis consider the problem of scheduling parallel
program tasks onto arbitrary
machines [57] in the presence of
contention between processing nodes. They use a scheduling heuristic,
MH, which produces a static schedule based on list scheduling. A
task graph and a undirected graph, representing the target machine,
are used by the mapper algorithm, which produce a Gantt chart of the
resulting schedule. Contention on interconnection links is modeled as
a extra delay which is constant for a given link. Although the effects
of contention on interconnection links is ignored, we consider the
effects of assigning multiple tasks to a single processor in the model
we develop in chapter
.
Malloy, Lloyd and Soffa [156] extended existing research based on the critical path method to the fine-grained parallelism found in program instruction streams, and implemented their model on a two-processor homogeneous system. This research is too fine-grained to be applicable to the problem that is under consideration.
While a variety of different scheduling and placement models have been
studied in the literature, there is very little evidence of most of
the models being used in practice. Although scheduling and process
placement are vital parts of a distributed system, most of the cluster
computing environments described in chapter
do not seem to have been designed with any algorithms in mind.
There is a great deal of existing research that focuses on scheduling DAGs across a bounded or unbounded number of target nodes. Although useful, results which only consider a homogeneous collection of nodes, such as [203, , ], are not helpful in the case of heterogeneous nodes. Still more research [156] focuses on exploiting fine-grained parallelism found in instruction streams. That too, is not helpful in the case where a processing pool is comprised of heterogeneous nodes separated by a wide-area broadband network.
Unfortunately the existing research (eg [4, , , , , , , , ]) that considers a collection of heterogeneous nodes all assume that either:
The general case that we consider has the following characteristics, for which an adequate solution does not yet exist:
The scheduling literature provides solutions for a number of special cases. However, we believe that the previous results are not completely applicable for the system model that is found in some wide-area distributed systems, such as our DISCWorld prototype metacomputing environment. They do not address the fundamental constraints and features found in our system: lack of up-to-date global system information; restricted placement and mobility of data and services; global naming strategy for data and services, by which data may be re-used; and the single assignment nature of data.
There are a number of popular approaches to this problem, most of which are related to the critical path method. We consider services that can be executed on a number of different nodes within the distributed environment, each with a different set of transfer and execution costs. Due to the sheer computational complexity of computing a critical path for every possible node assignment, we feel the classical approaches of list scheduling [4] and clustering [201, 240] are not applicable. In addition, the above techniques fail to take into account the probable need to transfer input data to each service before it can be run - which is different from the data created as the output of a previous processing step. Finally, Sarkar's two-stage algorithm [201] and DSC only consider the case of homogeneous compute nodes, which is too restrictive for the more general case we consider.
A number of heuristics and partial solutions may be combined to form a solution to the problem we consider. We use the general idea of clustering, and in particular, a modified version of Yang and Gerasoulis' DSC algorithm to minimise the total execution time of an individual job.
We consider a processing request to be made up of a number of programs,
which may have inter-dependencies. When all processors in a
distributed system are controlled, and assigned programs by a single
front-end processor, static scheduling, as described in
section
, is best. This is because the
complete state of the system is knowable. There are two major
drawbacks to using a centralised scheduler. Firstly, the use of a
centralised scheduler may cause a processing bottleneck, and secondly,
the centralised scheduler introduces a single point of failure. This
means that if the node hosting the scheduler suffers a failure, either
of the physical machine or the network interconnecting the scheduler
with the remainder of the processors in the pool, no more jobs will be
able to be scheduled.
If all processors have the ability to assign programs to run in the distributed system, then dynamic scheduling, or load balancing, is more appropriate because of the lack of control by an individual node over the whole system. Most dynamic scheduling algorithms seek to obtain the best performance for individual programs under consideration, in the assumption that by doing so, the complete processing request will be efficiently executed; they perform no forward-planning.
In chapter
we present a tool we have developed
that allows quantifiable analysis of static and dynamic scheduling
algorithms when used to execute independent programs injected to a
cluster of computers. When the distributed environment is relatively
stable, we find that in the case where the programs to be run are
easily characterised, a static scheduling algorithm provides good
execution time; when the programs are not easily characterised, a
dynamic algorithm is needed.
In the situation where the node from which a processing request is submitted has information on the system state (even if incomplete), there is benefit in using that information to produce a heuristically good execution schedule. If the execution schedule is created with static task-to-processor assignment, and then modified at execution time, we call this model hybrid static-dynamic. Thus, the node creates the best static schedule that is heuristically possible using its current system state information as an estimate. Since other nodes may have different system state information, at execution time they may modify the schedule to make it more efficient for the processing request's execution.
Some of the systems we review in chapter
provide solutions which incorporate subsets of the conditions which
characterise our system. In chapter
we develop a
model for scheduling in wide-area systems. The DISCWorld model is
designed for jobs that consist of inter-dependent high-level programs,
which we subsequently describe in chapter
.