next up previous
Next: Scheduling Independent Tasks Up: Scheduling in Metacomputing Systems Previous: Cluster Computing

A Review of Scheduling

 

Scheduling is a problem that is grounded in many different levels of computer science and computer hardware engineering. Various scheduling and sequencing problems have been addressed since the 1950's by researchers in computer science, operations research and discrete mathematics [64].

The scheduling problem in its most general form is known to be NP-complete [147], as is the creation of optimal execution schedules under a number of conditions. This is due to the large number of inter-related factors that directly and indirectly contribute to the execution time of the individual tasks. Consequently, many heuristics have been developed to generate adequate (but sub-optimal) schedules [56].

The problem that we consider is how to distribute (or schedule) processes among processing elements to achieve performance goal(s), such as minimising execution time, minimising communication delays, and/or maximising resource utilizations [34]. In a system consisting of real-time transactions, each of which requires computational, communication and data resources to processed, scheduling is the problem of allocating resources to satisfy the requirements of those transactions [142].

We recognise four main levels of instruction scheduling: machine-code instruction scheduling; interpreted program code (converted at run-time to machine-code); the scheduling of threads within a program; and the scheduling of programs within a distributed system. These are listed in order of increasing granularity and abstraction from the physical code that is executed by processors.

The systems described in chapter gif operate at the last two levels: the scheduling of threads within a program; and, the scheduling of programs within a distributed system. This thesis is concerned primarily with the scheduling of coarse-grained programs across distributed systems.

In this chapter we review some of the research in the area of scheduling, across all levels of granularity. We provide discussion of the applicability of the research to scheduling coarse-grained programs. The problem of scheduling on a local and global scale is considered. Different models of scheduling, including static, dynamic and hybrid approaches are discussed in section gif. The scheduling of independent programs is discussed in section gif and of dependent programs in section gif. A discussion of the state information that is available to schedulers is presented in section gif, and an historical review of scheduling is presented in section gif. The applicability of the existing research to the problem we consider is discussed in section gif. The review is summarised in section gif.

At the highest level, a distinction is drawn between local and global scheduling. The local scheduling discipline determines how the processes resident on a single CPU are allocated and executed; a global scheduling policy uses information about the system to allocate processes to multiple processors with the view of optimising a system-wide performance objective. Under the classification of global scheduling, the problem domain is further broken into the cases where the tasks to be scheduled are independent or related [7].

When executing a complex job in a distributed system, scheduling occurs in many places and on many levels. Proceeding in a top-down manner, the program is submitted into a queue of waiting jobs. At some point, the scheduler will select the job to run. The program (or current job) is broken into a number of tasks to be executed, each of which is scheduled, possibly on different machines. When each task arrives at the target machine, if it involves parallel computation, then each of those parallel components is then distributed to target processors. Once the code arrives onto the processor where it will be executed, it is then scheduled by the operating system. The scheduling of code onto a processor is done in conjunction with system processes and quite probably with other users' processes. Shown in figure gif, this is the manner in which most of the batch queueing systems discussed in chapter gif operate.

  figure466
Figure 9: Graphical representation of scheduling. The user's program is submitted to a queue. When a program is selected to run, the scheduler allocates it to a target machine, where the local scheduler controls the local execution. If the target machine is has a parallel architecture, the program may be distributed across the local processors.

Choosing the processors on which the tasks that comprise a program will run is a difficult decision. There are many factors, from the viewpoints of both the application and the system as a whole, that effect the decision, including: the number of tasks that comprise the program; the priority of the program; the current load of the system's nodes; whether all of the nodes can execute each task; and the resource owner's usage policies.

There are a number of reasons why scheduling programs, or the tasks that comprise the programs, is important to both the users and owners of machines. From the user's point of view, it is important that the programs they wish to run are executed as quickly as possible, using the resources that are best suited to the problem, which are also reasonably accessible to the user. In contrast, the machine owners wish to make the best utilizations of their resource. The term resource does not only refer to the machines that make compute cycles available, but also includes clusters of computers, communications link utilizations and storage services that may be offered, as well as whatever other peripherals each may have attached. The two objectives, fast turnaround time and optimal resource utilizations, are not always complementary. Owners are not usually willing to let a single user utilise all their resources, and users are not usually willing to wait an arbitrarily long time before they are allowed access to particular resources. Scheduling, from both points of view, is the process by which each party achieves a satisfactory quality of service, where the time spent waiting for a number of dedicated nodes on a supercomputer that is connected to a massive storage engine may be traded off against the actual opportunity to use the resources.

There is a complex trade-off between the user using resources that an owner may charge little (or nothing) to use, but may be of lower performance or have fewer desirable attributes (such as storage systems or high-bandwidth interconnections to other resources), and a larger resource which costs more to use but on which the problem will be solved more quickly and will be able to utilise any specialised peripherals that are attached to the system.

There is no single set of terminology that is used throughout the scheduling literature. For example, the terms program, job and task, and the terms host, processor and node are often used interchangeably in the literature.

When we use the term program, we refer to code that is directly executable by the user of a system, such as a command shell or a compiled application. We use the term job to refer to a collection of programs that are run concurrently or sequentially, for a single purpose. An example of a job is a shell script that calls several programs. The input- and output-dependencies between the programs that comprise a job are sometimes termed the job's processing pipeline. We use the term task to refer to the code that runs on a single processor. The task may consist of a number of threads of control.

As multi-processor workstations become commodity items, a new distinction is being made between the term processor and the terms host and node. When only uniprocessor machines were available, the terms were equivalent; now one must be careful to use the correct terminology. We use the term node, or processing entity (PE), to refer to a processor which is not normally accessed by the user. For example, as the processors that comprise the parallel processing array in a Thinking Machines CM-5 are not directly accessible by the user, they are termed nodes. We use the term hosts to describe the processors to which users usually have direct access.

Scheduling Models

 

There are different approaches to the selection of processors onto which programs will be placed for execution. The models range from static, where each of the programs is assigned once to a processor before execution of the program commences, to dynamic, where a program may be reassigned to different processors before being executed. Finally, we describe a hybrid approach, which combines those taken by the static and dynamic models.

One of the first taxonomies of scheduling in distributed computing was performed by Casavant and Kuhl [34]. They divided the problem domain into static and dynamic scheduling. Static scheduling involves assigning the programs that comprise the job to processors and then not allowing them to move. This minimises turnaround time for the user but is not responsive to changes in the execution environment. Dynamic scheduling, or load balancing, involves the averaging of load over the chosen processors, in an effort to maximise average resource utilizations and reduce turnaround time for the user.

Many theoretical studies consider off-line systems, and search for optimal solutions using the assumption that everything is known a priori. They often also assume nothing changes in the system's environment. Real systems, however, operate in a dynamic on-line environment, and need to contend with unpredictable arrivals of new work [62].

Static Scheduling

In the static model, every task comprising the job is assigned once to a (possibly different) node. Thus, the placement of a program is static, and a firm estimate of the cost of the computation can be made. Heuristic models for static task scheduling are discussed in [206].

One of the major benefits of the static model is that it is easier to program from a scheduling and placement view. The placement of tasks is fixed a priori; it is easy to monitor the progress of the computation and hence termination of processing is simplified. By the same reasoning, estimating the cost of jobs is simplified. Processors can give estimates of the time that will be spent processing programs. On completion of the program, the processor can be instructed to supply the precise time that was spent in processing. This allows the cost to the user to be updated, as well as any internal representations that are used for making performance estimates of new programs.

Using this model, it is also possible to reserve resources. Thus, instead of the scheduler that is creating the placement information simply looking up a (dynamic) table of host names and the programs available on those hosts, the system could send a reservation message to the host, enquiring as to its willingness to run a program.

When scheduling jobs in a static fashion, only the scheduler which creates the job placement and cost estimates needs to know the existence and relative costs of the other hosts in the system. Although other hosts may know about the programs that another host can execute, all they must know is how to send messages to the next (or previous) host in the job's processing pipeline. Using the static model, each host is told from where to get the input parameters that are needed, or where to send the outputs of the programs.

The static model allows for a `global view' of programs and costs. Heuristics are used to decide whether to incur slightly higher processing costs in order to keep all the programs involved in a job on the same or tightly-coupled nodes, or to search for lower computational costs and be penalised with slightly higher communication costs.

Unfortunately, the static model does not allow for the very real possibility that one of the nodes selected to perform a service may have failed, be isolated from the system due to network failure, or at least so heavily laden with jobs that it is not responding.

Dynamic Scheduling

General-purpose dynamic scheduling operates on two levels: the local scheduling discipline, and a load distribution strategy. The load distributing strategy determines how programs will be placed on remote machines. It uses an information policy to determine which information is to be collected from each machine in the processor pool, at what frequency, and also how often the local information should be exchanged with other machines. Scheduling may be approached from many different aspects, including the viewpoints of the user and the resource owner.

In a strictly dynamic scheduling model, the tasks that comprise a parallel or distributed job are assigned to processors based on whether a processor predicts that it can provide an adequate quality of service to a task. The meaning of quality of service is dependent on the application. The term includes: whether a maximum bound can be placed on the time a job will have to wait before starting execution; the minimum time quanta that a given job will be able to execute without interruption; and, the relative speed of a processor when compared to others in the processing pool. If the processor is assigned too many tasks, it may invoke a transfer policy to decide whether to transfer some tasks, and which tasks to transfer. A location policy determines which processor(s) will receive the tasks. There are two important models for location policies: sender-initiated and receiver-initiated. These depend on whether the sender polls potential targets for task transfer, or whether processors that are willing to receive extra tasks advertise the fact, respectively.

The advantage of dynamic load balancing over static scheduling is that the system need not be aware of the run-time behaviour of the application before execution. Dynamic load balancing is particularly useful in a system where the primary performance goal is the maximising of processor utilizations as opposed to the minimisation of runtime for individual jobs [207, ]. This is often found in systems consisting of networks of workstations.

The main consideration involved in mapping threads to PEs is the requirement to balance the loads. It has been shown that load balancing is crucial to achieving adequate performance [53, ]. Load balancing can be performed in one of two fundamental ways, where the decision of where to place the new thread is either made by a local queue or global queue. In a local queue, each PE makes its own decision; in a global queue there is a single point from where the threads are dispatched.

There are a number of methods that have been proposed for mapping a thread to a processor (from Feitelson [62] section 4.1.1) using local queues, for example:

Global queues are easy to implement in shared memory machines; they are not realistic options for distributed memory machines due to the need to maintain coherent lists of available tasks [62]. In most systems that use a global queue, tasks are allocated to a PE, execute for a time quantum, and are returned to the queue for re-allocation. The main advantage of using a global queue is that of load sharing, in which all processors get a roughly equal proportion of the workload. The disadvantages are the contention for the shared global queue, and the lack of memory locality.

Hybrid Static-Dynamic Scheduling

 

In the dynamic model, no placement information is assigned to the job. Consider the case in which a computation can be divided into a number of dependent parts. Initially, a node advertising the job's first service will be passed the job and once the task is complete, the node will consult its own records to determine the location (and cost information) of a server that can perform the next task, and will dispatch the job to that host.

This approach forces all hosts to be aware of, if not all, then at least a large part of the system. It does, however, provide for the truly dynamic nature of the system, in which nodes and services may or may not be available, and communication links may be down. This suggests that there may be problems with the dissemination of server (and service) information to the remainder of the system. Problems may arise if the network becomes segmented.

The server estimates the cost of each part of the job's processing pipeline, and assigns each part to the processors in a manner that minimises the total execution time. Thus, the assignment of computation parts to processors is a static assignment. Problems arise if, after assignment, one of the servers selected to participate in the computation is aware of an alternative server that can perform the part at less expense. If the original schedule is used for execution, the computation will take longer than it could have through using the alternate server.

The onus, therefore, is on the current server to find the next, most appropriate server to continue the computation. It may be advisable to configure the system with facilities to `backtrack' servers so if a server does not know about any appropriate servers to continue the computation, the job may be passed back to the last server to host it. The system would have to be carefully monitored, and trained not to follow local cost minima, which allow the computation to be passed to more remote, but more communications-costly nodes when it comes to return the results to the user.

This model runs into difficulties where the computation forks into a number of concurrent processing stages. One possible solution to this problem is to require that concurrent computations send periodic updates to the server which was used before the concurrent execution was started, to synchronise and continue the computation.

There needs to be some semblance of a priority system. To prevent users from abusing the priority system by requesting everything be run at the highest priority, it may be decided to take the solution that has been adopted in CCS [193], where the user is charged for the priority level at which they submit their job.

Independent Tasks

 

The performance of dynamic load balancing models in a heterogeneous system was studied by Chow and Kohler [42], where they considered deterministic and non-deterministic routing strategies. The non-deterministic strategy uses a probability, tex2html_wrap_inline2992 of a process being routed to processor i, which is either chosen arbitrarily or based on a function of the system parameters.

Deterministic routing involves a job dispatcher which routes jobs to processors according to some policy. Three deterministic policies are investigated: minimum response time, minimum system time, and maximum throughput. Chow and Kohler's study concludes that the maximum throughput policy, which uses information on the arrival rate as well as service rate and system state, is more optimal than those policies that base their decisions on the service rate and system state only.

We discuss the topic of independent tasks further in chapter gif. A tool is presented that allows the comparison of different static and dynamic scheduling algorithms for the placement of independent jobs. We use a number of simple placement algorithms and also a variant of Chow and Kohler's algorithm.

Dependent Tasks

 

There are a number of different ways in which schedules for the execution of related tasks can be generated in the situation where the resources that are to be used are dedicated. Feitelson and Rudolph [63] recognise a number of different static approaches to parallel job scheduling, including critical path methods, list scheduling methods, and the partitioning of a DAG into clusters of nodes. As creating an optimal schedule is NP-complete, heuristic algorithms based on list schedules and clustering techniques are common [4, , ].

List Schedules

List scheduling is relatively straightforward to implement. Most commonly, list scheduling is used to place the tasks that comprise a parallel program [4]. Tasks comprising a program are assigned priorities and placed in a list ordered in decreasing priority. Whenever tasks contend for processors, the highest priority task that is immediately executable is assigned first. List schedules may be preemptive or non-preemptive; if there are two tasks with the same priority that can execute, one is chosen randomly.

List scheduling algorithms rely on the knowledge of task execution times, and precedence relationships. They are most often used to schedule dependent tasks on clusters of homogeneous processors [4]. Given a set of tasks tex2html_wrap_inline2996, which has the precedence graph G, we let the execution time of task tex2html_wrap_inline3000 be tex2html_wrap_inline3002. The length of a directed path is defined as the sum of all the weights along the path including the initial and final vertex. The level of a task within a program is defined as the length from a vertex tex2html_wrap_inline3004 to an exit node (a node which has no successors). Similarly, the co-level of a task is defined as the maximum distance from the vertex tex2html_wrap_inline3004 to the entry vertex (which has no predecessors). A table can therefore be created in which each task is assigned a level and co-level. Figure gif shows a precedence graph and a table detailing each task's level and co-level. The level and co-level are used to order the tasks for execution. There is no allowance made for tasks not to be able to execute on any processor in the pool.

 figure517

The trade-off between cost and accuracy of list schedules is discussed by Adam, Chandy and Dickson [4], where they consider the problem of scheduling two or more homogeneous processors to minimise the execution time of a program which consists of partially-ordered tasks. Independent tasks are not considered. They found that when the tasks that comprise a program are organised into a precedence graph, the most optimal schedules are found when using the levels of tasks, incorporating the task execution times to increase the estimate of the priority which to assign the task. In this case, the generated list schedules were found to be within 4.4% of the optimal execution time. Communications time between tasks was not considered in this study.

Clustering Algorithms

Clustering, or processor assignment, involves the collection of tasks that exchange a large amount of data onto the same processors, while at the same time distributing the tasks in order to achieve good load balancing [183]. Heuristics have been suggested for clustering [86].

Yang and Gerasoulis' dominant sequence clustering (DSC) algorithm [240] is one part of a multi-part graph scheduling system [84]. The algorithm clusters the dominant sequence of a directed acyclic graph (DAG), which is the critical path of a DAG whose nodes have been allocated to processors. The critical path of a DAG is the path connecting the programs which dominate the execution time of the DAG. If the programs on the critical path are not optimally scheduled, the resulting execution time for the job will be non-minimal. Once the dominant sequence of the DAG has been clustered, the remainder of the DAG is placed so as to minimise total execution time. The work on DSC was later applied to iterative task graphs [77, , ].

 figure531

An example of clustering is shown in figure gif. In this figure, nodes 1, 2 and 4 have been clustered together and assigned to host 1; nodes 3 and 5 have been clustered and assigned to host 2; and nodes 6 and 7 have been clustered and assigned to host 3. The decision of which nodes to cluster together are based on heuristics, including the amount of data to be shared between the nodes, and the relative speeds of each of the hosts. Clustering is discussed in the context of our model in chapter gif. We use an algorithm based loosely on DSC.

System State Information

 

The situation in which everything is known about the nodes participating in a distributed system is called complete system state information. This concept is shown in figure gif, where each node is aware of all operations that each of the others can support. In the case in which all nodes are under dedicated control of a master node, complete system state is not an unreasonable assumption. In a real distributed system, characteristics including the load on each machine, and the number of tasks awaiting execution comprise the reported system information.

In practice, the use of complete system state information means that the algorithm performing task scheduling does not need to estimate any of the characteristics of remote nodes. This allows a scheduler to make accurate predictions on the behaviour of a remote node. Most resource management systems (see section gif) assume they have complete system information, which is updated either by the master explicitly polling the slave machines; the slave machines regularly reporting their state information to the master; or as found in [172], there may be a hierarchical arrangement through which complete information is propagated and can be searched.

If the nodes are regular in behaviour and use, maintaining complete system state information may not be difficult. In the case where nodes are dynamically joining and leaving the distributed system, or are not dedicated to the use of the distributed environment, such maintenance is a non-trivial problem. In the case where maintaining complete system state information is too difficult or impractical, partial system state information may be still be of value.

  figure545
Figure 12: Complete system state information. Each node has complete, accurate information about all other nodes. For example: node A can support operations F and G; node B can support operations H and I; node C can support operations J and K; and, node D can support operations L and M. This information is replicated across all nodes in the distributed system.

  figure552
Figure 13: Consequences of partial system state information, When the state information maintained by a distributed system is not complete, some nodes may know about resources that others do not. i) Initially, each node only knows about itself and the operations it supports; ii) nodes B and C exchange knowledge; iii) nodes C and D exchange knowledge; iv) nodes A and B exchange knowledge. These exchanges result in partial system state knowledge being maintained about the system. If node A needs to know where an operation, L, is supported, then it can ask those nodes that it knows about, in the hope that they know of such a node. In the case of static information, complete system state information is attained quickly; the dynamic case is more complicated as information needs to be updated periodically.

Partial system state information allows nodes to keep their own view of the complete system state. This idea is illustrated in figure gif. Figure gif i) shows the initial state of the distributed system where each node is unaware of the other nodes. Through some mechanism, nodes B and C may need to communicate. The results of this communication are shown in figure gif ii). This may occur as: a product of a multi-cast request by either node; a manager at one of the nodes may be aware that the other is now online; or, after restoring some saved state information, either node may see some information that references the other, and seek to update its information.

The problem of partial system state information is similar to that of assuring consistency amongst distributed databases. An example of this problem is that found in the Internet Domain Name System (DNS) [160, ]. Machines connected to the Internet always have a unique address, called the IP address [221], which is used for all communications. In addition, they usually have a human-readable names. For example, the host named opal.cs.adelaide.edu.au has the IP address 129.127.8.80. Within DNS, each logical group of computers is termed a domain. Domains can be split into further sub-domains, and many domains can be part of a larger domain in a tree-like fashion [9]. The information describing the mapping is stored locally to a domain. DNS exists as a mechanism for the exchange of local information between distributed domains. It relies on a central authority to delegate the authority to create domain names, and because of its hierarchical nature, when a hostname is unable to be resolved by the local domain, the request is passed to the larger, encapsulating, domain [199]. The design of DNS is not a directly useful approach to the problem we consider in the remainder of this thesis, as we do not impose a hierarchical ordering of logical domains in the DISCWorld.

Although not as accurate as complete system state information, partial information is easier to maintain. Schedulers can use partial system state information when constructing placement schedules. We impose the restriction of partial system state information in the model and algorithm developed in chapter gif.

Static scheduling is simpler to implement and it seems to be the more widely-used approach to scheduling. The literature features many examples of static scheduling, where the authors have made different assumptions about such details as:

Historical Review of Scheduling

 

This section presents an historical review of scheduling literature. Of course, this is not an exhaustive list of all the literature; it encompasses what we believe to be important studies. We are not interested in the complexity of the models - just the models themselves.

  table588
Table 3: Comparison of models found in the scheduling literature. The type of scheduling can be either Static, Dynamic or Hybrid (static then dynamic). Task allocation can either be Restricted, Semi-Restricted or Unrestricted. The model inputs can be an Arbitrary Graph, a Tree, a Precedence Graph, a DAG, or Unrelated Tasks. Task execution time is either Known or Unknown, and inter-task communication time in the model is either included as a constant constant (Y), included but variable (V), or not included (N). The processors used by the model are either homogeneous or heterogeneous.

Table gif shows a comparison of models found in the scheduling literature. The majority of the systems studied were formulated in the context of parallel program execution on tightly-clustered homogeneous processors. It can be seen that most of the systems do not incorporate the concept of restricted program placement between nodes - they assume that all programs can be placed on any processor in the pool. In addition, the input to most of the models is a tree, or graph, of programs (or tasks). Tasks organized into a tree have precedence relations; they are instances of task graphs with multiple source or sink nodes. In most cases, an arbitrary task graph, or a precedence graph, is implicitly directed. If the directed graph does not contain cycles, it is termed a directed acyclic graph (DAG). DAGs are used to represent processing requests in the model and algorithm developed in chapter gif. Some of the systems featured in table gif are described in the remainder of this section.

Stone [215] used a modified commodity flow algorithm and cutsets for scheduling an arbitrary graph over two and three heterogeneous processors. Each task had known, but not identical execution time. Each branch in the network had an associated amount of information to be transferred. The capacity of the network, was used to measure the actual amount of information flowing between the tasks at runtime. Cutsets divide the graph into information sources and sinks. The model incorporated variable communication time between processors. It also allowed concurrent execution, but neither parallel execution or loops and conditionals in the input graph. The model had the main characteristic of the input task graph being undirected, thus representing two-way communication between modules; the cost of moving a computation to another processor is traded off against communication costs of transferring the results between processors. This model was one of the first to consider semi-restricted tasks. Some tasks were assigned to processors; others were able to `float' between processors during program execution.

Kaufman [138] used a longest path method to statically schedule a tree of tasks onto n homogeneous processors, in an unrestricted fashion. The longest path method is equivalent to a list scheduling algorithm, ordering by decreasing task level. Trees of tasks have precedence relations between tasks. Tasks were assumed to have known but non-identical execution times, but communication time was not incorporated into the cost model. Concurrent execution of tasks was supported, but not parallel execution, or loops and conditionals in the input tree. This was the first study to consider tasks with non-homogeneous execution times.

List schedule algorithms were compared in [4], across tex2html_wrap_inline3026 homogeneous processors, using a precedence graph as input. Tasks were assumed to be unrestricted in assignment, and had random, but deterministic execution times. Communication time was not incorporated into the cost model. Concurrent execution time was allowed, but not parallel execution, and loops and conditionals were not supported. If tasks are treated as programs, and task intercommunication is ignored (i.e. tasks are independent), list scheduling is equivalent to the first-come first-serve scheduling algorithm used in the tool described in chapter gif.

Bokhari [28] featured a static then dynamic scheduling algorithm, using unrestricted placement of a tree of tasks, across tex2html_wrap_inline3028 heterogeneous processors. Tasks had known but non-identical execution time, and variable communication time was incorporated into the cost model. This study was one of the first to consider variable communication times. Parallelism between tasks, loops and conditionals in the input tree were not considered. All processors are assumed to be dissimilar, but are able to execute any task in the tree. The shortest tree algorithm is used to minimise the sum of execution time and inter-processor communication time. The algorithm suggested by this study exhaustively iterates through all possible assignments of modules to processors, using the shortest tree. This algorithm is similar to the model that we develop in chapter gif.

Chou and Abraham [41] suggested seven program descriptors for dynamic execution on distributed systems and policy iteration. The descriptors are: execution time of a task; the communication time for the results of a task; probability a task fails on a processor; the time to create a checkpoint for a task; time to restart a failed task; time to initiate a set of concurrent tasks on a processor; and, the communication time for the results of a set of concurrent tasks. The paper presents an algorithm for optimal task assignment on n heterogeneous processors, featuring probabilistic conditional branches in the arbitrary input task graph. The study was one of the first to consider parallel as well as concurrent execution of application tasks, and although communication was considered in the cost model, the communication cost between processors was fixed.

Towsley [224] extended Bokhari's shortest path method, and incorporated the concept of processor reliability from [41] to produce run-time estimations by loop-unravelling. An arbitrary task graph is scheduled for execution on n heterogeneous processors, where tasks are either allowed to execute on any processor, or they are all assigned to certain processors. Task execution time is known but non-identical, and communications time is incorporated into the cost model, albeit fixed. This research is significant because it supports conditionals and loops within the arbitrary task graph as input.

Chen and Eshaghian [40] present a fast recursive mapping algorithm for parallel applications on parallel systems. Their system involves the clustering of the algorithm's task graph, and the separate clustering of the parallel system's graph. The algorithm provides for the mapping between the two clustered graphs in O(MN) time, where M is the number of task modules and N is the number of underlying processors. In this study, the authors assume that each module (comprising the application) executes for a single time unit, and communicates a uniform amount of data to any child modules. They further assume that each processor is the same speed and that all have the same network characteristics (transmission rate and latency). Although this study is not applicable in the situation we consider for this thesis, it is applicable to the case where only a single variable (either the program graph or the architecture of the parallel machine) is changing. Thus, many program tasks, for example, could be mapped to a single parallel architecture graph.

The automatic heterogeneous supercomputing (AHS) system [49] uses a semi-dynamic strategy for scheduling user application programs. The system maintains a file for each application program containing an estimate of the execution time of that application on each of the available machines; when the user invokes the program, a machine is chosen such as to minimise the turnaround time, which is a function of the current load on the machine and the expected execution time on that machine. The system is semi-dynamic in that once the application is started on a given machine, there is no intervention from the scheduler. The amount of information that this scheduling algorithm uses is minimal, and while it benefits from low scheduling overhead, it does fail to take into account any issues arising from the input data locality.

The self-adjusting dynamic scheduling (SADS) algorithm [101] develops a detailed cost model which is used to trade-off effects of processor load imbalance, interprocessor communication delays and scheduling and synchronisation overhead. It creates partial schedules, minimising the cost function by using an algorithm similar to branch-and-bound [236]. The time that is taken to create the partial schedules is bound by the speed with which least-loaded processor in the pool can complete its tasks. Once the least-loaded processor completes its tasks, the computed partial schedules are send to all the machines in the pool, and the process begins again. This algorithm suffers from the need to have a dedicated processor to create the partial schedules, whether this processor is the same all the time, or whether it is chosen at execution time. This is extended in the self-adjusting scheduling for heterogeneous systems (SASH) algorithm [102]. The SADS algorithm is similar to the continual adaptive scheduling algorithm for independent programs that is considered in chapter gif.

The scheduling scheme suggested by Shirazi, Chen and Marquis [205] involves the clustering of a program's task graph into linear tasks, and then statically scheduling them after optimisation. Linear clustering of nodes is used to decrease the effects of inter-processor communications, while parallelising the nodes reduces the execution time. The implementation of this algorithm has the drawback that one of the assumptions that it makes is that the task graph that is to be used as input has a single root node, that an unbounded number of target processors are available, and they are homogeneous. No consideration is given to contention that may arise from running multiple programs at once, or the effects of introducing heterogeneity into the processor pool.

Weissman and Grimshaw [233] present a framework for partitioning data parallel computations in heterogeneous environments, which is implemented in Mentat. This framework is designed to effectively utilise clusters of workstations and supercomputers with different communications topologies, with a view to reducing the total elapsed time as seen by a program. Different communication topologies are characterised by different cost functions. This study includes the notion of router delays and per-byte data coercion (between data formats) costs. The values used for communications costs in the study are for an ideal system in stand-alone mode, and do not consider the effects of real-time machine load. The process placement algorithms used are communication topology-dependent, and rely on callbacks from the tasks whilst they are in computation and communication phases to provide the information that is needed to make intelligent placement decisions.

Shirazi, Wang and Pathak [206] consider the effects of three static scheduling algorithms in the context of a multiprocessor environment, where the execution time of each of the components of the input are known and invariant. Their study concentrates on achieving the fastest turn-around time for an input DAG. The algorithms they consider are the critical path method, and two novel algorithms, a heaviest-node first method, where the longest-executing program nodes are assigned to the least-loaded processors first, and a weighted length algorithm which extends the critical path algorithm by incorporating such information as the branching factor of each node, the number of children, and their weights. Although novel, and effective in terms of computational complexity, the models do not take into consideration the delay effects of interprocessor communication or the effects of any processor heterogeneity.

El-Rewini and Lewis consider the problem of scheduling parallel program tasks onto arbitrary machines [57] in the presence of contention between processing nodes. They use a scheduling heuristic, MH, which produces a static schedule based on list scheduling. A task graph and a undirected graph, representing the target machine, are used by the mapper algorithm, which produce a Gantt chart of the resulting schedule. Contention on interconnection links is modeled as a extra delay which is constant for a given link. Although the effects of contention on interconnection links is ignored, we consider the effects of assigning multiple tasks to a single processor in the model we develop in chapter gif.

Malloy, Lloyd and Soffa [156] extended existing research based on the critical path method to the fine-grained parallelism found in program instruction streams, and implemented their model on a two-processor homogeneous system. This research is too fine-grained to be applicable to the problem that is under consideration.

Discussion

 

While a variety of different scheduling and placement models have been studied in the literature, there is very little evidence of most of the models being used in practice. Although scheduling and process placement are vital parts of a distributed system, most of the cluster computing environments described in chapter gif do not seem to have been designed with any algorithms in mind.

There is a great deal of existing research that focuses on scheduling DAGs across a bounded or unbounded number of target nodes. Although useful, results which only consider a homogeneous collection of nodes, such as [203, , ], are not helpful in the case of heterogeneous nodes. Still more research  [156] focuses on exploiting fine-grained parallelism found in instruction streams. That too, is not helpful in the case where a processing pool is comprised of heterogeneous nodes separated by a wide-area broadband network.

Unfortunately the existing research (eg  [4, , , , , , , , ]) that considers a collection of heterogeneous nodes all assume that either:

The general case that we consider has the following characteristics, for which an adequate solution does not yet exist:

Conclusion

 

The scheduling literature provides solutions for a number of special cases. However, we believe that the previous results are not completely applicable for the system model that is found in some wide-area distributed systems, such as our DISCWorld prototype metacomputing environment. They do not address the fundamental constraints and features found in our system: lack of up-to-date global system information; restricted placement and mobility of data and services; global naming strategy for data and services, by which data may be re-used; and the single assignment nature of data.

There are a number of popular approaches to this problem, most of which are related to the critical path method. We consider services that can be executed on a number of different nodes within the distributed environment, each with a different set of transfer and execution costs. Due to the sheer computational complexity of computing a critical path for every possible node assignment, we feel the classical approaches of list scheduling [4] and clustering [201, 240] are not applicable. In addition, the above techniques fail to take into account the probable need to transfer input data to each service before it can be run - which is different from the data created as the output of a previous processing step. Finally, Sarkar's two-stage algorithm [201] and DSC only consider the case of homogeneous compute nodes, which is too restrictive for the more general case we consider.

A number of heuristics and partial solutions may be combined to form a solution to the problem we consider. We use the general idea of clustering, and in particular, a modified version of Yang and Gerasoulis' DSC algorithm to minimise the total execution time of an individual job.

We consider a processing request to be made up of a number of programs, which may have inter-dependencies. When all processors in a distributed system are controlled, and assigned programs by a single front-end processor, static scheduling, as described in section gif, is best. This is because the complete state of the system is knowable. There are two major drawbacks to using a centralised scheduler. Firstly, the use of a centralised scheduler may cause a processing bottleneck, and secondly, the centralised scheduler introduces a single point of failure. This means that if the node hosting the scheduler suffers a failure, either of the physical machine or the network interconnecting the scheduler with the remainder of the processors in the pool, no more jobs will be able to be scheduled.

If all processors have the ability to assign programs to run in the distributed system, then dynamic scheduling, or load balancing, is more appropriate because of the lack of control by an individual node over the whole system. Most dynamic scheduling algorithms seek to obtain the best performance for individual programs under consideration, in the assumption that by doing so, the complete processing request will be efficiently executed; they perform no forward-planning.

In chapter gif we present a tool we have developed that allows quantifiable analysis of static and dynamic scheduling algorithms when used to execute independent programs injected to a cluster of computers. When the distributed environment is relatively stable, we find that in the case where the programs to be run are easily characterised, a static scheduling algorithm provides good execution time; when the programs are not easily characterised, a dynamic algorithm is needed.

In the situation where the node from which a processing request is submitted has information on the system state (even if incomplete), there is benefit in using that information to produce a heuristically good execution schedule. If the execution schedule is created with static task-to-processor assignment, and then modified at execution time, we call this model hybrid static-dynamic. Thus, the node creates the best static schedule that is heuristically possible using its current system state information as an estimate. Since other nodes may have different system state information, at execution time they may modify the schedule to make it more efficient for the processing request's execution.

Some of the systems we review in chapter gif provide solutions which incorporate subsets of the conditions which characterise our system. In chapter gif we develop a model for scheduling in wide-area systems. The DISCWorld model is designed for jobs that consist of inter-dependent high-level programs, which we subsequently describe in chapter gif.


next up previous
Next: Scheduling Independent Tasks Up: Scheduling in Metacomputing Systems Previous: Cluster Computing

Heath A. James, heath@cs.adelaide.edu.au