next up previous
Next: Cluster Computing Up: Scheduling in Metacomputing Systems Previous: Scheduling in Metacomputing Systems

Introduction

 

Metacomputing [211, 74] is an abstraction by which clusters of computers in a distributed system can be treated as a single virtual computing resource. Scheduling is an important aspect of distributed computing, particularly in the case of metacomputing systems, where the scale of the system, and resource heterogeneity give rise to a large number of scheduling choices.

We have reviewed the current technologies in metacomputing systems and specifically the way in which scheduling, or program placement across distributed systems is approached. We have concluded that with few exceptions, scheduling is performed in an ad hoc manner. Therefore, there exists an opportunity to study the way in which schedules for distributed systems are created and executed, and to apply this research to metacomputing systems.

We consider processing requests that can be decomposed into relatively coarse-grained computations interconnected by data dependencies. We explicitly include the case in which it may be more economical to move the computation to where the data resides, rather than to move the data to the computation.

We initially consider the effects of a number of scheduling algorithms on the total execution time of independent programs with execution times chosen from a given distribution. We conclude that no single scheduling algorithm performs optimally with all distributions of execution time across both homogeneous and heterogeneous distributed clusters. We suggest the use of an adaptive scheduling algorithm which initially creates a static execution schedule that is adapted by knowledge of partial system state to minimise execution time.

The task of maintaining complete, accurate system state information in a distributed computational environment is very difficult. We hypothesise that an adaptive scheduling algorithm should be used to enable the performance of a metacomputing system to be optimised from two differing points of view: that of the user, in turnaround time; and, that of the system, in the impact that computations have on the system in the long-term. The scheduling algorithm makes use of an individual node's knowledge of the remainder of the global system state to generate optimal schedules, according to a given heuristic.

We present a set of abstractions with which we can describe the characteristics of data and programs, and the assumptions which make them valid in a general distributed system. These abstractions are used to create adaptive execution schedules with which a processing request may be satisfied and optimised. The experimental framework is implemented as an integral part of the broader Distributed Information Systems Control World (DISCWorld) metacomputing system.

This thesis does not attempt to provide solutions to all the research problems faced when implementing a metacomputing system. The focus of this work is on scheduling; we develop an algorithm and mechanisms for enabling smart scheduling of programs in the DISCWorld metacomputing system.

Project Motivation

In the nearly forty years in which computers have been available, the user base has broadened from isolated groups of mathematicians and engineers to the point where they have become widely accessible and affordable commodity items with applications for scientific research, business management and entertainment. Their ubiquity is driving users from many disciplines to use computers.

For those users with large high-performance computing requirements, the trend has been to move away from expensive massively parallel machines to medium- or small-scale, general-purpose parallel machines and clusters of workstations, as well as combinations of distributed heterogeneous machines [6]. Together with the widespread adoption of computers into daily life, computers are also being used to gather and store data on a wide variety of phenomena [133, 167]. New equipment is being deployed which can produce many terabytes of data per day, such as multi-spectral Earth observation satellites [17]. With a proliferation of data being collected every day on various physical phenomena (including visual spectra, invisible spectra, geothermal, seismic, hydrology), people are using this data to produce new, value added data.

The application domains of environmental science and decision support are areas that require access to a wide variety of data that is characterised by being very large and expensive to produce. Common software packages that are available for use by environmental scientists [59] are unable to handle the volume of data required for processing. Thus, more users are turning to high performance computers, and collections of computers, to provide the necessary processing capabilities. Unfortunately, these people are not often computer science specialists, and may not possess the skills to effectively program and use high-performance computing hardware that may be beneficial to them.

One of the characteristics of these application domains is the fact that, as the data is very large, it is also expensive to transfer across a network. Data is also being collected by different organisations  [133, , , , ] by different methods (including satellite, ground observations, radar data) and each owner may stipulate different conditions on the use of their data. The data may also be stored in repositories that reside in different parts of the same country, or even in different countries. Decision support systems often require near real-time results, so performance is a major consideration. The issues involved in implementing a distributed system that provides a computational infrastructure for the development of decision support and research applications are presented in appendix gif.

The problem of enabling non-computer scientist users access to large-scale distributed systems is not trivial. Technical issues such as ensuring each machine has a standardised interface which hides the heterogeneity of the platform and interconnection networks must be addressed. In summary, what these users require is an abstraction over the computers and data repositories that make up their computational infrastructure, which does not require them to be computer science experts.

A prototype system which provides this functionality is described in appendix gif. The Earth Observation Information Catalogue (ERIC) system allows authorised parties access to satellite imagery stored in an online data archive, and permits simple operations on the images and metadata using basic web-based client-server technologies. The inadequacies in the system have contributed to the design and philosophy behind this project.

Why Metacomputing?

 

With the advent of networking technologies such as Ethernet [50] and ATM [103] it has become possible to connect computers for the widespread, efficient sharing of data [221]. As high performance local- and wide-area networks have become less expensive, and as the price of commodity computers has dropped, it is now possible to connect a number of relatively cheap computers with a high-speed interconnect, to effect a local distributed computing cluster [115].

The granularity of code that is being run across collections of computers is becoming greater, and the need for frequent synchronisation is decreasing. For example, in massively parallel computers, such as the Thinking Machines CM-2 and the CM-5 running in single-instruction, multiple data (SIMD) mode, processors synchronously execute instructions, albeit on different data. This is feasible due to the high bandwidth and low latency of communications between processors. In local-area clusters of computers, where the interprocessor bandwidth may be comparable to that found in a supercomputer but latency is higher, stand-alone tasks that operate on a fraction of the total data to be processed by a program may be run on each processor (MIMD). Thus, while the cluster of computers and the SIMD supercomputer may have the same amount of data to process, the application executing on the cluster will probably perform less synchronisation between processors. Furthermore, the trend is to move towards service-based computing, where stand-alone programs, requiring very little, if any, synchronisation, are run on distributed computers, which may themselves be massively-parallel processors (task parallelism). Hence, variable interprocessor bandwidth and latency characteristics are tolerable.

Clusters of distributed computers, as well as high performance serial and parallel machines can be interconnected, speaking common communications protocols, to form a large virtual supercomputer [69, , , , ]. The general trend for computational capacity has thus been to move from monolithic single-processor computers in the early 1970's, to multi-processor parallel computers, in which the interprocessor bandwidth was high and the latency quite low, to clusters of commodity workstations with comparatively high bandwidth and latency, and ultimately to so-called metacomputing environments, which connect heterogeneous clusters of high-end and low-end computers to form a virtual supercomputer.

The term metacomputer [74, 211] was coined to describe a collection of possibly heterogeneous computational nodes which can be treated as a single virtual computer for both resource management and remote execution purposes [19]. General metacomputing environments allow users to submit serial or parallel programs and have tasks or jobs run on the virtual computer. An alternative definition of metacomputing is provided by Gehring and Reinefeld [78]: ``a monolithic computational resource provided by software that allows the transparent use of a network of heterogeneous, distributed computers''.

Focus and Contributions of this Thesis

 

Users are attempting to solve larger problems, using data which may not be stored at one physical location, or are not able to be contained within a single cluster of computers or single multi-processor. This provides incentives for users to join metacomputing environments, and owners to join larger clusters of computers, providing a higher aggregate compute capacity for users, which increases the chances of higher resource utilisation. Of course, by allowing users access to other clusters of computers, owners open their own resources to use by the wider community in a quid quo pro relationship.

We have examined a number of different systems [69, , , , , ] that provide approaches to metacomputing, but we believe that each of these systems suffers from deficiencies which make them inappropriate for the problem domain of high-level decision support services. For example, some systems focus more on the co-allocation of processors [69], while others focus on the effective use of idle machines [153]. Still others focus on the compositional nature of collaborative systems [39]. We address these deficiencies by introducing an environment model in which users can submit high-level processing requests to a daemon which runs on a local machine. The request is parsed by the daemon and decomposed into smaller, self-contained programs which together can access and process the data to fulfil the user's request. From the user's point of view, the system consists of a single virtual machine to which they submit requests, and can retrieve results.

One of the most important aspects of a metacomputing system is the method by which programs are allocated to processors in a fashion that will yield the best results from the viewpoint of both the user and of the owners of the resources. This problem is called the scheduling and resource management problem, and it is this man focus of this thesis.

El-Rewini [54] and Ahmad [6], amongst others, recognise the need for developing novel techniques for the management of computing resources through highly efficient dynamic task allocation, scheduling and load balancing algorithms. One of the major problems that the wider community studying scheduling systems has not addressed is that of not being able to place tasks onto every node in the distributed system. While this limitation has been alluded to by the introduction of restrictions based on machine (and hence binary code) heterogeneity, we wish to address the problem of controlling the system from the viewpoint of a platform independent language, such as the Java programming language, in which tasks, or services, are written and are distributed throughout the machines in the system. Some of the services in the system may be implemented as native programs, encapsulated in Java wrappers. Although services implemented in pure Java are physically able to execute on any of the nodes (due to the platform independence of the object code or byte code), they can be restricted in their choice of machine to be run on by virtue of machine-dependent policies. Policies are controlled by the machine's owner or administrator. This has an effect on the manner in which the data and programs in the system can be used and transferred, and also in the way that the scheduling system must adapt to the conditions.

Organisation of this Thesis

In order to properly place metacomputing into its historical perspective, a review of current and influential cluster computing projects is presented in chapter gif. The classical approaches of cluster control via batch queueing systems and wide-area managers are discussed. The concept of metacomputing, and the characteristics of metacomputing systems are discussed. The chapter concludes with an introduction and discussion of the Distributed Information Systems Control World (DISCWorld) metacomputing project, in which this scheduling work is developed. We critically compare DISCWorld to the other systems reviewed in chapter gif.

A literature survey on scheduling research, with focus on processor selection and allocation, is presented in chapter gif. The conclusion from this work is that there is very little scheduling research that considers the problem of restricted placement of code, and that considers both code and data to be movable between nodes in a distributed system.

Chapter gif presents work that was performed in the area of scheduling independent tasks across a cluster of distributed computational resources. Simulations were performed using a number of different static and dynamic scheduling algorithms under conditions that restrict the use of the systems, as mentioned in chapter gif. We reach two important conclusions: firstly, that there is no single scheduling algorithm that produces the best execution schedule when there are greatly varying numbers of jobs to be executed, with different job execution time distributions; and secondly, that the classifications of scheduling algorithms into either static or dynamic is too restrictive. We suggest that new descriptive terms, representing hybrid approaches in the presence of complete or partial system state information, are needed.

Scheduling, in the presence of restricted placement and movable code and data is the focus of chapter gif. Scheduling is modeled in accordance with the features and restrictions found in the prototype DISCWorld metacomputing system. The resulting model is general to distributed systems, even in the presence of partial system state information. Partial system state information is evidenced in typical distributed environments, where elements of the system are perhaps owned by different organisations and access by the metacomputing is subject to the owner's policies. Our model requires that there exist a global naming scheme for data and code, and single assignment of data. In this context, single assignment of data implies that once data is created and given a canonical name, any modification of the data will result in new data being created, having a different although related name. A description and discussion of the implementation of the scheduling model is presented. Also featured is a brief discussion of work on a distributed job placement language (DJPL). User requests are represented by process networks of high-level programs which share data in a producer-consumer relationship. The purpose of the DJPL is twofold. Firstly, it is used to express the user query when decomposed into programs and shared data; secondly, it also describes the static assignment of programs to machines within the DISCWorld system. In addition, the DJPL contains information that describes the user request and the parameters under which it may run, and also details the appropriate behaviour of the system when an exception condition is raised. We believe that the DJPL is sufficiently general that it may be effectively integrated with other high-level metacomputing systems.

Execution mechanisms associated with the scheduling model are presented in chapter gif. We describe a novel data access mechanism, the DISCWorld Remote Access Mechanism (DRAM), and discuss the manner in which it is used in our implementation of the DISCWorld model. The DRAM concept is extended to refer to data which has not yet been created, the Future DRAM (DRAMF). The DRAMF facilitates the execution of schedules in the DISCWorld metacomputing environment and allows executing schedules to be dynamically optimised to achieve the best performance using available state information. This optimisation is evidenced by possibly faster request execution time for the user and a lower overall system impact.

Performance analysis of the DISCWorld prototype is presented in chapter gif. The system is compared with ERIC, a simple client-server system of the same functionality, and an RMI implementation of ERIC. Limitations of the implementation are discussed.

The thesis is concluded, and future work is described in chapter gif. Other relevant work is summarised and presented in appendices gif and  gif. Appendix gif is a summary of a paper detailing the motivation for the concept behind the DISCWorld system philosophy. Appendix gif describes a user interface to a satellite data repository that was built in the beginning of this work. It describes the way in which a simple client-server technology, such as CGI [100] can be effectively used to provide server-side computation across the WWW. We show that the technology is inadequate for the types of systems detailed in appendix gif due to: its speed; the total reliance on server-side computation; the poor client support; and the fact that the technology is not scalable to complex distributed systems.


next up previous
Next: Cluster Computing Up: Scheduling in Metacomputing Systems Previous: Scheduling in Metacomputing Systems

Heath A. James, heath@cs.adelaide.edu.au