Metacomputing [211, 74] is an abstraction by which clusters of computers in a distributed system can be treated as a single virtual computing resource. Scheduling is an important aspect of distributed computing, particularly in the case of metacomputing systems, where the scale of the system, and resource heterogeneity give rise to a large number of scheduling choices.
We have reviewed the current technologies in metacomputing systems and specifically the way in which scheduling, or program placement across distributed systems is approached. We have concluded that with few exceptions, scheduling is performed in an ad hoc manner. Therefore, there exists an opportunity to study the way in which schedules for distributed systems are created and executed, and to apply this research to metacomputing systems.
We consider processing requests that can be decomposed into relatively coarse-grained computations interconnected by data dependencies. We explicitly include the case in which it may be more economical to move the computation to where the data resides, rather than to move the data to the computation.
We initially consider the effects of a number of scheduling algorithms on the total execution time of independent programs with execution times chosen from a given distribution. We conclude that no single scheduling algorithm performs optimally with all distributions of execution time across both homogeneous and heterogeneous distributed clusters. We suggest the use of an adaptive scheduling algorithm which initially creates a static execution schedule that is adapted by knowledge of partial system state to minimise execution time.
The task of maintaining complete, accurate system state information in a distributed computational environment is very difficult. We hypothesise that an adaptive scheduling algorithm should be used to enable the performance of a metacomputing system to be optimised from two differing points of view: that of the user, in turnaround time; and, that of the system, in the impact that computations have on the system in the long-term. The scheduling algorithm makes use of an individual node's knowledge of the remainder of the global system state to generate optimal schedules, according to a given heuristic.
We present a set of abstractions with which we can describe the characteristics of data and programs, and the assumptions which make them valid in a general distributed system. These abstractions are used to create adaptive execution schedules with which a processing request may be satisfied and optimised. The experimental framework is implemented as an integral part of the broader Distributed Information Systems Control World (DISCWorld) metacomputing system.
This thesis does not attempt to provide solutions to all the research problems faced when implementing a metacomputing system. The focus of this work is on scheduling; we develop an algorithm and mechanisms for enabling smart scheduling of programs in the DISCWorld metacomputing system.
In the nearly forty years in which computers have been available, the user base has broadened from isolated groups of mathematicians and engineers to the point where they have become widely accessible and affordable commodity items with applications for scientific research, business management and entertainment. Their ubiquity is driving users from many disciplines to use computers.
For those users with large high-performance computing requirements, the trend has been to move away from expensive massively parallel machines to medium- or small-scale, general-purpose parallel machines and clusters of workstations, as well as combinations of distributed heterogeneous machines [6]. Together with the widespread adoption of computers into daily life, computers are also being used to gather and store data on a wide variety of phenomena [133, 167]. New equipment is being deployed which can produce many terabytes of data per day, such as multi-spectral Earth observation satellites [17]. With a proliferation of data being collected every day on various physical phenomena (including visual spectra, invisible spectra, geothermal, seismic, hydrology), people are using this data to produce new, value added data.
The application domains of environmental science and decision support are areas that require access to a wide variety of data that is characterised by being very large and expensive to produce. Common software packages that are available for use by environmental scientists [59] are unable to handle the volume of data required for processing. Thus, more users are turning to high performance computers, and collections of computers, to provide the necessary processing capabilities. Unfortunately, these people are not often computer science specialists, and may not possess the skills to effectively program and use high-performance computing hardware that may be beneficial to them.
One of the characteristics of these application domains is the fact
that, as the data is very large, it is also expensive to transfer
across a network. Data is also being collected by different
organisations [133, , , , ]
by different methods (including satellite, ground observations, radar
data) and each owner may stipulate different conditions on the use of
their data. The data may also be stored in repositories that reside in
different parts of the same country, or even in different
countries. Decision support systems often require near real-time
results, so performance is a major consideration. The issues involved
in implementing a distributed system that provides a computational
infrastructure for the development of decision support and research
applications are presented in appendix
.
The problem of enabling non-computer scientist users access to large-scale distributed systems is not trivial. Technical issues such as ensuring each machine has a standardised interface which hides the heterogeneity of the platform and interconnection networks must be addressed. In summary, what these users require is an abstraction over the computers and data repositories that make up their computational infrastructure, which does not require them to be computer science experts.
A prototype system which provides this functionality is described in
appendix
. The Earth Observation Information Catalogue
(ERIC) system allows authorised parties access to satellite imagery
stored in an online data archive, and permits simple operations on the
images and metadata using basic web-based client-server
technologies. The inadequacies in the system have contributed to the
design and philosophy behind this project.
With the advent of networking technologies such as Ethernet [50] and ATM [103] it has become possible to connect computers for the widespread, efficient sharing of data [221]. As high performance local- and wide-area networks have become less expensive, and as the price of commodity computers has dropped, it is now possible to connect a number of relatively cheap computers with a high-speed interconnect, to effect a local distributed computing cluster [115].
The granularity of code that is being run across collections of computers is becoming greater, and the need for frequent synchronisation is decreasing. For example, in massively parallel computers, such as the Thinking Machines CM-2 and the CM-5 running in single-instruction, multiple data (SIMD) mode, processors synchronously execute instructions, albeit on different data. This is feasible due to the high bandwidth and low latency of communications between processors. In local-area clusters of computers, where the interprocessor bandwidth may be comparable to that found in a supercomputer but latency is higher, stand-alone tasks that operate on a fraction of the total data to be processed by a program may be run on each processor (MIMD). Thus, while the cluster of computers and the SIMD supercomputer may have the same amount of data to process, the application executing on the cluster will probably perform less synchronisation between processors. Furthermore, the trend is to move towards service-based computing, where stand-alone programs, requiring very little, if any, synchronisation, are run on distributed computers, which may themselves be massively-parallel processors (task parallelism). Hence, variable interprocessor bandwidth and latency characteristics are tolerable.
Clusters of distributed computers, as well as high performance serial and parallel machines can be interconnected, speaking common communications protocols, to form a large virtual supercomputer [69, , , , ]. The general trend for computational capacity has thus been to move from monolithic single-processor computers in the early 1970's, to multi-processor parallel computers, in which the interprocessor bandwidth was high and the latency quite low, to clusters of commodity workstations with comparatively high bandwidth and latency, and ultimately to so-called metacomputing environments, which connect heterogeneous clusters of high-end and low-end computers to form a virtual supercomputer.
The term metacomputer [74, 211] was coined to describe a collection of possibly heterogeneous computational nodes which can be treated as a single virtual computer for both resource management and remote execution purposes [19]. General metacomputing environments allow users to submit serial or parallel programs and have tasks or jobs run on the virtual computer. An alternative definition of metacomputing is provided by Gehring and Reinefeld [78]: ``a monolithic computational resource provided by software that allows the transparent use of a network of heterogeneous, distributed computers''.
Users are attempting to solve larger problems, using data which may not be stored at one physical location, or are not able to be contained within a single cluster of computers or single multi-processor. This provides incentives for users to join metacomputing environments, and owners to join larger clusters of computers, providing a higher aggregate compute capacity for users, which increases the chances of higher resource utilisation. Of course, by allowing users access to other clusters of computers, owners open their own resources to use by the wider community in a quid quo pro relationship.
We have examined a number of different systems [69, , , , , ] that provide approaches to metacomputing, but we believe that each of these systems suffers from deficiencies which make them inappropriate for the problem domain of high-level decision support services. For example, some systems focus more on the co-allocation of processors [69], while others focus on the effective use of idle machines [153]. Still others focus on the compositional nature of collaborative systems [39]. We address these deficiencies by introducing an environment model in which users can submit high-level processing requests to a daemon which runs on a local machine. The request is parsed by the daemon and decomposed into smaller, self-contained programs which together can access and process the data to fulfil the user's request. From the user's point of view, the system consists of a single virtual machine to which they submit requests, and can retrieve results.
One of the most important aspects of a metacomputing system is the method by which programs are allocated to processors in a fashion that will yield the best results from the viewpoint of both the user and of the owners of the resources. This problem is called the scheduling and resource management problem, and it is this man focus of this thesis.
El-Rewini [54] and Ahmad [6], amongst others, recognise the need for developing novel techniques for the management of computing resources through highly efficient dynamic task allocation, scheduling and load balancing algorithms. One of the major problems that the wider community studying scheduling systems has not addressed is that of not being able to place tasks onto every node in the distributed system. While this limitation has been alluded to by the introduction of restrictions based on machine (and hence binary code) heterogeneity, we wish to address the problem of controlling the system from the viewpoint of a platform independent language, such as the Java programming language, in which tasks, or services, are written and are distributed throughout the machines in the system. Some of the services in the system may be implemented as native programs, encapsulated in Java wrappers. Although services implemented in pure Java are physically able to execute on any of the nodes (due to the platform independence of the object code or byte code), they can be restricted in their choice of machine to be run on by virtue of machine-dependent policies. Policies are controlled by the machine's owner or administrator. This has an effect on the manner in which the data and programs in the system can be used and transferred, and also in the way that the scheduling system must adapt to the conditions.
In order to properly place metacomputing into its historical
perspective, a review of current and influential cluster computing
projects is presented in chapter
. The
classical approaches of cluster control via batch queueing systems and
wide-area managers are discussed. The concept of metacomputing, and
the characteristics of metacomputing systems are discussed. The
chapter concludes with an introduction and discussion of the
Distributed Information Systems Control World (DISCWorld)
metacomputing project, in which this scheduling work is developed. We
critically compare DISCWorld to the other systems reviewed in
chapter
.
A literature survey on scheduling research, with focus on processor
selection and allocation, is presented in
chapter
. The conclusion from this work is that
there is very little scheduling research that considers the problem of
restricted placement of code, and that considers both code and data to
be movable between nodes in a distributed system.
Chapter
presents work that was performed in the
area of scheduling independent tasks across a cluster of distributed
computational resources. Simulations were performed using a number of
different static and dynamic scheduling algorithms under conditions
that restrict the use of the systems, as mentioned in
chapter
. We reach two important conclusions:
firstly, that there is no single scheduling algorithm that produces
the best execution schedule when there are greatly varying numbers of
jobs to be executed, with different job execution time
distributions; and secondly, that the classifications of scheduling
algorithms into either static or dynamic is too restrictive. We
suggest that new descriptive terms, representing hybrid approaches in
the presence of complete or partial system state information, are
needed.
Scheduling, in the presence of restricted placement and movable code
and data is the focus of chapter
. Scheduling is
modeled in accordance with the features and restrictions found in the
prototype DISCWorld metacomputing system. The resulting model is
general to distributed systems, even in the presence of partial system
state information. Partial system state information is evidenced in
typical distributed environments, where elements of the system are
perhaps owned by different organisations and access by the
metacomputing is subject to the owner's policies. Our model requires
that there exist a global naming scheme for data and code, and single
assignment of data. In this context, single assignment of data implies
that once data is created and given a canonical name, any modification
of the data will result in new data being created, having a different
although related name. A description and discussion of the
implementation of the scheduling model is presented. Also featured is
a brief discussion of work on a distributed job placement language
(DJPL). User requests are represented by process networks of
high-level programs which share data in a producer-consumer
relationship. The purpose of the DJPL is twofold. Firstly, it is used
to express the user query when decomposed into programs and shared
data; secondly, it also describes the static assignment of programs to
machines within the DISCWorld system. In addition, the DJPL contains
information that describes the user request and the parameters under
which it may run, and also details the appropriate behaviour of the
system when an exception condition is raised. We believe that the DJPL
is sufficiently general that it may be effectively integrated with
other high-level metacomputing systems.
Execution mechanisms associated with the scheduling model are
presented in chapter
. We describe a novel data
access mechanism, the DISCWorld Remote Access Mechanism (DRAM), and
discuss the manner in which it is used in our implementation of the
DISCWorld model. The DRAM concept is extended to refer to data which
has not yet been created, the Future DRAM (DRAMF). The DRAMF
facilitates the execution of schedules in the DISCWorld metacomputing
environment and allows executing schedules to be dynamically optimised
to achieve the best performance using available state
information. This optimisation is evidenced by possibly faster request
execution time for the user and a lower overall system impact.
Performance analysis of the DISCWorld prototype is presented in
chapter
. The system is compared with ERIC, a
simple client-server system of the same functionality, and an RMI
implementation of ERIC. Limitations of the implementation are
discussed.
The thesis is concluded, and future work is described in
chapter
. Other relevant work is summarised and
presented in appendices
and
.
Appendix
is a summary of a paper detailing the
motivation for the concept behind the DISCWorld system
philosophy. Appendix
describes a user interface to a
satellite data repository that was built in the beginning of this
work. It describes the way in which a simple client-server technology,
such as CGI [100] can be effectively used to provide
server-side computation across the WWW. We show that the technology is
inadequate for the types of systems detailed in appendix
due to: its speed; the total reliance on server-side computation; the
poor client support; and the fact that the technology is not scalable
to complex distributed systems.