next up previous
Next: A Review of Scheduling Up: Scheduling in Metacomputing Systems Previous: Introduction

Cluster Computing

 

This chapter provides a review of the software that has been developed in response to the trends mentioned in section gif. The remainder of this chapter is further divided into four sections: a summary of resource management technologies; a description of metacomputing projects; an introduction to our prototype metacomputing project, Distributed Information Systems Control World (DISCWorld); and a discussion in which other systems are compared and contrasted with DISCWorld.

The evolution of computer systems from single monolithic mainframes, to massively parallel machines, to groups of tightly (and loosely) interconnected distributed workstations, has forced a change in the manner in which the systems are used, and what can be achieved with the systems. As computers, and groups of computers, become more complex, greater abstractions are necessary so ensure that users are able to exploit the computational power effectively, and to ensure that the resources are fairly and efficiently used.

While there is much debate about exactly what a cluster is [120], in this thesis we use the definition proposed in  [186]: a cluster is a collection of interconnected parallel or distributed machines that may be viewed and used as a single, unified computing resource. Clusters can consist of homogeneous or heterogeneous collections of von Neumann (serial) and parallel architecture computers, or even sub-clusters. An example of a cluster is shown in figure gif, where servers tex2html_wrap_inline2968 are workstations and server n is an interface to a sub-cluster. The term cluster usually refers to a tightly connected group of computers, connected by a high-speed interconnection network.

  figure117
Figure 2: A cluster of computers, where servers tex2html_wrap_inline2968 are workstations and server n is either an interface to a sub-cluster or a multi-processor machine. Any machine may provide computational and cluster-management functions.

A resource manager is the name given to the software that provides management functions and abstractions that allow clusters to be treated as a single resource. Furthermore, resource management software is employed by system managers to ensure resources are available depending on a pre-defined criteria. Resource management software can be used, for example, to ensures that all users' jobs are executed in turn and hence receive the same amount of system resource; or, by not allowing all users to immediately execute their jobs, load on the machine is spread out and the shared resource can be effectively utilised.

There are a number of specialised resource management software products available. These may divided into batch queueing systems and extended batch systems. Batch queueing systems are designed for use on tightly interconnected clusters, which usually feature shared file systems. Extended batch systems, designed for use in loosely interconnected clusters, do not usually make assumptions about shared file systems, and often offer increased functionality over typical batch queueing systems. Examples of batch queueing systems are: DQS [91, ]; GNQS [139]; PBS [116]; EASY [151]; LSF [188]; and LoadLeveler [123], while examples of extended batch systems are: Condor [153]; PRM [172]; CCS [195]; and Codine [81].

Resource Managers

 

Baker, Fox and Yau [20, ] divided cluster computing software into two groups: cluster management software (CMS); and distributed cluster computing environments (DCCE). Their definition of CMS maps onto our classifications of batch queueing systems and extended batch systems. Cluster management software systems are typically characterised by the user submitting a job description file specifying the programs to be run and the types of resources that are necessary for them to run. The resource management system selects an appropriate host on which to run the job, and plays no further part in the execution until either the job completes successfully, or needs to be killed for consuming too much resource. Finally, the resource manager places the outputs of the program execution at a place nominated by the user for later collection. There are various forms of result notification, including the manager placing the results in the directory from which the batch job was submitted or the user being sent email with the location of the results. Thus, batch queueing systems do not individually schedule the programs that comprise a job; they merely place the job onto a sufficiently unloaded server. The only form of scheduling they perform is the matching of available computational capacity with the job's requirements. The user typically submits their job to a queue that best reflects the job's characteristics (e.g. short, medium, or long execution time; priority based; short, medium, or long job execution time requiring parallel processors). In addition, batch queueing systems have no real support for platform heterogeneity. Users must ensure that binaries are available for the architecture on which the job will be placed.

An idealised batch queueing system is shown in figure gif, where jobs are submitted to a centralised queue master, placed by the queue master onto a compute server, and the output is returned to the user after processing is complete. Users typically have no control over which server executes their job. In contrast, an extended batch system is shown in figure gif. Users can submit their job to any server participating in the cluster. If the server to which the job is submitted does not have the available computational capacity to handle the job, it is moved to another server.

  figure137
Figure 3: Architecture of a typical batch queueing system. Jobs are submitted to the queue master. The queue master assigns jobs to a server. When the job has finished executing, the results are returned to the user.

  figure144
Figure 4: A wide-area manager. Jobs may be submitted to any server. If the local server is unable to execute the submitted job, the server decides where to send the job. Some systems use limits on the maximum number of times a job can be transferred to ensure it is eventually executed. The scale of network is larger and servers can be more loosely interconnected than in a batch queueing system.

Batch Systems

The two main reasons for using a batch system are: to maximise the use of shared resources; and, to make sure everyone can effectively and equitably utilise that resource. They are not predominantly used to utilise spare cycles on users' machines, although many packages provide this functionality [153].

As the name suggests, batch systems are only useful for the initiation and controlling of batch jobs. Although it is common to submit jobs to a batch system as a shell script, queue masters do not interpret the instructions contained within the script. Most systems merely parse the script for job-identification tokens, such as the name of the user that submitted the job and what resource usage is expected by the job. The script is then executed on the server to which the job was assigned, which runs the instructions contained therein. Thus, batch queueing systems are of little use when presented with a job that consists of a series of dependent programs which may be interactive, and may be possible to execute in parallel. These are the type of jobs often submitted to a metacomputing environment.

While the batch queueing systems described in this section are used successfully with clusters of workstations, one of the conclusions of Baker et al [21] was that software being developed for compute cluster environments and distributed compute cluster environments provide the most viable means of utilising workstation clusters. The main reason for this is that compute cluster environments and distributed compute cluster environments provide more flexibility than accessing machines individually. Batch queueing systems often assume that they are the only schedulers executing on the resources.

Through experience, we have discovered that GNQS, DQS and PBS are fragile when the machine that runs the queue master has multiple network interfaces. If the queue-master machine is used as a front-end to a cluster of machines that can only communicate through the front-end, these batch systems fail. We believe that this is due to the batch queueing system only recognising the front-end machine's primary network interface. Secondary interfaces, such as those that connect to the cluster, are ignored. This problem can be solved by using a wide-area resource manager designed for such conditions, such as Condor.

In a distributed computing context, with the aims of providing infrastructure for decision support systems, batch queueing systems are not entirely useful. While they provide effective resource utilisation, they are not able to handle requests which are comprised of dependent programs, especially if the programs can be logically run in parallel. In addition, they do not support the wide-area model in which the data to be used by a program is not immediately accessible.

GNQS

The development of Network Queueing System (GNQS) was driven by the need for a good UNIX based batch and device queueing facility capable of supporting such requests from a networked environment of UNIX machines [139, ]. Later superseded by Generic NQS (GNQS), it was implemented as a common interface between machines, using a collection of user-space programs to provide both the batch and device queueing capabilities for each machine in the network. GNQS relies on the cluster of workstations sharing a common file-system and common user database.

Within GNQS, a batch request is defined as a shell-script containing commands, that can be executed without any user intervention by an appropriate command interpreter. The commands contained within a batch request cannot require any specific peripherals. A device request, on the other hand, is defined as a set of instructions requiring the direct services of a specific device for execution.

GNQS has support for all the resource quotas that can be enforced by the underlying UNIX kernel implementation, and can support remote queueing and routing of batch and queue requests throughout the network of machines running GNQS. GNQS also supports queue access restrictions, and allows user accounts to be mapped across machine boundaries. Command standard output and error can be returned to the user on the originating (possibly remote) machine. Remote status operations are supported to alleviate the need for remote users to log into a machine in order to obtain information. The designers of GNQS also provided for the inclusion of file staging, which was to be implemented in a future release; unfortunately this has not happened.

PBS

Portable Batch System (PBS) [116, ] is a package which was designed and written by the Numerical Aerodynamic Simulation Complex, NASA, from 1994. PBS is a successor to NQS, and addressed many of the deficiencies of NQS. It was designed to provide additional controls over the initiating, or scheduling, of execution of batch jobs.

PBS extends the UNIX operating system by the addition of user-level services. Unlike NQS, it had detailed design and implementation documentation for developers. It was designed to be easy to add functionality and improved over NQS in a number of ways, including the provision for parallel jobs. PBS also included features that allow the scheduling policy to be modified according to a site's needs. A batch scheduling language has been designed for use with PBS, but the documentation recommends that it not be used, as there are implementation problems.

DQS

Distributed Queueing System (DQS) [220, 91] is a batch queueing system from the Supercomputer Computations Research Institute, Florida State University, which uses shared files to manage user-submitted jobs. It uses limited system information (the instantaneous and past load on a machine) and has no real support for binary heterogeneity. Managers can set up queue complexes, but the implementation is fragile, and must be monitored. It is fault tolerant, able to restart jobs on machines that have failed. In order to use the system, users must be familiar with the creation of shell-scripts in which they place the program(s) to be run. DQS supports parallelism via the PVM [79] system.

LSF

Load Sharing Facility (LSF) [187, ], by Platform Computing, is a very popular commercial batch queueing system. Like most batch queueing systems, LSF relies on the existence of shared files to implement queues, locks and logs. In addition to standard features found in batch queueing systems, LSF has some measure of fault tolerance inbuilt. In the event a shared file system is not available, the degree of fault tolerance is reduced. If the master host (which makes scheduling decisions) fails, another host in the cluster is automatically voted to be the master. Hosts are elected to be the master in the order they appear in a static file which must be visible to all machines in the cluster. In the event that the cluster becomes partitioned by network failure, the partition that has access to the LSF log files continues working, while the remaining partitions sit idle.

Unlike GNQS and PBS, LSF supports machines with multiple logical names. Jobs are scheduled according to the availability of available hosts, their resource requirements, and whether the job should be able to finish before the queue ceases to run. LSF provides job check-pointing and migration facilities for some machines [188]. It is possible to express dependencies between complete jobs so that dependent jobs are only run once their dependencies have been completed. Because LSF cannot be expected to run on all nodes in a distributed cluster, it has the ability to submit jobs to an NQS batch queue.

LSF has the ability to run parallel jobs through the use of PVM [79]. The user must submit their job using a non-standard command (pvmjob). When the job is executed, LSF creates PVM daemons on each host in the cluster and then starts the user's job.

LoadLeveler

LoadLeveler [123], by IBM, is a modified version of the Condor [153] batch queueing system. LoadLeveler supports clusters of workstations and multi-processor machines. It has parallel support for batch and interactive jobs, and is compatible with NQS. LoadLeveler has an application programmer interface (API) which allows TCP/IP applications to directly connect with least-loaded cluster resources.

The Extensible Argonne Scheduling sYstem (EASY) [150] was incorporated into LoadLeveler to produce EASY-LL [149, ]. EASY provided a better scheduling algorithm through which jobs could be selected to run. Jobs are ordered in the submission queue by their submission time. Jobs are considered for execution in this order. If the scheduling system has enough free nodes to run the first job, it is run. Otherwise, the first job is assigned the start time of the currently-running job's finish time. Jobs requiring fewer resources are executed as long as they will complete before the first job is due to start. Thus, jobs are back-filled to make better use of machine resources. The EASY-LL scheduling system was one of the first to provide deterministic, yet opportunistic scheduling algorithm based on the bin-packing algorithm [76].

Extended Batch Systems

Extended batch systems are characterised by having additional functionality over batch queueing systems. For example, some systems have the ability to act as disk caches [12], to harness idle CPU cycles on participating machines while the owner is not using them [12, ] or to manage software licenses [81]. Still others make searching nodes in the cluster easier for fine-grained parallel tasks [172]. The remainder of this sub-section provides a description of four significant extended batch systems: Prospero Resource Manager, Codine, Condor and Computer Center Software.

Prospero Resource Manager

The Prospero Resource Manager [172, ] (PRM) is designed for use with distributed and parallel programs implemented in PVM [79]. It presents a uniform and scalable model for scheduling tasks by providing mechanisms through which nodes on multiprocessors can be allocated to jobs.

The main contribution of PRM is its hierarchical organisation of system resources into a framework that can be easily searched for use by fine-grained parallel applications. Architecturally, PRM is divided into two distinct parts: the system and jobs. The system part is concerned with the resources that comprise the system, which are global and local. When job execution is requested by the user, the tasks that comprise a job are allocated to different resources. System management resources are allocated to jobs as they are needed, and resources are managed separately for each job. Multiple resource managers are used to achieve resource management scalability, each controlling a subset of the resources. There are three types of managers: system managers, job managers and node managers, each of which uses a different level of information abstraction.

The system manager controls a collection of physical resources, allocating them to jobs when requested. It collects and manages all the information pertaining to the resources under its control. The job manager requests resources needed by a particular job, and assigns them to individual tasks in the job. The job manager's only concern is the job to which it is allocated. The node manager loads and executes tasks on a local machine, as requested by the job manager and authorised by the system manager. The managers organise information using the Prospero Directory Service [169, ], which is based on the Virtual System Model. System managers are organised in a hierarchical fashion.

On job execution, decisions on which resources to use are determined by the job manager querying the directory service for a suitable resource set. During execution, the job manager monitor the tasks, and passes any requests for additional resources to system managers. Resource heterogeneity is handled by the use of architecture identifiers in the directory service.

Unfortunately, the only way in which Prospero can execute parallel job is by the use of the PVM system. This introduces restrictions on the programs that can be used with the system. In the context of decision support infrastructure, we do not wish to constrain developers to writing parallel code in PVM.

Codine

Codine [81] by Genias software, is a widely-used commercial resource management software system. It has the ability to not only manage physical computational resources such as memory and disk space, but also software licenses. It supports check-pointing and migration of user jobs, and is accessible through a user-level API. Although the level of functionality is greatly increased over other batch queueing and wide-area management systems, the architecture of Codine is queue-based, as found in many other batch queueing systems such as DQS, NQS and PBS. It does, however, handle wide-area clusters. There must exist a queue master, which performs the scheduling of user requests onto computational resources, and each machine comprising the cluster must execute a daemon. Jobs are scheduled according to a first-in first-out (FIFO) scheme, and the cluster administrator can determine the way in which jobs are placed onto resources. The system allows dependencies to exist between jobs submitted to the queues. Parallel execution packages, such as PVM and MPI are directly supported. When the user submits a request in which the program uses one of the supported parallel packages, multiple resources are co-scheduled while the job executes. There is no direct support for multi-threaded jobs to be split across resources.

Codine was later merged with other tools, to form the Global Resource Director (GRD) [82]. Directed at critical resource management issues, GRD was initially released for the Cray and SGI platforms. Codine provides job management capabilities, while PerfStat [83] provides performance monitoring capabilities. An advanced scheduler, GDS, is used to implement multiple scheduling policies such as functional priority (by user, department, job class or project), share-based (user share tree or project share tree) and urgency-based (initiate time or deadline) policies [66]. In addition GDS allows the system to be temporarily overridden when the needs arise. GDS allows dynamic scheduling of user jobs and utilisation management or resources.

Condor

Condor [153] is a distributed batch system developed at the University of Wisconsin-Madison from 1988 to execute long-running jobs on workstations that are otherwise idle. The emphasis of Condor is on high-throughput computing. The Condor system recognises that not all users utilise their systems all of the time, and that while they are not using their systems, they may be used for other processing tasks. While Condor provides the user with a uniform view of processor resources, thus making remote access to a compute resource easy, it also guarantees the workstation owner will receive the immediate response of their workstation, should they wish to use it.

Condor provides the user with a uniform view of processor resources. It also guarantees that the user's program will be run on an unloaded machine. Condor also allows users' jobs to be run on any machines in the processor pool, whether or not the user has a valid account on the target machine.

Scheduling and resource management in Condor is achieved through the use of matchmaking [192]. Matchmaking is the way in which Condor addresses the fact that in metacomputing environments, it is highly likely that resources will be owned by different organisations or institutions, and each one of these has its own usage and management policies. For example, a management policy may state that a particular resource is only available to core staff between certain hours, but is to be available to all bone fide Condor users outside of those hours. Matches are made using a semi-structured classified advertisements data model, which is used to make queries and publish services. A designated matchmaker is used to match queries with services and inform the entities of the match.

Computer Center Software

Based at the University of Paderborn, the Computing Center Software (CCS) [195, , ] project is concerned with the resource access and allocation problems in metacomputing environments. It offers the user the option of running jobs in either interactive or batch mode.

Scheduling within CCS is performed via an Implicit Voting Scheme (IVS) [193] coupled with a priority system to express urgency of tasks. If the system is not congested, a simple first-come first-serve (FCFS) algorithm is used. The FCFS algorithm is explained further in section gif. Once the system becomes congested, however, the job mix is inspected and if most of the jobs are batch programs, a first-fit decreasing-height (FFDH) algorithm, which is an instance of the bin packing algorithm, selects the first available job that takes the greatest number of processors. Tasks are assigned so as to maximise the average utilisation of the processors that comprise the metacomputer. If most of the jobs in the request list are interactive, another algorithm, first-fit increasing-height is used, which gives preference to shorter jobs and reduces the average waiting times. One of the significant contributions of CCS was the development of a resource definition language, (RDL) [22] which targets and specifies the types of interconnections between reconfigurable Transputer-based parallel computers.

Middleware and Metacomputing Systems

The primary purpose of a resource management system is simply to provide fair, efficient use of shared computational resources. Users are unwilling, and sometimes unable, to understand and implement the framework associated with fault tolerant distributed systems. This is further discussed in appendix gif. One of the fundamental purposes of metacomputing environments is to allow larger and more problems to be solved without the need for detailed technical knowledge of the computers that provide the computational services. Although one of the mechanisms for providing this is through resource management systems, they do not provide enough functionality to be classified as metacomputing systems.

One of the fundamental differences between cluster computing and metacomputing is the nature of the job which is to be run on the system. Although some cluster computing environments support fine-grained parallelism through the use of a third-party system such as PVM [79], MPI [68], p4 [32] or Linda [80], most cluster computing environments do not support single jobs that comprise of a number of largely independent tasks that can be run in parallel. In those systems that do, this is achieved through the use of a shell-script in which the programs to be run are contained. Upon execution, the separate programs are executed by the target machine as if they were a single program. Although batch queueing systems and wide-area resource management systems have provided good solutions to the problem of managing a cluster of workstations, they fail to scale up to thousands of machines, which may dynamically join and leave the extended cluster. They are also unable to manage and reason about the re-use of results. We characterise metacomputing environments as having the ability to process a request for execution that comprises of independent tasks, and schedule each as separate requests, organising task intercommunication appropriately. In addition, metacomputing environments have the ability to re-use results, and sometimes partial results, in further computations.

Baker and Fox point out that cluster management systems and distributed cluster control environment software either come as a software layer positioned between the native operating system and user applications, or as a partial or complete replacement for the native operating system [20]. Metacomputing environments, too, can either be positioned as a software layer above the operating system, or as a replacement for the operating system, from the application's point of view.

Middleware is a layer of software that provides high-level services to applications, abstracting over low-level details that may differ between platforms. This software layer allows multiple processes running on one or more machines to interact transparently across a network. The relationship between middleware, metacomputing and traditional computer architecture is shown in figure gif. Middleware exists on top of the host computer's operating system, and in some cases takes over most of its functionality [95]. Enabling technologies include DCE [197] and CORBA [177]. User queries are made to the middleware to execute applications on behalf of the user. The most typical example of this is a remote execution request. Middleware tools are described in the next sub-section. Examples of middleware tools are described and discussed.

  figure207
Figure 5: The relationship between middleware software and modern, large-scale computing. Metacomputing systems can be constructed to use the services provided by middleware. In this diagram, user queries are placed slightly higher than applications to indicate that the applications are used to satisfy user queries. Enabling technologies include those which provide naming and directory services, such as DCE [197] and CORBA [177].

Metacomputing systems make use of the services provided by middleware, which focus on larger-scale distribution and larger problems. There are a number of metacomputing systems in existence [69, , , , 158, , , , , ]. Some well-known metacomputing systems are Globus [69] and Legion [95]. A review is given in  [19], which provides a description of a number of metacomputing systems, as well as what may be termed middleware tools. Middleware tools are tools or systems that are independent of a particular metacomputing system, but are designed to run in the framework provided by a complete system. The Globus and Legion systems recognise several general characteristics of metacomputing systems, some of which may be mutually exclusive in a real implementation:

Scale and the need for selection. There is need for metacomputing environments to be scalable to large numbers of distributed machines. The system should also have the ability to sensibly choose to use a particular machine, or group of machines, in preference to others.
Heterogeneity at multiple levels. With the proliferation of different computer architectures and interconnection network architectures, systems must be able to adapt in order to make the best use of such heterogeneity. It follows that the software that runs on the distributed machines must also be capable of being heterogeneous from the point of view of machine usage policies and the software that provides high and low-level services.
Unpredictable structure. Unlike traditional supercomputers, it is not possible to completely specify in advance what the processor topology or interconnection network characteristics will be when running a processing job. For this reason, metacomputing environments must be able to withstand changes in environmental characteristics.
Dynamic and unpredictable behaviour. In contrast to supercomputers that may run a single processing job at a time, it is almost a certainty that users will be sharing resources in a metacomputing environment, from processors, interconnection network bandwidth, to I/O devices.
Multiple administrative domains. The problem of allowing a user to run their code using a different site introduces issues of security and authentication to the general problem of metacomputing, as does to issues of charging for system time, and access policies for users of the metacomputing environment.

A selection of metacomputing projects, representative of the state-of-the-art, is presented in subsections gif through gif. Our metacomputing project, DISCWorld, is introduced in section gif. DISCWorld is compared against other metacomputing projects in section gif, and the chapter is concluded in section gif.

Middleware Tools

In this sub-section we discuss tools and systems that, while not recognised as metacomputing systems in their own right, are designed to run in a metacomputing framework, or to provide some of the functionality of a metacomputing system.

There exist a range of tools that facilitate the writing of parallel and distributed programs, such as the MPI-CH [97] implementation of the Message Passing Interface standard [98] and PVM [79]. Both systems present the user with a message-passing environment which may be composed of a heterogeneous collection of computers. While not true metacomputing systems, these tools have played a large and important part in increasing the popularity and accessibility of parallel and distributed applications. MPI-CH and PVM provide an application programmer interface (API) and run-time communications library against which a programmer can write and execute their applications. PVM additionally provides a run-time environment with which the user can interrogate the virtual machine, add or delete hosts, and control running parallel jobs. The machine is independent of any single parallel program execution. In contrast, the virtual machine created by MPI-CH exists for the duration of the currently-running parallel program only; additional nodes are unable to be requested or released at run-time. In both systems, fault tolerance and reliability issues must be addressed by the user. Programs such as PLUS [31], which are built over an existing message passing library such as PVM, re-produce some functionality of a metacomputing system; and some projects emulate parallel computers with off-the shelf systems, such as the Berkeley NOW project [12], which schedules parallel workloads across clusters of computers [52, ].

DCE

Distributed Computing Environment (DCE) [181, , , ] from the Open Software Foundation (OSF) is one of the most influential distributed computing products on the market. DCE provides services and tools that support the creation, use and maintenance of distributed applications in a heterogeneous computing environment. It is a standard set of tools, such as DCE RPC and DCE Threads, and services, such as the DCE Directory Service, DCE File Service, Security Service and Distributed Time Service. Each of the tools and services are integrated with the aim to provide interoperability and portability across a heterogeneous collection of platforms.

AppLeS

The Application Level Scheduler (AppLeS) [24] project is an application-level scheduling system developed at the University of California, San Diego from 1996. AppLeS is not a resource management system; it is what is commonly called middleware. It interacts with metacomputing systems as Globus [69] and Legion [95] or cluster management applications such as PVM [79] and MPI [68]. AppLeS is an agent-based methodology for application-level scheduling. AppLeS agents are based on the application level scheduling paradigm, where everything about the system is evaluated in terms of its impact on the application [25]. Each application has its own AppLeS, and each AppLeS combines both static and dynamic information to determine a customised application-specific schedule and implement that schedule on the distributed resources of the metacomputer.

AppLeS relies on application-specific and system-specific information to produce good schedules. In this context, good schedules are judged by the application's performance criteria. Schedules derived from predictions of application and system state are only as accurate as the predictions themselves, and because the whole system is dynamic, the predictions have a finite lifetime, beyond which they become out of date.

The AppLeS agent has a single active agent, called the coordinator, and four subsystems: the resource selector; the planner; the performance estimator; and the actuator. The resource selector chooses and filters different resource combinations for the application's execution. The planner generates resource-dependent schedules for given resource combinations. The performance estimator generates performance estimations for candidate schedules according to the user's (or application's) performance metric, and the actuator implements the best schedule on the target resource management system [25].

Information, contained within the AppLeS information pool, is supplied to the AppLeS agents in a number of ways. The Network Weather Service [237] provides CPU load predictions and network statistics for the period in which the application will be scheduled. Through the user interface, the user provides specific information such as the structure and characteristics of the application, performance criteria, execution constraints, and any login information that may be needed to use remote machines. Performance estimation and resource selection information is gathered by any default models that have been built up through use of the metacomputing environment.

Available information is used to filter infeasible resource sets from the resource pool. Remaining resources are then prioritised according to an application-specific notion of distance between resources, and promising sets of resources are identified by the resource selector. The planner is then invoked to propose a schedule, which is evaluated by the performance estimator. Once the best schedule, according to the application's performance criteria, has been identified, the actuator implements the schedule.

MARS

The Metacomputer Adaptive Runtime System (MARS) [78] from the University of Paderborn is a metacomputing framework that aims to solve a number of problems concerned with the problem of running parallel codes on a metacomputer: how to determine a good initial task-to-processor mapping; how to cope with changing network performance; how to cope with varying node performance (e.g. concurrent usage); how to migrate tasks on heterogeneous systems; and how to cope with processor/network failures or extensions.

The runtime system uses application-specific data, such as communication characteristics, processor utilisation and program phases, and system-specific information such as node performance, network throughput and latency to determine where to run, and whether to migrate tasks between heterogeneous processors. MARS uses application- and system-specific information to predict applications' future resource utilisations in an effort to improve task migration. Migration decisions are made by a Migration Manager. MARS is currently written in C and uses MPI; a preprocessor inserts MPI calls to facilitate the collection of statistics by the runtime system.

MARS has two fundamental instances: Monitors, which collect statistical data; and Managers, which use the statistical data in order to make initial task placement and task migration decisions. Due to the fact that not all machines are binary-compatible, task migration is restricted to certain places in a computation; these places are identified by the addition of certain MPI calls, which are manually added. User's code is linked to the MARS runtime library, which allows the Monitors to intercept send and receive calls and generate application and system statistics.

Task dependency graphs are generated by the Application Manager, which are consolidated by a Program Information Manager. Nodes in the dependency graphs represent serial portions of code between a pair of send and receive calls, and are called Independent Blocks (IBs); edges denote the precedence relationship between IBs. Consolidation is necessary as task may generate different dependency graphs on subsequent runs. Dependency graphs of multiple task runs are consolidated to produce a graph of the likely communication patterns of a given task, which will aid in the placement of the tasks onto processors.

It is remarked that although most applications benefit from the use of the dependency graph in task placement, it may be difficult to construct for applications with highly irregular communications structures. The statistical variance of the dependency graph is used to modify the decision algorithm. The authors also justify the use of another system call when issuing a communications send or receive by the knowledge that the network latencies are an order of magnitude higher in WAN environments than tightly coupled parallel systems [78].

Nimrod

The Nimrod [2, ] project addresses the problem of performing a large number of parameterised simulations on a set of distributed computers, each simulation having a different parameter set. It does not address the problem of parallelising an individual program, or allow a series of programs with interdependencies to be executed in parallel, nor does it address the issue of fault tolerance. The original implementation of Nimrod required all participating resources to run DCE [198].

Resource management in Nimrod is handled by whatever queueing system is present on the computer that is chosen to run the jobs. Nimrod submits the jobs that it generates to the target computer's queueing system, where it is executed in turn, and the results returned to Nimrod. The decision to let Nimrod submit jobs to the target's queue management system, rather than actively support the distribution of jobs was an effort to reduce the complexity of the application program. It is pointed out in [146], that the skill-set of programming in truly distributed environments is different to that found in ordinary application programming.

The main contributions that Nimrod has made to the field of metacomputing are those of job transfer and the user-centric view of the metacomputing system. Nimrod takes care of the job transfer on behalf of the user by sending the appropriate input files to the target processor via a remote file transfer server. It was also one of the first systems to provide a user-centric view to the computational environment.

Recently, a new version of Nimrod, Nimrod-G, is being ported to operate under the Globus metacomputing environment [3]. It is intended that the new version will utilise the functionality of Globus that allows the user to specify time and cost constraints on experiments.

Globus

 

Globus [69, 70] is a metacomputing environment under development at Argonne National Laboratories and the Californian Institute of Technology. It consists of a metacomputing infrastructure toolkit, which provides basic capabilities and interfaces for communications, resource location, scheduling and data access [69]. Each toolkit component has a well-defined interface which, in combination, define a metacomputing abstract machine, upon which higher level services can be built.

The Globus project does not aim to re-invent existing technologies such as PVM [79], MPI [98], Condor [153] or Legion [95], but provides basic infrastructure through the development of low-level mechanisms that can be used by higher-level services. The project also aims to provide techniques that allow such services to observe and guide the mechanisms. Some of the lower-level mechanisms are: resource location and allocation, communications, unified resource information service, authentication interface, process creation, and data access.

The unified resource information service contains information about the status of the system, for example: static characteristics of a processing node, instantaneous performance information and application specific information. This information is gathered by different sources, and can be accessed by a single mechanism within Globus.

Globus modules can be influenced in the decisions that they make by higher-level services. This is accomplished by the use of rule-based selection, resource property inquiry and notification mechanisms. Rule-based selection is used to identify strategies with which the low-level module can perform a given task. The resource property inquiry module requests information from the Globus unified information service, which contains the current state of the environment. The notification module allows a call-back mechanism between the higher-level service and the low-level mechanism such that in the event of an exception, the mechanism can notify the service of the event.

Resource management in Globus is achieved through the interaction of the Globus toolkit with any schedulers that may run on the local system. An extensible resource specification language (RSL) [48] is used to communicate requests for resources between components. The RSL is a simple language that allows the system to request resources containing characteristics embedded in the language. The RSL describes, principally, the physical machines on which to run programs. Local and global services that the programs use are considered to be fine-grained enough that they can be run on any of the machines in the computing environment.

The Globus resource management system revolves around the concept of resource brokers, software that acts as an interface and translator between higher-level specifications of requests, and more concrete representations of the request (in RSL). These brokers are application specific, having the ability to understand high-level requests from user clients. They progressively refine the client's request until it becomes expressible as a request for specific resources. After translating the specific requests into RSL, they are dispatched to the Globus Resource Allocation Manager (GRAM). The GRAM [48] provides the local component for resource management. Each GRAM is responsible for a set of resources operating under the same site-specific allocation policy, often implemented by a local resource management system such as LSF [188] or Condor [153]. GRAM provides a standard network-enabled interface to local resource management systems. While individual sites are not constrained in their choice of resource management tools, the computational grid tools [71] and applications can express resource allocation and process management requests in terms of a standard API.

Resources and computation management services are implemented in a hierarchical fashion. An individual GRAM supports the creation and management of a set of processes on a set of local resources. A computation created by a global service may then consist of one or more jobs, each created by a request to a GRAM and managed via management functions implemented by that GRAM. To implement a global directory, Globus uses a Metacomputing Directory Service (MDS) [67], which is based on the Lightweight Directory Access Protocol (LDAP) [230].

Harness

Harness [158] is a metacomputing framework designed for dynamic reconfigurability. A collaborative project between Emory University, UTK and ORNL, it is based on IceT [89] from Emory University.

Dynamic reconfigurability within Harness is achieved through the use of a ``plug-in'' mechanism by which services and computational resources can be added to and removed from the system [159]. The model describing the Harness distributed virtual machine consists of four layers: the abstract distributed virtual machine (DVM); heterogeneous computational resources; services; and, applications and users. Applications and users are able to utilise a consistent baseline, to which the DVM conforms.

Services consist of an interface specification; instances of the services, plug-ins are plugged into the DVM to provide processing capabilities. Services have attributes such as their shareability, exportability and their threadability. Shareability refers to a service's ability to be used simultaneously by multiple applications; exportability determines whether different DVMs are able to access the service; and threadability refers to the ability to execute multiple instances of the service simultaneously in a single DVM. Services are divided into three classes: kernel level services; basic services; and specialised services. Kernel level services are crucial to the operation of the DVM; they must be portable across the DVM. Basic services are those very common services, which may or not be exportable. Specialised services include those that have non-standard platform requirements (eg very large memory requirements or massively-parallel vector processors) or performance requirements which restrict their execution on every machine in the DVM. As Harness is based on IceT, the remainder of this section describes IceT.

The IceT model of resources consists of multiple clusters of virtual environments belonging to multiple users, each with distinct levels of security and accessibility. Written in Java, native IceT processes and data are transportable. They can be uploaded to remote locations, and in the case of processes, can be executed without regard for the remote architecture or file system structure.

Fundamental to the IceT process model is the concept of process spoking, which refers to the ability of a process and its data to be uploaded to a remote computational resource for execution, as opposed to the common model of remote requests and responses. The main difference is that when a process is uploaded to a remote resource, it can exert independence from the requesting resource (unlike the CORBA and RMI models), and can manage the uploading of any other processes that are necessary for the computation or maintain persistent communications with other processes.

IceT is implemented in Java. Java servers allow byte-codes to be executed on remote nodes where the requesting user does not have normal access privileges. This is achieved through Java's portability and security measures, and the uniform, system-independent view of the underlying architecture. Java's just-in-time features are extensively relied upon to achieve better execution performance than pre-compiled Java byte-code running in a standard Java Virtual Machine.

Virtual environments are created by the user adding hosts into the configuration held by the local daemon. Remote virtual environments can be added, which causes any hosts contained within the remote virtual environment to become accessible through the local daemon.

The architecture and operation of Harness and IceT are similar to PVM. In order to execute pre-compiled IceT Java byte-code, as in PVM, the user must either run the program from the command-line or spawn an IceT task from a local daemon's console. Providing the node's security manager allows the byte-code to be executed, it is sent to the target node, and is parsed for dependencies on other byte-codes (Java packages) or shared system-dependent libraries. In the case of dependence on other byte-codes, they are uploaded to the target host, and in the case of shared libraries, if there is a version of the library that is available for the architecture, it is uploaded [90]. No indication is given in the literature of the behaviour when a shared library of the appropriate architecture is not available. No mention is made of the method by which scheduling in either Harness or IceT is achieved.

Infospheres

Infospheres [39, , , ] is a project from Caltech that addresses dynamic scalable distributed systems. The research is concerned with developing infrastructure to support structuring distributed applications by composing components using sequential, choice and parallel composition. It is implemented in Java [88] and TCP/IP, for use over the World-Wide Web (WWW). The project is one of the first to approach the metacomputing problem from the point of view of composing existing, well-characterised components together to perform a task. Infospheres is not intended to support seamless parallelism, high-performance computation or fault tolerant transactions.

Compositional units are abstractions of processes and sessions. Processes can be composed in parallel. Sessions are collections of processes composed in parallel. Sessions may be composed using sequential and choice composition. The infrastructure supports distributed applications, which can be structured by nesting processes and sessions. The object model is state-based, where every object has a persistent state for the lifetime of its corresponding entity. The execution of the system is regarded as a set of states where state is assignment of values to a given set of variables.

Interprocess communications is achieved by the use of asynchronous RPC [27]. Incoming and outgoing message queues are local to each process, created and destroyed at will. Queues may have different priorities. Output queues may be bound to an arbitrary number of input queues and messages are sent in a FIFO order. When multiple output queues are sent to an input queue, the result is a fair merge of the sequence of messages. Input queues are typed in the messages they are allowed to accept.

Parallel computation is implemented by an appropriate binding between input and output queues. Processes may be mobile over the network, but only between sessions; they must be immobile within sessions. Sessions are instances of applications, implemented as networks of processes. Processes and sessions have specifications, which are precise definitions of the behaviours. Infospheres supports three specifications: process specifications, interface specifications and session specifications.

Finding specific processes is done using standard web technology: processes are found by looking up the appropriate home page. Finding an appropriate process type is harder: the type specification must be clear enough to identify and compose processes; and an arbitration scheme must exist if the interfaces of two types do not match. Distributed objects are compared by checking to see if their interfaces match.

The target applications of the Infospheres project are collaborative applications, such as distributed design frameworks and distributed calendars. One of the features of the Infospheres system is the provision for long-lived collaborations by the incorporation of the idea of persistent components that may, when not actively undergoing computation, have their state serialised and be stored on a persistent media, only to be de-serialised when needed again.

Resources in the Infospheres environment are strictly computational [191], and clients are expected to reserve resources in order to be able to use them. To facilitate this, clients' systems can send Java agents to remote machines in order to request a resource be reserved. Agents are directed to resource managers, which act as brokers for the resources that they control.

Legion

Legion [95] is a research project aimed at providing a highly usable, efficient and scalable system, based on solid object-oriented principles. The output of the project is a single, coherent virtual machine that addresses the issues of: scalability, programming ease, fault tolerance, security, site autonomy and has an extensible core [94]. Legion is similar to a number of other projects, namely Nexus [73], Castle [46], NOW [12] and Globe [118].

Legion achieves multiple language interfaces and interoperability through object wrappers for legacy codes, similar to those used by CORBA [177] and DCE [197]. As well as providing wrappers for legacy objects, the Legion designers achieved widespread use through allowing the system to be the target of compilers. High performance is achieved via two methods: resource selection, and parallel computation. Resource selection is performed using resource availability and affinity, which is an extension of Condor [153], DQS [91] and LoadLeveler [123].

Written in a parallel variant of C++ called Mentat [93, 92], Legion is an attempt to create a single nationwide metacomputer using loosely coupled workstations. Legion was designed to transparently schedule application components on processors, manage data transfer and coercion, and provide communication and synchronisation in such a manner as to minimise execution time via parallel execution of the application components. Legion is based on Mentat [92], an object-oriented parallel processing system.

It is intended that Legion be used almost completely for parallel applications. The concept of resource transparency is used, where the user (and application program) do not depend on a certain number or type of processor. Latency-tolerant, relatively large-grained parallelism is targeted. Object wrappers are provided for parallel components, and Legion supports parallel method invocation. The Legion runtime system is exposed as an 'open system', and has an associated message-passing API.

Memory within Legion is treated as a single, persistent object space [96]. Legion does not make resource allocation decisions, but provides the basic mechanisms needed to make informed mapping decisions between resource objects and carry out these mapping decisions. If an object needs to contact another object, and the target object does not already exist, it is created, as is described in [137].

Fault tolerance is addressed with the knowledge that in a truly distributed heterogeneous system, hardware and software failures will be routine, and that each physical resource and application will have its own concept of what is necessary. Knowing that writing programs to achieve fault tolerance is a difficult and error-prone task, Legion does not mandate policies, but instead applications can select the level of fault tolerance they require. Legion facilitates this by encapsulating fault tolerance protocols in base classes, which may be extended by users. Legion implements different levels of fault tolerance, depending on the penalty that the application-writer is willing to pay.

Scheduling data parallel components in Legion is static, and is broken into three distinct phases: processor selection, load selection and placement. First, candidate processors are identified. Secondly, the number and type of processors are used and the data domain is decomposed. Lastly, tasks are mapped to chosen processors so that communication time is reduced.

Objects create mappers and provide them with a description of the particular placement problem. The mapper then marshals the objects that will be involved and asks the system for a snapshot of the current system state (i.e. available resources). It then makes the appropriate decisions, based on a heuristic search of the solution space and passes the decision on to a module that tries to implement the decision, the implementor.

Mappers perform the function of taking a high-level specification of a placement problem and converting it into a list of possible placement decisions. Although Legion puts the onus of supplying a default mapper on to the application writer, if other mappers are available, the user can choose in order to personalise the selection process.

Legion also has the concept of a jurisdiction magistrate, which contains and enforces the local policies. It is this object which handles all placement failures, security breeches and querying of placement decisions.

Programs are represented in Mentat by graphs (or DAGs), and parallel execution is based on a macro data flow model, where edges in the graph denote dependence relations between nodes, representing operations (or actors). In Mentat, future lists are sent out with actors, so that actors are aware of the nodes on which the output data depends. The data dependency is very fine-grained, operating at the instruction-level.

Parallelism is encapsulated between objects by allowing subsequent actors to use the results of previous actors, which may have not yet been computed. Coherency of variables that are to be shared is maintained by enforcing single-assignment of future variables. Thus, but the use of futures, parallelism opportunities are exploited without the danger of variables being modified between uses. The run-time system detects data dependencies and organises task scheduling.

DOCT

Distributed Object Computation Testbed (DOCT) [200] is a collaborative project between the San Diego Supercomputer Center (SDSC), Caltech, NCSA, Old Dominion University, Open Text Corporation, Science Applications International Corporation, University of California at San Diego and the University of Virginia. DOCT is a large metacomputing framework being constructed using Legion and AppLeS.

Designed to manage a tera-byte sized persistent document handling system [162], the software included in DOCT includes HPSS [122], MDAS (from SDSC), Legion, AppLeS' Network Weather System and IBM's DB2 parallel object-relational database. The system features intelligent agents which will be able to perform many diverse actions. Intelligent agents use the Legion framework as a basis for distributed objects and Nexus is being considered for distributed communications. Resource usage is coordinated by the AppLeS scheduling system.

WebFlow

 

WebFlow from Syracuse University is a metacomputing project which aims to produce a general-purpose Web-based visual interactive programming environment for coarse-grain distributed computing [26]. The project aims to provide coarser-grained units for Java distributed computing than the Java class. WebFlow is a programming paradigm implemented over WebVM. WebFlow is part of a larger project, WebSpace [174] to create a Web-based collaborative environment.

WebFlow is implemented over a mesh of WebVM servers. WebVM servers are implemented using Jeeves [219], which is a Java Web server. WebFlow allows modules (or servlets, atomic encapsulations of WebVM computation) to be run on demand, supports communication between modules, and allows users to create and destroy applications (sets of interconnected modules).

WebFlow is very similar to Infospheres in its support for code modules, communication between modules, and sessions. In both systems modules communicate by the use of ports. Sessions define the application that a user creates using modules. There is very little information available on the development and internal workings of the WebFlow system.

DISCWorld

 

Distributed Information Systems Control World (DISCWorld) [113] is a prototype metacomputing model and system being developed at the University of Adelaide. The project was started in 1996. The basic unit of execution in the DISCWorld is a service. Services are pre-written pieces of Java software that adhere to an API or legacy code that has been provided with a Java wrapper. Users can compose a number of services together to form a complex processing request.

The DISCWorld architecture consists of a number of peer-based computer hosts that participate in DISCWorld by running either a DISCWorld server daemon program or a DISCWorld-compliant client program. DISCWorld client programs can be constructed using Java wrappers to existing legacy programs, or can take the form of a special DISCWorld client environment which runs as a Java applet inside a WWW browser. This client environment can itself contain Java applet programs (Java Beans) [58] which can act as client programs communicating with the network of servers. The peer-based nature of DISCWorld clients and servers means that these servers can be clients of one another for carrying out particular jobs, and can broker or trade services amongst one another. Jobs can be scheduled [128] across the participating nodes. A DISCWorld server instance is potentially capable of providing any of the portable services any other DISCWorld server provides, but will typically be customised by the local administrator to specialise in a chosen subset, suited to local resources and needs. The nature of the DISCWorld architecture means that we use the terms client and server somewhat loosely, since these terms best refer to a temporary relationship between two running programs rather than a fixed relationship between two host platforms.

DISCWorld is targeted at wide-area systems and applications where it is worthwhile or necessary to run them over wide areas. These will typically be applications that require access to large specialist datasets stored by custodians at different geographically separated sites. An example application might be a land planning one [43], where a client application requires access to land titles information at one site, and digital terrain map data at another, and aerial photography or satellite imagery stored at another site. A human decision-maker may be seated at a low performance compute platform running a WWW browser environment, but can pose queries and data processing transactions of a network of DISCWorld connected servers to extract decisions from these potentially very large datasets without having to download them to their own site. This is shown in figure gif.

  figure332
Figure 6: Conceptual view of DISCWorld architecture. A number of cooperating server hosts communicate and share work, brokered by the DISCWorld daemon. Each is capable as acting as a gateway for client programs running on server hosts or as specialised graphical interface clients.

  figure339
Figure 7: Detailed component breakdown of DISCWorld daemon. Code modules are represented by boxes; modules executing in separate threads are shown.

Our vision for DISCWorld is an integrated query-based environment where users may connect to a ``cloud'' of high performance computing (HPC) resources (where the heterogeneity and exact composition of the cloud is hidden from the user) and request data retrieval and processing operations. Users themselves may only be connected into the cloud by a low bandwidth network link such as that provided by a modem line. This action-at-a-distance query-based approach appears an appropriate one for decision support applications where the user is provided with a collection of application components that run in situ in a WWW browser and help him control remote HPC resources. Much of our research to date has considered the multi-threaded software daemon (DWd) that runs on each participating DISCWorld service providing host.

DISCWorld is aimed at providing high-performance services, with which non-specialist scientists can compose challenging, big, and complex problems for solving over a distributed collection of possibly heterogeneous computers. One of the design goals of the DISCWorld metacomputing infrastructure is that any parallelism that can be extracted from the problem that the user has posed will be seamless to the user. This approach differs somewhat from Infospheres, the only other research project that has service-, or component-based approach, in that Infospheres is designed to facilitate collaboration between users, and allow programmers to design and build distributed system components.

When user queries are submitted to the DISCWorld system, they are decomposed into services; services are the items which are scheduled in the DISCWorld. Data and services may be moved to the host at which the least cost is found. This thesis uses DISCWorld as an implementation framework.

Discussion and Comparison of Metacomputing Systems

 

DISCWorld has a constrained objective, with less ambitious goals than the Globus and Legion systems. The notable difference between these systems and DISCWorld is that most users of our system will not be programmers, per se, but will be merely interested in the output of a service request.

DISCWorld is a high-level, object-oriented system that uses Java as an implementation vehicle. As DISCWorld does not directly address fine-grained parallelism, resources do not need to be co-scheduled as in Globus, but services are expected to run in a reasonable amount of time. In addition to scheduling frameworks, some metacomputing approaches favour reserving the resource to be used [72, ].

  figure352
Figure 8: Metacomputing systems, middleware tools and resource manager systems integration with user processes.

The DISCWorld project addresses the issue of constrained metacomputing, by making the decision that only restricted datasets and pre-defined operations on data will be allowed. As such, all functions that can be applied to data will be written by developers; end users will not have to concern themselves with the implementation details. This means that the services that a user can request are constrained to those which have been defined and written for the DISCWorld system. This is in contrast to unconstrained systems, where users can submit arbitrary binaries for execution, such as Globus.

We believe that being constrained is an important characteristic of a metacomputing system for a number of fundamental reasons. Allowing users to submit arbitrary binaries is a security hazard for fear of malicious attacks, and the lack of control over what information is being collected and disseminated by user programs. We are not able to mandate that all users must submit source- or byte-code that can be analysed for performance characteristics or malicious code. Therefore, if users can submit their own binaries, the task of characterising programs which may only be run once or twice, in order to make intelligent scheduling decisions, is very difficult. Together with the benefit of being able to characterise services more easily and effectively, pre-written services, with knowledge of the DISCWorld, can be written to take advantage of DISCWorld features and any available parallelism inherent in the high-level processing request, can be exploited.

Services share similarities with Infospheres components, especially in the way in which services are characterised. As mentioned in [37], the services are created using the methodology of component technology; services themselves are opaque, presenting the user with an interface specification, but giving no clues as to the service's implementation; they can have dynamic interfaces and undergo dynamic composition; and services can be selected from a world-wide pool. Services must either be written in a platform-independent way which supports reflection, or at the very least, a wrapper interface to the code must be supplied. Thus, it is up to the service writer to ensure that the services they make available to the system can be queried in a standard, platform independent way, even if the code that is actually executed is architecture-specific.

Most of the metacomputing systems discussed in this chapter, with the notable exception of Globus, are tightly integrated with the code that runs as a result of a user query. Tables gif and  gif show the characteristics of the systems described in the previous section. This is represented pictorially in figure gif. Systems that define a resource as either an object or an applet have user processes that are tightly integrated with the system. This typically allows more information to be collected by the system in order to make sensible object-placement or scheduling decisions. In none of the systems is the user able to directly control the placement of their processes; in virtually all cases, the system can make a better decision than the user.

Harness is similar in design philosophy to DISCWorld. Both projects present a constrained metacomputing environment. The main difference is that DISCWorld services are written by developers, and users simply use the services, while in Harness, the user is responsible for supplying their own programs. Harness and DISCWorld both recognise the need to be able to move service code and data. Harness does not perform any organised scheduling of service execution; the environment uses the nominated hosts in a manner very similar to that found in PVM.

Objects within the Legion system have the concept of sovereignty, which can restrict copying and movement. This is similar to the DISCWorld concept in which services and code may be restricted in movement.

  table367
Table 1: Summary of metacomputing system characteristics

  table400
Table 2: Summary of middleware tools characteristics

Conclusion

 

A cluster has been defined as a group of machines that may be viewed as a single entity for the purposes of control and job assignment. In this chapter the topic of cluster control has been addressed from two viewpoints: that of resource management, and of metacomputing. There are fundamental differences between the objectives of resource management software and metacomputing environments.

Resource management software is designed to provide all users with fair access to the machines that comprise the cluster, while at the same time ensure that the machines are being efficiently utilised. Metacomputing environments, on the other hand, are designed to allow access to remote resources, where the term resources is able to be defined as anything from remote machines, data, processing services, or specialised hardware such as visualisation equipment.

We have presented summaries of a number of metacomputing projects, and have compared them with our prototype metacomputing environment, DISCWorld. The conclusion of this chapter is that although solutions to the resource management and metacomputing problems do exist, none of the solutions is directly applicable to the case in which a user does not have a great deal of knowledge about the system they are using.

The system that currently provides the most comprehensive metacomputing environment is Globus. However, this system requires users to supply binary programs to achieve their goals, and is aimed toward the execution of parallel programs. We feel that in order to make metacomputing a more accessible technology to those users that wish to pose high-level queries, the details of the system, especially how the programs they use work, must be hidden from them.

In systems targeted towards fine-grained parallelism, scheduling or process placement is a matter of co-scheduling available resources. The DISCWorld model features high-level services in a producer-consumer relationship, where the results of a service are available to be used as inputs to another. We do not address fine-grained parallelism directly; services can be implemented as parallel code but the service interface presented to the DISCWorld is identical to the same service implemented as serial code. The task of scheduling DISCWorld services and data access is the focus of this thesis.

In the next chapter, we review some of the research concerning the placement of processes (jobs) onto nodes in a distributed system. In chapter gif we present a simulation of placing and scheduling independent jobs across distributed heterogeneous processors. Chapter gif introduces a general model that is used for scheduling in the prototype DISCWorld metacomputing system, and chapter gif discusses implementation issues.


next up previous
Next: A Review of Scheduling Up: Scheduling in Metacomputing Systems Previous: Introduction

Heath A. James, heath@cs.adelaide.edu.au