K.A.Hawick, A.L.Brown, P.D.Coddington, J.F.Hercus, H.A.James,
K.E.Kerry, K.J.Maciunas, J.A.Mathew, C.J.Patten, A.J.Silis, F.A.Vaughan
Department of Computer Science, University of Adelaide,
SA 5005, Australia
January 1998
We present some software architecture ideas for a distributed processing control and storage management environment, specifically for providing high-level services across wide area networks of high-performance computing resources. We describe the driving applications for our system, such as geographic information systems and digital libraries, and the particular design criteria they require. We present an overview of our software architecture and discuss: scalable architectural design through peer-to-peer protocols; the execution paradigm, resource description, job scheduling and mapping; the specification and interface management of high level services; multiple networks for control and bulk data transfer; legacy application support mechanisms; distributed data repository design, data layout and compression support; user query support and cached data product support.
This document presents a discussion of the general problem of realising a practical, distributed, high-performance computing system for a broad spectrum of data intensive applications. It contains an expansion of ideas in constrained metacomputing systems that have emerged from recent applications of distributed and high performance computing systems. In particular, our applications of distributed geographic information systems [2, 3] indicate a strong need for a consistent WWW-based interface to distributed, high-performance computing resources[1].
We describe the overall model for our Distributed Information System Control World (DISCWorld) [4, 13] and for user and inter-system transactions. We focus on the software architecture of our system in section 3. We also give examples of typical driving applications in section 2. The DISCWorld is strongly oriented towards user access through the World Wide Web (WWW).
Consider the examples of users of environmental or geographic information systems (GIS) and, more generally, digital libraries. GIS users typically manipulate large spatial and temporal data sets to extract information and ultimately decision support knowledge. This process might involve selecting particular spatial and temporal data from large sets from multiple sources, such as repositories for various remotely sensed data or other planning databases. Images need to be pre-processed - to enhance contrast for example, multispectral data needs to be combined, multi source data needs to be fused, and ultimately some significant data reduction performed to provide what is often a correlation scalar quantity, or a list of locations or times at which particular set of interesting conditions occur.
A complex job like this is currently carried out largely through manual use of low performance computing resources such as desktop PCs with data supplied on tape. This is a significant impediment to the effective use of advanced process techniques and platforms. This whole process could be substantially enhanced by providing a framework to connect data owners with consumers, and both with access to high-performance computational resources. Complex jobs can be decomposed into simple well known (a priori) jobs. Resources can be shared in a brokered fashion. A data exchange framework allows data to flow between bulk storage and bulk processing before it is reduced to the products desired by the end user. High-Performance computing components can be connected by broadband networks. Final-user product delivery can be over low bandwidth resources.
In summary, driving applications for DISCWorld require: large data sets, potentially at more than one remote site; timely access requirements, possibly in near interactive time; usually more processing requirements than could easily be run on a WWW browser platform. We are initially targeting applications that can be formulated as query-based operations that can be decomposed in terms of well-defined services. There may need to be new DISCWorld compliant wrappers built for some applications, but the core processing is in terms of straightforward processing modules which may be legacy codes.
Consider a detailed example of the DISCWorld in action. A user wishes to compute a measured property on a sequence of data such as percentage cloud cover over specified geographic subregion. The user connects to a point of service on the WWW, identifying him/herself by email address and some multikey/password arrangement. The user makes a request and this is broken down into an identifiable sequence of application services. Each of these comes with an approximate cost model which may have parameters associated with the target system on which it is run - also time of day, other loads, and users' previous requests. A cost estimation and set of choices is offered to the user, who can choose an option and commit the system to the request. The service providers carry out their allotted tasks in an appropriately managed sequence and the user result is either transferred back to him or deferred for later delivery on demand. An appropriate degree of cost recording and billing as well as interactive monitoring - services in their own right - are provided, possibly only to specially privileged users of the system. This sequence of events can be summarised as follows:
The architectural model adopted for the DISCWorld is that of a set of communicating peers - which provide mutual services to one another and which each control a local computational resource. A number of software technologies exist to implement such a model - we are investigating combinations of Java and CORBA ORBs at present. In general however, each node in the system acts as both a user access point as well as a supplier of services and a negotiator for remote services from remote nodes. This symmetric model provides some degree of scalability and universality.

Figure 1: DISCWorld Architecture Overview.
Figure 1 shows the DISCWorld architectural overview and key features. DISCWorld daemons are the communications points which act as the interface for control information between DISCWorld nodes as well as access by WWW users. The architecture provides for effectively separate networks for control information and bulk data transfer. There is encapsulation of applications and their control by the daemons and also of legacy applications using an NFS-like environment. Software bus adapters provide a bulk data transfer mechanism to interoperate between different protocols and across firewalls. Some nodes may provide hierarchical file-storage facilities.
A strongly desirable aspect of DISCWorld is that it comprise a single software entity that is run on a hardware compute resource to enable it to participate in DISCWorld. This multi-threaded software daemon handles user service requests through a port; negotiation and control exchange with other DISCWorld daemons; and management of services (jobs) on its own host [11]. DISCWorld daemons operate at a peer-to-peer level in general, but become promoted to deal with a particular user job request which they manage and own until it completes or is revoked.
The execution paradigm for DISCWorld is that user queries will be decomposed into sub-queries that can be satisfied by execution of well defined services. These services are named and known a priori and consist of application modules with an appropriate DISCWorld compliant wrapper around them. Alternatively they may be written custom for DISCWorld and might be Java Beans for example or possibly CORBA compliant services. Problems that underpin this part of DISCWorld are those of naming and describing resources including services [7].
We believe that the DISCWorld model is more tractable than other metacomputing system development projects such as Globus or Legion primarily because it does not try to solve the general distributed computing problem, but instead focuses on linking together well-defined high granularity services. High-granularity implies a specific decision to cater for job components that are fairly computationally intensive and involve moving around fairly large data sets. This allows some margin in amortising overheads including communications latencies. Key problems underlying the successful management of the services are to specify them in such a way that they can be matched with queries, and also to provide mechanisms to add and withdraw services from particular platforms according to system management policies. It is also an interesting problem to link service components together using appropriate abstract data types or interface data layers. We partially solve the naming by specifying the conventions that a collection of services must conform to [8] and by developing the initial set of prototype services in a flat namespace.
Development of the DISCWorld arose primarily through dissatisfaction with the performance of other systems and their failure to usefully incorporate high performance computing and communications technologies. A key idea behind DISCWorld is therefore to provide a smart middleware layer that will make the best use of whatever resources for computation, storage and communications that are available for a given transaction. One aspect of this is to provide a notional control and bulk data communications protocol that can be implemented separately to make use of any high bandwidth communications channels where available [12]. Not all nodes need use the bulk transfer mechanism however if they are only exchanging lightweight data objects.
To be useful, the DISCWorld environment must be able to incorporate legacy application codes to provide its service base. A significant problem is that many codes are only available in object form - this is often the result of a commercially licensed product running on a single seat. If a level of indirection can be incorporated at the system level, it is possible for the DISCWorld to support legacy codes by managing pseudo file handles. This can provide a support environment for legacy codes that can provide transparent distributed access to data [10].
To support application services that transact large data sets it is important that DISCWorld interface to bulk data storage devices, such as mass stores or hierarchical file-stores (HFS), over distributed networks. A separate storage manager component can act as a broker between DISCWorld nodes which have control over storage resources. These resources might range from bulk disk stores in the form of RAID or other disk arrays, or deep mass storage devices such as large tape-silo libraries. A number of techniques are available to optimise bulk data storage and transmission management [5]. Another issue that affects performance as well as the software architecture is managing auxiliary data or metadata. This may be metadata in the form of database schema allowing conventional database techniques to be used to manipulate the data sets prior to committing to a bulk data movement. It is also advantageous to maintain a database of auxiliary properties that may be scalars or other reduced quantities derived from previous user queries or other applications run on the bulk data sets [9]. These properties may be lazily evaluated initially, and subsequently stored for later queries.
At our present level of design for the DISCWorld system we have not distinguished strongly between user queries and sub-queries or the queries that are used to transact between running programs. We believe a single textual representation language for expressing queries is likely to be useful. It is not yet clear if SQL and/or a general Boolean Query Syntax is adequate for specifying queries [9] as well as job placement and other resource mapping information [7].
A useful technique for improving the query response performance perceived by users is to find smart ways to cache both final delivered products as well as intermediate level results used to derive products. This is viable only because we envisage certain frequently occurring patterns in DISCWorld queries and since the system is geared around a limited number of well defined operations and data entities. We have investigated the benefits of caching simple data products in a WWW based data browser [6]. The central problem is that of managing the namespace of final and intermediate data entities. We are investigating better ways to manage communicating databases of these names [9].
We have described some of the major ideas for our DISCWorld distributed computing environment. A number of issues remain to be properly addressed, but we believe we can proceed to construction of a series of partial prototypes making appropriate compromises.
A user interaction consists of a successful application by a user for a particular named service - there may be some degree of sophisticated interaction system that identifies the named sequence of services from a fuzzy request and breaks it down into a number of negotiable tasks. These are scheduled or requested to be run on a particular combination of resources locally and remotely and are tagged with the users identification and a suitable authorisation mechanism is used to ensure user access integrity at each stage of resource commitment.
The problem is made more tractable by choosing a relatively high - applications oriented - granularity of service. These are typically entire user programs rather than very low level algorithmic or systems level components. This perhaps increases the name space richness required to describe the service, but lowers the number of services required to carry out some demonstrably useful action.
We can simplify the architecture initially, by recognising that a great deal of scientific work is carried out in relatively tightly coupled communities where there are relatively few groups working in a particular niche, worldwide. An approach similar to the Internet domain naming one can be adopted until a more appropriate mechanism for carving up namespace can be identified.
DISCWorld is a development of the Distributed High Performance Computing Infrastructure (DHPC-I) project of the Research Data Networks Cooperative Research Center (RDN CRC) and is managed under the On-Line Data Archives Program of the Advanced Computational Systems CRC. RDN and ACSys are established under the Australian Government's CRC Program.
khawick@cs.adelaide.edu.au Thu Jan 22 17:13:06 CST 1998