next up previous
Next: Conclusions and Future Work Up: Scheduling in Metacomputing Systems Previous: Implementing Scheduling in DISCWorld

DISCWorld Performance Analysis

 

The services that are presented for performance measurement perform the same functions as those found in the ERIC prototype [129] (see appendix gif). As such, a direct comparison may be made between the performance of the services using the DISCWorld daemon and ERIC. Both ERIC and the implemented services provide remote access to a repository of GMS satellite [133] imagery. In addition to providing browsing access, they allow processing operations to be invoked on the stored images.

The main difference between the two approaches is that the ERIC program involves processing on the server-side only. The program is implemented as a Perl common gateway interface (cgi-bin) [100] script. The script interfaces to user-level command-line programs which communicate by the use of shared file systems and pipes. No distribution of the command-line programs is possible unless it is explicitly incorporated by the author of the script. The user composes a query using a HTML forms interface, and the result is returned as another HTML page. Although ERIC implements basic caching, only resultant images, metadata and MPEGs [134] are stored. No partial results are stored, and final results cannot be used as inputs for further processing. Other limitations of the ERIC prototype are discussed in appendix gif.

In contrast, the DISCWorld daemon allows processing to be performed at both the client-side and server-side. For example, the scheduling placement algorithm may decide that simple imagery operations can be more efficiently performed at the client-side than the server-side. Services can be written as pure Java or may be implemented as a wrapper around a native program (using either JNI [148] or system calls). Objects can be sent between nodes. Consequently, there is no need for common file-systems. The DISCWorld daemon stores all results of services, whether they are the final results of the user's processing request or the results of intermediate computations. Intermediate results can be used in subsequent processing requests since the DISCWorld model assigns canonical names to all objects..

Example Services

For performance analysis, we focus on three example services: findGMSImage, processGMSImage, and createMPEG. FindGMSImage is very similar to one of the fundamental operations found in ERIC: it invokes a shell-script to retrieve an image from the spatial imagery archive. After finding the image, it is loaded into the DISCWorld daemon using services built from the methods in the Java Advanced Imaging (JAI) [217] API. The image is then accessible via its DRAM. The input parameters to the service are the date and time at which the image was created, and its spectral channel. This service invokes a native program, and it is marked as ``not movable'' (as described in chapter gif); it cannot be transferred between nodes.

ProcessGMSImage uses services from the JAI API to crop and scale the GMS Image. It performs the same imagery operations as found in the ERIC. The inputs to the service are the GMS image, the area to be cropped, and the zoom scale of the resultant image. As this service is written completely in Java, it can be transferred between nodes.

CreateMPEG is a service that creates MPEG animations from sequences of images. This service invokes the mpeg_encode [87] program as a system call. Due to the way in which the program is implemented, it is necessary to write the images and an associated parameter file to a temporary disk area. The program also writes the newly-created MPEG to the disk, where it is ingested into the system using services from the JAI API. Like the findGMSImage service, the createMPEG service is deemed ``not movable''.

Three computers were used for testing the performance of the DISCWorld daemon: a dual-processor Sun Enterprise 250, (called lerwick) with 256Mb of physical and 300Mb of virtual memory that runs Solaris 2.6 and Java 1.2.1; a dual-processor Celeron, (called banff) with 256Mb of physical and 1Gb of virtual memory that runs Solaris 2.6 and Java 1.2.1; and a single-processor Pentium II, (called geronimo) with 64Mb of physical and 64Mb of virtual memory that runs Windows NT 4.0 and Java 1.2. Lerwick and banff are connected via a 100Mbps network; geronimo is connected via a 10Mbps network.

  table1629
Table 8: Comparison of overheads when individual services are executed by different environments. The execution time of the services implemented as command-line programs are measured so the overheads of the Java system can be measured. The time taken for a pre-computed result to be retrieved from the local daemon, and from a remote daemon via a slow 10Mbps link are also shown. Measurements of the ERIC system, from [127] are shown for comparison. All times are based on at least 10 measurements. Variances are based on a least-squares linear fit of the measured values. The error in the timers used to collect timing information is that of Java's currentTimeMillis method, which does not have an accuracy of exactly one millisecond [157].

Table gif shows the time that each of the services takes to run on lerwick under a number of different execution environments. The Unix command line measurements are the execution time of the command-line programs. ProcessGMSImage cannot be measured in this way as it is written purely in Java; it cannot be directly executed without the creation of a JVM. All services within DISCWorld are encapsulated by a service wrapper, which provides JavaBean interfaces to the methods of the service. A service's run() method is invoked to execute the service. The run-time penalty due to Java can be calculated by invoking the service from a thin client on a local machine. The run-time overhead on findGMSImage is approximately 3000ms. We attribute this to: the necessity of creating an instance of the service in the JVM; invoking the service's set method for each parameter; and forking a new process to execute the shell-script. This overhead is increased with the createMPEG service because of the need to write each of the input images to disk before executing the mpeg_encode program.

It is useful to compare the performance of each of our example services when executed as stand-alone servers in an RMI environment. As table gif shows, the additional overhead of each service when invoked via RMI is approximately 2000ms. The increase in execution time can be attributed to service brokering by the RMI registry and the fact that each of the services is executed in their own JVM. This overhead is not substantially increased when the RMI client is non-local.

The times quoted for the DISCWorld daemon are the total elapsed time from when the user client submits a processing request to when the client receives the final DRAMD. This includes the time taken to: transmit the processing request to the daemon; parse and place the service; return the DRAMF to the client; for the client to inspect the DRAMF; for the service to be executed; and for the DRAMD to be returned to the client. As the majority of the time spent executing a user processing request is at the server-side, the additional time taken to submit a request from a remote client is relatively small. Clearly, the DISCWorld daemon introduces some overheads, which when amortised over large services, become relatively inconsequential.

As previously mentioned, all results and partial results in DISCWorld are cached. This proves to be useful when a client requests an object that has already been cached. As illustrated in table gif, requesting a cached object from a DISCWorld daemon's store is extremely efficient. In contrast to the remote clients using RMI and DISCWorld, which were measured using a fast (100Mbps) network, the results of the remote retrieval from the DISCWorld cache are across a slow (10Mbps) network. The retrieval times are dominated by the network characteristics. The time is comparable to the time taken to execute the service using a thin Java client on the faster machine, or using RMI on the faster machine. Use of the remote client to request information from the local DISCWorld daemon illustrates the re-use of DRAMDs and DRAMFs inside the daemon - in this example, the original processing request that caused the data to be created was made from a client on the node local to the daemon. This further emphasises the usefulness of the DRAM mechanism, whereby the data may be further processed before being returned to the client, which may be connected via a very slow network.

In addition to the timing information presented in table gif, further measurements of the ERIC system are presented in [127]. While timing information for the findGMSImage alone service is not available for ERIC, the times for ERIC's processGMSImage service include that taken by findGMSImage. No standard deviation is provided in [127] for the time to create an MPEG at the resolution and zoom that we consider in this analysis. The figure given for the time taken to retrieve a cached MPEG is that taken to retrieve any object from ERIC's store.

The performance of the DISCWorld daemon is very dependent on the background load of the processors. The performance of Java is very susceptible to these effects because not only do the Java application's threads need to be scheduled by the JVM, but the JVM needs to be scheduled by the host operating system.

Table gif shows the performance of the DISCWorld daemon in the case where a single server is chosen to be used by the placement algorithm. The first two columns of this table show the effects of adding a second service which is independent of the first. The mean time to return DRAMFs, corresponding to the outputs of the services, scales linearly with the number of independent services. However, there is a high standard deviation in the DRAMF creation time for the case in which there are two independent services. The increase in DRAMF creation time may be attributed to additional parsing necessary for the extra service, and also to increased contention for the quartermaster's synchronised store object, as described in chapter gif. Interestingly, while the mean time shown to execute two services is less than that for a single service, the standard deviation is larger. Within the bounds of experimental measurement, these values appear to be the same. This is feasible because the daemon is multi-threaded, and it able to execute many services simultaneously.

  table1653
Table 9: Performance of DISCWorld prototype using multiple processing requests. In the case of two services, two independent findGMSImage services are requested; in the case of four services, two independent instances of a findGMSImage and processGMSImage service are requested. The DISCWorld daemon processes all independent services concurrently.

The last column of table gif presents the results of a client submitting a processing request in which there are two independent requests, comprised of a processGMSImage service that uses the output of a findGMSImage service. This may be compared with the second column of the table, in which two independent findGMSImage services are requested. The time taken to create and return DRAMFs is found to scale linearly with the total number of services requestedgif. The addition of the two processGMSImage services, dependent on the output of their respective findGMSImage services, causes the total execution time of the processing request to double. Thus for these services, at least in this situation, the performance scales linearly.

The situation in which more than one client simultaneously makes a processing request to the same DISCWorld daemon is shown in table gif. The time taken for the daemon to create and return a DRAMF, and to return previously-created DRAMDs from the cache, increase by approximately 750ms per additional client. Each additional client increases the time taken to execute a single findGMSImage service by approximately 7200ms. This again emphasises that the daemon scales linearly, subject to the limitations as described above.

  table1668
Table 10: Performance of DISCWorld prototype using single service when processing requests are made simultaneously. Within the bounds of control, each client submits a processing request, and then inspects the resulting DRAMFs at the same time.

Multiple servers will only be used by the scheduling placement algorithm when it will result in a lower processing cost, or if the required services are only available on different servers. Of course, when using multiple servers the effects of the interconnection network become significant, as illustrated above.

As described in section gif, whenever a daemon receives a DJPL script, the script is parsed and tested to see if any extra placement information needs to be generated. If two nodes are used for processing, the first node (local to the client from which it was submitted) generates an annotated DJPL script for the request. The annotated DJPL script is then distributed to all participating nodes. When the script is parsed and the services created, no DRAMFs are produced until all the DRAMFs that will be used as inputs to that service are received. This adds extra overhead that is not been taken into consideration when creating the schedule. This is a second-order effect, and has not been incorporated into the model.

A Detailed Example

 

Further to the example presented and discussed in section gif, we present a detailed example of event timing within the execution of a client processing request. We use this example to discuss the strengths and weaknesses of the DISCWorld architecture model.

Consider a processing request in which a number of GMS images are to be retrieved and processed on lerwick, and then transferred to banff, where they are used to make an MPEG. This sequence of events is shown pictorially in figure gif. An individual GMS image may be identified by the time and date at which it was made, and the spectral channel it represents [129]. A processed GMS image has the additional information of the bounding box to define the area of interest, and the zoom scale of the resulting image. The figure shows the services involved in the creation of such an MPEG; the images represent the data being manipulated at each step (although at greatly reduced resolution). When the processing request is parsed by the DISCWorld daemon, DRAMFs corresponding to the future data product are returned to the user. When the DRAMFs are inspected, the computation is started. Therefore, the arrows in figure gif represent both DRAMFs before the computation is started, and DRAMDs after the partial products have been made. For all the image cases tested, the performance of the daemon scales approximately linearly. The case in which there are four images to be used and the processing request is submitted from lerwick, the following sequence of events, at the following times are observed (we ignore the following steps discussed in section gif: resource discovery, user logging onto the client, and composing the processing request):

  figure1693
Figure 47: Pictorial representation of a complex DJPL request. A processed GMS image is defined by the tuple (time, date, spectral channel, bounding box, zoom scale). A sequence of processed GMS images is used as input by the createMPEG service. The query is constructed using DRAMDs and DRAMFs, denoted conceptually by arrows. The images show the data that is produced and consumed at each processing step.

On lerwick,

  1. The waiter is created and DJPL parsing is started,
  2. DJPL parsing takes 1455ms,
  3. WaiterExec and DistributeDJPL threads start,
  4. As all the findGMSImage services will be run on lerwick, and all the inputs to findGMSImage are supplied by the client, the DRAMFs are immediately created and distributed to the servers that will use them. As the processGMSImage service will use the outputs of findGMSImage, and is on the same node, the DRAMFs are only transferred to the user client. The processGMSImage services are created; their output DRAMFs are transferred to the nodes that will use them (banff) and the user client. This process takes 289ms. It takes 219ms for all the servers to be sent the annotated DJPL script.
  5. In total, the preparatory service creation, DJPL and DRAM distribution takes approximately 1750ms.

When the DJPL script and DRAMFs are received by banff,

  1. The waiter is created and DJPL parsing is started,
  2. DJPL parsing takes 6511ms,
  3. WaiterExec and DistributeDJPL threads start (the DJPL script is not annotated further by this DWd, so it is not distributed to any other nodes),
  4. It takes 215ms to start the createMPEG service, create the output DRAMF and send it to the client on lerwick,
  5. In total, the preparatory service creation, DJPL and DRAM distribution takes approximately 6750ms.

The user client, running on lerwick, receives the DRAMF for the output of the createMPEG service, and inspects it as soon as it is received.

Banff receives the request for the DRAMF to be created

  1. the service to create the createMPEG output begins to marshal parameters. Requests are sent to the data's originating server ( lerwick).

Lerwick receives the request for the data to which the DRAMFs correspond,

  1. The time taken for the findGMSImage service to execute takes approximately 28000ms,
  2. The time taken for the processGMSImage service to execute takes approximately 13000ms,
  3. Blocking requests for data in the quartermaster's store are periodically tested in case the thread is not awakened by a resume signal.
  4. In total, the execution of the findGMSImage and processGMSImage services take approximately 127000ms. The data is returned to the requesting node (banff).

On banff, the DRAMDs corresponding to the output of the processGMSImage services are received,

  1. The blocking requests corresponding to the DRAMF inspection takes approximately 145000ms,
  2. The createMPEG service takes approximately 30000ms to execute,
  3. The client's blocking request for the result of the createMPEG service is filled

  figure1717
Figure 48: Event timing diagram of the complex processing request, shown in figure gif, using four images. By far, the greatest amount of time is spent in the findGMSImage and processGMSImage services. This is not unreasonable due to the penalty of co-locating services on the same node.

The event sequence of this processing request is shown in figure gif. It can be seen that the majority of the total execution time is spent when the DRAMFs that are produced by the processGMSImage services are inspected. The request, instigated on banff, causes the services on lerwick to be started, and the final results returned. This example shows how DRAMs can be used to set up, and execute complex processing requests that span multiple nodes in a distributed system.

It can be seen in this example that the parsing of the annotated DJPL, as distributed by the DISCWorld daemon on lerwick, takes an abnormally large amount of time. While we are unable to pinpoint the exact cause of this anomaly, we suspect that its cause may be due to the way in which the annotations are implemented by our parsing routines.

This example shows the case in which a server has been specialised to create MPEG animations. Another example of a server specialising in a specific service is the dploader tool (described in chapter gif), which offers a parameterised simulation service that uses a network of workstations that are unable to run their own JVMs.

It can be seen from this example the benefits of caching partial results of services. For example, if a number of the same images are to be re-used, the total service execution time will be approximately 145000ms shorter. We have not addressed the issue of flushing cache contents; this is planned in future work. If a DRAMD with the same name is available from a different server at a lower cost, it may be retrieved from the alternate server, thus optimising the execution of a processing request.

As has been previously discussed, one of the limitations of the current implementation of the DISCWorld daemon is its poor memory management. If the cache becomes too full, the daemon will fail, losing its current state information.

Performance Considerations

 

The current DISCWorld daemon implementation relies heavily on information about available nodes and interconnection networks. In the current implementation, if both the node and interconnection network information are not available, a node is not used. The reason for this is that while a node may be available all of the time and may have a very short waiting time, its interconnections to the remainder of the network may introduce a processing bottleneck. Future versions are intended to perform an on-demand analysis of available nodes and interconnection networks, similar to the Network Weather Service [237] of the AppLeS project.

This model relies on maintaining a mean size for objects of given types, in order to make future estimates. Since the same tests are repeated during performance analysis, we are unable to take advantage of the repeated creation of objects of the same type. Consequently, the estimation effects are unable to be seen in these performance results.

A serious limitation of the current implementation's performance is due to Java's virtual machine (JVM). The amount of memory that a Java virtual machine can use is limited by the amount of physical and virtual memory on a node. This places an upper limit on the amount of data that can be stored in the DISCWorld daemon's store in the current implementation. It was observed that when the JVM runs out of memory, not only will it eventually crash, but the daemon fails in curious (but unpredictable) ways: some threads will be created but not others; partial parsing of new processing requests will be performed; and, some requests for information in the store will be successful, others will not. The presence of other users' processes running on the nodes further reduces the amount of memory that the daemon can use for caching.

When a service is moved to a new node, the execution time and variance are assumed to be the same as that of the node from where it has come. When the service is downloaded to a node, the past run-times are set to zero, so new averages and variances can be computed. Thus, the scheduling node makes the assumption of the same general performance characteristics; when the service is used the new values replace the assumed characteristics.

Due to the way in which the mpeg_encode program is implemented, each of the images to be used for input must be written to disk. This is primarily so that the images can be converted to a file format suitable for the program. The current implementation of the createMPEG service writes the files in the base format required by the program. Unfortunately, the need to write files to a disk makes the service reliant on the amount of space available on the disk. If, during the course of service execution, the disk fills, the service fails. While this does not cause the daemon to crash, the user client is not notified, and no attempt is made to rectify the failure.

A simple timer thread is used to approximate the load on a node. The operation of the timer thread is very simple: it resets and increments a counter for a given time period. Depending on what percentage of the current maximum is returned by the counter will give an estimate of the load on the node. In the current implementation, services are only executed once the estimated load is below a certain threshold. Of course, the execution of services is not the only contributing factor to the observed load: the parsing of new DJPL scripts; any blocking requests; and saving the state of the store all contribute significantly from within the JVM. This figure will be additionally affected by the external load on the system. In future versions of the daemon, a more sophisticated method of estimating load must be developed and incorporated.

The daemon's thread count, shown in figure gif, shows at least 8 threads that are executing while the daemon is quiescent. This figure can more than double when processing requests are being parsed, distributed and executed. While implementations that use native threads can handle the large number of threads created during processing, we have found that implementations that used green threads had serious synchronisation problems. In fact, although the daemon was developed on Java 1.2 on Windows NT 4, the release of Java 1.2.1 for Solaris was the first in which our implementation did not freeze.

As the DISCWorld daemon (and client program) are written in pure Java, we must rely on the mechanisms provided within the Java language specification [88] and the Java Runtime Environment [218] for robustness and fault tolerance. Although the prototype emphasises the usefulness of adaptive scheduling and DRAMs as an enabling mechanism, for the remainder of this section we discuss how reliability, robustness and fault tolerance might be incorporated into a production version of DISCWorld written in Java.

One of the fundamental issues that must be addressed in a production system is that of availability. On whatever platform the daemon runs, there must be a guarantee that whenever a host receives a message destined for a daemon, a daemon is running to accept it. On a UNIX machine the DISCWorld daemon can be treated similarly to the operating system daemons. An entry can be placed into the file /etc/inetd.conf so that if a message is received on the port dedicated to the DISCWorld daemon, a new daemon will be created if none is currently running. Similarly on Windows NT a file in /WINNT/system32/drivers/etc can be modified to the same effect.

As implemented, the DISCWorld prototype has two major failings: the store fills very quickly, which causes the daemon to crash; and, if the daemon fails while parsing or executing a processing request, the request is forgotten when the daemon is restarted. Problems with the store and avenues for future research are discussed in section gif. An obvious solution to the problem of fault tolerance is to store all daemon state information in a Java-accessible object-oriented database such as Oracle [182], Informix [121] and JDBC [218]. Thus by using transaction technology to ensure the database remains in a consistent state, when the daemon is restarted the state is able to be restored.

Writing all state information to a database could result in an unacceptable level of system overhead. Thus one of the management policies enforced by the daemon might be to categorise all state information and only store the highest levels. This scheme, similar to an incremental file backup system, would ensure that when the daemon receives important messages such as processing requests or DRAM dereference instructions, they will be stored. If the daemon is restarted after the message has been saved the action can be re-performed without too much cost to the daemon. Network failures are harder to detect because of the lack of acknowledgement messages in the daemon-daemon and client-daemon protocols. We are investigating the use of generous time-outs after which messages are resent.

Inside the daemon Java's strong typing and byte-code verification system provide the necessary framework within which a type-safe environment can be built. Other researcher in the Distributed and High Performance Computing group are working on adding user- and message-verification functions into the daemon's portal module. Authentication systems such as provided by Kerberos [173] are being considered for this task.

Conclusion

 

In this chapter, we have analysed the performance of the prototype DISCWorld daemon. The experimental results have shown that the scheduling algorithm is practically useful and DRAMs are a good enabling mechanism. However, the performance of the daemon is limited by the fact that several crucial parts of the daemon have not been implemented. The limitations of the current implementation are discussed, which suggest a large body of possible future work in this area.

We have compared the performance of the DISCWorld daemon with Java RMI and cgi-bin versions of the same programs. Results have shown that while the DISCWorld daemon is slower than the RMI version in the construction of the same data products, the time taken for the RMI version is constant, whereas the DISCWorld daemon benefits from caching all intermediate results. The DISCWorld model of computation also allows client-side computation, which is unavailable in cgi-bin scripts.

DRAMFs allow demand-driven execution of processing requests. Execution of processing requests can be optimised if the same data is available from alternate nodes. The prototype DISCWorld daemon has been shown to have performance which scales linearly within the limitations of the current implementation. A detailed example has shown how DRAMs can be used to construct and execute complex processing requests that span multiple servers. An event timing diagram has shown the relative time costs of the different processing stages of request execution.

It is important to remember that although the DISCWorld daemon is a vital component in the implementation of scheduling in DISCWorld, the creation of a perfect DISCWorld daemon is not the aim of this thesis. The main topics are: the creation and distribution of good service placement decisions in the presence of incomplete system state information; and, the implementation of a framework that allows high-level information about data and services to be traded between participating DISCWorld daemons and clients, to allow on-demand client- and server-side processing.

We have achieved these two goals through the use of: a platform independent Distributed Job Placement language, with which processing requests can be described; an execution-cost minimisation function, which incorporates the concept of a daemon's willingness to perform a given function; and the DISCWorld Remote Access Mechanism, which allows services and data, whether the data has been physically created or not, to be accessed and traded between clients and servers.


next up previous
Next: Conclusions and Future Work Up: Scheduling in Metacomputing Systems Previous: Implementing Scheduling in DISCWorld

Heath A. James, heath@cs.adelaide.edu.au