Technical Report DHPC-030: Querying and Auxiliary Data in the DISCWorld

Technical Report DHPC-030

Querying and Auxiliary Data in the DISCWorld

J.A. Mathew and K.A. Hawick
Email: {jm,khawick}@cs.adelaide.edu.au
Department of Computer Science, University of Adelaide, SA 5005, Australia

January 23, 1998





Abstract

An integral part of the DISCWorld environment is access to the storage and manipulation of the auxiliary data that is associated with each of the large data objects that is stored in a repository. For efficient job execution, it is necessary to minimise the movement and processing of relatively large files: this can be accommodated by pre-identifying files of interest by querying the auxiliary data assocaited with each file. A natural storage means of such auxiliary data would be in a RDBMS. However the interfaces to most RDMSs is proprietary and requires vendor and platform specific software to be installed at both the client and server. The Java JDBC package provides a mechanism for universal access to the data stored in a database. We propose, and have built a prototype implementation for, a high-level Java-based database layer that allows for storage and querying of auxiliary data. It also provides mechanisms by which the schema of the database can be dynamically altered, and uses Boolean Syntax Query form in addition to the SQL that is native to the database. Our implementation uses Informix Universal Server as the underlying database and we suggest how the Object-Relational extensions that this database provides can be utilised in constructing a storage environment for auxiliary data. We also discuss how graphical Java client applications can be constructed so users can browse auxiliary data without prior knowledge of the schema.

1 Introduction

  In traditional database systems, the client applications need to be specifically developed for a particular underlying DBMS and client platform. This restriction means that if access to data from multiple client platforms is required, a separate client application will need to be developed: or at the very least the source code will need to be recompiled.

In addition to the above, it is usually necessary for client applications to have detailed knowledge of the schema of the underlying database. This makes it impossible to alter the schema without modifying the client applications and distributing the updated client to all users. In addition if the client is intended to access a number of databases, the schema of each of these needs to be hard coded into the application.

The problems highlighted above can be overcome by using the Java [3] programming language in general and the JDBC [2] package in particular. Java byte code can be executed on any platform for which a Java Virtual Machine exists.

Our system is particularily focused on the storage of auxilary data, which is the context information relating to a large object in the system, such as an image file. For example our respository stores GMS5 Satellite images, and associated with each of these is about 100 items of data, such as the time at which the image was taken, the satellite position and many other pieces of information. This auxiliary data is often referred to as ``metadata''. However, in the literature the term ``metadata'' is often used in two senses. The first is contextual information as described above. The term ``metadata'' is also used to refer to the schema of a database. In order to avoid confusion, in this paper, we will use the term ``auxiliary data'' to refer to the extra information associated with the image files stored in the repository, and the term ``metadata'' to refer to the database schema.

The prototype implementation of the database server that we have developed uses Java RMI as it's communication mechanism between client and server. An alternative would be to use CORBA: the changes that would be required for the system to use CORBA would be relatively minor. The use of CORBA and RMI in the DISCwiorld environment is discussed in [7].

Traditionally, database applications have been developed as two-tier client/server style applications, with the client being developed specifically to interact with a particular database, and to execute on a particular platform. The system we describe makes uses of new technologies to create an intermediate layer that is largely database and platform independent, and that provides extra functionality on top of what is normally provided by a relational database system. This extra functionality includes facilities for incorporating the processing of data, as well as caching of these derived results. We describe technologies for implementing dynamic loading of code in Section 5.1 and for result caching in Section 5.2. The issue of querying of auxiliary data is presented in Section 4.

2 DISCWorld Query Example

  Consider an example of an image processing query in DISCWorld.


 figure26

The new column is lazily evaluated: if a similar query is later executed results already calculated can be re-used.

3 Technologies

 

3.1 JDBC Features

  The JDBC package is an API that allows Java applications to execute SQL statements irrespective of the underlying database. JDBC drivers are available, or will soon be available, for most common relational databases in use.

There are four basic types of JDBC driver which are outlined below.

  1. JDBC-ODBC bridge plus ODBC driver
  2. Native API partly-Java driver
  3. JDBC-Net pure Java driver
  4. Native-protocol pure Java driver

The first two types of drivers require the loading of platform specific binary code at the client. The first is useful where no JDBC driver is available for the database system, but where a legacy ODBC driver does exist. The second type of driver has binary code to translate JBDC calls to database specific calls. Type 3 drivers require some binary code to be loaded onto the database server: generally this is easier than loading code onto all clients. Type 4 drivers do not require any special code to be loaded onto either the client or server.The preferred type of driver is a type 3 or type 4 driver since this allows the client applications developed to be on any client platform for which a Java Virtual Machine exists. Most of the current JDBC drivers are type 3 drivers, however it can be expected that new versions of database software released in the future will have native support for JDBC calls. Our system makes uses of a type 3 driver from Openlink Software.

A particularly interesting aspect of the JDBC package is the DatabaseMetaData package. This allows clients to dynamically determine the schema of a database. Clients can obtain information such as the tables in the database, columns in the these tables and information about indices. Hence the schema need not be hard-coded into client applications. Although some database products store the schema in some form in certain system tables, this is very much vendor (and even release) specific and often the information is not readily obtainable, but needs to be reconstructed from relatively complex SQL queries.

Our database server will allow the client applications to create new tables to hold the auxiliary data and allow the client programs to alter the schema of an existing table by adding extra columns and dropping existing ones. This will be done without intervention by a database administrator, although for optimum performance it may be necessary for a database administrator to perform some administration tasks manually. It is necessary for columns to be added dynamically in order to support the result caching that we intend to incorporate into our system.

3.2 Informix

  Our database server implementation uses Informix Universal Server [4] as the underlying database engine. Where possible, we have tried to avoid making use of the proprietary features of Informix. However, it may be beneficial to use some of the Object-Relational extensions that are available as part of this product. These features include a richer type set, with the ability to extend the pre-existing types, and facilities and incorporating some processing functionality into the database engine. The Informix implementation of large objects does not provide any convenient mechanism by which binary files can be inserted into the database using JDBC. It is necessary for the file to actually exist on the file system of the database server.

Since we use JDBC for our database accesses, it will be relatively straightforward to substitute a free database engine, such as msql or postgres, in place of Informix. However, we have found that with extremely large data sets, the freely available databases exhibit performance and reliability problems.

4 Querying

  One of the other aspects of our system is that the querying interface does not only use SQL. We also support Boolean Query Syntax (BQS) as defined in the Geospatial and Imagery Access Services (GIAS) Specification [5]. Boolean Sytax Queries resemble the ``where'' clause of an SQL statement, and the use of this syntax means that Our initial implementation will not include geographic functions but will support the normal relational operators and also date types.

The BNF Description of BSQ is provided below.

query ::= [``not''] term { ``or'' term }
term ::= factor { ``and'' factor }
factor ::= (simple_attribute_name comp_op constant_expression )
           (text_attribute_name [ ``not'' ] ``like'' quoted_string
           ( ``('' query ``)'' )
attribute_name ::= a member of the set of queryable attribute names (defined in
                   the appropriate GIAS profile)
simple_attribute_name ::= member of subset of attribute_name for which boolean 
                          operators (comp_op) are allowed
text_attribute_name ::= member of subset of attribute_name for which string 
                        operators are allowed (``free text search'')
comp_op ::= ``=''|``<''|``>''|``<>''|``<=''|``>=''
constant_expression ::= number | quoted_string | date
sign ::= ``+''|``-''|``'' 
number ::= sign n [ ``.'' [n]]
n ::= digit { digit }
digit ::= ``0''|``1''|``2''|``3''|``4''|``5''|``6''|``7''|``8''|``9''
quoted_string ::= ``''' { character } ``'''
character ::= ``a''|``b''| ...
date ::= number ``/'' number ``/'' number 

Patern matching can be performed using the ``%'' character.

Use of Boolean Syntax means that client applications do not have to generate SQL statements themselves. In particular, if the data is spread across several tables, the clients do not need to know about this and need not performany SQL joins.

However, a disadvantage of this query syntax are that there is no provision for specifying a subset of the attributes to be returned. In the case of our the auxiliary data for our GMS 5 satellite images, there are over 100 attributes of which only one might be relevant for a particular query. Another limitation of this syntax is that there is no equivalent to SQL subqueries, so certain queries can not be expressed. In order to overcome these problems, we intend to also support SQL in addition to Boolean Query Syntax.

Since Java clients can dynamically query the database and extract the schema, it is relatively easy to construct GUI-based applets or applications that allow queries to be made on a 'Query by Example' basis, similar to that used by Microsoft Access. For example, we have developed a simple applet, that connects to a database and retrieves the list of tables. The use can then select one of these tables and a list of columns is obtained. The user can then highlight the columns they wish to see, and specify search criteria, and the query is executed and the results presented to the user.

5 Discussion

  As previously mentioned, one of the major research issues is the problem of identifying the intermediate results that are stored. One approach would be to store information about the cached results, such as the program or function used to create them, the parameters and input provided to these programs, time of creation and other relevant information into a database table. This table can later be searched in order to identify results of interest at a later time. Intelligence could be incorporated into the database server to find an appropriate match, since it might not be necessary for all of the stored attributes to match. This is obviously infeasible in the general case, so the type of results that can be cached and identified needs to be constrained in some way.

5.1 Code Server

  As already mentioned, the DISCWorld environment should allow for more than simple access to the auxiliary data stored in the system. A query might request some dervied result, which does not exist in the database. In this case, the database server can obtain from some know ``Code Server'', java byte code to perform the processing required. Alternatively, the client can provide the server with the java byte code along with the request. These class files can then be instantiated by the database server. Using the Java Reflection package the server can interrogate the downloaded class file and determine the methods implemented by the class and the parameters accepted by these methods. In many cases this approach will be more efficient than the client downloading the data of interest and performing the processing itself. For example if a client, connected to the DISCWorld system by a dialup link, requires the percentage cloud cover of a subset of the images stored in the repository, it is obviously preferable to perform the processing remotely in order to avoid having to download large amounts of data over the slow dialup link. We have implemented a prototype ``Code Server'' which performs this functionality. The problems in defining an appropriate namespace also arise in this context. There are also trust issues in allowing arbitrary code to be uploaded and executed.

Most of the current RDBMS packages currently available provide an API so client applications can insert extra functionality directly into the database engine: this approach has the disadvantage that the APIs are proprietary to the database vendor and require the compilation of platform specific code.

5.2 Result Caching

  In addition to supporting querying of the auxiliary data, the DISCWorld system is intended to allow ``value adding'' through processing of the data stored. This processing may be costly in terms of time or computing resources, and there may be scope for re-use of these results. We intend to provide hooks into our database server so that these intermediate results, which can be stored and re-used. The central problem in this is organising a namespace for cached results and particularily intermediate results, so that the can be identified for reuse. Eric [6], a prototype tool for browsing image data and creating video sequences tackles this problem by the generating incorporating details about the cached results into the filename. We discuss this further in the Section 5.

5.3 Distribution

  The framework that we have described can be further developed to support distribution and automatic migration of data. Multiple servers could be running on difference machines and possible having different underlying databases. These servers could communicate with each other so that when a server receives a query for data that it does not have, the query can be forwarded onto other servers. These servers can then pass the query results back to the server that made the initial request, which can then collate the results and forward them back to the client. Thus the distribution is transparent to the client. The servers could also can also communicate between themselves and move the data to the most appropriate site based on access patterns for the data. This will involve the keeping of access statistics by the server.

5.4 Summary

  As we have discussed in this paper in order to query the auxiliary data, both SQL and Boolean Query Syntax have limitations. A combination of both SQL and BQS, in addition to the extra functionality that we describe such as dynamic loading of code, result caching and distribution provides a better basis on which to build a framework for the storage and manipulation of auxiliary data within the DISCWorld environment.

Acknowledgements

This work is being carried out as part of the Distributed High Performance Computing Infrastructure (DHPC-I) project of the Research Data Networks Cooperative Research Center (RDN CRC) and is managed under the On-Line Data Archives Program of the Advanced Computational Systems CRC. RDN and ACSys are established under the Australian Government's CRC Program.

References

 
1
``DISCWorld: An Integrated Data Environment for Distributed High-Performance Computing'' K.A.Hawick, A.L.Brown, P.D.Coddington, J.F.Hercus, H.A.James, K.E.Kerry, K.J.Maciunas, J.A.Mathew, C.J.Patten, A.J.Silis, F.A.Vaughan, DHPC Technical Report DHPC-027, January 1998.

2
``JDBC Guide:Getting Started'', Javasoft 1997, http://www.javasoft.com.au/jdbc.

3
``The Java Language Reference'', J.Gosling, B.Joy, G.Steele, Addison-Wesley 1996.

4
``Informix WWW Site'' http://www.informix.com.

5
``Geospatial and Imagery Access Services Specification'', National Imagery and Mapping Agency, United States Imagery and Geospatial System, 22 July, 1997.

6
``A Web-based Interface for On-Demand Processing of Satellite Imagery Archives'', H.A.James and K.A.Hawick, DHPC Technical Report DHPC-018, August 1997,

7
``Service Management in DISCWorld using CORBA'', K.E.Kerry and K.A.Hawick , DHPC Technical Report DHPC-033

Jesudas Mathew (jm@cs.adelaide.edu.au)