Querying and Auxiliary Data in the DISCWorld
J.A. Mathew and K.A. Hawick
Email: {jm,khawick}@cs.adelaide.edu.au
Department of Computer Science, University of Adelaide,
SA 5005, Australia
January 23, 1998
An integral part of the DISCWorld environment is access to the storage and manipulation of the auxiliary data that is associated with each of the large data objects that is stored in a repository. For efficient job execution, it is necessary to minimise the movement and processing of relatively large files: this can be accommodated by pre-identifying files of interest by querying the auxiliary data assocaited with each file. A natural storage means of such auxiliary data would be in a RDBMS. However the interfaces to most RDMSs is proprietary and requires vendor and platform specific software to be installed at both the client and server. The Java JDBC package provides a mechanism for universal access to the data stored in a database. We propose, and have built a prototype implementation for, a high-level Java-based database layer that allows for storage and querying of auxiliary data. It also provides mechanisms by which the schema of the database can be dynamically altered, and uses Boolean Syntax Query form in addition to the SQL that is native to the database. Our implementation uses Informix Universal Server as the underlying database and we suggest how the Object-Relational extensions that this database provides can be utilised in constructing a storage environment for auxiliary data. We also discuss how graphical Java client applications can be constructed so users can browse auxiliary data without prior knowledge of the schema.
In addition to the above, it is usually necessary for client applications to have detailed knowledge of the schema of the underlying database. This makes it impossible to alter the schema without modifying the client applications and distributing the updated client to all users. In addition if the client is intended to access a number of databases, the schema of each of these needs to be hard coded into the application.
The problems highlighted above can be overcome by using the Java [3] programming language in general and the JDBC [2] package in particular. Java byte code can be executed on any platform for which a Java Virtual Machine exists.
Our system is particularily focused on the storage of auxilary data, which is the context information relating to a large object in the system, such as an image file. For example our respository stores GMS5 Satellite images, and associated with each of these is about 100 items of data, such as the time at which the image was taken, the satellite position and many other pieces of information. This auxiliary data is often referred to as ``metadata''. However, in the literature the term ``metadata'' is often used in two senses. The first is contextual information as described above. The term ``metadata'' is also used to refer to the schema of a database. In order to avoid confusion, in this paper, we will use the term ``auxiliary data'' to refer to the extra information associated with the image files stored in the repository, and the term ``metadata'' to refer to the database schema.
The prototype implementation of the database server that we have developed uses Java RMI as it's communication mechanism between client and server. An alternative would be to use CORBA: the changes that would be required for the system to use CORBA would be relatively minor. The use of CORBA and RMI in the DISCwiorld environment is discussed in [7].
Traditionally, database applications have been developed as two-tier client/server style applications, with the client being developed specifically to interact with a particular database, and to execute on a particular platform. The system we describe makes uses of new technologies to create an intermediate layer that is largely database and platform independent, and that provides extra functionality on top of what is normally provided by a relational database system. This extra functionality includes facilities for incorporating the processing of data, as well as caching of these derived results. We describe technologies for implementing dynamic loading of code in Section 5.1 and for result caching in Section 5.2. The issue of querying of auxiliary data is presented in Section 4.
(date > 1/1/1998 and date < 1/1/97)
and cloud_cover(australia) > 30)
The new column is lazily evaluated: if a similar query is later executed results already calculated can be re-used.
There are four basic types of JDBC driver which are outlined below.
The first two types of drivers require the loading of platform specific binary code at the client. The first is useful where no JDBC driver is available for the database system, but where a legacy ODBC driver does exist. The second type of driver has binary code to translate JBDC calls to database specific calls. Type 3 drivers require some binary code to be loaded onto the database server: generally this is easier than loading code onto all clients. Type 4 drivers do not require any special code to be loaded onto either the client or server.The preferred type of driver is a type 3 or type 4 driver since this allows the client applications developed to be on any client platform for which a Java Virtual Machine exists. Most of the current JDBC drivers are type 3 drivers, however it can be expected that new versions of database software released in the future will have native support for JDBC calls. Our system makes uses of a type 3 driver from Openlink Software.
A particularly interesting aspect of the JDBC package is the DatabaseMetaData package. This allows clients to dynamically determine the schema of a database. Clients can obtain information such as the tables in the database, columns in the these tables and information about indices. Hence the schema need not be hard-coded into client applications. Although some database products store the schema in some form in certain system tables, this is very much vendor (and even release) specific and often the information is not readily obtainable, but needs to be reconstructed from relatively complex SQL queries.
Our database server will allow the client applications to create new tables to hold the auxiliary data and allow the client programs to alter the schema of an existing table by adding extra columns and dropping existing ones. This will be done without intervention by a database administrator, although for optimum performance it may be necessary for a database administrator to perform some administration tasks manually. It is necessary for columns to be added dynamically in order to support the result caching that we intend to incorporate into our system.
Since we use JDBC for our database accesses, it will be relatively straightforward to substitute a free database engine, such as msql or postgres, in place of Informix. However, we have found that with extremely large data sets, the freely available databases exhibit performance and reliability problems.
One of the other aspects of our system is that the querying interface does not only use SQL. We also support Boolean Query Syntax (BQS) as defined in the Geospatial and Imagery Access Services (GIAS) Specification [5]. Boolean Sytax Queries resemble the ``where'' clause of an SQL statement, and the use of this syntax means that Our initial implementation will not include geographic functions but will support the normal relational operators and also date types.
The BNF Description of BSQ is provided below.
query ::= [``not''] term { ``or'' term }
term ::= factor { ``and'' factor }
factor ::= (simple_attribute_name comp_op constant_expression )
(text_attribute_name [ ``not'' ] ``like'' quoted_string
( ``('' query ``)'' )
attribute_name ::= a member of the set of queryable attribute names (defined in
the appropriate GIAS profile)
simple_attribute_name ::= member of subset of attribute_name for which boolean
operators (comp_op) are allowed
text_attribute_name ::= member of subset of attribute_name for which string
operators are allowed (``free text search'')
comp_op ::= ``=''|``<''|``>''|``<>''|``<=''|``>=''
constant_expression ::= number | quoted_string | date
sign ::= ``+''|``-''|``''
number ::= sign n [ ``.'' [n]]
n ::= digit { digit }
digit ::= ``0''|``1''|``2''|``3''|``4''|``5''|``6''|``7''|``8''|``9''
quoted_string ::= ``''' { character } ``'''
character ::= ``a''|``b''| ...
date ::= number ``/'' number ``/'' number
Patern matching can be performed using the ``%'' character.
Use of Boolean Syntax means that client applications do not have to generate SQL statements themselves. In particular, if the data is spread across several tables, the clients do not need to know about this and need not performany SQL joins.
However, a disadvantage of this query syntax are that there is no provision for specifying a subset of the attributes to be returned. In the case of our the auxiliary data for our GMS 5 satellite images, there are over 100 attributes of which only one might be relevant for a particular query. Another limitation of this syntax is that there is no equivalent to SQL subqueries, so certain queries can not be expressed. In order to overcome these problems, we intend to also support SQL in addition to Boolean Query Syntax.
Since Java clients can dynamically query the database and extract the schema, it is relatively easy to construct GUI-based applets or applications that allow queries to be made on a 'Query by Example' basis, similar to that used by Microsoft Access. For example, we have developed a simple applet, that connects to a database and retrieves the list of tables. The use can then select one of these tables and a list of columns is obtained. The user can then highlight the columns they wish to see, and specify search criteria, and the query is executed and the results presented to the user.
Most of the current RDBMS packages currently available provide an API so client applications can insert extra functionality directly into the database engine: this approach has the disadvantage that the APIs are proprietary to the database vendor and require the compilation of platform specific code.
This work is being carried out as part of the Distributed High Performance Computing Infrastructure (DHPC-I) project of the Research Data Networks Cooperative Research Center (RDN CRC) and is managed under the On-Line Data Archives Program of the Advanced Computational Systems CRC. RDN and ACSys are established under the Australian Government's CRC Program.
Jesudas Mathew (jm@cs.adelaide.edu.au)