A Bulk Data Transfer Bus

Francis Vaughan
francis@cs.adelaide.edu.au

Distributed High Performance Computing Group

University of Adelaide, Department of Computer Science

Tech Report DHPC-031

February 1998

 

Abstract

Storage, processing and transmission of very large data sets found in remotely sensed geographical information systems place heavy demands upon all aspects of a system design. The regularity of the data and the nature of the most common computational tasks allows advanced data tiling techniques to be used to good effect. A data transfer API which uses the notion of tiles as its fundamental data type allows each part of the system to describe both the physical aspects of parallel data management, and the computational and user level layouts. Mapping functions which describe the retitling of data from one tiling to another form the basic data transfer specification. Combinations of retilings can be specified before data is moved and a single, optimal remap performed that can be further optimised to best utilise the physical parameters of storage and communication devices. Bulk data is stored in a regular format for best performance, this requires that unstructured meta-data and control data be stored separately. A paradigm in which each data object is represented as a high level specification and a multi-dimensional array of simple data evolves. The system described can integrate with legacy applications through the provision of interfaces that create the illusion of files. These files are automatically generated upon demand and can have their content optimised so that the legacy application only sees exactly the data it needs. The Bulk Data Transfer Software Bus underpins the design of the DISCWorld system.


Introduction

Data storage, transfer and manipulation in geographical information systems (GIS), remote sensed applications and many other applications requires the handling of massive amounts of data. For instance, a single multi-spectral image delivered from a geosynchronous orbit weather satellite is of the order of 40 Megabytes. These are delivered hourly. Newer remote sensing satellites can deliver up to 1 Tera byte per day. Storage usually requires large robotic tape silos, multi disk RAID arrays and processing all but trivial images requires large parallel computers. Luckily these data sets are very regular and typical uses require large regular subsets of these data sets. The DISCWorld system [Hawick 1998] is an integrated environment which controls the storage, access, distribution, computation and user interaction in applications which require high performance, especially high performance distributed computations. The mechanisms and architecture described in this paper form the underpinning to the control and movement of data in the DISCWorld system.

In this paper we discuss the basics of tiling and in particular Fletcher’s k-tiling system [Felcther 1992]. We show how this forms a particularly useful paradigm for describing large regular data-sets such as GIS and remote sensed and go on to describe an API which embodies the k-tile as its basic data type. This is the Bulk Data Transfer Bus.

Tiling

Conventionally, simple image data is represented as a raster, with some slight variations covering such details as row or column dominance. As soon as an image is composed of more than one data set, significant complication is already apparent in existing designs. Allowing three spectral channels (red, green, blue) begets the question of whether the representation should store each pixel as RGB, each line as three lines one of red, one of green one of blue (interlacing) or whether to store three successive images. Ingenious minds will find many more possible combinations, each with its own advantages and disadvantages.

In parallel computation it is similarly necessary to split a regular data-set, such as an image, across the computational nodes. Even in symmetric shared memory designs an understanding of the data layout and the manner in which individual processors split up the problem at hand will significantly affect performance, since cache performance is crucially dependant upon access patterns. In data parallel languages (such as: CMF, HPF, Fortran 95 and C*) it is usual to provide a mechanism to describe how data is shattered. The issues are similar conceptualy to those for image data above.

Storage of data, particularly on disk arrays is similarly beset. Striped disk arrays can provide gratifyingly good data rates when data transfers occur for contiguous blocks. The choice of block size and stripe size, and thus the manner in which data is shattered across the drives can be a crucial performance determinant.

All these descriptions can be subsumed under the one banner of data tiling. Early efforts on data tiling were begun at ICL [Flanders 1988] with later work by Fraser [Fraser 1985] and Fletcher [Bosman 1992].

A simple and common question of data layout in data parallel programming is the layout of a two dimensional array. If it is to be shared across a set of identical processors three options are possible. Individual processors may receive a group of rows, a group of columns or the array may be divided into rectangular sub arrays, and these distributed.

Data spaces represent the physical or synthetic coordinate system in which the data was sampled or generated, eg a three channel 512 x 1024 image of 32 bit integers: [4,512,10-24,3] Note that the basic type is the byte, so that 32 bit integers are vectors of length four, hence the first term in the description. Similarly data in the device spaces can be represented, eg a 64 byte array stored on a CM-2 configured as a 4-D 16x16x16x16 processor array: [64:16,16,16,16]

Mapping provide the mechanism to move data from one space to another. Mappings provide an expression of any valid permutation of k-tile dimensions from source to destination data spaces. Importantly the mapping definitions allow for some dimensions to be empty, and also to specify tile templates that describe padding of sub-tiles, providing a mechanism to map data from spaces that do not factor the target space. Further an ability to specify offsets allows translation of geometries. Offsets are extended to describe replication of data relative to a given dimension.

The Bulk Data Transfer Bus

Concept

The Bulk Data Transfer Bus is a software system which allows all regular data to be described as a K-Tile, and further, all data sources and sinks to provide tiling. The system is responsible for managing all retilings. Further to this basic API the system provides access to control and description data through a separate channel. By allowing every part of the system involved with the handling of bulk data to control its own tiling, each system can control the layout optimally.

By making the entire system use a single canonical expression for data transfer and layout optimisations can be performed for any part of the system using the same tools, and can be further optimised in a global manner hitherto unavailable.

Control and description data

Data sets (especially remotely sensed) data includes significant extra descriptive data (often incorrectly referred to as meta data, or more correctly as auxiliary data.) Such aspects as the precise location and view of the satellite, calibration data, time, and other information is crucial to any computation using the sensed data. This data is small, not regular, and does not fit well into the bulk data transfer architecture. Useful systems will also include significant holdings of derived data that allow both users and advanced applications to search the data holdings. Such derived data can include such useful additions as thumbnail images, quality indexes, and bitmaps of cloud cover for example. This control and derived data is held in more conventional databases along with descriptions of the data holdings of the bulk data store.

Data Type

This view of data holdings leads to a significantly different view of the basic data types within the bulk data transfer bus. Data is defined as a pair. First, the high level definition of the data, which might include such determinants as satellite type, high level search predicates, or reference to an internal partial result specification. Second, the definition of the data tiling, which includes the base type of the cell.

Virtual Objects

This view of data naturally leads to the use of representations of data objects that do not currently exist and the automatic creation of derived bulk data-sets that exactly fit the requirements of a computation or intermediate result in a large scale computation. In the DISCWorld system large computations may be spread across a number of machines, and indeed may be geographically spread. Descriptors of virtual bulk data objects are generated by the DISCWorld control system and are mobile. In a manner similar to that used in the Globus system [Foster 1996] virtual object descriptors are end-points for communications and can be passed on from one system to another. Thus in the Bulk Data Transfer Bus design a virtual object descriptor that represents the result of a computation or retrieval of data is a reception point that can deliver the data to a waiting computation or other data sink (such as a storage or display device).

Bus Adaptors

Thus far the only applications that can make use of the Bulk Data Transfer Bus are those which have been explicitly written to use its API. Clearly this is a significant limitation, especially in support of systems for which there is a considerable body existing software. To support legacy applications the system includes the concept of a bus-adaptor. In particular an adaptor which can accept a virtual description object and provide access to the data so described, as if it were embedded in a conventional file. This can be accomplished through modification to the operating system’s file system (requiring kernel modifications in a great many operating systems) or, for those systems which support NFS (the great majority) a program which can masquerade as an NFS server can deliver the appropriate functionality. The design of NFS bus adaptors is explored in [Patten 1998].

Where data must transit long distances and perhaps cross networks, which may involve changes of packet size and transit of firewalls and other security hurdles, a second type of bus adaptor may be valuable. These act as data source and sinks for data object descriptors and are responsible for negotiating the conduit of data. In situations where data may be required to be repacketed into packet sizes different to those used internally, the network bus adaptor can specify a new target data space architecture that can be used to further remap data for best performance over the network. Over congested or unreliable networks where packets may be dropped this may have significant performance benefits.

Example of System Use

Here we present a simple example of the entire system. Our starting point is a massive data repository holding satellite data. Conventionally such a store would hold each image as separate file, possibly accessible through a hierarchical file system (such as SAM-FS.)

The external DISCWorld control system is responsible for locating the data, and determining the nature and identify of the compute servers to perform the computation. The data access request is translated into a request for a bulk data object tile. In the example the object is simply the satellite identity, and is the identifier for a single bulk data tile within the storage system. The description of the needed data is a tile within this bulk tile, describing sub arrays corresponding to the subsets of the satellite images. These subsets include the date/pas range and the sub-tiles of the images needed for the geographical coverage.

For each computational step retilings may be needed for a number of reasons. One is to cope with changes to the data due to the processing steps (for instance a reduction of the number of effective spectral channels). Another is to describe the data layout across the memory used in any parallel of distributed computation. The combination of all these tilings is computed and used by the internal retiling mechanisms of the bulk data transfer bus to effect data movement.

Tape Layout

Access to data on tape (we assume the use of a robotic silo with multiple drives and few, or a single robotic mechanism) is a particularly brutal example of latency effects in data access. Typically a robotic mechanism takes of the order of 15 seconds to locate and load a tape into a free drive. Even with the most high performance drives, an average seek time to the needed data block will be 30 seconds. Attempts have been made to stripe data across multiple tape drives, [Golubchik 1995] however the utility is significantly limited. A tape silo is limited to a single robotic mechanism, and thus tape loads must proceed sequentially. Parallel loads can occur with multiple silos. However, tapes cannot easily cross between silos, complicating management. Striped tape systems are currently limited to regimes where the large aggregate transfer rates on very large contiguous data sets is enough to amortise the effect of very large initial access latency.

Placing data on multiple tapes as a sequence of tiles that are described within a large multi-dimensional tiling, as described above, allows for access that suits the needs of many GIS applications. In the example above, it would be expected that the individual tiles required for the computation would be scattered across the tapes. By having access to the exact tiling of the tapes, the required data and the tape silo (reflecting available tape drives) the tape sub-system can determine ahead of time exactly which tapes will be needed and upon them which blocks. It may then schedule a single pass over each tape picking off exactly the required tiles. Furthermore, if multiple tapes must be read, and multiple drives are free, it can schedule the tape loading to take best advantage. Data streaming from each drive is simply passed into the bulk data transfer bus where a suitable retiling takes the data streams and tiles them onto the next stage, ideally delivering them directly to the system memory of the individual participants of a parallel computation, in exactly the required layout for computation to proceed.

High performance drives are often capable of rapidly seeking to locations on tape. Helical scan drives such as the Sony GY2120 or ID-2 place control information on serial tracks on the edges of the tape and can read this data whilst forwarding the tape at high speed. An optimal tile size for such tapes is a balance of the time to transit from read to fast forward and back to read again versus the data transfer rate whist reading. Less capable tape drives may have to pay an additional cost of reading past unwanted data until the exact tile needed is encountered.

Disk Layouts

Disk layouts embody a similar set of tradeoffs to tapes, but in a different combination. Striping across disks is a trivial tiling. The ability of the k-tiling to express duplication of data in a retiling allows for the creation of mirrored data. In general tile sizes will be dictated by tradeoff between seek times and block sizes. In a manner similar to tape accesses, the retiling can order access requests so that they sequentially traverse the disk. This may not yield quite the performance benefits seen with tape (partly because modern disk drives often perform scheduling internally and the real geometry of a disk is hidden) but will be of benefit. Again a disk array is expressed as a tile space, and direct knowledge of this space is made available through the data transfer bus providing for seamless transfer of data directly to its destination.

Computational Layouts

Finally user level programs often express aspects of data layout. This is normally to provide control of layout so as to ensure a minimum of inter-processor communication during a computation. Such layouts directives are a subset of those available in the k-tiling system. To fully integrate the transfer bus into user level programs it will be necessary to gain access to the tiling descriptions used in the individual compilers and translate these to a k-tile

Limitations

The system as described does not cover a number of interesting issues. The K-Tiling format does not provide for sparse data (although Fraser’s system does) and there is no simple way to express some simple computations. This may seem a curious criticism, however it may be noted that an ability to express very simple arithmetic operations would allow the complete description of all current RAID paradigms entirely as a remapping. Currently the generation of parity must be expressed as a computation outwith the remap.

The Bulk Data Bus, as it stands, does not address any aspects of data resiliency, nor does it express data replication. This is especially important when intermediate results may be usefully cached. Currently it is expected that the higher level control system will control caching, resilience and replication. This may not be appropriate, and an extention to the tiling mechanism to allow cloning of data during a remap to two separate target maps may prove valuable.

Conclusions

A bulk data transfer bus, based upon canonical tiling algebra provides the capability to integrate the layout and transfer of data through all levels of a large computational environment. It can control the layout and access to tapes, tape silos, disk, disk arrays, networks of distributed computations, including computations on networks of workstations, distributed memory supercomputers and other parallel designs. An end to end optimisation is possible. The descriptors for such data act as mobile end-points to data transfer. The design subsumes the file system and tape management systems used in conventional systems, replacing them with a combination of data naming thorough high level search predicates and addressing through the tile algebra.

Acknowledgements

This work is being carried out as part of the Distributed High Performance Computing Infrastructure (DHPC-I) project of the Research Data Networks Cooperative Research Center (RDN CRC) and is managed under the On-Line Data Archives Program of the Advanced Computational Systems CRC. RDN and ACSys are established under the Australian Government's CRC Program.

Bibliography

"K-Tiling: A Structure to Support Regular Ordering and Mapping of Image Data" Oscar Bosman, Peter Fletcher and Kenneth Tsui, Proc. Australian Pattern Recognition Society Workshop on Two and Three Dimensional Spatial Data: Representation and Standards, December 1992, Perth, Western Australia.

"DAP Series - Parallel Data Transforms"', P.M.Flanders, Active Memory Technology (formerly ICL DAP Division), 1988.

"Recification of multichannel images in mass storage using image transposition" D Fraser. Computer Vision, Graphics and Image Processing 29:23-36, 1985

"File Systems Support for Legacy Applications Embedded in DISCWorld", C.J.Patten, F.A.Vaughan, K.A.Hawick and A.L.Brown, Proc. 5th IDEA Workshop, Freemantle, February 1998, Technical Note DHPC-032, January 1998.

"Globus: A Meta-computing Infra-structure Toolkit", Ian Foster and Carl Kesselman, http://www.globus.org, 1996.

"Analysis of Striping Techniques in Robotic Storage Libraries", L.Golubchik, R.R.Muntz and R.W.Watson, Proc. 14th IEEE Symposium on Mass Storage Systems, Monterey California, September 11-14, 1995, P225.

"DISCWorld: An Integrated Data Environment for Distributed High-Performance Computing", K.A.Hawick, A.L.Brown, P.D.Coddington, J.F.Hercus, H.A.James, K.E.Kerry, K.J.Maciunas, J.A.Mathew, C.J.Patten, A.J.Silis, F.A.Vaughan. Proc. 5th IDEA Workshop, Freemantle, February 1998, Technical Note DHPC-027, January 1998.