Compression of Image Data and Performance Tradeoffs for Client/Server Systems
J.F.Hercus and K.A.Hawick
Email: {james,khawick}@cs.adelaide.edu.au
Department of Computer Science, University of Adelaide,
SA 5005, Australia
23 January 1997
We discuss issues relating We to the compression system in the distributed image repository of the DISCWorld system. We The distributed image repository is intended for the distributed storage and retrieval of large scientific imagery datasets. Given the large and regular nature of this type of data there is considerable scope for the application of data compression techniques in the system to reduce storage requirements and data transmission times.
We develop a framework for modeling the performance of the compression system within the DISCWorld, with the intention of applying this to and in the design of the system. The model is applied to two example situations to illustrate its application in the design and optimisation of the repository.
This section discusses the requirements of compression algorithms in the DISCWorld system[2] and describes the criteria which can be used to evaluate the suitability of algorithms in a distributed system.
The reasons for using data compression in the DISCWorld are twofold. By storing an image, or any other data, in a smaller amount of space, we can reduce the total amount of storage required in the system and also the transmission times for individual data items. Transmission includes transfer to and from storage devices as well as across networks. The corollary of reduced storage requirements is the capacity to store more data, which applies to the various levels of cache in a distributed system as well as to the primary store. These benefits come at the cost of the increased processing required to convert the data between the compressed format(s) in which it is stored in the system and the uncompressed format in which it is used, and of impaired flexibility of access to the data. This last cost can effectively be expressed as the additional processing time required to overcome the rigidities imposed by compressed formats. These benefits and costs, in broad terms, define the trade-off space which must be considered in the design of a compression system and associated algorithms for the DISCWorld.
The two reasons stated above for using compression constitute the primary quantitative criteria for assessing the compression system. To these must be added the time required to add data to the system, which is essentially the compression time. Of themselves none of these criteria are particularly involved or difficult to quantify. However, they are made more complex because of the other important requirement of the system, flexibility of access, which is perhaps the most important requirement of all. The need to provide for a wide variety of access patterns, and the trade-offs between this requirement and the others mentioned above complicates the measurement of the criteria, particularly access times.
Another interesting issue raised by parallel processing is parallel compression algorithms. Decompression times could be drastically reduced if the decompression were shared among numerous processors, perhaps even in such a way that each processor decompressed that data which it subsequently required in a parallel computation. This would add computation overheads and also inter-process communication overhead, reducing the theoretical efficiency. However, the actual decompression time could be considerably shorter and if the additional processors used in the parallel decompression would have been idle otherwise, which is often the case, then the reduced efficiency is irrelevant.
One particular data set which the DISCWorld will be used to store is geostationary satellite imagery, in particular an archive of data from the Japanese GMS5 satellite[3]. Most such data can be thought of as a four dimensional data set where two dimensions are spatial, one is spectral and the last is temporal. In the case of the GMS5 data, the temporal frequency of the images is about one image set per hour, each image set consists of four images from different spectrums and the individual images are 2291x2291 pixels (about 5.4M uncompressed).
All of the general compression requirement above apply to this data set. Below we discuss some issues which are specific to this type of data set.
Since the satellite data set is four dimensional there are opportunities to improve the compression algorithm beyond what is possible for normal 2D images. Broadly, there are two ways in which this can be done, one per extra dimension. The first of these is to exploit the similarities between the images in different spectrums from the same time. This is not possible for all spectrums; some of them have virtually no correlation with other images in the set. When there is considerable correlation it should be possible to use this to obtain compression ratios higher than could be obtained for the images compressed in isolation. This approach also has disadvantages. If two or more images are compressed as a single unit then it may be more expensive to extract a single image than would otherwise be the case.
The second way in which the compression could be improved for this data set is to try and exploit similarities between images from different times; either images either side of the one in question or perhaps from 24 hours earlier. It might be possible to find useful correlations between these images. This approach is complicated by the nutation of the satellite in its orbit which leads to the shifting of the earth disc within the image. Corresponding pixels in two images from different times do not necessarily correspond to the same region of the earths surface. Using the satellite location information in the metadata associated with the images it is possible to partially correct for this effect by calculating offsets between any two images. Furthermore, at least for this particular data set, there are less correlations between temporally separated images than between spectrally separated ones.
Another strategy to improve the compression system which was mentioned above is to isolate common elements of the elements of a data set and to store these separately to the remainder of the data. In the case of the satellite data set this common data might be statistical information which is common to the whole data set (or large parts of it, one spectrum for instance) or some type of spatial template which had features common to the whole data set.
This strategy is particularly useful in the distributed context. The common data described above could be duplicated at many (or every) site in the system, thereby reducing the volume of data which would need to be transmitted across the network. As described below, it would also mean that during an access to the repository some processing could be performed at the client end prior to data arriving from the server, thereby improving total access times.
When designing a compression system it will be useful to be able to obtain quantitative measurements of the criteria discussed above. However, as can be seen from the previous section, they are sufficiently complicated by the other requirements of the system that it would be very difficult to obtain a single number measurement of how well the system satisfied the criteria.
An alternative to this approach is to develop a model of the system which can incorporate most of the complexities found in it. This model can then be used to obtain measurements (or estimates) of the performance of the system in any given situation. It can also be used in the design of the system, for performance optimisation and estimation and for selection and design of compression algorithms.
To begin with, consider the following sequence of tasks which are to be performed when a user initiates a request for some data object which might be, for instance, a low resolution image for preview purposes or a full resolution image.
In the discussion which follows: T = Time, D = Data Size, L = Latency and B = Bandwidth. For subscripts: numbers 1 to 6 refer to the steps above, T = total and D = decompression.
Given the above series of steps, the simplest cost model is:
![]()
This is very simplistic and is not satisfactory. We can do better by considering latency and bandwidth separately. This will reveal the relationship between the steps in the process and permit investigation of possible pipelining of operations. For the steps which involve processing, we use a linear cost model, and characterise them by fixed and proportional cost (C and M) using the equation
![]()
To do this it is necessary to add some detail to the above sequence of steps.
We now apply this model to a simple example in which a user makes a request for a complete image from the repository. The image is not cached anywhere and must be retrieved from a remote server in the manner described above. Assume that the required image resides in the data store of the server in the required format (but possibly compressed) such that no processing is required at either the client or the server, with the possible exception of decompression at the client if the image is compressed. Therefore, step 4 is empty and step 6 involves only decompression. Steps 3 and 5 involve simple data transfers and therefore their time performance would usually be characterised in terms of latency and bandwidth. For this example we will represent this as a simple linear equation as described above. For step 3 this equation is:
where
is the data volume,
is the latency of the transfer
and
![]()
where
is the bandwidth of the link. The model for step 5 is the same.
The time performance of most decompression algorithms can be characterised by the formula
![]()
where
is the volume of data,
is the startup cost of the
decompression program and
measures the speed of the decompresser
(it is the time required to decompress a single byte). Most
decompression algorithms have a very low fixed startup cost, provided
that they are implemented in a pipeline fashion without large scale
buffering of output. (The corresponding compression algorithms, by
contrast, frequently have high startup costs due to analysis of the
image prior to commencing encoding). We will assume the startup cost
for the decompression in step 6 to be 0, ie
.
Finally, for this example we will assume that the three stages discussed above are pipelined; the fixed cost occurs at the beginning of the stage, after which the subsequent stage may commence. Given this, we wish to find a formula for the time performance of the entire operation which looks like
![]()
To do this we must first make the above individual equations equivalent by
expressing all of the data transfer rates in terms of the same amount of data.
This choice is arbitrary but from the users perspective the most useful figure,
and that used below, would be the data transfer rate of uncompressed data. Ie
. Rewrite equation 3 above as
![]()
where
![]()
and
![]()
and
is the compression ratio.
Now we can write
![]()
and
![]()
So,
![]()
To apply this equation to the design of the compression algorithm we
must try to minimise the max term. Of the first two components of this
term, the second is likely to be the larger as
involves a
network and
, a disk. Therefore given a monotonic relationship
between
and
, the optimal point occurs when
or
This is a reasonably intuitive result which shows how the model can be used to analyse the tradeoff space for the time performance of the compression algorithms. It does not yet address the other main criteria, storage space.
At present the satellite image data set is stored in our archive in the format in which it is distributed, HDF (Hierarchical Data Format) files compressed with the UNIX Compress utility. To convey a brief impression of the tradeoff space involved in the design of the compression system, this algorithm is compared with two lossless image coders: lossless JPEG[1] and CALIC[4]. Lossless JPEG is the standard for lossless image compression specified by the JPEG committee. It is far from being the state of the art, but is a reasonably simple algorithm which provides reasonable compression at reasonable cost. CALIC is a more expensive algorithm which is close to the state of the art in lossless image compression (as measured by compression ratio). It is one of the submissions to JPEG for the next generation of lossless image compression in this standard.
Table 1 gives the relevant figures for these algorithms. These figures are for compressing an image from the IR1 spectrum of the GMS5 dataset. The lossless JPEG algorithm has 7 modes, the figures below are the one which gives the best compression ratio. These figures were derived from running the algorithms on a Ultra 2170 with 2*167MHz UltraSPARC processors and 256M ram.

Table 1: Comparison of lossless compression algorithms
Given this information, we can apply the model developed above to some example
operations. Consider making a request over a wide area network. Data transfer
rates between Adelaide and Canberra across AARNET at an inter-peak time are
around 350 KB/s. This gives a value of
. Given this value,
the optimal total transfer time would be provided by an algorithm with a
slightly higher decompression time and lower compression ratio than the lossless
JPEG algorithm. At the other en of the network bandwidth spectrum, consider a
modem connection which is about 100 times slower than the above example. In this
situation the best algorithm would be slower than CALIC and provide better
compression.
These examples illustrate another desirable feature of the compression algorithms which are employed in the repository. It would be useful to be able to parameterise the algorithm such that compression ratios and processing times could be optimised for particular applications. However, this issue is more complex than the selection or optimisation of an algorithm based on the model above. The parameters of the compressed image, and therefore decompression times, are set when the image is compressed. Instead of algorithm modifications, the model could also be used to design parallel decompression algorithms and to optimise them for a particular network.
Other issues that arise in a real system include the relative economic cost of the system components. In the above analysis the tradeoff space has been explored as if the only consideration were the time to complete. In some time critical operations, particularly those in military applications with effectively unlimited budgets, this is the situation. In other application domains, budgetary cost is important. This can be complicated by the need to utilise legacy resources that effectively cost the organisation nothing compared to newer components. The tradeoff between storage capacity cost and response cost, processing response time and network delivery time is then no longer entirely technical.
This work is being carried out as part of the Distributed High Performance Computing Infrastructure (DHPC-I) project of the Research Data Networks Cooperative Research Center (RDN CRC) and is managed under the On-Line Data Archives Program of the Advanced Computational Systems CRC. RDN and ACSys are established under the Australian Government's CRC Program. The authors thank Don Bone of CSIRO for helpful discussions on this work.
Technical Report DHPC-029
Compression of Image Data and Performance Tradeoffs for Client/Server Systems
James Hercus