Archived: 11 July 2007
University of Adelaide Masters by coursework thesis, June 2006.
Supervisor: Paul Coddington
There are over two billion biological specimens, which are stored in hundreds of different collections around the world. Biological specimen data are very valuable for biologists and environmental scientists. This data is becoming available online, through web based distributed specimen databases such as Australian’s Virtual Herbarium (AVH) and the Global Biodiversity Information Facility (GBIF). To provide acceptable data quality for users, the specimen records should be validated and cleaned of errors.
This project involves the design and implementation of three kinds of services for validity and cleaning biological specimen data. They are the XSD Field Validation Service, Cluster Analysis Service, and Gazetteer Services. The first one can check the basic data mistakes such as the invalid data types, incorrect data ranges, and so on. The cluster analysis service uses a data mining approach to check for the geocode errors and taxonomic errors. The gazetteer services can check for the inconsistency between the geocode and state name. Moreover, it provides the nearest place name generation service according to the geocode data. Furthermore, the validation process between the locality text and geocode is also developed. Based on these three services above, the system integration process publishes some services as Web Services for the AVH programs. Moreover, other services are made as Web Interfaces for these Herbariums data custodians.
In this thesis, the detail of the design and implementation process of these services and their integration into AVH version 3 prototype are described. The future utilization of these services in more general such as applications GBIF, is also discussed.