Improving the geospatial consistency of digital libraries metadata

Consistency is an essential aspect of the quality of metadata. Inconsistent metadata records are harmful: given a themed query, the set of retrieved metadata records would contain descriptions of unrelated or irrelevant resources, and would even do not contain some resources considered obvious. This is even worse in when the description of the location is inconsistent. Inconsistent spatial descriptions may yield invisible or hidden geographical resources that cannot be retrieved by means of spatially themed queries. Therefore, ensuring spatial consistency should be a primary goal when reusing, sharing and developing georeferenced digital collections. We present a methodology able to detect geospatial inconsistencies in metadata collections based on the combination of spatial ranking, reverse geocoding, geographic knowledge organization systems, and information retrieval techniques. This methodology has been applied to a collection of metadata records describing maps and atlases belonging to the Library of Congress. The proposed approach was able to identify automatically inconsistent metadata records (870 out of 10,575) and propose fixes to most of them (91.5%) These results support that the proposed methodology could assess the impact of spatial inconsistency in the retrievability and visibility of metadata records and improve their spatial consistency.


Introduction
Geospatial information, i.e., information that references to a place, is a core component of Digital Libraries (DL). It helps to reveal unknown spatial patterns, increases the recall of information retrieval systems, and enhances real world experiences of users, since most events can be visualized, explained, and understood in geographic terms [1]. Libraries have traditionally included geospatial information, and Geographic Information Retrieval (GIR) systems have been developed to perform spatial queries on metadata [2,3]. The University of California library, the Library of Congress (LoC) and the National Archives of the United Kingdom are good examples of the relevance of geospatial Information in Digital Libraries and National Archives. For example, Petras [4] analysed around 5 million records from the University of California library catalogue and found that approximately 35% of the records contain data in MARC21  The problem of metadata consistency has drawn research interest [17][18][19][20][21]. In order to detect spatial inconsistencies, [22] proposed the hypothesis that geospatial clusters could reveal an implicit consensus among documentation experts for identifying some geographic areas. This hypothesis uses geospatial clustering and Knowledge Organization Systems (KOS) to compare Indirect Spatial References (ISR) from unstructured content with Direct Spatial References (DSR) from structured content of metadata. The consensus is dependent on traditions, values, interests and particular goals to the community involved in each digital library, and hence it could even be specific for each cluster. Therefore, homogeneous and distinct clusters that group spatially metadata records could provide clues for validating and detecting inconsistencies among its members.
This work presents and formalizes a semi-automatic methodology for digital curation processes, particularly, processes involved in preservation tasks of digital repositories of cartographic materials. To do so, this work extends the methodology proposed in [22] by adding a double validation process that improves inconsistency detection results. Additionally, the methodology has been updated to be able to process online spatial metadata collections.
This work presents the application of the extended methodology in a real use case that analyses more than 42,000 MARC21 metadata records describing spatial resources retrieved from the LoC. Results provide a quantitative view of problems related to resource discovery, invisibility and retrieval of such metadata records, and therefore interoperability consequences that may affect digital library distributed systems. This paper is organized as follows. Section 2 discusses related work on exploiting location in the context of digital libraries. Section 3 introduces the methodology of inconsistencies detection. Section 4 describes an experimental and quantitative study and presents the results. Section 5 discusses the main results of the study. Finally, section 6 concludes the paper and outlines our ideas for future work.

Related work
A growing number of digital library projects are working with georeferenced data and metadata to take advantage of the ubiquity and popularity of geographic services widely available, an example these works citing previous geospatial Digital Library projects is shown in the Figure 2. The most relevant digital library projects experimenting with georeferenced data and metadata are focused on three main areas: information visualization, geographic information retrieval and information validation. Some works focusing on information visualization are the Geo-Referenced Information Network, the Electronic Cultural Atlas Initiative (ECAI) [23], the Old Maps Online [24] and the Alexandria Digital Library (ADL) [25], probably one of the most widely cited research projects that made use of georeferencing in the context of digital libraries. They are focused on representing, exploring and browsing digital collections on a map, and some of them offer additional search functionalities. The area of geographic information retrieval deals with the disambiguation of place names based on internal and external evidence from the text content of metadata. Internal evidence includes the use of generic geographic labels, or linguistic environment. External evidence includes knowledge organization systems, gazetteers, biographical information, and general linguistic knowledge [25][26][27]. Some works in this area are the Spatially-Aware Information Retrieval on the Internet (SPIRIT) [27], the Geographic Awareness Tool (GAT) [15], the MapRank [29] and the Old Maps Online. With respect to information validation, these kinds of works are focused on data and metadata quality; we centre our attention on this last kind of works. Metadata quality is a semantically slippery term. Park [30] suggested that the most commonly accepted criteria for metadata quality are completeness, accuracy, and consistence. Our work is focused on the last criterion. Relevant works in the literature during the last decades confirm this perception [31][32][33][34][35][36]. Most of the Journal of Information Science, 2015, pp. 1-18 © The Author(s), DOI: 10.1177/0165551510000000 cited works recognize in different ways the metadata quality problems, and they remark the need to span the gap between the explicit geospatial information included in the metadata and the georeferenced information that was not explicitly labelled as such. Their main difference with our work lies in the quantitative evaluation of the problem. We present a quantitative study of the geospatial inconsistency problems in metadata focused on the libraries domain.
There are some works with a quantitative approach. Tolosana-Calasanz et al. [37] developed a quantitative method and realised a statistical analysis for assessing the quality of geographic metadata. The authors first formulated a list of geographic quality criteria by consulting domain experts. The identified criteria indicated quality preferences. The authors also noticed the need to ensure the completeness of the spatial fields in order to guarantee a minimum level of quality. Ma et al. [38] presented a study about the quality assessment of metadata on the Internet Public Library. This work is based on a combination of human evaluation (qualitative) and automatic evaluation (quantitative). The qualitative method gave an indication of the quality of information by rating accuracy, completeness, consistency and functionality. These works are different from the approach presented here because their quantitative methods only measure the completeness of metadata in the collection, however our approach is focused on evaluating the spatial consistency quantitatively, that is, we use spatial best matches for finding and measuring inconsistencies. Regarding to the clustering focus of our approach, one of the most relevant related works is Hays and Efros [39]. Their work also uses the idea of implicit consensus of spatial co-occurring resources for estimate the location of an image. One of the main differences with our work lies in the final use of the consensus: Hays and Efros use the consensus for estimating a geographic location, the work presented here uses the consensus for detecting geographic inconsistencies. Works such as [29,30,40] estimate the geographic location of resources (text, images, etc.). They use classifiers and knowledge from social datasets to disambiguate references to locations using generally textual context, but their approach does not take into account information from a wider spatial context, they do not take into account any information from co-occurring metadata such as spatial descriptive consensus provided by neighbours (e.g. cluster).
Here we show the utility of incorporating spatial knowledge of co-occurring metadata description in inconsistency detection systems. Many of the cited works develop approaches of geolocation based mainly on text. They also suggest that an interesting extension of their works is to rely upon the natural clustering of related documents. This is the focus of the research presented here. We take advantage of spatial co-occurrences found in metadata.

Methodology.
This section presents our extension of [23] proposal to detect geospatial inconsistencies in DL metadata. A general outline of the process is shown in Figure 3. Our methodology uses the principles proposed in [42]. Its main insight is the use of KOS combined with geospatial ranking functions for finding the most relevant toponyms associated with a footprint and then compare these with place names described in the metadata. We integrate this idea with the concepts of two-dimensional spatial clustering to refine the detection of spatial inconsistencies in other fields such as DL. The resulting extended methodology has six main steps: Harvesting, Geo-Extraction, Reverse Geocoding, Spatial Clustering, Metadata Validation, and Report Generation. These steps are described in detail in the following subsections.

Harvesting
The first step is the harvesting of metadata records that may contain geospatial information in form of geographic coordinates. This step is optional if we have access to some kind of metadata dump. The process of metadata harvesting consists in collecting metadata descriptions stored in digital repositories using protocols such as OAI-PMH [42] or Search/Retrieve via URL (SRU) 2 In our case we are using the SRU protocol. The SRU protocol uses three types of operations: explain, scan and searchRetrieve. Our methodology uses the last one. The searchRetrieve operation allows submitting a query using the high-level Contextual Query Language (CQL) and retrieving the list of items that match the query [43]. SRU has no explicit geographical information retrieval support. Hence, we formulated a heuristic based on string patterns to create queries that can retrieve metadata with information about the geographic extent of the resource, DSR specifically. This heuristic is based on the recommended procedures of the Map Cataloging Manual 3 of the LoC and allows us to harvest those metadata records that have been created according to the Map Cataloging Manual. For example, the query "W12*" in CQL can match a coordinate referencing a point whose longitude is between W120 to W129. If we want to retrieve metadata that may contain such information, we should formulate the following query: The response always report us the total number of records that matches such query. The maximumRecords parameter sets an upper limit to the number of records returned. Some systems may ignore the value of such parameter if it is over an internal constant. Thus, in order to retrieve all matching records, we repeat the query modifying the value of the startRecord, which provides a means to page through large numbers of results records, until retrieving all the matching records. Using this heuristic, the harvesting module retrieved all MARCXML 4 and MODS 5 metadata records matching with this type of query (see examples in Figure 4) in a range from 180º east to 180º West and from 90º North to 90º South by generating the appropriate query patterns. Later, the system verified if retrieved metadata records effectively contain geographic coordinates. The output of this process is the set of metadata records that contain an explicit DSR following well-known cataloguing rules that are retrievable through the SRU endpoint.

Geo-Extraction
The geo-extraction step applies to harvested metadata records, or, if already available, a metadata dump. This step is a geospatial Extraction, Transformation and Load process (ETL) [44]. This module extracts and homogenizes Direct Spatial References (DSR) encoded in MARC21 metadata records. This module also extracts the Indirect Spatial References (ISR) from textual place name fields. In MARC21 metadata, a DSR has the form of a bounding box and can be found in the field "034 -Coded Cartographic Mathematical Data". A bounding box is a pair of latitude/longitude pairs that defines the northern, southern, east and west extremes of a geographic region. ISR is the place name and it can be found in the field "651 -Subject Added Entry -Geographic Name" or sometimes it is located in the title field. The output of this process is a stream of metadata records annotated with the extracted DSR (explicit bounding box) and ISR (place name). Metadata without DSRs, ISRs or both are accounted as incomplete metadata in the statistics module. Incomplete metadata records are not taken into account for further processing because the purpose of this workflow is the identification of inconsistencies between DSR and ISR in metadata records. A particular observation in the context of this step is that, in MARC21 there are several different fields that can encode different aspects of direct/indirect spatial references including different ways to associate geographic codes, or different ways for expressing the geospatial reference method used for the coordinates in the direct spatial references. The geoextraction step focuses on the bounding box field, it was the most frequent DSR field in the dataset analysed. One potential drawback to this approach is that erroneous interpretations for the coordinates given in the bounding box associated to a particular resource may be due to problems in accounting with geospatial referencing systems. To deal these issues, we examined manually the detected inconsistencies.

Reverse geocoding
This step is a conversion process from a reference systems based on coordinates (i.e., a bounding box) into a reference systems based on geographic identifiers. The goal is to find the best ISR (place name) for the geographic region covered by the DSR (explicit bounding boxes). For this task, we use the reverse geocoder described in [22]. This reverse geocoder uses the Hausdorff distance [45] to measure the geospatial similarity between the geometrical shape of a DSR and the geographic extent of entities belonging to a geospatial KOS. This metric can actually be adapted to different types of metric spaces, by using different types of internal distance metrics. In the case of geospatial coordinates, there are better alternatives than using the default Euclidean distance as an internal metric; in particular we used the geodetic distance as the internal metric. The mathematical expression of the Hausdorff distance is shown in Eq. (1): Where X and Y are two non-empty subsets representing the points that describe a polygon, sup represents the supremum, inf the infimum, and d(x, y) is the geodetic distance between a pair of latitude/longitude points. The value of the Hausdorff Distance is used as a spatial ranking to score the most relevant entities, in a similar way to the work described in [46]. This module annotates each processed metadata record with the list of entities that best describe its Direct Spatial Reference. The geographical KOS used consists of several public models, databases and KOS. Its main sources are available online: GADM 6 (its current version delimits 556,049 administrative areas (or 218,238 if you count only the lowest level for each country), U.S. Census Bureau 7 , Natural Earth Data 8 , and the National Oceanic and Atmospheric Administration's 9 .

Geospatial clustering
We define geospatial metadata cluster as a group of metadata records whose spatial references co-occur in the same area and have similar geographical extent. This step assumes that a cluster of such characteristics may reveal an implicit consensus among library experts about the spatial references that are more likely to be used to describe textually a geographic location in the area covered by such cluster. This idea will serve to validate spatial descriptions in the metadata record and detect potential inconsistencies. This step uses the density-based DBSCAN algorithm (Density-Based Spatial Clustering of Applications with Noise) [47] for computing spatial clusters using as input values DSR (explicit bounding boxes) found in metadata records. DBSCAN has several advantages: it can recognize clusters with arbitrary shapes; it is not necessary to pre-define the number of clusters in the data; and it is an efficient algorithm for big collection of data [48]. As Wang et al. [49] summarizes it, the key idea is to define a new cluster, or extend an existing cluster, based on a neighbourhood. The neighbourhood around a point of a given radius (Eps) must contain at least a minimum number of points (MinPts). Given a dataset D, a distance function dist, and parameters Eps and MinPts, the following definitions are used to define DBSCAN. For an arbitrary point, the neighbourhood of p is defined as follow: If ‖ ( )‖ ≥ , then p is a core point of a cluster. If p is a core point and q is p's neighbour, q belongs to this cluster and each of q's neighbours is examined to see if it can be added to the cluster. Otherwise, point q is labelled as noise. The expansion process is repeated for every point in the neighbourhood. If a cluster cannot be expanded further, DBSCAN chooses another arbitrary unlabelled point and repeats the process. This procedure is iterated until all points in the dataset have been placed in clusters or labelled as noise. In general, DBSCAN defines a cluster as a set with a maximum number of density-connected data points, in which every core data point must have at least a minimum number of data points within a neighbour of a given radius. The input to the original algorithm can be made of points in a multi-dimensional space. The original DBSCAN algorithm assumes that the data to be clustered are points in given space, whereas in our particular application we are attempting to cluster objects that are represented as bounding rectangles instead of points. We are using an adaptation of the DBSCAN algorithm that uses the Hausdorff distance as distance measurement instead of the Euclidean distance [50,51] to works with bounding boxes. Many library metadata records contain a geographical extent which is a two-dimensional footprint. The use of the Hausdorff distance instead of the Euclidean distance allows computing clusters from two-dimensional data directly (bounding boxes, multi-Polygons or complex geometries). In our approach, we normalize Hausdorff distance values to the interval [0, 1], where values close to 1 mean strong similarity (high geospatial matching), and values close to 0 mean strong dissimilarity or disagreement between the compared DSR (explicit bounding boxes). The similarity threshold value is 0.5. The normalization function is similar to the function described in [52].
An important issue here is the parameter setting. The DBSCAN algorithm uses three main parameters: minimum number of elements inside the clusters MinPts, epsilon (Eps), and the distance function. As a basic consideration, a cluster is group of at least two elements, for this reason, the MinPts parameter is set with 2. The more complex selection is the Eps parameter. DBSCAN algorithm is very sensitive to its parameters, especially to Eps, the radius of the search. A small Eps value means that the radius of search of the algorithm is shorter, and indeed restrictive, so the results will a big number of clusters, more compact and dense, and more noise. On the other hand, using a higher Eps value, and the same value for MinPts we obtain a small number of clusters that aggregate more number of elements each. In our work we use the values Eps=0.2 and MinPts=2.0. These values provide the best separability for co-occurring spatial objects. That is to say, objects that co-occur from the one-dimensional perspective (coordinates based on points), but they are georeferencing spatial entities of different levels/size (example, a city, a province and a state centred in the same point but with different extents coverage). The recommended technique for the parameter selection is described in [47], the Journal of Information Science, 2015, pp. 1-18 © The Author(s), DOI: 10.1177/0165551510000000 same work where DBSCAN is introduced. It consists of generating a histogram with the sorted k-neighbour distance (Hausdorff distance in our case), being k the desired value of MinPts. Then this distance is sorted (descending) and plotted. The histogram will show a descending curve. The authors suggest that the optimum value of Eps parameter is the distance where the curve makes its first inflexion (or "valley"). The elements located on the left of this "valley" will be noise in the resulting partition and the rest will be present on some of the resulting clusters. The authors also ensure that choosing 4 as the default value of MinPts produces the best results in two-dimensional clustering. In our experiments lower values, usually 2, obtained better results for the spatial ranking.

Metadata validation
This step computes first for each cluster two sets of ISR (places names). The first set is the union of the place names generated by the reverse geocoding module for each metadata record belonging to the cluster. The second set is the union of the explicit place names in the metadata description belonging to a cluster. Next, this step performs in each metadata record belonging to a cluster a dual validation process. This validation process verifies if exist geospatial inconsistencies between the original ISR and the ISR generated by an external process (e.g. reverse geocoding or clustering); the first validation process validates the ISR with respect to the geographical KOS, and the second one with respect to the geospatial consensus provided by every cluster. Both validation processes are based on the concepts of the Vector Space Model (VSM) [53]. They measure the similarity between the spatial description of a metadata record and two vectors of place names associated with the cluster given a metadata record of a cluster. The first validation measure is the similarity between the vector of generated place names of the cluster and the vector of explicit place names of such metadata record. The second measure is similar, but it compares the vector of explicit place names of the cluster with the vector of explicit place names of the metadata record. In both cases, a metadata record will be considered consistent if the similarity measure is greater than 50%, and will be considered inconsistent otherwise. This step also produces the best-suggested place name, that is, the generated place name with the best scoring match for the DSR analysed. Although we use VSM for calculating the similarity, it is possible to use other metrics for measuring the similarity between them [54]. In our work, the similarity between two vectors is assessed by the next expression: Where t i is the vector of original place names of a metadata record belonging to the cluster i, and g i is the vector of place names generated by the reverse geocoder (for the first kind of validation), or the set of the explicit place names in the cluster (for the second kind of validation).
Based on the not repeated place names from these two vectors, a dictionary is constructed as: {" 1 ": 1, " 2 ": 2, … " ": , " 1 ": + 1, " 2 ": + 2, … " ": + } with n+m=k, where k represents the number of distinct place names. We use the indexes of the dictionary to represent each vector by a new k-entry vector, for example: Then we measure the level of consistency between the two normalized vectors by calculating the cosine of the angle between vectors using the common Eq. (3). A high value of consistency for a metadata indicates that the metadata is consistent. It could be consistent with the geographic references contained within the geographical KOS used by reverse Geocoding (for the individual validation), or it could be consistent with the set of explicit place names in the cluster (for the collective validation). This will facilitate the detection of those metadata records inconsistent with co-occurring metadata records in the cluster. In this step we analysed different string comparison methods, and we selected the simplest method: a simple matching string to compare the searched place name with two fields, the official place name and an alternative name (when it exist) provided by the KOS.

Report generation
This last step reports the consistency of each metadata with respect to its own geospatial information and with respect to its neighbours. Metadata records identified as consistent could be annotated as having a high quality value, and linked to the place name from the geographical KOS used by the reverse geocoder. Metadata records identified as inconsistent could be annotated with an alert value to advertise the need to review them. This information can be useful for curation managers [55]. All information reported is included in a general report produced by the Statistics Module. This module counts the number of uncompleted metadata and reports the kind of inconsistency found in individual and collective validation. This report is used for the analysis of results. An example is shown in Table 2.

Analysis and results
We tested our methodology analysing the quality of 12,000 metadata records that describe resources in the United States of America. This is a subset of a larger collection of more than 42,000 metadata records retrieved from the LoC in May 2013. The collection was harvested by the process described in the section 3.1. All examples, experiments and results here are based on records available on that date. Some records may have changed since that date. Although the analysis has been restricted to the United States of America, the methodology can be applied to other places. For the experiments, we have analysed and selected just the most frequent groups of elements in the dataset; the results are metadata records on which the DSR (bounding box) locates a state, a county, a city, a forest or a watershed. The distribution is shown in Figure 5.

Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk The validation processes of the methodology have helped to detect three kinds of inconsistencies: (1) syntactic inconsistency, (2) geospatial semantic inconsistency or geosemantic inconsistency, and (3) contextual inconsistency.
(1) Syntactic inconsistency. This kind of inconsistency is caused by logical problems in the codification. In addition to the traditional logical consistency, a library with geospatial resources needs to verify a more complex consistency of their metadata, for example, according to the international standard ISO 19113 Geographic Information -Quality Principles 10 [56][57][58]. For example, the range of latitude and longitude coordinates need to be checked: the absolute value of latitude must be between 90º North and 90º South, and the absolute value of longitude must be between 180º East and 180º West. In some cases, a simple query such as: "Are the values of latitude coordinate always between -90º and 90º?" can reveal a logical geospatial inconsistency. Our methodology reveals distorted (extra-long) DSR (bounding boxes) shown in the Figures 6 and 7 without the need of checking the coordinate values. These kinds of errors can be easily solved at the source by a careful conversion of the MARC21 coordinates. In many cases the results shows that they were encoded in the description data out of range.   (2) Geosemantic inconsistency. This kind of inconsistency is originated in the conceptual incoherency between the DSR (e.g. bounding boxes) and the ISR (e.g. place names) according a specific KOS. There are three cases: micromacro, macro-micro, and inconsistent. The micro-macro case happens when the ISR of the metadata record has a micro scope (e.g. forest, etc.) but its DSR has a macro scope (e.g. state). The macro-micro case is its inverse, where the ISR (e.g. county) has a macro scope but the DSR (e.g. city) covers a small area. Finally, inconsistent are the trivial cases or complete disagreement between DSR and ISR. Inconsistent cases can be found automatically by using the reverse geocoder with the help of the Hausdorff distance and a threshold of 0.5. Figures 8 (a) and (b) show examples of inconsistent cases. (3) Contextual inconsistency. This kind of inconsistency is caused by a disagreement between DSR (e.g. bounding boxes) and ISR (e.g. place names) of similar metadata records that describe the same area. For example, Figures 9 and  10 show a disagreement between a metadata record and the consensus of their neighbourhood. In the first case, the methodology identifies a disagreement between the metadata describing (Ohio State http://lccn.loc.gov/92681234) and the consensus of their neighbourhood (North Dakota State). This case is an obvious example of spatial inconsistency that could be detected without clustering. However, in the second case, the methodology identifies two subtypes of contextual inconsistencies more complexes. Although they overlap, they are contextual inconsistencies (spatial mismatches specifically) because in their spatial context their place names are unusual. Experts usually catalogue the same spatial area with other place name. This kind of inconsistencies could be seen as a geospatial synecdoche (taking a part for the whole and vice versa). Figures 10 (a) and (b) shows these cases. Our clustering-based approach also points out groups of metadata records with potential geospatial inconsistencies. These could be caused, among other things, by systematic errors, the reuse of non-validated metadata or the lack of information about the area in the geographical KOS used to validate. When a KOS does not have information about an area, we need an alternative way to validate the consistency. For example, there are cases where the best source of information for validating is provided by the descriptions found in the cluster itself. That is, the cluster can be seen as representative of the collective knowledge of an area, some of these cases can occur with native and unofficial places names or offshore fishing ground names, etc. Two examples are shown in Figure 11. The contextual inconsistency differs to the geosemantic inconsistency in the sense of the individual or group evaluation and in the presence or absence of external information to validate the consistency of an evaluated metadata. Geosemantic inconsistency is applied on individual metadata and makes use of KOS, while contextual inconsistency is applied on clusters and it could use KOS optionally.

Accepted for Publication
By the Journal of Information Science: http://jis.sagepub.co.uk   In some cases, we have found that most of the metadata records in a cluster are inconsistent. In such cases, we have applied a dual validation procedure, collective and individual one. We use the reverse geocoder to validate every metadata record and the contextual consistency of all metadata belonging to the cluster. An example is shown in Figure  12. In this case, 8 out of 14 elements in the cluster are inconsistent, thus the cluster is inconsistent. Table 3 shows these Accepted for Publication The results are summarized in Table 4. We have found geospatial inconsistencies in 870 out of 10,575 metadata records. Our methodology identified 212 (2%) metadata records with logical inconsistencies and 802 (7.6%) metadata records with geosemantic inconsistencies. Also, 93 (0.9%) metadata records presented a disagreement with their neighbourhood. The administrative types (states and counties) present fewer inconsistencies than types with imprecise boundaries (cities and forests). However, it is surprising that a man-made feature (cities) has proportionally more inconsistency issues than other types analysed (24.6%). That is to say, it is more geospatial disagreements among these records. Proportionally, the records georeferencing states are the most consistent (96.8%) and also they present better geospatial consensus than other categories.

Discussion
There are four issues that deserve to be discussed with respect to the methodology and its results: heuristics for validating spatial descriptions, outlier detection and inconsistencies, the dimension in the spatial representation, and metadata reuse. The methodology proposed uses a heuristic for validating spatial descriptions based on comparing sets of place names. Alternatively, a heuristic based on comparing geospatial coordinates could be developed. However, the main difficulty of this last approach is the high level of uncertainty generated by the ambiguity in the toponym transformation process (geocoding). Without additional information is complicated to convert very ambiguous terms/toponyms in their equivalent coordinates. Furthermore, two-dimensional footprint obtained by geocoding the place names that are mentioned in the metadata descriptions is more complex. In addition, the selection of an appropriate geographic KOS is crucial for a good reverse geocoding no wonder the heuristic applied. We need to take into account requirements such as having descriptions with different levels of details (geographical extents of different sizes) and topic variety.
Outlier and inconsistency detection is not an easy task. We took advantages of DBSCAN to detect outliers in our geospatial domain. Outliers are candidates to be inconsistent according to the clustering algorithm. In this case, however, we need an additional way to verify the record. In cases when a metadata record spatially consistent is alone in an area (it does not belong to any cluster), the clustering approach needs to be complemented with an individual validation, for example, by using the two-dimensional reverse geocoder. Thus, metadata validation by means of clustering can be applied when we have additional information about neighbours with a good spatial consensus.
Regarding to the dimension in the spatial representation, we have identified many cases where the one-dimensional representation generates problems. All these problems are due to bounding boxes that cannot be considered reasonably as similar, for example, when a metadata is georeferencing macro areas (countries, states) and another metadata is georeferencing micro-local areas (cities, towns, parks) and both are represented and centred in the same point. Thus, for these cases, a good solution could be the use of representations, algorithms and methodologies focused on twodimensional data. This is the main idea behind our approach, this kind of techniques provide separability for cooccurring spatial objects in one dimension, but georeferencing spatial entities of different levels (example, a city, a province and a state centred in the same point but with different extents coverage). Figure 13 illustrates this situation; (a) a clustering process with one-dimensional representation generates 3 clusters only, while (b) a two-dimensional process generates six clusters and gets a better separability between co-occurring MBB with differentiated extent coverage. Metadata reuse is an essential task in digital library domain. To understand the importance of reviewing the consistency of metadata first we need to understand the proper importance of the metadata such as the FGDC argues: "If you think the cost of metadata production is too high -you have not compiled the costs of not creating metadata: loss of information with staff changes, data redundancy, data conflicts, liability, misapplications, and decisions based upon poorly documented data" [59]. Even if we accept the importance of metadata, we need to worry about its quality. For example, metadata sharing and reuse is a common practice in digital libraries. These practices should include a richer