OUR PROJECTS - Data Integration

Data Integration for Sustainable and Resilient Infrastructures

Data integration provides the ability to query, mine, or otherwise manipulate data transparently across semantically heterogeneous data sourcesi. That is, once the data sources are integrated they will appear to users like a single database. Integration is a complex process, better handled when the data sources are described using a common data model, which serves as an abstraction of the data.

An ontology is a formal specification of a data model that plays an important role in data integrationii. In the geospatial domain, classifications such as land use, soil, or residential density can be effectively expressed using ontologies. Because such classifications may vary across adjoining counties and municipalitiesiii, bridging across classifications corresponds to finding mappings among the concepts in the corresponding ontologies, a process called ontology matching or alignment. For example, some counties may use the term urban while others will use the terms commercial and residential. Thus, a geographic information system covering all counties will have to support a mapping establishing that the urban concept is equivalent to the union of the residential and commercial concepts. Thus, retrieval of all the land parcels that are urban or residential or commercial can be performed using a single query.

The AgreementMaker system, developed at the ADVances in Information Systems Research Laboratory at the University of Illinois at Chicago directed by BURST team member Isabel Cruz, is the world leader in ontology matching, as demonstrated by its top performance in the Ontology Alignment Evaluation Initiative, the premier international competition for ontology matching systemsiv. Especially noteworthy is the first place earned by AgreementMaker in the last two years in the Anatomy Track because of the size of the ontologies involved (about 3,000 concepts) and the impact on biomedical research. An important feature of AgreementMaker is its modular framework, which allows to efficiently add new functionality to a powerful core system. As a consequence, over eighty research groups at national labs and at universities worldwide have extended AgreementMaker for their own applications. Building on the above successes, we plan to address several new challenges in ontology matching for geospatial data:

  • Multi-layered data integration. Land use, soil, and residential density correspond to single geospatial layers. For urban sustainability studies, we plan to develop efficient methods for data integration across multiple layersv given a geographic area and a time interval. For example, in the eco-boulevard scenario, the temporal and geospatial study of water demand must be correlated with the residential and commercial density (living and office spaces) along a series of interconnected wetlands spanning different municipalities.
  • Matching of web ontologies. The semantic connections among a large variety of data sources (see accompanying table of data sources below) can be established using well-known web resources such as Wikipedia, Geonames, or Freebase. The problem of establishing links between open datasets, called Linked Open Data (LOD), is therefore closely related to the problem of ontology matchingvi. Preliminary results show the potential of AgreementMakervii in comparison with other systems for LOD integrationviii.
  • Uncertainty. We propose to develop novel mapping methods for complex geospatial settings that incorporate the essential notion of uncertaintyix, both spatial and temporal. Uncertainty can stem from various factors including different spatial granularity (e.g., demographic information available from different years or aggregated in different subareas, such as county vs. municipality) and fuzzy spatial boundaries (e.g., no exact separation between dry and wetland).

Given these challenges, this research project will contribute to the BURST research scheme (Figure 1) as follows: We will implement a system that performs data integration across heterogeneous data sources using ontology matching. This system will support geospatial queries across multiple layers that take into account:

  • Uncertainty, thus providing answers within a spatio-temporal range with statistical error distribution
  • several layers, thus taking into account the interaction among several geospatial model components

This system will interact closely with the agent-based modeling componentx by using queries that retrieve the data necessary for running the models (e.g., water demand, residential, and commercial density in the eco-boulevard scenario). In turn, these models will include appropriate agents (e.g., residents) making choices about location (resulting in population distributions, stationary source emissions, willingness to invest in housing/schools, and extent of sprawl, etc.). The result of such choices will change the data associated with one or more layers, thus requiring new data integration operations. The system will support visual analyticsxi to guide user feedback to the matching process and a querying interface that displays encoded maps as answers to the queries.

Table 1. Example BURST Datasets

Type Source BURST Function Data Characteristics
Geographic USGS1, Google Earth Geographic and surface characteristics Digitized Mapping
Ecological Natural Connections Mapping and modeling of land use, soils, ecosystem and natural area locations and functions Digitized Mapping
Water Resources USGS, FEMA2, USEPA3, NWS4 Information on water movement, runoff, accumulation, quality Spatial distribution of water and watershed quality, drinking water safety, weather data, DFIRM digitized maps
Microbial measures of water quality UIC CHEERS (see new refs), EPA STAR grant Waterborne disease vectors Measures of bacteria, viruses and parasites in Lake MI, Chicago River system, area lakes, rivers, 2007-2009
Molecular measures of water quality at Lake MI beaches, 2011
Health and illness following use of area surface waters UIC CHEERS (ref pending) Public health impacts of urban waterways Occurrence and severity of illness following recreational use of area surface waters, 2007-2009
Transportation systems CTA5, NHTS6, BTS7, FHWA8, CMAP9 Mobility patterns and preferences Public transit GIS maps
Public transit ridership data
Safety, congestion, energy consumption data
Activity/travel diaries
Road network, warehouse and freight terminal locations
Vehicle traffic counts
Societal impacts US Census, HUD10 Valuation of residential housing Population and household attributes, employment data, housing values, discrimination, public housing
Societal characteristics and decision preferences FEMA, US Army COE11,ISSR12 (UCLA) Human responses to emergencies, risk perception; decision factors; type and location of refuge, psychological impacts Survey information
1 US Geological Survey, 2 Federal Emergency Management Agency, 3 US Environmental Protection Agency, 4 National Weather Service, 5 Chicago Transit Authority, 6 National Household Travel Survey, 7 Bureau of Transportation Statistics, 8 Federal Highway Administration, 9 Chicago Metropolitan Agency for Planning, 10 Housing and Urban Development, 11 Corps of Engineers, 12 Institute for Social Science Research

i Lutz, M., J. Sprado, E. Klien, C. Schubert, I. Christ (2009). "Overcoming Semantic Heterogeneity in Spatial Data Infrastructures". Computers & Geosciences, 35(4):739-752.
ii Cruz, I. F., and H. Xiao (2005). "The Role of Ontologies in Data Integration". Journal of Engineering Intelligent Systems, 13(4):245-252.
iii Wiegand N., D. Patterson, N. Zhou, S. Ventura, and I. F. Cruz (2002). "Querying Heterogeneous Land Use Data: Problems and Potential", National Conference on Digital Government Research (dg.o), pp. 115-121.
iv Euzenat, J., A. Ferrara, C. Meilicke, J. Pane, F. Scharffe, P. Shvaiko, H. Stuckenschmidt, O. Sváb-Zamazal, V. Svátek, and C. T. dos Santos (2010). "Results of the Ontology Alignment Evaluation Initiative 2010". ISWC International Workshop on Ontology Matching (OM), volume 689 of CEUR Workshop Proceedings.
v Gahegan, M., J. Luo, S. Weaver, W. Pike, and T. Banchuen, T (2009). "Connecting GEON: Making Sense of the Myriad Resources, Researchers and Concepts that Comprise a Geoscience Cyberinfrastructure". Computers & Geosciences, 35(4):836-854.
vi Volz, J., C. Bizer, M. Gaedke, and G. Kobilarov (2009). "Discovering and Maintaining Links on the Web of Data". International Semantic Web Conference (ISWC), volume 5823 of Lecture Notes in Computer Science, pages 650-665. Springer.; Bizer, C., T. Heath, and T. Berners-Lee (2009). "Linked Data-The Story So Far". International Journal on Semantic Web and Information Systems (IJSWIS), 5(3):1-22, 2009.
vii Cruz, I. F., M. Palmonari, F. Caimi, C. Stroe (2011). "Towards "On the Go" Matching of Linked Open Data Ontologies", IJCAI Workshop on Discovering Meaning On the Go in Large Heterogeneous Data (LHD).
viii Jain, P., P. Hitzler, A. Sheth, K. Verma, and P.Z. Yeh (2010). "Ontology Alignment for Linked Open Data. In International Semantic Web Conference (ISWC), volume 6496 of Lecture Notes in Computer Science, pages 402-417. Springer.
ix Pfoser, D., N. Tryfona, and C. S. Jensen (2005). "Indeterminacy and Spatiotemporal Data: Basic Definitions and Case Study", GeoInformatica International Journal, 9(3): 212-236.; Ban, H. and O. Ahlqvist (2009). "Representing and Negotiating Uncertain Geospatial Concepts - Where Are the Exurban Areas?" Computers, Environment and Urban Systems. 33(4):233-246.
x Zellner, M. L., T. L. Theis, A. T. Karunanithi, A. S. Garmestani, and H. Cabezas (2008) "A New Framework for Urban Sustainability Assessments: Linking Complexity, Information and Policy", Computers, Environment and Urban Systems, Special Issue on Geocomputation 32:474-488.
xi Keim, D.A.,F. Mansmann, and J. Thomas (2009). "Visual Analytics: How Much Visualization and How Much Analytics?" SIGKDD Explorations, 11(2):5-8.