The computing infrastructure of a company that has been running for a few years typically includes several heterogeneous, loosely coupled applications. Most companies have realised that integrating them or the data they manage is very valuable to support business processes. In the beginning, the integration was usually ad-hoc; however, as the number of applications to integrate increased, this soon proved not sustainable, which motivated many researchers to work on principled approaches to engineer integration.
The research work on integration can be broadly classified into Application Integration and Information Integration. The former approaches are operative since they model integration solutions as message workflows amongst several integration processes, i.e., the designer is responsible for devising and orchestrating the flow [HW03]; the latter, on the contrary, are declarative since they model solutions as data schemata and allow to transform queries into the appropriate message workflows automatically [HRO06]. Note, however, that both kinds of solutions are complementary, and that real-world integration problems usually benefit from techniques that come from both worlds. Wrappers are used in both approaches since they help endow applications with specific-purpose programming interfaces by means of which they can be integrated. Roughly speaking, a wrapper helps instruct an application to perform an action or to answer a query in cases in which it does not support this functionality natively or does not deliver it using the appropriate technology.
Our aim is to devise a framework to help software engineers develop application and information integration solutions, with an emphasis on web applications and web data. This framework shall provide a collection of tools to design efficient application integration solutions at the appropriate abstraction level, to deal with information integration problems in which data is represented using RDF, and to develop advanced wrappers.
The idea behind Application Integration is to devise a workflow of messages that allows to co-ordinate several applications exogenously so that they cooperate to keep their data in synchrony or to support a new piece of functionality.
One of the most successful approaches to application integration is the Hub&Spoke integration pattern [HW03]. Simply put, hubs provide the brains to an integration solution, since they implement the logic required so that several applications can co-operate exogenously; spokes, on the contrary, are communication channels by means of which messages are transferred from a hub to an application’s wrapper and vice versa, or from a hub to another hub. The set of hubs a company runs is commonly referred to as their business bus, cf. Figure 1, and the technologies and techniques involved as Enterprise Application Integration.
The Hub&Spoke approach is commonly implemented using so-called Process Support Systems [HA00], which is a term that includes both conventional workflow systems [AH04] and recent orchestrators [L08], e.g., BizTalk, Open ESB, Camel, Mule or Spring Integration. The Service Oriented Architecture initiative [PH07] has been a leap forward regarding application integration. It is not surprising, then, that most Process Support Systems are converging into Enterprise Service Buses that are based on recommendations like WSDL to define interfaces and bindings, SOAP to support message exchanges, or BPEL to support complex integration processes [PH07]. Note that these technologies ease integrating applications as long as they provide a programmatic interface or the data they manage can be accessed using files, databases or other data-oriented interfaces for which there are a kind of wrappers that are known as binding components. Applications that provide a user interface only, which is commonly the case of end-user web applications, are much more difficult to integrate since they require wrappers that emulate the interactions of a person to extract information from them.
Current Enterprise Service Buses use a database to store messages that come from wrappers or hubs until all of the correlated messages needed to start an integration process are available. The array of tasks that an integration process may execute includes tasks to receive messages, to copy, to manipulate, to route and to deliver them. Without an exception, the run-time system is based on a per-process allocation policy, which means that a thread is allocated to each integration process that is instantiated, and that it is not relinquished until the resulting outgoing messages are delivered.
Figure 1. An application integration solution.
The idea behind Information Integration is to have a target schema that integrates the data managed independently by several applications, so that they can be seen as if they were a large database, cf. Figure 2 [HRO06]. Wrappers allow to have access to an application’s data, and there are components that map user queries over the target schema into appropriate sub-queries over the applications’ schemata, and compose the results they return independently.
In the literature, there is a distinction between on-line techniques, aka Virtual Integration [HRO06, H01], and off-line techniques, aka Data Exchange [FKM05]. They both have been extensively studied in the context of relational, hierarchical and nested relational data. Note, however, that the Semantic Web initiative has constituted a major breakthrough regarding web information [AH08, SBH06], since it provides languages that can be used to describe rich graph-based data on the Web and technologies by means of which software agents are enabled to reason on these data and their descriptions.
Data Exchange relies on using mappings, which are queries that translate the data managed by several independent applications into a target schema. Roughly speaking, these mappings are used to materialise the target schema so that it can be queried without interfering with the original applications, and they have been of uttermost importance in the field of Data Warehousing [KC04].
Virtual Integration techniques rely on mappings, as well, but they materialise the target schema partially. That is, these techniques try to retrieve the minimum data that is possible to answer a query over a target schema. This makes it possible to answer them on-line, at the cost of interfering with the normal operation of the applications being integrated. Virtual Integration techniques rely on the following steps:
- Query rewriting, which takes a user query as input and reformulates it so that the result involves the application schemata only [H01].
- Query planning, which divides the rewritten query into a set of sub-queries, each of which involves an application only; then, it produces an execution plan that orchestrates the sub-queries so that they can be executed as efficiently as possible [IHW04]; the result is a set of data that come from the applications being integrated.
- Data composition, which helps aggregate the results returned by each application and transforms them into the target schema by executing the appropriate mappings. Note that data composition does not return the answer to the initial query, but a subset of data on which it must be run.
It is not difficult to realise that mappings are paramount to information integration. Beyond hand-crafted ones, there are a variety of techniques that allow generating them from user-defined correspondences amongst subsets of attributes in the target and the application schemata [PVM02].
Figure 2. An information integration solution.
Wrappers play a pivotal role regarding integration since they are modules that allow several applications to interact with each other within an integration solution. We are only interested in so-called web query wrappers, cf. Figure 3, since others are base technologies nowadays.
Figure 3. Structure of a web query wrapper.
An Enquirer is a module that takes a query as input and maps it onto the appropriate search forms provided by a web application. What a query is may range from a set of field names and values to SQL-like queries. In this project, we are only interested in the latter, since current technologies support the former sufficiently.
Current research efforts include a few intelligent techniques to analyse search forms and to extract their search capabilities, i.e., the goal is to have a model that others can use to map high-level queries onto it [HMY07, ZHC04]. Unfortunately, the literature does not provide many other results regarding this topic.
A Navigator cares of executing the filled forms provided by an Enquirer and navigating through the results to fetch data pages. Note that this process may lead to a data page in one step, to a no-results page, or to a so-called page hub, which is a set of interlinked pages that provide short descriptions of the information in other data pages together with links to them. (Note that term “hub” is polysemous in the literature on integration.)
Beyond navigators that rely on user-defined navigation sequences, the literature on crawling [RG01] provides several techniques that can be applied to solve this problem. Focused Crawling improves on traditional crawling in that it tries to avoid crawling pages that do not lead to data pages about a given topic of interest [PS06, CDK99, ALG09]. Other authors have worked on automated navigation pattern learners, which are algorithms that analyse a site to find the navigation sequences that lead from, e.g., a page hub to the data pages of interest [VSM08].
An Information Extractor is a general algorithm that can be configured by means of rules so that it extracts the information of interest from a web page and returns it according to a structured model. Rules range from regular expressions to context-free grammars or first-order clauses, but they all rely on mark-up tags or natural language properties to find which text corresponds to the data of interest.
Beyond hand-crafting information extraction rules, the literature provides a hundred proposals that can be used to learn them automatically, both in cases in which the data of interest is buried into text that is written in natural language [TAC06] and cases in which it is buried into tables, lists and other such layouts [CKG06]. Note that none of these techniques is universally applicable and that there are not any comprehensive empirical comparisons, which makes the decision on which to use very difficult.
The previous techniques deal with web pages. Working with PDF documents is a different setting, but it is becoming more and more important due to the ubiquity of this format, chiefly in scientific environments in which it is the de facto standard to publish articles. Current research results include several techniques that are specific to on-line bibliography databases [M09, PDA09].
An Information Verifier is an algorithm that analyses the result sets returned by an Information Extractor and attempts to find data that deviates largely from data that is known to be correct. They are necessary insofar the previous modules rely on intelligent techniques that may fail if the structure of a site or a web page changes, i.e., if they are confronted with cases that were not seen previously.
Information Verifiers build on feature-based verification models. The literature provides two probabilistic techniques [K00, MAL05] and a goodness-of-fit technique [KLM00] to build them. Given a new unverified result set, it is necessary to calculate its features and determine if they can be considered “normal enough” according to the model. In the case of probabilistic techniques, “normality” is tested by determining the probability associated with the values of the features; in the case of goodness-of-fit techniques, “normality” is tested by checking if these values can be considered statistically equal to the values in the verification model.
[AH04] Workflow Management. W. van der Aalst, K. van Hee. The MIT Press. 2004.
[AH08] A Semantic Web Primer (2nd edition). G. Antoniou, F. van Harmelen. The MIT Press. 2008.
[ALG09] A Genre-Aware Approach to Focused Crawling. G.T. de Assis, A.H.F. Laender, M.A. Gonçalves, A.S. da Silva. WWW Journal, 12(3):285-319. 2009.
[CDK99] Mining the Web's Link Structure. S. Chakrabarti, B. Dom, R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, J.M. Kleinberg. IEEE Computer, 32(8):60-67. 1999.
[CKG06] A Survey of Web Information Extraction Systems. C.-H. Chang, M. Kayed, M.R. Girgis, K.F. Shaalan. IEEE Trans. Knowl. Data Eng., 18(10):1411-1428. 2006.
[FKM05] Data Exchange. R. Fagin, P. G., Kolaitis, R.J. Miller, L. Popa. Theor. Comput. Sci., 336(1):89-124. 2005.
[H01] Answering Queries using Views. A.Y. Halevy. VLDB Journal, 10(4):270-294. 2001.
[HA00] Exception Handling in Workflow Management Systems. C. Hagen, G. Alonso. IEEE Trans. Software Eng., 26(10):943-958. 2000.
[HMY07] Towards Deeper Understanding of the Search Interfaces of the Deep Web. H. He, W. Meng, C.T. Yu, Y. Lu, Z. Wu. WWW Conf., 133-155. 2007.
[HRO06] Data Integration. A.Y. Halevy, A. Rajaraman, J.J. Ordille. VLDB Conf., 9-16. 2006.
[HW03] Enterprise Integration Patterns. G. Hohpe, B. Woolf. Addison-Wesley. 2003.
[IHW04] Adapting to Source Properties in Processing Data Integration Queries. Z.G. Ives, A.Y. Halevy, D.S. Weld. SIGMOD Conf., 395-406. 2004.
[K00] Wrapper verification. N. Kushmerick. WWW Journal, 3(2):79-94. 2000.
[KC04] The Data Warehouse ETL Toolkit. R. Kimball, J. Caaserta. Wiley. 2004.
[KLM00] Accurately and Reliably Extracting Data from the Web. C.A. Knoblock, K. Lerman, S. Minton, I. Muslea. IEEE Data Eng. Bull., 23(4):33-41. 2000.
[L08] Orchestrating Web Services with BPEL. P. Louridas. IEEE Software, 25(2):85-87, 2008.
[M09] Metadata Extraction from PDF Papers for Digital Library Ingest. S. Marinai. ICDAR, 251-255, 2009.
[MAL05] Mapping Maintenance for Data Integration Systems. R. McCann, B.K. AlShebli, Q. Le, H.Nguyen, L. Vu, A. Doan. VLDB Conf., 1018-1030. 2005.
[PDA09] Enriching a Document Collection by Integrating Information Extraction and PDF Annotation. B. Powley, R. Dale, I. Anisimoff. DRR, 1-10, 2009.
[PH07] Service Oriented Architectures. M.P. Papazoglou, W.-J. van den Heuvel. VLDB Journal, 16(3):389-415, 2007.
[PS06] Link Contexts in Classifier-Guided Topical Crawlers. G. Pant, P. Srinivasan. IEEE Trans. Knowl. Data Eng., 18(1):107-122. 2006.
[PVM02] Translating Web Data. L. Popa, Y. Velegrakis, R.J. Miller, R. Fagin, M.A. Hernández. VLDB Conf., 598-609. 2002.
[RG01] Crawling the Hidden Web. S. Raghavan, H. Garcia-Molina. VLDB Conf., 129-138. 2001.
[SBH06] The Semantic Web Revisited. N. Shadbolt, T. Berners-Lee, W. Hall. IEEE Intelligent Systems, 21(3):96-101. 2006.
[TAC06] Adaptive information extraction. J. Turmo, A. Ageno, N. Català. ACM Comput. Surv., 38(2). 2006.
[VSM08] Structure-Based Crawling in the Hidden Web. M.L.A. Vidal, A.S. da Silva, E.S. de Moura, J.M.B. Cavalcanti. J. UCS, 14(11):1857-1876. 2008.
[ZHC04] Understanding Web Query Interfaces. Z. Zhang, B. He, K.C.-C. Chang. SIGMOD Conf., 107-118. 2004.