TDG IntegraWeb Committed to research! TDG Site Manager 1.1

Objectives

Companies are increasingly relying on software applications to manage their data. Although these applications are very valuable on their own, most companies have realised that integrating them or the data they provide is even more valuable, since this usually results in better support for business processes. 

Our initial hypothesis is that more and more companies shall rely on an increasing number of such automated business processes, which shall require more and more applications to be integrated to support and to optimise them. We think that this hypothesis is sensible on account of recent research reports by CIO Magazine [M06], Gartner [G08] and DataMonitor [D08], according to which most IT companies are worried about integration, which shall be the driving force behind most IT large projects in the forthcoming ten years.  Unfortunately, even though the technologies provided by the Service Oriented Architecture and the Semantic Web initiatives are helping cut integration costs down, recent reports [W05, BH08] highlights that integration costs are still from 5 to 20 times higher than developing new functionality, chiefly when web applications and RDF data are involved.

Our research group works in close co-operation with two spin-offs and two other industrial partners that are heavily involved in integration projects and act as EPOs.  This all has motivated us to work on a number of research problems whose solution shall contribute to engineering integration. 

Specific Goals

Our aim is to devise a framework to help software engineers develop application and information integration solutions, with an emphasis on web applications and web data.  Our specific goals include:

Application Integration: (G01) We shall devise a Domain Specific Language for application integration. (G02) We shall devise a series of transformations to compile it into a subset of technologies in current use. (G03) We shall devise a new run-time system for application integration.

Information Integration: (G04) We shall devise a tool to integrate web data that is represented using RDF and described using RDFS.

Web Query Wrapping: (G05) We shall explore how to unify existing search form models and how we can enquire them using high-level structured queries. (G06) We shall devise a navigator that is able to navigate intelligently from filled forms to data pages so that the number of irrelevant links that are visited is kept to a minimum. (G07) We shall devise a software framework by means of which software engineers can build new information extraction algorithms and rule learners. (G08) We shall devise several optimisations and heuristics to improve the efficiency of the FOIL rule learner so that it can be used efficiently for information extraction tasks. (G09) We shall explore new techniques to extract information from PDF documents. (G10) We shall devise a new technique to build verification models that can deal with large result sets accurately.

Justification

In the introduction, we reported on the state of the art regarding application integration, information integration, and wrapping.  A thorough analysis reveals that the proposals in the literature have a number of weaknesses that globally support our specific research goals, namely:

  1. Current tools to engineer integration solutions are rather low-level because they tend to be general purpose, i.e., they build on constructors like interfaces, bindings, messages, orchestrations, and others that are actually intended to increase the level of abstraction at which distributed systems are devised, but do not provide any specific-purpose integration constructs, e.g., splitters, aggregators, content enrichers, or claim checkers [HW03], except for binding components.
  2. Historically, data has been relational, which does not apply currently, with more and more data available on the Web in hierarchical, nested relational or graph-based formats, i.e., XML and RDF [SBH06]. Integrating such data is challenging insofar the techniques in the literature have a focus on relational data.  In other words, many information integration solutions need to resort to ad-hoc techniques, which generally lead to higher development and maintenance costs. 
  3. Current tools focus on applications that provide programmatic or data-oriented interfaces that can be accessed by a kind of wrappers that are usually referred to as connectors, adapters or binding components. There are, however, a large number of applications that do not provide such an interface, but a user interface only, which is typically the case of end-user web applications. Integrating such applications is challenging insofar building a wrapper amounts to writing a module that emulates a human interacting with them.

In the following subsections, we delve into the justification of each specific goal.

Application Integration

(G01) Working on a Domain Specific Language amounts to increasing the level of abstraction at which application integration solutions are designed, which is appealing insofar this may help reduce development costs [HPD09]. This language must rely on constructs to represent processes, integration tasks, hubs, spokes, and other common application integration patterns [HW03, FCG08, FC07, F08].

(G02) Note that such a language is not a contribution by itself, since tools like BizTalk or Camel provide similar languages.  What makes our planned contribution unique is our emphasis on decoupling the language from the transformations we shall devise to compile it into current technologies [SSF09]. The previous tools do not provide such a decoupling; this makes them extremely dependent on today’s technology, which, as was the case with previous advances, is expected to fade away in years, if not months. Roughly speaking, we plan on engineering application integration by sheltering it into the Model Driven Architecture, which heralds the idea of using models at three different levels of abstraction and automatic transformations from higher-level models into lower-level models [MM03]. This approach proved to cut off development and maintenance costs [HPD09, H02].

(G03) We also plan on devising a run-time system for application integration that is based on a per-task thread allocation policy [FCM09]. A notorious limitation of current run times is that they use threads inefficiently in cases in which an integration process requests an application to perform a task and has to wait for the results; note that this case is very frequent, and that an application may take from seconds to hours to react, e.g., think of an action that depends on a person or a setting in which integration requests are given less priority during work hours. The literature proposes a technique called dehydration/rehydration to deal with these cases [WR09, DM06]; the idea is to detect integration processes that have been waiting on a request for too long, saving them to disk (dehydration), and resuming them when the results arrive (rehydration).  Although this solution is very common, our experience with our partners proves that it is far from scalable.  In this project, we shall devise a new run-time system that allocates threads per task, instead of per process. This finer granularity allows threads to be used more efficiently since they can be relinquished the sooner as a task completes or a request is sent to an application [LHS08].

Information Integration

(G04) Recall that a mapping is a query that translates data from a number of application schemata into a target schema.  Current research results regarding mapping generation focus on the relational and the nested relational models [H01, PVM02]. To the best of our knowledge, there are not any results to deal with graph-based data [ORC09], which is becoming more and more pervasive due to the increasing popularity of RDF and RDFS [SBH06].  The main problem with these languages is that they allow for classes, subclasses, properties, sub-properties and data that have several unrelated types, which is not the case for previous data models; furthermore, the notion of existence dependency is much weaker than in relational data, and the standard query language is SPARQL, which deviates largely from the SQL and XQuery subsets that have been studied so far.

Wrapping

(G05) The problem with Enquirers is that they are a recent research topic and there are not many results available [O08].  This clearly justifies exploring new techniques to implement them. Our idea is to model forms as if they were parameterised views, which shall require devising a new form model that allows, not only for the fields and actions available, but also for their semantics [HMY07, ZHC04]. This model shall allow us to transform our problem into the problem of answering queries using views [H01] and query-oriented programming interfaces [PDP06]. We also need to delve into the problem of query feasibility, i.e., the problem of checking whether a query can be mapped onto the available forms.

(G06) Note that visiting many irrelevant links is problematic insofar it has an impact on a server’s response time and on the bandwidth used.  Our preliminary studies prove that typical hub pages include 60-90% irrelevant links [H09], which makes it clear the need for intelligent navigators. Traditional crawlers [RG01] navigate a site by retrieving all of the pages they find, which means visiting all irrelevant links. Focused crawlers only crawl pages about a given topic or pages that may lead to them, but they are not able to set relevant links apart and the ratio of irrelevant links visited is far from negligible [PS06, CDK99, ALG09]. Automated navigation pattern learners can be the solution to our problem, but the existing proposals have some shortcomings that must be addressed.  A recent proposal by Vidal et al. [VSM08] is able to learn navigation patterns that set irrelevant pages apart without fetching them, i.e., the decision is solely based on their links; unfortunately, it seems unable to navigate through hub pages and does not guarantee that all relevant data pages are retrieved.

(G07) Developing a software framework is appealing insofar it shall help reduce development costs and shall allow side by side comparisons; note that the literature provides many results regarding information extraction, but they are currently not comparable to each other because they have been developed using different technologies and validated using different data [CKG06, TAC06].  From an industrial point of view, this is problematic because deciding which the most appropriate algorithm is becomes a matter of trial and error.

(G08) A distinguishing feature of the SRV information extraction system [CDF00] is that it relies on first-order logic extraction rules and on a set of user-defined predicates to implement features that range from the tag within which a piece of data is rendered to its natural language role.  This makes SRV the most flexible information extraction system; unfortunately, its learner is based on FOIL [Q96], which is quite a complex algorithm that hinders its applicability in production environments.  We are currently working on a number of preliminary optimisations and heuristics that seem promising [PV08], and we plan on exploring them further in this project.

(G09) Extracting information from PDF documents is a recent research area in which there are very few results, most of which focus on digital library settings and have low accuracy rates [M09, PDA09]. We plan on exploring new techniques that build on using visual properties and data mining techniques since our preliminary results seem promising enough [V09].

(G10) An important drawback of existing techniques [K00, KLM00, MAL05] is that the resulting verification models tend to be less and less accurate as the number of features examined increases.  Our experience is that real-world result sets usually involve hundreds of features, which makes the existing techniques of little interest [VAA09]. Note that result sets can naturally be represented as multi-dimensional vectors whose components are the values of the features applied to them. The distance between two arbitrary result sets can then be defined as the distance between their corresponding vectors.  Thus, an unverified result set must be considered valid if it is “similar enough” to the known valid result sets; otherwise, an alarm must be signalled. Determining what “similar enough” means can be dealt with using intelligent techniques from the field of machine learning. The problem is that these techniques usually rely on the hypothesis that there is a sufficiently large set of both valid and invalid samples, which is not the case for information verification.  Our focus shall be on generating invalid, representative data that allows to build verification models that can reliably deal with large sets of features.