

Special Issue on
Wrapping Web Data Islands
Information Integration for the Masses
Jim Blythe, Dipsy Kapoor, Craig A. Knoblock, Kristina Lerman, and Steven Minton
Information integration applications combine data from heterogeneous sources to assist the user in solving repetitive data-intensive tasks. Currently, such applications require a high level of expertise in information integration since users need to know how to extract data from an on-line source, describe its semantics, and build integration plans to answer specific queries. We have integrated three task learning technologies within a single desktop application to assist users in creating information integration applications. It includes a tool for programmatic access to data in on-line information sources, a tool to semantically model them by aligning their input and output parameters with a common ontology, and a tool that enables the user to create complex integration plans using simple text instructions. Our system was integrated within the Calo Desktop Assistant and evaluated independently on a range of problems. It enabled non-expert users to construct integration plans for a variety of problems in the office and travel domains.
A Workflow Language for Web Automation
Paula Montoto, Alberto Pan, Juan Raposo, José Losada, Fernando Bellas, and Víctor Carneiro
Most today's web sources do not provide suitable interfaces for software programs to interact with them. Many researchers have proposed highly effective techniques to address this problem. Nevertheless, ad-hoc solutions are still frequent in real-world web automation applications. Arguably, one of the reasons for this situation is that most proposals have focused on query wrappers, which transform a web source into a special kind of database in which some queries can be executed using a query form and return resultsets that are composed of structured data records. Although the query wrapper model is often useful, it is not appropriate for applications that make decisions according to the data retrieved or processes that use forms that can be modelled as insert/update/delete operations. This article proposes a new language for defining web automation processes that is based on a wide range of real-world web automation tasks that are being used by corporations from different business areas.
Structure-Based Crawling in the Hidden Web
Marcio Vidal, Altigran S. da Silva, Edleno S. de Moura, and João M.B. Cavalcanti
The number of applications that need to crawl the Web to gather data is growing at an ever increasing pace. In some cases, the criterion to determine what pages must be included in a collection is based on theirs contents; in others, it would be wiser to use a structure-based criterion. In this article, we present a proposal to build structure-based crawlers that just requires a few examples of the pages to be crawled and an entry point to the target web site. Our crawlers can deal with form-based web sites. Contrarily to other proposals, ours does not require a sample database to fill in the forms, and does not require the user to interact heavily. Our experiments prove that our precision is 100% in seventeen real-world web sites, with both static and dynamic content, and that our recall is 95% in the eleven static web sites examined.
Structure and Semantics of Data-intensive Web Pages: An Experimental Study on their Relationships
Lorenzo Blanco, Valter Crescenzi, and Paolo Merialdo
In data-intensive web sites pages are generated by scripts that embed data from a back-end database into HTML templates. There is usually a relationship between the semantics of the data in a page and its corresponding template. For example, in a web site about sports events, it is likely that pages with data about athletes are associated with a template that differs from the template used to generate pages about coaches or referees. This article presents a method to classify web pages according to the associated template. Given a web page, the goal of our method is to accurately find the pages that are about the same topic. Our method leverages on a simple, yet effective model to abstract some structural features of a web page. We present the results of an extensive experimental analysis that show the performance of our methods in terms of both recall and precision regarding a large number of real-world web pages.
Recognising Informative Web Page Blocks Using Visual Segmentation for Efficient Information Extraction
Jinbeom Kang and Joongmin Choi
As web sites are getting more complicated, the construction of web information extraction systems becomes more troublesome and time-consuming. A common theme is the difficulty in locating the segments of a page in which the target information is contained, which we call the informative blocks. This article reports on the Recognising Informative Page Blocks algorithm (RIPB), which is able to identify the informative block in a web page so that information extraction algorithms can work on it more efficiently. RIPB relies on an existing algorithm for vision-based page block segmentation to analyse and partition a web page into a set of visual blocks, and then groups related blocks with similar content structures into block clusters by using a tree edit distance method. RIPB recognises the informative block cluster by using tree alignment and tree matching. A series of experiments were performed, and the conclusions were that RIPB was more than 95% accurate in recognising informative block clusters, and improved the efficiency of information extraction by 17%.
Exploring Information Extraction Resilience
Dawn G. Gregg
There are many challenges developers face when attempting to reliably extract data from the Web. One of these challenges is the resilience of the extraction system to changes in the web pages information is being extracted from. This article compares the resilience of information extraction systems that use position based extraction with an ontology based extraction system and a system that combines position based extraction with ontology based extraction. The findings demonstrate the advantages of using a system that combines multiple extraction techniques, especially in environments where web sites change frequently and where data collection is conducted over an extended period of time.
Copyright © 2008 Journal of Universal Computer Science
In keeping with the traditional purpose of furthering science, education and research, it is the policy of the publisher, whenever possible, to permit non-commercial use and redistribution of the information contained in the documents whose copyright they own. You however are not allowed to take money for the distribution or use of these documents except for a nominal charge for photocopying, sending copies, or whichever means you use redistribute them. The results in this special issue have been tested carefully, but they are not guaranteed for any particular purpose. The publisher or the holder of the copyright do not offer any warranties or representations, nor do they accept any liabilities with respect to them. A link to the J.UCS web site will be provided the sooner as the special issue is published.