Key words: Information extraction, Enterprise Information Integration.
Wrappers are a piece of software usually used in information integration scenarios and aim at offering an API that abstracts developers from the details required to simulate a human interacting with a search form an in providing structured data. One of the components of any web wrapper is the Information Extractor which is used to extract information from the Web.
An Information Extractor is a general algorithm that can be configured by means of rules so that it extracts the information of interest from a web page and returns it according to a structured model. Building an information extractor can be performed handcrafting it. Since this technique is so tedious and expensive, many proposals provide better techniques to automate the process of building an information extractor. Proposed algorithms vary from automatic ones [NDKT05] [YB06], where extraction rules are learned without any human intervention, to semi-automatic algorithms where user should provide a set of examples so that the algorithm can induce the extraction rules [CM98] [SS99].
Beyond hand-crafting information extraction rules, the literature provides a hundred proposals that can be used to learn them automatically, both in cases in which the data of interest is buried into text that is written in natural language [TAC06] and cases in which it is buried into tables, lists and other such layouts [CKG06].
Note that none of these techniques is universally applicable and that there are not any comprehensive empirical comparisons, which makes the decision on which to use very difficult. Developing a new information extractor learning algorithm implies a difficult task since it should start from scratch.
Developing a software framework is appealing insofar it shall help reduce development costs and shall allow side by side comparisons; note that the literature provides many results regarding information extraction, but they are currently not comparable to each other because they have been developed using different technologies and validated using different data [CKG06, TAC06]. From an industrial point of view, this is problematic since deciding which the most appropriate algorithm is becomes a matter of trial and error.
We believe that a framework shall help develop information extraction algorithms and rule learners within a coherent setting in which they can be developed and compared side by side reducing costs and making these tasks easier, Besides, note that using our frameworks, researchers can focus on their algorithm and on optimising them by reusing components rather than on building other algorithms for comparison and on building components again and again.
[TAC06] Adaptive information extraction. J. Turmo, A. Ageno, N. Català. ACM Comput. Surv., 38(2). 2006.
[CKG06] A Survey of Web Information Extraction Systems. C.-H. Chang, M. Kayed, M.R. Girgis, K.F. Shaalan. IEEE Trans. Knowl. Data Eng., 18(10):1411-1428. 2006.
[CM98]Generating Finite-State Transducers for Semi-Structured Data Extraction from the Web. Chun-Nan Hsu, Ming-Tzung Dung. Inf. Syst., 23(8):521-538. 1998
[SS99]Learning Information Extraction Rules for Semi-Structured and Free Text. Stephen Soderland. Machine Learning, 34(1-3):233-272. 1999
[NDKT05] STAVIES: A System for Information Extraction from Unknown Web Data Sources through Automatic Web Wrapper Generation Using Clustering Techniques. Nikolaos Papadakis, Dimitrios Skoutas, Konstantinos Raftopoulos, Theodora A. Varvarigou. IEEE Trans. Knowl. Data Eng., 17(12):1638-1652. 2005
[YB06]Structured Data Extraction from the Web Based on Partial Tree Alignment. Yanhong Zhai, Bing Liu. IEEE Trans. Knowl. Data Eng., 18(12):1614-1628. 2006