TDG PatriciaJiménez Aguirre Committed to research! TDG Site Manager 1.1

Research

Web applications are designed to be rendered for human user consumption and they rarely provide an Application Programming Interface (API) that allow us interact with its interfaces automatically, so they become challenging insofar integration is concerned. Wrappers are modules that allow to endow such applications with programmatic interfaces that are implemented by emulating the behaviour of a person who is interacting with a user interface.

A typical wrapper consist of an Enquirer, which maps user queries onto filled search forms, a Navigator, which executes filled search forms and reaches data pages, an information extractor, which extracts the information of interest from them, an ontologiser, which changes the extracted information into a conceptual data schema defined for a given domain, and a verifier, which attempts to find erroneous result sets. In my research, I focus on information extractors.

An information extractor is a general algorithm that can be configured by means of extraction rules so that it extracts the information of interest from a web page and returns it according to a structured model. Rules range from regular expressions to context-free grammars and first-order clauses, but they all rely on mark-up tags or natural language properties to find which text corresponds to the information of interest.

Beyond hand-crafting information extraction rules, the literature provides a hundred proposals that can be used to learn them automatically. Rule learners rely on a number of features that are buried into the algorithm, and we would like to explore learners that do not rely on a predefined set of features. This led us to exploring the first-order learner called FOIL [Q96], which has previously been used to extract information in the context of the SRV system [CDFMMNS00].

The idea is to use FOIL to induce extraction rules as flexible as possible, in order to be applicable to every page on the web. These extraction rules will consist of user-defined, first-order predicates that help to classify a piece of text or a DOM node as a piece of information of interest or not.

For FOIL to be able to learn, we need to instantiate these predicates on a number of training pages specifying for each predicate and page, text fragments that satisfy the predicate, known as positive examples, and text fragments that do not satisfy the predicate, known as negative examples. One of these predicates is known as the target predicate, that characterises a piece of information to be extracted and it would be explained by means of rule set. Each rule is grown by repeated specialisation, starting with the most general rule head and adding predicates to the body until the rule does not cover any negative examples. The goal is achieved when all positive examples belonging to target predicate are covered by the learned rule set.

A typical rule looks as follows: title(X):- bold(X), link(X), next(X, A), author(A). Intuitively, it means that a title in the web page is a piece of text in bold inside a hyperlink; and that it is followed by another piece of text that has been previously identified as an author, whose rule also should be defined.

Unfortunately, the search space of FOIL grows exponentially on the number of available predicates to characterise a piece of information of interest and to evaluate every single predicate becomes an expensive task. Consequently, to apply FOIL to nowadays web pages increases the number of predicates available to define a piece of interest, becoming too high, so it hinders its applicability in production environments.

This motivated us to work on a number of optimisations and heuristics that may help to reduce this search space increasing its efficiency whilst retaining its full expressiveness. The improvements proposed are explained below.

    Heuristics:
  • To provide a suitable catalogue of useful support predicates classified by domain. We use feature selection techniques to identify the most useful predicates depending on the studied domain to add them to this catalogue. This is going to help us when we work on a new web site from an existing domain.
  • To define heuristics to order the predicates which are going to be explored by prioritising those that determine the left and right part of the target predicate (like next and previous) proposed in the extractors of Kushmerick[K00].
  • To define the semantics of predicates in order to detect inconsistencies amongst predicates in a rule or to reduce the number of new predicates that are going to be created and evaluated.
  • To translate the rules obtained on regular expressions that are easily understandable by a machine.
    Optimisations:
  • To rule out all predicates that seem random. Lerman et al. [LMK00] use Papoulis hypothesis test [P90] to determine if a text sequence is random or not by regularities detection. When this test is incorporated into FOIL, a sequence of predicates could be rule out before its evaluation.
  • To use paired T-Students Test to compare if a pair of predicates are equal from statistical point of view. This means that the generation of predicates is going to be reduced by equivalence classes so that to each class belongs all those predicates that are statistically equal. We evaluate just a predicate per class.
  • To select the first predicate whose information gain exceeds a threshold. The number of predicates in rules could be longer, which could have repercussions on the performance of the information extractors. Therefore, tuning the threshold is the key to take advantage of this optimisation.
Some of these proposals were implemented in an early version [PV08] reducing the cost by up 40%, thus we believe that speeding the extraction rules production up is feasible. We bet for the expressiveness that FOIL offers us and we try to improve its efficiency in order to we can use in integration solutions by emulating the behaviour of a user who is interacting with a web application.

Bibliography

[K00] Wrapper induction: Efficiency and expressiveness Nicholas Kushmerick ARTIF. INTELL., 118(1-2):15-68, 2000

[Q96] Learning First-Order Definitions of Functions J. Ross Quinlan J. ARTIF. INTELL. RES. (JAIR), 5:139-161, 1996

[PV08] Tuning up FOIL for extracting information from the web Pablo Palacios, Iñaki Fernández de Viana IJCAT, 33(4):280-284, 2008

[CDFMMNS00] Learning to construct knowledge bases from the World Wide Web Mark Craven, Dan DiPasquo, Dayne Freitag, Andrew McCallum, Tom M. Mitchell, Kamal Nigam, Seán Slattery ARTIF. INTELL., 118(1-2):69-113, 2000

[LMK03] Wrapper Maintenance: A Machine Learning Approach Kristina Lerman, Steven Minton, Craig A. Knoblock J. ARTIF. INTELL. RES. (JAIR), 18:149-181, 2003

[P90] Probability and Statistics A. Papoulis PRENTICE HALL, Englewood Cliffs, NJ, 1990