TDG IñakiFernández de Viana y Glez Committed to research! TDG Site Manager 1.1

Research

A web wrapper is a system with a programming interface that enables the access to data given by web islands by simulating human interaction. A typical web wrapper gets a query as input, locates the appropriate search forms, fills them in, navigates through the resulting pages, extracts the attributes of interest from these pages and returns these attributes as a result set.

When information extractors are composed of extraction rules rely on HTML land marks, it can only extract information from the same information source where the training was obtained. Therefore, if the source changes in some cases could render the data returned incorrectly. Unless the information generated by wrapping agents is verified in an automatic way, these data can go unnoticed for the applications using them.

A general verification framework begins by invoking the wrapper in order to obtain the set of results that we shall use to produce the training set. This training set is characterized by a set of numerical and categorical features. These features will be profiled and combinated to modelling the training set. When we recive a unverified result set, we will extract the values of the different features and then we will check if they follow the model previously calculated.

On our analysis of the current literature, we have build a general verification framework composed of four components that are:

  • Reaping data islands: Consists of gathering valid result sets using the wrapping agents to be verified. The reaper executes a few queries that the wrapping agents can execute at any time. The result of a reaper is a working set, which is a structure that stores the wrapping agents used and a map that associates each of the queries onto their corresponding result sets. That result sets are multi-slot, where the term slot refers to either an attribute or a record and every slot is labelled with a class.
  • Assembling training sets: The result sets returned by a reaper must be examined by a person to classify them as either valid or invalid. So that, a training set can thus be modelled as a pair of working sets that contain valid and invalid result sets. Such pair of sets must originate from the same wrapping agents so that all the result sets returned by a reaper are expected to be valid. However, models that build on valid data only tend to overgeneralise hence is necessary to create new synthetic result sets using so-called perturbations. There are several proposals but all of them may lead to result sets that deviate largely from actual result sets.
  • Building verification models: A verification model is a characterisation of a training set that builds on the analysis of a number of features. Features are quantifiable characteristics and their values can be used as a form of evidence to decide if a result is valid or not. Features can be classified along two orthogonal axes: whether they are numeric or categorical, and whether they are applicable to slots or result sets. Numeric features transform slots or result sets into real numbers. The literature  reports on many numeric features, so we have group into several categories range from count the number of slots of given class to count attributes of a given class that match a given starting or ending pattern. Categorical features  range from patterns that describe the structure of a record to constrains on the values of some attributes.
  • Verification Model: When the models are constructed a function f(x) has to be inferred from the training set. This function should be constructed such that for a given feature vector x an estimate of its quality is obtained, that is, if this vector is similar to the rest of the training set. In [2] the training set is characterised by a vector in which every feature is associated with its average value in the training set. In [1] features are modelled as if they were random variables whose Gaussian distributions can be inferred from the training set; thus, to profile the value of a feature on an unverified result set, one can compute the probability that the corresponding random variable takes this value. The technique presented in [3] models every feature as if it was a random variable with a Gaussian distribution, but the profiles are calculated as the probability that a feature might have another value with a higher probability.
  • Data filtering: There are chances that the alarm reports a false positive. Filters can be viewed as sanity checks that explore a result set, search for attributes whose features deviate largely from their standard values and then apply an expensive procedure in an attempt to find out if they can be considered valid in spite of being outliers. Different authors assume that each attribute class can be mapped onto a number of external sources that provide data that is semantically equivalent. Then, they use the attribute values, the lexical patterns of all attributes of each class and searching engines to check if an to check if an attribute is valid.

The wrapper verification process is long and complex, composed of very well defined tasks in many stages. The difficulty lies in that each of these should prompt authors to propose different solutions. None of the proposals solve completely the problems, most of the time, as they tend to be quite simple and, we believe they lack a global vision of the problem. Our current research challenges  are focused on:

  • The verification modelling techniques described assume that the data sets returned by the reaping plan are homogeneous. In order to work with truly homogeneous data sets we propose to analyse the training set data and obtain a series of new data sets, which, this time, will be homogeneous.
  • For the verifier training to be adequate, it is advisable that the training set has both valid and invalid examples. This is a problem for us as the training set has only valid examples.
  • Before creating the verification model, it would be interesting to study the candidate features set in order to reduce its size.
  • We cannot assume all features follow a normal distribution and hence we must find techniques that allow for the modelling without any assumptions of its distribution.
  • If the problem of wrapper verification is re-phrased in terms of feature vectors and how close they are to each other, the problem is then closely related to the problem of novelty recognition.

 

References

[1] N. Kushmerick et al. Wrapper verification. World Wide Web, 3(2), 2000.

[2] K. Lerman et al. Wrapper maintenance: A machine learning approach. Journal of Artificial
Intelligence Research, 18, 2003. 22.

[3] R. McCann et al. Mapping maintenance for data integration systems. In VLDB, 2005.