TDG DanielAyala Committed to research! TDG Site Manager 1.1



  • The source code of the latest version of our proposal can be found in this repository. It includes csv files with the experimental results of each technique we tested (results.csv) as well as the training times of TAPON (times.csv).
  • The datasets used for learning and labelling can be found here. The zip file contains datasets from the 10 sources we used, separated into 10 folds. Each fold contains several json files, corresponding to a labelled structure with information.


Semantic labelling consists in endowing structured information with labels that denote their class in a known schema. This allows us to integrate such information into local models. The structured information can be seen as a tree-like structure with records (structural instances of a class, the intermediary nodes in the tree with no textual value) and attributes (instances of a class with a text attached to them, the leaf nodes in the tree).

In order to perform semantic labelling,we must create a model that learns what makes a class unique according to some features. For example, a model could be a decision-tree classifier that has learnt that instances of class "price" (which are attributes) have 4 digits and no letters, or that instances of class "book" (which are records) have 4 children.

Simple features such as textual pattern occurrences or some structure measures are enough to model these classes, but there are complex cases where they may not be enough. For example, lets suppose our model must include the classes "price" and "discounted price". They follow the same textual format, have the same position in the structure, and even the distribution of their numerical values is the same. Modelling this classes seems to be a challenge.

There is a key aspect to model that simple features don't cover: the relationship between two classes. If this is the dataset to be labelled:

|--A1 "Pride and Prejudice"
|--A2 "$5.95"
|--A3 "$4.00"

We, as humans, easily identify the price and discounted price because the discounted price has a lower value than the other attribute, but including this information in a model is more difficult.

In order to solve this and similar cases, we have devised Tapon, a semantic labelling proposal that aims to model the relationship between classes by means of novel features we call hint-based features, which can be computed in a two-phase process. Tapon does the following: it learns a model with traditional features. This model is used to endow the dataset with a preliminary set of labels we call "hints" (since they are a temporary hint of what the instance seems to be). Then, we compute additional features that can only be applied to datasets with labelled instances (hint-based features). With the full set of features, we learn a second model with increased quality. This model is applied to datasets labelled with hints, and outputs a set of refine labels.

So, how could Tapon help label the discounted price? We could include the following hint-based feature: the difference between the numerical value of the attribute and the numerical value of the nearest instance of each numerical class. This is actually a group of features: "the difference between the value of this attribute and that of the nearest price", "the difference between the value of this attribute and that of the nearest year", "the difference between the value of this attribute and that of the nearest height" and so on. This feature should be enough to distinguish between a price and a discounted price: the difference is positive in one case and negative in the other. Intuitively, a discounted price is near something that seems to be a price, but has a lower value. A price is near something that seems to be a price, but has a greater value.

In order to compute this feature, there must be instances labelled as "price", "age", "height" etc. Tapon would do the following: in a first iteration, the dataset would be given a first set of labels, and since the two attributes are indistinguishables, they would both be labelled as "price" (it does not matter if both are labelled as "discounted price as long as that is always the case". Since the dataset is endowed with hints, we can compute the feature "difference between this attribute value and that of the nearest price". In the case of A2, the value would be 1.95, and in the case of A3, it would be -1.95. This feature always takes positive values for prices and negative values for discounted prices, thus enabling the second model to properly classify the attributes.

This is just an example of hint-based features. Other examples could be "the distance in the tree graph to the nearest instances labelled as a name", "the number of children labelled as a price", or even more complex ones like "the rate between the nearest height and the nearest width". The underlying idea is the same: performing a first iteration to obtain a first hint about the classes, and a second one injecting additional features.