Making informed decisions in today's always-on culture requires information. This information is usually obtained through automated collection processes that use rules or heuristics. The information can be projected onto a high dimensional feature space that reflects the differences between the semantics of instances that belong to different classes. This projection allows us to learn a feature-based model from a learning set with labelled instances and later apply it to instances whose semantics are unknown in order to obtain a label that denotes them. These models have numerous applications that range from endowing datasets with semantics to verifying already existing labels. Our study of the bibliography reveals that, although information modelling techniques are applied with a variety of different purposes, their underlying goal is the same. Unfortunately, current proposals focus on the text associated to the information and have some limitations, namely: they cannot deal with multi-domain datasets, they cannot deal with arbitrary record-based information, they are unable to model structural instances with no text, and they do not model the structure of the information. This hinders their applicability to real-world datasets in which the information typically comes from different domains, it is structured using arbitrarily complex records with purely structural instances, and structure is an important factor when modelling information.
This motivated the creation of Tapon, the first information modelling proposal that addresses the previous problems by means of an ensemble, contextual-aware, machine learning approach. Tapon endows the instances in a dataset with a hint: a temporary label that reflect the apparent semantics of the instance. Hints allow us to compute additional hint-based features and expand the feature space in order to make apart classes that are particularly similar and would otherwise overlap in feature space, making their differentiation impossible. Our experimental results prove that it can be used to model real-world datasets very effectively. Our contributions are then expected to improve the reliability of any process that relies on information modelling.
In Linked Data, datasets have to be linked to enabling discovery additional information. Such links are established by a two-step task called link discovery, which firstly devises suitable link specifications to relate the instances of different classes from the datasets, and then, it applies such specifications to generate the links. Unfortunately, link specifications have a main drawback: they are not appealing when dealing with homonyms, i.e., different instances have similar attribute values even if they represent different real-world concepts. The same problem has been faced in Collective Matching and Ontology Matching, however, techniques from both approaches focus on applying specifications given by the user. Additionally, the collective matching approach is devised for deduplication only.
The goal of our research is to assist the user to devise suitable link specifications between datasets that generate more precise links. In order to address this problem, we generate restrictions called relational link specification that are able to disambiguate homonym instances. The relational link specifications combine a link specification and a set metric. The set metric measures the ratio of related instances that are the same between datasets. This specifications generate restrictions such as, link two authors if they have the same name and if 50% of their articles have the same title.
José María García. (DBLP, Google Scholar)
IMPROVING SEMANTIC WEB SERVICES DISCOVERY AND RANKING:
A LIGHTWEIGHT, INTEGRATED APPROACH.
Semantic Web Services (SWSs) have become a preeminent research area, where various underlying frameworks, such as WSMO or OWL-S, deﬁne Semantic Web ontologies to describe Web services, so that they can be automatically discovered, ranked, composed, and invoked according to user requirements and preferences. Speciﬁcally, several service discovery and ranking techniques have been envisioned, and related tools have been made available for the community. However, existing approaches offer a limited expressiveness to deﬁne preferences that are highly dependent on underlying techniques. Furthermore, discovery and ranking mechanisms usually suffer from performance, interoperability and integration issues that prevent a wide exploitation of semantically-enhanced techniques.
In order to address those issues, current research focus is on developing lightweight SWSs descriptions, which enable interoperability of existing approaches, and corresponding discovery and ranking solutions that offer a better performance with a contained penalty on precision and recall. In this thesis dissertation, we address those challenges by proposing SOUP, a fully-ﬂedged preference ontological model that serves as the foundations for the development of lightweight tools, namely EMMA and PURI, to both improve discovery performance and integrate current ranking proposals, correspondingly.
Our contributions have been thoroughly evaluated and validated with both synthetic and real-world scenarios. First, SOUP preference model expressiveness and independence has been validated by completely describing complex scenarios from the SWS Challenge. Moreover, we have carried out an experimental study of EMMA that shows a signiﬁcant performance improvement while obtaining a negligible penalty on precision and recall. Finally, PURI has been applied within the EU FP7 project SOA4All, successfully integrating its three existing ranking mechanisms (objective, NFP-based, and fuzzy based) into an interoperable discovery and ranking solution.
ON GENERATING MAPPINGS AND BENCHMARKING DATA EXCHANGE SYSTEMS: EXCHANGING DATA AMONGST SEMANTIC-WEB ONTOLOGIES.
The goal of data exchange is to populate a target data model using data that come from one or more source data models. It is common toaddressdataexchangebuildingoncorrespondencesthataretransformed into executable mappings. The problem that we address in this dissertation is how to generate executable mappings in the context of semantic-web ontologies, which are the shared data models of the Semantic Web. In the literature, there are many proposals to generate executable mappings. Most of them focus on relational or nested-relational data models, which cannot be applied to our context; unfortunately, the few proposals that focus on ontologies have important drawbacks, namely: they solely work on a subset of taxonomies, they require the target data model to be prepopulated, or they interpret correspondences in isolation, not to mention the proposals that actually require the user to handcraft the executable mappings.
In this dissertation, we present MostoDE, a new automated proposal to generate executable mappings in the context of semantic-web ontologies. Its salient features are that it does not have any of the previous drawbacks, it is computationally tractable, and it has been validated using a repository that has been generated using MostoBM, a benchmark that is also described in this dissertation. Our validation suggests that MostoDE is very efﬁcient in practice and that the exchange of data it performs amongst semantic-web ontologies is appropriate.
ENTERPRISE INFORMATION INTEGRATION: AN UNSUPERVISED PROPOSAL FOR WEB PAGE CLASSIFICATION.
Integrating a web application into an automated business process requires to design wrappers that get user queries as input and map them onto the search forms that the application provides. Such wrappers build, amongst other components, on automatic navigators which are responsible for executing search forms that have been previously ﬁlled and navigating to the pages that provide the information required to answer the original user queries; this information is later extracted from those pages by means of an information extractor. A navigator relies on a web page classiﬁer that allows to discern which pages are relevant and which are not.
In this dissertation, we address the problem of designing an unsupervised web page classiﬁer that builds solely on the information provided by the URLs and does not require extensive crawling of the site being analysed. In the literature, there are many proposals to classify web pages. None of them fulﬁlls the requirements for a web page classiﬁer in a navigator context, namely: to avoid a previous extensive crawling, which is costly and unfeasible in some cases, to be unsupervised, which relieves the user from providing training information, or to use features from outside the page to be classiﬁed, which avoids having to download it previously.
Our contribution is CALA, a new automated proposal to generate URLbased web page classiﬁers. CALA builds a number of URL patterns that represent the different classes of pages in a web site, and further pages can be classiﬁed by matching their URLs to the patterns. Its salient features are that it fulﬁlls every one of the previous requirements, it is computationally tractable, and it has been validated by a number of experiments using real-world, top-visited web sites. Our validation proves that CALA is very effective and efﬁcient in practice.