TDG Juan CarlosRoldán Committed to research! TDG Site Manager 1.1

Publications

Find here my publications as lead author.

 

On Extracting Data from HTML Tables
Juan C. Roldán, Patricia Jiménez, Rafael Corchuelo
In review process.

 

Extracting web information using representation patterns
Juan C. Roldán, Patricia Jiménez, Rafael Corchuelo
HotWeb, 4.1-4.5. 2017

Feeding decision support systems with Web information typically requires sifting through an unwieldy amount of information that is available in human-friendly formats only. Our focus is on a scalable proposal to extract information from semi-structured documents in a structured format, with an emphasis on it being scalable and open. By semi-structured we mean that it must focus on information that is rendered using regular formats, not free text; by scalable, we mean that the system must require a minimum amount of human intervention and it must not be targeted to extracting information from a particular domain or web site; by open, we mean that it must extract as much useful information as possible and not be subject to any pre-defined data model. In the literature, there is only one open but not scalable proposal, since it requires human supervision on a per-domain basis. In this paper, we present a new proposal that relies on a number of heuristics to identify patterns that are typically used to represent the information in a web document. Our experimental results confirm that our proposal is very competitive in terms of effectiveness and efficiency.


Kizomba: An Unsupervised Heuristic-Based Web Information Extractor
Juan C. Roldán
PAAMS (Special Sessions), 383-385. 2016

The Web is an ever growing repository of valuable information. That information lacks semantics since it is buried into web documents that are represented using HTML. Information extractors are software components that help software engineers in the task of extracting structured information from web documents.