Information extraction from semi-structured web pages is usually performed using rule-based information extractors. Maintaining extraction rules is a tedious task which should be performed if changes are introduced into the web source.
We devised a new information extractor called TEX that is not based on rules. TEX is an unsupervised information extractor from semi-structured web pages. It takas a collection of web pages generated by the same server-side template as input, and returns a collection of text fragments that contain relevant information inside each web page.
TEX is based on the hypothesis that the server-side template inserts irrelevant text fragments to render the inforamtion retrieved from the database. It works on searching for the common contents between input web pages, partitions them, and keeps searching inside the fragmented results.
We have tested TEX on a large collection of datasets and results regarding recall and extraction time are very promising.
TEX is available for the research community and can be downloaded from this link. In case you have any question, suggestion or feedback, please don't hesitate to contact me at hassansleiman at us.es.
1. Hassan A. Sleiman, Rafael Corchuelo: Towards a Method for Unsupervised Web Information Extraction. ICWE 2012: 427-430.
2. Hassan A. Sleiman, Rafael Corchuelo, TEX: An Efficient and Effective Unsupervised Web Information Extractor, Knowledge-Based Systems, Available online 25 October 2012, ISSN 0950-7051.