Comparing information extraction techniques from an empirical point of view is problematic insofar there is not an up-to-date, publicly available collection of datasets on which hey can be assessed; neither is there a consensus on how to evaluate the results from a comparative point of view.
CEDAR, (Community-Effort test beD for informAtion extRaction), is a testbed that is ntended to evaluate different information extraction proposals homogenously. CEDAR is composed of the following components:
- The Annotator: This tool helps researchers create and deal with datasets.
- The Collection of Datasets: CEDAR provides a collection of datasets that was gathered from nine domains, namely namely: books, cars, conferences, doctors, jobs, movies, real estates, sports, and video games.
- The API: When users devise a new technique for information extraction, they may use our structure of datasets which allows them to reuse the shared datasets. If a user wishes to evaluate his or her proposal, but his or her dataset use another format, the solution is to create a simple translator using this API. Soon we will make public the Javadoc for this API.
- The k-cross validator: k-Cross validation seems a good methodology to estimate the technique performance on real web pages. It avoids the over-fitting of the learnt rules on the used datasets and considers all web pages in a dataset for training and for testing. The variance of the estimated results is reduced as k is increased.
CEDAR testbed can be downloaded here. We are creating a repository so tha researchers may upload and share their datasets. Also, we are increasing the number of datasets by annotating new web sites from different domains. In case you have any question or suggestion, please don't hesitate to contact me.