In this page, we have made available to the reviewers the datasets we used to perform experiments, grouped into domains:
Books: book (record), title, author, price, and isbn
Videogames: videogame (record), date, developer, name, and platform
Conferences: conference (record), date, place, title, and url
Doctors: doctor (record), name, fax, specialty, address (record), address-value, and phone
Jobs: offer (record), company, location, and category
Movies: movie (record), title, director, actor, year, and runtime
Sports: player (record), name, birth, birth-place, height, weight, money, and country
Business: business (record), phone, street, name, and city
Each zip file contains the datasets of one of the eight domains that were described in the paper. Inside each one there is a folder with the name of the Web site the datasets were taken from. Insinde each of these folders there are 30 json files, each of them containing a dataset.
In order to perform tests, we parsed the learning datasets and extracted its instances, computed their hint-freefeatures and used the resulting features vectors to learn classifiers and create a hint-free model. We applied the model to each instance and gave it a hint. Then, we computed hint-based features and used the extended features vectors to learn additional classifiers and learn a hint-based model.
The hint-free model was applied to the testing datasets in order to endow their instances with a hint. Then, the hit-based model was applied in several iterations, refining the hints. After a variable number of iterations, the final hints are compared to the true labels in order to measure their accuracy.
The classifiers that were learnt and used by the models can be found here.
The zip file contains eight folders corresponding to the eight number-of-domains variations performed during testing. Inside each folder, there is a folder with the hint-based classifiers and the hint-free classifiers. Each one of them contains folders with the binary attribute and record instance classifiers, as well as the multiclass classifiers that combine their output. All classifiers were stored using Spark.