There are a large number of web applications that only provide a human oriented interface, that usually presents a form that needs to be filled in with some information and submitted. Afterwards, the human user has to click through response pages until the relevant information is found. Integrating those applications involves emulating this human interaction, while offering a programmatic interface to the virtual integration systems, and this is achieved by means of a wrapper.
Inside a wrapper, the navigator is the module responsible of executing the filled forms provided by an enquirer and navigating through the results to fetch data pages. Note that this process may lead to a data page in one step, to a no-results page, or to a so-called page hub [K99], which is a set of interlinked pages that provide short descriptions of the information in other data pages together with links to them.
There is a lack for techniques to navigate intelligently through search results in virtual integration contexts, while at the same time keeping the server load low, saving bandwidth, and therefore, offering a reasonable client response time.
Visiting many irrelevant links is problematic insofar it has an impact on a server’s response time and on the bandwidth used. Our preliminary studies prove that typical hub pages include 60-90% irrelevant links [H09], which makes it clear the need for intelligent navigators.
To the present day, there are many approaches to solve the problem of navigating through a site, but most of them are unable to find pages containing relevant information efficiently.
Traditional crawlers [RG01] navigate a site by retrieving all of the pages they find, which means visiting all of the links. Focused crawlers only crawl pages about a given topic or pages that may lead to them, but they are not able to set relevant links apart and the ratio of irrelevant links visited is far from negligible [PS06, CDK99, ALG09]. Automated navigation pattern learners can be the solution to our problem, but the existing proposals have some shortcomings that must be addressed. A recent proposal by Vidal et al. [VSM08] is able to learn navigation patterns that set irrelevant pages apart without fetching them, i.e., the decision is solely based on their links; unfortunately, it seems unable to navigate through hub pages and does not guarantee that all relevant data pages are retrieved.
Our proposal is to research how intelligent techniques can be applied to build a navigator able to navigate intelligently. In our context, this means being able to classify links found on every page according to their functionality, that is, to their destination, before following them, and hence, being able to leave aside irrelevant links. We aim at relieving the user from the burden of classifying the links by hand, therefore the main focus is on finding a set of features to automatically classify links with a high precision. This set of features, moreover, has to be sufficiently generic as to be applicable to most web sites, instead of building a model for each particular site.
Existing proposals in the web navigation area do not address some of the problems stated before. Although there are a large number of proposals studying the use of links for web page classification purposes, link classification in itself has not been thoroughly studied yet. Most proposals are supervised, and they depend on the intervention of the user for selecting the links that should be followed. Instead, our focus is on a non-supervised solution, which is more reusable, fault tolerant, and easier to adapt to ever-changing web sites. The unsupervised existing proposals try to tackle navigation problems by analysing each site and extracting several specific models, while we focus on link classification in order to obtain a generic model.
[ALG09] A Genre-Aware Approach to Focused Crawling. G.T. de Assis, A.H.F. Laender, M.A. Gonçalves, A.S. da Silva. WWW Journal, 12(3):285-319. 2009.
[CDK99] Mining the Web's Link Structure. S. Chakrabarti, B. Dom, R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, J.M. Kleinberg. IEEE Computer, 32(8):60-67. 1999.
[H09] Intelligent Web Navigation. I. Hernández. Taller de Trabajo Zoco'09/JISBD, 2009.
[K99]Authoritative Sources in a Hyperlinked Environment. Jon M. Kleinberg. J. ACM, 46(5):604-632. 1999
[PS06] Link Contexts in Classifier-Guided Topical Crawlers. G. Pant, P. Srinivasan. IEEE Trans. Knowl. Data Eng., 18(1):107-122. 2006.
[RG01] Crawling the Hidden Web. S. Raghavan, H. Garcia-Molina. VLDB Conf., 129-138. 2001
[VSM08] Structure-Based Crawling in the Hidden Web. M.L.A. Vidal, A.S. da Silva, E.S. de Moura, J.M.B. Cavalcanti. J. UCS, 14(11):1857-1876. 2008.