A Focused Crawler by Segmentation of Context Information

Chang Hee Cho; Nam Yong Lee; Jin Bum Kang; Jae Young Yang; Joong Min Choi

A Focused Crawler by Segmentation of Context Information

The KIPS Transactions:PartB , Vol. 12, No. 6, pp. 697-702, Oct. 2005

10.3745/KIPSTB.2005.12.6.697, PDF Download:

Abstract

The focused crawler is a topic-driven document-collecting crawler that was suggested as a promising alternative of maintaining up-to-date web document indices in search engines. A major problem inherent in previous focused crawlers is the liability of missing highly relevant documents that are linked from off-topic documents. This problem mainly originated from the lack of consideration of structural information in a document. Traditional weighting method such as TFIDF employed in document classification can lead to this problem. In order to improve the performance of focused crawlers, this paper proposes a scheme of locality-based document segmentation to determine the relevance of a document to a specific topic. We segment a document into a set of sub-documents using contextual features around the hyperlinks. This information is used to determine whether the crawler would fetch the documents that are linked from hyperlinks in an off-topic document.

Statistics

Show / Hide Statistics

Statistics (Cumulative Counts from September 1st, 2017)
Multiple requests among the same browser session are counted as one view.
If you mouse over a chart, the values of data points will be shown.

Cite this article

[IEEE Style]

C. H. Cho, N. Y. Lee, J. B. Kang, J. Y. Yang, J. M. Choi, "A Focused Crawler by Segmentation of Context Information," The KIPS Transactions:PartB , vol. 12, no. 6, pp. 697-702, 2005. DOI: 10.3745/KIPSTB.2005.12.6.697.

[ACM Style]

Chang Hee Cho, Nam Yong Lee, Jin Bum Kang, Jae Young Yang, and Joong Min Choi. 2005. A Focused Crawler by Segmentation of Context Information. The KIPS Transactions:PartB , 12, 6, (2005), 697-702. DOI: 10.3745/KIPSTB.2005.12.6.697.