Discover hidden web properties by random walk on bipartite graph |
| |
Authors: | Yan Wang Jie Liang Jianguo Lu |
| |
Institution: | 1. School of Information, Central University of Finance and Economics, Beijing, China 2. BiblioCommons Inc, Toronto, Canada 3. School of Computer Science, University of Windsor, Windsor, Canada 4. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
|
| |
Abstract: | This paper proposes to use random walk (RW) to discover the properties of the deep web data sources that are hidden behind searchable interfaces. The properties, such as the average degree and population size of both documents and terms, are of interests to general public, and find their applications in business intelligence, data integration and deep web crawling. We show that simple RW can outperform the uniform random (UR) samples disregarding the high cost of UR sampling. We prove that in the idealized case when the degrees follow Zipf’s law, the sample size of UR sampling needs to grow in the order of O(N/ln 2 N) with the corpus size N, while the sample size of RW sampling grows logarithmically. Reuters corpus is used to demonstrate that the term degrees resemble power law distribution, thus RW is better than UR sampling. On the other hand, document degrees have lognormal distribution and exhibit a smaller variance, therefore UR sampling is slightly better. |
| |
Keywords: | |
本文献已被 SpringerLink 等数据库收录! |
|