首页 | 本学科首页   官方微博 | 高级检索  
     检索      


Discover hidden web properties by random walk on bipartite graph
Authors:Yan Wang  Jie Liang  Jianguo Lu
Institution:1. School of Information, Central University of Finance and Economics, Beijing, China
2. BiblioCommons Inc, Toronto, Canada
3. School of Computer Science, University of Windsor, Windsor, Canada
4. State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China
Abstract:This paper proposes to use random walk (RW) to discover the properties of the deep web data sources that are hidden behind searchable interfaces. The properties, such as the average degree and population size of both documents and terms, are of interests to general public, and find their applications in business intelligence, data integration and deep web crawling. We show that simple RW can outperform the uniform random (UR) samples disregarding the high cost of UR sampling. We prove that in the idealized case when the degrees follow Zipf’s law, the sample size of UR sampling needs to grow in the order of O(N/ln 2 N) with the corpus size N, while the sample size of RW sampling grows logarithmically. Reuters corpus is used to demonstrate that the term degrees resemble power law distribution, thus RW is better than UR sampling. On the other hand, document degrees have lognormal distribution and exhibit a smaller variance, therefore UR sampling is slightly better.
Keywords:
本文献已被 SpringerLink 等数据库收录!
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号