首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 109 毫秒
1.
The Web and especially major Web search engines are essential tools in the quest to locate online information for many people. This paper reports results from research that examines characteristics and changes in Web searching from nine studies of five Web search engines based in the US and Europe. We compare interactions occurring between users and Web search engines from the perspectives of session length, query length, query complexity, and content viewed among the Web search engines. The results of our research shows (1) users are viewing fewer result pages, (2) searchers on US-based Web search engines use more query operators than searchers on European-based search engines, (3) there are statistically significant differences in the use of Boolean operators and result pages viewed, and (4) one cannot necessary apply results from studies of one particular Web search engine to another Web search engine. The wide spread use of Web search engines, employment of simple queries, and decreased viewing of result pages may have resulted from algorithmic enhancements by Web search engine companies. We discuss the implications of the findings for the development of Web search engines and design of online content.  相似文献   

2.
针对主题搜索引擎反馈信息主题相关度低的问题,提出了将遗传算法与基于内容的空间向量模型相结合的搜索策略。利用空间向量模型确定网页与主题的相关度,并将遗传算法应用于相关度判别,提高主题信息搜索的准确率和查全率。在Heritrix框架基础上,利用Eclipse3.3实现了相应功能。实验结果表明,搜索策略改进后的系统抓取主题页面所占比例与原系统相比提高了约30%。  相似文献   

3.
爬虫是搜索引擎的重要组成部分,它决定了搜索引擎的性能,而Larbin正是一种高效的网络爬虫。首先分析了Larbin的设计结构,再由对其核心的算法Bloom Filter进行了研究,并对其提出了改进。最后是关于Larbin优化的实现。  相似文献   

4.
基于Lucene的索引系统的设计与实现   总被引:2,自引:0,他引:2  
索引系统是搜索引擎的数据大本营,在搜索引擎发展早期,能够索引的网页数量代表了整个行业的技术发展水平。Lucene全文检索技术是信息检索领域广泛使用的基本技术,它是一个优秀的开源全文本搜索技术框架,本文详细分析了索引系统相关技术和Lucene的索引系统结构。  相似文献   

5.
郭金兰 《现代情报》2009,29(7):53-56
作者通过调研说明了政府网站优化的必要性,并且提出了政府网站优化的方法,如避免复杂的结构,修改关键字结构与密度,加入一些合适的外部链接,可以让一般的政府职能部门网站在搜索引擎上获得很好排名,但对政府门户网站优化还要精心布局网站内部结构,巧妙使用网站内部链接来提高其子页在搜索引擎上的排名。  相似文献   

6.
Stochastic simulation has been very effective in many domains but never applied to the WWW. This study is a premiere in using neural networks in stochastic simulation of the number of rejected Web pages per search query. The evaluation of the quality of search engines should involve not only the resulting set of Web pages but also an estimate of the rejected set of Web pages. The iterative radial basis functions (RBF) neural network developed by Meghabghab and Nasr [Iterative RBF neural networks as meta-models for stochastic simulations, in: Second International Conference on Intelligent Processing and Manufacturing of Materials, IPMM’99, Honolulu, Hawaii, 1999, pp. 729–734] was adapted to the actual evaluation of the number of rejected Web pages on four search engines, i.e., Yahoo, Alta Vista, Google, and Northern Light. Nine input variables were selected for the simulation: (1) precision, (2) overlap, (3) response time, (4) coverage, (5) update frequency, (6) boolean logic, (7) truncation, (8) word and multi-word searching, (9) portion of the Web pages indexed. Typical stochastic simulation meta-modeling uses regression models in response surface methods. RBF becomes a natural target for such an attempt because they use a family of surfaces each of which naturally divides an input space into two regions X+ and X− and the n patterns for testing will be assigned either class X+ or X−. This technique divides the resulting set of responses to a query into accepted and rejected Web pages. To test the hypothesis that the evaluation of any search engine query should involve an estimate of the number of rejected Web pages as part of the evaluation, RBF meta-model was trained on 937 examples from a set of 9000 different simulation runs on the nine different input variables. Results show that two of the variables can be eliminated which include: response time and portion of the Web indexed without affecting evaluation results. Results show that the number of rejected Web pages for a specific set of search queries on these four engines very high. Also a goodness measure of a search engine for a given set of queries can be designed which is a function of the coverage of the search engine and the normalized age of a new document in result set for the query. This study concludes that unless search engine designers address the issue of rejected Web pages, indexing, and crawling, the usage of the Web as a research tool for academic and educational purposes will stay hindered.  相似文献   

7.
Recent research in the human computer interaction and information retrieval areas has revealed that search response latency exhibits a clear impact on the user behavior in web search. Such impact is reflected both in users’ subjective perception of the usability of a search engine and in their interaction with the search engine in terms of the number of search results they engage with. However, a similar impact analysis has been missing so far in the context of sponsored search. Since the predominant business model for commercial search engines is advertising via sponsored search results (i.e., search advertisements), understanding how response latency influences the user interaction with the advertisements displayed on the search engine result pages is crucial to increase the revenue of a commercial search engine. To this end, we conduct a large-scale analysis using query logs obtained from a commercial web search. We analyze the short-term and long-term impact of search response latency on the querying and clicking behaviors of users using desktop and mobile devices to access the search engine, as well as the corresponding impact on the revenue of the search engine. This analysis demonstrates the importance of serving sponsored search results with low latency and provides insight into the ad serving policy of commercial search engines to ensure long-term user engagement and search revenue.  相似文献   

8.
由于因特网和web都是开放、变化、非结构化、动态无序的海量信息资源组织,所以对于网络信息数据的采集和质量控制成为网络计量学领域集中研究的热点问题。本文针对网络信息数据采集的质量控制问题进行了比较全面的研究,内容涉及网络检索时段的统一测定,Web网页及Web网站的抽样设计,避免重复采集网页和优先搜集重要网页的方法,以及面向主题进行特定信息采集的技术等。  相似文献   

9.
Searching for relevant material that satisfies the information need of a user, within a large document collection is a critical activity for web search engines. Query Expansion techniques are widely used by search engines for the disambiguation of user’s information need and for improving the information retrieval (IR) performance. Knowledge-based, corpus-based and relevance feedback, are the main QE techniques, that employ different approaches for expanding the user query with synonyms of the search terms (word synonymy) in order to bring more relevant documents and for filtering documents that contain search terms but with a different meaning (also known as word polysemy problem) than the user intended. This work, surveys existing query expansion techniques, highlights their strengths and limitations and introduces a new method that combines the power of knowledge-based or corpus-based techniques with that of relevance feedback. Experimental evaluation on three information retrieval benchmark datasets shows that the application of knowledge or corpus-based query expansion techniques on the results of the relevance feedback step improves the information retrieval performance, with knowledge-based techniques providing significantly better results than their simple relevance feedback alternatives in all sets.  相似文献   

10.
随着互联网、5G等新一代信息技术的迅猛发展,企业需要跨越组织边界,通过研发搜索行为捕捉创新机会、激发创新灵感、培养新技能从而提升企业的创新能力,但研发搜索行为是问题使然还是预知先知是企业研发中的一个重要的战略问题。本文基于组织行为理论,系统研究了绩效反馈、审时度势以及在联盟惯例的调节下,他们对企业研发搜索行为选择的影响。研究发现:当绩效反馈低于期望值时,企业会加强研发搜索;当绩效反馈高于期望值时,企业会减少研发搜索;当决策者预期未来绩效高于未来期望值时,企业会减少研发搜索。当绩效反馈低于期望值和预期未来绩效优于未来期望绩效时,联盟惯例在其中发挥负向的调节作用,但是当绩效反馈高于期望值时,联盟惯例强化了企业的搜索行为。研究结论对指导企业进行研发搜索行为的策略选择,有效利用外部资源,推动企业创新,提升企业核心竞争力具有重要的理论和实践意义。  相似文献   

11.
基于RSS的搜索引擎技术及其发展趋向探析   总被引:3,自引:0,他引:3  
随着RSS资源的飞速增长,基于RSS的搜索引擎应运而生.RSS搜索引擎的使用方法和Google、百度一样,都是通过输入关键词来搜索要查询的内容.不同的是传统门户搜索引擎是对抓取到的网页内容进行搜索.而RSS搜索引擎则是直接对RSS种子或含有RSS种子的网页进行检索.RSS搜索引擎具有高度准确性、动态聚合机制和高效率、高速度的搜索特点,为当前网络资源最为重要的新型信息检索工具.鉴于此,本文就RSS搜索引擎的国内外研究状况、技术特点和应用机理等进行了初步探讨,在此基础上.笔者进一步对基于RSS的搜索引擎技术进行了展望.  相似文献   

12.
In sponsored search, many advertisers have not achieved their expected performances while the search engine also has a large room to improve their revenue. Specifically, due to the improper keyword bidding, many advertisers cannot survive the competitive ad auctions to get their desired ad impressions; meanwhile, a significant portion of search queries have no ads displayed in their search result pages, even if many of them have commercial values. We propose recommending a group of relevant yet less-competitive keywords to an advertiser. Hence, the advertiser can get the chance to win some (originally empty) ad slots and accumulate a number of impressions. At the same time, the revenue of the search engine can also be boosted since many empty ad shots are filled. Mathematically, we model the problem as a mixed integer programming problem, which maximizes the advertiser revenue and the relevance of the recommended keywords, while minimizing the keyword competitiveness, subject to the bid and budget constraints. By solving the problem, we can offer an optimal group of keywords and their optimal bid prices to an advertiser. Simulation results have shown the proposed method is highly effective in increasing ad impressions, expected clicks, advertiser revenue, and search engine revenue.  相似文献   

13.
李村合  刘竞 《情报科学》2005,23(6):905-907,916
Cloaking指的是显示给Google和其他搜索引擎的页面与显示给普通浏览者的页面不同的技术。这种技术的目的通常是想操纵搜索引擎的排名结果,当普通浏览者在搜索引擎结果中查找资料的时候就会受到误导。本文介绍了Cloaking的定义和出现原因,并对Cloaking技术进行了分析和评价。  相似文献   

14.
The purpose of this study is to provide automatic new topic identification of search engine query logs, and estimate the effect of statistical characteristics of search engine queries on new topic identification. By applying multiple linear regression and multi-factor ANOVA on a sample data log from the Excite search engine, we demonstrated that the statistical characteristics of Web search queries, such as time interval, search pattern and position of a query in a user session, are effective on shifting to a new topic. Multiple linear regression is also a successful tool for estimating topic shifts and continuations. The findings of this study provide statistical proof for the relationship between the non-semantic characteristics of Web search queries and the occurrence of topic shifts and continuations.  相似文献   

15.
In this paper, we use time series analysis to evaluate predictive scenarios using search engine transactional logs. Our goal is to develop models for the analysis of searchers’ behaviors over time and investigate if time series analysis is a valid method for predicting relationships between searcher actions. Time series analysis is a method often used to understand the underlying characteristics of temporal data in order to make forecasts. In this study, we used a Web search engine transactional log and time series analysis to investigate users’ actions. We conducted our analysis in two phases. In the initial phase, we employed a basic analysis and found that 10% of searchers clicked on sponsored links. However, from 22:00 to 24:00, searchers almost exclusively clicked on the organic links, with almost no clicks on sponsored links. In the second and more extensive phase, we used a one-step prediction time series analysis method along with a transfer function method. The period rarely affects navigational and transactional queries, while rates for transactional queries vary during different periods. Our results show that the average length of a searcher session is approximately 2.9 interactions and that this average is consistent across time periods. Most importantly, our findings shows that searchers who submit the shortest queries (i.e., in number of terms) click on highest ranked results. We discuss implications, including predictive value, and future research.  相似文献   

16.
随着网络的飞速发展,网页数量急剧膨胀,近几年来更是以指数级进行增长,搜索引擎面临的挑战越来越严峻,很难从海量的网页中准确快捷地找到符合用户需求的网页。网页分类是解决这个问题的有效手段之一,基于网页主题分类和基于网页体裁分类是网页分类的两大主流,二者有效地提高了搜索引擎的检索效率。网页体裁分类是指按照网页的表现形式及其用途对网页进行分类。介绍了网页体裁的定义,网页体裁分类研究常用的分类特征,并且介绍了几种常用特征筛选方法、分类模型以及分类器的评估方法,为研究者提供了对网页体裁分类的概要性了解。  相似文献   

17.
赵发珍 《现代情报》2013,33(6):91-95
论文通过Yahoo!和Bing搜索引擎获取30个网络社区网站的网页总数、链接总数、内、外部链接数、PR值,并计算了网络影响因子等,运用灰色关联分析对以上多项链接指标数据进行综合排序。研究结果表明:这30个网络社区网站网络影响力前几位是:51.com、腾讯微博、腾讯博客、腾讯论坛、网易微博、网易博客、新浪博客、豆瓣网。最后通过对比Yahoo!和Bing搜索引擎获取的链接数据,验证了两大搜索引擎对于网站链接分析是可行的,但是用Yahoo搜索引擎统计的数据来分析更为准确一些。  相似文献   

18.
Search Engine for South-East Europe (SE4SEE) is a socio-cultural search engine running on the grid infrastructure. It offers a personalized, on-demand, country-specific, category-based Web search facility. The main goal of SE4SEE is to attack the page freshness problem by performing the search on the original pages residing on the Web, rather than on the previously fetched copies as done in the traditional search engines. SE4SEE also aims to obtain high download rates in Web crawling by making use of the geographically distributed nature of the grid. In this work, we present the architectural design issues and implementation details of this search engine. We conduct various experiments to illustrate performance results obtained on a grid infrastructure and justify the use of the search strategy employed in SE4SEE.  相似文献   

19.
Pseudo-relevance feedback (PRF) is a classical technique to improve search engine retrieval effectiveness, by closing the vocabulary gap between users’ query formulations and the relevant documents. While PRF is typically applied on the same target corpus as the final retrieval, in the past, external expansion techniques have sometimes been applied to obtain a high-quality pseudo-relevant feedback set using the external corpus. However, such external expansion approaches have only been studied for sparse (BoW) retrieval methods, and its effectiveness for recent dense retrieval methods remains under-investigated. Indeed, dense retrieval approaches such as ANCE and ColBERT, which conduct similarity search based on encoded contextualised query and document embeddings, are of increasing importance. Moreover, pseudo-relevance feedback mechanisms have been proposed to further enhance dense retrieval effectiveness. In particular, in this work, we examine the application of dense external expansion to improve zero-shot retrieval effectiveness, i.e. evaluation on corpora without further training. Zero-shot retrieval experiments with six datasets, including two TREC datasets and four BEIR datasets, when applying the MSMARCO passage collection as external corpus, indicate that obtaining external feedback documents using ColBERT can significantly improve NDCG@10 for the sparse retrieval (by upto 28%) and the dense retrieval (by upto 12%). In addition, using ANCE on the external corpus brings upto 30% NDCG@10 improvements for the sparse retrieval and upto 29% for the dense retrieval.  相似文献   

20.
设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号