期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

How are we searching the World Wide Web? A comparison of nine search engine transaction logs

Bernard J. Jansen Amanda Spink 《Information processing & management》2006

The Web and especially major Web search engines are essential tools in the quest to locate online information for many people. This paper reports results from research that examines characteristics and changes in Web searching from nine studies of five Web search engines based in the US and Europe. We compare interactions occurring between users and Web search engines from the perspectives of session length, query length, query complexity, and content viewed among the Web search engines. The results of our research shows (1) users are viewing fewer result pages, (2) searchers on US-based Web search engines use more query operators than searchers on European-based search engines, (3) there are statistically significant differences in the use of Boolean operators and result pages viewed, and (4) one cannot necessary apply results from studies of one particular Web search engine to another Web search engine. The wide spread use of Web search engines, employment of simple queries, and decreased viewing of result pages may have resulted from algorithmic enhancements by Web search engine companies. We discuss the implications of the findings for the development of Web search engines and design of online content. 相似文献

2.

Analysis of large data logs: an application of Poisson sampling on excite web queries

《Information processing & management》2002,38(4):473-490

Search engines are the gateway for users to retrieve information from the Web. There is a crucial need for tools that allow effective analysis of search engine queries to provide a greater understanding of Web users' information seeking behavior. The objective of the study is to develop an effective strategy for the selection of samples from large-scale data sets. Millions of queries are submitted to Web search engines daily and new sampling techniques are required to bring these databases to a manageable size, while preserving the statistically representative characteristics of the entire data set. This paper reports results from a study using data logs from the Excite Web search engine. We use Poisson sampling to develop a sampling strategy, and show how sample sets selected by Poisson sampling statistically effectively represent the characteristics of the entire dataset. In addition, this paper discusses the use of Poisson sampling in continuous monitoring of stochastic processes, such as Web site dynamics. 相似文献

3.

A study of results overlap and uniqueness among major Web search engines

Amanda Spink Bernard J. Jansen Chris Blakely Sherry Koshman 《Information processing & management》2006

The performance and capabilities of Web search engines is an important and significant area of research. Millions of people world wide use Web search engines very day. This paper reports the results of a major study examining the overlap among results retrieved by multiple Web search engines for a large set of more than 10,000 queries. Previous smaller studies have discussed a lack of overlap in results returned by Web search engines for the same queries. The goal of the current study was to conduct a large-scale study to measure the overlap of search results on the first result page (both non-sponsored and sponsored) across the four most popular Web search engines, at specific points in time using a large number of queries. The Web search engines included in the study were MSN Search, Google, Yahoo! and Ask Jeeves. Our study then compares these results with the first page results retrieved for the same queries by the metasearch engine Dogpile.com. Two sets of randomly selected user-entered queries, one set was 10,316 queries and the other 12,570 queries, from Infospace’s Dogpile.com search engine (the first set was from Dogpile, the second was from across the Infospace Network of search properties were submitted to the four single Web search engines). Findings show that the percent of total results unique to only one of the four Web search engines was 84.9%, shared by two of the three Web search engines was 11.4%, shared by three of the Web search engines was 2.6%, and shared by all four Web search engines was 1.1%. This small degree of overlap shows the significant difference in the way major Web search engines retrieve and rank results in response to given queries. Results point to the value of metasearch engines in Web retrieval to overcome the biases of individual search engines. 相似文献

4.

Multitasking during Web search sessions

Amanda Spink Minsoo Park Bernard J. Jansen Jan Pedersen 《Information processing & management》2006

A user’s single session with a Web search engine or information retrieval (IR) system may consist of seeking information on single or multiple topics, and switch between tasks or multitasking information behavior. Most Web search sessions consist of two queries of approximately two words. However, some Web search sessions consist of three or more queries. We present findings from two studies. First, a study of two-query search sessions on the AltaVista Web search engine, and second, a study of three or more query search sessions on the AltaVista Web search engine. We examine the degree of multitasking search and information task switching during these two sets of AltaVista Web search sessions. A sample of two-query and three or more query sessions were filtered from AltaVista transaction logs from 2002 and qualitatively analyzed. Sessions ranged in duration from less than a minute to a few hours. Findings include: (1) 81% of two-query sessions included multiple topics, (2) 91.3% of three or more query sessions included multiple topics, (3) there are a broad variety of topics in multitasking search sessions, and (4) three or more query sessions sometimes contained frequent topic changes. Multitasking is found to be a growing element in Web searching. This paper proposes an approach to interactive information retrieval (IR) contextually within a multitasking framework. The implications of our findings for Web design and further research are discussed. 相似文献

5.

Automatic new topic identification using multiple linear regression

Seda Ozmutlu 《Information processing & management》2006

The purpose of this study is to provide automatic new topic identification of search engine query logs, and estimate the effect of statistical characteristics of search engine queries on new topic identification. By applying multiple linear regression and multi-factor ANOVA on a sample data log from the Excite search engine, we demonstrated that the statistical characteristics of Web search queries, such as time interval, search pattern and position of a query in a user session, are effective on shifting to a new topic. Multiple linear regression is also a successful tool for estimating topic shifts and continuations. The findings of this study provide statistical proof for the relationship between the non-semantic characteristics of Web search queries and the occurrence of topic shifts and continuations. 相似文献

6.

Application of automatic topic identification on Excite Web search engine data logs

《Information processing & management》2005,41(5):1243-1262

The analysis of contextual information in search engine query logs enhances the understanding of Web users’ search patterns. Obtaining contextual information on Web search engine logs is a difficult task, since users submit few number of queries, and search multiple topics. Identification of topic changes within a search session is an important branch of search engine user behavior analysis. The purpose of this study is to investigate the properties of a specific topic identification methodology in detail, and to test its validity. The topic identification algorithm’s performance becomes doubtful in various cases. These cases are explored and the reasons underlying the inconsistent performance of automatic topic identification are investigated with statistical analysis and experimental design techniques. 相似文献

7.

Iterative radial basis functions neural networks as metamodels of stochastic simulations of the quality of search engines in the World Wide Web

《Information processing & management》2001,37(4):571-591

Stochastic simulation has been very effective in many domains but never applied to the WWW. This study is a premiere in using neural networks in stochastic simulation of the number of rejected Web pages per search query. The evaluation of the quality of search engines should involve not only the resulting set of Web pages but also an estimate of the rejected set of Web pages. The iterative radial basis functions (RBF) neural network developed by Meghabghab and Nasr [Iterative RBF neural networks as meta-models for stochastic simulations, in: Second International Conference on Intelligent Processing and Manufacturing of Materials, IPMM’99, Honolulu, Hawaii, 1999, pp. 729–734] was adapted to the actual evaluation of the number of rejected Web pages on four search engines, i.e., Yahoo, Alta Vista, Google, and Northern Light. Nine input variables were selected for the simulation: (1) precision, (2) overlap, (3) response time, (4) coverage, (5) update frequency, (6) boolean logic, (7) truncation, (8) word and multi-word searching, (9) portion of the Web pages indexed. Typical stochastic simulation meta-modeling uses regression models in response surface methods. RBF becomes a natural target for such an attempt because they use a family of surfaces each of which naturally divides an input space into two regions X+ and X− and the n patterns for testing will be assigned either class X+ or X−. This technique divides the resulting set of responses to a query into accepted and rejected Web pages. To test the hypothesis that the evaluation of any search engine query should involve an estimate of the number of rejected Web pages as part of the evaluation, RBF meta-model was trained on 937 examples from a set of 9000 different simulation runs on the nine different input variables. Results show that two of the variables can be eliminated which include: response time and portion of the Web indexed without affecting evaluation results. Results show that the number of rejected Web pages for a specific set of search queries on these four engines very high. Also a goodness measure of a search engine for a given set of queries can be designed which is a function of the coverage of the search engine and the normalized age of a new document in result set for the query. This study concludes that unless search engine designers address the issue of rejected Web pages, indexing, and crawling, the usage of the Web as a research tool for academic and educational purposes will stay hindered. 相似文献

8.

Discriminating meta-search: a framework for evaluation

《Information processing & management》1999,35(3):337-362

There was a proliferation of electronic information sources and search engines in the 1990s. Many of these information sources became available through the ubiquitous interface of the Web browser. Diverse information sources became accessible to information professionals and casual end users alike. Much of the information was also hyperlinked, so that information could be explored by browsing as well as searching. While vast amounts of information were now just a few keystrokes and mouseclicks away, as the choices multiplied, so did the complexity of choosing where and how to look for the electronic information. Much of the complexity in information exploration at the turn of the twenty-first century arose because there was no common cataloguing and control system across the various electronic information sources. In addition, the many search engines available differed widely in terms of their domain coverage, query methods and efficiency.Meta-search engines were developed to improve search performance by querying multiple search engines at once. In principle, meta-search engines could greatly simplify the search for electronic information by selecting a subset of first-level search engines and digital libraries to submit a query to based on the characteristics of the user, the query/topic, and the search strategy. This selection would be guided by diagnostic knowledge about which of the first-level search engines works best under what circumstances. Programmatic research is required to develop this diagnostic knowledge about first-level search engine performance.This paper introduces an evaluative framework for this type of research and illustrates its use in two experiments. The experimental results obtained are used to characterize some properties of leading search engines (as of 1998). Significant interactions were observed between search engine and two other factors (time of day and Web domain). These findings supplement those of earlier studies, providing preliminary information about the complex relationship between search engine functionality and performance in different contexts. While the specific results obtained represent a time-dependent snapshot of search engine performance in 1998, the evaluative framework proposed should be generally applicable in the future. 相似文献

9.

Syntactic complexity of Web search queries through the lenses of language models,networks and users

《Information processing & management》2016,52(5):923-948

Across the world, millions of users interact with search engines every day to satisfy their information needs. As the Web grows bigger over time, such information needs, manifested through user search queries, also become more complex. However, there has been no systematic study that quantifies the structural complexity of Web search queries. In this research, we make an attempt towards understanding and characterizing the syntactic complexity of search queries using a multi-pronged approach. We use traditional statistical language modeling techniques to quantify and compare the perplexity of queries with natural language (NL). We then use complex network analysis for a comparative analysis of the topological properties of queries issued by real Web users and those generated by statistical models. Finally, we conduct experiments to study whether search engine users are able to identify real queries, when presented along with model-generated ones. The three complementary studies show that the syntactic structure of Web queries is more complex than what n-grams can capture, but simpler than NL. Queries, thus, seem to represent an intermediate stage between syntactic and non-syntactic communication. 相似文献

10.

基于Heritrix和Lucene的专题搜索引擎研究

贾超卫文学《中国科技信息》2012,(10):95-96

专题搜索引擎也称垂直搜索引擎,主要用来满足特定领域的用户需求。Heritrix是开源的网络爬虫,Heritrix的WebUI启动方式并不易用于广大用户。本文改变了往常对Heritrix用法,摒弃了Heritrix的WebUI启动方式,对Heritrix源码进行修改,将Lucene整合到Heritrix中,构建成一个完整的搜索引擎,并通过监听器监听搜索引擎状态,使搜索引擎能够进行自动爬取和数据更新。同时,本文添加了网页过滤模块以及对查询结果排序算法进行了改进,提高了搜索引擎的易用性和查询的准确率。相似文献

11.

Using genetic algorithms to evolve a population of topical queries 总被引：1，自引：1，他引：0

Rocío L. Cecchini Carlos M. Lorenzetti Ana G. Maguitman Nlida Beatríz Brignole 《Information processing & management》2008,44(6):1863

Systems for searching the Web based on thematic contexts can be built on top of a conventional search engine and benefit from the huge amount of content as well as from the functionality available through the search engine interface. The quality of the material collected by such systems is highly dependant on the vocabulary used to generate the search queries. In this scenario, selecting good query terms can be seen as an optimization problem where the objective function to be optimized is based on the effectiveness of a query to retrieve relevant material. Some characteristics of this optimization problem are: (1) the high-dimensionality of the search space, where candidate solutions are queries and each term corresponds to a different dimension, (2) the existence of acceptable suboptimal solutions, (3) the possibility of finding multiple solutions, and in many cases (4) the quest for novelty. This article describes optimization techniques based on Genetic Algorithms to evolve “good query terms” in the context of a given topic. The proposed techniques place emphasis on searching for novel material that is related to the search context. We discuss the use of a mutation pool to allow the generation of queries with new terms, study the effect of different mutation rates on the exploration of query-space, and discuss the use of a especially developed fitness function that favors the construction of queries containing novel but related terms. 相似文献

12.

Review of “Search Engines: Information Retrieval in Practice” by Croft,Metzler and Strohman

B. Barla Cambazoglu 《Information processing & management》2010

Despite the common public use of Web search engines, their internal design details mostly remain as a black art. The speculation is that there is a significant knowledge gap between what is published by academia and what is guarded behind the doors of large-scale search companies. “Search Engines: Information Retrieval in Practice” is one of the few books that make an attempt to cover issues involved in search engine design and is probably the most comprehensive book published so far on this topic. Unfortunately, the book fails to be a complete search engine guide as its content is dominated by the topics from information retrieval, text processing, and statistics. More precisely, the focus of the book is biased towards the “search” rather than the “engines” as, in most places, discussions on effectiveness dominate those on efficiency by a great margin. However, the book stands as a very solid IR book. 相似文献

13.

利用BP神经网络算法优化个性化搜索引擎 总被引：1，自引：0，他引：1

徐恺英王硕《情报理论与实践》2011,34(2)

本文针对现代人对网络信息检索需求的提高,提出一种利用BP人工神经网络算法优化个性化搜索引擎的方法,使搜索引擎更具智能化、个性化,力图为用户提供最佳的检索结果。相似文献

14.

新型搜索引擎畅想

黄建年《情报理论与实践》2007,30(6):851-854

本文认为在网络世界将会出现9种新型的搜索引擎,它们分别是零次文献搜索引擎、潜在文献搜索引擎、知识发现搜索引擎、大型元搜索聚类引擎、专业学术型聚类引擎、学术趋势搜索引擎、概念类比联想搜索引擎、解疑答难型搜索引擎和教学研究平台搜索引擎。相似文献

15.

浅析近年来网络搜索引擎研究现状——以2001至2010年为例

刘天娇周瑛《情报科学》2012,(8):1192-1195

以研究2001-2010年网络搜索引擎的研究发展动态,为该领域后续研究指明方向为目的。以2001-2010年10年为时间限制,通过对CNKI来源期刊有关"网络搜索引擎"的文章搜索出的386篇文章进行分析,并运用内容分析法以及SPSS统计软件,对发文数量,发文期刊分布及发文内容进行分析。经过实例的分析,得出自2001-2010年10年间,对网络搜索引擎的细分化研究论文数量开始多于其综合性研究论文的数量,近10年间对网络搜索引擎方面的研究开始呈现向纵深方向发展的趋势的结论。相似文献

16.

垂直搜索引擎系统的设计与实现 总被引：1，自引：0，他引：1

张敏杜华《情报科学》2011,(3)

面对日益专业和个性化的信息检索需求,通用搜索引擎存在的问题暴露无遗。垂直搜索技术作为搜索引擎发展的一个主要方向,正在受到越来越多的关注。在给出一个垂直搜索引擎总体结构的基础上,详细分析了所涉及的关键技术:网页抓取、中文分词、文本分类等。并将分词和分类算法加入到Nutch中,实现了系统原型。实验证明,该系统主题相关度达到94%以上。相似文献

17.

Web search strategies: The influence of Web experience and task type

Andrew Thatcher 《Information processing & management》2008

Despite a number of studies looking at Web experience and Web searching tactics and behaviours, the specific relationships between experience and cognitive search strategies have not been widely researched. This study investigates how the cognitive search strategies of 80 participants might vary with Web experience as they engaged in two researcher-defined tasks and two participant-defined information seeking tasks. Each of the two researcher-defined tasks and participant-defined tasks included a directed search task and a general-purpose browsing task. While there were almost no significant performance differences between experience levels on any of the four tasks, there were significant differences in the use of cognitive search strategies. Participants with higher levels of Web experience were more likely to use “Parallel player”, “Parallel hub-and-spoke”, “Known address search domain” and “Known address” strategies, whereas participants with lower levels of Web experience were more likely to use “Virtual tourist”, “Link-dependent”, “To-the-point”, “Sequential player”, “Search engine narrowing”, and “Broad first” strategies. The patterns of use and differences between researcher-defined and participant-defined tasks and between directed search tasks and general-purpose browsing tasks are also discussed, although the distribution of search strategies by Web experience were not statistically significant for each individual task. 相似文献

18.

Characteristics of character usage in Chinese Web searching

Michael Chau Yan Lu Xiao Fang Christopher C. Yang 《Information processing & management》2009

The use of non-English Web search engines has been prevalent. Given the popularity of Chinese Web searching and the unique characteristics of Chinese language, it is imperative to conduct studies with focuses on the analysis of Chinese Web search queries. In this paper, we report our research on the character usage of Chinese search logs from a Web search engine in Hong Kong. By examining the distribution of search query terms, we found that users tended to use more diversified terms and that the usage of characters in search queries was quite different from the character usage of general online information in Chinese. After studying the Zipf distribution of n-grams with different values of n, we found that the curve of unigram is the most curved one of all while the bigram curve follows the Zipf distribution best, and that the curves of n-grams with larger n (n = 3–6) had similar structures with β-values in the range of 0.66–0.86. The distribution of combined n-grams was also studied. All the analyses are performed on the data both before and after the removal of function terms and incomplete terms and similar findings are revealed. We believe the findings from this study have provided some insights into further research in non-English Web searching and will assist in the design of more effective Chinese Web search engines. 相似文献

19.

基于广度优先搜索的网络蜘蛛设计

郑泳《人天科学研究》2010,(7):122-123

网络蜘蛛搜索策略的研究是近年来专业搜索引擎研究的焦点之一,如何使搜索引擎快速准确地从庞大的网页数据中获取所需资源的需求是目前所面临的重要问题。重点阐述了搜索引擎的Web Spider（网络蜘蛛）的搜索策略和搜索优化措施,提出了一种简单的基于广度优先算法的网络蜘蛛设计方案,并分析了设计过程中的优化措施。相似文献

20.

搜索引擎偏见:合理性与不合理性——从百度竞价排名风波说起 总被引：1，自引：0，他引：1

陈世华《科学与管理》2009,29(4)

百度竞价风波是搜索引擎偏见的典型体现。以竞价排名为特征的搜索引擎偏见与生俱来,具有一定的合理性,同时也具有严重的消极后果:伤害了新闻自由和个人自治,破坏了民主制度的基础和市场规则,违反了相关法律。如何减少和消除搜索引擎偏见,维护新闻自由,维持媒介市场秩序,促进个人和组织自治,推进民主进程,推动人类全面发展,值得深思。相似文献