首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Passage retrieval (already operational for lawyers) has advantages in output form over reference retrieval and is economically feasible. Previous experiments in passage retrieval for scientists have demonstrated recall and false retrieval rates as good or better than those of present reference retrieval services. The present experiment involved a greater variety of forms of retrieval question. In addition, search words were selected independently by two different people for each retrieval question. The search words selected, in combination with the computer procedures used for passage retrieval, produced average recall ratios of 72 and 67%, respectively, for the two selectors. The false retrieval rates were (except for one predictably difficult question) respectively 13 and 10 falsely retrieved sentences per answer-paper retrieved.  相似文献   

2.
Citing statements can be used to aid retrieval, to increase the efficiency of citation indexes and for the study of information flow and use. These uses are only feasible on a large scale if computers can identify citing statements within the texts of documents with reasonable accuracy.Computer recognition of multi-sentence citing statements is not easy. Procedures developed for chemistry papers in an earlier experiment were tested on biomedical papers (dealing with various aspects of cancer) and were almost as successful. Specifically, (1) 78% of the words in computer-recognized citing statements were correctly attributable to the corresponding cited papers; and (2) the computer procedures missed 4% of the words in the actual citing statements. When the procedures were modified on the basis of those results and tested on a new sample of cancer papers the results were comparable: 72 and 3% respectively.In an earlier experiment in use of full-text searching to retrieve answer-passages from cancer papers, recall in the “test phase” averaged about 70% and the false retrieval rate was thirteen falsely retrieved sentences per answer-paper retrieved. Unretrieved answer-papers in that experiment's “development phase”, and citing statements referring to them, were studied to develop computer procedures for using citing statements to increase recall. The procedures developed only produced slight recall increases for development phase answer-papers, and similarly for the test phase papers on which they were then tested. Specifically, the test phase results were the following: recall was increased from 70 to 74%, and there was no increase in false retrieval. This contrasts with an earlier experiment in which 50% recall of chemistry papers by search of index terms and abstract words was increased to 70% by the addition of words from citing statements. The difference may be because the average number of citing papers per unretrieved cancer paper was only six while that for chemistry papers was thirteen.  相似文献   

3.
Documents in computer-readable form can be used to provide information about other documents, i.e. those they cite.To do this efficiently requires procedures for computer recognition of citing statements. This is not easy, especially for multi-sentence citing statements. Computer recognition procedures have been developed which are accurate to the following extent: 73% of the words in statements selected by computer procedures as being citing statements are words which are correctly attributable to the corresponding documents.The retrieval effectiveness of computer-recognized citing statements was tested in the following way. First, for eight retrieval requests in inorganiic chemistry, average recall by search of Chemical Abstracts Service indexing and Chemical Abstracts abstract text words was found to be 50%. Words from citing statements referring to the papers to be retrieved were then added to the index terms and abstract words as additional access points, and searching was repeated. Average recall increased to 70%. Only words from citing statements published within a year of the cited papers were used.The retrieval effect of citing statement words alone (published within a year) without index or abstract terms was the following: average recall was 40%. When just the words of the titles of the cited papers were added to those citing statement words, average recall increased to 50%.  相似文献   

4.
5.
曲琳琳 《情报科学》2021,39(8):132-138
【目的/意义】跨语言信息检索研究的目的即在消除因语言的差异而导致信息查询的困难,提高从大量纷繁 复杂的查找特定信息的效率。同时提供一种更加方便的途径使得用户能够使用自己熟悉的语言检索另外一种语 言文档。【方法/过程】本文通过对国内外跨语言信息检索的研究现状分析,介绍了目前几种查询翻译的方法,包括: 直接查询翻译、文献翻译、中间语言翻译以及查询—文献翻译方法,对其效果进行比较,然后阐述了跨语言检索关 键技术,对使用基于双语词典、语料库、机器翻译技术等产生的歧义性提出了解决方法及评价。【结果/结论】使用自 然语言处理技术、共现技术、相关反馈技术、扩展技术、双向翻译技术以及基于本体信息检索技术确保知识词典的 覆盖度和歧义性处理,通过对跨语言检索实验分析证明采用知识词典、语料库和搜索引擎组合能够提高查询效 率。【创新/局限】本文为了解决跨语言信息检索使用词典、语料库中词语缺乏的现象,提出通过搜索引擎从网页获 取信息资源来充实语料库中语句对不足的问题。文章主要针对中英文信息检索问题进行了探讨,解决方法还需要 进一步研究,如中文切词困难以及字典覆盖率低等严重影响检索的效率。  相似文献   

6.
许跃军 《情报科学》2008,26(6):866-871
主要论述基于Ontology(本体)的政府知识库的信息检索技术。该技术有别于传统的全文检索技术,采用基于本体的技术来分析处理用户提交的查询请求,分析自然语言形式问题中的词法、语法、语义等信息,识别出问题的类别,得到一些关键词,并进行扩展。还可根据本体中领域词汇的关系对关键词进行扩展,并赋予不同的权值。然后将问题类别和带权值的关键词序列提交给系统的检索引擎进行后继的处理。  相似文献   

7.
A prefix trie index (originally called trie hashing) is applied to the problem of providing fast search times, fast load times and fast update properties in a bibliographic or full text retrieval system. For all but the largest dictionaries a single key search in the dictionary under trie hashing takes exactly one disk read. Front compression of search keys is used to enhance performance. Partial combining of the postings into the dictionary is analyzed as a method to give both faster retrieval and improved update properties for the trie hashing inverted file. Statistics are given for a test database consisting of an online catalog at the Graduate School of Library and Information Science Library of the University of Western Ontario. The effect of changing various parameters of prefix tries are tested in this application.  相似文献   

8.
Abbreviations adversely affect information retrieval and text comprehensibility. We describe a software tool to decipher abbrevations by finding their whole-word equivalents or “disabbreviations”. It uses a large English dictionary and a rule-based system to guess the most-likely candidates, with users having final approval. The rule-based system uses a variety of knowledge to limit its search, including phonetics, known methods of constructing multiword abbrevations, and analogies to previous abbreviations. The tool is especially helpful for retrieval from computer programs, a form of technical text in which abbreviations are notoriously common; disabbreviation of programs can make programs more reusable, improving software engineering. It also helps decipher the often-specialized abbreviations in technical captions. Experimental results confirm that the prototype tool is easy to use, finds many correct disabbreviations, and improves text comprehensibility.  相似文献   

9.
A user’s single session with a Web search engine or information retrieval (IR) system may consist of seeking information on single or multiple topics, and switch between tasks or multitasking information behavior. Most Web search sessions consist of two queries of approximately two words. However, some Web search sessions consist of three or more queries. We present findings from two studies. First, a study of two-query search sessions on the AltaVista Web search engine, and second, a study of three or more query search sessions on the AltaVista Web search engine. We examine the degree of multitasking search and information task switching during these two sets of AltaVista Web search sessions. A sample of two-query and three or more query sessions were filtered from AltaVista transaction logs from 2002 and qualitatively analyzed. Sessions ranged in duration from less than a minute to a few hours. Findings include: (1) 81% of two-query sessions included multiple topics, (2) 91.3% of three or more query sessions included multiple topics, (3) there are a broad variety of topics in multitasking search sessions, and (4) three or more query sessions sometimes contained frequent topic changes. Multitasking is found to be a growing element in Web searching. This paper proposes an approach to interactive information retrieval (IR) contextually within a multitasking framework. The implications of our findings for Web design and further research are discussed.  相似文献   

10.
A procedure for automated indexing of pathology diagnostic reports at the National Institutes of Health is described. Diagnostic statements in medical English are encoded by computer into the Systematized Nomenclature of Pathology (SNOP). SNOP is a structured indexing language constructed by pathologists for manual indexing. It is of interest that effective automatic encoding can be based upon an existing vocabulary and code designed for manual methods. Morphosyntactic analysis, a simple syntax analysis, matching of dictionary entries consisting of several words, and synonym substitutions are techniques utilized.  相似文献   

11.
When consumers search for health information, a major obstacle is their unfamiliarity with the medical terminology. Even though medical thesauri such as the Medical Subject Headings (MeSH) and related tools (e.g., the MeSH Browser) were created to help consumers find medical term definitions, the lack of direct and explicit integration of these help tools into a health retrieval system prevented them from effectively achieving their objectives. To explore this issue, we conducted an empirical study with two systems: One is a simple interface system supporting query-based searching; the other is an augmented system with two new components supporting MeSH term searching and MeSH tree browsing. A total of 45 subjects were recruited to participate in the study. The results indicated that the augmented system is more effective than the simple system in terms of improving user-perceived topic familiarity and question–answer performance, even though we did not find users spend more time on the augmented system. The two new MeSH help components played a critical role in participants’ health information retrieval and were found to allow them to develop new search strategies. The findings of the study enhanced our understanding of consumers’ search behaviors and shed light on the design of future health information retrieval systems.  相似文献   

12.
This paper presents a laboratory based evaluation study of cross-language information retrieval technologies, utilizing partially parallel test collections, NTCIR-2 (used together with NTCIR-1), where Japanese–English parallel document collections, parallel topic sets and their relevance judgments are available. These enable us to observe and compare monolingual retrieval processes in two languages as well as retrieval across languages. Our experiments focused on (1) the Rosetta stone question (whether a partially parallel collection helps in cross-language information access or not?) and (2) two aspects of retrieval difficulties namely “collection discrepancy” and “query discrepancy”. Japanese and English monolingual retrieval systems are combined by dictionary based query translation modules so that a symmetrical bilingual evaluation environment is implemented.  相似文献   

13.
Following a general discussion on the philosophy and design of information systems, with particular attention to the definition, needs and psychology of the ultimate user of systems providing on-line access to biomedical information, the role of the documentalist, the differences between document retrieval and true information retrieval and the operational characteristics of on-line systems which affect their cost and hence their design and acceptability, the authors make some tentative predictions as to the future demand for such information retrieval services and their probable organizational form. A brief report is then presented on the principal findings and conclusions of a user's study of the Excerpta Medica system, the key features and history of which are briefly described. Based on the conclusions of this study, particularly as regards the complexity of the average search question, the role of the search formulators in determining the results of computer searching, the importance of secondary concepts for retrieval and the optimal level of specificity of a computer thesaurus, some of the changes in the Excerpta Medica system which are in the planning stage and will be incorporated into the system's Mark II version are outlined, as are the principal features of the two systems currently offering on-line access to the Excerpta Medica database in Western Germany and the U.S.A. Finally, attention is given to the planned partial hierarchic structuring of the Excerpta Medica thesaurus (Malimet), a project which is to be based largely on frequency counts of the existing database and the elimination of over-specific terms by posting under broader concepts. The results of some of the initial steps in this direction (i.e. frequency counts of portions of the database and the structuring of some of the terms used in the cancer field) are presented by way of illustration.  相似文献   

14.
张军华  韩全会 《情报科学》2008,26(10):1540-1542,1557
通过评价实验.从百度瞬时风向标选择实验检索课题集合,从索引数据库性能、检索功能、检索效果、结果显示及用户负担五个方面.采用平均站点索引量、前X命中记录查准率等测评方法.对五大中文综合搜索引擎百度、爱问、搜狗、搜搜和中搜进行评价研究,其中对定量指标进行了重点试验和分析.总结分析了五大中文搜索引擎各自的优势、特色和不足.最后总结出搜索引擎的改进方向和用户选择搜索引擎与实施检索的策略.  相似文献   

15.
In legal case retrieval, existing work has shown that human-mediated conversational search can improve users’ search experience. In practice, a suitable workflow can provide guidelines for constructing a machine-mediated agent replacing of human agents. Therefore, we conduct a comparison analysis and summarize two challenges when directly applying the conversational agent workflow in web search to legal case retrieval: (1) It is complex for agents to express their understanding of users’ information need. (2) Selecting a candidate case from the SERPs is more difficult for agents, especially at the early stage of the search process. To tackle these challenges, we propose a suitable conversational agent workflow in legal case retrieval, which contains two additional key modules compared with that in web search: Query Generation and Buffer Mechanism. A controlled user experiment with three control groups, using the whole workflow or removing one of these two modules, is conducted. The results demonstrate that the proposed workflow can actually support conversational agents working more efficiently, and help users save search effort, leading to higher search success and satisfaction for legal case retrieval. We further construct a large-scale dataset and provide guidance on the machine-mediated conversational search system for legal case retrieval.  相似文献   

16.
中文五大综合搜索引擎主要性能测评   总被引:1,自引:0,他引:1  
通过评价实验,从百度瞬时风向标选择实验检索课题集合,从索引数据库性能、检索功能、检索效果、结果显示及用户负担五个方面,采用平均站点索引量、前X命中记录查准率等测评方法,对五大中文综合搜索引擎百度、爱问、搜狗、搜搜和中搜进行评价研究,其中对定量指标进行了重点试验和分析。总结分析了五大中文搜索引擎各自的优势、特色和不足。最后总结出搜索引擎的改进方向和用户选择搜索引擎与实施检索的策略。  相似文献   

17.
The absence of diacritics in text documents or search queries is a serious problem for Turkish information retrieval because it creates homographic ambiguity. Thus, the inappropriate handling of diacritics reduces the retrieval performance in search engines. A straightforward solution to this problem is to normalize tokens by replacing diacritic characters with their American Standard Code for Information Interchange (ASCII) counterparts. However, this so-called ASCIIfication produces either synthetic words that are not legitimate Turkish words or legitimate words with meanings that are completely different from those of the original words. These non-valid synthetic words cannot be processed by morphological analysis components (such as stemmers or lemmatizers), which expect the input to be valid Turkish words. By contrast, synthetic words are not a problem when no stemmer or a simple first-n-characters-stemmer is used in the text analysis pipeline. This difference emphasizes the notion of the diacritic sensitivity of stemmers. In this study, we propose and evaluate an alternative solution based on the application of deASCIIfication, which restores accented letters in query terms or text documents. Our risk-sensitive evaluation results showed that the diacritics restoration approach yielded more effective and robust results compared with normalizing tokens to remove diacritics.  相似文献   

18.
Web queries in question format are becoming a common element of a user's interaction with Web search engines. Web search services such as Ask Jeeves – a publicly accessible question and answer (Q&A) search engine – request users to enter question format queries. This paper provides results from a study examining queries in question format submitted to two different Web search engines – Ask Jeeves that explicitly encourages queries in question format and the Excite search service that does not explicitly encourage queries in question format. We identify the characteristics of queries in question format in two different data sets: (1) 30,000 Ask Jeeves queries and 15,575 Excite queries, including the nature, length, and structure of queries in question format. Findings include: (1) 50% of Ask Jeeves queries and less than 1% of Excite were in question format, (2) most users entered only one query in question format with little query reformulation, (3) limited range of formats for queries in question format – mainly “where”, “what”, or “how” questions, (4) most common question query format was “Where can I find………” for general information on a topic, and (5) non-question queries may be in request format. Overall, four types of user Web queries were identified: keyword, Boolean, question, and request. These findings provide an initial mapping of the structure and content of queries in question and request format. Implications for Web search services are discussed.  相似文献   

19.
Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The system learns the similarity matrix between word pairs from the unlabeled corpus, and it uses the vector representations of sense definitions from MRD, which are derived based on the similarity matrix. In order to disambiguate all occurrences of polysemous words in a sentence, the system separately constructs the acyclic weighted digraph (AWD) for every occurrence of polysemous words in a sentence. The AWD is structured based on consideration of the senses of context words which occur with a target word in a sentence. After building the AWD per each polysemous word, we can search the optimal path of the AWD using the Viterbi algorithm. We assign the most appropriate sense to the target word in sentences with the sense on the optimal path in the AWD. By experiments, our system shows 76.4% accuracy for the semantically ambiguous Korean words.  相似文献   

20.
Word sense disambiguation (WSD) is meant to assign the most appropriate sense to a polysemous word according to its context. We present a method for automatic WSD using only two resources: a raw text corpus and a machine-readable dictionary (MRD). The system learns the similarity matrix between word pairs from the unlabeled corpus, and it uses the vector representations of sense definitions from MRD, which are derived based on the similarity matrix. In order to disambiguate all occurrences of polysemous words in a sentence, the system separately constructs the acyclic weighted digraph (AWD) for every occurrence of polysemous words in a sentence. The AWD is structured based on consideration of the senses of context words which occur with a target word in a sentence. After building the AWD per each polysemous word, we can search the optimal path of the AWD using the Viterbi algorithm. We assign the most appropriate sense to the target word in sentences with the sense on the optimal path in the AWD. By experiments, our system shows 76.4% accuracy for the semantically ambiguous Korean words.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号