首页 | 本学科首页   官方微博 | 高级检索  
 共查询到20条相似文献,搜索用时 0 毫秒
Text document clustering provides an effective and intuitive navigation mechanism to organize a large amount of retrieval results by grouping documents in a small number of meaningful classes. Many well-known methods of text clustering make use of a long list of words as vector space which is often unsatisfactory for a couple of reasons: first, it keeps the dimensionality of the data very high, and second, it ignores important relationships between terms like synonyms or antonyms. Our unsupervised method solves both problems by using ANNIE and WordNet lexical categories and WordNet ontology in order to create a well structured document vector space whose low dimensionality allows common clustering algorithms to perform well. For the clustering step we have chosen the bisecting k-means and the Multipole tree, a modified version of the Antipole tree data structure for, respectively, their accuracy and speed.
Diego Reforgiato RecuperoEmail:

随着互联网上信息数量的不断增长,传统的信息检索技术已经很难满足人们对查询质量的苛刻要求。为了方便用户从检索结果中快速、准确地定位自己想要的信息,集成了文档聚类功能的搜索引擎应运而生。本文讨论了文档聚类技术在搜索引擎中的应用问题,介绍了一些算法,重点分析了Vivisimo这个比较有代表性的聚类搜索引擎,并预测了搜索引擎聚类技术的发展趋势。  相似文献   

We consider a multi-stage retrieval architecture consisting of a fast, “cheap” candidate generation stage, a feature extraction stage, and a more “expensive” reranking stage using machine-learned models. In this context, feature extraction can be accomplished using a document vector index, a mapping from document ids to document representations. We consider alternative organizations of such a data structure for efficient feature extraction: design choices include how document terms are organized, how complex term proximity features are computed, and how these structures are compressed. In particular, we propose a novel document-adaptive hashing scheme for compactly encoding term ids. The impact of alternative designs on both feature extraction speed and memory footprint is experimentally evaluated. Overall, results show that our architecture is comparable in speed to using a traditional positional inverted index but requires less memory overall, and offers additional advantages in terms of flexibility.  相似文献   

In Information Retrieval, since it is hard to identify users’ information needs, many approaches have been tried to solve this problem by expanding initial queries and reweighting the terms in the expanded queries using users’ relevance judgments. Although relevance feedback is most effective when relevance information about retrieved documents is provided by users, it is not always available. Another solution is to use correlated terms for query expansion. The main problem with this approach is how to construct the term-term correlations that can be used effectively to improve retrieval performance. In this study, we try to construct query concepts that denote users’ information needs from a document space, rather than to reformulate initial queries using the term correlations and/or users’ relevance feedback. To form query concepts, we extract features from each document, and then cluster the features into primitive concepts that are then used to form query concepts. Experiments are performed on the Associated Press (AP) dataset taken from the TREC collection. The experimental evaluation shows that our proposed framework called QCM (Query Concept Method) outperforms baseline probabilistic retrieval model on TREC retrieval.  相似文献   

We apply the knowledge discovery process to the mapping of current topics in a particular field of science. We are interested in how articles form clusters and what are the contents of the found clusters. A framework involving web scraping, keyword extraction, dimensionality reduction and clustering using the diffusion map algorithm is presented. We use publicly available information about articles in high-impact journals. The method should be of use to practitioners or scientists who want to overview recent research in a field of science. As a case study, we map the topics in data mining literature in the year 2011.  相似文献   

A study of the time to locate, acquire, and process articles for storage in a personal collection from conventional paper or electronic sources was conducted. Traditional paper methods of acquiring, indexing, and storing documents, it was found, take twice as long as electronic methods.  相似文献   

Document clustering of scientific texts using citation contexts   总被引:3,自引:0,他引:3  
Document clustering has many important applications in the area of data mining and information retrieval. Many existing document clustering techniques use the “bag-of-words” model to represent the content of a document. However, this representation is only effective for grouping related documents when these documents share a large proportion of lexically equivalent terms. In other words, instances of synonymy between related documents are ignored, which can reduce the effectiveness of applications using a standard full-text document representation. To address this problem, we present a new approach for clustering scientific documents, based on the utilization of citation contexts. A citation context is essentially the text surrounding the reference markers used to refer to other scientific works. We hypothesize that citation contexts will provide relevant synonymous and related vocabulary which will help increase the effectiveness of the bag-of-words representation. In this paper, we investigate the power of these citation-specific word features, and compare them with the original document’s textual representation in a document clustering task on two collections of labeled scientific journal papers from two distinct domains: High Energy Physics and Genomics. We also compare these text-based clustering techniques with a link-based clustering algorithm which determines the similarity between documents based on the number of co-citations, that is in-links represented by citing documents and out-links represented by cited documents. Our experimental results indicate that the use of citation contexts, when combined with the vocabulary in the full-text of the document, is a promising alternative means of capturing critical topics covered by journal articles. More specifically, this document representation strategy when used by the clustering algorithm investigated in this paper, outperforms both the full-text clustering approach and the link-based clustering technique on both scientific journal datasets.  相似文献   

文章通过对CNKI自主开发的文献管理软件CNKI E-learning与其它文献管理软件进行对比,发现其在学术文献的高效获取、有序管理、深度阅读、辅助论文撰写发表等方面的功能更为强大、更具特色,并重点探讨了如何把CNKI Elearning的特色功能运用到学术研究的各个过程中,助力科研人员进行高效、有序的学术研究。  相似文献   


Key points

  • Concepts from lean manufacturing and Kanban production can usefully be applied to writing for academic publication.
  • Value and pull focus the author's attention on the needs of reviewers, editors, and readers.
  • Value stream and flow emphasize an end‐to‐end process of prioritization, writing, editing, revision, resubmission, and publication
  • Perfection places emphasis on publication quality.
  • A Kanban board is advocated to plan and monitor the writing and publication lifecycle.
  • The author's experience shows a steady improvement in output rankings and researcher reputation metrics over a four‐year period.

信息时代"文献检索与利用"课的改革与实践   总被引:5,自引:0,他引:5  
随着信息时代的到来,传统的“文献检索与利用”课将面临新的挑战。为适应时代的变化,该课程的教学体系、教学内容、教学方法、课程质量评价体系都将进行一系列的变革和创新;同时任课教师的综合素质也应相应有所提升。结合几年来的教学改革和实践经验,本文对以上课题内容进行了进一步的讨论。  相似文献   

应用CONSORT提高医学期刊质量   总被引:8,自引:0,他引:8  
刘雪梅  刘建平 《编辑学报》2002,14(3):228-229
介绍更新的CONSORT中文版及相关知识,以供期刊界参考.  相似文献   

运用注意策略提高编辑效率   总被引:1,自引:0,他引:1  
曹作华 《编辑学报》2010,22(3):240-241
为了提高编辑工作效率,尝试根据编辑工作不同阶段的任务重点,运用认知心理学中注意的相关策略进行工作.在论文筛选阶段,将选择性注意放在对内容的把握上;在论文退修阶段,首先将选择性注意放在对内容提出修改意见上,其次是针对表达标准化、规范化方面提出修改建议;在编辑加工阶段,选择性注意重点是论文的表格、数据、计量单位、名词等的规范表达.根据练习与自动化原理,加强编辑规范与技能的学习,加强计算机技能训练,以进一步提高编辑的工作效率.  相似文献   

Social tagging systems have gained increasing popularity as a method of annotating and categorizing a wide range of different web resources. Web search that utilizes social tagging data suffers from an extreme example of the vocabulary mismatch problem encountered in traditional information retrieval (IR). This is due to the personalized, unrestricted vocabulary that users choose to describe and tag each resource. Previous research has proposed the utilization of query expansion to deal with search in this rather complicated space. However, non-personalized approaches based on relevance feedback and personalized approaches based on co-occurrence statistics only showed limited improvements. This paper proposes a novel query expansion framework based on individual user profiles mined from the annotations and resources the user has marked. The underlying theory is to regularize the smoothness of word associations over a connected graph using a regularizer function on terms extracted from top-ranked documents. The intuition behind the model is the prior assumption of term consistency: the most appropriate expansion terms for a query are likely to be associated with, and influenced by terms extracted from the documents ranked highly for the initial query. The framework also simultaneously incorporates annotations and web documents through a Tag-Topic model in a latent graph. The experimental results suggest that the proposed personalized query expansion method can produce better results than both the classical non-personalized search approach and other personalized query expansion methods. Hence, the proposed approach significantly benefits personalized web search by leveraging users’ social media data.  相似文献   

The images found within biomedical articles are sources of essential information useful for a variety of tasks. Due to the rapid growth of biomedical knowledge, image retrieval systems are increasingly becoming necessary tools for quickly accessing the most relevant images from the literature for a given information need. Unfortunately, article text can be a poor substitute for image content, limiting the effectiveness of existing text-based retrieval methods. Additionally, the use of visual similarity by content-based retrieval methods as the sole indicator of image relevance is problematic since the importance of an image can depend on its context rather than its appearance. For biomedical image retrieval, multimodal approaches are often desirable. We describe in this work a practical multimodal solution for indexing and retrieving the images contained in biomedical articles. Recognizing the importance of text in determining image relevance, our method combines a predominately text-based image representation with a limited amount of visual information, in the form of quantized content-based visual features, through a process called global feature mapping. The resulting multimodal image surrogates are easily indexed and searched using existing text-based retrieval systems. Our experimental results demonstrate that our multimodal strategy significantly improves upon the retrieval accuracy of existing approaches. In addition, unlike many retrieval methods that utilize content-based visual features, the response time of our approach is negligible, making it suitable for use with large collections.  相似文献   

文章以2019年武汉大学阅读报告为研究对象,阐述了高校图书馆如何运用读者数据分析学生的入馆和借阅行为,并提出了相应的建议,以期提升高校图书馆的服务精准度。  相似文献   

A computer-based health literacy intervention for older adults was developed and assessed from September 2007 to June 2009. A total of 218 adults between the ages of 60–89 participated in the study at two public libraries. The four week-long curricula covered two National Institutes of Health (NIH) websites: NIHSeniorHealth.gov and MedlinePlus.gov. Computer and Web knowledge significantly improved from pre- to post-intervention (p < .01 in both cases). Computer attitudes significantly improved from pre- to post-intervention: Anxiety significantly decreased while interest and efficacy both increased (p < .001 in all three cases). Most participants found both sites easy to use and were able to find needed information on both. Information found on NIHSeniorHealth was significantly more useful than that on MedlinePlus (p < .05). Most participants (78%) reported that what they learned had affected their participation in their own health care. Participants had positive feedback on the intervention. These findings support the effectiveness and popularity of the intervention. By tapping into the well-established public library and NIH infrastructure, this intervention has great potential for scaling-up, and significant social and economic implications for a diverse range of communities and individuals.  相似文献   

Research on cross-language information retrieval (CLIR) has typically been restricted to settings using binary relevance assessments. In this paper, we present evaluation results for dictionary-based CLIR using graded relevance assessments in a best match retrieval environment. A text database containing newspaper articles and a related set of 35 search topics were used in the tests. First, monolingual baseline queries were automatically formed from the topics. Secondly, source language topics (in English, German, and Swedish) were automatically translated into the target language (Finnish), using structured target queries. The effectiveness of the translated queries was compared to that of the monolingual queries. Thirdly, pseudo-relevance feedback was used to expand the original target queries. CLIR performance was evaluated using three relevance thresholds: stringent, regular, and liberal. When regular or liberal threshold was used, a reasonable performance was achieved. Using stringent threshold, equally high performance could not be achieved. On all the relevance thresholds the performance of the translated queries was successfully raised by pseudo-relevance feedback based query expansion. However, the performance of the stringent threshold in relation to the other thresholds could not be raised by this method.  相似文献   

Cooperative inquiry, a form of qualitative research used in community building, has not often been applied in educational contexts. Through the lens of formative leadership theory, the researchers studied the abilities of three new school librarians trained in cooperative inquiry and leadership to engage in collaborative problem solving for technology-related school challenges. Due to internal and external factors, participants experienced various levels of success with their challengers, but cooperative inquiry proved to be a viable methodology to evaluate the outcomes of library education for school librarians' formative leadership.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号