首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
A critical challenge for Web search engines concerns how they present relevant results to searchers. The traditional approach is to produce a ranked list of results with title and summary (snippet) information, and these snippets are usually chosen based on the current query. Snippets play a vital sensemaking role, helping searchers to efficiently make sense of a collection of search results, as well as determine the likely relevance of individual results. Recently researchers have begun to explore how snippets might also be adapted based on searcher preferences as a way to better highlight relevant results to the searcher. In this paper we focus on the role of snippets in collaborative web search and describe a technique for summarizing search results that harnesses the collaborative search behaviour of communities of like-minded searchers to produce snippets that are more focused on the preferences of the searchers. We go on to show how this so-called social summarization technique can generate summaries that are significantly better adapted to searcher preferences and describe a novel personalized search interface that combines result recommendation with social summarization.  相似文献   

3.
User-model based personalized summarization   总被引:3,自引:0,他引:3  
The potential of summary personalization is high, because a summary that would be useless to decide the relevance of a document if summarized in a generic manner, may be useful if the right sentences are selected that match the user interest. In this paper we defend the use of a personalized summarization facility to maximize the density of relevance of selections sent by a personalized information system to a given user. The personalization is applied to the digital newspaper domain and it used a user-model that stores long and short term interests using four reference systems: sections, categories, keywords and feedback terms. On the other side, it is crucial to measure how much information is lost during the summarization process, and how this information loss may affect the ability of the user to judge the relevance of a given document. The results obtained in two personalization systems show that personalized summaries perform better than generic and generic-personalized summaries in terms of identifying documents that satisfy user preferences. We also considered a user-centred direct evaluation that showed a high level of user satisfaction with the summaries.  相似文献   

4.
In this paper, a document summarization framework for storytelling is proposed to extract essential sentences from a document by exploiting the mutual effects between terms, sentences and clusters. There are three phrases in the framework: document modeling, sentence clustering and sentence ranking. The story document is modeled by a weighted graph with vertexes that represent sentences of the document. The sentences are clustered into different groups to find the latent topics in the story. To alleviate the influence of unrelated sentences in clustering, an embedding process is employed to optimize the document model. The sentences are then ranked according to the mutual effect between terms, sentence as well as clusters, and high-ranked sentences are selected to comprise the summarization of the document. The experimental results on the Document Understanding Conference (DUC) data sets demonstrate the effectiveness of the proposed method in document summarization. The results also show that the embedding process for sentence clustering render the system more robust with respect to different cluster numbers.  相似文献   

5.
The purpose of extractive speech summarization is to automatically select a number of indicative sentences or paragraphs (or audio segments) from the original spoken document according to a target summarization ratio and then concatenate them to form a concise summary. Much work on extractive summarization has been initiated for developing machine-learning approaches that usually cast important sentence selection as a two-class classification problem and have been applied with some success to a number of speech summarization tasks. However, the imbalanced-data problem sometimes results in a trained speech summarizer with unsatisfactory performance. Furthermore, training the summarizer by improving the associated classification accuracy does not always lead to better summarization evaluation performance. In view of such phenomena, we present in this paper an empirical investigation of the merits of two schools of training criteria to alleviate the negative effects caused by the aforementioned problems, as well as to boost the summarization performance. One is to learn the classification capability of a summarizer on the basis of the pair-wise ordering information of sentences in a training document according to a degree of importance. The other is to train the summarizer by directly maximizing the associated evaluation score or optimizing an objective that is linked to the ultimate evaluation. Experimental results on the broadcast news summarization task suggest that these training criteria can give substantial improvements over a few existing summarization methods.  相似文献   

6.
In recent years, there has been increased interest in topic-focused multi-document summarization. In this task, automatic summaries are produced in response to a specific information request, or topic, stated by the user. The system we have designed to accomplish this task comprises four main components: a generic extractive summarization system, a topic-focusing component, sentence simplification, and lexical expansion of topic words. This paper details each of these components, together with experiments designed to quantify their individual contributions. We include an analysis of our results on two large datasets commonly used to evaluate task-focused summarization, the DUC2005 and DUC2006 datasets, using automatic metrics. Additionally, we include an analysis of our results on the DUC2006 task according to human evaluation metrics. In the human evaluation of system summaries compared to human summaries, i.e., the Pyramid method, our system ranked first out of 22 systems in terms of overall mean Pyramid score; and in the human evaluation of summary responsiveness to the topic, our system ranked third out of 35 systems.  相似文献   

7.
Task-based evaluation of text summarization using Relevance Prediction   总被引:2,自引:0,他引:2  
This article introduces a new task-based evaluation measure called Relevance Prediction that is a more intuitive measure of an individual’s performance on a real-world task than interannotator agreement. Relevance Prediction parallels what a user does in the real world task of browsing a set of documents using standard search tools, i.e., the user judges relevance based on a short summary and then that same user—not an independent user—decides whether to open (and judge) the corresponding document. This measure is shown to be a more reliable measure of task performance than LDC Agreement, a current gold-standard based measure used in the summarization evaluation community. Our goal is to provide a stable framework within which developers of new automatic measures may make stronger statistical statements about the effectiveness of their measures in predicting summary usefulness. We demonstrate—as a proof-of-concept methodology for automatic metric developers—that a current automatic evaluation measure has a better correlation with Relevance Prediction than with LDC Agreement and that the significance level for detected differences is higher for the former than for the latter.  相似文献   

8.
A large number of extractive summarization techniques have been developed in the past decade, but very few enquiries have been made as to how these differ from each other or what are the factors that actually affect these systems. Such meaningful comparison if available can be used to create a robust ensemble of these approaches, which has the possibility to consistently outperform each individual summarization system. In this work we examine the roles of three principle components of an extractive summarization technique: sentence ranking algorithm, sentence similarity metric and text representation scheme. We show that using a combination of several different sentence similarity measures, rather than only one, significantly improves performance of the resultant meta-system. Even simple ensemble techniques, when used in an informed manner, prove to be very effective in improving the overall performance and consistency of summarization systems. A statistically significant improvement of about 5% to 10% in ROUGE-1 recall was achieved by aggregating various sentence similarity measures. As opposed to this aggregation of several ranking algorithms did not show a significant improvement in ROUGE score, but even in this case the resultant meta-systems were more robust than candidate systems. The results suggest that new extractive summarization techniques should particularly focus on defining a better sentence similarity metric and use multiple sentence similarity scores and ranking algorithms in favour of a particular combination.  相似文献   

9.
Computing Semantic Similarity (SS) between concepts is one of the most critical issues in many domains such as Natural Language Processing and Artificial Intelligence. Over the years, several SS measurement methods have been proposed by exploiting different knowledge resources. Wikipedia provides a large domain-independent encyclopedic repository and a semantic network for computing SS between concepts. Traditional feature-based measures rely on linear combinations of different properties with two main limitations, the insufficient information and the loss of semantic information. In this paper, we propose several hybrid SS measurement approaches by using the Information Content (IC) and features of concepts, which avoid the limitations introduced above. Considering integrating discrete properties into one component, we present two models of semantic representation, called CORM and CARM. Then, we compute SS based on these models and take the IC of categories as a supplement of SS measurement. The evaluation, based on several widely used benchmarks and a benchmark developed by ourselves, sustains the intuitions with respect to human judgments. In summary, our approaches are more efficient in determining SS between concepts and have a better human correlation than previous methods such as Word2Vec and NASARI.  相似文献   

10.
维基百科与百度百科比较分析   总被引:2,自引:0,他引:2  
Wiki是一种集体创造知识的协作平台,其中最有名也最成功的是维基百科.而在中文网络中,百度百科成长非常迅速.本文通过分析运营机制、管理方式、激励机制、版权保护、审核机制等,对维基百科和百度百科之间的异同做出了探索性的研究.  相似文献   

11.
The recent prevalence of wiki applications has demonstrated that wikis have high potential in facilitating knowledge creation, sharing, integration, and utilization. As wikis are increasingly adopted in contexts like business, education, research, government, and the public, how to improve user contribution becomes an important concern for researchers and practitioners. In this research, we focus on the quality aspect of user contribution: contribution value. Building upon the critical mass theory and research on editing activities in wikis, this study investigates whether user interests and resources can increase contribution value for different types of users. We develop our research model and empirically test it using survey method and content analysis method in Wikipedia. The results demonstrate that (1) for users who emphasize substantive edits, depth of interests and depth of resources play more influential roles in affecting contribution value; and (2) for users who focus on non-substantive edits, breadth of interests and breadth of resources are more important in increasing contribution value. The findings suggest that contribution value develops in different patterns for two types of users. Implications for both theory and practice are discussed.  相似文献   

12.
We present two approaches to email thread summarization: collective message summarization (CMS) applies a multi-document summarization approach, while individual message summarization (IMS) treats the problem as a sequence of single-document summarization tasks. Both approaches are implemented in our general framework driven by sentence compression. Instead of a purely extractive approach, we employ linguistic and statistical methods to generate multiple compressions, and then select from those candidates to produce a final summary. We demonstrate these ideas on the Enron email collection – a very challenging corpus because of the highly technical language. Experimental results point to two findings: that CMS represents a better approach to email thread summarization, and that current sentence compression techniques do not improve summarization performance in this genre.  相似文献   

13.
Automatic text summarization attempts to provide an effective solution to today’s unprecedented growth of textual data. This paper proposes an innovative graph-based text summarization framework for generic single and multi document summarization. The summarizer benefits from two well-established text semantic representation techniques; Semantic Role Labelling (SRL) and Explicit Semantic Analysis (ESA) as well as the constantly evolving collective human knowledge in Wikipedia. The SRL is used to achieve sentence semantic parsing whose word tokens are represented as a vector of weighted Wikipedia concepts using ESA method. The essence of the developed framework is to construct a unique concept graph representation underpinned by semantic role-based multi-node (under sentence level) vertices for summarization. We have empirically evaluated the summarization system using the standard publicly available dataset from Document Understanding Conference 2002 (DUC 2002). Experimental results indicate that the proposed summarizer outperforms all state-of-the-art related comparators in the single document summarization based on the ROUGE-1 and ROUGE-2 measures, while also ranking second in the ROUGE-1 and ROUGE-SU4 scores for the multi-document summarization. On the other hand, the testing also demonstrates the scalability of the system, i.e., varying the evaluation data size is shown to have little impact on the summarizer performance, particularly for the single document summarization task. In a nutshell, the findings demonstrate the power of the role-based and vectorial semantic representation when combined with the crowd-sourced knowledge base in Wikipedia.  相似文献   

14.
针对维基百科协同编辑的协同认知过程和机制研究相对较少,以及协同认知理论发展也不成熟的问题。本文系统介绍了系统认知过程模型,从认知冲突的角度解释了知识吸收和知识创新的过程。以Wikipedia中香港教育史条目为例,分析了维基百科的协同认知过程,深入认识其协同编辑过程,为研究维基百科的演化过程提供一个新的视角,也对虚拟环境下的协同认知过程提供研究基础。  相似文献   

15.
越来越多的研究者认识到维基百科是知识获取的有效知识源,然而维基百科站点内部现有的搜索引擎没有充分利用其丰富的语义信息,因此,本文对面向中文维基百科的检索模式进行了对比研究。实验表明,本文提出的基于分类体系的语义检索模式在检准率、检全率以及检索速度方面能取得更好的效果,让用户更充分地利用到中文维基百科这个大规模知识库。  相似文献   

16.
Wikipedia is known as a free online encyclopedia. Wikipedia uses largely transparent writing and editing processes, which aim at providing the user with quality information through a democratic collaborative system. However, one aspect of these processes is not transparent—the identity of contributors, editors, and administrators. We argue that this particular lack of transparency jeopardizes the validity of the information being produced by Wikipedia. We analyze the social and ethical consequences of this lack of transparency in Wikipedia for all users, but especially students; we assess the corporate social performance issues involved, and we propose courses of action to compensate for the potential problems. We show that Wikipedia has the appearance, but not the reality, of responsible, transparent information production. This paper’s authors are the same as those who authored Wood, D. J. and Queiroz, A. 2008. Information versus. knowledge: Transparency and social responsibility issues for Wikipedia. In Antonino Vaccaro, Hugo Horta, and Peter Madsen (Eds.), Transparency, Information, and Communication Technology (pp. 261–283). Charlottesville, VA: Philosophy Documentation Center. Adele has changed her surname from Queiroz to Santana  相似文献   

17.
This paper proposes two approaches to address text summarization: modified corpus-based approach (MCBA) and LSA-based T.R.M. approach (LSA + T.R.M.). The first is a trainable summarizer, which takes into account several features, including position, positive keyword, negative keyword, centrality, and the resemblance to the title, to generate summaries. Two new ideas are exploited: (1) sentence positions are ranked to emphasize the significances of different sentence positions, and (2) the score function is trained by the genetic algorithm (GA) to obtain a suitable combination of feature weights. The second uses latent semantic analysis (LSA) to derive the semantic matrix of a document or a corpus and uses semantic sentence representation to construct a semantic text relationship map. We evaluate LSA + T.R.M. both with single documents and at the corpus level to investigate the competence of LSA in text summarization. The two novel approaches were measured at several compression rates on a data corpus composed of 100 political articles. When the compression rate was 30%, an average f-measure of 49% for MCBA, 52% for MCBA + GA, 44% and 40% for LSA + T.R.M. in single-document and corpus level were achieved respectively.  相似文献   

18.
Controversy is a complex concept that has been attracting attention of scholars from diverse fields. In the era of Internet and social media, detecting controversy and controversial concepts by the means of automatic methods is especially important. Web searchers could be alerted when the contents they consume are controversial or when they attempt to acquire information on disputed topics. Presenting users with the indications and explanations of the controversy should offer them chance to see the “wider picture” rather than letting them obtain one-sided views. In this work we first introduce a formal model of controversy as the basis of computational approaches to detecting controversial concepts. Then we propose a classification based method for automatic detection of controversial articles and categories in Wikipedia. Next, we demonstrate how to use the obtained results for the estimation of the controversy level of search queries. The proposed method can be incorporated into search engines as a component responsible for detection of queries related to controversial topics. The method is independent of the search engine’s retrieval and search results recommendation algorithms, and is therefore unaffected by a possible filter bubble.Our approach can be also applied in Wikipedia or other knowledge bases for supporting the detection of controversy and content maintenance. Finally, we believe that our results could be useful for social science researchers for understanding the complex nature of controversy and in fostering their studies.  相似文献   

19.
Most knowledge accumulated through scientific discoveries in genomics and related biomedical disciplines is buried in the vast amount of biomedical literature. Since understanding gene regulations is fundamental to biomedical research, summarizing all the existing knowledge about a gene based on literature is highly desirable to help biologists digest the literature. In this paper, we present a study of methods for automatically generating gene summaries from biomedical literature. Unlike most existing work on automatic text summarization, in which the generated summary is often a list of extracted sentences, we propose to generate a semi-structured summary which consists of sentences covering specific semantic aspects of a gene. Such a semi-structured summary is more appropriate for describing genes and poses special challenges for automatic text summarization. We propose a two-stage approach to generate such a summary for a given gene – first retrieving articles about a gene and then extracting sentences for each specified semantic aspect. We address the issue of gene name variation in the first stage and propose several different methods for sentence extraction in the second stage. We evaluate the proposed methods using a test set with 20 genes. Experiment results show that the proposed methods can generate useful semi-structured gene summaries automatically from biomedical literature, and our proposed methods outperform general purpose summarization methods. Among all the proposed methods for sentence extraction, a probabilistic language modeling approach that models gene context performs the best.  相似文献   

20.
Two uses of anaphora resolution in summarization   总被引:4,自引:0,他引:4  
We propose a new method for using anaphoric information in Latent Semantic Analysis (lsa), and discuss its application to develop an lsa-based summarizer which achieves a significantly better performance than a system not using anaphoric information, and a better performance by the rouge measure than all but one of the single-document summarizers participating in DUC-2002. Anaphoric information is automatically extracted using a new release of our own anaphora resolution system, guitar, which incorporates proper noun resolution. Our summarizer also includes a new approach for automatically identifying the dimensionality reduction of a document on the basis of the desired summarization percentage. Anaphoric information is also used to check the coherence of the summary produced by our summarizer, by a reference checker module which identifies anaphoric resolution errors caused by sentence extraction.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号