期刊界 All Journals 搜尽天下杂志传播学术成果专业期刊搜索期刊信息化学术搜索

1.

On macro- and micro-level information in multiple documents and its influence on summarization

Jiaming Zhan Han Tong Loh Ying Liu 《International Journal of Information Management》2009

A well-known challenge for multi-document summarization (MDS) is that a single best or “gold standard” summary does not exist, i.e. it is often difficult to secure a consensus among reference summaries written by different authors. It therefore motivates us to study what the “important information” is in multiple input documents that will guide different authors in writing a summary. In this paper, we propose the notions of macro- and micro-level information. Macro-level information refers to the salient topics shared among different input documents, while micro-level information consists of different sentences that act as elaborating or provide complementary details for those salient topics. Experimental studies were conducted to examine the influence of macro- and micro-level information on summarization and its evaluation. Results showed that human subjects highly relied on macro-level information when writing a summary. The length allowed for summaries is the leading factor that affects the summary agreement. Meanwhile, our summarization evaluation approach based on the proposed macro- and micro-structure information also suggested that micro-level information offered complementary details for macro-level information. We believe that both levels of information form the “important information” which affects the modeling and evaluation of automatic summarization systems. 相似文献

2.

A bottom-up approach to sentence ordering for multi-document summarization 总被引：1，自引：0，他引：1

Danushka Bollegala Naoaki Okazaki Mitsuru Ishizuka 《Information processing & management》2010,46(1):89-109

Ordering information is a difficult but important task for applications generating natural language texts such as multi-document summarization, question answering, and concept-to-text generation. In multi-document summarization, information is selected from a set of source documents. However, improper ordering of information in a summary can confuse the reader and deteriorate the readability of the summary. Therefore, it is vital to properly order the information in multi-document summarization. We present a bottom-up approach to arrange sentences extracted for multi-document summarization. To capture the association and order of two textual segments (e.g. sentences), we define four criteria: chronology, topical-closeness, precedence, and succession. These criteria are integrated into a criterion by a supervised learning approach. We repeatedly concatenate two textual segments into one segment based on the criterion, until we obtain the overall segment with all sentences arranged. We evaluate the sentence orderings produced by the proposed method and numerous baselines using subjective gradings as well as automatic evaluation measures. We introduce the average continuity, an automatic evaluation measure of sentence ordering in a summary, and investigate its appropriateness for this task. 相似文献

3.

Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion 总被引：1，自引：0，他引：1

Lucy Vanderwende Hisami Suzuki Chris Brockett Ani Nenkova 《Information processing & management》2007,43(6):1606

In recent years, there has been increased interest in topic-focused multi-document summarization. In this task, automatic summaries are produced in response to a specific information request, or topic, stated by the user. The system we have designed to accomplish this task comprises four main components: a generic extractive summarization system, a topic-focusing component, sentence simplification, and lexical expansion of topic words. This paper details each of these components, together with experiments designed to quantify their individual contributions. We include an analysis of our results on two large datasets commonly used to evaluate task-focused summarization, the DUC2005 and DUC2006 datasets, using automatic metrics. Additionally, we include an analysis of our results on the DUC2006 task according to human evaluation metrics. In the human evaluation of system summaries compared to human summaries, i.e., the Pyramid method, our system ranked first out of 22 systems in terms of overall mean Pyramid score; and in the human evaluation of summary responsiveness to the topic, our system ranked third out of 35 systems. 相似文献

4.

Cross-document event clustering using knowledge mining from co-reference chains

June-Jei Kuo Hsin-Hsi Chen 《Information processing & management》2007

Unifying terminology usages which captures more term semantics is useful for event clustering. This paper proposes a metric of normalized chain edit distance to mine, incrementally, controlled vocabulary from cross-document co-reference chains. Controlled vocabulary is employed to unify terms among different co-reference chains. A novel threshold model that incorporates both time decay function and spanning window uses the controlled vocabulary for event clustering on streaming news. Under correct co-reference chains, the proposed system has a 15.97% performance increase compared to the baseline system, and a 5.93% performance increase compared to the system without introducing controlled vocabulary. Furthermore, a Chinese co-reference resolution system with a chain filtering mechanism is used to experiment on the robustness of the proposed event clustering system. The clustering system using noisy co-reference chains still achieves a 10.55% performance increase compared to the baseline system. The above shows that our approach is promising. 相似文献

5.

QMOS: Query-based multi-documents opinion-oriented summarization

Asad Abdi Siti Mariyam Shamsuddin Ramiz M. Aliguliyev 《Information processing & management》2018,54(2):318-338

Sentiment analysis concerns the study of opinions expressed in a text. This paper presents the QMOS method, which employs a combination of sentiment analysis and summarization approaches. It is a lexicon-based method to query-based multi-documents summarization of opinion expressed in reviews.QMOS combines multiple sentiment dictionaries to improve word coverage limit of the individual lexicon. A major problem for a dictionary-based approach is the semantic gap between the prior polarity of a word presented by a lexicon and the word polarity in a specific context. This is due to the fact that, the polarity of a word depends on the context in which it is being used. Furthermore, the type of a sentence can also affect the performance of a sentiment analysis approach. Therefore, to tackle the aforementioned challenges, QMOS integrates multiple strategies to adjust word prior sentiment orientation while also considers the type of sentence. QMOS also employs the Semantic Sentiment Approach to determine the sentiment score of a word if it is not included in a sentiment lexicon.On the other hand, the most of the existing methods fail to distinguish the meaning of a review sentence and user's query when both of them share the similar bag-of-words; hence there is often a conflict between the extracted opinionated sentences and users’ needs. However, the summarization phase of QMOS is able to avoid extracting a review sentence whose similarity with the user's query is high but whose meaning is different. The method also employs the greedy algorithm and query expansion approach to reduce redundancy and bridge the lexical gaps for similar contexts that are expressed using different wording, respectively. Our experiment shows that the QMOS method can significantly improve the performance and make QMOS comparable to other existing methods. 相似文献

6.

Noise reduction through summarization for Web-page classification 总被引：1，自引：0，他引：1

Dou Shen Qiang Yang Zheng Chen 《Information processing & management》2007,43(6):1735

Due to a large variety of noisy information embedded in Web pages, Web-page classification is much more difficult than pure-text classification. In this paper, we propose to improve the Web-page classification performance by removing the noise through summarization techniques. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then put forward a new Web-page summarization algorithm based on Web-page layout and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that the classification algorithms (NB or SVM) augmented by any summarization approach can achieve an improvement by more than 5.0% as compared to pure-text-based classification algorithms. We further introduce an ensemble method to combine the different summarization algorithms. The ensemble summarization method achieves more than 12.0% improvement over pure-text based methods. 相似文献

7.

The use of domain-specific concepts in biomedical text summarization 总被引：3，自引：0，他引：3

Lawrence H. Reeve Hyoil Han Ari D. Brooks 《Information processing & management》2007,43(6):1765

Text summarization is a method for data reduction. The use of text summarization enables users to reduce the amount of text that must be read while still assimilating the core information. The data reduction offered by text summarization is particularly useful in the biomedical domain, where physicians must continuously find clinical trial study information to incorporate into their patient treatment efforts. Such efforts are often hampered by the high-volume of publications. This paper presents two independent methods (BioChain and FreqDist) for identifying salient sentences in biomedical texts using concepts derived from domain-specific resources. Our semantic-based method (BioChain) is effective at identifying thematic sentences, while our frequency-distribution method (FreqDist) removes information redundancy. The two methods are then combined to form a hybrid method (ChainFreq). An evaluation of each method is performed using the ROUGE system to compare system-generated summaries against a set of manually-generated summaries. The BioChain and FreqDist methods outperform some common summarization systems, while the ChainFreq method improves upon the base approaches. Our work shows that the best performance is achieved when the two methods are combined. The paper also presents a brief physician’s evaluation of three randomly-selected papers from an evaluation corpus to show that the author’s abstract does not always reflect the entire contents of the full-text. 相似文献

8.

Document concept lattice for text understanding and summarization 总被引：4，自引：0，他引：4

Shiren Ye Tat-Seng Chua Min-Yen Kan Long Qiu 《Information processing & management》2007,43(6):1643

We argue that the quality of a summary can be evaluated based on how many concepts in the original document(s) that can be preserved after summarization. Here, a concept refers to an abstract or concrete entity or its action often expressed by diverse terms in text. Summary generation can thus be considered as an optimization problem of selecting a set of sentences with minimal answer loss. In this paper, we propose a document concept lattice that indexes the hierarchy of local topics tied to a set of frequent concepts and the corresponding sentences containing these topics. The local topics will specify the promising sub-spaces related to the selected concepts and sentences. Based on this lattice, the summary is an optimized selection of a set of distinct and salient local topics that lead to maximal coverage of concepts with the given number of sentences. Our summarizer based on the concept lattice has demonstrated competitive performance in Document Understanding Conference 2005 and 2006 evaluations as well as follow-on tests. 相似文献

9.

Older versions of the ROUGEeval summarization evaluation system were easier to fool 总被引：3，自引：0，他引：3

Jonas Sjbergh 《Information processing & management》2007,43(6):1500

We show some limitations of the ROUGE evaluation method for automatic summarization. We present a method for automatic summarization based on a Markov model of the source text. By a simple greedy word selection strategy, summaries with high ROUGE-scores are generated. These summaries would however not be considered good by human readers. The method can be adapted to trick different settings of the ROUGEeval package. 相似文献

10.

On enhancing the robustness of timeline summarization test collections

Richard McCreadie Shahzad Rajput Ian Soboroff Craig Macdonald Iadh Ounis 《Information processing & management》2019,56(5):1815-1836

Timeline generation systems are a class of algorithms that produce a sequence of time-ordered sentences or text snippets extracted in real-time from high-volume streams of digital documents (e.g. news articles), focusing on retaining relevant and informative content for a particular information need (e.g. topic or event). These systems have a range of uses, such as producing concise overviews of events for end-users (human or artificial agents). To advance the field of automatic timeline generation, robust and reproducible evaluation methodologies are needed. To this end, several evaluation metrics and labeling methodologies have recently been developed - focusing on information nugget or cluster-based ground truth representations, respectively. These methodologies rely on human assessors manually mapping timeline items (e.g. sentences) to an explicit representation of what information a ‘good’ summary should contain. However, while these evaluation methodologies produce reusable ground truth labels, prior works have reported cases where such evaluations fail to accurately estimate the performance of new timeline generation systems due to label incompleteness. In this paper, we first quantify the extent to which the timeline summarization test collections fail to generalize to new summarization systems, then we propose, evaluate and analyze new automatic solutions to this issue. In particular, using a depooling methodology over 19 systems and across three high-volume datasets, we quantify the degree of system ranking error caused by excluding those systems when labeling. We show that when considering lower-effectiveness systems, the test collections are robust (the likelihood of systems being miss-ranked is low). However, we show that the risk of systems being mis-ranked increases as the effectiveness of systems held-out from the pool increases. To reduce the risk of mis-ranking systems, we also propose a range of different automatic ground truth label expansion techniques. Our results show that the proposed expansion techniques can be effective at increasing the robustness of the TREC-TS test collections, as they are able to generate large numbers missing matches with high accuracy, markedly reducing the number of mis-rankings by up to 50%. 相似文献

11.

Automatic generic document summarization based on non-negative matrix factorization

Ju-Hong Lee Sun Park Chan-Min Ahn Daeho Kim 《Information processing & management》2009

In existing unsupervised methods, Latent Semantic Analysis (LSA) is used for sentence selection. However, the obtained results are less meaningful, because singular vectors are used as the bases for sentence selection from given documents, and singular vector components can have negative values. We propose a new unsupervised method using Non-negative Matrix Factorization (NMF) to select sentences for automatic generic document summarization. The proposed method uses non-negative constraints, which are more similar to the human cognition process. As a result, the method selects more meaningful sentences for generic document summarization than those selected using LSA. 相似文献

12.

Query-oriented text summarization based on hypergraph transversals

H. Van Lierde Tommy W.S. Chow 《Information processing & management》2019,56(4):1317-1338

The rise in the amount of textual resources available on the Internet has created the need for tools of automatic document summarization. The main challenges of query-oriented extractive summarization are (1) to identify the topics of the documents and (2) to recover query-relevant sentences of the documents that together cover these topics. Existing graph- or hypergraph-based summarizers use graph-based ranking algorithms to produce individual scores of relevance for the sentences. Hence, these systems fail to measure the topics jointly covered by the sentences forming the summary, which tends to produce redundant summaries. To address the issue of selecting non-redundant sentences jointly covering the main query-relevant topics of a corpus, we propose a new method using the powerful theory of hypergraph transversals. First, we introduce a new topic model based on the semantic clustering of terms in order to discover the topics present in a corpus. Second, these topics are modeled as the hyperedges of a hypergraph in which the nodes are the sentences. A summary is then produced by generating a transversal of nodes in the hypergraph. Algorithms based on the theory of submodular functions are proposed to generate the transversals and to build the summaries. The proposed summarizer outperforms existing graph- or hypergraph-based summarizers by at least 6% of ROUGE-SU4 F-measure on DUC 2007 dataset. It is moreover cheaper than existing hypergraph-based summarizers in terms of computational time complexity. 相似文献

13.

Resolving ambiguity in biomedical text to improve summarization

Laura Plaza Mark Stevenson Alberto Díaz 《Information processing & management》2012

Access to the vast body of research literature that is now available on biomedicine and related fields can be improved with automatic summarization. This paper describes a summarization system for the biomedical domain that represents documents as graphs formed from concepts and relations in the UMLS Metathesaurus. This system has to deal with the ambiguities that occur in biomedical documents. We describe a variety of strategies that make use of MetaMap and Word Sense Disambiguation (WSD) to accurately map biomedical documents onto UMLS Metathesaurus concepts. Evaluation is carried out using a collection of 150 biomedical scientific articles from the BioMed Central corpus. We find that using WSD improves the quality of the summaries generated. 相似文献

14.

Preferences in Wikipedia abstracts: Empirical findings and implications for automatic entity summarization

Danyun Xu Gong Cheng Yuzhong Qu 《Information processing & management》2014

相似文献

15.

High-performance FAQ retrieval using an automatic clustering method of query logs

Harksoo Kim Jungyun Seo 《Information processing & management》2006

To resolve some of lexical disagreement problems between queries and FAQs, we propose a reliable FAQ retrieval system using query log clustering. On indexing time, the proposed system clusters the logs of users’ queries into predefined FAQ categories. To increase the precision and the recall rate of clustering, the proposed system adopts a new similarity measure using a machine readable dictionary. On searching time, the proposed system calculates the similarities between users’ queries and each cluster in order to smooth FAQs. By virtue of the cluster-based retrieval technique, the proposed system could partially bridge lexical chasms between queries and FAQs. In addition, the proposed system outperforms the traditional information retrieval systems in FAQ retrieval. 相似文献

16.

Task-based evaluation of text summarization using Relevance Prediction 总被引：2，自引：0，他引：2

Stacy President Hobson Bonnie J. Dorr Christof Monz Richard Schwartz 《Information processing & management》2007,43(6):1482

This article introduces a new task-based evaluation measure called Relevance Prediction that is a more intuitive measure of an individual’s performance on a real-world task than interannotator agreement. Relevance Prediction parallels what a user does in the real world task of browsing a set of documents using standard search tools, i.e., the user judges relevance based on a short summary and then that same user—not an independent user—decides whether to open (and judge) the corresponding document. This measure is shown to be a more reliable measure of task performance than LDC Agreement, a current gold-standard based measure used in the summarization evaluation community. Our goal is to provide a stable framework within which developers of new automatic measures may make stronger statistical statements about the effectiveness of their measures in predicting summary usefulness. We demonstrate—as a proof-of-concept methodology for automatic metric developers—that a current automatic evaluation measure has a better correlation with Relevance Prediction than with LDC Agreement and that the significance level for detected differences is higher for the former than for the latter. 相似文献

17.

克服信息不对称——有效的风险投资契约设定 总被引：1，自引：0，他引：1

李Hui 《软科学》2001,15(2):38-41

风险投资是以股权形式投向企业,特别是中小型、高新技术企业,并为其提供多种服务,最后以出售股权方式来回笼资金的投资方式。风险资本是一种股权资本而不是借贷资本,是金融创新和技术创新的成果。本文试图分析在信息不对称风险的条件下,风险资本如何制定有效而安全的组织形式来规避此风险。相似文献

18.

当前农村精神文明建设探究

王洋洋《科教文汇》2015,(17)

农村精神文明建设是社会主义精神文明建设的重要方面。改革开放以来,中国的精神文明建设取得了飞速的发展,积累了大量的经验。不过,农村精神文明的建设还有其自身的困难需要我们探究,面对社会的、经济的发展,农村精神文明建设还不断面临新的课题需要我们攻克。通过对当前农村精神文明建设进行探究,主要积累改革开放以来我们在建设方面的经验,发现农村精神文明建设面临的问题和挑战,并为解决这些问题提供积极借鉴。相似文献

19.

基于主题的网络论坛知识转换研究

王力耿爱静《情报科学》2005,23(10):1505-1508

本文采用现有的中文自动标引与文档自动摘要的技术,将主题讨论区中的内容,通过网络技术自动汇总成常见问答集(Frequently Asked Questions)的知识形式,辅助版主能有效率地将主题讨论区中的知识分享给所有的成员使用。本文通过自动摘要文献的探讨,提出一个FAQ知识转换的概念模式。以混合式自动标引法作为中文关键词抽取的工具并结合相似度计算,将文章整理成FAQ摘要的形式。相似文献

20.

信息权利的性质及其对信息立法的影响 总被引：10，自引：0，他引：10

杨宏玲黄瑞华《科学学研究》2005,23(1):35-39

信息立法中的一个基本概念就是信息权利。本文对信息权利的内涵和外延从法学理论上进行了澄清,认为信息权利泛指所有以信息为客体的权利,该权利是一项综合性的权利,其中包括了一些具有财产性的权利,也包括一些非财产性的权利,所以不能笼统的认定信息权利的性质。信息权利的性质的复杂性,导致一个统一的信息法不仅是不必要的,而且是不现实的。相似文献