首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
This paper describes a technique for automatic book indexing. The technique requires a dictionary of terms that are to appear in the index, along with all text strings that count as instances of the term. It also requires that the text be in a form suitable for processing by a text formatter. A program searches the text for each occurrence of a term or its associated strings and creates an entry to the index when either is found. The results of the experimental application to a portion of a book text are presented, including measures of precision and recall, with precision giving the ratio of terms correctly assigned in the automatic process to the total assigned, and recall giving the ratio of correct terms automatically assigned to the total number of term assignments according to a human standard. Results indicate that the technique can be applied successfully, especially for texts that employ a technical vocabulary and where there is a premium on indexing exhaustivity.  相似文献   

2.
A method for updating the dictionary in a dynamic information retrieval system is presented. It is shown that as a collection changes through addition and deletion of documents, the appropriate set of index terms may be determined without complete periodic regeneration of the dictionary. Results are presented for experiments involving a complete change in collection membership, with the dynamic dictionary updating methods shown to be effective.  相似文献   

3.
In a typical inverted-file full-text document retrieval system, the user submits queries consisting of strings of characters combined by various operators. The strings are looked up in a text-dictionary which lists, for each string, all the places in the database at which it occurs. It is desirable to allow the user to include in his query truncated terms such as X1, 1X, 1X1, or X1Y, where X and X are specified strings and 1 is a variable-length-don't-care character, that is, 1 represents an arbitrary, possibly empty, string. Processing these terms involves finding the set of all words in the dictionary that match these patterns. How to do this efficiently is a long-standing open problem in this domain.In this paper we present a uniform and efficient approach for processing all such query terms. The approach, based on a “permuted dictionary” and a corresponding set of access routines, requires essentially one disk access to obtain from the dictionary all the strings represented by a truncated term, with negligible computing time. It is thus well suited for on-line applications. Implementation is simple, and storage overhead is low: it can be made almost negligible by using some specially adapted compression techniques described in the paper.The basic approach is easily adaptable for slight variants, such as fixed (or bounded) length don't-care characters, or more complex pattern matching templates.  相似文献   

4.
A prefix trie index (originally called trie hashing) is applied to the problem of providing fast search times, fast load times and fast update properties in a bibliographic or full text retrieval system. For all but the largest dictionaries a single key search in the dictionary under trie hashing takes exactly one disk read. Front compression of search keys is used to enhance performance. Partial combining of the postings into the dictionary is analyzed as a method to give both faster retrieval and improved update properties for the trie hashing inverted file. Statistics are given for a test database consisting of an online catalog at the Graduate School of Library and Information Science Library of the University of Western Ontario. The effect of changing various parameters of prefix tries are tested in this application.  相似文献   

5.
Topological analysis of the signal flow graph associated with the hybrid system of equations for a linear active or passive electrical network for which the element admittance matrix exists and is diagonal is considered. First, the term cancellation which occurs in Mason's topological formulas is investigated. Necessary and sufficient conditions on the signal flow graph topology such that a term in the expansion of the graph determinant and cofactors either cancels out with another term in the expansion or does not cancel are established. Properties of the associated network which result in non-cancelling terms are given and the number of non-cancelling terms is determined. Second, new signal flow graph topological formulas for the graph determinant and cofactors are proven. These formulas are such that no term cancellation occurs and are readily adaptable to computer implementation. In addition, the number of terms in these formulas is independent of the network tree used to formulate the signal flow graph. Examples are given to illustrate the new formulas.  相似文献   

6.
将 Julia曲线"按正方形形状以多种方式进行量化,并将量化的 Julia曲线 "用于分形图像压缩编码,改变了分形图像压缩编码以变化的压缩编码字典进行编码的缺点。此外,还建立了一个小型的常用字典,用以加速分形图像的压缩编码。实验结果表明, Julia曲线 "能很好地拼贴所要编码的图像,并具有分形图像的解码优点。  相似文献   

7.
黄立赫  石映昕 《情报杂志》2022,41(2):146-154
[研究目的]从视频弹幕的视角出发,挖掘网络舆情事件的话题漂移规律,提升网络舆情事件的视频情感检索精度。[研究方法]通过对视频弹幕进行主题与情感分析,提升网络舆情事件在线监测精准度,并在此基础上提出并构建弹幕迁移指数,建立一种基于弹幕迁移指数的情感监测方法,该方法首先基于BTM主题模型抽取视频弹幕的话题信息,并基于情感词典与颜文字词典计算不同时间窗口下的话题情感类别与情感强度,建立面向视频弹幕的网络舆情事件监测模型,再从话题内容的变化与视频兴趣热度两个角度构建话题迁移指数,并利用话题的情感强度变化,构建情感迁移指数。最终,基于话题迁移指数与情感迁移指数,得到加权后的弹幕迁移指数,实现网络舆情事件的在线监测。[研究结论]通过视频弹幕社区的真实数据,从逻辑层面验证了本模型的合理性,结果表明该方法能够较为准确地识别网络舆情事件迁移的关键时间窗口,为实现视频分享平台的情感可视化提供了切实可行的理论探索。  相似文献   

8.
In a preceding experiment in text-searching retrieval for cancer questions, search words were humanly selected with the aid of a medical dictionary and cancer textbooks. Recall results were (1) using only stems of question words (humanly stemmed): 20%; (2) adding dictionary search words: 29%; (3) adding also textbook search words: 70%. For the experiment reported here, computer procedures for using the medical dictionary to select search words were developed. Recall results were (1) for question stems (computer stemmed): 19%; (2) adding search words computer selected from the dictionary: 24 %. Thus the computer procedures compared to human use of the dictionary were 50% successful. Human and computer false retrieval rates were almost equal. Some hypotheses about computer selection of search words from textbooks are also described.  相似文献   

9.
Vital to the task of Sentiment Analysis (SA), or automatically mining sentiment expression from text, is a sentiment lexicon. This fundamental lexical resource comprises the smallest sentiment-carrying units of text, words, annotated for their sentiment properties, and aids in SA tasks on larger pieces of text. Unfortunately, digital dictionaries do not readily include information on the sentiment properties of their entries, and manually compiling sentiment lexicons is tedious in terms of annotator time and effort. This has resulted in the emergence of a large number of research works concentrated on automated sentiment lexicon generation. The dictionary-based approach involves leveraging digital dictionaries, while the corpus-based approach involves exploiting co-occurrence statistics embedded in text corpora. Although the former approach has been exhaustively investigated, the majority of works focus on terms. The few state-of-the-art models concentrated on the finer-grained term sense level remain to exhibit several prominent limitations, e.g., the proposed semantic relations algorithm retrieves only senses that are at a close proximity to the seed senses in the semantic network, thus prohibiting the retrieval of remote sentiment-carrying senses beyond the reach of the ‘radius’ defined by number of iterations of semantic relations expansion. The proposed model aims to overcome the issues inherent in dictionary-based sense-level sentiment lexicon generation models using: (1) null seed sets, and a morphological approach inspired by the Marking Theory in Linguistics to populate them automatically; (2) a dual-step context-aware gloss expansion algorithm that ‘mines’ human defined gloss information from a digital dictionary, ensuring senses overlooked by the semantic relations expansion algorithm are identified; and (3) a fully-unsupervised sentiment categorization algorithm on the basis of the Network Theory. The results demonstrate that context-aware in-gloss matching successfully retrieves senses beyond the reach of the semantic relations expansion algorithm used by prominent, well-known models. Evaluation of the proposed model to accurately assign senses with polarity demonstrates that it is on par with state-of-the-art models against the same gold standard benchmarks. The model has theoretical implications in future work to effectively exploit the readily-available human-defined gloss information in a digital dictionary, in the task of assigning polarity to term senses. Extrinsic evaluation in a real-world sentiment classification task on multiple publically-available varying-domain datasets demonstrates its practical implication and application in sentiment analysis, as well as in other related fields such as information science, opinion retrieval and computational linguistics.  相似文献   

10.
Traditional index weighting approaches for information retrieval from texts depend on the term frequency based analysis of the text contents. A shortcoming of these indexing schemes, which consider only the occurrences of the terms in a document, is that they have some limitations in extracting semantically exact indexes that represent the semantic content of a document. To address this issue, we developed a new indexing formalism that considers not only the terms in a document, but also the concepts. In this approach, concept clusters are defined and a concept vector space model is proposed to represent the semantic importance degrees of lexical items and concepts within a document. Through an experiment on the TREC collection of Wall Street Journal documents, we show that the proposed method outperforms an indexing method based on term frequency (TF), especially in regard to the few highest-ranked documents. Moreover, the index term dimension was 80% lower for the proposed method than for the TF-based method, which is expected to significantly reduce the document search time in a real environment.  相似文献   

11.
The number and power of loads that pollute the electric network, from an electric point of view, are constantly increasing. For this reason this study aims to define a parameter to assess, in terms of distortion and unbalance, the quality of an electric network. Taking as a starting point the advantages and limitations of wavelet methods proposed in the literature, a new index useful to quantify the load characteristics is introduced. The efficiency of this index has been verified by computer simulations to assess the applicability of the method to three-phase systems using three-wire measurements based on the instantaneous power theory, in several typical situations.  相似文献   

12.
郑阳  莫建文 《大众科技》2012,14(4):20-23
针对在科技文献中,未登录词等相关专业术语其变化多端,在中文分词中难以识别,影响了专业领域文章的分词准确度,结合实际情况给出了一种基于专业术语提取的中文分词方法。通过大量特定领域的专业语料库,基于互信息和统计的方法,对文中的未登录词等专业术语进行提取,构造专业术语词典,并结合通用词词典,利用最大匹配方法进行中文分词。经实验证明,该分词方法可以较准确的抽取出相关专业术语,从而提高分词的精度,具有实际的应用价值。  相似文献   

13.
14.
赵铁军  高文 《情报科学》1993,14(6):52-57
机器可读电子词典是一切自然语言处理系统特别是机器翻译系统的基础。机器翻译的研究实践表明,没有高质量的词典,也就没有高质量的译文。每个机译系统都要在机译词典上花费大量的人力和投资,因此本文就建立这样一套通用的支持机器翻译的电子词典,提出若干设想加以讨论。  相似文献   

15.
Aspect-based sentiment analysis technologies may be a very practical methodology for securities trading, commodity sales, movie rating websites, etc. Most recent studies adopt the recurrent neural network or attention-based neural network methods to infer aspect sentiment using opinion context terms and sentence dependency trees. However, due to a sentence often having multiple aspects sentiment representation, these models are hard to achieve satisfactory classification results. In this paper, we discuss these problems by encoding sentence syntax tree, words relations and opinion dictionary information in a unified framework. We called this method heterogeneous graph neural networks (Hete_GNNs). Firstly, we adopt the interactive aspect words and contexts to encode the sentence sequence information for parameter sharing. Then, we utilized a novel heterogeneous graph neural network for encoding these sentences’ syntax dependency tree, prior sentiment dictionary, and some part-of-speech tagging information for sentiment prediction. We perform the Hete_GNNs sentiment judgment and report the experiments on five domain datasets, and the results confirm that the heterogeneous context information can be better captured with heterogeneous graph neural networks. The improvement of the proposed method is demonstrated by aspect sentiment classification task comparison.  相似文献   

16.
For historical and cultural reasons, English phases, especially proper nouns and new words, frequently appear in Web pages written primarily in East Asian languages such as Chinese, Korean, and Japanese. Although such English terms and their equivalences in these East Asian languages refer to the same concept, they are often erroneously treated as independent index units in traditional Information Retrieval (IR). This paper describes the degree to which the problem arises in IR and proposes a novel technique to solve it. Our method first extracts English terms from native Web documents in an East Asian language, and then unifies the extracted terms and their equivalences in the native language as one index unit. For Cross-Language Information Retrieval (CLIR), one of the major hindrances to achieving retrieval performance at the level of Mono-Lingual Information Retrieval (MLIR) is the translation of terms in search queries which can not be found in a bilingual dictionary. The Web mining approach proposed in this paper for concept unification of terms in different languages can also be applied to solve this well-known challenge in CLIR. Experimental results based on NTCIR and KT-Set test collections show that the high translation precision of our approach greatly improves performance of both Mono-Lingual and Cross-Language Information Retrieval.  相似文献   

17.
随看信息技术的高速发展,计算机网络的规模不断扩大,网络的复杂性也在不断地增加,在大规模和复杂的网络中需要有合理的网络管理体系。主要从介绍计算机网络管理协议出发,分析并介绍两种常见网络管理模式。  相似文献   

18.
We report on the design and construction of features of an automated query system which will assist pharmacologists who are not information specialists to access the Derwent Drug File (DDF) pharmacological database. Our approach was to first elucidate those search skills of the search intermediary which might prove tractable to automation. Modules were then produced which assist in the three important subtasks of search statement generation, namely vocabulary selection, the choice of context indicators and query reformulation. Vocabulary selection is facilitated by approximate string matching, morphological analysis, browsing and menu searching. The context of the study, such as treatment or metabolism, is determined using a system of advisory menus. The task of query reformulation is performed using user feedback on retrieved documents, thesaurus relations between document index terms and term postings data. Use is made of diverse information sources, including electronic forms of printed search aids, a thesaurus and a medical dictionary. The system will be of use both to semicasual users and experienced intermediaries. Many of the ideas developed should prove transportable to domains other than pharmacology: the techniques for thesaurus manipulation are designed for use with any hierarchical thesaurus.  相似文献   

19.
Bibliometric mapping of computer and information ethics   总被引:1,自引:0,他引:1  
This paper presents the first bibliometric mapping analysis of the field of computer and information ethics (C&IE). It provides a map of the relations between 400 key terms in the field. This term map can be used to get an overview of concepts and topics in the field and to identify relations between information and communication technology concepts on the one hand and ethical concepts on the other hand. To produce the term map, a data set of over thousand articles published in leading journals and conference proceedings in the C&IE field was constructed. With the help of various computer algorithms, key terms were identified in the titles and abstracts of the articles and co-occurrence frequencies of these key terms were calculated. Based on the co-occurrence frequencies, the term map was constructed. This was done using a computer program called VOSviewer. The term map provides a visual representation of the C&IE field and, more specifically, of the organization of the field around three main concepts, namely privacy, ethics, and the Internet.  相似文献   

20.
A procedure for automated indexing of pathology diagnostic reports at the National Institutes of Health is described. Diagnostic statements in medical English are encoded by computer into the Systematized Nomenclature of Pathology (SNOP). SNOP is a structured indexing language constructed by pathologists for manual indexing. It is of interest that effective automatic encoding can be based upon an existing vocabulary and code designed for manual methods. Morphosyntactic analysis, a simple syntax analysis, matching of dictionary entries consisting of several words, and synonym substitutions are techniques utilized.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号