首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 31 毫秒
1.
Succinct data structures were designed to store and/or index data with a relatively small alphabet size, a rather skewed distribution and/or, a considerable amount of repetitiveness. Although many of them were developed to handle text, they have been used with other data types, like biological collections or source code. However, there are no applications of succinct data structures in the case of floating point data, the obvious reason is that this data type does not usually fulfill the aforementioned requirements.  相似文献   

2.
Automatic text classification is the task of organizing documents into pre-determined classes, generally using machine learning algorithms. Generally speaking, it is one of the most important methods to organize and make use of the gigantic amounts of information that exist in unstructured textual format. Text classification is a widely studied research area of language processing and text mining. In traditional text classification, a document is represented as a bag of words where the words in other words terms are cut from their finer context i.e. their location in a sentence or in a document. Only the broader context of document is used with some type of term frequency information in the vector space. Consequently, semantics of words that can be inferred from the finer context of its location in a sentence and its relations with neighboring words are usually ignored. However, meaning of words, semantic connections between words, documents and even classes are obviously important since methods that capture semantics generally reach better classification performances. Several surveys have been published to analyze diverse approaches for the traditional text classification methods. Most of these surveys cover application of different semantic term relatedness methods in text classification up to a certain degree. However, they do not specifically target semantic text classification algorithms and their advantages over the traditional text classification. In order to fill this gap, we undertake a comprehensive discussion of semantic text classification vs. traditional text classification. This survey explores the past and recent advancements in semantic text classification and attempts to organize existing approaches under five fundamental categories; domain knowledge-based approaches, corpus-based approaches, deep learning based approaches, word/character sequence enhanced approaches and linguistic enriched approaches. Furthermore, this survey highlights the advantages of semantic text classification algorithms over the traditional text classification algorithms.  相似文献   

3.
袁琴芬 《科教文汇》2011,(2):125-126,147
语篇语言学是语言学的一个新的分支学科,它从语言学的角度或用语用学的方法研究语篇的产生、分析和理解。对篇章的理解首先是对意义的理解,因为翻译说到底是语篇的翻译。就翻译思维学而言,在翻译的理解阶段,必须切入语言(原语)与其使用情景(语境)之间的关系,理解"使用中的语言"的语篇(Discourseislanguageinuse),透彻把握其语篇性(textuality),然后构建译语语篇,并再现源语语篇的各种显在的和隐在的语篇特点,制作出理想的译品来。  相似文献   

4.
The world-wide use of digital storage and communications devices is increasing the need to make texts available in multiple languages. In this article we explore the possibility of storing a compressed form of a translated version of a text, taking advantage of the availability of the original text. The original text provides some of the semantic content of the text that is to be compressed, and therefore makes it possible for compression to be more efficient than if that information were not available. We begin with an experiment to evaluate the information content of a text when a parallel translation is available. This is achieved by having human subjects guess texts letter by letter, with and without a parallel translation. The perceived information content of a text can be determined from the way subjects make their guesses. The design and results of this experiment are described. The main conclusion is that while the text is considerably more predictable with the aid of a parallel translation, there is a surprising amount of information introduced by the translation. Insights obtained from this experiment are then applied in the design of a mechanical system for compressing parallel texts. The system stores one translation of a text intact, and then compresses further translations of the text with the aid of the original. The method described is able to compress texts significantly better than is possible without the aid of a parallel text. Aspects of the design are also applicable to future compressors that might take advantage of the semantic content of a text to obtain better compression.  相似文献   

5.
张家军  马婧 《科教文汇》2014,(17):100-101
翻译不是逐字死译,是不同文化间的交流,为了确保翻译更加忠实有效地传达原文交际目的,应当跨越文化间的鸿沟,发挥译者的主观能动性,用译语再现原文本的交际目的。本文通过翻译实例,提出了文化翻译中的三个翻译原则,揭示译者的主观能动性和如何用以上三个原则处理好文化间的翻译问题。  相似文献   

6.
魏羽 《情报科研学报》2013,(6):615-617,624
如何在翻译中有效突出信息文本功能,保证译文信息传递的真实性和准确性,奈达认为,“内容的精确不应以(译文)对原作者的‘忠实’来判断,而应以传递的信息不被译文读者误解作为判断的基准”.因此在翻译信息型文本的过程中,译者应本着让读者客观、准确地理解原文信息的目的,在语言表达和文体形式上利用翻译的各种技巧灵活地翻译.秦兵马俑博物馆的文物介绍属于“信息型”文本,其英文译文存在着不少问题.作者以翻译理论为指导,运用翻译的技巧和方法针对秦兵马俑博物馆的文物简介的英文译文中的问题进行客观地评析.  相似文献   

7.
Trends in Internet usage and accessing online content in different languages and formats are proliferating at a considerable speed. There is a vast amount of digital online content available in different formats that are sensitive in nature with respect to writing styles and arrangement of diacritics. However, research done in the area aimed at identifying the necessary techniques suitable for preserving content integrity of sensitive digital online content is limited. So, it is a challenge to determine the techniques most suitable for different formats such as image or binary. Hence, preserving and verifying sensitive content constitutes an emerging problem and calls for timely solutions. The digital Holy Qur'an in Arabic, constitutes, one case of such sensitive content. Due to the different characteristics of the Arabic letters like diacritics (punctuation symbols), kashidas (extended letters) and other symbols, it is very easy to alter the original meaning of the text by simply changing the arrangement of diacritics. This article surveys the different approaches that are presently employed in the process of preserving and verifying the content integrity of sensitive online content. We present the state-of-the-art in content integrity verification and address the existing challenges in preserving the integrity of sensitive texts using the Digital Qur'an as a case study. The proposed taxonomy provides an effective classification and analysis of existing related schemes and their limitations. The paper discusses the recommendations of the expected efficiency of such approaches when applied for use in digital content integrity. Some of the main findings suggest unified approaches of watermarking and string matching approaches can be used to preserve content integrity of any sensitive digital content.  相似文献   

8.
The article presents an analysis of the effect of granularity and order in an XML encoded collection of full text journal articles. Two-hundred and eighteen sessions of searchers performing simulated work tasks in the collection have been analysed. The results show that searchers prefer to use smaller sections of the article as their source of information. In interaction sessions during which articles are assessed, however, they are to a large degree evaluated as more important than the articles’ sections and subsections.  相似文献   

9.
彭志瑛 《科教文汇》2011,(35):122-124
字幕翻译是一种特殊的语码转换类型,具有语言浓缩和对白性格化的特点。字幕翻译中,文化预设表现为电影对白与文化现实之间的关联,有效解读源语对白的文化预设是成功字幕翻译的前提。当源语和目的语共有某种文化预设,译者采用"形意对应"的编码方式,以保持源语对白的异域特色;当两种语言不共享某种文化预设时,译者采用打破重组,创意缩合的编码方式,如明示与阐释、替换与重构、增补与删减等翻译策略。  相似文献   

10.
In the present work we perform compressed pattern matching in binary Huffman encoded texts [Huffman, D. (1952). A method for the construction of minimum redundancy codes, Proc. of the IRE, 40, 1098–1101]. A modified Knuth–Morris–Pratt algorithm is used in order to overcome the problem of false matches, i.e., an occurrence of the encoded pattern in the encoded text that does not correspond to an occurrence of the pattern itself in the original text. We propose a bitwise KMP algorithm that can move one extra bit in the case of a mismatch since the alphabet is binary. To avoid processing any bit of the encoded text more than once, a preprocessed table is used to determine how far to back up when a mismatch is detected, and is defined so that we are always able to align the start of the encoded pattern with the start of a codeword in the encoded text. We combine our KMP algorithm with two practical Huffman decoding schemes which handle more than a single bit per machine operation; skeleton trees defined by Klein [Klein, S. T. (2000). Skeleton trees for efficient decoding of huffman encoded texts. Information Retrieval, 3, 7–23], and numerical comparisons between special canonical values and portions of a sliding window presented in Moffat and Turpin [Moffat, A., & Turpin, A. (1997). On the implementation of minimum redundancy prefix codes. IEEE Transactions on Communications, 45, 1200–1207]. Experiments show rapid search times of our algorithms compared to the “decompress then search” method, therefore, files can be kept in their compressed form, saving memory space. When compression gain is important, these algorithms are better than cgrep [Ferragina, P., Tommasi, A., & Manzini, G. (2004). C Library to search over compressed texts, http://roquefort.di.unipi.it/~ferrax/CompressedSearch], which is only slightly faster than ours.  相似文献   

11.
分析Loiss算法在抵御猜测确定攻击方面的安全性,并给出一个猜测确定攻击的方法.该攻击方法避免对Loiss算法中BOMM结构的寄存器进行直接猜测从而降低了攻击复杂度.最终攻击的时间复杂度为O(2247),数据复杂度为O(252).  相似文献   

12.
论类比三要素的选择   总被引:2,自引:2,他引:2  
石益祥 《科学学研究》2004,22(5):460-463
汤建民提出了类比的三要素,类比源、类比泉和类比知识单元,并对类比源和类比知识单元的选择作了初步的研究。本文则在汤的研究基础上深入一步,并对选择类比泉的类比进行了系统的研究,尤其是把数学知识作为类比源的类比。在笔者看来,类比源、类比泉和类比知识单元的选择事关所类比对象的认识,决定类比成果的大小和意义。因此,研究如何选择类比三要素关系到类比的成功与否,其意义绝不亚于三个概念本身的提出。  相似文献   

13.
康志琪 《科教文汇》2020,(8):117-118
部编版语文四年级上册第六单元中的语文要素提出,要学生学习用批注的方法阅读。所谓的批注式阅读就是运用文字和符号在相应地方的空白处做标记,以帮助学生理解课文内容。一边阅读一边做批注是学习语文阅读的好方法。在学校实习期间,我通过语文批注式阅读教学发现了学生存在的问题,得到了一些教学启示和批注式阅读的教学策略。  相似文献   

14.
Transfer learning utilizes labeled data available from some related domain (source domain) for achieving effective knowledge transformation to the target domain. However, most state-of-the-art cross-domain classification methods treat documents as plain text and ignore the hyperlink (or citation) relationship existing among the documents. In this paper, we propose a novel cross-domain document classification approach called Link-Bridged Topic model (LBT). LBT consists of two key steps. Firstly, LBT utilizes an auxiliary link network to discover the direct or indirect co-citation relationship among documents by embedding the background knowledge into a graph kernel. The mined co-citation relationship is leveraged to bridge the gap across different domains. Secondly, LBT simultaneously combines the content information and link structures into a unified latent topic model. The model is based on an assumption that the documents of source and target domains share some common topics from the point of view of both content information and link structure. By mapping both domains data into the latent topic spaces, LBT encodes the knowledge about domain commonality and difference as the shared topics with associated differential probabilities. The learned latent topics must be consistent with the source and target data, as well as content and link statistics. Then the shared topics act as the bridge to facilitate knowledge transfer from the source to the target domains. Experiments on different types of datasets show that our algorithm significantly improves the generalization performance of cross-domain document classification.  相似文献   

15.
The automatic text summary concerns the language industries. This work proposes a system automatically and directly transforming a source text into a reduced target text. The system deals exclusively with scientific and technical texts. It is based on the identification of specific expressions allowing an evaluation of the relevance of the sentence concerned, which can then be selected for the elaboration of the summary. The procedure consists in attributing a score to each sentence of the text and then eliminating those having the lowest scores. To produce the RAFI system (automatic summary based on indicative fragments), we resorted to the linguistic means of discourse analysis and the computing capacity of data processing instruments. This system could be adapted to Internet.  相似文献   

16.
The documents retrieved by a web search are useful if the information they contain contributes to some task or information need. To measure search result utility, studies have typically focused on perceived usefulness rather than on actual information use. We investigate the actual usefulness of search results—as indicated by their use as sources in an extensive writing task—and the factors that make a writer successful at retrieving useful sources. Our data comprise 150 essays written by 12 writers whose querying, clicking and writing activities were recorded. By tracking authors’ text reuse behavior, we quantify the search results’ contribution to the task more accurately than before. We model the overall utility of the search results retrieved throughout the writing process using path analysis, and compare a binary utility model (Reuse Events) to one that quantifies a degree of utility (Reuse Amount). The Reuse Events model has greater explanatory power (63% vs. 48%); in both models, the number of clicks is by far the strongest predictor of useful results—with β-coefficients up to 0.7—while dwell time has a negative effect (β between −0.14 and −0.21). As a conclusion, we propose a new measure of search result usefulness based on a source’s contribution to an evolving text. Our findings are valid for tasks where text reuse is allowed, but also have implications on designing indicators of search result usefulness for general writing tasks.  相似文献   

17.
复杂文本布局引擎机制及应用研究   总被引:4,自引:0,他引:4  
目前应用较为广泛的复杂文本布局引擎有微软的Uniscribe和IBM的ICU。通过对复杂文本布局引擎机制的研究,结合开放源码的复杂文本布局引擎(如ICU布局引擎)代码分析,我们可以在其中加入该引擎尚未支持的复杂文本(如我国少数民族语言蒙、藏、维)的实现模块,并以此开发出基于OpenOffice的用于我国少数民族地区的办公套件。本文先介绍什么是复杂文字和复杂文本,接着阐明复杂文本布局引擎机制,最后讲解了复杂文本布局引擎ICU及其在开发基于OpenOffice的我国少数民族办公套件中的应用。  相似文献   

18.
提出一种基于企业技术同心多元化进行研发合作伙伴识别与选择的框架与方法,从企业现有技术能力、研发资源及其发展需求出发,为其定制可发展的技术方向和确定可开展合作研发的最佳伙伴。首先利用关联规则挖掘目标企业的同心多元化技术领域,然后利用LDA主题建模对候选研发合作伙伴的专利内容进行文本挖掘并划分为不同的技术主题,最后构建包括专业能力和合作能力两个维度的专利评价体系对候选合作伙伴进行评估,以确定每个技术主题下的最佳合作伙伴。进一步以天士力控股集团有限公司作为目标企业进行实证分析,研究表明上述框架与方法适用、有效。  相似文献   

19.
Multimedia objects can be retrieved using their context that can be for instance the text surrounding them in documents. This text may be either near or far from the searched objects. Our goal in this paper is to study the impact, in term of effectiveness, of text position relatively to searched objects. The multimedia objects we consider are described in structured documents such as XML ones. The document structure is therefore exploited to provide this text position in documents. Although structural information has been shown to be an effective source of evidence in textual information retrieval, only a few works investigated its interest in multimedia retrieval. More precisely, the task we are interested in this paper is to retrieve multimedia fragments (i.e. XML elements having at least one multimedia object). Our general approach is built on two steps: we first retrieve XML elements containing multimedia objects, and we then explore the surrounding information to retrieve relevant multimedia fragments. In both cases, we study the impact of the surrounding information using the documents structure.  相似文献   

20.
Sentiment lexicons are essential tools for polarity classification and opinion mining. In contrast to machine learning methods that only leverage text features or raw text for sentiment analysis, methods that use sentiment lexicons embrace higher interpretability. Although a number of domain-specific sentiment lexicons are made available, it is impractical to build an ex ante lexicon that fully reflects the characteristics of the language usage in endless domains. In this article, we propose a novel approach to simultaneously train a vanilla sentiment classifier and adapt word polarities to the target domain. Specifically, we sequentially track the wrongly predicted sentences and use them as the supervision instead of addressing the gold standard as a whole to emulate the life-long cognitive process of lexicon learning. An exploration-exploitation mechanism is designed to trade off between searching for new sentiment words and updating the polarity score of one word. Experimental results on several popular datasets show that our approach significantly improves the sentiment classification performance for a variety of domains by means of improving the quality of sentiment lexicons. Case-studies also illustrate how polarity scores of the same words are discovered for different domains.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号