首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
2.
A suite of computer programs has been developed for representing the full text of lengthy documents in vector form and classifying them by a clustering method. The programs have been applied to the full text of the Conventions and Agreements of the Council of Europe which consist of some 280,000 words in the English version and a similar number in the French. Results of the clustering experiments are presented in the form of dendrograms (tree diagrams) using both the treaty and article as the clustering unit. The conclusion is that vector techniques based on the full text provide an effective method of classifying legal documents.  相似文献   

3.
谭笑  刘兵 《科学学研究》2009,27(8):1144-1148
 近些年来的科学文本研究根据各自的理论前提和研究方法,已经形成了不同的研究进路。科学修辞分析成为其中重要的一支,它以独特的观念和方法形成了自己的研究特点。本论文将明晰这些研究之间的差异,并且明确科学修辞分析在整个科学文本研究中的位置与特色。  相似文献   

4.
For a given text which has been encoded by a static Huffman code, the possibility of locating a given pattern directly in the compressed text is investigated. The main problem is one of synchronization, as an occurrence of the encoded pattern in the encoded text does not necessarily correspond to an occurrence of the pattern in the text. A simple algorithm is suggested which reduces the number of erroneously declared matches. The probability of such false matches is analyzed and empirically tested.  相似文献   

5.
This paper explores the nature of, and justification for, copyright in academic texts in the light of recent developments in information technology, in particular the growth of electronic publication on the internet. Copyright, like other forms of property, is best thought of as a cluster of rights. A distinction is drawn within this cluster between first order `control rights' and higher order `commodity rights'. It is argued that copyright in academic texts is founded on its role as a means to allow academics to fulfil their role responsibilities. While the possession and exercise by academics of commodity rights can be thus justified in the case of mechanical print-based publication, since this helps make possible the reproduction and dissemination of academic texts, they cannot be so justified in the case of electronic publication. There are nevertheless good reasons to retain various control rights.  相似文献   

6.
A method is introduced to recognize the part-of-speech for English texts using knowledge of linguistic regularities rather than voluminous dictionaries. The algorithm proceeds in two steps; in the first step information concerning the part-of-speech is extracted from each word of the text in isolation using morphological analysis as well as the fact that in English there are a reasonable number of word endings which are characteristic of the part-of-speech. The second step is to look at a whole sentence and, using syntactic criteria, to assign the part-of-speech to a single word according to the parts-of-speech and other features of the surrounding words. In particular, those parts-of-speech which are relevant for automatic indexing of documents, i.e. nouns, adjectives, and verbs, are recognized. An application of this method to a large corpus of scientific text showed the result that for 84% of the words the part-of-speech was identified correctly and only for 2% definitely wrong; for the rest of the words ambiguous assignments were made. Using only word lists of a limited extent, the technique thus may be a valuable tool aiding automatic indexing of documents and automatic thesaurus construction as well as other kinds of natural language processing.  相似文献   

7.
8.
科学文本研究的神话范式及其转变   总被引:2,自引:0,他引:2       下载免费PDF全文
王彦雨  池田 《科学学研究》2009,27(3):328-333
 传统的科学文本分析范式是内容无涉的神话式研究模式,它在科学文本与科学实践之间作了“真实反映论”的处理。而SSK学者则打开了科学文本的黑箱,从认知的角度来解构传统的科学文本神话观,试图打破传统科学文本观在文本与真实世界之间所标注的反映论逻辑。文章认为这种转向既与科学哲学界语言学、修辞学、解释学转向相一致,同时也打开了SSK进行自我反思的通途,具有重要的学术意义;就实践意义来讲,它提醒人们进行科学文本关注,为科学失范的预防提供认识论依据,同时也启示了一种基于传统宏观分析与SSK微观研究互补的新的方法论。  相似文献   

9.
This investigation deals with the problem of language identification of noisy texts, which could represent the primary step of many natural language processing or information retrieval tasks. Language identification is the task of automatically identifying the language of a given text. Although there exists several methods in the literature, their performances are not so convincing in practice.In this contribution, we propose two statistical approaches: the high frequency approach and the nearest prototype approach. In the first one, 5 algorithms of language identification are proposed and implemented, namely: character based identification (CBA), word based identification (WBA), special characters based identification (SCA), sequential hybrid algorithm (HA1) and parallel hybrid algorithm (HA2). In the second one, we use 11 similarity measures combined with several types of character N-Grams.For the evaluation task, the proposed methods are tested on forum datasets containing 32 different languages. Furthermore, an experimental comparison is made between the proposed approaches and some referential language identification tools such as: LIGA, NTC, Google translate and Microsoft Word. Results show that the proposed approaches are interesting and outperform the baseline methods of language identification on forum texts.  相似文献   

10.
Ethnicity-targeted hate speech has been widely shown to influence on-the-ground inter-ethnic conflict and violence, especially in such multi-ethnic societies as Russia. Therefore, ethnicity-targeted hate speech detection in user texts is becoming an important task. However, it faces a number of unresolved problems: difficulties of reliable mark-up, informal and indirect ways of expressing negativity in user texts (such as irony, false generalization and attribution of unfavored actions to targeted groups), users’ inclination to express opposite attitudes to different ethnic groups in the same text and, finally, lack of research on languages other than English. In this work we address several of these problems in the task of ethnicity-targeted hate speech detection in Russian-language social media texts. This approach allows us to differentiate between attitudes towards different ethnic groups mentioned in the same text – a task that has never been addressed before. We use a dataset of over 2,6M user messages mentioning ethnic groups to construct a representative sample of 12K instances (ethnic group, text) that are further thoroughly annotated via a special procedure. In contrast to many previous collections that usually comprise extreme cases of toxic speech, representativity of our sample secures a realistic and, therefore, much higher proportion of subtle negativity which additionally complicates its automatic detection. We then experiment with four types of machine learning models, from traditional classifiers such as SVM to deep learning approaches, notably the recently introduced BERT architecture, and interpret their predictions in terms of various linguistic phenomena. In addition to hate speech detection with a text-level two-class approach (hate, no hate), we also justify and implement a unique instance-based three-class approach (positive, neutral, negative attitude, the latter implying hate speech). Our best results are achieved by using fine-tuned and pre-trained RuBERT combined with linguistic features, with F1-hate=0.760, F1-macro=0.833 on the text-level two-class problem comparable to previous studies, and F1-hate=0.813, F1-macro=0.824 on our unique instance-based three-class hate speech detection task. Finally, we perform error analysis, and it reveals that further improvement could be achieved by accounting for complex and creative language issues more accurately, i.e., by detecting irony and unconventional forms of obscene lexicon.  相似文献   

11.
针对图像存储、处理和传输过程的巨大数据量和复杂度,本文通过对图像感兴趣区域采用无损编码,改进研究出了一种基于嵌入式零树小波变换算法的不同区域编解码压缩算法,可以实现改变压缩比以及可手动选择任意形状的感兴趣区域的功能。改进的算法保证了压缩重构图像感兴趣区域质量清晰,对标准测试图实验结果表明,此算法具有较高的压缩比和图像恢复质量,取得较好的效果。  相似文献   

12.
The fundamental idea of the work reported here is to extract index phrases from texts with the help of a single word concept dictionary and a thesaurus containing relations among concepts. The work is based on the fact, that, within every phrase, the single words the phrase is composed of are related in a certain well denned manner, the type of relations holding between concepts depending only on the concepts themselves. Therefore relations can be stored in a semantic network. The algorithm described extracts single word concepts from texts and combines them to phrases using the semantic relations between these concepts, which are stored in the network. The results obtained show that phrase extraction from texts by this semantic method is possible and offers many advantages over other (purely syntactic or statistic) methods concerning preciseness and completeness of the meaning representation of the text. But the results show, too, that some syntactic and morphologic “filtering” should be included for effectivity reasons.  相似文献   

13.
In this study, quantitative measures of the information content of textual material have been developed based upon analysis of the linguistic structure of the sentences in the text. It has been possible to measure such properties as: (1) the amount of information contributed by a sentence to the discourse; (2) the complexity of the information within the sentence, including the overall logical structure and the contributions of local modifiers; (3) the density of information based on the ratio of the number of words in a sentence to the number of information-contributing operators.Two contrasting types of texts were used to develop the measures. The measures were then applied to contrasting sentences within one type of text. The textual material was drawn from narrative patient records and from the medical research literature. Sentences from the records were analyzed by computer and those from the literature were analyzed manually, using the same methods of analysis. The results show that quantitative measures of properties of textual information can be developed which accord with intuitively perceived differences in the informational complexity of the material.  相似文献   

14.
Natural Language Processing (NLP) techniques have been successfully used to automatically extract information from unstructured text through a detailed analysis of their content, often to satisfy particular information needs. In this paper, an automatic concept map construction technique, Fuzzy Association Concept Mapping (FACM), is proposed for the conversion of abstracted short texts into concept maps. The approach consists of a linguistic module and a recommendation module. The linguistic module is a text mining method that does not require the use to have any prior knowledge about using NLP techniques. It incorporates rule-based reasoning (RBR) and case based reasoning (CBR) for anaphoric resolution. It aims at extracting the propositions in text so as to construct a concept map automatically. The recommendation module is arrived at by adopting fuzzy set theories. It is an interactive process which provides suggestions of propositions for further human refinement of the automatically generated concept maps. The suggested propositions are relationships among the concepts which are not explicitly found in the paragraphs. This technique helps to stimulate individual reflection and generate new knowledge. Evaluation was carried out by using the Science Citation Index (SCI) abstract database and CNET News as test data, which are well known databases and the quality of the text is assured. Experimental results show that the automatically generated concept maps conform to the outputs generated manually by domain experts, since the degree of difference between them is proportionally small. The method provides users with the ability to convert scientific and short texts into a structured format which can be easily processed by computer. Moreover, it provides knowledge workers with extra time to re-think their written text and to view their knowledge from another angle.  相似文献   

15.
We propose a social relation extraction system using dependency-kernel-based support vector machines (SVMs). The proposed system classifies input sentences containing two people’s names on the basis of whether they do or do not describe social relations between two people. The system then extracts relation names (i.e., social-related keywords) from sentences describing social relations. We propose new tree kernels called dependency trigram kernels for effectively implementing these processes using SVMs. Experiments showed that the proposed kernels delivered better performance than the existing dependency kernel. On the basis of the experimental evidence, we suggest that the proposed system can be used as a useful tool for automatically constructing social networks from unstructured texts.  相似文献   

16.
Substantial real cases can be formed in current online medical platforms, constituting potentially rich commercial medical value. In order to obtain the value, it is necessary to mine the preference for user perceived cancer risk in online medical platforms. However, user preference in the platforms varies with medical inquiry text environments, and a user's disease-specific online medical inquiry text environment would also affect his/her behavioral decisions in real time. In this sense, considering the inner relations between different contexts and user preferences under different diseases-specific inquiry text environments and integrating early cancer texts will facilitate the exploration on the law of preference for user perceived cancer risk. Therefore, in this paper, the matrix decomposition and Labeled-LDA models are expanded to propose a context-based method to access the preference for user perceived cancer risk. Firstly, modeling on the relationship between user preferences and information in multi-dimensional context is carried out, and the universal method of integrating multi-dimensional contextual information with user preferences is analyzed. Moreover, more accurate user references were obtained under the multi-dimensional text space and multi-dimensional disease space. Secondly, the similarity relationships between all disease-specific online medical inquiries and early cancer texts are used to obtain user perceived cancer risk, thus knowing the online medical inquiry texts of user cognized diseases and perceiving the cancer risk. Lastly, by accessing the user preferences under different disease topics and user perceived cancer risk in multi-dimensional contexts, the preference for user perceived cancer risk is obtained in a more accurate way. Based on the large-volume real-world dataset, the relationship between each context and user preferences is assessed, and it is concluded that the method proposed in this paper is superior to MF-LDA method in obtaining the preference for user perceived cancer risk. This indicates that the proposed method not only expresses user perceived risk, but also clearly expresses the characteristics of user's preference. Furthermore, it is verified that the integration of context with early cancer text and the establishment of user preference model are feasible and effective.  相似文献   

17.
本文分析了基于DCT顺序型工作模式的JPEG压缩算法,利用Visual C 软件,设计了基于此种算法下的图像编解码应用程序。选取了6幅24位640*480大小的位图图像进行JPEG的编解码处理,处理结果表明,在通常的应用中JPEG基本系统可以实现10~20的压缩率,并且人眼感觉不到明显的失真,从而完成高品质的图像压缩。  相似文献   

18.
The main idea of the original parallel distributed compensation (PDC) method is to partition the dynamics of a nonlinear system into a number of linear subsystems, design a number of state feedback gains for each linear subsystem, and finally generate the overall state feedback gain by fuzzy blending of such gains. A new modification to the original PDC method is proposed here, so that, besides the stability issue, the closed-loop performance of the system can be considered at the design stage. For this purpose, the state feedback gains are not considered constant through the linearized subsystems, rather, based on some prescribed performance criteria, several feedback gains are associated to every subsystem, and the final gain for every subsystem is obtained by fuzzy blending of such gains. The advantage is that, for example, a faster response can be obtained, for a given bound on the control input. Asymptotic stability of the closed loop system is also guaranteed by using the Lyapunov method. To illustrate the effectiveness of the new method, control of a flexible joint robot (FJR) is investigated and superiority of the designed controller over other existing methods is demonstrated.  相似文献   

19.
20.
Transductive classification is a useful way to classify texts when labeled training examples are insufficient. Several algorithms to perform transductive classification considering text collections represented in a vector space model have been proposed. However, the use of these algorithms is unfeasible in practical applications due to the independence assumption among instances or terms and the drawbacks of these algorithms. Network-based algorithms come up to avoid the drawbacks of the algorithms based on vector space model and to improve transductive classification. Networks are mostly used for label propagation, in which some labeled objects propagate their labels to other objects through the network connections. Bipartite networks are useful to represent text collections as networks and perform label propagation. The generation of this type of network avoids requirements such as collections with hyperlinks or citations, computation of similarities among all texts in the collection, as well as the setup of a number of parameters. In a bipartite heterogeneous network, objects correspond to documents and terms, and the connections are given by the occurrences of terms in documents. The label propagation is performed from documents to terms and then from terms to documents iteratively. Nevertheless, instead of using terms just as means of label propagation, in this article we propose the use of the bipartite network structure to define the relevance scores of terms for classes through an optimization process and then propagate these relevance scores to define labels for unlabeled documents. The new document labels are used to redefine the relevance scores of terms which consequently redefine the labels of unlabeled documents in an iterative process. We demonstrated that the proposed approach surpasses the algorithms for transductive classification based on vector space model or networks. Moreover, we demonstrated that the proposed algorithm effectively makes use of unlabeled documents to improve classification and it is faster than other transductive algorithms.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号