首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
In this paper, we propose a novel approach for multilingual story link detection. Our approach utilized the distributional features of terms in timelines and multilingual spaces, together with selected types of named entities in order to get distinctive weights for terms that constitute linguistic representation of events. On timelines term significance is calculated by comparing term distribution of the documents on a day with that of the total document collection. Since two languages can provide more information than one language, term significance is measured on each language space, which is then used as a bridge between two languages on multilingual spaces. Evaluating the method on Korean and Japanese news articles, our method achieved 14.3% improvement for monolingual story pairs, and 16.7% improvement for multilingual story pairs. By measuring the space density, the proposed weighting components are verified with a high density of the intra-event stories and a low density of the inter-events stories. This result indicates that the proposed method is helpful for multilingual story link detection.  相似文献   

2.
With the explosion of multilingual content on Web, particularly in social media platforms, identification of languages present in the text is becoming an important task for various applications. While automatic language identification (ALI) in social media text is considered to be a non-trivial task due to the presence of slang words, misspellings, creative spellings and special elements such as hashtags, user mentions etc., ALI in multilingual environment becomes even more challenging task. In a highly multilingual society, code-mixing without affecting the underlying language sense has become a natural phenomenon. In such a dynamic environment, conversational text alone often fails to identify the underlying languages present in the text. This paper proposes various methods of exploiting social conversational features for enhancing ALI performance. Although social conversational features for ALI have been explored previously using methods like probabilistic language modeling, these models often fail to address issues related to code-mixing, phonetic typing, out-of-vocabulary etc. which are prevalent in a highly multilingual environment. This paper differs in the way the social conversational features are used to propose text refinement strategies that are suitable for ALI in highly multilingual environment. The contributions in this paper therefore includes the following. First, this paper analyzes the characteristics of various social conversational features by exploiting language usage patterns. Second, various methods of text refinement suitable for language identification are proposed. Third, the effects of the proposed refinement methods are investigated using various sentence level language identification frameworks. From various experimental observations over three conversational datasets collected from Facebook, Youtube and Twitter social media platforms, it is evident that our proposed method of ALI using social conversational features outperforms the baseline counterparts.  相似文献   

3.
Probabilistic topic models are unsupervised generative models which model document content as a two-step generation process, that is, documents are observed as mixtures of latent concepts or topics, while topics are probability distributions over vocabulary words. Recently, a significant research effort has been invested into transferring the probabilistic topic modeling concept from monolingual to multilingual settings. Novel topic models have been designed to work with parallel and comparable texts. We define multilingual probabilistic topic modeling (MuPTM) and present the first full overview of the current research, methodology, advantages and limitations in MuPTM. As a representative example, we choose a natural extension of the omnipresent LDA model to multilingual settings called bilingual LDA (BiLDA). We provide a thorough overview of this representative multilingual model from its high-level modeling assumptions down to its mathematical foundations. We demonstrate how to use the data representation by means of output sets of (i) per-topic word distributions and (ii) per-document topic distributions coming from a multilingual probabilistic topic model in various real-life cross-lingual tasks involving different languages, without any external language pair dependent translation resource: (1) cross-lingual event-centered news clustering, (2) cross-lingual document classification, (3) cross-lingual semantic similarity, and (4) cross-lingual information retrieval. We also briefly review several other applications present in the relevant literature, and introduce and illustrate two related modeling concepts: topic smoothing and topic pruning. In summary, this article encompasses the current research in multilingual probabilistic topic modeling. By presenting a series of potential applications, we reveal the importance of the language-independent and language pair independent data representations by means of MuPTM. We provide clear directions for future research in the field by providing a systematic overview of how to link and transfer aspect knowledge across corpora written in different languages via the shared space of latent cross-lingual topics, that is, how to effectively employ learned per-topic word distributions and per-document topic distributions of any multilingual probabilistic topic model in various cross-lingual applications.  相似文献   

4.
Over the past decade, worldwide Internet usage has grown tremendously, with the most rapid growth in some emerging economies such as Latin America and the Middle East, where people speaking different languages actively seek information on the web. Global search engines may not adequately address local users’ needs while regional web portals may lack rich web content. Different from search engines, web directories organize sites and pages into intuitive hierarchical structures to facilitate browsing. However, high-quality web directories in users’ native languages often do not exist and their development requires much domain knowledge not readily available. In this research, we proposed a novel semi-automatic approach to facilitate web repository management. We applied the approach to developing web directories in the business and health-care domains for the Spanish-speaking and Arabic-speaking communities respectively. The two directories contain respectively 4735 and 5107 unique sites and pages with a maximum depth of 5 levels. Results of experiments involving 37 native speakers show that these directories outperformed existing benchmark directories in terms of browsing effectiveness and efficiency, providing strong implications for information professionals and multinational enterprise managers.  相似文献   

5.
Hate speech is an increasingly important societal issue in the era of digital communication. Hateful expressions often make use of figurative language and, although they represent, in some sense, the dark side of language, they are also often prime examples of creative use of language. While hate speech is a global phenomenon, current studies on automatic hate speech detection are typically framed in a monolingual setting. In this work, we explore hate speech detection in low-resource languages by transferring knowledge from a resource-rich language, English, in a zero-shot learning fashion. We experiment with traditional and recent neural architectures, and propose two joint-learning models, using different multilingual language representations to transfer knowledge between pairs of languages. We also evaluate the impact of additional knowledge in our experiment, by incorporating information from a multilingual lexicon of abusive words. The results show that our joint-learning models achieve the best performance on most languages. However, a simple approach that uses machine translation and a pre-trained English language model achieves a robust performance. In contrast, Multilingual BERT fails to obtain a good performance in cross-lingual hate speech detection. We also experimentally found that the external knowledge from a multilingual abusive lexicon is able to improve the models’ performance, specifically in detecting the positive class. The results of our experimental evaluation highlight a number of challenges and issues in this particular task. One of the main challenges is related to the issue of current benchmarks for hate speech detection, in particular how bias related to the topical focus in the datasets influences the classification performance. The insufficient ability of current multilingual language models to transfer knowledge between languages in the specific hate speech detection task also remain an open problem. However, our experimental evaluation and our qualitative analysis show how the explicit integration of linguistic knowledge from a structured abusive language lexicon helps to alleviate this issue.  相似文献   

6.
Effective learning schemes such as fine-tuning, zero-shot, and few-shot learning, have been widely used to obtain considerable performance with only a handful of annotated training data. In this paper, we presented a unified benchmark to facilitate the problem of zero-shot text classification in Turkish. For this purpose, we evaluated three methods, namely, Natural Language Inference, Next Sentence Prediction and our proposed model that is based on Masked Language Modeling and pre-trained word embeddings on nine Turkish datasets for three main categories: topic, sentiment, and emotion. We used pre-trained Turkish monolingual and multilingual transformer models which can be listed as BERT, ConvBERT, DistilBERT and mBERT. The results showed that ConvBERT with the NLI method yields the best results with 79% and outperforms previously used multilingual XLM-RoBERTa model by 19.6%. The study contributes to the literature using different and unattempted transformer models for Turkish and showing improvement of zero-shot text classification performance for monolingual models over multilingual models.  相似文献   

7.
This article presents conceptual navigation and NavCon, an architecture that implements this navigation in World Wide Web pages. NavCon architecture makes use of ontology as metadata to contextualize user search for information. Based on ontologies, NavCon automatically inserts conceptual links in Web pages. By using these links, the user may navigate in a graph representing ontology concepts and their relationships. By browsing this graph, it is possible to reach documents associated with the user desired ontology concept. This Web navigation supported by ontology concepts we call conceptual navigation. Conceptual navigation is a technique to browse Web sites within a context. The context filters relevant retrieved information. The context also drives user navigation through paths that meet his needs. A company may implement conceptual navigation to improve user search for information in a knowledge management environment. We suggest that the use of an ontology to conduct navigation in an Intranet may help the user to have a better understanding about the knowledge structure of the company.  相似文献   

8.
We present three fundamental, interrelated approaches to support multiple access paths to each terminal object in information hierarchies: faceted classification, faceted search, and web directories with embedded symbolic links. This survey aims to demonstrate how each approach supports users who seek information from multiple perspectives. We achieve this by exploring each approach, the relationships between these approaches, including tradeoffs, and how they can be used in concert, while focusing on a core set of hypermedia elements common to all. This approach provides a foundation from which to study, understand, and synthesize applications which employ these techniques. This survey does not aim to be comprehensive, but rather focuses on thematic issues.  相似文献   

9.
We present a study to develop an improved understanding of symbolic links in web directories. A symbolic link is a hyperlink which makes a directed connection from a webpage along one path through a directory to a page along another path. While symbolic links are ubiquitous in web directories such as Yahoo!, they are under-studied and, as a result, their uses are poorly understood. A cursory analysis of symbolic links reveals multiple uses: to provide navigational shortcuts deeper into a directory, backlinks to more general categories, and multiclassification. We investigated these uses in the Open Directory Project (ODP), the largest, most comprehensive, and most widely distributed human-compiled taxonomy of links to websites, which makes extensive use of symbolic links. The results reveal that while symbolic links in ODP are used primarily for multiclassification, only few multiclassification links actually span top- and second-level categories. This indicates that most symbolic links in ODP are used to create multiclassification between topics which are nested more than two levels deep and suggests that there may be multiple uses of multiclassification links. We also situate symbolic links vis à vis other semantic and structural link types from hypermedia. We anticipate that the results and relationships identified and discussed in this paper will provide a foundation for (1) users for understanding the usages of symbolic links in a directory, (2) designers to employ symbolic links more effectively when building and maintaining directories and for crafting user interfaces to them, and (3) information retrieval researchers for further study of symbolic links in web directories.  相似文献   

10.
Frequent requests from users to search engines on the World Wide Web are to search for information about people using personal names. Current search engines only return sets of documents containing the name queried, but, as several people usually share a personal name, the resulting sets often contain documents relevant to several people. It is necessary to disambiguate people in these result sets in order to to help users find the person of interest more readily. In the task of name disambiguation, effective measurement of similarities in the documents is a crucial step towards the final disambiguation. We propose a new method that uses web directories as a knowledge base to find common contexts in documents and uses the common contexts measure to determine document similarities. Experiments, conducted on documents mentioning real people on the web, together with several famous web directory structures, suggest that there are significant advantages in using web directories to disambiguate people compared with other conventional methods.  相似文献   

11.
Research on clustering algorithms in synonymy graphs of a single language yields promising results, however, this idea is not yet explored in a multilingual setting. Nevertheless, moving the problem to a multilingual translation graph enables the use of more clues and techniques not possible in a monolingual synonymy graph. This article explores the potential of sense induction methods in a massively multilingual translation graph. For this purpose, the performance of graph clustering methods in synset detection are investigated. In the context of translation graphs, the use of existing Wordnets in different languages is an important clue for synset detection which cannot be utilized in a monolingual setting. Casting the problem into an unsupervised synset expansion task rather than a clustering or community detection task improves the results substantially. Furthermore, instead of a greedy unsupervised expansion algorithm guided by heuristics, we devise a supervised learning algorithm able to learn synset expansion patterns from the words in existing Wordnets to achieve superior results. As the training data is formed of already existing Wordnets, as opposed to previous work, manual labeling is not required. To evaluate our methods, Wordnets for Slovenian, Persian, German and Russian are built from scratch and compared to their manually built Wordnets or labeled test-sets. Results reveal a clear improvement over 2 state-of-the-art algorithms targeting massively multilingual Wordnets and competitive results with Wordnet construction methods targeting a single language. The system is able to produce Wordnets from scratch with a Wordnet base concept coverage ranging from 20% to 88% for 51 languages and expands existing Wordnets up to 30%.  相似文献   

12.
There are several language constructs and mechanisms that provide some sort of support for conceptual modeling in object-oriented programming: the Liskov substitution principle, the Meyer programming by contract, the Beta inner construct, interfaces in C# and Java, and the separation of subtype hierarchies from subclass hierarchies as in Timor and Sather. All these mechanisms and constructs are powerful and useful tools to enforce the conceptual modeling trait of the inheritance mechanism in object-oriented programming. Their purpose is to ensure semantic compatibility of classes related by an inheritance hierarchy. When applied independently, these mechanisms can lead to more correct inheritance hierarchies that are easy to understand, use, and extend. This article discusses the different mechanisms and studies the interaction between them. It investigates whether conceptual modeling can be satisfactorily achieved by using such tools. It will be shown that the interaction between these tools might lead to contradictions and might prevent legitimate inheritance hierarchies. This article proposes a framework for a conceptual modeling mechanism that will better support conceptual modeling at the language level.  相似文献   

13.
居住在甘肃河西走廊的裕固族,族内操着差异较大的两种语言:一种语言属于蒙古语族,被称为东部裕固语;另一种语言属于突厥语族,被称为西部裕固语。笔者通过民族学田野调查,对裕固语使用现状作了详细的描述,并在描述和理论探讨的基础上对裕固语的使用区域进行了分类。具体来说,由于地理环境、汉语文教育、民族通婚等因素的影响,可分为裕固语、汉语双语区;东西部裕固语、汉语多语区和汉语单语区(裕固语缺失区)。  相似文献   

14.
For historical and cultural reasons, English phases, especially proper nouns and new words, frequently appear in Web pages written primarily in East Asian languages such as Chinese, Korean, and Japanese. Although such English terms and their equivalences in these East Asian languages refer to the same concept, they are often erroneously treated as independent index units in traditional Information Retrieval (IR). This paper describes the degree to which the problem arises in IR and proposes a novel technique to solve it. Our method first extracts English terms from native Web documents in an East Asian language, and then unifies the extracted terms and their equivalences in the native language as one index unit. For Cross-Language Information Retrieval (CLIR), one of the major hindrances to achieving retrieval performance at the level of Mono-Lingual Information Retrieval (MLIR) is the translation of terms in search queries which can not be found in a bilingual dictionary. The Web mining approach proposed in this paper for concept unification of terms in different languages can also be applied to solve this well-known challenge in CLIR. Experimental results based on NTCIR and KT-Set test collections show that the high translation precision of our approach greatly improves performance of both Mono-Lingual and Cross-Language Information Retrieval.  相似文献   

15.
Stochastic simulation has been very effective in many domains but never applied to the WWW. This study is a premiere in using neural networks in stochastic simulation of the number of rejected Web pages per search query. The evaluation of the quality of search engines should involve not only the resulting set of Web pages but also an estimate of the rejected set of Web pages. The iterative radial basis functions (RBF) neural network developed by Meghabghab and Nasr [Iterative RBF neural networks as meta-models for stochastic simulations, in: Second International Conference on Intelligent Processing and Manufacturing of Materials, IPMM’99, Honolulu, Hawaii, 1999, pp. 729–734] was adapted to the actual evaluation of the number of rejected Web pages on four search engines, i.e., Yahoo, Alta Vista, Google, and Northern Light. Nine input variables were selected for the simulation: (1) precision, (2) overlap, (3) response time, (4) coverage, (5) update frequency, (6) boolean logic, (7) truncation, (8) word and multi-word searching, (9) portion of the Web pages indexed. Typical stochastic simulation meta-modeling uses regression models in response surface methods. RBF becomes a natural target for such an attempt because they use a family of surfaces each of which naturally divides an input space into two regions X+ and X− and the n patterns for testing will be assigned either class X+ or X−. This technique divides the resulting set of responses to a query into accepted and rejected Web pages. To test the hypothesis that the evaluation of any search engine query should involve an estimate of the number of rejected Web pages as part of the evaluation, RBF meta-model was trained on 937 examples from a set of 9000 different simulation runs on the nine different input variables. Results show that two of the variables can be eliminated which include: response time and portion of the Web indexed without affecting evaluation results. Results show that the number of rejected Web pages for a specific set of search queries on these four engines very high. Also a goodness measure of a search engine for a given set of queries can be designed which is a function of the coverage of the search engine and the normalized age of a new document in result set for the query. This study concludes that unless search engine designers address the issue of rejected Web pages, indexing, and crawling, the usage of the Web as a research tool for academic and educational purposes will stay hindered.  相似文献   

16.
语言鸿沟已成为Internet上信息传播不对称的主要障碍,要充分实现双边和多边的跨国、跨地区的网络商务信息服务,研究和开发多语种的网络平台是必要条件.文章通过简要分析广西北部湾经济区对网络平台多语种化的需求趋势,提出了面向东盟国家的电子商务网络平台建设若干对策.  相似文献   

17.
Noise reduction through summarization for Web-page classification   总被引:1,自引:0,他引:1  
Due to a large variety of noisy information embedded in Web pages, Web-page classification is much more difficult than pure-text classification. In this paper, we propose to improve the Web-page classification performance by removing the noise through summarization techniques. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then put forward a new Web-page summarization algorithm based on Web-page layout and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that the classification algorithms (NB or SVM) augmented by any summarization approach can achieve an improvement by more than 5.0% as compared to pure-text-based classification algorithms. We further introduce an ensemble method to combine the different summarization algorithms. The ensemble summarization method achieves more than 12.0% improvement over pure-text based methods.  相似文献   

18.
A challenge for sentence categorization and novelty mining is to detect not only when text is relevant to the user’s information need, but also when it contains something new which the user has not seen before. It involves two tasks that need to be solved. The first is identifying relevant sentences (categorization) and the second is identifying new information from those relevant sentences (novelty mining). Many previous studies of relevant sentence retrieval and novelty mining have been conducted on the English language, but few papers have addressed the problem of multilingual sentence categorization and novelty mining. This is an important issue in global business environments, where mining knowledge from text in a single language is not sufficient. In this paper, we perform the first task by categorizing Malay and Chinese sentences, then comparing their performances with that of English. Thereafter, we conduct novelty mining to identify the sentences with new information. Experimental results on TREC 2004 Novelty Track data show similar categorization performance on Malay and English sentences, which greatly outperform Chinese. In the second task, it is observed that we can achieve similar novelty mining results for all three languages, which indicates that our algorithm is suitable for novelty mining of multilingual sentences. In addition, after benchmarking our results with novelty mining without categorization, it is learnt that categorization is necessary for the successful performance of novelty mining.  相似文献   

19.
李雪  赵新力  杨开荆 《情报科学》2007,25(12):1879-1882
目前的网页一般只是在首页选择语言,不能够在其他页面实现当前页面的多种语言转换。本文分析了传统Web程序多语言解决方案,详细阐述了如何实现基于MVC设计模式的Web程序的多语言动态切换,并且介绍了一个多语言应用实例-澳门记忆工程。  相似文献   

20.
We develop a two-phased survey design—based on the uses and gratifications approach and the theory of planned behavior—to analyze competitive relations between search engines and traditional information sources. We apply the survey design in a large-scale empirical study with 14-to 66-year-old Internet users (mean age 32) to find out whether complementary or substitutional dependencies predominate between search engines and three traditional information sources—paper-based encyclopedias and yellow pages and telephone-based directory assistance. We find that search engines, compared to the traditional alternatives, are gratifying a wider spread of users' needs. Although yellow pages and directory assistance are potentially substitutable, encyclopedias serve those needs that search engines cannot (yet) fulfill. The traditional media companies face increased competition, but do not necessarily have to be in an inferior competitive position.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号