首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 0 毫秒
1.
Language modeling (LM), providing a principled mechanism to associate quantitative scores to sequences of words or tokens, has long been an interesting yet challenging problem in the field of speech and language processing. The n-gram model is still the predominant method, while a number of disparate LM methods, exploring either lexical co-occurrence or topic cues, have been developed to complement the n-gram model with some success. In this paper, we explore a novel language modeling framework built on top of the notion of relevance for speech recognition, where the relationship between a search history and the word being predicted is discovered through different granularities of semantic context for relevance modeling. Empirical experiments on a large vocabulary continuous speech recognition (LVCSR) task seem to demonstrate that the various language models deduced from our framework are very comparable to existing language models both in terms of perplexity and recognition error rate reductions.  相似文献   

2.
Opinion mining is one of the most important research tasks in the information retrieval research community. With the huge volume of opinionated data available on the Web, approaches must be developed to differentiate opinion from fact. In this paper, we present a lexicon-based approach for opinion retrieval. Generally, opinion retrieval consists of two stages: relevance to the query and opinion detection. In our work, we focus on the second state which itself focusses on detecting opinionated documents . We compare the document to be analyzed with opinionated sources that contain subjective information. We hypothesize that a document with a strong similarity to opinionated sources is more likely to be opinionated itself. Typical lexicon-based approaches treat and choose their opinion sources according to their test collection, then calculate the opinion score based on the frequency of subjective terms in the document. In our work, we use different open opinion collections without any specific treatment and consider them as a reference collection. We then use language models to determine opinion scores. The analysis document and reference collection are represented by different language models (i.e., Dirichlet, Jelinek-Mercer and two-stage models). These language models are generally used in information retrieval to represent the relationship between documents and queries. However, in our study, we modify these language models to represent opinionated documents. We carry out several experiments using Text REtrieval Conference (TREC) Blogs 06 as our analysis collection and Internet Movie Data Bases (IMDB), Multi-Perspective Question Answering (MPQA) and CHESLY as our reference collection. To improve opinion detection, we study the impact of using different language models to represent the document and reference collection alongside different combinations of opinion and retrieval scores. We then use this data to deduce the best opinion detection models. Using the best models, our approach improves on the best baseline of TREC Blog (baseline4) by 30%.  相似文献   

3.
Over the years, various meta-languages have been used to manually enrich documents with conceptual knowledge of some kind. Examples include keyword assignment to citations or, more recently, tags to websites. In this paper we propose generative concept models as an extension to query modeling within the language modeling framework, which leverages these conceptual annotations to improve retrieval. By means of relevance feedback the original query is translated into a conceptual representation, which is subsequently used to update the query model.  相似文献   

4.
The term mismatch problem in information retrieval is a critical problem, and several techniques have been developed, such as query expansion, cluster-based retrieval and dimensionality reduction to resolve this issue. Of these techniques, this paper performs an empirical study on query expansion and cluster-based retrieval. We examine the effect of using parsimony in query expansion and the effect of clustering algorithms in cluster-based retrieval. In addition, query expansion and cluster-based retrieval are compared, and their combinations are evaluated in terms of retrieval performance by performing experimentations on seven test collections of NTCIR and TREC.  相似文献   

5.
In this paper, a new robust relevance model is proposed that can be applied to both pseudo and true relevance feedback in the language-modeling framework for document retrieval. There are at least three main differences between our new relevance model and other relevance models. The proposed model brings back the original query into the relevance model by treating it as a short, special document, in addition to a number of top-ranked documents returned from the first round retrieval for pseudo feedback, or a number of relevant documents for true relevance feedback. Second, instead of using a uniform prior as in the original relevance model proposed by Lavrenko and Croft, documents are assigned with different priors according to their lengths (in terms) and ranks in the first round retrieval. Third, the probability of a term in the relevance model is further adjusted by its probability in a background language model. In both pseudo and true relevance cases, we have compared the performance of our model to that of the two baselines: the original relevance model and a linear combination model. Our experimental results show that the proposed new model outperforms both of the two baselines in terms of mean average precision.  相似文献   

6.
In the KL divergence framework, the extended language modeling approach has a critical problem of estimating a query model, which is the probabilistic model that encodes the user’s information need. For query expansion in initial retrieval, the translation model had been proposed to involve term co-occurrence statistics. However, the translation model was difficult to apply, because the term co-occurrence statistics must be constructed in the offline time. Especially in a large collection, constructing such a large matrix of term co-occurrences statistics prohibitively increases time and space complexity. In addition, reliable retrieval performance cannot be guaranteed because the translation model may comprise noisy non-topical terms in documents. To resolve these problems, this paper investigates an effective method to construct co-occurrence statistics and eliminate noisy terms by employing a parsimonious translation model. The parsimonious translation model is a compact version of a translation model that can reduce the number of terms containing non-zero probabilities by eliminating non-topical terms in documents. Through experimentation on seven different test collections, we show that the query model estimated from the parsimonious translation model significantly outperforms not only the baseline language modeling, but also the non-parsimonious models.  相似文献   

7.
The application of natural language processing (NLP) to financial fields is advancing with an increase in the number of available financial documents. Transformer-based models such as Bidirectional Encoder Representations from Transformers (BERT) have been successful in NLP in recent years. These cutting-edge models have been adapted to the financial domain by applying financial corpora to existing pre-trained models and by pre-training with the financial corpora from scratch. In Japanese, by contrast, financial terminology cannot be applied from a general vocabulary without further processing. In this study, we construct language models suitable for the financial domain. Furthermore, we compare methods for adapting language models to the financial domain, such as pre-training methods and vocabulary adaptation. We confirm that the adaptation of a pre-training corpus and tokenizer vocabulary based on a corpus of financial text is effective in several downstream financial tasks. No significant difference is observed between pre-training with the financial corpus and continuous pre-training from the general language model with the financial corpus. We have released our source code and pre-trained models.  相似文献   

8.
Warning: This paper contains abusive samples that may cause discomfort to readers.Abusive language on social media reinforces prejudice against an individual or a specific group of people, which greatly hampers freedom of expression. With the rise of large-scale pre-trained language models, classification based on pre-trained language models has gradually become a paradigm for automatic abusive language detection. However, the effect of stereotypes inherent in language models on the detection of abusive language remains unknown, although this may further reinforce biases against the minorities. To this end, in this paper, we use multiple metrics to measure the presence of bias in language models and analyze the impact of these inherent biases in automatic abusive language detection. On the basis of this quantitative analysis, we propose two different debiasing strategies, token debiasing and sentence debiasing, which are jointly applied to reduce the bias of language models in abusive language detection without degrading the classification performance. Specifically, for the token debiasing strategy, we reduce the discrimination of the language model against protected attribute terms of a certain group by random probability estimation. For the sentence debiasing strategy, we replace protected attribute terms and augment the original text by counterfactual augmentation to obtain debiased samples, and use the consistency regularization between the original data and the augmented samples to eliminate the bias at the sentence level of the language model. The experimental results confirm that our method can not only reduce the bias of the language model in the abusive language detection task, but also effectively improve the performance of abusive language detection.  相似文献   

9.
Theory-based model of factors affecting information overload   总被引:1,自引:0,他引:1  
As the volume of available information increases, individuals and organisations become overwhelmed by the plethora of information. This can reduce productivity and performance, hinder learning and innovation, affect decision making and well-being and cost organisations large amounts of money. This paper develops a new theory-based model of factors affecting information overload and provides a formula for calculating the extent of overload, potentially of use as a diagnostic tool supporting individual or organisational development.  相似文献   

10.
11.
The paper combines a comprehensive account of the probabilistic model of retrieval with new systematic experiments on TREC Programme material. It presents the model from its foundations through its logical development to cover more aspects of retrieval data and a wider range of system functions. Each step in the argument is matched by comparative retrieval tests, to provide a single coherent account of a major line of research. The experiments demonstrate, for a large test collection, that the probabilistic model is effective and robust, and that it responds appropriately, with major improvements in performance, to key features of retrieval situations.Part 1 covers the foundations and the model development for document collection and relevance data, along with the test apparatus. Part 2 covers the further development and elaboration of the model, with extensive testing, and briefly considers other environment conditions and tasks, model training, concluding with comparisons with other approaches and an overall assessment.Data and results tables for both parts are given in Part 1. Key results are summarised in Part 2.  相似文献   

12.
Mobile agent technology has been used in various applications including e-commerce, information processing, distributed network management, and database access. Information search and retrieval can be conducted by mobile agents in a decentralized system. As compared with the client/server model, the mobile agent approach has an advantage of saving network bandwidth and offering flexibility in information search and retrieval. In this paper, we present a model for mobile agents to select the most reputable information host to search and retrieve information. We use opinion-based belief structure to represent, aggregate and calculate the reputation of an information host. Since reputation is a multi-faced concept, our approach first allows the users to rank each information host's quality of service based on a set of evaluation categories. Then, a comprehensive, final reputation of the host is obtained by aggregating those specific category reputations. To recognize the subjective nature of a reputation, the transferable belief model is used to represent and rank the category reputation. Experiments are conducted using the Aglets technology to illustrate mobile agent migration.  相似文献   

13.
In this paper, we lay out a relational approach for indexing and retrieving photographs from a collection. The increase of digital image acquisition devices, combined with the growth of the World Wide Web, requires the development of information retrieval (IR) models and systems that provide fast access to images searched by users in databases. The aim of our work is to develop an IR model suited to images, integrating rich semantics for representing this visual data and user queries, which can also be applied to large corpora.  相似文献   

14.
The article analyzes user–IR system interaction from the broad, socio-cognitive perspective of lessons we can learn about human brain evolution when we compare the Neanderthal brain to the human brain before and after a small human brain mutation is hypothesized to have occurred 35,000–75,000 years ago. The enhanced working memory mutation enabled modern humans (i) to decode unfamiliar environmental stimuli with greater focusing power on adaptive solutions to environmental changes and problems, and (ii) to encode environmental stimuli in more efficient, generative knowledge structures. A sociological theory of these evolving, more efficient encoding knowledge structures is given. These new knowledge structures instilled in humans not only the ability to adapt to and survive novelty and/or changing conditions in the environment, but they also instilled an imperative to do so. Present day IR systems ignore the encoding imperative in their design framework. To correct for this lacuna, we propose the evolutionary-based socio-cognitive framework model for designing interactive IR systems. A case study is given to illustrate the functioning of the model.  相似文献   

15.
A rapid increase in the use of web-based technologies – and corresponding changes in government and local council policies – in recent years, means that many vital services are now provided solely online. While this has many potential benefits, it can place additional burdens on certain demographic groups, some of whom may become considerably disadvantaged or even disenfranchised. This is particularly problematic for English-as-a Second Language (ESL) speakers, who are often immigrants or refugees and thus have a greater need to access these e-government services, and who may struggle to understand and assess the relevance of complex documents. In this work we investigate the search behaviours and performance of native English speakers and two different groups of ESL speakers when completing e-government tasks, and the effect of document readability/complexity. In contrast with previous work, our results show significant differences between groups of varying language proficiency in terms of objective search performance, time on task, and self-perceived performance and confidence. We also demonstrate that document reading level moderates the effect of language proficiency on objective search performance. The findings contribute to our existing understanding of how English language proficiency affects search for e-government topics, and have important implications for the future development of e-government services to ensure more equitable access and use.  相似文献   

16.
In this paper, we present the state of the art in the field of information retrieval that is relevant for understanding how to design information retrieval systems for children. We describe basic theories of human development to explain the specifics of young users, i.e., their cognitive skills, fine motor skills, knowledge, memory and emotional states in so far as they differ from those of adults. We derive the implications these differences have on the design of information retrieval systems for children. Furthermore, we summarize the main findings about children’s search behavior from multiple user studies. These findings are important to understand children’s information needs, their search strategies and usage of information retrieval systems. We also identify several weaknesses of previous user studies about children’s information-seeking behavior. Guided by the findings of these user studies, we describe challenges for the design of information retrieval systems for young users. We give an overview of algorithms and user interface concepts. We also describe existing information retrieval systems for children, in specific web search engines and digital libraries. We conclude with a discussion of open issues and directions for further research. The survey provided in this paper is important both for designers of information retrieval systems for young users as well as for researchers who start working in this field.  相似文献   

17.
Benchmarks are vital tools in the performance measurement, evaluation, and comparison of computer hardware and software systems. Standard benchmarks such as the TREC, TPC, SPEC, SAP, Oracle, Microsoft, IBM, Wisconsin, AS3AP, OO1, OO7, XOO7 benchmarks have been used to assess the system performance. These benchmarks are domain-specific and domain-dependent in that they model typical applications and tie to a problem domain. Test results from these benchmarks are estimates of possible system performance for certain pre-determined problem types. When the user domain differs from the standard problem domain or when the application workload is divergent from the standard workload, they do not provide an accurate way to measure the system performance of the user problem domain. System performance of the actual problem domain in terms of data and transactions may vary significantly from the standard benchmarks.In this research, we address the issue of generalization and precision of benchmark workload model for web search technology. The current performance measurement and evaluation method suffers from the rough estimate of system performance which varies widely when the problem domain changes. The performance results provided by the vendors cannot be reproduced nor reused in the real users’ environment. Hence, in this research, we tackle the issue of domain boundness and workload boundness which represents the root of the problem of imprecise, ir-representative, and ir-reproducible performance results. We address the issue by presenting a domain-independent and workload-independent workload model benchmark method which is developed from the perspective of the user requirements and generic constructs. We present a user-driven workload model to develop a benchmark in a process of workload requirements representation, transformation, and generation via the common carrier of generic constructs. We aim to create a more generalized and precise evaluation method which derives test suites from the actual user domain and application setting.The workload model benchmark method comprises three main components. They are a high-level workload specification scheme, a translator of the scheme, and a set of generators to generate the test database and the test suite. They are based on the generic constructs. The specification scheme is used to formalize the workload requirements. The translator is used to transform the specification. The generator is used to produce the test database and the test workload. We determine the generic constructs via the analysis of search methods. The generic constructs form a page model, a query model, and a control model in the workload model development. The page model describes the web page structure. The query model defines the logics to query the web. The control model defines the control variables to set up the experiments.In this study, we have conducted ten baseline research experiments to validate the feasibility and validity of the benchmark method. An experimental prototype is built to execute these experiments. Experimental results demonstrate that the method based on generic constructs and driven by the perspective of user requirements is capable of modeling the standard benchmarks as well as more general benchmark requirements.  相似文献   

18.
This paper describes our novel retrieval model that is based on contexts of query terms in documents (i.e., document contexts). Our model is novel because it explicitly takes into account of the document contexts instead of implicitly using the document contexts to find query expansion terms. Our model is based on simulating a user making relevance decisions, and it is a hybrid of various existing effective models and techniques. It estimates the relevance decision preference of a document context as the log-odds and uses smoothing techniques as found in language models to solve the problem of zero probabilities. It combines these estimated preferences of document contexts using different types of aggregation operators that comply with different relevance decision principles (e.g., aggregate relevance principle). Our model is evaluated using retrospective experiments (i.e., with full relevance information), because such experiments can (a) reveal the potential of our model, (b) isolate the problems of the model from those of the parameter estimation, (c) provide information about the major factors affecting the retrieval effectiveness of the model, and (d) show that whether the model obeys the probability ranking principle. Our model is promising as its mean average precision is 60–80% in our experiments using different TREC ad hoc English collections and the NTCIR-5 ad hoc Chinese collection. Our experiments showed that (a) the operators that are consistent with aggregate relevance principle were effective in combining the estimated preferences, and (b) that estimating probabilities using the contexts in the relevant documents can produce better retrieval effectiveness than using the entire relevant documents.  相似文献   

19.
Synchronous collaborative information retrieval (SCIR) is concerned with supporting two or more users who search together at the same time in order to satisfy a shared information need. SCIR systems represent a paradigmatic shift in the way we view information retrieval, moving from an individual to a group process and as such the development of novel IR techniques is needed to support this. In this article we present what we believe are two key concepts for the development of effective SCIR namely division of labour (DoL) and sharing of knowledge (SoK). Together these concepts enable coordinated SCIR such that redundancy across group members is reduced whilst enabling each group member to benefit from the discoveries of their collaborators. In this article we outline techniques from state-of-the-art SCIR systems which support these two concepts, primarily through the provision of awareness widgets. We then outline some of our own work into system-mediated techniques for division of labour and sharing of knowledge in SCIR. Finally we conclude with a discussion on some possible future trends for these two coordination techniques.  相似文献   

20.
This paper presents an overview of automatic methods for building domain knowledge structures (domain models) from text collections. Applications of domain models have a long history within knowledge engineering and artificial intelligence. In the last couple of decades they have surfaced noticeably as a useful tool within natural language processing, information retrieval and semantic web technology. Inspired by the ubiquitous propagation of domain model structures that are emerging in several research disciplines, we give an overview of the current research landscape and some techniques and approaches. We will also discuss trade-offs between different approaches and point to some recent trends.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号