首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 203 毫秒
1.
Recommendation is an effective marketing tool widely used in the e-commerce business, and can be made based on ratings predicted from the rating data of purchased items. To improve the accuracy of rating prediction, user reviews or product images have been used separately as side information to learn the latent features of users (items). In this study, we developed a hybrid approach to analyze both user sentiments from review texts and user preferences from item images to make item recommendations more personalized for users. The hybrid model consists of two parallel modules to perform a procedure named the multiscale semantic and visual analyses (MSVA). The first module is designated to conduct semantic analysis on review documents in various aspects with word-aware and scale-aware attention mechanisms, while the second module is assigned to extract visual features with block-aware and visual-aware attention mechanisms. The MSVA model was trained, validated and tested using Amazon Product Data containing sampled reviews varying from 492,970 to 1 million records across 22 different domains. Three state-of-the-art recommendation models were used as the baselines for performance comparisons. Averagely, MSVA reduced the mean squared error (MSE) of predicted ratings by 6.00%, 3.14% and 3.25% as opposed to the three baselines. It was demonstrated that combining semantic and visual analyses enhanced MSVA's performance across a wide variety of products, and the multiscale scheme used in both the review and visual modules of MSVA made significant contributions to the rating prediction.  相似文献   

2.
The way that users provide feedback on items regarding their satisfaction varies among systems: in some systems, only explicit ratings can be entered; in other systems textual reviews are accepted; and in some systems, both feedback types are accommodated. Recommender systems can readily exploit explicit ratings in the rating prediction and recommendation formulation process, however textual reviews -which in the context of many social networks are in abundance and significantly outnumber numeric ratings- need to be converted to numeric ratings. While numerous approaches exist that calculate a user's rating based on the respective textual review, all such approaches may introduce errors, in the sense that the process of rating calculation based on textual reviews involves an uncertainty level, due to the characteristics of the human language, and therefore the calculated ratings may not accurately reflect the actual ratings that the corresponding user would enter. In this work (1) we examine the features of textual reviews, which affect the reliability of the review-to-rating conversion procedure, (2) we compute a confidence level for each rating, which reflects the uncertainty level for each conversion process, (3) we exploit this metric both in the users’ similarity computation and in the prediction formulation phases in recommender systems, by presenting a novel rating prediction algorithm and (4) we validate the accuracy of the presented algorithm in terms of (i) rating prediction accuracy, using widely-used recommender systems datasets and (ii) recommendations generated for social network user satisfaction and precision, where textual reviews are abundant.  相似文献   

3.
Integrating useful input information is essential to provide efficient recommendations to users. In this work, we focus on improving items ratings prediction by merging both multiple contexts and multiple criteria based research directions which were addressed separately in most existent literature. Throughout this article, Criteria refer to the items attributes, while Context denotes the circumstances in which the user uses an item. Our goal is to capture more fine grained preferences to improve items recommendation quality using users’ multiple criteria ratings under specific contextual situations. Therefore, we examine the recommenders’ data from the graph theory based perspective by representing three types of entities (users, contextual situations and criteria) as well as their relationships as a tripartite graph. Upon the assumption that contextually similar users tend to have similar interests for similar item criteria, we perform a high-order co-clustering on the tripartite graph for simultaneously partitioning the graph entities representing users in similar contextual situations and their evaluated item criteria. To predict cluster-based multi-criteria ratings, we introduce an improved rating prediction method that considers the dependency between users and their contextual situations, and also takes into account the correlation between criteria in the prediction process. The predicted multi-criteria ratings are finally aggregated into a single representative output corresponding to an overall item rating. To guide our investigation, we create a research hypothesis to provide insights about the tripartite graph partitioning and design clear and justified preliminary experiments including quantitative and qualitative analyzes to validate it. Further thorough experiments on the two available context-aware multi-criteria datasets, TripAdvisor and Educational, demonstrate that our proposal exhibits substantial improvements over alternative recommendations approaches.  相似文献   

4.
Existing approaches in online health question answering (HQA) communities to identify the quality of answers either address it subjectively by human assessment or mainly using textual features. This process may be time-consuming and lose the semantic information of answers. We present an automatic approach for predicting answer quality that combines sentence-level semantics with textual and non-textual features in the context of online healthcare. First, we extend the knowledge adoption model (KAM) theory to obtain the six dimensions of quality measures for textual and non-textual features. Then we apply the Bidirectional Encoder Representations from Transformers (BERT) model for extracting semantic features. Next, the multi-dimensional features are processed for dimensionality reduction using linear discriminant analysis (LDA). Finally, we incorporate the preprocessed features into the proposed BK-XGBoost method to automatically predict the answer quality. The proposed method is validated on a real-world dataset with 48121 question-answer pairs crawled from the most popular online HQA communities in China. The experimental results indicate that our method competes against the baseline models on various evaluation metrics. We found up to 2.9% and 5.7% improvement in AUC value in comparison with BERT and XGBoost models respectively.  相似文献   

5.
We present a novel multimodal query expansion strategy, based on genetic programming (GP), for image search in visually-oriented e-commerce applications. Our GP-based approach aims at both: learning to expand queries with multimodal information and learning to compute the “best” ranking for the expanded queries. However, different from previous work, the query is only expressed in terms of the visual content, which brings several challenges for this type of application. In order to evaluate the effectiveness of our method, we have collected two datasets containing images of clothing products taken from different online shops. Experimental results indicate that our method is an effective alternative for improving the quality of image search results when compared to a genetic programming system based only on visual information. Our method can achieve gains varying from 10.8% against the strongest learning-to-rank baseline to 54% against an adhoc specialized solution for the particular domain at hand.  相似文献   

6.
Analyzing and extracting insights from user-generated data has become a topic of interest among businesses and research groups because such data contains valuable information, e.g., consumers’ opinions, ratings, and recommendations of products and services. However, the true value of social media data is rarely discovered due to overloaded information. Existing literature in analyzing online hotel reviews mainly focuses on a single data resource, lexicon, and analysis method and rarely provides marketing insights and decision-making information to improve business’ service and quality of products. We propose an integrated framework which includes a data crawler, data preprocessing, sentiment-sensitive tree construction, convolution tree kernel classification, aspect extraction and category detection, and visual analytics to gain insights into hotel ratings and reviews. The empirical findings show that our proposed approach outperforms baseline algorithms as well as well-known sentiment classification methods, and achieves high precision (0.95) and recall (0.96). The visual analytics results reveal that Business travelers tend to give lower ratings, while Couples tend to give higher ratings. In general, users tend to rate lowest in July and highest in December. The Business travelers more frequently use negative keywords, such as “rude,” “terrible,” “horrible,” “broken,” and “dirty,” to express their dissatisfied emotions toward their hotel stays in July.  相似文献   

7.
This paper describes and evaluates various stemming and indexing strategies for the Czech language. Based on Czech test-collection, we have designed and evaluated two stemming approaches, a light and a more aggressive one. We have compared them with a no stemming scheme as well as a language-independent approach (n-gram). To evaluate the suggested solutions we used various IR models, including Okapi, Divergence from Randomness (DFR), a statistical language model (LM) as well as the classical tf idf vector-space approach. We found that the Divergence from Randomness paradigm tend to propose better retrieval effectiveness than the Okapi, LM or tf idf models, the performance differences were however statistically significant only with the last two IR approaches. Ignoring the stemming reduces generally the MAP by more than 40%, and these differences are always significant. Finally, if our more aggressive stemmer tends to show the best performance, the differences in performance with a light stemmer are not statistically significant.  相似文献   

8.
Aspect mining, which aims to extract ad hoc aspects from online reviews and predict rating or opinion on each aspect, can satisfy the personalized needs for evaluation of specific aspect on product quality. Recently, with the increase of related research, how to effectively integrate rating and review information has become the key issue for addressing this problem. Considering that matrix factorization is an effective tool for rating prediction and topic modeling is widely used for review processing, it is a natural idea to combine matrix factorization and topic modeling for aspect mining (or called aspect rating prediction). However, this idea faces several challenges on how to address suitable sharing factors, scale mismatch, and dependency relation of rating and review information. In this paper, we propose a novel model to effectively integrate Matrix factorization and Topic modeling for Aspect rating prediction (MaToAsp). To overcome the above challenges and ensure the performance, MaToAsp employs items as the sharing factors to combine matrix factorization and topic modeling, and introduces an interpretive preference probability to eliminate scale mismatch. In the hybrid model, we establish a dependency relation from ratings to sentiment terms in phrases. The experiments on two real datasets including Chinese Dianping and English Tripadvisor prove that MaToAsp not only obtains reasonable aspect identification but also achieves the best aspect rating prediction performance, compared to recent representative baselines.  相似文献   

9.
Opinion mining is one of the most important research tasks in the information retrieval research community. With the huge volume of opinionated data available on the Web, approaches must be developed to differentiate opinion from fact. In this paper, we present a lexicon-based approach for opinion retrieval. Generally, opinion retrieval consists of two stages: relevance to the query and opinion detection. In our work, we focus on the second state which itself focusses on detecting opinionated documents . We compare the document to be analyzed with opinionated sources that contain subjective information. We hypothesize that a document with a strong similarity to opinionated sources is more likely to be opinionated itself. Typical lexicon-based approaches treat and choose their opinion sources according to their test collection, then calculate the opinion score based on the frequency of subjective terms in the document. In our work, we use different open opinion collections without any specific treatment and consider them as a reference collection. We then use language models to determine opinion scores. The analysis document and reference collection are represented by different language models (i.e., Dirichlet, Jelinek-Mercer and two-stage models). These language models are generally used in information retrieval to represent the relationship between documents and queries. However, in our study, we modify these language models to represent opinionated documents. We carry out several experiments using Text REtrieval Conference (TREC) Blogs 06 as our analysis collection and Internet Movie Data Bases (IMDB), Multi-Perspective Question Answering (MPQA) and CHESLY as our reference collection. To improve opinion detection, we study the impact of using different language models to represent the document and reference collection alongside different combinations of opinion and retrieval scores. We then use this data to deduce the best opinion detection models. Using the best models, our approach improves on the best baseline of TREC Blog (baseline4) by 30%.  相似文献   

10.
Narratives are comprised of stories that provide insight into social processes. To facilitate the analysis of narratives in a more efficient manner, natural language processing (NLP) methods have been employed in order to automatically extract information from textual sources, e.g., newspaper articles. Existing work on automatic narrative extraction, however, has ignored the nested character of narratives. In this work, we argue that a narrative may contain multiple accounts given by different actors. Each individual account provides insight into the beliefs and desires underpinning an actor’s actions. We present a pipeline for automatically extracting accounts, consisting of NLP methods for: (1) named entity recognition, (2) event extraction, and (3) attribution extraction. Machine learning-based models for named entity recognition were trained based on a state-of-the-art neural network architecture for sequence labelling. For event extraction, we developed a hybrid approach combining the use of semantic role labelling tools, the FrameNet repository of semantic frames, and a lexicon of event nouns. Meanwhile, attribution extraction was addressed with the aid of a dependency parser and Levin’s verb classes. To facilitate the development and evaluation of these methods, we constructed a new corpus of news articles, in which named entities, events and attributions have been manually marked up following a novel annotation scheme that covers over 20 event types relating to socio-economic phenomena. Evaluation results show that relative to a baseline method underpinned solely by semantic role labelling tools, our event extraction approach optimises recall by 12.22–14.20 percentage points (reaching as high as 92.60% on one data set). Meanwhile, the use of Levin’s verb classes in attribution extraction obtains optimal performance in terms of F-score, outperforming a baseline method by 7.64–11.96 percentage points. Our proposed approach was applied on news articles focused on industrial regeneration cases. This facilitated the generation of accounts of events that are attributed to specific actors.  相似文献   

11.
Recently, the Transformer model architecture and the pre-trained Transformer-based language models have shown impressive performance when used in solving both natural language understanding and text generation tasks. Nevertheless, there is little research done on using these models for text generation in Arabic. This research aims at leveraging and comparing the performance of different model architectures, including RNN-based and Transformer-based ones, and different pre-trained language models, including mBERT, AraBERT, AraGPT2, and AraT5 for Arabic abstractive summarization. We first built an Arabic summarization dataset of 84,764 high-quality text-summary pairs. To use mBERT and AraBERT in the context of text summarization, we employed a BERT2BERT-based encoder-decoder model where we initialized both the encoder and decoder with the respective model weights. The proposed models have been tested using ROUGE metrics and manual human evaluation. We also compared their performance on out-of-domain data. Our pre-trained Transformer-based models give a large improvement in performance with ~79% less data. We found that AraT5 scores ~3 ROUGE higher than a BERT2BERT-based model that is initialized with AraBERT, indicating that an encoder-decoder pre-trained Transformer is more suitable for summarizing Arabic text. Also, both of these two models perform better than AraGPT2 by a clear margin, which we found to produce summaries with high readability but with relatively lesser quality. On the other hand, we found that both AraT5 and AraGPT2 are better at summarizing out-of-domain text. We released our models and dataset publicly1,.2  相似文献   

12.
In this paper, we introduce a novel knowledge-based word-sense disambiguation (WSD) system. In particular, the main goal of our research is to find an effective way to filter out unnecessary information by using word similarity. For this, we adopt two methods in our WSD system. First, we propose a novel encoding method for word vector representation by considering the graphical semantic relationships from the lexical knowledge bases, and the word vector representation is utilized to determine the word similarity in our WSD system. Second, we present an effective method for extracting the contextual words from a text for analyzing an ambiguous word based on word similarity. The results demonstrate that the suggested methods significantly enhance the baseline WSD performance in all corpora. In particular, the performance on nouns is similar to those of the state-of-the-art knowledge-based WSD models, and the performance on verbs surpasses that of the existing knowledge-based WSD models.  相似文献   

13.
Aspect level sentiment analysis is important for numerous opinion mining and market analysis applications. In this paper, we study the problem of identifying and rating review aspects, which is the fundamental task in aspect level sentiment analysis. Previous review aspect analysis methods seldom consider entity or rating but only 2-tuples, i.e., head and modifier pair, e.g., in the phrase “nice room”, “room” is the head and “nice” is the modifier. To solve this problem, we novelly present a Quad-tuple Probability Latent Semantic Analysis (QPLSA), which incorporates entity and its rating together with the 2-tuples into the PLSA model. Specifically, QPLSA not only generates fine-granularity aspects, but also captures the correlations between words and ratings. We also develop two novel prediction approaches, the Quad-tuple Prediction (from the global perspective) and the Expectation Prediction (from the local perspective). For evaluation, systematic experiments show that: Quad-tuple PLSA outperforms 2-tuple PLSA significantly on both aspect identification and aspect rating prediction for publication datasets. Moreover, for aspect rating prediction, QPLSA shows significant superiority over state-of-the-art baseline methods. Besides, the Quad-tuple Prediction and the Expectation Prediction also show their strong ability in aspect rating on different datasets.  相似文献   

14.
A main challenge in Cross-Language Information Retrieval (CLIR) is to estimate a proper translation model from available translation resources, since translation quality directly affects the retrieval performance. Among different translation resources, we focus on obtaining translation models from comparable corpora, because they provide appropriate translations for both languages and domains with limited linguistic resources. In this paper, we employ a two-step approach to build an effective translation model from comparable corpora, without requiring any additional linguistic resources, for the CLIR task. In the first step, translations are extracted by deriving correlations between source–target word pairs. These correlations are used to estimate word translation probabilities in the second step. We propose a language modeling approach for the first step, where modeling based on probability distribution provides two key advantages. First, our approach can be tuned easier in comparison with heuristically adjusted previous work. Second, it provides a principled basis for integrating additional lexical and translational relations to improve the accuracy of translations from comparable corpora. As an indication, we integrate monolingual relations of word co-occurrences into the process of translation extraction, which helps to extract more reliable translations for low-frequency words in a comparable corpus. Experimental results on an English–Persian comparable corpus show that our method outperforms the previous approaches in terms of both translation quality and the performance of CLIR. Indeed, the proposed method is naturally applicable to any comparable corpus, regardless of its languages. In addition, we demonstrate the significant impact of word translation probabilities, estimated in the second step of our approach, on the performance of CLIR.  相似文献   

15.
Syntax parse trees are a method of representing sentence structure and are often used to provide models with syntax information and enhance downstream task performance. Because grammar and syntax are inherently linked, the incorporation of syntax parse trees in GEC is a natural solution. In this work, we present a method of incorporating syntax parse trees for Grammatical Error Correction (GEC). Building off a strong sequence-to-sequence Transformer baseline, we present a unified parse integration method for GEC that allows for the use of both dependency and constituency parse trees, as well as their combination - a syntactic graph. Specifically, on the sentence encoder, we propose a graph encoder that can encode dependency trees and constituent trees at the same time, yielding two representations for terminal nodes (i.e., the token of the sentence) and non-terminal nodes. We next use two cross-attentions (NT-Cross-Attention and T-Cross-Attention) to aggregate these source syntactic representations to the target side for final corrections prediction. In addition to evaluating our models on the popular CoNLL-2014 Shared Task and JFLEG GEC benchmarks, we affirm the effectiveness of our proposed method by testing both varying levels of parsing quality and exploring the use of both parsing formalisms. With further empirical exploration and analysis to identify the source of improvement, we found that rich syntax information provided clear clues for GEC; a syntactic graph composed of multiple syntactic parse trees can effectively compensate for the limited quality and insufficient error correction capability of a single syntactic parse tree.  相似文献   

16.
The estimation of query model is an important task in language modeling (LM) approaches to information retrieval (IR). The ideal estimation is expected to be not only effective in terms of high mean retrieval performance over all queries, but also stable in terms of low variance of retrieval performance across different queries. In practice, however, improving effectiveness can sacrifice stability, and vice versa. In this paper, we propose to study this tradeoff from a new perspective, i.e., the bias–variance tradeoff, which is a fundamental theory in statistics. We formulate the notion of bias–variance regarding retrieval performance and estimation quality of query models. We then investigate several estimated query models, by analyzing when and why the bias–variance tradeoff will occur, and how the bias and variance can be reduced simultaneously. A series of experiments on four TREC collections have been conducted to systematically evaluate our bias–variance analysis. Our approach and results will potentially form an analysis framework and a novel evaluation strategy for query language modeling.  相似文献   

17.
《Research Policy》2019,48(9):103808
We explore the effect of political uncertainty on innovation. In particular, we examine the differential effects of two sources of uncertainty – leaders’ education levels and political regimes (i.e., presidential vs. parliamentary) – on patent applications. We posit that firms react to political uncertainty caused by the unexpected departure of a national leader by investing in patents as growth options. The empirical design analyzes a panel with information from over 62 million patent applications at the aggregated applicant level. Results show that leaders’ unexpected departures cause, on average, increases of approximately 9% in the aggregate growth of patent applications. We also find that the leader’s level of education and the country’s political regime system have significant effects on the relationship between political uncertainty and innovation. The difference between leaders with high and low levels of education accounts for 21% of the change in the growth of patent applications. Further, the effect of political uncertainty on innovation is amplified in presidential systems, which grant leaders more power and make electoral transitions less predictable. The differences between presidential and parliamentary systems account for approximately 16% of the change in the growth of patent applications. As a robustness check, we utilized a subsample of more than 170,000 firms with local and foreign patent applications, as well as a panel of over 5700 government non-profits, universities, and hospitals with local patent applications. Consistent with our theory, the former react to political uncertainty by investing in patents, while the latter remain unaffected. We contribute by showing the theoretical mechanisms linking leader and regime characteristics with patent applications.  相似文献   

18.
As compared to the continuous temporal distributions, discrete data representations may be desired for simplified and faster data analysis and forecasting. Data compression can introduce one of the efficient ways to reduce continuous historical stock market data and present them in discrete forms; while predicting stock trend, a primary concern is towards up and down directions of the price movement and thus, data discretization for a focused approach can be beneficial. In this article, we propose a quantization-based data fusion approach with a primary motivation to reduce data complexity and hence, enhance the prediction ability of a model. Here, the continuous time-series values are transformed into discrete quantum values prior to applying them to a prediction model. We extend the proposed approach and factorize quantization by integrating different quantization step sizes. Such fused data can reduce the data to mainly concentrate on the stock price movement direction. To empirically evaluate the proposed approach for stock trend prediction, we adopt long short-term memory, deep neural network, and backpropagation neural network models and compare our prediction results with five existing approaches on several datasets using ten performance metrics. We analyze the impact of specific quantization factors and determine the individual best as well as overall best factor sizes; the results indicate a consistent performance enhancement in stock trend prediction accuracy as compared to the considered baseline methods with an improvement up to 7%. To evaluate the impact of quantization-based data fusion, we analyze time required to execute the experiments along with percentage reduction in the number of unique numeric terms. Further, these results are statistically evaluated using Wilcoxon signed-rank test. We discuss the superiority and applicability of factored quantization-based data fusion approach and conclude our work with potential future research directions.  相似文献   

19.
Professional discourse is the language used by specialists, such as lawyers, doctors and academics, to communicate the knowledge and assumptions associated with their respective fields. Professional discourse can be especially difficult for non-specialists to understand due to the lexical ambiguity of commonplace words that have a different or more specific meaning within a specialist domain. This phenomena also makes it harder for specialists to communicate with the general public because they are similarly unaware of the potential for misunderstandings.In this article, we present an approach for detecting domain terms with lexical ambiguity versus everyday English. We demonstrate the efficacy of our approach with three case studies in statistics, law and biomedicine. In all case studies, we identify domain terms with a precision@100 greater than 0.9, outperforming the best performing baseline by 18.1–91.7%. Most importantly, we show this ranking is broadly consistent with semantic differences. Our results highlight the difficulties that existing semantic difference methods have in the cross-domain setting, which rank non-domain terms highly due to noise or biases in the data. We additionally show that our approach generalizes to short phrases and investigate its data efficiency by varying the number of labeled examples.  相似文献   

20.
本文分析了基于DCT顺序型工作模式的JPEG压缩算法,利用Visual C 软件,设计了基于此种算法下的图像编解码应用程序。选取了6幅24位640*480大小的位图图像进行JPEG的编解码处理,处理结果表明,在通常的应用中JPEG基本系统可以实现10~20的压缩率,并且人眼感觉不到明显的失真,从而完成高品质的图像压缩。  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号