首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
Research on automated social media rumour verification, the task of identifying the veracity of questionable information circulating on social media, has yielded neural models achieving high performance, with accuracy scores that often exceed 90%. However, none of these studies focus on the real-world generalisability of the proposed approaches, that is whether the models perform well on datasets other than those on which they were initially trained and tested. In this work we aim to fill this gap by assessing the generalisability of top performing neural rumour verification models covering a range of different architectures from the perspectives of both topic and temporal robustness. For a more complete evaluation of generalisability, we collect and release COVID-RV, a novel dataset of Twitter conversations revolving around COVID-19 rumours. Unlike other existing COVID-19 datasets, our COVID-RV contains conversations around rumours that follow the format of prominent rumour verification benchmarks, while being different from them in terms of topic and time scale, thus allowing better assessment of the temporal robustness of the models. We evaluate model performance on COVID-RV and three popular rumour verification datasets to understand limitations and advantages of different model architectures, training datasets and evaluation scenarios. We find a dramatic drop in performance when testing models on a different dataset from that used for training. Further, we evaluate the ability of models to generalise in a few-shot learning setup, as well as when word embeddings are updated with the vocabulary of a new, unseen rumour. Drawing upon our experiments we discuss challenges and make recommendations for future research directions in addressing this important problem.  相似文献   

2.
Existing unsupervised keyphrase extraction methods typically emphasize the importance of the candidate keyphrase itself, ignoring other important factors such as the influence of uninformative sentences. We hypothesize that the salient sentences of a document are particularly important as they are most likely to contain keyphrases, especially for long documents. To our knowledge, our work is the first attempt to exploit sentence salience for unsupervised keyphrase extraction by modeling hierarchical multi-granularity features. Specifically, we propose a novel position-aware graph-based unsupervised keyphrase extraction model, which includes two model variants. The pipeline model first extracts salient sentences from the document, followed by keyphrase extraction from the extracted salient sentences. In contrast to the pipeline model which models multi-granularity features in a two-stage paradigm, the joint model accounts for both sentence and phrase representations of the source document simultaneously via hierarchical graphs. Concretely, the sentence nodes are introduced as an inductive bias, injecting sentence-level information for determining the importance of candidate keyphrases. We compare our model against strong baselines on three benchmark datasets including Inspec, DUC 2001, and SemEval 2010. Experimental results show that the simple pipeline-based approach achieves promising results, indicating that keyphrase extraction task benefits from the salient sentence extraction task. The joint model, which mitigates the potential accumulated error of the pipeline model, gives the best performance and achieves new state-of-the-art results while generalizing better on data from different domains and with different lengths. In particular, for the SemEval 2010 dataset consisting of long documents, our joint model outperforms the strongest baseline UKERank by 3.48%, 3.69% and 4.84% in terms of F1@5, F1@10 and F1@15, respectively. We also conduct qualitative experiments to validate the effectiveness of our model components.  相似文献   

3.
This paper focuses on personalized outfit generation, aiming to generate compatible fashion outfits catering to given users. Personalized recommendation by generating outfits of compatible items is an emerging task in the recommendation community with great commercial value but less explored. The task requires to explore both user-outfit personalization and outfit compatibility, any of which is challenging due to the huge learning space resulted from large number of items, users, and possible outfit options. To specify the user preference on outfits and regulate the outfit compatibility modeling, we propose to incorporate coordination knowledge in fashion. Inspired by the fact that users might have coordination preference in terms of category combination, we first define category combinations as templates and propose to model user-template relationship to capture users’ coordination preferences. Moreover, since a small number of templates can cover the majority of fashion outfits, leveraging templates is also promising to guide the outfit generation process. In this paper, we propose Template-guided Outfit Generation (TOG) framework, which unifies the learning of user-template interaction, user–item interaction and outfit compatibility modeling. The personal preference modeling and outfit generation are organically blended together in our problem formulation, and therefore can be achieved simultaneously. Furthermore, we propose new evaluation protocols to evaluate different models from both the personalization and compatibility perspectives. Extensive experiments on two public datasets have demonstrated that the proposed TOG can achieve preferable performance in both evaluation perspectives, namely outperforming the most competitive baseline BGN by 7.8% and 10.3% in terms of personalization precision on iFashion and Polyvore datasets, respectively, and improving the compatibility of the generated outfits by over 2%.  相似文献   

4.
This paper investigates the existence of a representative subset obtained from a large original dataset that can achieve the same performance level obtained using the entire dataset in the context of training neural language models. We employ the likelihood-based scoring method based on two distinct types of pre-trained language models to select a representative subset. We conduct our experiments on widely used 17 natural language processing datasets with 24 evaluation metrics. The experimental results showed that the representative subset obtained using the likelihood difference score can achieve the 90% performance level even when the size of the dataset is reduced to approximately two to three orders of magnitude smaller than the original dataset. We also compare the performance with the models trained with the same amount of subset selected randomly to show the effectiveness of the representative subset.  相似文献   

5.
6.
Hate speech is an increasingly important societal issue in the era of digital communication. Hateful expressions often make use of figurative language and, although they represent, in some sense, the dark side of language, they are also often prime examples of creative use of language. While hate speech is a global phenomenon, current studies on automatic hate speech detection are typically framed in a monolingual setting. In this work, we explore hate speech detection in low-resource languages by transferring knowledge from a resource-rich language, English, in a zero-shot learning fashion. We experiment with traditional and recent neural architectures, and propose two joint-learning models, using different multilingual language representations to transfer knowledge between pairs of languages. We also evaluate the impact of additional knowledge in our experiment, by incorporating information from a multilingual lexicon of abusive words. The results show that our joint-learning models achieve the best performance on most languages. However, a simple approach that uses machine translation and a pre-trained English language model achieves a robust performance. In contrast, Multilingual BERT fails to obtain a good performance in cross-lingual hate speech detection. We also experimentally found that the external knowledge from a multilingual abusive lexicon is able to improve the models’ performance, specifically in detecting the positive class. The results of our experimental evaluation highlight a number of challenges and issues in this particular task. One of the main challenges is related to the issue of current benchmarks for hate speech detection, in particular how bias related to the topical focus in the datasets influences the classification performance. The insufficient ability of current multilingual language models to transfer knowledge between languages in the specific hate speech detection task also remain an open problem. However, our experimental evaluation and our qualitative analysis show how the explicit integration of linguistic knowledge from a structured abusive language lexicon helps to alleviate this issue.  相似文献   

7.
In recent years, there has been increased interest in topic-focused multi-document summarization. In this task, automatic summaries are produced in response to a specific information request, or topic, stated by the user. The system we have designed to accomplish this task comprises four main components: a generic extractive summarization system, a topic-focusing component, sentence simplification, and lexical expansion of topic words. This paper details each of these components, together with experiments designed to quantify their individual contributions. We include an analysis of our results on two large datasets commonly used to evaluate task-focused summarization, the DUC2005 and DUC2006 datasets, using automatic metrics. Additionally, we include an analysis of our results on the DUC2006 task according to human evaluation metrics. In the human evaluation of system summaries compared to human summaries, i.e., the Pyramid method, our system ranked first out of 22 systems in terms of overall mean Pyramid score; and in the human evaluation of summary responsiveness to the topic, our system ranked third out of 35 systems.  相似文献   

8.
Argument mining (AM) aims to automatically generate a graph that represents the argument structure of a document. Most previous AM models only pay attention to a single argument component (AC) to classify the type of the AC or a pair of ACs to identify and classify the argumentative relation (AR) between the two ACs. These models ignore the impact of global argument structure of the documents, which is important, especially in some highly structured genres such as scientific papers, where the process of argumentation is relatively fixed. Inspired by this, we propose a novel two-stage model which leverages global structure information to support AM. The first stage uses a multi-turn question-answering model to incrementally generate an initial argumentative graph that identifies relations among ACs. At each turn, all ACs related to the query AC are generated simultaneously, such that the sibling global information between the answer ACs is considered. In addition, the partially constructed graph is used as global structure information to support the extension of the graph with additional ACs. After the whole initial graph structure has been determined, the second stage assigns semantic types to both the ACs and ARs among them, leveraging information from this initial graph as global structure information. We test the proposed methods on two scientific datasets (one is the AbstRCT dataset including 659 abstracts about cancer research and the other is the SciARG dataset that consists of 225 computer linguistic abstracts and 285 biomedical abstracts) and a student essay dataset PE with 402 essays. Our experiments show that our model improves the state-of-the-art performance on two scientific datasets for different AM subtasks, with average improvements of 1%, 2.41%, 1.1% for the ACC, ARI and ARC task respectively on the AbstRCT dataset, and 2.36%, 1.84%, 8.87% for the ACC, ARI and ARC task on the SciARG dataset. Our model also achieves comparative results on the PE datasets: 87.7% of F1 scores for the ACC task, 81.4% for the ARI task and 78.8% for the ARC task.  相似文献   

9.
Overlapping entity relation extraction has received extensive research attention in recent years. However, existing methods suffer from the limitation of long-distance dependencies between entities, and fail to extract the relations when the overlapping situation is relatively complex. This issue limits the performance of the task. In this paper, we propose an end-to-end neural model for overlapping relation extraction by treating the task as a quintuple prediction problem. The proposed method first constructs the entity graphs by enumerating possible candidate spans, then models the relational graphs between entities via a graph attention model. Experimental results on five benchmark datasets show that the proposed model achieves the current best performance, outperforming previous methods and baseline systems by a large margin. Further analysis shows that our model can effectively capture the long-distance dependencies between entities in a long sentence.  相似文献   

10.
Visual dialog, a visual-language task, enables an AI agent to engage in conversation with humans grounded in a given image. To generate appropriate answers for a series of questions in the dialog, the agent is required to understand the comprehensive visual content of an image and the fine-grained textual context of the dialog. However, previous studies typically utilized the object-level visual feature to represent a whole image, which only focuses on the local perspective of an image but ignores the importance of the global information in an image. In this paper, we proposed a novel model Human-Like Visual Cognitive and Language-Memory Network for Visual Dialog (HVLM), to simulate global and local dual-perspective cognitions in the human visual system and understand an image comprehensively. HVLM consists of two key modules, Local-to-Global Graph Convolutional Visual Cognition (LG-GCVC) and Question-guided Language Topic Memory (T-Mem). Specifically, in the LG-GCVC module, we design a question-guided dual-perspective reasoning to jointly learn visual contents from both local and global perspectives through a simple spectral graph convolution network. Furthermore, in the T-Mem module, we design an iterative learning strategy to gradually enhance fine-grained textual context details via an attention mechanism. Experimental results demonstrate the superiority of our proposed model, which obtains the comparable performance on benchmark datasets VisDial v1.0 and VisDial v0.9.  相似文献   

11.
The spread of fake news has become a significant social problem, drawing great concern for fake news detection (FND). Pretrained language models (PLMs), such as BERT and RoBERTa can benefit this task much, leading to state-of-the-art performance. The common paradigm of utilizing these PLMs is fine-tuning, in which a linear classification layer is built upon the well-initialized PLM network, resulting in an FND mode, and then the full model is tuned on a training corpus. Although great successes have been achieved, this paradigm still involves a significant gap between the language model pretraining and target task fine-tuning processes. Fortunately, prompt learning, a new alternative to PLM exploration, can handle the issue naturally, showing the potential for further performance improvements. To this end, we propose knowledgeable prompt learning (KPL) for this task. First, we apply prompt learning to FND, through designing one sophisticated prompt template and the corresponding verbal words carefully for the task. Second, we incorporate external knowledge into the prompt representation, making the representation more expressive to predict the verbal words. Experimental results on two benchmark datasets demonstrate that prompt learning is better than the baseline fine-tuning PLM utilization for FND and can outperform all previous representative methods. Our final knowledgeable model (i.e, KPL) can provide further improvements. In particular, it achieves an average increase of 3.28% in F1 score under low-resource conditions compared with fine-tuning.  相似文献   

12.
In spite of the vast amount of work on subjectivity and sentiment analysis (SSA), it is not yet particularly clear how lexical information can best be modeled in a morphologically-richness language. To bridge this gap, we report successful models targeting lexical input in Arabic, a language of very complex morphology. Namely, we measure the impact of both gold and automatic segmentation on the task and build effective models achieving significantly higher than our baselines. Our models exploiting predicted segments improve subjectivity classification by 6.02% F1-measure and sentiment classification by 4.50% F1-measure against the majority class baseline surface word forms. We also perform in-depth (error) analyses of the behavior of the models and provide detailed explanations of subjectivity and sentiment expression in Arabic against the morphological richness background in which the work is situated.  相似文献   

13.
The high quality evaluation of generated summaries is needed if we are to improve automatic summarization systems. Although human evaluation provides better results than automatic evaluation methods, its cost is huge and it is difficult to reproduce the results. Therefore, we need an automatic method that simulates human evaluation if we are to improve our summarization system efficiently. Although automatic evaluation methods have been proposed, they are unreliable when used for individual summaries. To solve this problem, we propose a supervised automatic evaluation method based on a new regression model called the voted regression model (VRM). VRM has two characteristics: (1) model selection based on ‘corrected AIC’ to avoid multicollinearity, (2) voting by the selected models to alleviate the problem of overfitting. Evaluation results obtained for TSC3 and DUC2004 show that our method achieved error reductions of about 17–51% compared with conventional automatic evaluation methods. Moreover, our method obtained the highest correlation coefficients in several different experiments.  相似文献   

14.
Conceptual metaphor detection is a well-researched topic in Natural Language Processing. At the same time, conceptual metaphor use analysis produces unique insight into individual psychological processes and characteristics, as demonstrated by research in cognitive psychology. Despite the fact that state-of-the-art language models allow for highly effective automatic detection of conceptual metaphor in benchmark datasets, the models have never been applied to psychological tasks. The benchmark datasets differ a lot from experimental texts recorded or produced in a psychological setting, in their domain, genre, and the scope of metaphoric expressions covered.We present the first experiment to apply NLP metaphor detection methods to a psychological task, specifically, analyzing individual differences. For that, we annotate MetPersonality, a dataset of Russian texts written in a psychological experiment setting, with conceptual metaphor. With a widely used conceptual metaphor annotation procedure, we obtain low annotation quality, which arises from the dataset characteristics uncommon in typical automatic metaphor detection tasks. We suggest a novel conceptual metaphor annotation procedure to mitigate issues in annotation quality, increasing the inter-annotator agreement to a moderately high level. We leverage the annotated dataset and existing metaphor datasets in Russian to select, train and evaluate state-of-the-art metaphor detection models, obtaining acceptable results in the metaphor detection task. In turn, the most effective model is used to detect conceptual metaphor automatically in RusPersonality, a larger dataset containing meta-information on psychological traits of the participant authors. Finally, we analyze correlations of automatically detected metaphor use with psychological traits encoded in the Freiburg Personality Inventory (FPI).Our pioneering work on automatically-detected metaphor use and individual differences demonstrates the possibility of unprecedented large-scale research on the relation between of metaphor use and personality traits and dispositions, cognitive and emotional processing.  相似文献   

15.
16.
Ethnicity-targeted hate speech has been widely shown to influence on-the-ground inter-ethnic conflict and violence, especially in such multi-ethnic societies as Russia. Therefore, ethnicity-targeted hate speech detection in user texts is becoming an important task. However, it faces a number of unresolved problems: difficulties of reliable mark-up, informal and indirect ways of expressing negativity in user texts (such as irony, false generalization and attribution of unfavored actions to targeted groups), users’ inclination to express opposite attitudes to different ethnic groups in the same text and, finally, lack of research on languages other than English. In this work we address several of these problems in the task of ethnicity-targeted hate speech detection in Russian-language social media texts. This approach allows us to differentiate between attitudes towards different ethnic groups mentioned in the same text – a task that has never been addressed before. We use a dataset of over 2,6M user messages mentioning ethnic groups to construct a representative sample of 12K instances (ethnic group, text) that are further thoroughly annotated via a special procedure. In contrast to many previous collections that usually comprise extreme cases of toxic speech, representativity of our sample secures a realistic and, therefore, much higher proportion of subtle negativity which additionally complicates its automatic detection. We then experiment with four types of machine learning models, from traditional classifiers such as SVM to deep learning approaches, notably the recently introduced BERT architecture, and interpret their predictions in terms of various linguistic phenomena. In addition to hate speech detection with a text-level two-class approach (hate, no hate), we also justify and implement a unique instance-based three-class approach (positive, neutral, negative attitude, the latter implying hate speech). Our best results are achieved by using fine-tuned and pre-trained RuBERT combined with linguistic features, with F1-hate=0.760, F1-macro=0.833 on the text-level two-class problem comparable to previous studies, and F1-hate=0.813, F1-macro=0.824 on our unique instance-based three-class hate speech detection task. Finally, we perform error analysis, and it reveals that further improvement could be achieved by accounting for complex and creative language issues more accurately, i.e., by detecting irony and unconventional forms of obscene lexicon.  相似文献   

17.
Performance of text classification models tends to drop over time due to changes in data, which limits the lifetime of a pretrained model. Therefore an ability to predict a model’s ability to persist over time can help design models that can be effectively used over a longer period of time. In this paper, we provide a thorough discussion into the problem, establish an evaluation setup for the task. We look at this problem from a practical perspective by assessing the ability of a wide range of language models and classification algorithms to persist over time, as well as how dataset characteristics can help predict the temporal stability of different models. We perform longitudinal classification experiments on three datasets spanning between 6 and 19 years, and involving diverse tasks and types of data. By splitting the longitudinal datasets into years, we perform a comprehensive set of experiments by training and testing across data that are different numbers of years apart from each other, both in the past and in the future. This enables a gradual investigation into the impact of the temporal gap between training and test sets on the classification performance, as well as measuring the extent of the persistence over time. Through experimenting with a range of language models and algorithms, we observe a consistent trend of performance drop over time, which however differs significantly across datasets; indeed, datasets whose domain is more closed and language is more stable, such as with book reviews, exhibit a less pronounced performance drop than open-domain social media datasets where language varies significantly more. We find that one can estimate how a model will retain its performance over time based on (i) how well the model performs over a restricted time period and its extrapolation to a longer time period, and (ii) the linguistic characteristics of the dataset, such as the familiarity score between subsets from different years. Findings from these experiments have important implications for the design of text classification models with the aim of preserving performance over time.  相似文献   

18.
Dynamic link prediction is a critical task in network research that seeks to predict future network links based on the relative behavior of prior network changes. However, most existing methods overlook mutual interactions between neighbors and long-distance interactions and lack the interpretability of the model’s predictions. To tackle the above issues, in this paper, we propose a temporal group-aware graph diffusion network(TGGDN). First, we construct a group affinity matrix to describe mutual interactions between neighbors, i.e., group interactions. Then, we merge the group affinity matrix into the graph diffusion to form a group-aware graph diffusion, which simultaneously captures group interactions and long-distance interactions in dynamic networks. Additionally, we present a transformer block that models the temporal information of dynamic networks using self-attention, allowing the TGGDN to pay greater attention to task-related snapshots while also providing interpretability to better understand the network evolutionary patterns. We compare the proposed TGGDN with state-of-the-art methods on five different sizes of real-world datasets ranging from 1k to 20k nodes. Experimental results show that TGGDN achieves an average improvement of 8.3% and 3.8% in terms of ACC and AUC on all datasets, respectively, demonstrating the superiority of TGGDN in the dynamic link prediction task.  相似文献   

19.
Named entity recognition aims to detect pre-determined entity types in unstructured text. There is a limited number of studies on this task for low-resource languages such as Turkish. We provide a comprehensive study for Turkish named entity recognition by comparing the performances of existing state-of-the-art models on the datasets with varying domains to understand their generalization capability and further analyze why such models fail or succeed in this task. Our experimental results, supported by statistical tests, show that the highest weighted F1 scores are obtained by Transformer-based language models, varying from 80.8% in tweets to 96.1% in news articles. We find that Transformer-based language models are more robust to entity types with a small sample size and longer named entities compared to traditional models, yet all models have poor performance for longer named entities in social media. Moreover, when we shuffle 80% of words in a sentence to imitate flexible word order in Turkish, we observe more performance deterioration, 12% in well-written texts, compared to 7% in noisy text.  相似文献   

20.
This study aims to explore the relationships between user interaction and digital libraries (DLs) evaluation. User interaction is a multi-dimensional construct and recognized as three dimensions in this study, as user interaction with: information resource; interface; and, tasks. DL evaluation is considered from the user's perspective and defined as users’ perception of DL performance from different perspectives, including the support of DL's interaction design to user interaction (labeled as interaction-design-based (IDB) evaluation), the support of task completion (labeled as task-based evaluation), and a DL's overall performance (labeled as overall evaluation). An experiment with 48 participants was conducted using the China National Knowledge Infrastructure (CNKI (http://cnki.net/), the most widely used digital library in China). Participants searched for four simulated work tasks and one real work task during the experiment, subsequently evaluating their interaction with information resource, interface, and tasks, and DL performance from different perspectives before or after the search. Correlation analysis and stepwise regression analysis were conducted to examine the relationships. The results indicate that a list of factors related to different dimensions of user interaction can significantly predict or be correlated to users’ evaluation of DL performance from different perspectives, including appropriateness, rich and valid links, reasonable page layout, salience of topics, search task difficulty, well-organized web site, easy to learn, accessibility, usefulness, familiarity with task procedure, etc. These factors surface as the most critical criteria for DL evaluation. Based on the results, an integrated DL evaluation framework is developed. The study adds new knowledge about how tasks affect DL evaluation. It has implications for improving the efficiency of DL evaluation and helping DL developers design DLs to better support users’ interaction, task completion, and their overall experience with DLs.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号