首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
    
The advent of connected devices and omnipresence of Internet have paved way for intruders to attack networks, which leads to cyber-attack, financial loss, information theft in healthcare, and cyber war. Hence, network security analytics has become an important area of concern and has gained intensive attention among researchers, off late, specifically in the domain of anomaly detection in network, which is considered crucial for network security. However, preliminary investigations have revealed that the existing approaches to detect anomalies in network are not effective enough, particularly to detect them in real time. The reason for the inefficacy of current approaches is mainly due the amassment of massive volumes of data though the connected devices. Therefore, it is crucial to propose a framework that effectively handles real time big data processing and detect anomalies in networks. In this regard, this paper attempts to address the issue of detecting anomalies in real time. Respectively, this paper has surveyed the state-of-the-art real-time big data processing technologies related to anomaly detection and the vital characteristics of associated machine learning algorithms. This paper begins with the explanation of essential contexts and taxonomy of real-time big data processing, anomalous detection, and machine learning algorithms, followed by the review of big data processing technologies. Finally, the identified research challenges of real-time big data processing in anomaly detection are discussed.  相似文献   

2.
    
This paper proposes a new method for semi-supervised clustering of data that only contains pairwise relational information. Specifically, our method simultaneously learns two similarity matrices in feature space and label space, in which similarity matrix in feature space learned by adopting adaptive neighbor strategy while another one obtained through tactful label propagation approach. Moreover, the above two learned matrices explore the local structure (i.e., learned from feature space) and global structure (i.e., learned from label space) of data respectively. Furthermore, most of the existing clustering methods do not fully consider the graph structure, they can not achieve the optimal clustering performance. Therefore, our method forcibly divides the data into c clusters by adding a low rank restriction on the graphical Laplacian matrix. Finally, a restriction of alignment between two similarity matrices is imposed and all items are combined into a unified framework, and an iterative optimization strategy is leveraged to solve the proposed model. Experiments in practical data show that our method has achieved brilliant performance compared with some other state-of-the-art methods.  相似文献   

3.
This paper is concerned with paraphrase detection, i.e., identifying sentences that are semantically identical. The ability to detect similar sentences written in natural language is crucial for several applications, such as text mining, text summarization, plagiarism detection, authorship authentication and question answering. Recognizing this importance, we study in particular how to address the challenges with detecting paraphrases in user generated short texts, such as Twitter, which often contain language irregularity and noise, and do not necessarily contain as much semantic information as longer clean texts. We propose a novel deep neural network-based approach that relies on coarse-grained sentence modelling using a convolutional neural network (CNN) and a recurrent neural network (RNN) model, combined with a specific fine-grained word-level similarity matching model. More specifically, we develop a new architecture, called DeepParaphrase, which enables to create an informative semantic representation of each sentence by (1) using CNN to extract the local region information in form of important n-grams from the sentence, and (2) applying RNN to capture the long-term dependency information. In addition, we perform a comparative study on state-of-the-art approaches within paraphrase detection. An important insight from this study is that existing paraphrase approaches perform well when applied on clean texts, but they do not necessarily deliver good performance against noisy texts, and vice versa. In contrast, our evaluation has shown that the proposed DeepParaphrase-based approach achieves good results in both types of texts, thus making it more robust and generic than the existing approaches.  相似文献   

4.
One of the most time-critical challenges for the Natural Language Processing (NLP) community is to combat the spread of fake news and misinformation. Existing approaches for misinformation detection use neural network models, statistical methods, linguistic traits, fact-checking strategies, etc. However, the menace of fake news seems to grow more vigorous with the advent of humongous and unusually creative language models. Relevant literature reveals that one major characteristic of the virality of fake news is the presence of an element of surprise in the story, which attracts immediate attention and invokes strong emotional stimulus in the reader. In this work, we leverage this idea and propose textual novelty detection and emotion prediction as the two tasks relating to automatic misinformation detection. We re-purpose textual entailment for novelty detection and use the models trained on large-scale datasets of entailment and emotion to classify fake information. Our results correlate with the idea as we achieve state-of-the-art (SOTA) performance (7.92%, 1.54%, 17.31% and 8.13% improvement in terms of accuracy) on four large-scale misinformation datasets. We hope that our current probe will motivate the community to explore further research on misinformation detection along this line. The source code is available at the GitHub.2  相似文献   

5.
In this work, we present the first quality flaw prediction study for articles containing the two most frequent verifiability flaws in Spanish Wikipedia: articles which do not cite any references or sources at all (denominated Unreferenced) and articles that need additional citations for verification (so-called Refimprove). Based on the underlying characteristics of each flaw, different state-of-the-art approaches were evaluated. For articles not citing any references, a well-established rule-based approach was evaluated and interesting findings show that some of them suffer from Refimprove flaw instead. Likewise, for articles that need additional citations for verification, the well-known PU learning and one-class classification approaches were evaluated. Besides, new methods were compared and a new feature was also proposed to model this latter flaw. The results showed that new methods such as under-bagged decision trees with sum or majority voting rules, biased-SVM, and centroid-based balanced SVM, perform best in comparison with the ones previously published.  相似文献   

6.
    
Despite growing efforts to halt distasteful content on social media, multilingualism has added a new dimension to this problem. The scarcity of resources makes the challenge even greater when it comes to low-resource languages. This work focuses on providing a novel method for abusive content detection in multiple low-resource Indic languages. Our observation indicates that a post’s tendency to attract abusive comments, as well as features such as user history and social context, significantly aid in the detection of abusive content. The proposed method first learns social and text context features in two separate modules. The integrated representation from these modules is learned and used for the final prediction. To evaluate the performance of our method against different classical and state-of-the-art methods, we have performed extensive experiments on SCIDN and MACI datasets consisting of 1.5M and 665K multilingual comments, respectively. Our proposed method outperforms state-of-the-art baseline methods with an average increase of 4.08% and 9.52% in the F1-score on SCIDN and MACI datasets, respectively.  相似文献   

7.
Ranking aggregation is a task of combining multiple ranking lists given by several experts or simple rankers to get a hopefully better ranking. It is applicable in several fields such as meta search and collaborative filtering. Most of the existing work is under an unsupervised framework. In these methods, the performances are usually limited especially in unreliable case since labeled information is not involved in. In this paper, we propose a semi-supervised ranking aggregation method, in which preference constraints of several item pairs are given. In our method, the aggregation function is learned based on the ordering agreement of different rankers. The ranking scores assigned by this ranking function on the labeled data should be consistent with the given pairwise order constraints while the ranking scores on the unlabeled data obey the intrinsic manifold structure of the rank items. The experimental results on toy data and the OHSUMED data are presented to illustrate the validity of our method.  相似文献   

8.
Graph Convolutional Networks (GCNs) have been established as a fundamental approach for representation learning on graphs, based on convolution operations on non-Euclidean domain, defined by graph-structured data. GCNs and variants have achieved state-of-the-art results on classification tasks, especially in semi-supervised learning scenarios. A central challenge in semi-supervised classification consists in how to exploit the maximum of useful information encoded in the unlabeled data. In this paper, we address this issue through a novel self-training approach for improving the accuracy of GCNs on semi-supervised classification tasks. A margin score is used through a rank-based model to identify the most confident sample predictions. Such predictions are exploited as an expanded labeled set in a second-stage training step. Our model is suitable for different GCN models. Moreover, we also propose a rank aggregation of labeled sets obtained by different GCN models. The experimental evaluation considers four GCN variations and traditional benchmarks extensively used in the literature. Significant accuracy gains were achieved for all evaluated models, reaching results comparable or superior to the state-of-the-art. The best results were achieved for rank aggregation self-training on combinations of the four GCN models.  相似文献   

9.
    
Semi-supervised multi-view learning has recently achieved appealing performance with the consensus relation between samples. However, in addition to the relation between samples, the relation between samples and their assemble centroid is also important to the learning. In this paper, we propose a novel model based on orthogonal non-negative matrix factorization, which allows exploring both the consensus relations between samples and between samples and their assemble centroid. Since this model utilizes more consensus information to guide the multi-view learning, it can lead to better performance. Meanwhile, we theoretically derive a proposition about the equivalency between the partial orthogonality and the full orthogonality. Based on this proposition, the orthogonality constraint and the label constraint are simultaneously implemented in the proposed model. Experimental evaluations on five real-world datasets show that our approach outperforms the state-of-the-art methods, where the improvement is 6% average in terms of ARI index.  相似文献   

10.
Semi-supervised document retrieval   总被引:2,自引:0,他引:2  
This paper proposes a new machine learning method for constructing ranking models in document retrieval. The method, which is referred to as SSRank, aims to use the advantages of both the traditional Information Retrieval (IR) methods and the supervised learning methods for IR proposed recently. The advantages include the use of limited amount of labeled data and rich model representation. To do so, the method adopts a semi-supervised learning framework in ranking model construction. Specifically, given a small number of labeled documents with respect to some queries, the method effectively labels the unlabeled documents for the queries. It then uses all the labeled data to train a machine learning model (in our case, Neural Network). In the data labeling, the method also makes use of a traditional IR model (in our case, BM25). A stopping criterion based on machine learning theory is given for the data labeling process. Experimental results on three benchmark datasets and one web search dataset indicate that SSRank consistently and almost always significantly outperforms the baseline methods (unsupervised and supervised learning methods), given the same amount of labeled data. This is because SSRank can effectively leverage the use of unlabeled data in learning.  相似文献   

11.
    
In this work, we propose BERT-WMAL, a hybrid model that brings together information coming from data through the recent transformer deep learning model and those obtained from a polarized lexicon. The result is a model for sentence polarity that manages to have performances comparable with those at the state-of-the-art, but with the advantage of being able to provide the end-user with an explanation regarding the most important terms involved with the provided prediction. The model has been evaluated on three polarity detection Italian dataset, i.e., SENTIPOLC, AGRITREND and ABSITA. While the first contains 7,410 tweets released for training and 2,000 for testing, the second and the third respectively include 1,000 tweets without splitting , and 2,365 reviews for training, 1,171 for testing. The use of lexicon-based information proves to be effective in terms of the F1 measure since it shows an improvement of F1 score on all the observed dataset: from 0.664 to 0.669 (i.e, 0.772%) on AGRITREND, from 0.728 to 0.734 (i.e., 0.854%) on SENTIPOLC and from 0.904 to 0.921 (i.e, 1.873%) on ABSITA. The usefulness of this model not only depends on its effectiveness in terms of the F1 measure, but also on its ability to generate predictions that are more explainable and especially convincing for the end-users. We evaluated this aspect through a user study involving four native Italian speakers, each evaluating 64 sentences with associated explanations. The results demonstrate the validity of this approach based on a combination of weights of attention extracted from the deep learning model and the linguistic knowledge stored in the WMAL lexicon. These considerations allow us to regard the approach provided in this paper as a promising starting point for further works in this research area.  相似文献   

12.
聚类算法通常用于数据的聚类。除此,它还可以用于异常数据的检测。首先介绍了基于划分的聚类算法K-means,然后给出改进算法I-K-means的算法描述,最后通过实例进行异常分析。  相似文献   

13.
The majority of currently available entity alignment (EA) solutions primarily rely on structural information to align entities, which is biased and disregards additional multi-source information. To compensate for inadequate structural details, this article suggests the SKEA framework, which is a simple but flexible framework for Entity Alignment with cross-modal supervision of Supporting Knowledge. We employ a relational aggregate network to specifically utilize the details about the entity and its neighbors. To overcome the limitations of relational features, two multi-modal encode modules are being used to extract visual and textural information. A new set of potential aligned entity pairs are generated by SKEA in each iteration using the knowledge of two reference modalities, which can enhance the model’s supervision. It is important to note that the supporting information used in our framework does not participate in the network’s backpropagation, which considerably improves efficiency and differs dramatically from earlier work. In comparison to existing baselines, experiments demonstrate that our proposed framework can incorporate multi-aspect information efficiently and enable supervisory signals from other modalities to transmit to entities. The maximum performance improvement of 5.24% indicates our suggested framework’s superiority, especially for sparse KGs.  相似文献   

14.
Document clustering is an important tool for document collection organization and browsing. In real applications, some limited knowledge about cluster membership of a small number of documents is often available, such as some pairs of documents belonging to the same cluster. This kind of prior knowledge can be served as constraints for the clustering process. We integrate the constraints into the trace formulation of the sum of square Euclidean distance function of K-means. Then,the combined criterion function is transformed into trace maximization, which is further optimized by eigen-decomposition. Our experimental evaluation shows that the proposed semi-supervised clustering method can achieve better performance, compared to three existing methods.  相似文献   

15.
    
With the development of information technology and economic growth, the Internet of Things (IoT) industry has also entered the fast lane of development. The IoT industry system has also gradually improved, forming a complete industrial foundation, including chips, electronic components, equipment, software, integrated systems, IoT services, and telecom operators. In the event of selective forwarding attacks, virus damage, malicious virus intrusion, etc., the losses caused by such security problems are more serious than those of traditional networks, which are not only network information materials, but also physical objects. The limitations of sensor node resources in the Internet of Things, the complexity of networking, and the open wireless broadcast communication characteristics make it vulnerable to attacks. Intrusion Detection System (IDS) helps identify anomalies in the network and takes the necessary countermeasures to ensure the safe and reliable operation of IoT applications. This paper proposes an IoT feature extraction and intrusion detection algorithm for intelligent city based on deep migration learning model, which combines deep learning model with intrusion detection technology. According to the existing literature and algorithms, this paper introduces the modeling scheme of migration learning model and data feature extraction. In the experimental part, KDD CUP 99 was selected as the experimental data set, and 10% of the data was used as training data. At the same time, the proposed algorithm is compared with the existing algorithms. The experimental results show that the proposed algorithm has shorter detection time and higher detection efficiency.  相似文献   

16.
    
  相似文献   

17.
    
  相似文献   

18.
    
Aesthetic assessment evaluates the quality of a given image using subjective annotations, commonly user ratings, as a knowledge base. Rating complexity is usually relaxed in state-of-the-art works by employing a binary high/low quality label computed from the mean value of rating votes. Nevertheless, this approach introduces uncertainty to average-quality images, which may affect the performance of machine learning models trained from annotated data.In this work, we present a novel approach to aesthetic assessment based on redefining the rating-based groundtruths present in most datasets. Our intent is twofold: to reduce the rating uncertainty and to automatically group them into clusters reflecting high and low quality patterns, thus avoiding an arbitrary threshold like 5 in 1–10 ratings. The experimentation uses the well-known AVA dataset, which consists of more than 255,000 images, and we train several CNN models to test our new groundtruths against the baseline ones. The results show that our approach achieves significant performance gains, between 3% and 9% more balanced accuracy than the baseline groundtruths.  相似文献   

19.
Most existing state-of-the-art neural network models for math word problems use the Goal-driven Tree-Structured decoder (GTS) to generate expression trees. However, we found that GTS does not provide good predictions for longer expressions, mainly because it does not capture the relationships among the goal vectors of each node in the expression tree and ignores the position order of the nodes before and after the operator. In this paper, we propose a novel Recursive tree-structured neural network with Goal Forgetting and information aggregation (RGFNet) to address these limits. The goal forgetting and information aggregation module is based on ordinary differential equations (ODEs) and we use it to build a sub-goal information feedback neural network (SGIFNet). Unlike GTS, which uses two-layer gated-feedforward networks to generate goal vectors, we introduce a novel sub-goal generation module. The sub-goal generation module could capture the relationship among the related nodes (e.g. parent nodes, sibling nodes) using attention mechanism. Experimental results on two large public datasets i.e. Math23K and Ape-clean show that our tree-structured model outperforms the state-of-the-art models and obtains answer accuracy over 86%. Furthermore, the performance on long-expression problems is promising.1  相似文献   

20.
    
Since the patient is not quarantined during the conclusion of the Polymerase Chain Reaction (PCR) test used in the diagnosis of COVID-19, the disease continues to spread. In this study, it was aimed to reduce the duration and amount of transmission of the disease by shortening the diagnosis time of COVID-19 patients with the use of Computed Tomography (CT). In addition, it is aimed to provide a decision support system to radiologists in the diagnosis of COVID-19. In this study, deep features were extracted with deep learning models such as ResNet-50, ResNet-101, AlexNet, Vgg-16, Vgg-19, GoogLeNet, SqueezeNet, Xception on 1345 CT images obtained from the radiography database of Siirt Education and Research Hospital. These deep features are given to classification methods such as Support Vector Machine (SVM), k Nearest Neighbor (kNN), Random Forest (RF), Decision Trees (DT), Naive Bayes (NB), and their performance is evaluated with test images. Accuracy value, F1-score and ROC curve were considered as success criteria. According to the data obtained as a result of the application, the best performance was obtained with ResNet-50 and SVM method. The accuracy was 96.296%, the F1-score was 95.868%, and the AUC value was 0.9821. The deep learning model and classification method examined in this study and found to be high performance can be used as an auxiliary decision support system by preventing unnecessary tests for COVID-19 disease.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号