Information resources estimation for accurate distribution-based concept drift detection |
| |
Affiliation: | 1. School of Information Management, Wuhan University, Wuhan 430072, China;2. Research Center for Chinese Science Evaluation (RCCSE), Wuhan University, Wuhan 430072, China;1. Cryptography and Cognitive Informatics Laboratory, AGH University of Science and Technology, 30 Mickiewicza Ave, Krakow 30-059, Poland;2. School of Computing, Engineering and Mathematical Sciences, La Trobe University, Melbourne, Australia;3. Department of Computer Science, Ryerson University, Canada |
| |
Abstract: | Machine learning applications must continually utilize label information from the data stream to detect concept drift and adapt to the dynamic behavior. Due to the computational expensiveness of label information, it is impractical to assume that the data stream is fully labeled. Therefore, much research focusing on semi-supervised concept drift detection has been proposed. Despite the large research effort in the literature, there is a lack of analysis on the information resources required with the achievable concept drift detection accuracy. Hence, this paper aims to answer the unexplored research question of “How many labeled samples are required to detect concept drift accurately?” by proposing an analytical framework to analyze and estimate the information resources required to detect concept drift accurately. Specifically, this paper disintegrates the distribution-based concept drift detection task into a learning task and a dissimilarity measurement task for independent analyses. The analyses results are then correlated to estimate the required number of labels within a set of data samples to detect concept drift accurately. The proximity of the information resources estimation is evaluated empirically, where the results suggest that the estimation is accurate with high amount of information resources provided. Additionally, estimation results of a state-of-the-art method and a benchmark data set are reported to show the applicability of the estimation by proposed analytical framework within benchmarked environments. In general, the estimation from the proposed analytical framework can serve as guidance in designing systems with limited information resources. This paper also hopes to assist in identifying research gaps and inspiring new research ideas regarding the analysis of the amount of information resources required for accurate concept drift detection. |
| |
Keywords: | Concept drift Information resources estimation Probably Approximately Correct Power Analysis No Free Lunch theorem Dynamic data stream |
本文献已被 ScienceDirect 等数据库收录! |
|