Data Stream Classification Using Classifier Ensemble

Michał Woźniak,

Andrzej Kasprzak


For the contemporary business, the crucial factor is making smart decisions on the basis of the knowledge hidden in stored data. Unfortunately,m traditional simple methods of data analysis are not sufficient for efficient management of modern enterprizes, because they are not appropriate for the huge and growing amount of the data stored by them. Additionally data usually comes continuously in the form of so-called data stream. The great disadvantage of traditional classification methods is that they assume that statistical properties of the discovered concept are being unchanged, while in real situation, we could observe so-called concept drift, which could be caused by changes in the probabilities of classes or/and conditional probability distributions of classes. The potential for considering new training data is an important feature of machine learning methods used in security applications (spam filtering or intrusion detection) or decision support systems for marketing departments, which need to follow the changing client behavior. Unfortunately, the occurrence of concept drift dramatically decreases classification accuracy. This work presents the comprehensive study on the ensemble classifier approach applied to the problem of drifted data streams. Especially it reports the research on modifications of previously developed Weighted Aging Classifier Ensemble (WAE) algorithm, which is able to construct a valuable classifier ensemble for classification of incremental drifted stream data. We generalize WAE method and propose the general framework for this approach. Such framework can prune an classifier ensemble before or after assigning weights to individual classifiers. Additionally, we propose new classifier pruning criteria, weight calculation methods, and aging operators. We also propose rejuvenating operator, which is able to soften the aging effect, which could be useful, especially in the case if quite ”old” classifiers are high quality models, i.e., their presence increases ensemble accuracy, what could be found, e.g., in the case of recurring concept drift. The chosen characteristics of the proposed frameworks were evaluated on the basis of the wide range of computer experiments carried out on the two benchmark data streams. Obtained results confirmed the usability of proposed method to the data stream classification with the presence of incremental concept drift.

Słowa kluczowe: data stream classification, classifier enslemble, pattern classification, forgetting

Domingos P., Hulten G., A general framework for mining massive data streams, Journal of Computational and Graphical Statistics 12, 2003, pp. 945–949.

Widmer G., Kubat M., Learning in the presence of concept drift and hidden contexts, Mach. Learn. 23 (1), Apr. 1996, pp. 69–101.

Kifer D., Ben-David S., Gehrke J., Detecting change in data streams, Proceedings of the Thirtieth international conference on Very large data bases - Vol. 30, ser. VLDB ’04. VLDB Endowment, 2004, pp. 180–191.

Tsymbal A., Pechenizkiy M., Cunningham P., Puuronen S., Dynamic integration of classifiers for handling concept drift, Inf. Fusion 9 (1), Jan. 2008, pp. 56–68.

Littlestone N., Warmuth M.K., The weighted majority algorithm, Inf. Comput. 108 (2), Feb. 1994, pp. 212–261.

Bifet A., Holmes G., Pfahringer B., Read J., Kranen P., Kremer H., Jansen T., Seidl T., Moa: A real-time analytics open source framework, Proc. European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2011), Athens, Greece, Springer Heidelberg, Germany, 2011, pp. 617–620.

Jackowski K., Fixed-size ensemble classifier system evolutionarily adapted to a recurring context with an unlimited pool of classifiers, Pattern Analysis and Applications, 2013, pp. 1–16.

Street W.N., Kim Y., A streaming ensemble algorithm (sea) for large-scale classification, Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’01 ACM. New York, NY, USA, 2001, pp. 377–382.

Wang H., Fan W., Yu P.S., Han J., Mining concept-drifting data streams using ensemble classifiers, Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, ser. KDD ’03 ACM. New York, NY, USA, 2003, pp. 226–235.

Kolter J., Maloof M., Dynamic weighted majority: A new ensemble method for tracking concept drift, in Data Mining, 2003. ICDM 2003. Third IEEE International Conference on, Nov. 2003, pp. 123 – 130.

Wozniak M., Kasprzak A., Cal P., Application of combined classifiers to data stream classification, Proceedings of the 10th International Conference on Flexible Query Answering Systems FQAS 2013, ser. LNCS Springer-Verlag. Berlin– Heidelberg, 2013, in press.

Klinkenberg R., Renz I., Adaptive information filtering: Learning in the presence of concept drifts, AAAI Technical Report WS-98-05, 1998, pp. 33–40.

Wozniak M., Hybrid Classifiers – Methods of Data, Knowledge, and Classifier Combination, ser. Studies in Computational Intelligence, Springer 519, 2014.

Kuncheva L.I., Combining Pattern Classifiers: Methods and Algorithms, Wiley-Interscience, Hoboken, New Jersey, USA, 2004.

X.X., Stream data mining repository,˜xqzhu/stream.html, 2010.

Quinlan J., C4.5: Programs for Machine Learning, ser. Morgan Kaufmann Series in Machine Learning. Morgan Kaufmann Publishers, London, England, 1993.

Platt J.C., Advances in kernel methods, B. Sch¨olkopf, C.J.C. Burges, A.J. Smola (Eds.) MIT Press Cambridge, MA, USA, 1999, ch. Fast training of support vector machines using sequential minimal optimization, pp. 185–208.

Le Cessie S., Van Houwelingen J., Ridge estimators in logistic regression, Applied statistics, 1992, pp. 191–201.

Holte R.C., Very simple classification rules perform well on most commonly used datasets, Machine Learning 11, 1993, pp. 63–91.

Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I.H., The Weka data mining software: An update, SIGKDD Explor. Newsl. 11 (1), Nov. 2009, pp. 10–18.