Incremental Machine Learning for Streaming data with river: Part 3; Classification Algorithms Ensemble Learning
Intro
Welcome back, this is the part 3 of “Incremental Machine Learning for Streaming data with river” series. In the part1 of the series we have discussed the need for Incremental Leanring, some challeneges and terminolgoies, and also explored Model Drift. We also got our self familar to the River Python Package which focuses on providing user firendly machine learning library to work with streaming data and incremental/online learning. The part2 of the series we discussed on how to simulate streaming data (for experimentation and research)and then we focused on few Tree Based algorithms to perform Incremental Learning.
In this part 3 of the series, we will be focusing on Ensemble Learning methods with Streaming Data. Ensemble learning mixes many models to increase prediction accuracy on unseen data. The most popular Ensemble learning methods are bagging (also known as bootstrap aggregation), and boosting.
In the remainder of this article, we will discuss the Adaptive Random Forests, Online Bagging methods, and Boosting methods for incremental ML. The bagging method, which Leo Breiman [2] first developed in 1996, selects a random sample of data from a training set with replacement, allowing for multiple selections of the individual data points. Following the generation of several data samples, these models are individually trained, and depending on the task (classification or regression ) the average or majority of those predictions result in a more accurate estimate.
Table of Contents
- Adaptive Random Forests for incremental ML and hands-on code example.
- Online Bagging methods for incremental ML and hands-on code example.
- Boosting methods for incremental ML and hands-on code example.
- Summary of the series
1. Adaptive Random Forests for incremental ML:
The random forest algorithm works by building multiple decision trees. The way Random Forest works in batch Machine Learning is that it makes multiple passes over the dataset. Hence a vanilla Random forest algorithm cannot be applied in Incremental ML for streaming data as it can’t handle the drift and also because it needs multiple passes over the dataset.
A variation on the random forest was introduced by Gomes and team [3] called the adaptive random forest (ARF) which works with streaming data. Online bagging is used in the ARF method to simulate resampling with replacement. ARF uses a ( lambda = 6) Poisson distribution in online bagging as opposed to the more common (lambda = 1) Poisson distribution.
The Hoeffding tree (discussed in part2 of the series) serves as the foundation for ARF’s fundamental model, random forest tree training (RFTreeTrain), albeit there are a few modifications made. The RFTreeTrain differs in that it prohibits early tree cutting. Another distinction is that splits are restricted to only the features that are formed at random when a new node is created. In base trees, the ARF employs drift detectors that trigger resets anytime there is a drift. ARF enables training background trees, which, in the event of a drift, take the place of active trees.
The full algorithm for the adaptive random forest (ARF) is as follows:
The following algorithm is for the base model that ARF will use. (In batch random forest this usually is a decision tree.)
The following is the full algorithm of the ARF.
Further experiments and results of ARF can be studied by running through [3].
1.2 Hands-on ARF with the river package:
The river package has the AdaptiveRandomForestClassifier class which implements the ARF algorithm discussed above.
The 3 most important aspects of Adaptive Random Forest 1 are:
inducing diversity through re-sampling
inducing diversity through randomly selecting subsets of features for node splits
drift detectors per base tree, which cause selective resets in response to drifts
It also allows training background trees, which start training if a warning is detected and replace the active tree if the warning escalates to a drift.
Some important parameters in AdaptiveRandomForestClassifier are as follows:
a. n_models: Number of RFtrain trees in the ensemble.
b. max_features: Max number of attributes for each node split.
c. lambda_value: The lambda value for bagging it defaults to 6 (reasons discussed above).
d. max_depth: The maximum depth a tree can reach. By default it is infinte
e. split_criterion: It can be set to either(gini, information gain or hellinger distance )
f. leaf_prediction: Prediction mechanism used at leafs.
- ‘mc’ — Majority Class
- ‘nb’ — Naive Bayes
- ‘nba’ — Naive Bayes Adaptive
Following is the sample code
An example of how to use the ARF model to make a prediction on unseen data:
2. Online Bagging methods for incremental ML:
In batch Machine Learning Bootstrapping is the method of randomly creating samples of data out of a population with replacement to estimate a population parameter. On size N bootstrap sample sets, X base models are trained. The original training dataset is used to build the bootstrapped sample sets through sampling with replacement. Each base model’s training set includes the initial training instances K times, where P(K = k) has a binomial distribution. Additionally, K tends to have a Poisson(1) distribution when N tends to be infinite (i.e., K exp(-1)/k!).
In online bagging [4], the algorithm selects the training example K Poisson(1) times for each base model and modifies the base model as necessary. By completing majority voting of the X basic models, new instances are categorized. Both batch bagging and online bagging use the same step. Online bagging is thought to be an effective substitute for batch bagging.
The reference [4] goes into details about Online Bagging and Boosting and can be the next source to continue learning.
2.2 Hands-On Online Bagging methods for incremental ML:
The river package provides BaggingClassifier class for Online Bagging Classification and BaggingRegressor for Regression. In this series since we are looking at the classification methods, we will be exploring BaggingClassifier.
Online bagging for classification.For each incoming observation, each model’s learn_one method is called k times where k is sampled from a Poisson distribution of parameter 1. k thus has a 36% chance of being equal to 0, a 36% chance of being equal to 1, an 18% chance of being equal to 2, a 6% chance of being equal to 3, a 1% chance of being equal to 4, etc.
This class has only a few parameters.
- model: The classifier to bag.
- n_models: The number of models in the ensemble.
- seed: Random number generator seed for reproducibility.
In the example, we will use three logistic regressions bagged together. The performance is slightly better than when using a single logistic regression. We will be using the Phishing Dataset. This dataset contains features from web pages that are classified as phishing or not. For the data, we will be preprocessing it with StandardScaler.
Let us take a look at the dataset using the following code:
from river import datasetsdataset = datasets.Phishing()# run following to get details of dataset
dataset# Taking a look at couple of the datapoints in dataset
list(dataset.take(2))
The output will look like:
Here is the full example of building two models:
- A simple logistic regression (incremental model)
- A bagging method with 5 base models. Each base model is logistic regression.
Note: The pipe| character in the above code is for building a pipeline. It works similar to sklearn pipeline.
3. Boosting methods for incremental ML
Boosting is a well-known ensemble learning method. The fundamental idea behind it is to successfully bring together a group of “weak learners” to produce a “strong learner.” One of the earliest boosting methods for batch data is called AdaBoost.
The online boosting classifier, which Wang devised [8], is the AdaBoost algorithm’s incremental form. Each new example in an online learning environment is trained K times using data from a binomial distribution. The binomial distribution tends to be a Poisson(lambda) distribution because there are many examples (tends to infinity), just like in a data stream. The value of is determined by keeping track of the weights of incorrectly and correctly classified samples. The ADWIN windowing approach is used by the online boosting algorithm to manage the data stream’s innate concept drift.
Following is the algorithm for Online AdaBoost Algorithm from [8].
3.2 Hands-On Adaboost for Incremental Learning with River:
The river package provides the AdaBoostClassifier for Online boosting.
Boosting for classification. For each incoming observation, each model’s
learn_one
method is calledk
times wherek
is sampled from a Poisson distribution of parameter lambda. The lambda parameter is updated when the weaks learners fit successively the same observation.
The parameters here are the same as the ones used by BaggingClassifier.
Here we look at the code for classifying the phishing data using the AdaBoostClassifer (we just replace BaggingClassifier with AdaBoostClassifier from the previous code).
Following is the code:
Summary of the series:
In this three-part series on the Incremental Learning with river package, we have learned quite a few things. We started with the question of why we need to analyze data in real-time and the need for Incremental learning. Then we learned about some challenges for incremental learning on streaming data. We then looked at model drift and its various types. then we learned about window methods to analyze and detect the Drift in data. We explored the river package to detect drift. We looked at various classification algorithms in the last 2 parts of the series and leveraged the river package to perform classification incrementally on streaming data.
The code for this tutorial can be found at:
- On google colab
- On github
Thank you for reading, that’s all for this article. More content to follow. Please clap if the article was helpful to you and comment if you have any questions. If you want to connect with me, learn and grow with me or collaborate you can reach me at any of the following:
Linkedin:- https://www.linkedin.com/in/virajdatt-kohir/
Twitter:- https://twitter.com/kvirajdatt
GitHub:- https://github.com/Virajdatt
GoodReads:- https://www.goodreads.com/user/show/114768501-virajdatt-kohir
:) :):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):)
References:
- https://www.amazon.com/Practical-Machine-Learning-Streaming-Python/dp/1484268660
- https://link.springer.com/content/pdf/10.1007/BF00058655.pdf
- Gomes, H.M., Bifet, A., Read, J., Barddal, J.P., Enembreck, F., Pfahringer, B., Holmes, G., & Abdessalem, T. (2017). Adaptive random forests for evolving data stream classification. Machine Learning, 106, 1469–1495.
- Nikunj C. Oza and Stuart Russell. “Online Bagging and Boosting.” Artificial Intelligence and Statistics. January 2001, pp. 105–112. https://www.researchgate.net/publication/2453583_Online_Bagging_and_Boosting
- https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
- https://riverml.xyz/0.11.1/api/ensemble/BaggingClassifier/
- https://riverml.xyz/dev/api/ensemble/AdaptiveRandomForestClassifier/
- B. Wang and J. Pineau, “Online Bagging and Boosting for Imbalanced Data Streams.” IEEE Transactions on Knowledge and Data Engineering. December 2016. pp. 3353–3366.