Incremental Machine Learning for Streaming data with river: Part 2, Classification Algorithms
Welcome back, this is part 2 of the “Incremental Machine Learning for Streaming data with river” series. In this present article, we will be going over some Supervised Classification Algorithms. We will first look at the key theoretical underpinnings of the algorithms and then look at the implementation in the package river. In the previous part, we have seen: why we need incremental learning, challenges with incremental learning (like concept drift), river package and finally we calculated concept drift using the river package. If you haven’t read the previous part you can check it out here at this link.
In this part, we will look at a few tree-based classification algorithms available for incremental and online learning. Before we actually start looking at the classification algorithms we will quickly look at two interesting topics required for incremental learning. First, we will look at how to simulate streaming data, this will be essential in aiding us to carry out experiments and check the performance of algorithms. Next, we will move on to discuss the evaluation strategy for incremental algorithms. With incremental learning, we cannot use the traditional machine learning evaluation methods like holding out a specific test-set to evaluate the model performance, so we check out evaluation startergies we can use for incremental learning. Finally, we look at 2 classification algorithms for streaming data:
1. Hoeffding Tree Classifier and
2. Extremely Fast Decision Tree Classifier:
Table of Contents:
- Simulating Streaming data
- Evaluation strategy for incremental algorithms
- Hoeffding Tree Classifier
- HANDS-ON with the river for Hoeffding Tree Classifier
- Extremely Fast Decision Tree Classifier
- HANDS-ON with the river for Extremely Fast Decision Tree Classifier
1. Simulating Streaming data:
In this article, we will use streaming data which is generated using data generators. This will help us understand and learn the implementation of the classification algorithms easily. Once we are familiar with the workings of the algorithms in the future parts, we will use real-world data and build models. Following are some of the data generators that generate streaming data:
- Hyperplane Generator: Since we can modify the orientation and position of the hyperplane in a smooth manner by adjusting the relative sizes of the weights, hyperplanes are ideal for replicating time-changing notions. This generator introduces a change to this dataset by adding drift to each weighted feature, it is the probability that the direction of change is reversed and is the change applied to each example.
Following is how we can generate data using the river:
from river import synthdataset = synth.Hyperplane(seed=42, n_features=2)for x, y in dataset.take(5):
print(x, y) # OUTPUT
#{0: 0.7319, 1: 0.5986} 1
#{0: 0.8661, 1: 0.6011} 1
#{0: 0.8324, 1: 0.2123} 0
#{0: 0.5247, 1: 0.4319} 0
#{0: 0.2921, 1: 0.3663} 0
2. Agarwal generator: One of the first data sources for training decision tree learners was the Agarwal stream generator. The dataset contains nine distinct features. The target variable indicates whether or not a loan should be approved. Age, salary, education level, loan amount, house value, zip code, commission, the carmaker, and the number of years the house has been owned are some of the features or predictor variables in this dataset. There are 10 functions defined for generating binary class labels from the features. Presumably, these determine whether the loan should be approved.
from river import synth# classification_function accepts values between 0-9
dataset = synth.Agrawal(classification_function=0, seed=42)for x, y in dataset.take(5):
print(list(x.values()), y) # Output
[68690.2154, 81303.5729, 62, 4, 6, 2, 419982.4410, 11, 433088.0728]1
[98144.9515, 0, 43, 2, 1, 7, 266488.5281, 6, 389.3829] 0 [148987.502, 0, 52, 3, 11, 8, 79122.9140, 27, 199930.4858] 0 [26066.5362, 83031.6639, 34, 2, 11, 6, 444969.2657, 25, 23225.2063]1
[98980.8307, 0, 40, 0, 6, 1, 1159108.4298, 28, 281644.1089] 0
Following is a better view of the data generated using the Agarwal generator:
There are many such other data generators like LED stream generator(with concept drift), Mixed data stream generator, STAGGER concepts stream generator, etc.
2. Evaluation strategy for incremental algorithms:
The Evaluation of the Algorithms for Incremental Machine Learning is slightly different from the traditional Machine Learning Methods. Typically in Machine Learning, we use a holdout test set that the ML algorithm has never seen before and we test the performance of our model on this dataset. But with incremental learning on streaming data the properties of data that could evolve over time need some novel ways of evaluation.Training/Testing: There are multiple ways to evaluate incremental learning models and the most frequently used one’s are as follows:
1. Holdout: Use earlier data for training and later data for testing. Meaning when you have the incoming data from a stream. Use part of it to
2. Prequential evaluation (or interleaved test-then-train): In this evaluation, the incoming stream of data is first used for testing the model performance, and then as subsequent data streams arrive the data points are used in training then.
3. Data chunk Evaluation: uses data chunks of a certain size instead of single instances to apply Prequential evaluation. It is a good mix of holdout and Prequential evaluation.
4. Progressive Validation: Assume you have an mtrain-sized training set and an mpv-sized test set. Progressive validation begins with the training set learning a model and then testing it on the first example of the test set. Then we test on the second example of the test set and train on the training set plus the first example of the test set. The operation is repeated for mtest iterations.
The river package contains an implementation of Progressive Validation.
NOTE: Evaluation strategy is different from metrics
3. Hoeffding Tree Classifier:
The biggest challenge with developing an incremental decision tree-based algorithm is that we don’t have access to all the data at once. With streaming data, the cost of storing the data and then later loading them in memory to train an algorithm will be really high. We get only one pass through the data as we discussed in the challenges of incremental learning in part 1 of the series. This inability to not pass through the data more than once leads to the challenge of deciding the best split attribute. In batch learning decision trees (like ), we get to choose the best split after the splits are done on various attributes and the best one is chosen based on the criteria selected (Gini-index, etc.). To address this challenge Hoffeding Tree or the Very Fast Decision Tree (VFDT) algorithm was introduced in the paper Mining High-Speed Data Streams, in which instead of using previously used instances, the algorithm waits for fresh ones to arrive. The basic assumption about the data that the VFDT makes is that data is not changing over time and this helps in building the Hoeffding tree. The algorithm has the notion that small sample size is typically sufficient for determining the best splitting attribute. The statistical result known as the Hoeffding bound backs up this theory. While converting a tree leaf to a tree node, VFDT uses the Hoeffding bound to accumulate adequate statistics of new incoming observations in a data stream.
The Hoeffding bound states (with a probability of 1–£) that after n independent observations of a real-valued random variable, x, with range R, the true mean of x is at least μ − € where μ is the observed mean of the random variable x and €. Where the € has the following value.
The algorithm for Hoeffding Tree Classifier is as follows:
4. HANDS-ON with the river for Hoeffding Tree Classifier:
Now let us see the implementation of the Hoeffding Tree Classifier in the river. We will use the Agarwal Data generator to generate our dataset. HoeffdingTreeClassifier class has the implementation which is part of the tree module. Following are some of the important parameters we pass to the HoeffdingTreeClassifier while instantiating the class.
a. grace_period: Number of instances a leaf should observe between split attempts.
b. nominal_attributes: List of Nominal attributes identifiers. If empty, then assume that all numeric attributes should be treated as continuous.
c. split_confidence: Allowed error in a split decision, a value closer to 0 takes longer to decide.
from river import synth
from river import evaluate
from river import metrics
from river import treegen = synth.Agrawal(classification_function=0, seed=42)
# Take 1000 instances from the infinite data generator
# The dataset object will have the data in form of streaming datadataset = iter(gen.take(1000))model = tree.HoeffdingTreeClassifier(grace_period=100,
split_confidence=1e-5,
nominal_attributes=['elevel', 'car', 'zipcode']
)# We will use the accuracy metric to check metric = metrics.Accuracy()# We evaluate our model using the Progressive Validation methodevaluate.progressive_val_score(dataset, model, metric)# OUTPUT
# Accuracy: 83.78%
5. Extremely Fast Decision Tree Classifier:
The extremely Fast Decision Tree classifier is also referred to as Hoeffding AnyTime Tree (HATT) classifier. EFDT works in a similar fashion to a Hoeffding tree, however, the distinction is in how they divide at nodes. The Hoeffding tree defers splitting at a node until it finds the best split and then makes no further decisions. The Hoeffding Anytime Tree splits at a node as soon as it appears to be a worthwhile split and returns to it if a better split becomes available. The Hoeffding Anytime Tree is not as fast as the Hoeffding tree in terms of processing, but it is faster in terms of statistics. Following is the pseudocode for Extremely Fast Decision Tree Classifier.
6. HANDS-ON with the river for Extremely Fast Decision Tree Classifier:
Following is the code for EFDT implemented with the river. We have everything the same as the previous example Hoeffding Tree Classifier (dataset, metrics, and evaluation strategy) but instead of Hoeffding Tree Classifier, we use the Extremely Fast Decision Tree Classifier.
from river import synth
from river import evaluate
from river import metrics
from river import treegen = synth.Agrawal(classification_function=0, seed=42)
# Take 1000 instances from the infinite data generator
dataset = iter(gen.take(1000))model = tree.ExtremelyFastDecisionTreeClassifier(grace_period=100, split_confidence=1e-5, nominal_attributes=['elevel', 'car', 'zipcode'], min_samples_reevaluate=100)metric = metrics.Accuracy()evaluate.progressive_val_score(dataset, model, metric)
Accuracy: 87.89%
In conclusion, we can see that the Extremely Fast Decision Tree Classifier gives relatively better results than the Hoffedding Trees.
That’s all for this part, to quickly summarize in this part of the incremental learning series we learned about the various ways to generate synthetic data for stream data. Then we looked at the implementation of the Agarwal data generator in the river package. We learned why incremental learning has different evaluation methods compared to machine learning and we also looked at 4 types of evaluation methods. In the end, we saw two tree-based classifiers for incremental learning on streaming data.
In the next part, we will look at a few more algorithms for streaming data. We will focus mainly on Ensemble learning algorithms like Adaptive Random Forests, Online Bagging, and Boosting methods. Please stay tuned.
Thank you for reading, that’s all for this article. More content to follow. Please clap if the article was helpful to you and comment if you have any questions.If you want to connect with me, learn and grow with me or collaborate you can reach me at any of the following:
Linkedin:- https://www.linkedin.com/in/virajdatt-kohir/
Twitter:- https://twitter.com/kvirajdatt
GitHub:- https://github.com/Virajdatt
GoodReads:- https://www.goodreads.com/user/show/114768501-virajdatt-kohir
:) :):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):)
References:
- https://www.cms.waikato.ac.nz/~abifet/book/chapter_6.html
- https://riverml.xyz/dev/api/tree/HoeffdingTreeClassifier/
- https://riverml.xyz/dev/api/synth/Hyperplane/
- https://riverml.xyz/dev/api/synth/Agrawal/#fnref2:1
- https://hunch.net/~jl/projects/prediction_bounds/thesis/mathml/thesisse44.xml
- https://riverml.xyz/0.11.0/api/evaluate/progressive-val-score/#examples
- https://riverml.xyz/dev/api/tree/ExtremelyFastDecisionTreeClassifier/