Incremental Machine Learning for Streaming data with River: Part 1

Virajdatt Kohir
10 min readMay 27, 2022

--

Photo by kazuend on Unsplash

The amount of data generated, processed and analyzed every day is increasing at a tremendous pace. Modern smart devices(ex: cell phones, ipads), IoT devices, and social media (tweets shared, Snapchat photos sent) are some of the key contributing factors to this increased generation and analysis of data. Bernard Marr in his article How Much Data Do We Create Every Day? explains in detail how social media, IoT, photos, cloud services, etc are responsible for the current explosion in data. Researchers and Industry giants have leveraged Machine Learning and Deep Learning technologies to process the tremendous volume of data to identify key patterns to aid in better decision making, and forecasting. Although the data is generated in real-time we see that most the Machine Learning Algorithms and projects do something called batch processing as opposed to real-time processing. Batch-Processing loosely can be defined as processing data collected over a period of time. Real-time processing is the processing of data almost as soon as it is generated.

In the rest of the article (and series), we will see how we can perform machine learning with real-time data. We will look at the various considerations, challenges, algorithms, tools, and libraries (mostly in python) that will aid us in our journey.

Table of Contents:

  1. Motivation (why do we even need to analyze data in real-time?)
  2. Some Challenges for incremental learning on streaming data.
  3. Some Terminologies.
  4. Details about Model Drift.
  5. Details on Windowing Methods.
  6. Intro to River Package.
  7. Hands-On Drift Detection with River.
  8. Quick peek at the next part of the series.

1. Motivation: (so why do we even need to analyze data in real-time?)

Real-time data analytics and insights will enable decision-makers in a business to make timely, value-added decisions. If you look at the various algorithm and machine learning projects [], almost all of them collect historic data, clean it, process it, and build predictive models. By the time all these steps in Machine Learning pipelines are complete, there is a good chance that new data is generated.

Now, what happens when new data is generated? Usually, in the traditional Machine Learning approach, the new data will be appended to the existing data and the whole ML pipeline is kicked-off.

Traditional ML Pipeline

If there were 100 data points in the first dataset on which the first ML model was trained and then we saw additional 10 data points generated (we will call this delta data). The second model will be built by training the algorithm on 100 + 10 = 110 data points. If instead of the original dataset being of 100 data points imagine it is 10 Million data points and the newly generated delta has 2 million data points. Now you can see how it is hard to scale, also it is not a time and computationally efficient method?

This is where real-time analysis and machine learning come in handy. The machine learning algorithms for real-time stream data learn from the incoming delta data. Here the model continues to learn from new data points and optimizes its objective function.

Following is an illustration that captures important aspects of real-time stream data Machine Learning:

Real-time analysis of data improves the Applications like:

  1. Fraud detection in real-time.
  2. Real-time recommendations: Based on your current interaction on a website the recommendations suitable for you are “learned on the fly”.
  3. Algorithmic trading using stock tick data. Monitoring stock trends in real-time helps trading-related decision-making.

And many more are possible through the analysis of real-time data. Given its tremendous potential, numerous academic and industrial researchers are developing novel real-time analytics methods. Because real-time data presents a unique set of issues, traditional machine learning techniques do not work.

2. Some Challenges for incremental learning on streaming data:

  1. Model or Data Drift: Data drift usually leads to instability of models at which point it won’t be useful anymore. We will explore more about this in the next section.
  2. Usually, with streaming data, the data is not saved for a longer time. Hence the learning algorithms cannot get multiple passes through the data. Data must be processed in a single step.
  3. Higher memory and processing requirements.
  4. For a classification task, all the classes for the target variable may not be available right up front as usually is the case in Traditional Batch Processing Machine Learning. Ex: Consider your target variable is predicting the color of the car the customer selects. With stream data, you exactly don’t know how many classes of color you will end up with. At extreme it can be just one or more than 25.

3. Some Terminologies

  1. Stream-Data: Streaming data, often known as data streams, is an endless and continuous flow of data from a source arriving at a rapid rate. As a result, streaming data is a subset of big data that focuses on the velocity of the data.
    Example for stream data: Tweets online, online credit card transactions.
  2. Model Drift: Model drift or model decay is the degradation of the ML model’s predictive ability due to alterations in the digital environment and subsequent changes in variables like data and concepts.
  3. Windowing: A window is a snapshot of data that can be counted by observation or measured in time. This is a particularly helpful strategy in the context of streaming data because you never have access to the “whole data” at any given time. Because new data points are given more weight, a window can successfully handle drift. Older data points are periodically removed.
  4. Incremental and Online Learning:
    Incremental Learning
    When a new batch/mini-batch of data comes, incremental learning algorithms work with limited resources and constantly change model parameters.
    Online Learning: In contrast to incremental learning algorithms, online learning algorithms update the model parameters whenever a new observation is received. To update the model parameters with online learning algorithms, you don’t have to wait for a mini-batch of data to arrive.

4. Details about Model Drift:

Why is detecting Model Drift Important?

  1. Important to keep the model relevant in production.
  2. Detect the right kind of drift for corrective measures (retraining the model, training with weighted data, online learning, feature dropping or replacing the model with a new one, etc)

There are 2 types of Model Drift:

  1. Concept Drift: A type of model drift where the properties of target variables change over time.
    Example: A personalized recommender system profiles user shopping patterns based on past behavior. But significant changes such as customer relocation to a different geography or a global pandemic like COVID-19 drastically affect the shopping behavior making current recommender systems irrelevant.

Types of Concept Drifts:

  • Sudden Drift: Drift occurred as a result of drastic external changes. It can be a sudden/abrupt change such as traveling patterns affected by the COVID-19 outbreak.
  • Incremental Drift: Drift occurred due to small changes and was noticed only after a long period. For example, changes in the loan default pattern over some months require retraining of a credit scoring model.
  • Recurrent Drift: This drift happens periodically, maybe during a specific event or time of the year. For example, the Black Friday event significantly affects consumer buying patterns. Therefore, training a different model on Black Friday data makes sense.

2. Data Drift: A shift observed due to changes in the statistical properties of the independent variables, such as feature distributions.

Why is Data Drift Monitoring Important?

Flagging data drifts and automating model retraining jobs with new data help ensure that the model is relevant in production and offers fair predictions over time. Timely insights on data drift detection help avoid model decay with best industry practices such as:

  1. Incremental learning with retraining model as new data arrives
  2. Training with weighted data
  3. Periodic retraining and updating models

5. Details on Windowing:

Windowing methods are used to evaluate the data in chunks of windows to detect drift. By now we have seen what drift is and also seen what windowing is. Knowing about Windowing methods is useful to understand drift detection algorithms. There are various types of windowing techniques we will focus on the following 3:

1] Landmark window model. Older data points are not discarded. All data points are accumulated in the window. Figure this model. The rectangle represents the window, and the blue circles are the data.

The figure above shows that a landmark window accumulates data points over time. The window size increases as more data points arrive. This windowing technique requires large memory resources, especially when dealing with large data streams.

2] Sliding window model. This is a popular way to discard older data points and consider only the recent data points for analysis.

In a sliding window model, as shown in Figure “time” is the x-axis, and the observations arrive sequentially from left to right. The leftmost observation is the oldest data, and the most recent observations are on the right.

3] damped window model associates weights with the data in the stream, and gives higher weights to recent data than those in the past. The data points are weighted. Higher weight is given to recent data points. An exponential fading strategy is used to discard old data.

A comparison of the 3 windowing techniques. The paper covers the comparison in much detail:

Comparision from [4]

6. Intro to River Package:

River is a Python library for online machine learning. It aims to be the most user-friendly library for doing machine learning on streaming data. River is the result of a merger between creme and scikit-multiflow.

Some interesting facts about the river package are that it has similar API’s like sklearn and it has the potential to democratize real-time stream data machine learning in the same way as sklearn has done with traditional machine learning.

Features of River as listed by the team (all below for Online Learning):

  • Linear models with a wide array of optimizers
  • Nearest neighbors, decision trees, naïve Bayes
  • Anomaly detection
  • Drift detection
  • Recommender systems
  • Time series forecasting
  • Imbalanced learning
  • Clustering
  • Feature extraction and selection
  • Online statistics and metrics
  • Built-in datasets
  • Progressive model validation
  • Model pipelines as a first-class citizen
  • Check out the API for a comprehensive overview

In my experience, the documentation for river can be a little more improved in the organization so it is easier to navigate for absolute beginners who are interested in streaming data and real-time data analytics. But you need not worry we will pass through it

Installation of River:

pip install river

7. Hands-On with River to Detect Drift:

With the basic idea about streaming data, and also now equipped with the knowledge of windowing, data drift and the tool river we will now explore some advanced algorithms for drift detection. We will also explore how to use these algorithms with the river for hands-on expereince.

Adaptive Windowing Method for Concept Drift Detection: In this approach, a sliding window of dynamic size is employed in the adaptive windowing approach for concept drift detection (ADWIN); that is, the window’s size is not set but is estimated online based on the rate of change detected from the data in the window.

To illustrate drift detection using ADWIN with the river we follow a few simple steps:

  1. First, create a 1000 datapoints with Standard Normal distribution (values will be between -1 to 1)
  2. Data concepts are changed from index 599 to 999 (we change the data points here these will have a value between 5–9 now)
  3. We try to detect Drift using ADWIN
ADWIN Data Drift

Conclusion: We see that the algorithm detected drift at the index 639 that is after seeing additional 40 odd points

Next method:

Early Drift Detection Method: The Early Drift Detection Approach (EDDM) is an enhancement to the DDM method, which is particularly useful for identifying slow, progressive changes in a data stream. The idea behind the DDM approach, which we described before, was to keep track of the number of errors in the learning algorithms, but the EDDM method tracks the mean distance between two errors.

That’s all for this article. There are other various methodologies and algorithms to detect drift. There are more python packages and other tools to do real-time machine learning we will see them in the future part of the series. In the next part, we will learn about Supervised Learning Algorithms for real-time streaming data.

The code can be accessed on :

  1. Google-Colab
  2. Git-Hub

Thank you for reading, that’s all for this article. More content to follow. Please clap if the article was helpful to you and comment if you have any questions.

If you want to connect with me, learn and grow with me or collaborate you can reach me at any of the following:

Linkedin:- https://www.linkedin.com/in/virajdatt-kohir/
Twitter:- https://twitter.com/kvirajdatt
GitHub:- https://github.com/Virajdatt
GoodReads:- https://www.goodreads.com/user/show/114768501-virajdatt-kohir

:) :):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):):)

References:

  1. https://bernardmarr.com/how-much-data-do-we-create-every-day-the-mind-blowing-stats-everyone-should-read/
  2. https://streamsets.com/why-dataops/what-is-data-drift/
  3. https://riverml.xyz/latest/
  4. https://link.springer.com/article/10.1007/s10462-020-09874-x
  5. https://github.com/online-ml/river
  6. https://censius.ai/wiki/data-drift
  7. https://www.amazon.in/Practical-Machine-Learning-Streaming-Python-ebook/dp/B0923T8ZY1

--

--

Virajdatt Kohir

Data Analysis/Science, Machine Learning, Deep Learning, student for life.