Measuring the Stability of Machine Learning Algorithms

1 mainWhen you think of a machine learning algorithm, the first metric that comes to mind is its accuracy. A lot of research is centered on developing algorithms that are accurate and can predict the outcome with a high degree of confidence. During the training process, an important issue to think about is the stability of the learning algorithm. This allows us to understand how a particular model is going to turn out. We need to make sure that it generalizes well to various training sets. Estimating the stability becomes crucial in these situations. So what exactly is stability? How do we estimate it?  

Why do we need to analyze “stability”?

Stability analysis enables us to determine how the input variations are going to impact the output of our system. In our case, the system is a learning algorithm that ingests data to learn from it. Let’s take the example of supervised learning. A supervised learning algorithm takes a labeled dataset that contains data points and the corresponding labels. The process of training involved feeding data into this algorithm and building a model. So far, so good!

Now that we have a model, we need to estimate its performance. The accuracy metric tells us how many samples were classified correctly, but it doesn’t tell us anything about how the training dataset influenced this process. If we choose a different subset within that training dataset, will the model remain the same? If we repeat this experiment with different subsets of the same size, will the model perform its job with the same efficiency? Ideally, we want the model to remain the same and perform its job with the same accuracy. But how can we know? This is where stability analysis comes into picture.

How do we define stability?

Stability of a learning algorithm refers to the changes in the output of the system when we change the training dataset. A learning algorithm is said to be stable if the learned model doesn’t change much when the training dataset is modified. It’s important to notice the word “much” in this definition. That’s the part about putting an upper bound. A model changes when you change the training set. That’s just how it is! But it shouldn’t change more than a certain threshold regardless of what subset you choose for training. If it satisfies this condition, it’s said to be “stable”.

Now what are the sources of these changes? The two possible sources would be:

  • choosing a different subset for training
  • presence of noise in the dataset

The noise factor is a part of the data collection problem, so we will focus our discussion on the training dataset. If we create a set of learning models based on different subset and measure the error for each one, what will it look like? The goal of stability analysis is to come up with a upper bound for this error. We want this bound to be as tight as possible.

Let’s take an example. Your friend, Carl, asks you to buy some cardboard boxes to move all his stuff to his new apartment. You don’t know how many items he has, so you call him to get that information. During that call, Carl tells you that he definitely has less than 100 million items. Even though it’s factually correctly, it’s not very helpful. It’s obvious that he has less than 100 million items. So putting a tight upper bound is very important.

How do we estimate stability?

As we discussed earlier, the variation comes from how we choose the training dataset. Specifically, the way in which we pick a particular subset of that dataset for training. In order to estimate it, we will consider the stability factor with respect to the changes made to the training set. We need a criterion that’s easy to check so that we can estimate the stability with a certain degree of confidence.

Mathematically speaking, there are many ways of determining the stability of a learning algorithm. Some of the common methods include hypothesis stability, error stability, leave-one-out cross-validation stability, and a few more. The goal of all these different metrics is to put a bound on the generalization error. They use different approaches to compute it. We will not be discussing the mathematical formulations here, but you should definitely look into it. It’s actually quite interesting!

The notion of stability is centered on putting a bound on the generalization error of the learning algorithm. This allows us to see how sensitive it is and what needs to be changed to make it more robust.


3 thoughts on “Measuring the Stability of Machine Learning Algorithms

  1. Interesting post!

    I am interested in your thoughts on the pros and cons on the different measures of stability such as hypothesis stability vs. cross validation stability. I have thought a lot about this issue but express it a bit different. Check out my thoughts:
    View at
    View at

  2. Prateek, keep thinking of tracking the Stability of a model in terms of Precision and Recall over time. However given the dataset changes with time what other factors should I keep in mind:
    1. Do I use a known tagged source (different from the original training dataset) and measure and track its precision and recall at that time?
    2. What factors do we consider or keep track in terms of the new dataset used to measure this – size, statistical significance of the sample, feature diversity in the dataset?

    Am I wrong in looking at Stability in this way? I am thinking in terms of tracking only Precision and Recall and not Accuracy as many practical domains/business problems tend to have class imbalances

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s