Even though it has an ornate name, the underlying concept is actually quite simple and intuitive. The concept of Empirical Risk Minimization becomes relevant in the world of supervised learning. The actual goal of supervised learning is to find a model that solves a problem as opposed to finding a model that best fits the given dataset. Since we don’t have every single data point that represents each class completely, we just use the next best thing available, which is a dataset that’s representative of the classes. We can think of the process of supervised learning as choosing a function that achieves a given goal. We have to choose this function from a set of potential functions. Now how can we measure the effectiveness of this chosen function given that we don’t know what the actual distribution looks like? Bear in mind that all the potential functions can achieve the given goal. How do we find the function that’s the best representative of the true solution? Continue reading

# Tag Archives: Probability

# What Is Monte Carlo Simulation

There are many phenomena in everyday life where it’s very difficult to model the problem. There are so many variables and so many dependencies that any approximation or assumption would lead to a huge errors in outputs. This is usually a combination of uncertainty and variability. Even though we have access to all the historical information, we can’t accurately predict a future outcome because of inaccurate modeling. This becomes especially relevant when we are dealing with systems where the degrees of freedom are dependent on each other. An example would be movement of fluids or kinetic modeling of gases. How do we compute the possible outcomes? How can we assess the impact of all the free variables to make sure we predict the outcome under uncertainty? Continue reading

# How To Compute Confidence Measure For SVM Classifiers

Support Vector Machines are machine learning models that are used to classify data. Let’s say you want to build a system that can automatically identify if the input image contains a given object. For ease of understanding, let’s limit the discussion to three different types of objects i.e. chair, laptop, and refrigerator. To build this, we need to collect images of chairs, laptops, and refrigerators so that our system can “learn” what these objects look like. Once it learns that, it can tell us whether an unknown image contains a chair or a laptop or a refrigerator. SVMs are great at this task! Even though it can predict the output, wouldn’t it be nice if we knew how confident it is about the prediction? This would really help us in designing a robust system. So how do we compute these confidence measures? Continue reading

# What Is Relative Entropy?

In this blog post, we will be using a bit of background from my previous blog post. If you are familiar with the basics of entropy coding, you should be fine. If not, you may want to quickly read through my previous blog post. So coming to the topic at hand, let’s continue our discussion on entropy coding. Let’s say we have a stream of English alphabets coming in, and you want to store them in the best possible way by consuming the least amount of space. So you go ahead and build your nifty entropy coder to take care of all this. But what if you don’t have access to all the data? How do you know what alphabet appears most frequently if you can’t access the full data? The problem now is that you cannot know for sure if you have chosen the best possible representation. Since you cannot wait forever, you just wait for the first ‘n’ alphabets and build your entropy coder hoping that the rest of the data will adhere to this distribution. Do we end up suffering in terms of compression by doing this? How do we measure the loss in quality? Continue reading

# What Is Entropy Coding?

Entropy Coding appears everywhere in modern digital systems. It is a fundamental building block of data compression, and data compression is pretty much needed everywhere, especially for internet, video, audio, communication, etc. Let’s consider the following scenario. You have a stream of English alphabets coming in and you want to store them in the best possible way by consuming the least amount of space. For the sake of discussion, let’s assume that they are all uppercase letters. Bear in mind that you have an empty machine which doesn’t know anything, and it understands only binary symbols i.e. 0 and 1. It will do exactly what you tell it to do, and it will need data in binary format. So what do we do here? One way would be to use numbers to represent these alphabets, right? Since there are 26 alphabets in English, we can convert them to numbers ranging from 0 to 25, and then convert those numbers into binary form. The biggest number, 25, needs 5 bits to be represented in binary form. So considering the worst case scenario, we can say that we need 5 bits to represent every alphabet. If have to store 100 alphabets, we will need 500 bits. But is that the best we can do? Are we perhaps not exploring our data to the fullest possible extent? Continue reading

# What Is A Markov Chain?

If you have studied probability theory, then you must have heard Markov’s name. When we study probability and statistics, we tend to deal with independent trials. What this means is that if you conduct an experiment a lot of times, we assume that the outcome of one trial doesn’t influence the outcome of the next trial. For example, let’s say you are tossing a coin. If you toss the coin 5 times, you are bound to get either heads or tails with equal probability. If the outcome of the first toss is heads, it doesn’t tell us anything about the next trial. But what if we are dealing with a situation where this assumption is not true? If we are dealing with something like estimating the weather, we cannot assume that today’s weather is not affected by what happened yesterday. If we go ahead with the independence assumption here, we are bound to get wrong results. How do we formulate this kind of model? Continue reading

# What Is Maximum Likelihood Estimation?

Let’s say you are trying to estimate the height of a group of people somewhere. If the group is small enough, you can just measure all of them and be done with it. But in real life, the groups are pretty large and you cannot measure each and every person. So we end up having a model which will estimate the height of a person. For example, if you are surveying a group of professional basketball players, you may have a model which will be centered around 6’7″ with a variance of a couple of inches. But how do we get this model in the first place? How do we know if this model is accurate enough to fit the entire group? Continue reading

# What Are Confidence Intervals?

Confidence interval is a concept in statistics that is used extensively in many diverse areas like physics, chemistry, computer vision, machine learning, genetics, etc. This concept is so fundamental that any modern science would eventually end up using it. Let’s say you have collected some data and you want to understand the behavior of that data. For example, you can say that the data is centered around some value or that the data is distributed with a certain amount of variance. This is very common in many fields where you have estimate the underlying parameters that govern the data distribution. When you estimate a statistical parameter from some data, you can’t be certain about its true value. If you have a lot of high-quality data, then you’re more confident that your estimate is near its true value. But if you don’t have a lot of data, or if it’s of poor quality, then you don’t have much confidence in it. So how do we deal with these situations? Can we measure this uncertainty? Continue reading

# What Are P-Values?

Let’s say you are a part of the sub-atomic physics team and you are working on discovering an important effect. The thing about sub-atomic physics is that nothing is certain and you cannot say something has happened with 100% certainty. The best we can do is to say that we are x-percent sure that something interesting happened. One fine day, you see some pattern in your data which looks pretty much like what that effect would look like. Now the problem is, your experiment produced data with a lot of noise. People are therefore skeptical of you, and think that the supposed “effect” you claimed to see might just have been a funny pattern in some random noise. How would you convince them that it’s not? Before that, how do you convince yourself that it’s not just noise? A good strategy for arguing your point would be to say, “Alright listen, suppose you’re right, and the patterns in my data really are in fact just from random noise, then how would you explain the fact that random noise very rarely produces patterns like this?”. Pretty good strategy right? Now how do we formulate this mathematically? Continue reading

# What Is Random Walk?

Consider the following situation. We have a drunkard who is clinging to a lamppost, and now he decides to start walking. He is in the middle of the street and the road runs from east to west. In his inebriated state, he is as likely to take a step towards the east as he is towards the west. It just means that there is a 50% chance that he will go in either direction. From each new position, he is again as likely to go east or west. Each of his steps are of the same length but in random direction. After having taken ‘n’ number of steps, he is to be found standing at some position on the street. This is what a random walk is. We can plot the position against the number of steps taken for any particular random walk. Now the question is, can we model his movement so that we can predict where he will be after taking ‘n’ steps? Continue reading