When we deal with large amount of data, we can’t have specific rules for each and every instance. We have to come up with a model which defines the whole data. This model can then be used to analyze unknown inputs. More often than not, the data has some underlying pattern. When we think of a model, we extract specific characteristics from the data and come up with a formulation which best explains the behavior of the data. One of the most frequently occurring pattern is the Gaussian Distribution. It is used almost everywhere in science and technology. But what is it exactly? Why do we need it?
Simply put, statistics is the study of analysis and interpretation of data. The entire subject of statistics is based around the idea that you have this big set of data, and you want to analyze that in terms of the relationships between the individual points in that data set. We use certain measures to analyze the given data, namely mean and standard deviation. Let’s see what those are.
Mean has different meanings in different contexts. In general, mean refers to the average value of a set of values. Why do we need to concern ourselves with the mean? The mean of a distribution gives us a general idea about the value around which the datapoints are centered. After an exam, we ask for the average score of the class. This mean value gives us an idea about how the students performed in that exam.
To understand standard deviation, we need a data set. Statisticians are usually concerned with taking a sample of a population. This means that we don’t have to concern ourselves with each and every datapoint. To use election polls as an example, the population is all the people in the country, whereas a sample is a subset of the population that the statisticians measure. The great thing about statistics is that by only measuring a sample of the population, you can work out what is most likely to be the measurement if you used the entire population. We need to know how many people agree with each other or have an opinion close to each other. This is where standard deviation comes into picture.
The Standard Deviation (SD) of a data set is a measure of how spread out the data is. This will tell us how confident the model is about predicting and analyzing any new data. It tells us how close the datapoints are to the mean of the distribution. It the standard deviation is small, it tells us that most of the datapoints are close to the mean of the distribution. The term “variance” refers to the square of standard deviation.
Gaussian distribution is also called normal distribution. The Gaussian distribution refers to a family of continuous probability distributions described by the Gaussian equation. The Gaussian equation is an exponentially decaying curve centered around the mean of the distribution scaled by a factor. The scaling factor is inversely proportional to the standard deviation of the distribution. If that was confusing, I will try to clarify it soon.
The graph of the Gaussian distribution depends on two factors – the mean and the standard deviation. The mean of the distribution determines the location of the center of the graph, and the standard deviation determines the height and width of the graph. The height is determined by the scaling factor and the width is determined by the factor in the power of the exponential. When the standard deviation is large, the curve is short and wide; when the standard deviation is small, the curve is tall and narrow. All Gaussian distributions look like a symmetric, bell-shaped curves. If you consider the image here, the curve to the right is the one with smaller standard deviation than the one to the left.
A lot of real-world data exhibits Gaussian characteristics. This makes it easy for scientists and researchers to analyze unknown data by using this model. It is used in biology to study the characteristics of nerve tissues. It is used in finance to analyze and predict exchange rates, stock prices, general data analysis etc. It occurs a lot of times in quantum physics, signal processing, chip manufacturing, biology, etc. When we collect large amounts of data and study the underlying pattern, it usually exhibits a Gaussian nature. It’s not necessary that the data will always exhibit Gaussian nature. There are many other types of distributions too. But we’ll save that discussion for another blog post.
Multivariate Gaussian Distribution
A Gaussian distribution which varies in more than one dimension is called a Multivariate Gaussian Distribution. For example, if you have a set of numbers and draw a simple curve by placing these numbers on the number line, you will get a univariate Guassian distribution. But what if you had pairs of numbers like points on a plane or height and width of an object etc? Here, the first number in all the pairs will have a Gaussian distribution and the second number will have another Guassian distribution. When you look at them together, you will get a Gaussian distribution which might look like the curve shown in the image here.
Multimodal Gaussian Distribution
A Gaussian distribution which has more than one mode is called a Multimodal Distribution. A particular mode corresponds to a particular value of mean and standard deviation. These curves have two or more peaks with different variances. These distributions are useful when the same quantities cluster around two or more means. If you try to generalize it with a single mean, we might lose vital information. In the figure shown here, the distribution has two modes. As we can see, they differ in their means and standard deviations.
Analysis of data is integral to almost each and every branch of study. Gaussian distribution occurs in many forms and in many different places. The name Gaussian comes from a mathematician named Carl Friedrich Gauss. He came up with this formulation around 200 years ago. He is referred to as the “Prince of Mathematics”, and rightly so! He was a child prodigy and he’s one the most influential mathematicians of all time. His theories and formulations have become deeply engrained in many different fields.