When we talk about deep neural networks, we tend to focus on feature learning. Traditionally, in the field of machine learning, people use hand-crafted features. What this means is that we look at the data and build a feature vector which we think would be good and discriminative. Once we have that, we train a model to learn from it. But one of the biggest problems with this approach is that we don’t really know if it’s the best possible representation of the data. Ideally, we would want the machine to learn the features by itself, and then use it to build the machine learning model. Autoencoder is one such neural network which aims to learn how to build optimal feature vector for the given data. So how exactly does it work? How is it used in practice?
Why do we need autoencoders?
Before we continue our discussion, we need to understand why we need autoencoders in the first place. Autoencoders play a fundamental role in unsupervised learning, particularly in deep architectures. Now what is a deep architecture? Within machine learning, we have a branch called Deep Learning which has gained a lot of traction in recent years. Deep Learning focuses on learning meaningful representations of data. So a machine learning architecture that attempts to model this is called a deep architecture. This is just a simplistic explanation of something that’s very complex! Deep Learning is too vast to be discussed here, so we will save it for another post. So coming back to autoencoders, the aim of an autoencoder is to learn a compressed, distributed representation for the given data, typically for the purpose of dimensionality reduction.
Don’t we already have Principal Component Analysis for dimensionality reduction?
Well, that’s true! Principal Component Analysis (PCA) is one way to handle dimensionality reduction. PCA finds the directions along which the data has maximum variance in addition to the relative importance of these directions. But the problem here is that PCA is linear, which means that the data has to be on or close to a linear sub-space of the data set. PCA projects the data onto a low dimensional surface and it restricts that surface to be linear. In the real world, this can be restrictive. I have discussed the differences between linear and non-linear methods of dimensionality reduction here. Autoencoder doesn’t impose that restriction. We can have both linear and non-linear autoencoders. In fact, when we force the autoencoder to be linear, the optimal solution is very close to what we get using PCA.
What exactly is an autoencoder?
Autoencoders are simple learning networks that aim to transform inputs into outputs with the minimum possible error. This means that we want the output to be as close to input as possible. Now you might wonder where this error comes from? I mean, if you just directly pass the input to the output, we are bound to get the perfect reconstruction, right? Well, that’s where it gets interesting. We add a couple of layers in between the input and the output, and the sizes of these layers are smaller than the input layer. Let’s say the input vector has a dimensionality of ’n’, which means that the output will also have a dimensionality of ’n’. Now, we make the input go through a layer of size ‘p’, where p < n, and we ask it to reconstruct the input.
Mathematically speaking, an autoencoder is a model that takes a vector input x, maps it into a hidden representation ‘h’ using an encoder. The hidden representation ‘h’, often called “code”, is then mapped back into the space of x, using a decoder. The goal of the autoencoder is to minimize the reconstruction error, which is represented by a distance between the input x and the output y. The most common type of distance is the mean squared error. The code h typically has less dimensions than x, which forces the autoencoder to learn a good representation of the data. In its simplest form i.e. the linear form, an autoencoder learns to project the data onto its principal components. If the code h has as many components as x, then no compression is required, and the model could typically end up learning the identity function. Now if the encoder has a non-linear form, or uses a multi-layered model, then the autoencoder can learn a potentially more powerful representation of the data. This is where it becomes very useful for feature learning!
Are there any problems?
One of the biggest concerns with using autoencoders is that if there is no other constraint besides minimizing the reconstruction error. So if we have an autoencoder with n inputs and a hidden layer of dimension n (or more), it could potentially just end up learning the identity function. This is called overfitting and we should avoid it. One way to avoid it would be by using additional constraints. These constraints can come in the form of regularization. For example, to ensure the sparsity of the hidden-layer representation, we might end up regularizing it. We can also put restrictions on the classes of functions that are being used, or add some noise in the hidden layer so that it doesn’t end up becoming the identity function.
If we train with stochastic gradient descent, we can have non-linear auto-encoders with more hidden units than inputs. This is called the overcomplete form and it tends to yield useful representations of the given data. When I say “useful”, I mean useful in the sense of classification error measured on a network taking this representation in input. One way of looking at it would be that the stochastic gradient descent with early stopping is similar to L2 regularization of the parameters. To achieve perfect reconstruction of continuous inputs, we just need a one-hidden layer auto-encoder with non-linear hidden units. With this setup, we need very small weights in the encoding layer to introduce non-linearity in the hidden units and very large weights in the second decoding layer.
When we are dealing with binary inputs, we need very large weights to completely minimize the reconstruction error. But there is a small problem here. Remember when we talked about regularization to avoid overfitting? Well, this makes it difficult to reach large-weight solutions. The good thing is that the optimization algorithm at least finds the encodings which only work well for examples similar to those in the training set. This means that it is exploiting the statistical patterns present in the training set, as opposed to learning the identity function.