In this blog post, we will be using a bit of background from my previous blog post. If you are familiar with the basics of entropy coding, you should be fine. If not, you may want to quickly read through my previous blog post. So coming to the topic at hand, let’s continue our discussion on entropy coding. Let’s say we have a stream of English alphabets coming in, and you want to store them in the best possible way by consuming the least amount of space. So you go ahead and build your nifty entropy coder to take care of all this. But what if you don’t have access to all the data? How do you know what alphabet appears most frequently if you can’t access the full data? The problem now is that you cannot know for sure if you have chosen the best possible representation. Since you cannot wait forever, you just wait for the first ‘n’ alphabets and build your entropy coder hoping that the rest of the data will adhere to this distribution. Do we end up suffering in terms of compression by doing this? How do we measure the loss in quality?
I’m not sure what you are talking about! Show me an example.
Nothing clears up a discussion like a concrete example. Okay, let’s consider the first 30 alphabets in our data stream:
You can see that the letter ‘K’ appears a lot. So we go ahead and build an entropy coder using Huffman coding to get the following representation:
S = 000 (3 bits) A = 001 (3 bits) C = 01 (2 bits) K = 1 (1 bit)
If you substitute these values, you will see that it optimally compresses the above sequence to 51 bits. Now consider the first 60 alphabets in the same data stream:
The next 30 characters are all ‘S’. So if we go ahead with the above representation, it will take 51 + 3 x 30 = 141 bits to represent the data stream. If you hadn’t used any entropy coding, you would have assigned 2 bits to each character. This means that the above sequence of length 60 would have taken 120 bits to represent it. So, not using any entropy coding turned out to be better than using entropy coding. That doesn’t make sense, right? The reason this happened is because we didn’t know what the data contained before we built our entropy coder. If you consider the full sequence of length 60 to build the optimal entropy coder, you will see that you need only 100 bits to represent the whole sequence.
How do we measure this thing?
We need a way to quantify this whole thing, right? We should be able to measure this loss in quality because of sub-optimal representation. This is where the concept of relative entropy comes into picture. It’s also called Kullback-Leibler divergence (KL-divergence). It is a distance function from a true probability distribution to a target probability distribution. Basically, we can think of it as the extra average number of bits per symbol needed to represent the system due to sub-optimal encoding. Here, symbol refers to the alphabets appearing in the sequence.
To draw an appropriate analogy, let’s consider our example. We used sub-optimal code to represent the sequence of length 60. The average number of bits per symbol would be 141/60 = 2.35 bits/symbol. For the optimal case, this would be 100/60 = 1.67 bits/symbol. So the KL-divergence would 2.35 – 1.67 = 0.68 bits/symbol. Ideally, we want this to be as low as possible when we estimate our model to build the entropy coder. One thing to note is that this is not a symmetric measure. As in, if P is the true distribution and we use Q to represent the data, then the KL-divergence is not the same as the case when Q is the true distribution and we use P to represent the data.
What if I get a completely new symbol? How do we represent that?
Now let’s say you built your entropy coder based on the symbols you got. You know what alphabets are likely to appear more frequently, and you build based on that. It’s all good up until this point. What if you get a completely new alphabet in your sequence? An alphabet that your entropy coder has never seen before! You will not be able to assign any binary code to it. At this point, KL-divergence would break down. Since KL-divergence breaks down, you will either have to use another measure or other probability distributions. But barring all that, it is a good measure of what’s happening with your entropy coder.
KL divergence represents the amount of information lost per observation if the distribution Q is used as instead of P. According to our current scheme, our entropy coder says that the letter ‘F’ occurs with 0 probability (because it doesn’t appear in the first 60 alphabets). As you receive more data from the data stream, you will see that the letter ‘F’ does indeed appear. Since we don’t have a representation for ‘F’, we cannot store that alphabet, which means we lose that information. Entropy coders are lossless by definition. So if your entropy coder loses information like this, you most certainly have the wrong coder. The solution is to never allow probabilities that are 0 or 1 in estimated distributions.
Where is this used in real life?
This is actually used frequently in natural language processing. If your probability distributions come from natural language documents, then you can compare pairs of categories. Let’s say you have two categories of documents and you have built an entropy coder for each. Now when a new document comes in, you can pass it through both the entropy coders and see which one represents it more compactly. You could estimate a probability function from both documents and then see how many extra bits you would need on average for either document using KL-divergence. This gives us an idea of what category the unknown document belongs to.