Entropy Coding appears everywhere in modern digital systems. It is a fundamental building block of data compression, and data compression is pretty much needed everywhere, especially for internet, video, audio, communication, etc. Let’s consider the following scenario. You have a stream of English alphabets coming in and you want to store them in the best possible way by consuming the least amount of space. For the sake of discussion, let’s assume that they are all uppercase letters. Bear in mind that you have an empty machine which doesn’t know anything, and it understands only binary symbols i.e. 0 and 1. It will do exactly what you tell it to do, and it will need data in binary format. So what do we do here? One way would be to use numbers to represent these alphabets, right? Since there are 26 alphabets in English, we can convert them to numbers ranging from 0 to 25, and then convert those numbers into binary form. The biggest number, 25, needs 5 bits to be represented in binary form. So considering the worst case scenario, we can say that we need 5 bits to represent every alphabet. If have to store 100 alphabets, we will need 500 bits. But is that the best we can do? Are we perhaps not exploring our data to the fullest possible extent?

**Okay so what was the point of that whole alphabet story?**

The thing about data is that there is always something hidden inside. If we just care to look into it, we will find magical treasures! Coming back to our example, consider the following sequence:

SSCAAACCCKKKKAKKKKKKKKCCCAKKKK

This is a sequence of length 30. If we just naively assume that we need 5 bits to represent every alphabet, we will need 30 x 5 = 150 bits to represent this sequence. Can we do something better? Well, of course we can! The first thing we notice is that we have only 4 distinct alphabets. Let’s assign a number to each of those alphabets:

S - 0 C - 1 A - 2 K - 3

We need only 2 bits to represent the biggest number, 3, in binary form. So even though we need 5 bits to distinctly represent all English alphabets, we just need 2 bits per alphabet here because we don’t need to worry about other alphabets. There’s our first saving! We now only need 30 x 2 = 60 bits to represent that sequence.

**Can we do better?**

So what’s next here? Do you notice anything else? The alphabet ‘K’ appears way more frequently than other characters, where as ‘S’ appears only twice. So may be it’s not optimal to represent both of them using the same number of bits, right? I mean, we can save some bits if we use lesser number of bits to represent ‘K’ and higher numbers of bits to represent ‘S’. But how do we find out that representation? Well, we need to use something called “Huffman Coding” for that. I won’t be discussing the implementation of details here. But if we apply Huffman Coding here, you will find out that we can have the following representation:

S = 000 (3 bits) A = 001 (3 bits) C = 01 (2 bits) K = 1 (1 bit)

If you now compute the sum by replacing those alphabets in the sequence with these assignments, you will find out that we can in fact represent that sequence using just 51 bits (3 x 2 + 3 x 5 + 2 x 7 + 1 x 16 = 51). Isn’t that nice?

**What exactly is “entropy”?**

Entropy generally refers to disorder that’s present in a system. The term “entropy” appears in multiple fields. But in information theory, entropy is a measure of uncertainty associated with a variable. Higher the uncertainty, higher the entropy! To understand it, let’s consider the following three sequences:

Sequence 1: ABABABABABABABABABABABABABABAB Sequence 2: AAAAAAAAAAAAAAABBBBBBBBBBBBBBB Sequence 3: ABBBBBBBBBBBBBBBBBBBBBBBBBBBBB

In sequences 1 and 2, the alphabets ‘A’ and ‘B’ appear exactly the same number of times. So basically there’s no uncertainty associated with it. Both of them appear with equal probability and there are no surprises as such. Here, we say that the entropy of the system is maximum. In sequence 3, we can see that ‘A’ appears only once and ‘B’ appears a lot. So we can say that the uncertainty is reduced. Given a random position, ‘B’ is more likely to appear than ‘A’, right? We say that this sequence has lower entropy. Lower entropy systems are good for us because we can compress them by reassigning the symbols. Do you see something similar between sequence 3 and the alphabet example we discussed in the first paragraph? That system also had low entropy, and that’s why we were able to compress it.

**Okay then what’s entropy coding?**

Actually, you already know it! But for the sake of completion, let’s define it here. Entropy Coding refers to lossless data compression that’s applicable to any stream of data regardless of how it’s created. We talked about data compression earlier. Different domains need different algorithms for data compression. For example, image compression algorithms are very different from audio compression algorithms. We make use of certain characteristics of human visual system to optimally compress images. Similarly, we make use of certain characteristics of human auditory system to optimally compress audio files. So we need to know where the data is coming from, so that we can pick the right algorithms. But entropy coding doesn’t care about all that. You can give it any stream of data and it will compress it. That’s the best part! In fact, almost all data compression systems have an entropy coder as their final compression step. They compress whatever they can and then feed it to an entropy coder to produce the final compressed data.

———————————————————————-———————————–