Intro To Information Theory

From Bits To The Modern Entropy Function

March 18, 2020

Beyond Calculus

The Indonesian Caves of Borneo Island offer an insight to the most primitive form of recorded communication. Around some 40,000 years ago, before the evolution of written language, physical illustrations on cave walls were the most precise method of recorded communication. As time progressed, the methodology of record-taking evolved from cave paintings to complex alphabets — offering complete expression through the complex structure that is language. In the English language, ideas as concrete as a “tree”, & as complex as “love” are expressed through the usage of 26 letters — offering zero intrinsic value besides that which society has assigned them.

Information, within the context of the recently-established branch, is defined as the resolution of uncertainty (or surprise). Consider communicating the presence of a tree. With a preserved image, for example, information on the shape, colors, & even relative size of neighboring trees are all discernible — but is it perfectly precise? Only relatively so — a discrepancy is revealed when comparing supplementary information provided by written language. With an unlimited amount of words, language offers deeply more descriptive information not casually observed, such as where the tree was planted, by whom, & the type of soil it absorbs. Different communication methods imply a difference in the uncertainty communicated.

Known as the Father of Information Theory, MIT mathematician Claude Shannon famously introduced a logical, mathematical way of measuring this difference in recorded communication methods: he defined it as entropy. Entropy adds a mathematical tool-set to measure the relationship between principles of disorder & uncertainty. A higher entropy value indicates a high level of uncertainty of information; or more practically, a higher number of possible outputs of a function. To arrive at the modern formula for measuring information, we first have to revert back to the ~1920s, when one Ralph Hartley first built on the work of Henry Nyquist.

Early Entropy— A Measure Of Uncertainty

Information is defined as the resolution of uncertainty — if no questions are necessary to determine a value, there is no information being presented. Entropy is directly related to the information of a system. The higher the entropy, the more uncertainty is attributed to the definition of a symbol (a number, letter, etc…). Entropy, or uncertainty, is mathematically at a maximum when a symbol could equally have many meanings (uniform distribution). The simplest unit of entropy is when a symbol could equally have two meanings. A coin-flip for example has this binary property — it’s either heads or tails.

Image for post
Source Material From Claude Shannon, The Mathematical Theory of Communication

In the graph above, the y-axis, H(x), represents entropy for a symbol with two possibilities (again, binary); notice that entropy, or uncertainty, is maximum (equal to 1) in the middle of the graph. This intuitively checks out, when we flip a coin, for example, the uncertainty of the value it’ll land on is maximized when it’s in the air; on the flip side, entropy is at a minimum (equal to 0), at opposite ends of the x-axis when we know the current value of the binary symbol. With that in mind, it’s worth stating that coin flips, yes/no questions, Booleans & binary digits (0 or 1) are all mathematically equivalent — they are all represented in information theory by the graph above.

This Symbol With A Binary Property, However We Want To Refer To It (a “Bit”), Is The Base Unit Of Entropy Used In All Of Information Theory

Chronologically, the term “bit” wasn’t adopted until much later in history, however, we’re adopting it now for ease of understanding as we work our way up to modern information theory. A fun trivia fact here — the word “bit” is short-hand for binary digit.

Reasonably, it follows that an intended message with a single-step solution (a single bit/coin-flip/yes or no/Boolean of uncertainty), has a lower entropy value than that which requires two, or more additional steps. For example, let’s consider a fair coin flip. When trying to determine whether the result is heads or tails, a single bit will suffice to decipher the value: “was it heads?” Either way, we are certain of the return value with a single response. This problem has an entropy value of one because the answer can be determined in a matter of one step:

Image for post

When more complex problems are considered, it’ll likely take an additional number of Booleans to determine the answer with certainty. We now know that communicating a single symbol, the simplest symbol, a bit, yields an entropy value of one.

What If Sent A Message With Three Symbols (All Bits)?

Continuing the example above, let’s say we now wanted to send a three-symbol character with three coin flips. The number of total possible messages is now eight (2³), but the entropy of this message has a value of three. Remember entropy is not the number of message possibilities, rather it’s the minimum number of questions required to determine the message.

What If I Sent A Message With A Single Non-Binary Symbol (Character)?

Now say we want to send a single character, that means our symbol could be any one of twenty-six letters (1/26) — so what is the entropy value this time around? Essentially, how many yes/no questions, do we have to answer to determine the value of our letter, say J?

The most identifiable starting point is to simply ask, in alphabetical order, one character at a time: “is X the intended letter?” For example, is A the letter we’re trying to encode? How about B? Or C?…With this method, on average, a single symbol would take us 13 “yes/no” questions to determine (or have 13 bits of entropy). However, there’s a big caveat here, measuring information requires the most effective way of assigning value to symbols. Instead of individually searching character-by-charter, we borrow a concept from a binary search algorithm. Again searching for the letter J, we instead ask is J (our intended character), in the first half of the alphabet? Yes. Next, we split the remaining array, is our character in the first 6 characters (A-F)? No. And so on & so forth until we identify our symbol, J. Notice that this way of assigning value to a symbol using Booleans is much more effective; instead of an average of thirteen (13) yes/no questions, with this second method, we never need more than five (5) yes/no questions (occasionally only four (4)).

Through this observation, theoretically, the amount of entropy increases by a factor of two, for every possible symbol in the message space. With this relationship understood, calculating the most effective way to translate an alphabet character given uniform distribution is straightforward:

Image for post

The formula above arrives at the approximate amount of entropy (4.7 bits) associated with sending a single, random, alphabet character. As Ralph Hartley concluded in his eminent paper, Transmission of Information:

What We Have Done Then Is To Take As Our Practical Measure Of Information The Logarithm Of The Number Of Possible Symbol Sequences

His early definition of information slightly differed in notation. He defined H (information) as H = n log (s), where H is information, n is the number of symbols (letters, numbers, etc…), & s is the number of different symbols available at each selection (essentially the length of the message space).

Image for post

Still not quite at modern information theory, we’ve learned a solid amount:

  • Information Theory provides a way to quantify the amount of surprise for a communication event
  • Entropy, or uncertainty, is a measure of the minimum amount of yes/no questions that are required to determine a symbol value
  • Established that the binary digit, the bit, has an entropy value of 1 & is therefore the base unit within this field of math
  • Derived an early function of information that assumes a set of uniformly distributed symbols values

Up to this point, we’ve assumed that each value in a symbol set is randomly discrete, however, that’s a convenient oversimplification. As we know in real life, the theory isn’t quite this neat & all symbol values aren’t quite created equally. There are organic, measurable, patterns to our language & other forms of communication. As a quick thought experiment, count the number of times the letter “e” shows up in the previous sentence — does that distribution in characters represent a uniform distribution of 1/26?

When Symbols Aren’t Random — Markov Chains

Fast forward to 1948, when the father of modern information theory, Claude Shannon, proposed in his groundbreaking paper, “A Mathematical Theory of Communication” that there are patterns in communication that can be exploited to output the same message (value) in fewer steps (bits).

Written Language Offers Patterns That Make The Next Value In The Sequence More Predictable Due To Their Prior Values.

In other words, the prior value gives the next value less uncertainty (entropy). The best example of this is the predictability of a ‘U’ appearing after a ‘Q’ in English writing. If ‘Q’ is proceeded by ‘U’ 90% of the time, the potential outcome of the next letter is no longer in equilibrium throughout the entire system, it skews towards the value of ‘Q’, at a rate of 90%.

This creates a system in which the next value is dependent on the previous. Russian mathematician Andrey Markov proved this in his groundbreaking proof, ‘Markov’ Chains’. In this, he states that the probability of future values that are dependent on previous events are fixed in their probability. He proved that over the continuous running of a system, the outcomes will match their statistical probabilities.

Image for post

Remembering the dependency that ‘Q’ has on ‘U’, with the 9/10 probability of the return value being ‘U’ (P(Xi)), the entropy, or uncertainty, of a ‘U’ appearing after a ‘Q’ is H(X) = 0.13 bits. The entropy for any value of a completely randomized alphabet, one with complete independence, is H(X) = 4.7 bits. With this established pattern, & the dependence ‘Q’ exhibits on ‘U’, the entropy is reduced by an astounding 4.57 bits. Instead of cutting down the alphabet by half on the first step, the question that reveals the most information states, “Is the value ‘U’?”. 90% of the time this will be true, & the entropy is only one bit. This allows for the removal of unnecessary questions, in turn, lowering the total entropy of the system. Through the computer analysis of millions of written works, the standard probabilities of each letter within the English language have been devised. These are standard probabilities that can be used for independent events. Taking dependent value outputs into consideration (Markov Chains), relative frequencies have also been established for the frequency of letters.

The Information/Entropy Formula Re-Visited

With this realization, Shannon modernized information theory by evolving Hartley’s function. With a set of random, uniform values X, we calculate the entropy of encoding a single symbol with the log (base 2) of X. With a set of related, interdependent values X, we calculate the entropy of a single symbol by adding up the individual values of entropy for each individual possible symbol value in the symbol set:

Image for post

However, the formula above assumes uniform distribution, which through Markov chains we now know to be false. In order to account for this, we need amend the formula to multiply by the probability frequency of each symbol value x, (p(x)):

Image for post

Last, we need to replace the n inside the log function. We’re calculating the number yes/no questions required to arrive at each individual character, instead of the aggregate outcome. Continuing with the alphabet example, we know from the Markov chain above that it doesn’t take an equal amount of bits to guess whether a character is e versus z. Therefore, for each sum, we need the number of that specific outcome occurring — we know it’s not 26, but rather (1 / p(x)):

Image for post
For Further Detail Read Reza's An Intro To Infor...

The formula we’ve now derived is the very same one that Claude Shannon derived & catapulted him as one of the fathers of information theory. He slight re-arranged the variables above, this equation is at very core of modern information theory:

Image for post

H(x) is entropy, the measure of uncertainty associated with set variable X. P(x) is the probability of outcome x in variable X. And the base-2 Log(1/p(x)), is the number of bits required to decode the outcome x of variable X. Again, the base unit here, what H(x) is equal to, is defined in bits.

Theoretically, Shannon’s updated formula, which leverages the principle behind Markov chains, should decrease the entropy in a set of symbol values since we’ve moved away from uniform distribution. Once again returning to the alphabet, we can confirm this through an example. Recall in an earlier example we calculated that H(x) = 4.7 for a single alphabet character. Let’s compare that to H(x) calculated with the updated formula:

Image for post

As seen from the sum in the bottom-right, this intuition indeed checks out with a final entropy value of H(x) = 4.18. With this general formula for entropy, since the 1950s, information theory & its applications have only grown faster.

Onward To Applications

The concepts learned here, built from the mathematical bit, up, are recognizable & powerful in the digital era. Likely one of the most widespread & influential uses of information theory is the role it plays in lossless data compression. This is a form of compression utilized in database records, word files, image files, & video files. In this form of compression, data can be perfectly reassembled to its original state from the compressed file. Utilizing the principles of Shannon’s entropy & Markov Chains, information can be perfectly retrieved without fault. This ability to compress has allowed for the mass production of devices that hold unfathomable amounts of data. This is most impressive in the case of music files. Early days of recorded music relied on vinyl records — an uncompressed format of information. A vinyl record could store a single album in a lossless, uncompressed state. With modern compression technology, music files are compressed to contain bits pertaining to pitch, volume, resonation, etc. Another powerful example is the increase in hardware storage capability, again another outcome that’s powered by the mathematical principles outlined here.

The digital age would only be a distant dream without Shannon’s contribution to the world of communication. In similar fashion to every other under-recognized mathematical principle, information theory plays a vital role in our day-to-day functions.

Learning Sources

A Mind At Play - Soni, Goodman

The Mathematical Theory of Communication - Shannon, Weaver

An Introduction To Information Theory - Reza