Chapter 1: Introduction to Information Theory
The first problem we tackle is how to communicate perfectly over imperfect communication channels.
The running example is this. We have some message which is a sequence of bits. We want to write this data to a noisy hard disk that transmits each bit correctly with probability and correctly with probability . This is called the binary symmetric channel (symmetric as the probablity of flipping from to is the same as to ).
Suppose . This is clearly unacceptable. We want a useful disk drive to flip no bits in say 10 years writing a day, which means we want error probability around or smaller. We solve this problem by introducing communication systems to detect and correct the errors.
The flow is such:
- Start with message
- gets encoded into , which is some intermediate representation we design
- Some noise gets added to which represents noise from the imperfect hard disk
- This results in which is the received vector
- The addition is in modulo 2 arithmetic, i.e. and
- Finally we decode using some algorithm to get from
The Repetition Code
A simple communication system is to repeat each bit times. After noise is added, we decode each bit in the received vector by taking the majority vote. The idea is that when is small, the probability of multiple bits failing independently becomes small. So the majority vote will correct some errors.
Theorem. The majority vote decoding is optimal (i.e. has the lowest probability of error).
Proof. Since each bit is independent, we just need to consider the optimal decision for one bit. Suppose we received bits . The objective is to recover the original bit , which could have been or . Thus, we want: Note that:
- In the first line, is the random variable representing the unknown true bit
- is a candidate value that we are choosing (some notation overloading as is also used to represent the true bit above)
- is treated as a constant, and does not affect choice of
Now, if we denote as the number of bit flips between and , we have: Notice that we are computing the probability for this specific received sequence , hence we do not need to multiply by the number of combinations.
With this formulation, we see that is maximized when the choice of minimizes the number of bit flips (if ). This is because a scenario with few bit flips is more likely than a scenario with many bit flips. Hence, the majority vote is optimal as it minimizes the number of bit flips.
Exercise 1.2. Show that the error probability is reduced by the user of by computing the error probability of this code for a binary symmetric channel with noise level .
Solution. The naive error probability is:
With , the error probability is:
Now we want to show that when , has a better error probability:
Hence we showed that has a better error probability.
Even though has a better error probability, the problem is that our rate of information transfer has fallen by a factor of . To improve our error probability, we can continue to increase repetition at the cost of decreased rate.
Exercise 1.3. Find the probability of error of , the repetition code with repetitions for odd
Assuming , which terms in this sum is biggest? How much bigger than the second largest term?
The largest term is
The second largest term is around times smaller, so the largest term dominates.
Use stirling's approximation to approximate the largest term and find the probability of error.
Using stirling's approximation:
Hence we have probability of error as:
To get error probability less than , we write out the inequality and manipulate to find the lower bound for .
Hamming (7, 4) code
Thus far, we have been trying to encode each bit independently. What if we encode blocks of bits together? Can we get more efficient codes?
A block code converts a sequence of source bits , of length , into a transmitted sequence of length bits.
- We add some redundancy by making . The extra bits (called parity-check bits) are used to store some redundant information of the original bits.
- This is usually implemented as some linear function of the original bits
- In the (7,4) Hamming Code, we transmit bits for every source bits
The encoding can be shown with an example. Suppose .
- The first 4 bits of transmitted , i.e. just copy the source sequence
- The next 3 bits are parity-check bits
- is set such that is even
- is set such that is even
- is set such that is even
- For the example of :
We can see that the Hamming code is a linear code, since the encoding can be written compactly in terms of a matrix-vector multiplication (using modulo-2 arithmetic). Specifically, the transmitted code-vector may be obtained from source-vector using a matrix multiplication:
Where is the generator matrix of the code:
The columns of the generator matrix may be viewed as defining four basis vectors in a seven dimensional binary space. The sixteen codewords are obtained by taking all possible linear combinations of these vectors (in modulo-2 arithmetic). This is a useful perspective:
- With 7 dimensions, we have
128possible vectors, out of which only2^4 = 16are valid. This means that if we get an invalid vector, we know that it needs to be corrected. - The
16valid vectors are well spread-out in the space (every pair of codewords differs in at least 3 positions), enabling correction
Decoding Hamming Code
Decoding the hamming code follows the same logic as before. Assuming that the channel is a binary symmetric channel and all source vectors are equiprobable, we want to choose a source vector whose encoding differs from received in as few bits as possible. Recall that the theorem we proved earlier shows that we want to maximize .
Naively, we can enumerate all 16 valid hamming codes (each of length 7), and then compare against each and find the closest. This is inefficient as our codes get longer.
The better way is syndrome decoding. A syndrome is defined as the pattern of violations of the parity bits. In other words, it is the vector of "unhappy" parity bits.
Syndome example. Suppose we transmit
t = 1000101and noise flips the second bit, giving usr = 1100101. The syndrome would bez = 110, because the first two parity checks are unhappy and the third is happy.
The syndrome decoding task is thus to find the unique bit that lies inside all the unhappy circles and outside the happy circles. If we can find such a bit, flipping it would account for the observed syndrome.
Matrix Version of Hamming Decoding
We can describe the decoding operation in terms of matrices. Let us define:
Let . Then the syndrome vector is:
Since the part computes the expected parity bit and subtracts it from the actual received parity bits (from the identity part). But because in modulo-2 arithmetic, , so:
It should be clear that for valid transmitted vectors of the form , will be the zero vector / syndrome.
Exercise 1.4. Prove that this is so.
We want to show that is the zero vector for all . Consider . We have:
For compatible matrix blocks (i.e. is defined etc.), we have:
Hence will always be the zero vector.
We close this section of hamming decoding by noting that the received vector is obtained by:
And the syndrome vector is:
Hence, the syndrome decoding problem is to find the most probably noise vector that satisfies . Such a decoding algorithm is called maximum likelihood decoder.
Exercise 1.5 Refer to the (7,4) Hamming code. Decode these received strings:
r = 1101011
We have z = 011, so flip , =1100
r = 0110110
We have z = 111, so flip , =0100
r = 0100111
We have z = 001, so flip , =0100
r = 1111111We havez = 000, so flip nothing, =1111
Exercise 1.6. Calculate the probability of block error of the 7, 4 hamming code as a function of the noise level and show that to leading order it goes as .
The block error is the probability that one or more of the decoded bits in one block fail to match the corresponding source bits.
Assuming that whenever 2 or more bits are flipped in a block of 7 bits, we get a block decoding error, we can derive this as a binomial distribution.
Recall Taylor's expansion:
So for :
The leading term is .
Show that to leading order the probability of bit error goes as .
The naive way to solve this is to try to enumerate for each source bit possible scenarios and compute the probabilities. But that is very tedious.
The observation is to consider the subset of cases under a block error where exactly two bits are corrupted. This is the leading case because the probability of 3 or more bits being corrupted becomes increasingly unlikely.
Now when two bits are corrupted, exactly 3 bits will be in error if we follow optimal decoding.
- Suppose bits are corrupted. Since the syndrome follows , and , then this corresponds to an operation where we add the and columns of H together
- The decoder flips the single bit that explains the syndrome , since this has the highest likelihood. In other words, the bit flipped would be the column where
- We can show that , which tells us that exactly 3 bits will be in error
It is furthermore possible to show that these 3 bits in error are uniformly distributed amongst the 7 possible positions in the code (but not shown here). Since the probability of a block error goes as , and each block error has a leading explanation that bits are in error, so the probability of any bit being in error is
Exercise 1.7 Find some noise vectors that give the all zero syndrome (i.e. noise vectors that leave all the parity checks unviolated). How many such noise vectors are there?
We use the formula . If , then .
In other words, we are finding the number of elements in the null space of . Using the fundamental theorem of linear maps:
Since (the 3 row vectors in are linearly independent) and (as each vector has 7 elements), .
The number of elements in the null space is thus , because each element can be either 0 or 1. We remove the all zero solution as it is not a valid noise vector, leaving us with 15 non-zero noise vectors that leave us with the all zero syndrome.
Exercise 1.8 I asserted above that block decoding error will results whenever two or more bits are flipped in a single block. Show that this is indeed so. [In principle, there may be error patterns that, after decoding, lead only to corruption of parity bits, with all source bits correct]
We need to show that when two or more bits are flipped in a single block, there will always be at least one source bit that is still corrupted (after decoding).
We need to consider the formula:
We saw earlier that we can interpret this matrix-vector multiplication as adding up columns in (which is ) corresponding to the positions in which have value 1. Hence:
The optimal bit to be flipped is the columns s.t. .
In case where two bits are flipped, we can show that . This is because if , then which is a contradiction. This means that three bits will be flipped under optimal decoding.
It suffices to show that the three bits flipped cannot all be parity bits. Recall that we had setup the generator to put the source bits in the first 4 positions and the parity bits in the last 3 positions. The corresponding "parity check" matrix is like so:
In order to flip all 3 parity bits, we need such that (or some other combination), which is not possible as they are linearly independent. Hence for the two-bit case, at least one source bit is corrupted.
In the case where three bits are flipped, we can again show that . This is because if , then , which contradicts since . This means that total of 4 bits will be flipped. Since there are only 3 parity bits, at least one source bit must be flipped.
If at least four bits are flipped, then it may be possible for a corrupted position to be flipped back (into a correct state). This will leave us with at least 3 corrupted bits. Hence we just need to show (in the four-bit case) that the 3 corrupted bits remaining cannot all be parity bits.
Suppose (i.e. position was corrupted but flipped back), and . Clearly, cannot all correspond to parity bits, because they are linearly independent and cannot add up to . So we are done.
Exercise 1.10 A (7, 4) Hamming code can correct any one error; might there be a (14, 8) code that can correct any two errors? Does the answer to this question depend on whether the code is linear or non-linear?
In the (14, 8) code, we have 8 source bits and 6 parity check bits.
First try to understand why (7, 4) works.
- is , because there are 3 parity checks
- So we have syndromes, if we subtract the all zero syndrome
- We can correct any one error because 1-bit errors map exactly to the non-zero syndromes
With (14, 8):
- is , because there are 6 parity checks
- So possible syndromes, non-zero syndromes
- 1-bit errors map to possible non-zero syndromes, so we still have syndromes to use
- 2-bit errors map to potentially syndromes
- There are insufficient syndromes to correct for all possible 2-bit errors
So no, a (14, 8) hamming code cannot possibly correct any two errors.