Chapter 1: Introduction to Information Theory

The first problem we tackle is how to communicate perfectly over imperfect communication channels.

The running example is this. We have some message which is a sequence of bits. We want to write this data to a noisy hard disk that transmits each bit correctly with probability $1 - f$ and correctly with probability $f$ . This is called the binary symmetric channel (symmetric as the probablity of flipping from $0$ to $1$ is the same as $1$ to $0$ ).

Suppose $f = 0.1$ . This is clearly unacceptable. We want a useful disk drive to flip no bits in say 10 years writing $1 GB$ a day, which means we want error probability around $1 0^{- 15}$ or smaller. We solve this problem by introducing communication systems to detect and correct the errors.

The flow is such:

Start with message $s$
$s$ gets encoded into $t$ , which is some intermediate representation we design
Some noise $n$ gets added to $t$ which represents noise from the imperfect hard disk
This results in $r = t + n$ which is the received vector
- The addition is in modulo 2 arithmetic, i.e. $0 + 1 = 1$ and $1 + 1 = 0$
Finally we decode using some algorithm to get $\overset{s}{^}$ from $r$

The $R_{3}$ Repetition Code

A simple communication system is to repeat each bit $3$ times. After noise is added, we decode each bit in the received vector by taking the majority vote. The idea is that when $f$ is small, the probability of multiple bits failing independently becomes small. So the majority vote will correct some errors.

Theorem. The majority vote decoding is optimal (i.e. has the lowest probability of error).

Proof. Since each bit is independent, we just need to consider the optimal decision for one bit. Suppose we received $3$ bits $r_{1}, r_{2}, r_{3}$ . The objective is to recover the original bit $s$ , which could have been $0$ or $1$ . Thus, we want: $\overset{s}{^} = s \in {0, 1} arg max P (S = s ∣ r_{1}, r_{2}, r_{3}) = s arg max \frac{P ( r _{1} , r _{2} , r _{3} ∣ s ) P ( s )}{P ( r _{1} , r _{2} , r _{3} )} = s arg max P (r_{1}, r_{2}, r_{3} ∣ s)$ Note that:

In the first line, $S$ is the random variable representing the unknown true bit
$s$ is a candidate value that we are choosing (some notation overloading as $s$ is also used to represent the true bit above)
$P (S = s)$ is treated as a constant, and $P (r_{1}, r_{2}, r_{3})$ does not affect choice of $s$

Now, if we denote $d$ as the number of bit flips between $r_{1}, r_{2}, r_{3}$ and $s$ , we have: $P (r_{1}, r_{2}, r_{3} ∣ s) = (1 - f)^{3 - d} f^{d}$ Notice that we are computing the probability for this specific received sequence $r_{1}, r_{2}, r_{3}$ , hence we do not need to multiply by the number of combinations.

With this formulation, we see that $P (r_{1}, r_{2}, r_{3} ∣ s)$ is maximized when the choice of $s$ minimizes the number of bit flips (if $f < 0.5$ ). This is because a scenario with few bit flips is more likely than a scenario with many bit flips. Hence, the majority vote is optimal as it minimizes the number of bit flips.

Exercise 1.2. Show that the error probability is reduced by the user of $R_{3}$ by computing the error probability of this code for a binary symmetric channel with noise level $f$ .

Solution. The naive error probability is: $P (r \neq = s) = f$

With $R_{3}$ , the error probability is: $P (r_{maj or i t y} \neq = s) = P (2 or 3 bit flips) = f^{2} (1 - f) \times 3 + f^{3} = 3 f^{2} - 2 f^{3}$

Now we want to show that when $f < 0.5$ , $R_{3}$ has a better error probability: $f 3 f 3 f - 2 f^{2} 3 f^{2} - 2 f^{3} < 0.5 < 1.5 < 1 < f$

Hence we showed that $R_{3}$ has a better error probability.

Even though $R_{3}$ has a better error probability, the problem is that our rate of information transfer has fallen by a factor of $3$ . To improve our error probability, we can continue to increase repetition at the cost of decreased rate.

Exercise 1.3. Find the probability of error of $R_{N}$ , the repetition code with $N$ repetitions for odd $N$

$P (Error of R_{N}) = P (At least \frac{N + 1}{2} bits flipped) = n = (N + 1) /2 \sum N (n N) f^{n} (1 - f)^{N - n}$

Assuming $f = 0.1$ , which terms in this sum is biggest? How much bigger than the second largest term?

The largest term is $(( N + 1 ) /2 N) f^{(N + 1) /2} (1 - f)^{(N - 1) /2}$

The second largest term is around $0.1$ times smaller, so the largest term dominates.

Use stirling's approximation to approximate the largest term and find the probability of error.

Using stirling's approximation: $(( N + 1 ) /2 N) \approx 2^{N \cdot H_{2} ((N + 1) /2/ N)} \approx 2^{N \cdot H_{2} (1/2)} = 2^{N}$

Hence we have probability of error as: $P_{error} \approx 2^{N} \cdot f^{N /2} \cdot (1 - f)^{N /2} = (4 f (1 - f))^{N /2}$

To get error probability less than $1 0^{- 15}$ , we write out the inequality and manipulate to find the lower bound for $N$ .

Hamming (7, 4) code

Thus far, we have been trying to encode each bit independently. What if we encode blocks of bits together? Can we get more efficient codes?

A block code converts a sequence of source bits $s$ , of length $K$ , into a transmitted sequence $t$ of length $N$ bits.

We add some redundancy by making $N > K$ . The extra $N - K$ bits (called parity-check bits) are used to store some redundant information of the original $K$ bits.
This is usually implemented as some linear function of the original $K$ bits
In the (7,4) Hamming Code, we transmit $N = 7$ bits for every $K = 4$ source bits

The encoding can be shown with an example. Suppose $s = 0011$ .

The first 4 bits of transmitted $t = 0011$ , i.e. just copy the source sequence
- $t_{1}, t_{2}, t_{3}, t_{4} = s_{1}, s_{2}, s_{3}, s_{4}$
The next 3 bits are parity-check bits
- $t_{5}$ is set such that $s_{1} + s_{2} + s_{3} + t_{5}$ is even
- $t_{6}$ is set such that $s_{2} + s_{3} + s_{4} + t_{6}$ is even
- $t_{7}$ is set such that $s_{1} + s_{3} + s_{4} + t_{7}$ is even
For the example of $s = 0011$ :
- $t = 0011100$

We can see that the Hamming code is a linear code, since the encoding can be written compactly in terms of a matrix-vector multiplication (using modulo-2 arithmetic). Specifically, the transmitted code-vector $t$ may be obtained from source-vector $s$ using a matrix multiplication: $t = G^{T} s$

Where $G$ is the generator matrix of the code: $G^{T} = 1000101010011000101110001011$

The columns of the generator matrix may be viewed as defining four basis vectors in a seven dimensional binary space. The sixteen codewords are obtained by taking all possible linear combinations of these vectors (in modulo-2 arithmetic). This is a useful perspective:

With 7 dimensions, we have 128 possible vectors, out of which only 2^4 = 16 are valid. This means that if we get an invalid vector, we know that it needs to be corrected.
The 16 valid vectors are well spread-out in the space (every pair of codewords differs in at least 3 positions), enabling correction

Decoding Hamming Code

Decoding the hamming code follows the same logic as before. Assuming that the channel is a binary symmetric channel and all source vectors are equiprobable, we want to choose a source vector $s$ whose encoding $t (s)$ differs from received $r$ in as few bits as possible. Recall that the theorem we proved earlier shows that we want to maximize $p (r_{1}, ..., r_{7} ∣ s)$ .

Naively, we can enumerate all 16 valid hamming codes (each of length 7), and then compare $t (s)$ against each and find the closest. This is inefficient as our codes get longer.

The better way is syndrome decoding. A syndrome is defined as the pattern of violations of the parity bits. In other words, it is the vector of "unhappy" parity bits.

Syndome example. Suppose we transmit t = 1000101 and noise flips the second bit, giving us r = 1100101. The syndrome would be z = 110, because the first two parity checks are unhappy and the third is happy.

The syndrome decoding task is thus to find the unique bit that lies inside all the unhappy circles and outside the happy circles. If we can find such a bit, flipping it would account for the observed syndrome.

Matrix Version of Hamming Decoding

We can describe the decoding operation in terms of matrices. Let us define: $P = 101110111011 \in R^{3 \times 4}$

Let $H = [- P I_{3}]$ . Then the syndrome vector is: $z = Hr$

Since the $- P$ part computes the expected parity bit and subtracts it from the actual received parity bits (from the identity part). But because in modulo-2 arithmetic, $- 1 \equiv 1$ , so: $H = [P I_{3}] = 101110111011100010001$

It should be clear that for valid transmitted vectors of the form $t = G^{T} s$ , $H t$ will be the zero vector / syndrome.

Exercise 1.4. Prove that this is so.

We want to show that $H G^{T} s$ is the zero vector for all $s$ . Consider $H G^{T}$ . We have: $H G^{T} = [P I_{3}] [I_{4} P]$

For compatible matrix blocks (i.e. $P \cdot I_{4}$ is defined etc.), we have: $H G^{T} = P + P = 0 \in R^{3 \times 4}$

Hence $H G^{T} s$ will always be the zero vector.

We close this section of hamming decoding by noting that the received vector is obtained by: $r = G^{T} s + n$

And the syndrome vector is: $z = Hr = H G^{T} s + H n = H n$

Hence, the syndrome decoding problem is to find the most probably noise vector that satisfies $z = H n$ . Such a decoding algorithm is called maximum likelihood decoder.

Exercise 1.5 Refer to the (7,4) Hamming code. Decode these received strings: r = 1101011

We have z = 011, so flip $r_{4}$ , $\overset{s}{^}$ =1100

r = 0110110

We have z = 111, so flip $r_{3}$ , $\overset{s}{^}$ =0100

r = 0100111

We have z = 001, so flip $r_{7}$ , $\overset{s}{^}$ =0100

r = 1111111 We have z = 000, so flip nothing, $\overset{s}{^}$ =1111

Exercise 1.6. Calculate the probability of block error $p_{B}$ of the 7, 4 hamming code as a function of the noise level $f$ and show that to leading order it goes as $21 f^{2}$ .

The block error is the probability that one or more of the decoded bits in one block fail to match the corresponding source bits.

Assuming that whenever 2 or more bits are flipped in a block of 7 bits, we get a block decoding error, we can derive this as a binomial distribution. $p_{B} = P (At least 2 bits flipped) = 1 - P (0 bits flipped) - P (1 bit flipped) = 1 - (1 - f)^{7} - 7 f (1 - f)^{6}$

Recall Taylor's expansion: $(1 - f)^{n} = k = 0 \sum n (k n) (- f)^{k}$

So for $p_{B}$ : $p_{B} = 1 - (1 - f)^{7} - 7 f (1 - f)^{6} = 1 - (1 - 7 f + 21 f^{2} + ...) - 7 f (1 - 6 f + ...) = 21 f^{2} + O (f^{3})$

The leading term is $21 f^{2}$ .

Show that to leading order the probability of bit error $p_{b}$ goes as $9 f^{2}$ .

The naive way to solve this is to try to enumerate for each source bit possible scenarios and compute the probabilities. But that is very tedious.

The observation is to consider the subset of cases under a block error where exactly two bits are corrupted. This is the leading case because the probability of 3 or more bits being corrupted becomes increasingly unlikely.

Now when two bits are corrupted, exactly 3 bits will be in error if we follow optimal decoding.

Suppose bits $i, j$ are corrupted. Since the syndrome follows $z = H n$ , and $n_{i} = n_{j} = 1$ , then this corresponds to an operation where we add the $i^{t h}$ and $j^{t h}$ columns of H together $z = H_{i} + H_{j}$
The decoder flips the single bit that explains the syndrome $z$ , since this has the highest likelihood. In other words, the bit flipped would be the column $k$ where $H_{k} + z = 0$
We can show that $k \neq = i \neq = j$ , which tells us that exactly 3 bits will be in error

It is furthermore possible to show that these 3 bits in error are uniformly distributed amongst the 7 possible positions in the code (but not shown here). Since the probability of a block error goes as $21 f^{2}$ , and each block error has a leading explanation that $\frac{3}{7}$ bits are in error, so the probability of any bit being in error is $\frac{3}{7} \times 21 f^{2} = 9 f^{2}$

Exercise 1.7 Find some noise vectors that give the all zero syndrome (i.e. noise vectors that leave all the parity checks unviolated). How many such noise vectors are there?

We use the formula $H n = z$ . If $z = 0$ , then $H n = 0$ .

In other words, we are finding the number of elements in the null space of $H$ . Using the fundamental theorem of linear maps: $rank (H) + d im (nullspace (H)) = d im (V)$

Since $rank (H) = 3$ (the 3 row vectors in $H$ are linearly independent) and $d im (V) = 7$ (as each vector has 7 elements), $d im (nullspace (H)) = 7 - 3 = 4$ .

The number of elements in the null space is thus $2^{4} = 16$ , because each element can be either 0 or 1. We remove the all zero solution as it is not a valid noise vector, leaving us with 15 non-zero noise vectors that leave us with the all zero syndrome.

Exercise 1.8 I asserted above that block decoding error will results whenever two or more bits are flipped in a single block. Show that this is indeed so. [In principle, there may be error patterns that, after decoding, lead only to corruption of parity bits, with all source bits correct]

We need to show that when two or more bits are flipped in a single block, there will always be at least one source bit that is still corrupted (after decoding).

We need to consider the formula: $H n = z$

We saw earlier that we can interpret this matrix-vector multiplication as adding up columns in $H$ (which is $3 \times 7$ ) corresponding to the positions in $n$ which have value 1. Hence: $z = H_{i} + H_{j} + ... where i, j, ... are the noise positions$

The optimal bit to be flipped is the columns $x$ s.t. $H_{x} = z$ .

In case where two bits are flipped, we can show that $H_{x} \neq = H_{i} \neq = H_{j}$ . This is because if $H_{x} = H_{i}$ , then $H_{j} = 0$ which is a contradiction. This means that three bits will be flipped under optimal decoding.

It suffices to show that the three bits flipped cannot all be parity bits. Recall that we had setup the generator $G$ to put the source bits in the first 4 positions and the parity bits in the last 3 positions. The corresponding "parity check" matrix $H$ is like so: $H = [P I_{3}] = 101110111011100010001$

In order to flip all 3 parity bits, we need $H_{i} + H_{j} = H_{k}$ such that $[1, 0, 0]^{T} + [0, 1, 0]^{T} = [0, 0, 1]^{T}$ (or some other combination), which is not possible as they are linearly independent. Hence for the two-bit case, at least one source bit is corrupted.

In the case where three bits are flipped, we can again show that $H_{x} \neq = H_{i} \neq = H_{j} \neq = H_{k}$ . This is because if $H_{x} = H_{i}$ , then $H_{j} + H_{k} = 0$ , which contradicts since $H_{j} \neq = H_{k}$ . This means that total of 4 bits will be flipped. Since there are only 3 parity bits, at least one source bit must be flipped.

If at least four bits are flipped, then it may be possible for a corrupted position to be flipped back (into a correct state). This will leave us with at least 3 corrupted bits. Hence we just need to show (in the four-bit case) that the 3 corrupted bits remaining cannot all be parity bits.

Suppose $H_{x} = H_{i}$ (i.e. position $i$ was corrupted but flipped back), and $H_{j} + H_{k} + H_{l} = 0$ . Clearly, $j, k, l$ cannot all correspond to parity bits, because they are linearly independent and cannot add up to $0$ . So we are done.

Exercise 1.10 A (7, 4) Hamming code can correct any one error; might there be a (14, 8) code that can correct any two errors? Does the answer to this question depend on whether the code is linear or non-linear?

In the (14, 8) code, we have 8 source bits and 6 parity check bits.

First try to understand why (7, 4) works.

$H$ is $3 \times 7$ , because there are 3 parity checks
So we have $2^{3} = 8$ syndromes, $7$ if we subtract the all zero syndrome
We can correct any one error because 1-bit errors map exactly to the $7$ non-zero syndromes

With (14, 8):

$H$ is $6 \times 14$ , because there are 6 parity checks
So $2^{6} = 64$ possible syndromes, $63$ non-zero syndromes
1-bit errors map to $14$ possible non-zero syndromes, so we still have $63 - 14 = 49$ syndromes to use
2-bit errors map to potentially $(2 14) = 91$ syndromes
There are insufficient syndromes to correct for all possible 2-bit errors

So no, a (14, 8) hamming code cannot possibly correct any two errors.

Keyboard shortcuts

Chux's Notebook

Chapter 1: Introduction to Information Theory

The R3​ Repetition Code

Hamming (7, 4) code

Decoding Hamming Code

Matrix Version of Hamming Decoding

The $R_{3}$ Repetition Code