Probability, Entropy, Inference

This chapter will spend some time on notation.

Probabilities and ensembles

An ensemble $X$ is a triple $(x, A_{X}, P_{X})$ where the outcome $x$ is the value of a random variable, which takes on one of a set of possible values from $A_{X} = {a_{1}, ..., a_{I}}$ . $P_{X} = {p_{1}, ..., P_{I}}$ represents the probability distribution, such that $P (x = a_{i}) = p_{i}$ , $p_{i} \geq 0$ , and $\sum_{a_{i}} P (x = a_{i}) = 1$

The name $A$ is a mnemonic for "alphabet". An example of an ensemble is to select a random latter from an english document.

Note: Mackay's definition of an ensemble seems redundant, as we usually just use the term "random variable" to denote the same idea. But he insists on using "ensemble" to refer to the entire system or environment, with an emphasis on the total space of the alphabet. Whereas in strict mathetical form, a random variable refers to a function that maps from each outcome to a real number.

Probability of a subset. If $T$ is a subset of $A_{X}$ , then: $P (T) = P (x \in T) = a_{i} \in T \sum P (x = a_{i})$

Joint ensemble $X Y$ is an ensemble in which each outcome is an ordered pair $x, y$ with $x \in A_{X} = {a_{1}, ..., a_{I}}$ and $y \in A_{Y} = {b_{1}, ..., b_{J}}$ .

We call $P (x, y)$ or $P (x y)$ the joint probability of $x$ and $y$ .

Marginal probability. We can obtain the marginal probability $P (x)$ from the joint probability $P (x, y)$ by summation: $P (x = a_{i}) \equiv y \in A_{Y} \sum P (x = a_{i}, y)$

Or more succintly: $P (y) \equiv x \in A_{X} \sum P (x, y)$

Conditional Probability. $P (x = a_{i} ∣ y = b_{j}) \equiv \frac{P ( x = a _{i} , y = b _{j} )}{P ( y = b _{j} )} if P (y = b_{j}) \neq = 0$

We often do not write down the joint probability directly, but rather define an ensemble in terms of a collection of conditional probabilities. Hence the following rules to manipulate conditional probabilities are useful.

Product rule (or Chain rule). Based on the definition of conditional probability, we have: $P (x, y ∣ H) = P (x ∣ y, H) P (y ∣ H) = P (y ∣ x, H) P (x ∣ H)$

Where $H$ is a way of denoting the assumptions upon which the probabilities are based.

Sum Rule. We are just writing the marginal probability in terms of conditional probabilities: $P (x ∣ H) = y \sum P (x, y ∣ H) = y \sum P (x ∣ y, H) P (y ∣ H)$

Bayes theorem - obtained from the product rule: $P (y ∣ x, H) = \frac{P ( x ∣ y , H ) P ( y ∣ H )}{P ( x ∣ H )} = \frac{P ( x ∣ y , H ) P ( y ∣ H )}{\sum _{y^{'}} P ( x ∣ y ^{'} , H ) P ( y ^{'} ∣ H )}$

Independence Two random variables $X$ and $Y$ are independent if and only if: $P (x, y) = P (x) P (y)$

We often define an ensemble in terms of a collection of conditional probabilities. This example illustrates the idea.

Example 2.3. Jo has a test for a disease. The variable $a$ denotes whether Jo has the disease and $b$ denotes the test result. In $95$ of cases of people who have the disease, a positive test results; in $95$ of cases of people who do not have the disease, a negative test results. $1$ of people of Jo's age and background have the disease. Now Jo takes the test and it is positive. What is the probability that Jo has the disease?

Based on the information: $P (B = 0∣ A = 0) P (B = 1∣ A = 1) P (A = 1) = 0.95 = 0.95 = 0.01$

We want: $P (A = 1∣ B = 1) = \frac{P ( B = 1∣ A = 1 ) P ( A = 1 )}{P ( B = 1 )} = \frac{P ( B = 1∣ A = 1 ) P ( A = 1 )}{P ( B = 1∣ A = 0 ) P ( A = 0 ) + P ( B = 1∣ A = 1 ) P ( A = 1 )} = \frac{0.95 \times 0.01}{0.05 \times 0.99 + 0.95 \times 0.01} = 0.16$

The meaning of probability

There are two philosophical definitions of probability.

The frequentist view is easier to grasp and clearly defined in the case of random variables. It is simply the frequency of outcomes in random experiments. This is well defined in the case of random variables where the concept of repeated trials is valid. However, it does not translate neatly into real world scenarios - what does it mean for the probability that x murdered y given the amount of evidence available?

The bayesian view tries to account for such by using the concept of degrees of belief. This notion of degrees of belief can be mapped to probabilities is they satisfy simple consistency rules known as the Cox axioms.

Forward probabilities and inverse probabilities

There are two categories of probability calculations: forward and inverse.

Example 2.4. This is an example of a forward probability problem. An urn contains $K$ balls, of which $B$ are black and the rest are white. Fred draws a ball at random from the urn and replaces it, $N$ times.

a. What is the probability distribution of the number of times a black ball is drawn, $n_{B}$ ?

b. What is the expectations and variance of $n_{B}$ ?

We note that $n_{B}$ follows a binomial distribution: $n_{B} \sim Bin (N, \frac{B}{K})$

The expectation is $NB / K$ and the variance is $N (B / K) (1 - B / K)$ .

The following is an inverse probability problem. Instead of computing the probability distribution of some quanity produced by the process, we compute the conditional probability of one or more of the unobserved variables in the process. This invariably requires use of bayes' theorem.

Example 2.6. There are eleven urns labelled by $u \in {0, ..., 10}$ . Each urn contains $10$ balls. For urn $u$ , $u$ balls are black and the rest are white. Fred selects an urn at random and draws $N$ times with replacement, obtaining $n_{B}$ black balls and $N - n_{B}$ white balls. If after $N = 10$ draws $n_{B} = 3$ blacks have been drawn, what is the probability that the urn Fred is drawing from is any urn $u$ ?

We want to find $P (u ∣ n_{B}, N)$ : $P (u ∣ n_{B}, N) = \frac{P ( u , n _{B} ∣ N )}{P ( n _{B} ∣ N )} = \frac{P ( n _{B} ∣ u , N ) P ( u )}{P ( n _{B} ∣ N )}$

$P (u)$ is straightforward as it is $\frac{1}{11}$ for all $u$ by definition.

For $P (n_{B} ∣ u, N)$ : $P (n_{B} ∣ u, N) = (n _{B} N) f_{u}^{n_{B}} (1 - f_{u})^{N - n_{B}}$

Where $f_{u}$ is the probability of choosing a black ball in urn $u$ , i.e. $u /10$ .

The final term is the denominator, $P (n_{B} ∣ N)$ . This is simply the sum of what we wrote above for $P (n_{B} ∣ u, N)$ over all possible instances of $u$ . i.e. $P (n_{B} ∣ N) = u \sum P (u) P (n_{B} ∣ u, N)$

With these formulas, the rest is just arithmetic. For the settings in the question, the numbers are like:

$P (u = 0∣ n_{B} = 3, N = 10) = 0$
$P (u = 1∣ n_{B} = 3, N = 10) = 0.063$
$P (u = 2∣ n_{B} = 3, N = 10) = 0.22$
$P (u = 3∣ n_{B} = 3, N = 10) = 0.29$

In inverse probability problems it is useful to give names to the different terms in bayes theorem:

The probability $P (u)$ is called the prior probability of $u$
$P (n_{B} ∣ u, N)$ is called the likelihood of $u$ .
- It is important to note that $P (n_{B} ∣ u, N)$ is sometimes called the probability of $n_{B}$ given $u$ , if we fix $u$ and want to express the probability of the observed data
- But if we fix $n_{B}$ (the data) and want to express the likelihood of the parameters, then $P (n_{B} ∣ u, N)$ is called the likelihood of $u$
$P (u ∣ n_{B}, N)$ is called the posterior probability of $u$ given $n_{B}$ .
$P (n_{B} ∣ N)$ is known as the evidence or marginal likelihood

In summary, let $θ$ denote the unknown parameters, $D$ the data, and $H$ the hypothesis space. Then we have: $P (θ ∣ D, H) = \frac{P ( D ∣ θ , H ) P ( θ ∣ H )}{P ( D ∣ H )}$

This is also known as: $posterior = \frac{likelihood \times prior}{evidence}$

Example 2.6 continued. Assume again that we observed $n_{B} = 3, N = 10$ . Let us draw another ball from the same urn. What is the probability that the next draw results in a black ball?

We should use our expression of the posterior probabilities we calculated earlier. Note that we do not have a fixed guess of which particular urn $u$ it is - we only have a probability distribution over all urns (with the highest probability placed on urn $u = 3$ ).

Hence, we need to marginalize over all possible urns. The probability that the next ball is black (given that we fix urn $u$ ) is simply $f_{u}$ . So we want: $P (next ball is black ∣ n_{B}, N) = u \sum f_{u} P (u ∣ n_{B}, N)$

Doing all the calculations will give us $0.333$ as the answer.

Note that this differs from the answer if we fixed our guess at the maximum likelihood solution, which is $u = 3$ . This would yield $0.3$ as the answer. Marginalizing over all possible values of $u$ yields a more robust answer.

Data compression and inverse probability

Suppose we have a binary file that is just random sequence of $0$ and $1$ s, something like:

000000000000000011111111111111111000000000000000001111111110101011101111111111

Intuitively, compression works by taking advantage of the predictability of a file. A data compression program must, implicitly or explicitly, answer the question "What is the probability that the next character in this file is a 1?".

Mackay's take is that data compression and data modelling are one and the same, and this is basically an inverse probability problem. More of this in chapter 6.

Likelihood principle

The likelihood principle tells us that the only thing that matters for inference problems is the likelihood, i.e. how the probability of the data that was observed varies with the hypothesis. In other words:

The likelihood principle. Given a generative model for data $d$ given parameters $θ, P (d ∣ θ)$ and having observed a particular outcome $d_{1}$ , all inferences and predictions should depend only on the function $P (d_{1} ∣ θ)$ .

Definition of entropy

Shannon information content of an outcome $x$ is defined as: $h (x) = lo g_{2} \frac{1}{P ( x )}$

The unit of measurement is bits. In the subsequent chapters, it will be established that the information content $h (a_{i})$ is indeed a natural measure of the information content of the event $x = a_{i}$ .

Entropy of an ensemble $X$ is defined as the average shannon information content of each outcome: $H (X) \equiv x \in A_{X} \sum P (x) lo g \frac{1}{P ( x )}$

The convention for $P (x) = 0$ is $h (x) = 0$ , since $lim_{θ \to 0^{+}} θ lo g 1/ θ = 0$ .

Where it is convenient, we may also write $H (X)$ as $H (p)$ , where $p$ is the vector of probability $p_{1}, p_{2}, ..., p_{I}$ .

Example 2.12. The entropy of a randomly selected letter in an English document is about $4.11$ bits.

Properties of Entropy

Here we cover some properties of the entropy function.

$H (X) \geq 0$ , with equality iff $p_{i} = 1$ for one single event $i$
Entropy is maximized if $p$ is uniform: $H (X) \leq lo g (∣ A_{X} ∣)$
- With equality iff $p_{i} = 1/∣ A_{X} ∣$ for all $i$
- Note that $∣ A_{X} ∣$ is the number of elements in $A_{X}$

Redundancy. The redundancy of $X$ is: $1 - \frac{H ( X )}{lo g ∣ A _{X} ∣}$

Intuitively, the redundancy measures the difference between $H (X)$ and its maximum possible value which is $lo g (∣ A_{X} ∣)$ .

Joint Entropy. The joint entropy of $X, Y$ is: $H (X, Y) = x, y \in A_{X}, A_{Y} \sum P (x, y) lo g \frac{1}{P ( x , y )}$

Entropy is additive for independent random variables: $H (X, Y) = H (X) + H (Y) iff P (x, y) = P (x) P (y)$

Decomposability of Entropy

One useful property of the Shannon entropy is that it can be decomposed into two parts, if we group outcomes together:

The entropy of grouped outcomes
Plus the weighted entropy of the outcomes within each group

Let $X$ be a discrete random variable with $n$ distinct outcomes $x_{1}, x_{2}, .., x_{n}$ and associated probabilities $p_{1}, p_{2}, ..., p_{n}$ .

We partition these outcomes into $m$ disjoint groups $S_{1}, S_{2}, ..., S_{m}$ . Let $q_{k}$ be the probability of group $S_{k}$ : $q_{k} = j \in S_{k} \sum p_{j}$

The decomposability theorem then states that: $H (X) = H (Q) + k = 1 \sum m q_{k} H_{k}$

Where $H_{k}$ is the entropy within each group.

Proof. Let us first define the conditional probability of a specific outcomes $x_{j}$ , given that we are already inside group $S_{k}$ , as $r_{j ∣ k}$ : $r_{j ∣ k} = \frac{p _{j}}{q _{k}}$

Now we decompose the entropy of the full system: $H (X) = - j = 1 \sum n p_{j} lo g p_{j} = - k = 1 \sum m j \in S_{k} \sum p_{j} lo g p_{j} = - k = 1 \sum m j \in S_{k} \sum (q_{k} r_{j ∣ k}) lo g (q_{k} r_{j ∣ k}) = - k = 1 \sum m j \in S_{k} \sum (q_{k} r_{j ∣ k}) lo g q_{k} - k = 1 \sum m j \in S_{k} \sum (q_{k} r_{j ∣ k}) lo g r_{j ∣ k}$

For the first term, it is just the entropy at the group level: $First term = - k = 1 \sum m j \in S_{k} \sum (q_{k} r_{j ∣ k}) lo g q_{k} = - k = 1 \sum m q_{k} lo g q_{k} j \in S_{k} \sum r_{j ∣ k} = - k = 1 \sum m q_{k} lo g q_{k} = H (Q)$

Note that in the second line, the summation over $r_{j ∣ k}$ is equals to $1$ as we sum up the probabilities of all events in a group.

For the second term, it is the weighted entropies for each group: $Second term = - k = 1 \sum m j \in S_{k} \sum (q_{k} r_{j ∣ k}) lo g r_{j ∣ k} = - k = 1 \sum m q_{k} j \in S_{k} \sum r_{j ∣ k} lo g r_{j ∣ k} = k = 1 \sum m q_{k} H_{k}$

Note that in the second line, the inner sum is simply the entropy of outcomes within group $k$ .

Hence we have shown the decomposability theorem.

Example 2.13. A source produces a character $x$ from the alphabet $A = {0, 1, ..., 9, a, b, ..., z}$ . With equal probability, $x$ is first determined to be a numeral, vowel or consonant. Then, within the selected category, a random element is selected. What is the entropy of $X$ ?

Using the decomposability theorem: $H (X) = H (Q) + k = 1 \sum 3 \frac{1}{3} H_{k} = lo g 3 + \frac{1}{3} (lo g 10 + lo g 5 + lo g 21) \approx 4.9$

Gibb's Inequality

Relative Entropy. The relative entropy or kullback-leibler divergence between two probability distributions $P (x)$ and $Q (x)$ that are defined over the same alphabet $A_{X}$ is: $D_{KL} (P ∣∣ Q) = x \sum P (x) lo g \frac{P ( x )}{Q ( x )}$

The relative entropy satisfies Gibb's inequality: $D_{KL} (P ∣∣ Q) \geq 0$

With equality iff $P = Q$ . Note that the relative entropy is not symmetric, i.e. $D_{KL} (P ∣∣ Q) \neq = D_{KL} (Q ∣∣ P)$

Jensen's Inequality for Convex Functions

Convex Function. A function $f (x)$ is convex over $(a, b)$ is every chord of the function lies above the function. That is, for every $x_{1}, x_{2} \in (a, b)$ and $0 \leq λ \leq 1$ : $f (λ x_{1} + (1 - λ) x_{2}) \leq λ f (x_{1}) + (1 - λ) f (x_{2})$

Strictly Convex. A function $f$ is strictly convex if, for all $x_{1}, x_{2} \in (a, b)$ , the equality holds only for $λ = 0, λ = 1$ .

Some strictly convex functions are $x^{2}, e^{x}, e^{- x}$ .

Jensen's Inequality. If $f$ is a convex function and $x$ is a random variable then: $E [f (x)] \geq f (E [x])$

Furthermore, if $f$ is strictly convex and $E [f (x)] = f (E [x])$ , then the random variable $x$ is a constant.

Exercise 2.14. Prove Jensen's inequality, assuming $f$ is convex.

Proof by induction. We want to show that for any $n \geq 1$ , if $p_{1}, ..., p_{n} \geq 0$ with $\sum_{i = 1}^{n} p_{i} = 1$ , and $f$ is convex, then: $i = 1 \sum n p_{i} f (x_{i}) \geq f (i = 1 \sum n p_{i} x_{i})$

Base case ( $n = 1$ ). We have $p_{1} = 1$ , so both sides equal $f (x_{1})$ and the inequality holds trivially.

Base case ( $n = 2$ ). We have $p_{1} + p_{2} = 1$ , so letting $λ = p_{1}$ gives $1 - λ = p_{2}$ . By definition of convexity: $p_{1} f (x_{1}) + p_{2} f (x_{2}) \geq f (p_{1} x_{1} + p_{2} x_{2})$

This is exactly the convexity condition, so the $n = 2$ case holds.

Inductive step. Assume the result holds for $n = k$ , i.e. for any $k$ points $x_{1}, ..., x_{k}$ with weights $q_{1}, ..., q_{k} \geq 0$ summing to $1$ : $i = 1 \sum k q_{i} f (x_{i}) \geq f (i = 1 \sum k q_{i} x_{i})$

We want to show it holds for $n = k + 1$ . Let $p_{1}, ..., p_{k + 1} \geq 0$ with $\sum_{i = 1}^{k + 1} p_{i} = 1$ .

We "group" the first $k$ points into a single weighted point $x_{1 \dots k}$ . Let's also define $s$ to be the sum of probabilities up to the $k^{t h}$ point. $s x_{1 \dots k} = i = 1 \sum k p_{i} = 1 - p_{k + 1} = \frac{1}{s} i = 1 \sum k p_{i} x_{i}$

Now observe that we can write the $k + 1$ case as a convex combination of the weighted point and the new $k + 1$ point: $i = 1 \sum k + 1 p_{i} x_{i} = s \cdot x_{1 \dots k} + p_{k + 1} \cdot x_{k + 1}$

Since $s + p_{k + 1} = 1$ , this is a convex combination of two points. Applying the $n = 2$ case with $λ = s$ : $s \cdot f (x_{1 \dots k}) + p_{k + 1} \cdot f (x_{k + 1}) \geq f (s \cdot x_{1 \dots k} + p_{k + 1} \cdot x_{k + 1}) = f (i = 1 \sum k + 1 p_{i} x_{i})$

The RHS is already in the desired form. It remains to show that the expression $s \cdot f (x_{1 \dots k}) + p_{k + 1} \cdot f (x_{k + 1})$ is a lower bound for our desired LHS.

We do so by applying the induction hypothesis to the $k$ points $x_{1}, ..., x_{k}$ with weights $\frac{p _{1}}{s}, ..., \frac{p _{k}}{s}$ . Note that we can do this because dividing by $s$ normalizes these into probabilities that sum to $1$ : $f (x_{1 \dots k}) = f (i = 1 \sum k \frac{p _{i}}{s} x_{i}) \leq i = 1 \sum k \frac{p _{i}}{s} f (x_{i})$

Multiplying both sides by $s > 0$ and adding $p_{k + 1} f (x_{k + 1})$ : $s \cdot f (x_{1 \dots k}) + p_{k + 1} f (x_{k + 1}) \leq i = 1 \sum k + 1 p_{i} f (x_{i})$

Thus we have lower bounded our desired LHS. Putting the two together: $i = 1 \sum k + 1 p_{i} f (x_{i}) \geq s \cdot f (x_{1 \dots k}) + p_{k + 1} f (x_{k + 1}) \geq f (i = 1 \sum k + 1 p_{i} x_{i})$

This completes the inductive step, and hence the proof. $□$

Example 2.15. Three squares have an average area $\overset{ˉ}{A} = 100 m^{2}$ . The average of the lengths of their sides is $\overset{ˉ}{l} = 10 m$ . What is the size of the largest of the three squares?

Let $x$ be the length of side of a randomly chosen square (amongst the 3), with equal probability. Let $l_{1}, l_{2}, l_{3}$ denote the lengths of each square. Then the information we have is: $E [x] E [f (x)] = 10 = 100$

Where $f$ is the square function, which is strictly convex. Furthermore, we observe that $f (E [x]) = E [f (x)]$ holds. Thus by Jensen's inequality, the random variable $x$ is a constant.

This implies that all three lengths are equal. So all three lengths must equal $10$ , and their area is $100 m^{2}$ each.

Convexity and Concavity Relate to Maximisation

If $f (x)$ is concave and there exists a point at which $\frac{\partial f}{\partial x _{k}} = 0 \forall k$

Then $f (x)$ has its maximum value at that point.

Note that the converse does not hold. If a concave $f (x)$ is maximized at some $x$ , it is not necessarily true that the gradient $\nabla f (x)$ is equal to zero there. For example, $f (p) lo g p$ for a probability of $p \in (0, 1)$ , is maximized on the boundary of the range at $p = 1$ , where the gradient $df (p) / d p = 1$ .

Exercises

Exercise 2.16. (a) Two ordinary dice with faces labelled $1, \dots, 6$ are thrown. What is the probability distribution of the sum of the values? What is the probability distribution of the absolute difference between the values?

(b) One hundred ordinary dice are thrown. What, roughly, is the probability distribution of the sum of the values? Sketch the probability distribution and estimate its mean and standard deviation.

(c) How can two cubical dice be labelled using the numbers ${0, 1, 2, 3, 4, 5, 6}$ so that when the two dice are thrown the sum has a uniform probability distribution over the integers 1–12?

(d) Is there any way that one hundred dice could be labelled with integers such that the probability distribution of the sum is uniform?

a. The probability distribution for the sum is as follows:

$x = 2$ : $\frac{1}{36}$
$x = 3$ : $\frac{2}{36}$
$x = 4$ : $\frac{3}{36}$
$x = 5$ : $\frac{4}{36}$
$x = 6$ : $\frac{5}{36}$
$x = 7$ : $\frac{6}{36}$

And so on. The probability distribution for the absolute difference is:

$x = 0$ : $\frac{6}{36}$
$x = 1$ : $\frac{10}{36}$
$x = 2$ : $\frac{8}{36}$
$x = 3$ : $\frac{6}{36}$
$x = 4$ : $\frac{4}{36}$
$x = 5$ : $\frac{2}{36}$

b. Since each toss is independent, we have: $E [X] = 100 \times 3.5 = 350$

The variance is additive for independent trials: $\sum Var (one trial) = 100 \times \frac{35}{12}$

The probability distribution is hard to reason about, but we expect the average value of $E [X] = 350$ to be the most likely because it has the most combinations of values.

c. There are $36$ possible permutations and $12$ possible values (i.e. 1 to 12), so for uniform distribution we need to assign $3$ permutations to each value.

We start by observing that $12$ is only possible by adding $6 + 6$ , so we need 3 $6$ s on A side and 1 $6$ on B side.
Similarly, we can put $5$ on the B side to form 3 entries for $11$
Put $4$ on B side to form 3 entries for $10$
And so on, until B side contains $1$ to $6$ , which gives us 3 entries for $7$ to $12$
To complete, we put 3 more $0$ s on the A side, which gives us 3 entries for $1$ to $6$

d. We can use the same logic in part c to craft a uniform distribution. The idea is that we want to space things out so that every permutation results in a unique sum.

It is easier to think about in base $10$ , which we are more used to. Suppose each die has $10$ faces. Then the solution is to:

Let the first die be $1, 2, ..., 10$
Let the second die be $10, 20, ..., 100$ (i.e. the tens place)
Let the third die be $100, 200, ..., 1000$ (i.e. the hundreds place)
And so on

It is not hard to see that the sum of hundred die will result in a unique sum for every permutation, which results in a uniform distribution.

We can use the same logic, but in base $6$ instead. So:

The first die is $1, ..., 6$
The second die is $6, ..., 36$
The third die is $36, 72, ..., 216$

Exercise 2.17. If $q = 1 - p$ and $a = ln p / q$ , show that: $p = \frac{1}{1 + exp ( - a )} .$

Sketch this function and find its relationship to the hyperbolic tangent function $tanh (u) = \frac{e ^{u} - e ^{- u}}{e ^{u} + e ^{- u}}$ . It will be useful to be fluent in base-2 logarithms also. If $b = lo g_{2} p / q$ , what is $p$ as a function of $b$ ?

$a e^{- a} p (e^{- a} + 1) p = ln (p /1 - p) = \frac{1 - p}{p} = 1 = \frac{1}{1 + e ^{- a}}$

This is the sigmoid (logistic) function $σ (a)$ . A sketch:

  p
  1 |                          ___________
    |                       __/
    |                     /
0.5 |_ _ _ _ _ _ _ _ _ _/_ _ _ _ _ _ _ _
    |                  /
    |               __/
  0 |______________/
    +--|---|---|---|---|---|---|---|-->  a
      -4  -3  -2  -1   0   1   2   3   4

Observe that $tanh (a /2)$ : $tanh (a /2) = \frac{e ^{a /2} - e ^{- a /2}}{e ^{a /2} + e ^{- a /2}}$

If we multiply by $e^{- a /2}$ in the numerator and denominator: $tanh (a /2) = \frac{1 - e ^{- a}}{1 + e ^{- a}}$

It remains to add $1$ and divide by $2$ , so we see that: $\frac{tanh ( a /2 ) + 1}{2} = σ (a)$

The hyperbolic tangent function is just a scaled version of the sigmoid.

If $b = lo g_{2} (p / q)$ , instead of the exponential function we have it as a power of $2$ , so: $p = \frac{1}{1 + 2 ^{- b}}$

Exercise 2.18. Let $x$ and $y$ be dependent random variables with $x$ a binary variable taking values in $A_{X} = {0, 1}$ . Use Bayes' theorem to show that the log posterior probability ratio for $x$ given $y$ is $lo g \frac{P ( x = 1∣ y )}{P ( x = 0∣ y )} = lo g \frac{P ( y ∣ x = 1 )}{P ( y ∣ x = 0 )} + lo g \frac{P ( x = 1 )}{P ( x = 0 )} .$

$lo g \frac{P ( x = 1∣ y )}{P ( x = 0∣ y )} = lo g \frac{P ( y ∣ x = 1 ) P ( x = 1 ) / P ( y )}{P ( y ∣ x = 0 ) P ( x = 0 ) / P ( y )}$

Since $P (y)$ is the same for each outcome of $x$ , it will cancel out. The result just follows from the properties of $lo g$ .

Exercise 2.19. Let $x, d_{1}$ and $d_{2}$ be random variables such that $d_{1}$ and $d_{2}$ are conditionally independent given a binary variable $x$ . Use Bayes' theorem to show that the posterior probability ratio for $x$ given ${d_{i}}$ is: $\frac{P ( x = 1∣ { d _{i} })}{P ( x = 0∣ { d _{i} })} = \frac{P ( d _{1} ∣ x = 1 )}{P ( d _{1} ∣ x = 0 )} \frac{P ( d _{2} ∣ x = 1 )}{P ( d _{2} ∣ x = 0 )} \frac{P ( x = 1 )}{P ( x = 0 )} .$

This is just applying bayes rule and using the independence: $P (d_{1}, d_{2} ∣ x) = P (d_{1} ∣ x) \times P (d_{2} ∣ x)$

Exercise 2.20. Consider a sphere of radius $r$ in an $N$ -dimensional real space. Show that the fraction of the volume of the sphere that is in the surface shell lying at values of the radius between $r - ϵ$ and $r$ , where $0 < ϵ < r$ , is: $f = 1 - (1 - \frac{ϵ}{r})^{N} .$

Evaluate $f$ for the cases $N = 2, N = 10$ and $N = 1000$ , with (a) $ϵ / r = 0.01$ ; (b) $ϵ / r = 0.5$ . Implication: points that are uniformly distributed in a sphere in $N$ dimensions, where $N$ is large, are very likely to be in a thin shell near the surface.

The formula for the volume of a hypersphere is: $\frac{π ^{N /2}}{Γ ( \frac{N}{2} + 1 )} r^{N}$

The stuff on the left is only dependent on the dimension, so let's call it a constant $a$ . Then the ratio we desire is the larger sphere minus the smaller sphere divided by the larger sphere: $ratio = 1 - \frac{a \cdot ( r - ϵ ) ^{N}}{a \cdot r ^{N}} = 1 - (1 - \frac{ϵ}{r})^{N} □$

Evaluating $f$ :

$N$	$ϵ / r = 0.01$	$ϵ / r = 0.5$
2	0.02	0.75
10	0.10	0.999
1000	0.9999	1.0

This confirms the implication: in high dimensions, almost all the volume is concentrated in a thin shell near the surface.

Exercise 2.21. Let $p_{a} = 0.1$ , $p_{b} = 0.2$ , and $p_{c} = 0.7$ . Let $f (a) = 10$ , $f (b) = 5$ , and $f (c) = 10/7$ . What is $E [f (x)]$ ? What is $E [1/ P (x)]$ ?

$E [f (x)] = 0.1 \times 10 + 0.2 \times 5 + 0.7 \times 10/7 = 1 + 1 + 1 = 3$ .

$f (x) = 1/ P (x)$ above, so $E [1/ P (x)] = 3$ as well.

Exercise 2.22. For an arbitrary ensemble, what is $E [1/ P (x)]$ ?

This would be the cardinality $∣ A_{X} ∣$ , since the value is $1$ for each unique event.

Exercise 2.23. Let $p_{a} = 0.1$ , $p_{b} = 0.2$ , and $p_{c} = 0.7$ . Let $g (a) = 0$ , $g (b) = 1$ , and $g (c) = 0$ . What is $E [g (x)]$ ?

$0.2 \times 1 = 0.2$

Exercise 2.24. Let $p_{a} = 0.1$ , $p_{b} = 0.2$ , and $p_{c} = 0.7$ . What is the probability that $P (x) \in [0.15, 0.5]$ ? What is $P (lo g \frac{P ( x )}{0.2} > 0.05) ?$

For the first question, $P (x) \in [0.15, 0.5] = 0.2$

For the second question, $lo g \frac{P ( x )}{0.2} > 0.05$

For the positive case, only event $c$ will fulfil the criteria. Event $b$ fails the criteria.

For the negative case, under event $a$ , $∣ lo g \frac{0.1}{0.2} ∣ = lo g 2$ , which meets the criteria.

So the total probability is 0.8.

Exercise 2.25. Prove the assertion that $H (X) \leq lo g (∣ A_{X} ∣)$ with equality iff $p_{i} = 1/∣ A_{X} ∣$ for all $i$ . ( $∣ A_{X} ∣$ denotes the number of elements in the set $A_{X}$ .) [Hint: use Jensen’s inequality; if your first attempt to use Jensen does not succeed, remember that Jensen involves both a random variable and a function, and you have quite a lot of freedom in choosing these; think about whether your chosen function $f$ should be convex or concave.]

Recall that entropy is: $H (X) = x \sum p (x) lo g \frac{1}{p ( x )}$

And that if $f$ is concave: $f (E (X)) \geq E (f (X))$

Let $Y$ denote the random variable that follows the probability distribution of $X$ but the value is $\frac{1}{p ( x )} \forall x$ . It follows that (see Exercise 2.22): $E (Y) = ∣ A_{X} ∣$

Let $f$ denote the $lo g$ function, which is concave. By Jensen's it follows that: $lo g ∣ A_{X} ∣ \geq x \sum p (x) lo g \frac{1}{p ( x )} = H (X)$

Which completes the proof.

Exercise 2.26. Prove that the relative entropy satisfies $D_{K L} (P ∣∣ Q) \geq 0$ (Gibbs' inequality) with equality only if $P = Q$ .

Similar strategy to the previous question. Let us define $Y$ as the random variable with same probability distribution as $X$ but with value: $\frac{q ( x )}{p ( x )} \forall x$

Let us also denote $f$ as the $- lo g$ function which is strictly convex. It follows that: $E (f (Y)) = x \sum p (x) lo g \frac{p ( x )}{q ( x )} = D_{K L} (P ∣∣ Q)$

On the other side: $f (E (Y)) = - lo g x \sum p (x) \frac{q ( x )}{p ( x )} = - lo g 1 = 0$

Using Jensen's inequality gives us Gibb's inequality.

For the equality case, note that $f$ is strictly convex. Now suppose $D_{K L} (P ∣∣ Q) = 0$ (in the only if case), and WTS $P = Q$ .

When the KL divergence is $0$ , it implies that: $E (f (Y)) = 0 = f (E (Y))$

With this condition satisfied, Jensen's tells us that the random variable $Y$ is a constant. This means that: $\frac{p ( x )}{q ( x )} = a \forall x$

Where $a$ is some constant. And since both $P$ and $Q$ are valid probability distributions, the only possible value for $a$ is $1$ , meaning that $P = Q$ . $□$

Exercise 2.27. Prove that the entropy is indeed decomposable.

Already shown above.

Exercise 2.28. $^{[2, p.45]}$ A random variable $x \in {0, 1, 2, 3}$ is selected by flipping a bent coin with bias $f$ to determine whether the outcome is in ${0, 1}$ or ${2, 3}$ ; then either flipping a second bent coin with bias $g$ or a third bent coin with bias $h$ respectively. Write down the probability distribution of $x$ . Use the decomposability of the entropy (2.44) to find the entropy of $X$ . [Notice how compact an expression is obtained if you make use of the binary entropy function $H_{2} (x)$ , compared with writing out the four-term entropy explicitly.] Find the derivative of $H (X)$ with respect to $f$ . [Hint: $d H_{2} (x) / d x = lo g ((1 - x) / x)$ .]

$P (x) = ⎩ ⎨ ⎧ f g f (1 - g) (1 - f) h (1 - f) (1 - h) if x = 0 if x = 1 if x = 2 if x = 3$

Using decomposability: $H (X) = f lo g \frac{1}{f} + (1 - f) lo g \frac{1}{1 - f} + f [g lo g \frac{1}{g} + (1 - g) lo g \frac{1}{1 - g}] + (1 - f) [h lo g \frac{1}{h} + (1 - h) lo g \frac{1}{1 - h}] = H_{2} (f) + f H_{2} (g) + (1 - f) H_{2} (h)$

Exercise 2.29. $^{[2]}$ An unbiased coin is flipped until one head is thrown. What is the entropy of the random variable $x \in {1, 2, 3, \dots}$ , the number of flips? Repeat the calculation for the case of a biased coin with probability $f$ of coming up heads. [Hint: solve the problem both directly and by using the decomposability of the entropy (2.43).]

The probability distribution is like so:

$0.5$ : Outcome is H, $x = 1$
$0. 5^{2}$ : Outcome is TH, $x = 2$
$0. 5^{3}$ : Outcome is TTH, $x = 3$
And so on

Using the direct method, the entropy of $x$ is: $H (X) = x \sum p (x) lo g \frac{1}{p ( x )} = n = 1 \sum \infty - 0. 5^{n} lo g 0. 5^{n} = lo g 2 n = 1 \sum \infty n \cdot 0. 5^{n} = lo g 2 \cdot \frac{0.5}{( 1 - 0.5 ) ^{2}} = 2 bits$

Using the decomposability method, the first group has entropy $H_{2} (0.5)$ corresponding to the first coin flip. Thereafter, there are two groups:

The first group is when the first flip was H. In this branch, the value of $x$ becomes fixed/determined, and the entropy for this group becomes 0
The second group brings us to the same setting as what we started with

This gives us a recursive definition for $H (X)$ : $H (X) H (X) = H_{2} (0.5) + 0.5 \cdot H (X) = 2 bits$

Exercise 2.30. $^{[1]}$ An urn contains $w$ white balls and $b$ black balls. Two balls are drawn, one after the other, without replacement. Prove that the probability that the first ball is white is equal to the probability that the second is white.

Exercise 2.31. $^{[2]}$ A circular coin of diameter $a$ is thrown onto a square grid whose squares are $b \times b$ . $(a < b)$ What is the probability that the coin will lie entirely within one square? [Ans: $(1 - a / b)^{2}$ ]

Exercise 2.32. $^{[3]}$ Buffon's needle. A needle of length $a$ is thrown onto a plane covered with equally spaced parallel lines with separation $b$ . What is the probability that the needle will cross a line? [Ans, if $a < b$ : $2 a / πb$ ] [Generalization — Buffon's noodle: on average, a random curve of length $A$ is expected to intersect the lines $2 A / πb$ times.]

Exercise 2.33. $^{[2]}$ Two points are selected at random on a straight line segment of length 1. What is the probability that a triangle can be constructed out of the three resulting segments?

Exercise 2.34. $^{[2, p.45]}$ An unbiased coin is flipped until one head is thrown. What is the expected number of tails and the expected number of heads? Fred, who doesn't know that the coin is unbiased, estimates the bias using $\hat{f} \equiv h / (h + t)$ , where $h$ and $t$ are the numbers of heads and tails tossed. Compute and sketch the probability distribution of $\hat{f}$ . N.B., this is a forward probability problem, a sampling theory problem, not an inference problem. Don't use Bayes' theorem.

Exercise 2.35. $^{[2, p.45]}$ Fred rolls an unbiased six-sided die once per second, noting the occasions when the outcome is a six. (a) What is the mean number of rolls from one six to the next six?

(b) Between two rolls, the clock strikes one. What is the mean number of rolls until the next six? (c) Now think back before the clock struck. What is the mean number of rolls, going back in time, until the most recent six?

(d) What is the mean number of rolls from the six before the clock struck to the next six?

(e) Is your answer to (d) different from your answer to (a)? Explain.

Another version of this exercise refers to Fred waiting for a bus at a bus-stop in Poissonville where buses arrive independently at random (a Poisson process), with, on average, one bus every six minutes. What is the average wait for a bus, after Fred arrives at the stop? [6 minutes.] So what is the time between the two buses, the one that Fred just missed, and the one that he catches? [12 minutes.] Explain the apparent paradox. Note the contrast with the situation in Clockville, where the buses are spaced exactly 6 minutes apart. There, as you can confirm, the mean wait at a bus-stop is 3 minutes, and the time between the missed bus and the next one is 6 minutes.

Conditional probability

Exercise 2.36. $^{[2]}$ You meet Fred. Fred tells you he has two brothers, Alf and Bob. What is the probability that Fred is older than Bob? Fred tells you that he is older than Alf. Now, what is the probability that Fred is older than Bob? (That is, what is the conditional probability that $F > B$ given that $F > A$ ?)

Exercise 2.37. $^{[2]}$ The inhabitants of an island tell the truth one third of the time. They lie with probability $2/3$ . On an occasion, after one of them made a statement, you ask another 'was that statement true?' and he says 'yes'. What is the probability that the statement was indeed true?

Exercise 2.38. $^{[2, p .46]}$ Compare two ways of computing the probability of error of the repetition code $R_{3}$ , assuming a binary symmetric channel (you did this once for exercise 1.2 (p.7)) and confirm that they give the same answer. Binomial distribution method. Add the probability that all three bits are flipped to the probability that exactly two bits are flipped. Sum rule method. Using the sum rule, compute the marginal probability that $r$ takes on each of the eight possible values, $P (r)$ . $[P (r) = \sum_{s} P (s) P (r ∣ s) .]$ Then compute the posterior probability of $s$ for each of the eight values of $r$ . [In fact, by symmetry, only two example cases $r = (000)$ and $r = (001)$ need be considered.] Notice that some of the inferred bits are better determined than others. From the posterior probability $P (s ∣ r)$ you can read out the case-by-case error probability, the probability that the more probable hypothesis is not correct, $P (error ∣ r)$ . Find the average error probability using the sum rule, $P (error) = r \sum P (r) P (error ∣ r) . (2.55)$

Exercise 2.39. $^{[3 C, p .46]}$ The frequency $p_{n}$ of the $n$ th most frequent word in English is roughly approximated by $p_{n} ≃ {\frac{0.1}{n} 0 for n \in 1, \dots, 12367 n > 12367. (2.56)$

[This remarkable $1/ n$ law is known as Zipf's law, and applies to the word frequencies of many languages (Zipf, 1949).] If we assume that English is generated by picking words at random according to this distribution, what is the entropy of English (per word)? [This calculation can be found in ‘Prediction and entropy of printed English’, C.E. Shannon, Bell Syst. Tech. J. 30, pp.50–64 (1950), but, inexplicably, the great man made numerical errors in it.]

Keyboard shortcuts

Chux's Notebook