Probability, Entropy, Inference
This chapter will spend some time on notation.
Probabilities and ensembles
An ensemble is a triple where the outcome is the value of a random variable, which takes on one of a set of possible values from . represents the probability distribution, such that , , and
The name is a mnemonic for "alphabet". An example of an ensemble is to select a random latter from an english document.
Note: Mackay's definition of an ensemble seems redundant, as we usually just use the term "random variable" to denote the same idea. But he insists on using "ensemble" to refer to the entire system or environment, with an emphasis on the total space of the alphabet. Whereas in strict mathetical form, a random variable refers to a function that maps from each outcome to a real number.
Probability of a subset. If is a subset of , then:
Joint ensemble is an ensemble in which each outcome is an ordered pair with and .
We call or the joint probability of and .
Marginal probability. We can obtain the marginal probability from the joint probability by summation:
Or more succintly:
Conditional Probability.
We often do not write down the joint probability directly, but rather define an ensemble in terms of a collection of conditional probabilities. Hence the following rules to manipulate conditional probabilities are useful.
Product rule (or Chain rule). Based on the definition of conditional probability, we have:
Where is a way of denoting the assumptions upon which the probabilities are based.
Sum Rule. We are just writing the marginal probability in terms of conditional probabilities:
Bayes theorem - obtained from the product rule:
Independence Two random variables and are independent if and only if:
We often define an ensemble in terms of a collection of conditional probabilities. This example illustrates the idea.
Example 2.3. Jo has a test for a disease. The variable denotes whether Jo has the disease and denotes the test result. In of cases of people who have the disease, a positive test results; in of cases of people who do not have the disease, a negative test results. of people of Jo's age and background have the disease. Now Jo takes the test and it is positive. What is the probability that Jo has the disease?
Based on the information:
We want:
The meaning of probability
There are two philosophical definitions of probability.
The frequentist view is easier to grasp and clearly defined in the case of random variables. It is simply the frequency of outcomes in random experiments. This is well defined in the case of random variables where the concept of repeated trials is valid. However, it does not translate neatly into real world scenarios - what does it mean for the probability that x murdered y given the amount of evidence available?
The bayesian view tries to account for such by using the concept of degrees of belief. This notion of degrees of belief can be mapped to probabilities is they satisfy simple consistency rules known as the Cox axioms.
Forward probabilities and inverse probabilities
There are two categories of probability calculations: forward and inverse.
Example 2.4. This is an example of a forward probability problem. An urn contains balls, of which are black and the rest are white. Fred draws a ball at random from the urn and replaces it, times.
a. What is the probability distribution of the number of times a black ball is drawn, ?
b. What is the expectations and variance of ?
We note that follows a binomial distribution:
The expectation is and the variance is .
The following is an inverse probability problem. Instead of computing the probability distribution of some quanity produced by the process, we compute the conditional probability of one or more of the unobserved variables in the process. This invariably requires use of bayes' theorem.
Example 2.6. There are eleven urns labelled by . Each urn contains balls. For urn , balls are black and the rest are white. Fred selects an urn at random and draws times with replacement, obtaining black balls and white balls. If after draws blacks have been drawn, what is the probability that the urn Fred is drawing from is any urn ?
We want to find :
is straightforward as it is for all by definition.
For :
Where is the probability of choosing a black ball in urn , i.e. .
The final term is the denominator, . This is simply the sum of what we wrote above for over all possible instances of . i.e.
With these formulas, the rest is just arithmetic. For the settings in the question, the numbers are like:
In inverse probability problems it is useful to give names to the different terms in bayes theorem:
- The probability is called the prior probability of
- is called the likelihood of .
- It is important to note that is sometimes called the probability of given , if we fix and want to express the probability of the observed data
- But if we fix (the data) and want to express the likelihood of the parameters, then is called the likelihood of
- is called the posterior probability of given .
- is known as the evidence or marginal likelihood
In summary, let denote the unknown parameters, the data, and the hypothesis space. Then we have:
This is also known as:
Example 2.6 continued. Assume again that we observed . Let us draw another ball from the same urn. What is the probability that the next draw results in a black ball?
We should use our expression of the posterior probabilities we calculated earlier. Note that we do not have a fixed guess of which particular urn it is - we only have a probability distribution over all urns (with the highest probability placed on urn ).
Hence, we need to marginalize over all possible urns. The probability that the next ball is black (given that we fix urn ) is simply . So we want:
Doing all the calculations will give us as the answer.
Note that this differs from the answer if we fixed our guess at the maximum likelihood solution, which is . This would yield as the answer. Marginalizing over all possible values of yields a more robust answer.
Data compression and inverse probability
Suppose we have a binary file that is just random sequence of and s, something like:
000000000000000011111111111111111000000000000000001111111110101011101111111111
Intuitively, compression works by taking advantage of the predictability of a file. A data compression program must, implicitly or explicitly, answer the question "What is the probability that the next character in this file is a 1?".
Mackay's take is that data compression and data modelling are one and the same, and this is basically an inverse probability problem. More of this in chapter 6.
Likelihood principle
The likelihood principle tells us that the only thing that matters for inference problems is the likelihood, i.e. how the probability of the data that was observed varies with the hypothesis. In other words:
The likelihood principle. Given a generative model for data given parameters and having observed a particular outcome , all inferences and predictions should depend only on the function .
Definition of entropy
Shannon information content of an outcome is defined as:
The unit of measurement is bits. In the subsequent chapters, it will be established that the information content is indeed a natural measure of the information content of the event .
Entropy of an ensemble is defined as the average shannon information content of each outcome:
The convention for is , since .
Where it is convenient, we may also write as , where is the vector of probability .
Example 2.12. The entropy of a randomly selected letter in an English document is about bits.
Properties of Entropy
Here we cover some properties of the entropy function.
- , with equality iff for one single event
- Entropy is maximized if is uniform:
- With equality iff for all
- Note that is the number of elements in
Redundancy. The redundancy of is:
Intuitively, the redundancy measures the difference between and its maximum possible value which is .
Joint Entropy. The joint entropy of is:
Entropy is additive for independent random variables:
Decomposability of Entropy
One useful property of the Shannon entropy is that it can be decomposed into two parts, if we group outcomes together:
- The entropy of grouped outcomes
- Plus the weighted entropy of the outcomes within each group
Let be a discrete random variable with distinct outcomes and associated probabilities .
We partition these outcomes into disjoint groups . Let be the probability of group :
The decomposability theorem then states that:
Where is the entropy within each group.
Proof. Let us first define the conditional probability of a specific outcomes , given that we are already inside group , as :
Now we decompose the entropy of the full system:
For the first term, it is just the entropy at the group level:
Note that in the second line, the summation over is equals to as we sum up the probabilities of all events in a group.
For the second term, it is the weighted entropies for each group:
Note that in the second line, the inner sum is simply the entropy of outcomes within group .
Hence we have shown the decomposability theorem.
Example 2.13. A source produces a character from the alphabet . With equal probability, is first determined to be a numeral, vowel or consonant. Then, within the selected category, a random element is selected. What is the entropy of ?
Using the decomposability theorem:
Gibb's Inequality
Relative Entropy. The relative entropy or kullback-leibler divergence between two probability distributions and that are defined over the same alphabet is:
The relative entropy satisfies Gibb's inequality:
With equality iff . Note that the relative entropy is not symmetric, i.e.
Jensen's Inequality for Convex Functions
Convex Function. A function is convex over is every chord of the function lies above the function. That is, for every and :
Strictly Convex. A function is strictly convex if, for all , the equality holds only for .
Some strictly convex functions are .
Jensen's Inequality. If is a convex function and is a random variable then:
Furthermore, if is strictly convex and , then the random variable is a constant.
Exercise 2.14. Prove Jensen's inequality, assuming is convex.
Proof by induction. We want to show that for any , if with , and is convex, then:
Base case (). We have , so both sides equal and the inequality holds trivially.
Base case (). We have , so letting gives . By definition of convexity:
This is exactly the convexity condition, so the case holds.
Inductive step. Assume the result holds for , i.e. for any points with weights summing to :
We want to show it holds for . Let with .
We "group" the first points into a single weighted point . Let's also define to be the sum of probabilities up to the point.
Now observe that we can write the case as a convex combination of the weighted point and the new point:
Since , this is a convex combination of two points. Applying the case with :
The RHS is already in the desired form. It remains to show that the expression is a lower bound for our desired LHS.
We do so by applying the induction hypothesis to the points with weights . Note that we can do this because dividing by normalizes these into probabilities that sum to :
Multiplying both sides by and adding :
Thus we have lower bounded our desired LHS. Putting the two together:
This completes the inductive step, and hence the proof.
Example 2.15. Three squares have an average area . The average of the lengths of their sides is . What is the size of the largest of the three squares?
Let be the length of side of a randomly chosen square (amongst the 3), with equal probability. Let denote the lengths of each square. Then the information we have is:
Where is the square function, which is strictly convex. Furthermore, we observe that holds. Thus by Jensen's inequality, the random variable is a constant.
This implies that all three lengths are equal. So all three lengths must equal , and their area is each.
Convexity and Concavity Relate to Maximisation
If is concave and there exists a point at which
Then has its maximum value at that point.
Note that the converse does not hold. If a concave is maximized at some , it is not necessarily true that the gradient is equal to zero there. For example, for a probability of , is maximized on the boundary of the range at , where the gradient .
Exercises
Exercise 2.16. (a) Two ordinary dice with faces labelled are thrown. What is the probability distribution of the sum of the values? What is the probability distribution of the absolute difference between the values?
(b) One hundred ordinary dice are thrown. What, roughly, is the probability distribution of the sum of the values? Sketch the probability distribution and estimate its mean and standard deviation.
(c) How can two cubical dice be labelled using the numbers so that when the two dice are thrown the sum has a uniform probability distribution over the integers 1–12?
(d) Is there any way that one hundred dice could be labelled with integers such that the probability distribution of the sum is uniform?
a. The probability distribution for the sum is as follows:
- :
- :
- :
- :
- :
- :
And so on. The probability distribution for the absolute difference is:
- :
- :
- :
- :
- :
- :
b. Since each toss is independent, we have:
The variance is additive for independent trials:
The probability distribution is hard to reason about, but we expect the average value of to be the most likely because it has the most combinations of values.
c. There are possible permutations and possible values (i.e. 1 to 12), so for uniform distribution we need to assign permutations to each value.
- We start by observing that is only possible by adding , so we need 3 s on A side and 1 on B side.
- Similarly, we can put on the B side to form 3 entries for
- Put on B side to form 3 entries for
- And so on, until B side contains to , which gives us 3 entries for to
- To complete, we put 3 more s on the A side, which gives us 3 entries for to
d. We can use the same logic in part c to craft a uniform distribution. The idea is that we want to space things out so that every permutation results in a unique sum.
It is easier to think about in base , which we are more used to. Suppose each die has faces. Then the solution is to:
- Let the first die be
- Let the second die be (i.e. the tens place)
- Let the third die be (i.e. the hundreds place)
- And so on
It is not hard to see that the sum of hundred die will result in a unique sum for every permutation, which results in a uniform distribution.
We can use the same logic, but in base instead. So:
- The first die is
- The second die is
- The third die is
Exercise 2.17. If and , show that:
Sketch this function and find its relationship to the hyperbolic tangent function . It will be useful to be fluent in base-2 logarithms also. If , what is as a function of ?
This is the sigmoid (logistic) function . A sketch:
p
1 | ___________
| __/
| /
0.5 |_ _ _ _ _ _ _ _ _ _/_ _ _ _ _ _ _ _
| /
| __/
0 |______________/
+--|---|---|---|---|---|---|---|--> a
-4 -3 -2 -1 0 1 2 3 4
Observe that :
If we multiply by in the numerator and denominator:
It remains to add and divide by , so we see that:
The hyperbolic tangent function is just a scaled version of the sigmoid.
If , instead of the exponential function we have it as a power of , so:
Exercise 2.18. Let and be dependent random variables with a binary variable taking values in . Use Bayes' theorem to show that the log posterior probability ratio for given is
Since is the same for each outcome of , it will cancel out. The result just follows from the properties of .
Exercise 2.19. Let and be random variables such that and are conditionally independent given a binary variable . Use Bayes' theorem to show that the posterior probability ratio for given is:
This is just applying bayes rule and using the independence:
Exercise 2.20. Consider a sphere of radius in an -dimensional real space. Show that the fraction of the volume of the sphere that is in the surface shell lying at values of the radius between and , where , is:
Evaluate for the cases and , with (a) ; (b) . Implication: points that are uniformly distributed in a sphere in dimensions, where is large, are very likely to be in a thin shell near the surface.
The formula for the volume of a hypersphere is:
The stuff on the left is only dependent on the dimension, so let's call it a constant . Then the ratio we desire is the larger sphere minus the smaller sphere divided by the larger sphere:
Evaluating :
| 2 | 0.02 | 0.75 |
| 10 | 0.10 | 0.999 |
| 1000 | 0.9999 | 1.0 |
This confirms the implication: in high dimensions, almost all the volume is concentrated in a thin shell near the surface.
Exercise 2.21. Let , , and . Let , , and . What is ? What is ?
.
above, so as well.
Exercise 2.22. For an arbitrary ensemble, what is ?
This would be the cardinality , since the value is for each unique event.
Exercise 2.23. Let , , and . Let , , and . What is ?
Exercise 2.24. Let , , and . What is the probability that ? What is
For the first question,
For the second question,
For the positive case, only event will fulfil the criteria. Event fails the criteria.
For the negative case, under event , , which meets the criteria.
So the total probability is 0.8.
Exercise 2.25. Prove the assertion that with equality iff for all . ( denotes the number of elements in the set .) [Hint: use Jensen’s inequality; if your first attempt to use Jensen does not succeed, remember that Jensen involves both a random variable and a function, and you have quite a lot of freedom in choosing these; think about whether your chosen function should be convex or concave.]
Recall that entropy is:
And that if is concave:
Let denote the random variable that follows the probability distribution of but the value is . It follows that (see Exercise 2.22):
Let denote the function, which is concave. By Jensen's it follows that:
Which completes the proof.
Exercise 2.26. Prove that the relative entropy satisfies (Gibbs' inequality) with equality only if .
Similar strategy to the previous question. Let us define as the random variable with same probability distribution as but with value:
Let us also denote as the function which is strictly convex. It follows that:
On the other side:
Using Jensen's inequality gives us Gibb's inequality.
For the equality case, note that is strictly convex. Now suppose (in the only if case), and WTS .
When the KL divergence is , it implies that:
With this condition satisfied, Jensen's tells us that the random variable is a constant. This means that:
Where is some constant. And since both and are valid probability distributions, the only possible value for is , meaning that .
Exercise 2.27. Prove that the entropy is indeed decomposable.
Already shown above.
Exercise 2.28. A random variable is selected by flipping a bent coin with bias to determine whether the outcome is in or ; then either flipping a second bent coin with bias or a third bent coin with bias respectively. Write down the probability distribution of . Use the decomposability of the entropy (2.44) to find the entropy of . [Notice how compact an expression is obtained if you make use of the binary entropy function , compared with writing out the four-term entropy explicitly.] Find the derivative of with respect to . [Hint: .]
Using decomposability:
Exercise 2.29. An unbiased coin is flipped until one head is thrown. What is the entropy of the random variable , the number of flips? Repeat the calculation for the case of a biased coin with probability of coming up heads. [Hint: solve the problem both directly and by using the decomposability of the entropy (2.43).]
The probability distribution is like so:
- : Outcome is
H, - : Outcome is
TH, - : Outcome is
TTH, - And so on
Using the direct method, the entropy of is:
Using the decomposability method, the first group has entropy corresponding to the first coin flip. Thereafter, there are two groups:
- The first group is when the first flip was
H. In this branch, the value of becomes fixed/determined, and the entropy for this group becomes0 - The second group brings us to the same setting as what we started with
This gives us a recursive definition for :
Exercise 2.30. An urn contains white balls and black balls. Two balls are drawn, one after the other, without replacement. Prove that the probability that the first ball is white is equal to the probability that the second is white.
Exercise 2.31. A circular coin of diameter is thrown onto a square grid whose squares are . What is the probability that the coin will lie entirely within one square? [Ans: ]
Exercise 2.32. Buffon's needle. A needle of length is thrown onto a plane covered with equally spaced parallel lines with separation . What is the probability that the needle will cross a line? [Ans, if : ] [Generalization — Buffon's noodle: on average, a random curve of length is expected to intersect the lines times.]
Exercise 2.33. Two points are selected at random on a straight line segment of length 1. What is the probability that a triangle can be constructed out of the three resulting segments?
Exercise 2.34. An unbiased coin is flipped until one head is thrown. What is the expected number of tails and the expected number of heads? Fred, who doesn't know that the coin is unbiased, estimates the bias using , where and are the numbers of heads and tails tossed. Compute and sketch the probability distribution of . N.B., this is a forward probability problem, a sampling theory problem, not an inference problem. Don't use Bayes' theorem.
Exercise 2.35. Fred rolls an unbiased six-sided die once per second, noting the occasions when the outcome is a six. (a) What is the mean number of rolls from one six to the next six?
(b) Between two rolls, the clock strikes one. What is the mean number of rolls until the next six? (c) Now think back before the clock struck. What is the mean number of rolls, going back in time, until the most recent six?
(d) What is the mean number of rolls from the six before the clock struck to the next six?
(e) Is your answer to (d) different from your answer to (a)? Explain.
Another version of this exercise refers to Fred waiting for a bus at a bus-stop in Poissonville where buses arrive independently at random (a Poisson process), with, on average, one bus every six minutes. What is the average wait for a bus, after Fred arrives at the stop? [6 minutes.] So what is the time between the two buses, the one that Fred just missed, and the one that he catches? [12 minutes.] Explain the apparent paradox. Note the contrast with the situation in Clockville, where the buses are spaced exactly 6 minutes apart. There, as you can confirm, the mean wait at a bus-stop is 3 minutes, and the time between the missed bus and the next one is 6 minutes.
Conditional probability
Exercise 2.36. You meet Fred. Fred tells you he has two brothers, Alf and Bob. What is the probability that Fred is older than Bob? Fred tells you that he is older than Alf. Now, what is the probability that Fred is older than Bob? (That is, what is the conditional probability that given that ?)
Exercise 2.37. The inhabitants of an island tell the truth one third of the time. They lie with probability . On an occasion, after one of them made a statement, you ask another 'was that statement true?' and he says 'yes'. What is the probability that the statement was indeed true?
Exercise 2.38. Compare two ways of computing the probability of error of the repetition code , assuming a binary symmetric channel (you did this once for exercise 1.2 (p.7)) and confirm that they give the same answer. Binomial distribution method. Add the probability that all three bits are flipped to the probability that exactly two bits are flipped. Sum rule method. Using the sum rule, compute the marginal probability that takes on each of the eight possible values, . Then compute the posterior probability of for each of the eight values of . [In fact, by symmetry, only two example cases and need be considered.] Notice that some of the inferred bits are better determined than others. From the posterior probability you can read out the case-by-case error probability, the probability that the more probable hypothesis is not correct, . Find the average error probability using the sum rule,
Exercise 2.39. The frequency of the th most frequent word in English is roughly approximated by
[This remarkable law is known as Zipf's law, and applies to the word frequencies of many languages (Zipf, 1949).] If we assume that English is generated by picking words at random according to this distribution, what is the entropy of English (per word)? [This calculation can be found in ‘Prediction and entropy of printed English’, C.E. Shannon, Bell Syst. Tech. J. 30, pp.50–64 (1950), but, inexplicably, the great man made numerical errors in it.]