Things I don't understand: Bayes theorem

bayescraft formcraft

In which I lay down foundations

Probability is interesting to me for two reasons:

  • It's useful.
  • It's a generalisation of logic1.

The problem is, I have never had any formal training in it. Not even in high school. So that leaves me in a bit of a limbo because I see understanding probability as one of the cornerstones of Good Thinking.

This is my attempt at curing that.

What I shall do is start from the axioms then prove theorems as I tackle classic problems. Which means three things: a) this page will be a work-in-progress indefinitely2, b) the vocabulary I will build here may not correspond to the standard one, and c) I'll probably get things wrong a lot of times so this may not be the best page to cite in support of your Internet argument.

Before we start, I'd like to introduce a rule and a notion. I won't start from the very foundations of mathematics because that will just waste everyone's time. But I really admire the rigor of the Bourbaki school (in great part because I'm still in my "rigorous phase"3). So I shall follow a simple rule: every assertion must have a proof or a reference to one. And to do that more efficiently I'll borrow a notion from programming and "import" complete mathematical objects in this manner:

Eventually, those external links should become internal ones.

Right. Let's get to work. Here are some definitions:

  • DEF: (data point)

    a value (e.g., "red", "42.4 seconds")

  • DEF: (data set)

    a set of data points

  • DEF: (universe, the Universal Data Set)

    $ \Omega$, or the data set containing all data sets

  • DEF: (probability)

    the probability $ P(E)$ of a data set $ E$ is a real number associated with $ E$

These are the basic building blocks of probability theory.

As far as I know, probability theory is then completely axiomatised by the following three axioms (which come from A. Kolmogorov, according to Wikipedia):

  • AXM 1: ("All probabilities are non-zero.")

    $ (\forall S \subseteq \Omega)(P(S) \geq 0)$

  • AXM 2: ("The probability of the Universal Data Set is 1.")

    $ P(\Omega) = 1$

  • AXM 3: ("The probabilities of disjoint data sets are additive.")

    $ (\forall E_1, E_2 \subseteq \Omega)( (E_1 \cap E_2 = \emptyset) => (P(E_1 \cup E_2) = P(E_1) + P(E_2)) )$

Using these we can already say a few basic facts about probabilities.

  • THM 1: "The probability of the empty set is 0." (i.e., $ P(\emptyset) = 0$)

$$\begin{align} &\rightsquigarrow \emptyset \cap \Omega = \emptyset \\ &\implies P(\emptyset \cup \Omega) = P(\Omega) = P(\emptyset) + P(\Omega) \\ &\therefore P(\emptyset) = 0 \end{align}$$

  • THM 2: "The probability of the complement of a data set is one (1) minus the probability of the original." (i.e., $P(A^c) = 1 - P(A)$)

\[\begin{align} &\rightsquigarrow A \cap A^c = \emptyset \\ &\implies P(A \cup A^c) = P(\Omega) = 1 = P(A) + P(A^c) \\ &\therefore P(A^c) = 1 - P(A) \end{align}\]

What we have though is still too bare. It lacks flavour. So let's define a few more things:

  • DEF: (reduced universe)

    the reduced universe $ \Omega_E$ of a data set E is the subset of the universe $ \Omega$ where $ E$ is true

  • DEF: (joint probability)

    the probability $ P(A \cap B)$, or the probability of both A and B being true

  • DEF: (conditional probability)

    the probability $ P(A|B) = \frac{P(A \cap B)}{P(B)}$, or the probability of A being true given that B is true

  • DEF: (independence)

    two data sets are independent if $ P(A \cap B) = P(A)P(B)$

What are these definitions for?

  • We defined the reduced universe as such because we want to be able to say, "In the universe where data set A is true...".
  • What do we mean by a data set being true in the first place? Say $ A = {\text{"Alice is a big mouse."}}$. Then in a particular universe, $ A$ is true if Alice is a mouse and if she is big and not otherwise. The truthiness of $ A$ is the truthiness of all its conditions.
  • Why the definition of conditional probability? We want to have a way of saying, "The truth of A depends on the truth of B by this much."
  • In this vein, saying that two data sets are independent is saying that whether or not B is true does not affect whether or not A is true

(to be continued...)


  • Good Thinking

    following what works; see the Twelve Virtues of Rationality, particularly the twelfth virtue

  • Things I don't understand series

    my own mental models of things, organised as best as I can

  1. E. T. Jaynes. "Probability: The Logic of Science." 1995. Print.

  2. See the About section of

  3. See Terry Tao's post.