The basic axioms of probability and the idea of a random variable as a function from the sample space to \(\mathbb{R}\) are quite abstract and rather confusing at first pass. However, the axioms can be well motivated from our intuition, and defining random variables simply as functions turns out to be a brilliant and intuitive abstraction. My goal in this post is to try to explain the ideas behind the axiomatization of probability theory, and hopefully make the study of measure-theoretic probability seem a bit less intimidating.
Mathematics and Axiomatization
Exactly what is Mathematics is not necessarily an easy thing to pin down, and to try to give some sort of clear formal definition would be senseless. However, it is easy to say what it is not: it is not simply a prescribed set of rules for calculating numbers, or “solving for \(x\)” – a prescribed set of calculating rules has another name: an algorithm.
In contrast to the mechanical process described by an algorithm, mathematics is a highly creative endeavour. Indeed, many mathematicians are motivated largely by the aesthetics and the “beauty” of their results. To many mathematicians, it is the abstraction and the concepts themselves that are of interest, and they may not even be entirely interested in whether or not the results and calculations are \(100\%\) correct! A great book about the process of doing math (which can be got for rather low prices) is The Mathematical Experience 💁♂️
Still, many mathematicians are at the same time concerned with correctness and rigor. That is, they want to be able to say with supreme confidence that their results are true and correct, with no room for doubt. Essential to this is the idea of a mathematical proof – a sequence of logical deductions which begin from some specified premises \(P\) and lead inexorably to some conclusion \(Q\). If the logical deductions are correct, then the mathematician has rigorously obtained a proof of the implication \(P \implies Q\). That is, given that the premises \(P\) are true, it must be the case that the conclusions \(Q\) are true. Mathematicians call this \(P \implies Q\) implication a Theorem.
Once a mathematician (or a group of them, perhaps spanning continents and generations) has amassed a rather sizable number of theorems, and understands something about the various connections amongst them, the mathematician might attempt to lay down a fundamental set of premises from which all of their conclusions can be derived. This set of fundamental premises, which cannot be derived (as far as the mathematician can tell) from any other still more fundamental premises, are called axioms. The axioms are the basic starting point for a theory, that is, for all the further conclusions the mathematician has derived. Axioms are ultimately just premises, but the former term tends to imply that it is a foundational premise – i.e., something that is self-evident that can serve as a starting point.
The most famous set of axioms are Euclid’s Axioms laying the classical foundation for geometry. Another being those of ZF Set Theory , which lay the foundations for set theory, and therefore of more-or-less all of modern “mainstream” mathematics. The axioms I care about discussing in this post are the axioms of Probability Theory .
As opposed to the rigorous series of logical deductions by which proofs proceed, the way that the idea of a Theorem comes about (i.e., the thought that some implication might possibly be true) is much more creative and messy. Often, one begins with a conclusion \(Q\) that “seems true”, and then proceeds to search for the set of premises \(P\) that make it true, possibly going back and forth many times tweaking the premises and the conclusion – a process reminiscent of Rawl’s reflective equilibrium in philosphy. The mathematician is done with this inquiry when they have obtained a Theorem which they think is “good”. I have seen it claimed that a Theorem is good if it is surprising, deep, and can be used to derive other good theorems. My memory tells me that this should be attributed to Polya, but I cannot find a reference (please email me if you know where this is from!).
As well, we usually come up with axioms long after having figured out most of the conclusions (possibly using stronger or more numerous premises). Axiomatization is a process of cleaning up the theory and placing it upon a nice, beautiful, and simple foundation. Thus, I think it can be rather confusing, from a didactic standpoint, to write textbooks or teach classes which simply begin with the axioms and proceed to derive all of the conclusions – the student is left confused and wondering what the axioms mean, where did they come from, etc. To be comfortable in following this process straight from the axioms requires some amount of experience and mathematical maturity.
Indeed, the process of derivation from a set of axioms is not reflective of how the theory was developed, and it is hardly even reflective of how mathematicians think about the theory themselves. My hope is that the rest of this post can clarify some of the intuition behind the axioms of probability.
The Axioms of Probability
The purpose of probability theory is to provide a set of mathematical tools for reasoning about events which are random. Exactly what random means is on one hand left to intuition (e.g., we think of a dice roll as random, because we have no practical means of determining what the outcome will be), and on the other hand, is rigorously axiomatized by the theory of probability. My purpose here is to help to make the axiomatization itself more intuitive.
The rigorous theory of probability was axiomatized in 1933 by Russian mathematician Andrey Kolmogorov (1903-1987). This axiomatization begins with a set \(\Omega\) called the sample space. What this set is exactly need not be fully specified – it is just some abstract set, and it may be a finite set like \(\{\mathsf{H}, \mathsf{T}\}\) (to model flipping a coin), it may be a countably infinite set like \(\{1, 2, \ldots\}\), it could be an uncountably infinite set like \(\mathbb{R}\), or some even more abstract set.
To set the stage, the following sections will introduce the probability triple \((\Omega, \mathcal{F}, \mathbb{P})\) consisting of a sample space \(\Omega\), a $σ$-algebra \(\mathcal{F}\) thereof, and the probability measure \(\mathbb{P}\). On this triple we axiomatize probability with the following three premises:
- The domain of \(\mathbb{P}\) is \(\mathcal{F}\) and for any \(S \in \mathcal{F}\) we have \(P(S) \ge 0\) (non-negativity)
- \(\mathbb{P}(\Omega) = 1\) (Normalization)
- If \(S_1, S_2, \ldots \in \mathcal{F}\) are disjoint, then \(\mathbb{P}\bigl(\bigcup_{i = 1}^\infty S_i \bigr) = \sum_{i = 1}^\infty \mathbb{P}(S_i)\) (Countable Additivity)
By the end of this section, I hope the reader will have some understanding of what all of this means, and why these axioms are so beautiful. It is remarkable that probability theory flows from these three simple statements!
Samples and Events
In probability theory, an element \(\omega \in \Omega\) of the sample space is called a sample. We don’t know what this sample is, but it encodes the state of the universe. Subsets \(S \subseteq \Omega\) are called events (actually, they must be elements \(S \in \mathcal{F}\), which is a set of subsets of \(\Omega\) – more on this later), and we imagine that an event has occurred if \(\omega \in S\).
Intuitively, there are many states of the universe that might lead to the same event; e.g., the outcome of a coin toss ultimately depends upon, at least, the arrangement of molecules in the air through which the coin travels, and there are many more than two such arrangements. Thus, there are many \(\omega\) (arrangements of molecules) which result in the “Heads” outcome of the coin toss.
When we do modelling in practice, we usually don’t work directly with \(\Omega\). Instead, we work with random variables, which are functions from the sample space into \(\mathbb{R}\). The sample space is kept as an abstract entity “in the background”. But, before we discuss this additional layer of abstraction, we need to understand \((\Omega, \mathcal{F}, \mathbb{P})\).
Nature’s Dartboard
On an intuitive level, I think of the set \(\Omega\) as being Nature’s Dartboard – it is just a disc, and “Mother Nature” is throwing a dart at it. The exact location at which the dart lands selects for us a state of the universe \(\omega \in \Omega\).
In addition to the set \(\Omega\), we are given a function \(\mathbb{P}\) called a probability measure. The function \(\mathbb{P}\) is a bit more abstract than the sort of function (a curve) that one might usually encounter in calculus. Instead, \(\mathbb{P}\) is a function which maps from subsets of \(\Omega\) into some number between \([0, 1]\). That is, if we have some set \(S \subset \Omega\), we would like \(\mathbb{P}\) to assign some number to \(S\) as in \(\mathbb{P}(S)\). This number is called the probability of the event (subset), and is directly analogous to the volume of \(S\).
Indeed, if we imagine that Mother Nature throws her dart at the dartboard \(\Omega\) completely “at random” (following our intuition) then our intuitive idea of the probability of the dart landing within some region \(S \subseteq \Omega\) of the dartboard is nothing but \(\text{vol}(S) / \text{vol}(\Omega)\), where \(\text{vol}(S)\) measures the volume (in this case, just an area) of the set \(S\).
Now, if we imagine that \(\mathbb{P}(S) = \text{vol}(S) / \text{vol}(\Omega)\), then the axiom \(\mathbb{P}(S) \ge 0\) is natural since volumes cannot be negative, and \(\mathbb{P}(\Omega) = 1\) is elementary algebra. This is the basic intuition of the axioms of probability – we measure the probability of events \(S\), which are nothing but subsets of the sample space \(\Omega\), by appealing to the volume of the dartboard that they occupy.
Continuing this line of reasoning, imagine we have some set \(S \subseteq \Omega\) which has some probability \(\mathbb{P}(S)\). If we break this set into two pieces (i.e., two disjoint sets \(S_1, S_2\) with \(S_1 \cup S_2 = S\) and \(S_1 \cap S_2 = \emptyset\)) then it stands to reason that we should have \[\mathbb{P}(S) = \mathbb{P}(S_1) + \mathbb{P}(S_2),\] since the volume of one set must be the sum of the volumes of any partition of that set. This is the idea behind Axiom 3, additivity.
That Axiom 3 is stated for countably infinite partitions of events (i.e., infinite sums) is a technical condition: countable additivity. Simple finite additivity turns out to not be strong enough to result in a rich enough theory. Making the axioms works rigorously for these infinite sums, and for infinite sample spaces, is the whole point of the formalism surrounding the $σ$-algebra \(\mathcal{F}\). We turn to this next.
The Banach-Tarski Paradox: Volume and Measure
The object \(\mathcal{F}\) is called a $σ$-algebra of subsets of \(\Omega\), and it is in general a subset of the set of all subsets of \(\Omega\). Understanding this $σ$-algebra is part of the difficult technical work that goes into learning rigorous probability theory, but the purpose of \(\mathcal{F}\) is to eliminate certain pathological sets in \(2^\Omega\) which are not measurable (see: Measure Theory ). That is, there are some sets (e.g., Vitali Sets ) for which it simply makes no sense to assign a notion of volume, and the issue arises directly as a result of dealing with infinite sets.
A famous example illustrating this issue is the Banach-Tarski paradox. This paradox shows that we can take a sphere, partition it into six pieces, and then, after rotating some of those pieces in a clever way, reassemble two copies of the original sphere. Here is a Video explanation of the procedure as well as a Quanta Magazine article . The fact that we can take a set, partition that set, and then reassemble two identical copies of that set is rather problematic when it comes to assigning volumes! It seems that we can derive \(\mathbb{P}(S) = 2\mathbb{P}(S)\) where \(S\) is a sphere! This is a pathology that needs to be eliminated.
Indeed, we would really like for it to be the case that \(\mathbb{P}(A \cup B) = \mathbb{P}(A) + \mathbb{P}(B)\) if \(A \cap B = \emptyset\) and where \(\mathbb{P}(S) = \text{vol}(S) / \text{vol}(\Omega)\) is consistent with our notion of volume. But, this can only be done if we simply exclude a-priori so-called non-measurable sets. This is more-or-less what the $σ$-algebra \(\mathcal{F}\) is doing – it is a special collection of subsets of \(\Omega\) which ensures that any \(S \in \mathcal{F}\) is a measurable set to which we can assign a notion of volume that corresponds with our intuition. In the language of probability, it means that only certain subsets of \(\Omega\) can be assigned a probability: the measurable subsets. There are some pathological subsets of Nature’s dartboard which we simply cannot measure.
Specifically, a collection of subsets \(\mathcal{F}\) of \(\Omega\) is a $σ$-algebra of subsets if the following conditions are satisfied:
- \(\Omega \in \mathcal{F}\) (The universe is one of the sets)
- \(A, B \in \mathcal{F} \implies A \setminus B \in \mathcal{F}\) (Closed under set differences)
- If \(A_1, A_2, \ldots \in \mathcal{F}\) and \(A_i \cap A_j = \emptyset\) if \(i \ne j\), then \(\bigcup_{i = 1}^\infty A_i \in \mathcal{F}\) (Closed under countable disjoint union)
Notice that these are basically nothing but the conditions that make the axioms of probability “work”, particularly closure under countable unions. That is, the notion of a $σ$-algebra is cooked up exactly for the purpose of making sure that measure corresponds to our intuition of volume.
To actually construct a volume measure for the dartboard, one begins by constructing a measure on the real line \(\mathbb{R}\) by defining \(\mathbb{P}([a, b]) = b - a\), and then generating a $σ$-algebra out of intervals (by taking complements and countable unions etc.). This guarantees that the resulting \(\sigma\) algebra is generated by “reasonable” operations on the sets, excluding those which are not measurable. Then, by putting together the rules of the $σ$-algebra above, and using the basic intuition of \(\mathbb{P}([a, b]) = b - a\), we can eventually construct what is called Lebesgue Measure , a rigorous way of assigning volume to (measurable) subsets of \(\mathbb{R}\). This is enough to get us to a suitable notion of volume, and therefore to a construction of a probability measure \(\mathbb{P}\)!
Remark (Axiom of Choice): The problem of measure is not entirely inescapable. If we did not accept the Axiom of Choice , which basically just says that it is possible to choose an arbitrary element from an infinite set, then non-measurable sets could not be constructed, and the problem would disappear. Some mathematicians thus view the existence of non-measurable sets, which are incredibly pathological, as a reason to refuse to accept the axiom of choice. However, we need to pick our poison; there are at the same time a vast number of perfectly intuitive and incredibly useful theorems which depend on the axiom of choice. My favourite example being the Hanh-Banach Theorem , the cornerstone of convex analysis (named after the same Banach as the Banach-Tarski Paradox). Instead of doing without more-or-less all of functional analysis, most instead choose to accept non-measurable sets as a fact of (mathematical) life, and carry around all the baggage associated with $\sigma$-algebras and measure theory.
Random Variables
All of the formalism laid out in understanding the probability triple \((\Omega, \mathcal{F}, \mathbb{P})\) is rather complicated. Fortunately, in most “practical” scenarios where we want to use probability for modelling, we just let this formalism sit back as an abstract foundation, more-or-less out of sight. We instead work more directly with random variables which are defined as nothing but functions from \(\Omega\) into \(\mathbb{R}\) as in: \[X: \Omega \rightarrow \mathbb{R},\] sometimes written as \(X(\omega)\).
The “randomness” thus passes from Nature’s Dartboard \(\Omega\), an abstract set that we don’t want to think about too much, to \(\mathbb{R}\), a natural set that we like to work with. The way this works is as follows – suppose we want to compute the probability that \(X(\omega)\) is less than some real number \(x\), that is \(F(x) = \mathbb{P}(X \le x)\). This is an important function called the cumulative distribution function of \(X\) and it turns out to be enough to fully determine the random behaviour of \(X\). The notation \(\mathbb{P}(X \le x)\) is really nothing but a shorthand for \[F(x) = \mathbb{P}(\{\omega: X(\omega) \le x\}).\]
That is, the common notation \(\mathbb{P}(X \le x)\), which is used to mean “the probability that the random variable \(X\) is less than or equal to the real number \(x\)” is really a shorthand for \(\mathbb{P}(\{\omega:\ X(\omega) \le x\})\) which is “the volume of the set of all events \(\omega\) such that the function \(X\) evaluated \(X(\omega)\) is less than or equal to the real number \(x\)”. We can simplify this further:
\begin{equation} \begin{aligned} F(x) &= \mathbb{P}(\{\omega: X(\omega) \le x\})\\ &= \mathbb{P}(\{\omega: X(\omega) \in (-\infty, x]\})\\ &= \mathbb{P}(X^{-1} ((-\infty, x]))\\ &= \mathbb{P} \circ X^{-1} ((-\infty, x]). \end{aligned}\notag \end{equation}
These elementary manipulations show us that that the cumulative distribution function \(F\) is nothing but the composition of the probability measure \(\mathbb{P}\) with the inverse mapping of \(X\) (the domain of which are all the half-open intervals like \((-\infty, x]\))! What does this mean? It allows us to define events on \(\mathbb{R}\), like the probability of the event \(X \le x\) by measuring the size of the set of all \(\omega\) which satisfy \(X(\omega) \le x\). So, what we do “in practice” is that we start by positing a distribution function \(F\) (like a Gaussian distribution) which we want to work with on \(\mathbb{R}\), and then we just trust that there is some appropriate probability triple \((\Omega, \mathcal{F}, \mathbb{P})\) which is “adequately expressive” and some function \(X\) such that \(F = \mathbb{P} \circ X^{-1}\). The function \(X\) is now a random variable having c.d.f. \(F\). This gives us a huge variety of possibilities for modelling, and the randomness on \(\mathbb{R}\) is inherited from the simple and intuitive notion of randomness as volume on Nature’s dartboard.
Remark (Measurable Functions): Technically, the random variable $X$ must be a measurable function . This is another important technical condition, but irrelevant for developing an intuition.
Remark (Functions of Random Variables): Suppose we have some function $g: \mathbb{R} \rightarrow \mathbb{R}$. It is worthwhile to point out that a function of a random variable $g(X)$ defines a new random variable $Y(\omega) = g \circ X(\omega)$, since it is nothing but a another function from $\Omega$ to $\mathbb{R}$. The cumulative distribution function of this new random variable is given by $F_Y(y) = \mathbb{P} \circ X^{-1} \circ g^{-1}((-\infty, y])$.
Expectation
Another essential concept in probability theory is that of the expectation. Intuitively, this is like a generalization of an average. If the sample space \(\Omega\) is finite, then we define \[\mathbb{E}X = \sum_{i = 1}^N X(\omega) \mathbb{P}(\omega),\] and if \(\mathbb{P}(\omega) = \frac{1}{N}\) then it is exactly an average of the random variable over the sample space. Generally, finite sample spaces aren’t rich enough to be interesting, so we define the expectation by means of an integral \[\mathbb{E}X = \int_\Omega X(\omega) \mathsf{d}\mathbb{P}(\omega).\] This is called a Lebesgue Integral , and our desire to define and use this integral is another part of the motivation for studying probability in the context of measure theory. Regardless, it is usually possible to dispense with this integral “in practice” and to work directly with the distribution function \(F\). This works as follows:
\begin{equation} \begin{aligned} \mathbb{E}X &= \int_\Omega X(\omega) \mathsf{d}\mathbb{P}(\omega)\\ &\overset{(a)}{=} \int_{X(\Omega)} x\ \mathsf{d} \mathbb{P} \circ X^{-1} (x)\\ &\overset{(b)}{=} \int_{\mathbb{R}} x\ \mathsf{d} F(x)\\ &\overset{( c)}{=} \int_{-\infty}^\infty xf(x)\ \mathsf{d}x, \end{aligned}\notag \end{equation}
where \((a)\) comes from making the substitution \(x = X(\omega)\), \((b)\) just simplifies using \(F = \mathbb{P} \circ X^{-1}\), and \(( c)\) is a final simplification for continuous random variables many readers may be used to: if \(F\) is a differentiable function, then the random variable admits a density function \(f(x) = F^\prime(x)\) and \(\mathsf{d}F(x)\) turns into \(f(x)\mathsf{d}x\), resulting in an “ordinary” Riemann integral (as opposed to \(\int_\mathbb{R} x \mathsf{d} F(x)\) which is a Lebesgue-Stieltjes integral ).
Remark (Expectations of Functions of Random Variables): Given a function $g: \mathbb{R} \rightarrow \mathbb{R}$, we may like to compute the expectation of the new random variable $g(X)$ as in $\mathbb{E}g(X) = \int_\Omega g(X(\omega)) \mathsf{d} \mathbb{P}(\omega).$ We could do so by determining the distribution function $G(x) = \mathbb{P}(g(X) \le x)$ of the new random variable $g(X)$ and then doing the calculation $\int_\mathbb{R} x \mathbb{d} G(x)$. But alternatively, we can see by repeating the same substitution $x = X(\omega)$ in the integral that we made earlier that $\mathbb{E}g(X) = \int_\mathbb{R} g(x) \mathsf{d} F(x)$ which is still just an integral over the distribution of $X$! This is an incredibly convenient fact called the Law of the Unconcious Statistician .
Data and Sampling – the Passage to Statistics
Many people are now drawn to the study of probability for the purposes of doing statistics, i.e., analyzing data. So, I wanted to add some more commentary about random variables and their connection to the modelling of data. That is, how do statistics and probability connect?
Statistics is often concerned with experiments. Suppose we plan an experiment where we are going to measure someone’s height. The outcome of this experiment is not known a-priori, so it makes sense to model it using probability theory – the outcome is random. As well, a person’s height is a real number, and we now know that we can model random real numbers using random variables. Thus, we might like to construct a model of the measurement of someone’s height as a random variable \(H(\omega)\).
If we measure the heights of many people, we will wind up with a multitude of random variables \(H_1, H_2, \ldots, H_N\), where \(N\) is the number of measurements we make. These random variables all together define some joint distribution \(F(h_1, \ldots, h_N) = \mathbb{P}(H_1 \le h_1, \ldots, H_N \le h_N)\), and statistics is often concerned with what can be learned about the distribution of these random variables through the actually observed outcomes of experiments.
Independent and Identically Distributed Random Variables
The multivariate distribution described above can be difficult to work with – some assumptions need to be made. The most common assumption is that the random variables are independent and identically distributed, that is, i.i.d. What this means is that we assume:
- \(\mathbb{P}(H_i \le h) = \mathbb{P}(H_j \le h)\) (identically distributed)
- \(\mathbb{P}(H_1 \le h_1, \ldots, H_N \le h_N) = \mathbb{P}(H_1 \le h_1) \times \cdots \times \mathbb{P}(H_N \le h_N)\) (independent).
The idea of the random variables having an identical distribution is that we are observing separate outcomes of the same experiment. Independence, where the joint distribution is nothing but the product of the individual (marginal) distributions, encodes the idea that the outcome of one experiment does not influence (and is not influenced by) any other experiment.
Thus, one of the most common settings in statistics is that we are given a sequence \(X_1, X_2, \ldots, X_N\) of i.i.d. random variables, and our goal is to learn something about the statistics of their common distribution, call it \(F_X\). The word statistic here just means “functionals of the distribution”, i.e., functions which take the distribution function \(F_X\) as an input, and output some real number. The mean value \(\mathbb{E}X = \int_{\mathbb{R}} x \mathsf{d} F_X(x)\) being an important example, quantiles being another: \(\mathbb{P}(X \le q) = \int_{\mathbb{R}} \mathbf{1}[x \le q]\mathsf{d} F_X(x)\).
Given data, and statistical theorems about how collections of i.i.d. random variables behave, statisticians are able to answer questions like: “what is the probability that a randomly selected person will be over six feet tall?”, “in a room of \(N\) people, what is the probability that there will be someone more than six feet five inches?”, etc. Going further, the combination of powerful computers with sophisticated statistical models can be used to produce striking results – the dart throwing woman that serves as the cover photo of this blog post was itself generated from a statistical model. That is, the image itself is essentially nothing but a particular realization of a random variable!
Remark (The LLN): The most natural estimator of \(\mathbb{E}X\) given the sample is the average \(\overline{X}_N = \frac{1}{N}\sum_{i = 1}^N X_i\) and indeed, many of the most famous theorems in statistics (the Law of Large Numbers and the Central Limit Theorem ) are concerned exactly with how $\overline{X}_N$ behaves under various different assumptions about the common distribution. Indeed, it can be shown that in almost all sensible cases $\overline{X}_N \rightarrow \mathbb{E}X \text{ as } N \rightarrow \infty$ (given an appropriate sense of convergence for random variables): the average of the sample converges to the expectation as the size of the sample increases. This is a highly intuitive outcome that can be derived entirely from the axioms of probability, and it is an incredibly satisfying result! Some historical formulations of probability began with the postulate that this is true, so it is a testament to how effectively the axioms “clean up” the theory.
Conclusion
Axiomatic and measure theoretic probability is a topic more than worth studying. It is very difficult, if not impossible, to have a full understanding of probability (particularly stochastic processes) without eventually going down this path. I hope that I have here shown that the elementary ideas are not just esoteric abstractions, but that they are well supported by a very natural intuition. It is the technical need to deal with annoying pathologies (like the Banach-Tarski paradox) that necessitates some technical machinery. But, the underlying intuition comes down to nothing but the measure of volumes in Nature’s Dartboard. Indeed, most abstractions in mathematics have some completely natural and intelligible intuition behind them!
Remark (Some book recommendations): The first book about probability I read was An Intermediate Course in Probability by Allan Gut. However, this book is not measure theoretic. Instead, this book by the same author can serve as a reasonable introduction, or the well known book by Billingsley which is more-or-less the canonical book to recommend. Another great book which is somewhat less concerned with the details of measure theory, focusing more on probability, is Probability: Theory and Examples which contains all of the classic results and, as the name suggestions, a lot of very nice examples. However, I only read this book long after I became well acquainted with measure-theoretic probability. Since its treatment of measure is rather more limited, it’s difficult for me to know whether or not this is a good book to start with.
Remark (Other topics): There are a number of other rather famous books broadly in the area of machine learning and statistics – examples being the following: Theoretical Statistics , Pattern Recognition and Machine Learning , Bayesian Data Analysis , Deep Learning , and Mathematics for Machine Learning (this last one I have not myself read). These books are perfectly good, but they are emphatically not about probability theory – they are applied books in machine learning and statistics. To learn probability theory itself, stick with the above.