icon nav

Probability (III)

Probability Space
  • Probability Theory
  • Mathematics

Probability Theory

Probability Space

To start our discussion of probability theory, we first need to define the concept of probability space. For a simple definition, the probability space of some probabilistic experiment is the set of all possible outcomes of that experiment. In this case, "experiment" refers to any random event where different outcomes may occur at different probabilities.

For example, if our experiment is "What will the weather be tomorrow?", the outcomes


  • "sunny" with probability 0.5

  • "partly cloudy" with probability 0.3

  • "torrential flood of Biblical proportions" with probability 0.00001

could all be elements in the probability space. A probability space is a mathematical construct that allows us to create a formal model for these kinds of random processes. To define this concept rigorously, we will start with some definitions.


Definition 1. We define the sample space of an experiment to be the set of all possible outcomes of that experiment.


The sample space is usually denoted as S or \Omega. Here, we will use \Omega.


Example 2. We toss a coin twice. What is the sample space of this experiment?


Solution: Denoting heads ad H and tails as T, this experiment has the four possible outcomes: \Omega = \{HH, HT, TH, TT\}



A set \Omega needs to satisfy certain properties in order to define a sample space. Letting \Omega = \{s_1, s_2, ... s_n\}, it must be true that:


  • The elements of \Omega represent all possible outcomes of the experiment. There is no chance that we can run a trial and get an outcome which is outside of \Omega.

  • The outcomes are mutually exclusive. If s_j is our outcome, then no other s_i where i \neq j can take place

  • \Omega must have the right granularity with regards to the experiment and what the experimenter is looking for. This means that we should only include relevant information and remove anything which is irrelevant. For instance, if our experiment is tossing a coin once, the sample space we would most likely want is \Omega = \{H,T\} If we choose to clutter this space up with unnecessary information, for example by adding in the weather for some reason, we can create another sample space \Omega_2 = \{\{H, \text{raining}\},\{H, \text{not raining}\} \{T, \text{raining}\}, \{T, \text{not raining}\}\} Clearly, this is not the best representation of the experiment, and this extra irrelevant information is best left out.


Definition 3. We define a \mathbf{\sigma}-field on \mathbf{\Omega} (also called a \sigma-algebra), which we’ll denote here as \mathcal{F}, as the power set of \Omega, or \mathcal{P}(\Omega).


\mathcal{F} has the following properties:



(i) \Omega \in \mathcal{F}



(ii)A\in \mathcal{F} \Rightarrow \overline{A} \in \mathcal{F}, where \overline{A} is the complement of A in \Omega



(iii) \displaystyle A_1, A_2, ... \in \mathcal{F} \Rightarrow \bigcup_{k=1}^{\infty} A_k \in \mathcal{F}



If \Omega is a sample space, \mathcal{F} is called a \mathbf{\sigma}-field of events. \mathcal{F} represents the the collection of all subsets of \Omega for which we can assign a probability. The elements of F are called events.


Example 4. Find the \sigma-field on \Omega for an experiment with \Omega = \{1,2,3\}


Solution: The \sigma-field \mathcal{F} will be the set of all subsets of \Omega, or \mathcal{F} = \{ \emptyset, \{1\}, \{2\}, \{3\}, \{1,2\}, \{2,3\}, \{1,3\}, \{1,2,3\}\} Here, the empty set \emptyset represents the event where nothing happens. We will show later that the probability of \emptyset is always zero. Events with multiple elements represent the case where any of the included outcomes occur. For example, {1,2} represents the event that we roll either a 1 or 2.



The concept of a \sigma-field is used outside of probability as well. Here are some other examples of \sigma-fields.


Example 5. The Borel \sigma-field on \mathbb{R}^n, denoted B(\mathbb{R}^n), is defined as the smallest \sigma-field on \mathbb{R}^n that contains all open n-rectangles (a_1,b_1) \times (a_2,b_2) \times ... \times (a_n,b_n).


To make this easier to visualize, let’s discuss the Borel \sigma-field on \mathbb{R}^2. This is the set of all combinations of open rectangles we can draw on the xy plane. Does this satisfy the properties of a \sigma-field?



(i) Is \mathbb{R}^2 \in B(\mathbb{R}^n)? Yes! We can think of \mathbb{R}^2 as the union of all rectangles (a_1,b_1) \times (a_2,b_2) where a_1, b_1, a_2,b_2 \in \mathbb{R}.



(ii) If A \in B(\mathbb{R}^n), is \overline{A} \in B(\mathbb{R}^n)? Also yes! For instance, let’s choose the element (-2,1) \times (-1,1), represented by red rectangle below. We can draw rectangles to fill in the rest of the space where these rectangles are, by definition, also contained in B(\mathbb{R}^n). This is demonstrated by the blue rectangles below.

 

(iii) Finally, if \displaystyle A_1, A_2, ... \in B(\mathbb{R}^n) is \bigcup_{k=1}^{\infty} A_k \in B(\mathbb{R}^n)? You better believe the answer is yes! A_1, A_2,... are all rectangles, so their union will be a set of rectangles which is again, by definition, in B(\mathbb{R}^n).


Example 6. If \mathcal{F} is the smallest \sigma-field that contains all sets of the form \{\omega = \{x_1,x_2,x_3, ...\} : x_k=1\} \quad k=1,2,3,... and all sets of the form \{\omega = \{x_1,x_2,x_3, ...\} : x_k=0\} \quad k=1,2,3,...


Therefore this \sigma-field also contains all sets of the form \{\omega = \{x_1,x_2, ...,x_n\} : x_1= a_1, x_2=a_2, ..., x_n=a_n,...\} \forall n\geq 1 \text{ and } a_1,...,a_n \in \{0,1\}



You can check for yourself that this satisfies all the properties of a \sigma-field.



Finally, in order to define a probability space, we need one more definition.


Definition 7. A probability measure, P, on (\Omega, \mathcal{F})) is a function P: \mathcal{F} \rightarrow [0,1] such that


(i) 0 \leq P(A) \leq 1 \quad \forall A \in \mathcal{F}



(ii) P(\Omega) = 1



(iii) \displaystyle P\left(\sum_{k=1}^{\infty} A_k \right) =  \sum_{k=1}^{\infty}P( A_k) for all sequences \{ A_k\}_{k \geq 1} of pairwise disjoint events in \mathcal{F}.



Property (iii) is called \sigma-additivity.


Theorem 8. For any event A \in \mathcal{F} P(\overline{A}) = 1 - P(A) and P(\emptyset) = 0


Proof: We know that P(\Omega) = 1 since \Omega is the set of all possible outcomes. So we have that: 1= P(\Omega) = P(A + \overline{A}) = P(A) + P(\overline{A}) by \sigma-addivitity. Therefore: P(\overline{A}) = 1-P(A) We can now use this result to prove P(\emptyset) = 0: P(\emptyset) = P(\overline{\Omega} = 1 - P(\Omega) = 1-1 = 0 \blacksquare



The time has come to rigorously define a probability space.


Definition 9. A probability space is defined as the triple (\Omega, \mathcal{F}, P) where



  • \Omega is a sample space

  • \mathcal{F} is a \sigma-field on \Omega

  • P is a probability measure


Example 10. Find the probability space for rolling a fair six-sided die once.


Solution: We know that: \Omega = \{ 1,2,3,4,5,6\} and \mathcal{F} = \mathcal{P}(\Omega) Now, for the probability measure, we know that each option in \Omega has an equal probability of \frac{1}{6}. So if we take an element of \mathcal{F}, for instance \{1,2\}, the probability will be \frac{1}{6} + \frac{1}{6} = \frac{2}{6} = \frac{|\{1,2\}|}{6} So for any subset of \mathcal{F}: P(A) = \dfrac{|A|}{6} \quad \forall A \in \mathcal{F} Together these make up the probability space for this experiment.


Example 11. Say we toss a fair coin until we get a heads. Find the probability space for the event that we get our first heads on the nth toss.


Solution: Here, our sample space is going to be any possible number of tosses. \Omega = \{1,2,3,...\} and, of course, \mathcal{F} = \mathcal{P} (\Omega) We define our probability measure to represent the probability that it will take n tosses to get our first heads, where n \in \mathbb{N}. That is P(n) = \left( \frac{1}{2}\right)^n = \frac{1}{2^n} or P(A) = \sum_{n=1}^m \frac{1}{2^n} where A = \{a_1, a_2, ..., a_m\}, with a_i \in \mathbb{N} \forall i \in \{1,2,...,m\}, is an element of \mathcal{F}.



We’ll end this section with some important theorems.


Theorem 12. Probability is monotonic. That is, \forall A, B \in \mathcal{F}


A \subset B \Rightarrow P(A) \leq P(B)

Proof: A \subset B \Rightarrow B = A + (B-A) Using \sigma-additivity, given A and B-A are by definition disjoint, this implies that P(B) = P(A) + P(B-A) \Rightarrow P(B) \geq P(A) \blacksquare


Theorem 13. Probability is sub \sigma-additive. That is, for any sequence A_1, A_2,... of events which may not be disjoint P\left(\bigcup_{k=1}^{\infty} A_k \right) \leq \sum_{k=1}^{\infty} P(A_k)


Proof: Observe that \bigcup_{k=1}^{\infty} A_k = \sum_{k=1}^{\infty} A'_k where we define A' as \begin{aligned}<br>   & A'_1 = A_1 \\<br>&A'_2 = A_2 \cap \overline{A_1}\\<br>&\vdots\\<br>& A'_k = A_k \cap \left\{ \overline{\bigcup_{i=1}^{k-1} A_i}\right\}<br>\end{aligned} By making it so each successive A_k only includes elements present in none of the previous sets, this process gives us all disjoint sets, so \sigma-additivity applies.



Therefore we have that P\left(\bigcup_{k=1}^{\infty} A_k \right) = P \left( \sum_{k=1}^{\infty} A'_k \right) = \sum_{k=1}^{\infty} P(A'_k) \leq \sum_{k=1}^{\infty} P(A_k) since A'_k \subseteq A_k. \blacksquare


Theorem 14. Let \{ A_n\}_{n \geq 1} be a non-decreasing sequence of events. That is, A_n \subseteq A_{n+1} \forall n \in \mathbb{N}. Then P \left( \bigcup_{n=1}^{\infty} A_n\right) = \lim_{n \rightarrow \infty}P(A_n)


Proof: Observe that since A_n \subseteq A_{n+1} \forall n \in \mathbb{N} we have that A_n = A_1 + (A_2 - A_1) + \dots + (A_n - A_{n-1}) for all n, and \bigcup_{k=1}^{\infty} A_k = A_1 + (A_2 - A_1) + (A_3 - A_2) + \dots Notice that we once again gave ourselves a set of disjoint sets to work with. Therefore P \left( \bigcup_{n=1}^{\infty} A_n\right) = P(A_1) + \sum_{i=2}^{\infty} P(A_i - A_{i-1}) = \lim_{n \rightarrow \infty} \left[ P(A_1) + \sum_{i=2}^{n} P(A_i - A_{i-1}) \right] = \lim_{n \rightarrow \infty}P(A_n) \blacksquare


Corollary 15. Let \{ B_n\}, n \in \mathbb{N} be a non-increasing sequence of events. That is, B_{n+1} \subseteq B_n. Then P \left( \bigcap_{n=1}^{\infty} B_n\right) = \lim_{n \rightarrow \infty}P(B_n)


Proof: We know that P \left( \bigcap_{n=1}^{\infty} B_n\right) = 1 - P \left( \overline{\bigcap_{n=1}^{\infty} B_n}\right) Using De Morgan’s Law of Intersection we can rewrite this as 1 - P \left( \bigcup_{n=1}^{\infty} \overline{B_n}\right) So by the previous theorem we have that = 1 - P \left( \bigcup_{n=1}^{\infty} \overline{B_n}\right) = 1 - \lim_{n \rightarrow \infty} P(\overline{B_n}) = \lim_{n \rightarrow \infty} \left[ 1 - P(\overline{B_n})\right] = \lim_{n \rightarrow \infty}P(B_n) \blacksquare

Negligible Sets


Definition 16. A set N \subset \Omega is called P-negligible if it is contained in an event A \in \mathcal{F} where the probability P(A) = 0.



Theorem 17. A countable union of negligible sets is a negligible set.


Proof: Let N_k with k \in \mathbb{N} be P-negligible sets. By definition there exists a sequence A_k, \forall k \in \mathbb{N} such that P(A_k) = 0 and N_k \subseteq A_k for all k. Therefore we have P\left(\bigcup_{k=1}^{\infty} N_k \right) \leq P\left(\bigcup_{k=1}^{\infty} A_k \right) \leq \sum_{k=1}^\infty P(A_k) = 0 \blacksquare

Independence and Conditioning

Another important topic in probability is that of independence and conditional probability. If we know what the weather is today, will that affect our prediction for tomorrow? Probably, since if it’s snowing today we most likely won’t have scorching heat tomorrow. So how do we take this into account?

Independent Events


Definition 18. Two events A and B are called independent if and only if P(A \cap B) = P(A)P(B)


Note that we may also write P(A\cap B) as P(A,B).


Definition 19. A family \{A_n\}_{n \in \mathbb{N}} is called jointly independent or just independent if, for any finite set of indices, i_1 < i_2 < \dots < i_r \quad i_j \in \mathbb{N}, 1 \leq j \leq r we have P(A_{i_1} \cap A_{i_2} \cap \dots \cap A_{i_r}) = P(A_{i_1})P(A_{i_2}) \dots P(A_{i_r})


Bayes’ Calculus

What if two or more events are not independent? Then, we will have to take into account some extra information in order to compute probability. We do this by using conditional probability.


Definition 20. The conditional probability of A given B is the number P(A|B) = \frac{P(A\cap B)}{P(B)} defined when P(B) > 0.


P(A|B) is read as "the probability of A given B" and represents the probability of A occurring given that we know B is true.



Note that this also gives us an alternative but equivalent definition of independence. We can say that A and B are independent if P(A|B) = P(A) and P(B|A) = P(B) meaning that whether or not B is true has no effect on A, and vice versa.


Theorem 21. Bayes’ Rule

Given P(A)>0, we have that P(B|A) = \frac{P(A|B) P(B)}{P(A)}


We can prove this theorem using the definition of conditional probability rewritten as P(A\cap B) = P(A|B)P(B) Note that P(A\cap B) is the same as P(B\cap A), so we also have that P(A\cap B) = P(B|A)P(A) Equating these two we get P(A|B)P(B) = P(B|A)P(A)\Rightarrow P(B|A) = \frac{P(A|B) P(B)}{P(A)} \blacksquare


Theorem 22. Bayes’ Rule of Total Probability

Let B_1, B_2, B_3, \dots be events which form a partition of \Omega. That is, B_1, B_2, B_3, \dots is a set of disjoint sets such that \bigcup_{i=1}^{\infty} B_i = \Omega Then, \forall A \in \mathcal{F}, we have that P(A) = \sum_{i=1}^{\infty} P(A|B_i) P(B_i)


Proof: Since A is in the power set of \Omega, we have that A = A \cap \Omega = A \cap \left( \sum_{i=1}^{\infty} B_i \right) = \sum_{i=1}^{\infty} (A \cap B_i) Using \sigma-additivity and the definition of conditional probability, we therefore have that P(A) = P\left(\sum_{i=1}^{\infty} (A \cap B_i) \right) = \sum_{i=1}^{\infty} P(A \cap B_i) = \sum_{i=1}^{\infty} P(A|B_i) P(B_i) \blacksquare


Theorem 23. The Bayes’ Sequential

For any sequence of events A_1, A_2, \dots, A_n we have P\left(\bigcap_{i=1}^n A_i \right) = P(A_1) P(A_2|A_1) P(A_3|A_1 \cap A_2) \dots P\left(A_n \biggr\rvert \bigcap_{i=1}^{n-1} A_i \right)


Proof: We will use induction on n



Base step: Let n=2. We have that P(A_1 \cap A_2) = P(A_2 | A_1) P(A_1) Inductive hypothesis: Assume the theorem is true for n-1, for some n >2.



Inductive step: P\left( \bigcap_{i=1}^{n} A_i \right) = P\left( \left(\bigcap_{i=1}^{n-1} A_i\right) \cap A_{n} \right) = P\left( A_{n} \biggr\rvert \bigcap_{i=1}^{n-1} A_i\right) P\left(\bigcap_{i=1}^{n-1} A_i\right) = P(A_1) P(A_2|A_1) P(A_3|A_1 \cap A_2) \dots P\left(A_n \biggr\rvert \bigcap_{i=1}^{n-1} A_i \right) \blacksquare


Example 24. The False Positive Problem



A certain disease affects one in every thousand people. A test for this disease will result in a positive for 99% of the patients who have the disease. However, this test also gives a positive result for 2% of the population without the disease.



If a patient gets a positive result using this test, what is the probability that they have the disease?


Solution: Let D be the event that the patient does have the disease and let pos be the event that the test is positive.



The given information tells us that P(D) = 0.001 P(pos|D) = 0.99 P(pos|\overline{D}) = 0.02

Using Bayes’ Rule we know that P(D|pos) = \frac{P(pos|D)P(D)}{P(pos)}

To find P(pos), we use Bayes’ Rule of Total Probability: P(pos) = P(pos|M)P(M) + P(pos| \overline{M}) P(\overline{D}) = (0.99)(0.0001)+ (0.02)(0.999) = 0.02097

and so P(D|pos) = \frac{(0.99)(0.001)}{0.02097} \approx 0.05 Thus, if the patient tests positive there is only a 5% chance that they actually have the disease!


Example 25. Bertrand’s Ballot Problem



In an election, candidates A and B have received a and b votes respectively. Candidate A won the election, meaning a>b.



Prove that the following statement is true: The probability that, in the course of counting all the votes, candidate A has always has the lead, is \frac{a-b}{a+b}


Solution: We are going to prove this by induction on the number of votes. Let P_{a,b} denote the probability that A is always in the lead.



Base step: The smallest number of votes is 1, where a=1 and b=0. Here, it is trivial that A will always be in the lead.



Inductive hypothesis: We will assume that the expression is true when the total number of votes is a+b-1. This includes the cases where Candidate A receives a-1 votes and Candidate B receives b votes, as well as when they receive a and b-1 votes respectively.



Inductive step: Let the total number of votes be a+b where Candidate A and Candidate B receive a and b votes respectively, with a, b \in \mathbb{N} and a>b.



First, let’s look only at the last vote. The probability that the last vote will be a vote for A is P_{A \,glv}= \frac{a}{a+b} where glv stands for "gets last vote". The probability this vote will be for B is P_{B \,glv} = \frac{b}{a+b} Now let’s compute P_{a,b}. We will break this into two cases:



In the first case, say the final vote is for A. In order for A to be ahead the entire time prior to this vote, we will require Candidate A’s a-1 prior votes be always ahead of Candidate B’s b votes. The probability that this will occur is P_{a-1,b}.



For the second case, say the final vote is for B. For Candidate A to be ahead the entire time, up to and including this vote, we will require that Candidate A’s a votes be always ahead of Candidate B’s b-1 votes. Note that since a>b, the last vote for B has no chance of making B pull ahead. The probability that this will occur is P_{a,b-1}.



And so we have that P_{a,b} = (P_{a-1,b}) \, (P_{A \,glv}) + (P_{a,b-1}) \, (P_{B \,glv}) Using the inductive hypothesis, we know that P_{a-1,b} and P_{a,b-1} follow the proposed formula, so we have P_{a,b} = \left(\frac{a-1-b}{a+b-1} \right)\left(\frac{a}{a+b} \right)+ \left(\frac{a-b+1}{a+b-1}\right)\left(\frac{b}{a+b}\right) \frac{a^2-a-ab+ab-b^2+b}{(a+b-1)(a+b)} = \frac{(a-b)^2 + (a-b)}{(a+b-1)(a+b)} = \frac{(a-b)(a+b-1)}{(a+b-1)(a+b)} = \frac{a-b}{a+b} Therefore P_{a,b} = \frac{a-b}{a+b}. \blacksquare

Conditional Independence


Definition 26. Let A, B, and C be events where P(C)>0. A and B are called conditionally independent given C if P(A\cap B | C) = P(A|C)P(B|C)


In other words, A and B are independent with respect to the probability P_C defined as P_C(A) = P(A|C)


Example 27. Cheap Watches



Two factories, Factory A and Factory B, manufacture watches. Factory A produces, on average, 1 defective item out of 100, while Factory B produces, on average, 1 defective item out of 200.



You’ve received a container of watches from one of the two factories, but don’t know which one. You pull out the first watch and it works!



(a) What is the probability that the second watch you check will also work?



(b) Are the states of the first two watches independent?



(c) Are the states of the first two watches conditionally independent given we know that the container came from Factory A?


Solution:

(a) Let X_n represent the state of the of the nth watch in the container, where

X_n =  \begin{cases} <br>      1 & \text{if the watch works} \\<br>      0 & \text{if the watch doesn't work} x <br>   \end{cases} Let Y represent the factory the container came from. Since the we don’t know which factory that is, we represent our ignorance by letting P(Y=A)= P(Y=B) = \frac{1}{2} We will assume that P(X_1=1,X_2=0| Y=A) = P(X_1=1|Y=A) P(X_2=0|Y=A) We know that P(X_n=0|Y=A) = 0.01 P(X_n=0|Y=B) = 0.005 Given that the first watch worked, the probability that the second watch will work is given by P(X_2=1|X_1=1) = \frac{P(X_1=1,X_2=1)}{P(X_1=1)} We can use Bayes’ formula of total probability to find the numerator and denominator of this expression: P(X_1=1,X_2=1) = P(X_1=1,X_2=1|Y=A)P(Y=A) + P(X_1=1,X_2=1|Y=B)P(Y=B) = (0.99)^2 (0.5) + (0.995)^2 (0.5) =0.985 and P(X_1=1) = P(X_1|Y=A) P(Y=A) + P(X_1=1|Y=B)P(Y=B) =(0.99)(0.5)+(0.995)(0.5) =0.9925 Therefore P(X_2=1|X_1=1) = \frac{0.985}{0.9925} \approx 0.9925063 So the probability that the second watch will work given that the first one does is about 9925063.



(b) If these two events were independent, we’d have that P(X_2=1|X_1=1) = P(X_2=1) From the previous part we know that P(X_2=1|X_1=1) \approx 0.9925063 and P(X_1=1)=0.9925 By the same logic we also have that P(X_2=1)=0.9925 which means that P(X_2=1|X_1=1) \neq P(X_2=1) meaning that the state of the two watches is not independent!



Why is this the case? We start our watch selection with zero information about whether our container is from Factory A or Factory B. But with each successive selection, we will gain the information to slightly adjust our probabilities.



But what if we were to know which factory our watches came from from the start?



(c) If you look closely at part (a), we’ve been assuming conditional independence from the start, as it can be seen as a bit trivial in this instance. But let’s look closely at it anyway. X_1 and X_2 are conditionally independent given Y=A if P(X_1 =1, X_2=1|Y=A) = P(X_1 =1|A)P( X_2=1|Y=A) We know that P(X_1 =1|Y=A)P( X_2=1|Y=A) = (0.99)^2 = 0.9801 and we also know that P(X_1 =1, X_2=1|Y=A) = (0.99)^2 = 0.9801 Therefore, the states of the two watches are conditionally independent given we know that the container came from Factory A.