Probability Theory

Probability Space

To start our discussion of probability theory, we first need to define the concept of probability space. For a simple definition, the probability space of some probabilistic experiment is the set of all possible outcomes of that experiment. In this case, "experiment" refers to any random event where different outcomes may occur at different probabilities.

For example, if our experiment is "What will the weather be tomorrow?", the outcomes

"sunny" with probability 0.5

"partly cloudy" with probability 0.3

"torrential flood of Biblical proportions" with probability 0.00001

could all be elements in the probability space. A probability space is a mathematical construct that allows us to create a formal model for these kinds of random processes. To define this concept rigorously, we will start with some definitions.

Definition 1. We define the sample space of an experiment to be the set of all possible outcomes of that experiment.

The sample space is usually denoted as $S$ or $\Omega$ . Here, we will use $\Omega$ .

Example 2. We toss a coin twice. What is the sample space of this experiment?

Solution: Denoting heads ad $H$ and tails as $T$ , this experiment has the four possible outcomes: $\Omega = \{HH, HT, TH, TT\}$

A set $\Omega$ needs to satisfy certain properties in order to define a sample space. Letting $\Omega = \{s_1, s_2, ... s_n\}$ , it must be true that:

The elements of $\Omega$ represent all possible outcomes of the experiment. There is no chance that we can run a trial and get an outcome which is outside of $\Omega$ .

The outcomes are mutually exclusive. If $s_j$ is our outcome, then no other $s_i$ where $i \neq j$ can take place

$\Omega$ must have the right granularity with regards to the experiment and what the experimenter is looking for. This means that we should only include relevant information and remove anything which is irrelevant. For instance, if our experiment is tossing a coin once, the sample space we would most likely want is $\Omega = \{H,T\}$ If we choose to clutter this space up with unnecessary information, for example by adding in the weather for some reason, we can create another sample space $\Omega_2 = \{\{H, \text{raining}\},\{H, \text{not raining}\} \{T, \text{raining}\}, \{T, \text{not raining}\}\}$ Clearly, this is not the best representation of the experiment, and this extra irrelevant information is best left out.

Definition 3. We define a $\mathbf{\sigma}$ -field on $\mathbf{\Omega}$ (also called a $\sigma$ -algebra), which we’ll denote here as $\mathcal{F}$ , as the power set of $\Omega$ , or $\mathcal{P}(\Omega)$ .

$\mathcal{F}$ has the following properties:

(i) $\Omega \in \mathcal{F}$

(ii) $A\in \mathcal{F} \Rightarrow \overline{A} \in \mathcal{F}$ , where $\overline{A}$ is the complement of $A$ in $\Omega$

(iii) $\displaystyle A_1, A_2, ... \in \mathcal{F} \Rightarrow \bigcup_{k=1}^{\infty} A_k \in \mathcal{F}$

If $\Omega$ is a sample space, $\mathcal{F}$ is called a $\mathbf{\sigma}$ -field of events. $\mathcal{F}$ represents the the collection of all subsets of $\Omega$ for which we can assign a probability. The elements of $F$ are called events.

Example 4. Find the $\sigma$ -field on $\Omega$ for an experiment with $\Omega = \{1,2,3\}$

Solution: The $\sigma$ -field $\mathcal{F}$ will be the set of all subsets of $\Omega$ , or $\mathcal{F} = \{ \emptyset, \{1\}, \{2\}, \{3\}, \{1,2\}, \{2,3\}, \{1,3\}, \{1,2,3\}\}$ Here, the empty set $\emptyset$ represents the event where nothing happens. We will show later that the probability of $\emptyset$ is always zero. Events with multiple elements represent the case where any of the included outcomes occur. For example, {1,2} represents the event that we roll either a 1 or 2.

The concept of a $\sigma$ -field is used outside of probability as well. Here are some other examples of $\sigma$ -fields.

Example 5. The Borel $\sigma$ -field on $\mathbb{R}^n$ , denoted $B(\mathbb{R}^n)$ , is defined as the smallest $\sigma$ -field on $\mathbb{R}^n$ that contains all open n-rectangles $(a_1,b_1) \times (a_2,b_2) \times ... \times (a_n,b_n)$ .

To make this easier to visualize, let’s discuss the Borel $\sigma$ -field on $\mathbb{R}^2$ . This is the set of all combinations of open rectangles we can draw on the $xy$ plane. Does this satisfy the properties of a $\sigma$ -field?

(i) Is $\mathbb{R}^2 \in B(\mathbb{R}^n)$ ? Yes! We can think of $\mathbb{R}^2$ as the union of all rectangles $(a_1,b_1) \times (a_2,b_2)$ where $a_1, b_1, a_2,b_2 \in \mathbb{R}$ .

(ii) If $A \in B(\mathbb{R}^n)$ , is $\overline{A} \in B(\mathbb{R}^n)$ ? Also yes! For instance, let’s choose the element $(-2,1) \times (-1,1)$ , represented by red rectangle below. We can draw rectangles to fill in the rest of the space where these rectangles are, by definition, also contained in $B(\mathbb{R}^n)$ . This is demonstrated by the blue rectangles below.

(iii) Finally, if $\displaystyle A_1, A_2, ... \in B(\mathbb{R}^n)$ is $\bigcup_{k=1}^{\infty} A_k \in B(\mathbb{R}^n)$ ? You better believe the answer is yes! $A_1, A_2,...$ are all rectangles, so their union will be a set of rectangles which is again, by definition, in $B(\mathbb{R}^n)$ .

Example 6. If $\mathcal{F}$ is the smallest $\sigma$ -field that contains all sets of the form $\{\omega = \{x_1,x_2,x_3, ...\} : x_k=1\} \quad k=1,2,3,...$ and all sets of the form $\{\omega = \{x_1,x_2,x_3, ...\} : x_k=0\} \quad k=1,2,3,...$

Therefore this $\sigma$ -field also contains all sets of the form $\{\omega = \{x_1,x_2, ...,x_n\} : x_1= a_1, x_2=a_2, ..., x_n=a_n,...\}$ $\forall n\geq 1 \text{ and } a_1,...,a_n \in \{0,1\}$

You can check for yourself that this satisfies all the properties of a $\sigma$ -field.

Finally, in order to define a probability space, we need one more definition.

Definition 7. A probability measure, P, on $(\Omega, \mathcal{F}))$ is a function $P: \mathcal{F} \rightarrow [0,1]$ such that

(i) $0 \leq P(A) \leq 1 \quad \forall A \in \mathcal{F}$

(ii) $P(\Omega) = 1$

(iii) $\displaystyle P\left(\sum_{k=1}^{\infty} A_k \right) = \sum_{k=1}^{\infty}P( A_k)$ for all sequences $\{ A_k\}_{k \geq 1}$ of pairwise disjoint events in $\mathcal{F}$ .

Property (iii) is called $\sigma$ -additivity.

Theorem 8. For any event $A \in \mathcal{F}$ $P(\overline{A}) = 1 - P(A)$ and $P(\emptyset) = 0$

Proof: We know that $P(\Omega) = 1$ since $\Omega$ is the set of all possible outcomes. So we have that: $1= P(\Omega) = P(A + \overline{A}) = P(A) + P(\overline{A})$ by $\sigma$ -addivitity. Therefore: $P(\overline{A}) = 1-P(A)$ We can now use this result to prove $P(\emptyset) = 0$ : $P(\emptyset) = P(\overline{\Omega} = 1 - P(\Omega) = 1-1 = 0$ $\blacksquare$

The time has come to rigorously define a probability space.

Definition 9. A probability space is defined as the triple $(\Omega, \mathcal{F}, P)$ where

$\Omega$ is a sample space

$\mathcal{F}$ is a $\sigma$ -field on $\Omega$

P is a probability measure

Example 10. Find the probability space for rolling a fair six-sided die once.

Solution: We know that: $\Omega = \{ 1,2,3,4,5,6\}$ and $\mathcal{F} = \mathcal{P}(\Omega)$ Now, for the probability measure, we know that each option in $\Omega$ has an equal probability of $\frac{1}{6}$ . So if we take an element of $\mathcal{F}$ , for instance $\{1,2\}$ , the probability will be $\frac{1}{6} + \frac{1}{6} = \frac{2}{6} = \frac{|\{1,2\}|}{6}$ So for any subset of $\mathcal{F}$ : $P(A) = \dfrac{|A|}{6} \quad \forall A \in \mathcal{F}$ Together these make up the probability space for this experiment.

Example 11. Say we toss a fair coin until we get a heads. Find the probability space for the event that we get our first heads on the $n$ th toss.

Solution: Here, our sample space is going to be any possible number of tosses. $\Omega = \{1,2,3,...\}$ and, of course, $\mathcal{F} = \mathcal{P} (\Omega)$ We define our probability measure to represent the probability that it will take $n$ tosses to get our first heads, where $n \in \mathbb{N}$ . That is $P(n) = \left( \frac{1}{2}\right)^n = \frac{1}{2^n}$ or $P(A) = \sum_{n=1}^m \frac{1}{2^n}$ where $A = \{a_1, a_2, ..., a_m\}$ , with $a_i \in \mathbb{N}$ $\forall i \in \{1,2,...,m\}$ , is an element of $\mathcal{F}$ .

We’ll end this section with some important theorems.

Theorem 12. Probability is monotonic. That is, $\forall A, B \in \mathcal{F}$

$A \subset B \Rightarrow P(A) \leq P(B)$

Proof: $A \subset B \Rightarrow B = A + (B-A)$ Using $\sigma$ -additivity, given $A$ and $B-A$ are by definition disjoint, this implies that $P(B) = P(A) + P(B-A)$ $\Rightarrow P(B) \geq P(A)$ $\blacksquare$

Theorem 13. Probability is sub $\sigma$ -additive. That is, for any sequence $A_1, A_2,...$ of events which may not be disjoint $P\left(\bigcup_{k=1}^{\infty} A_k \right) \leq \sum_{k=1}^{\infty} P(A_k)$

Proof: Observe that $\bigcup_{k=1}^{\infty} A_k = \sum_{k=1}^{\infty} A'_k$ where we define $A'$ as $\begin{aligned} & A'_1 = A_1 \\ &A'_2 = A_2 \cap \overline{A_1}\\ &\vdots\\ & A'_k = A_k \cap \left\{ \overline{\bigcup_{i=1}^{k-1} A_i}\right\} \end{aligned}$ By making it so each successive $A_k$ only includes elements present in none of the previous sets, this process gives us all disjoint sets, so $\sigma$ -additivity applies.

Therefore we have that $P\left(\bigcup_{k=1}^{\infty} A_k \right) = P \left( \sum_{k=1}^{\infty} A'_k \right) = \sum_{k=1}^{\infty} P(A'_k)$ $\leq \sum_{k=1}^{\infty} P(A_k)$ since $A'_k \subseteq A_k$ . $\blacksquare$

Theorem 14. Let $\{ A_n\}_{n \geq 1}$ be a non-decreasing sequence of events. That is, $A_n \subseteq A_{n+1}$ $\forall n \in \mathbb{N}$ . Then $P \left( \bigcup_{n=1}^{\infty} A_n\right) = \lim_{n \rightarrow \infty}P(A_n)$

Proof: Observe that since $A_n \subseteq A_{n+1}$ $\forall n \in \mathbb{N}$ we have that $A_n = A_1 + (A_2 - A_1) + \dots + (A_n - A_{n-1})$ for all $n$ , and $\bigcup_{k=1}^{\infty} A_k = A_1 + (A_2 - A_1) + (A_3 - A_2) + \dots$ Notice that we once again gave ourselves a set of disjoint sets to work with. Therefore $P \left( \bigcup_{n=1}^{\infty} A_n\right) = P(A_1) + \sum_{i=2}^{\infty} P(A_i - A_{i-1})$ $= \lim_{n \rightarrow \infty} \left[ P(A_1) + \sum_{i=2}^{n} P(A_i - A_{i-1}) \right]$ $= \lim_{n \rightarrow \infty}P(A_n)$ $\blacksquare$

Corollary 15. Let $\{ B_n\}$ , $n \in \mathbb{N}$ be a non-increasing sequence of events. That is, $B_{n+1} \subseteq B_n$ . Then $P \left( \bigcap_{n=1}^{\infty} B_n\right) = \lim_{n \rightarrow \infty}P(B_n)$

Proof: We know that $P \left( \bigcap_{n=1}^{\infty} B_n\right) = 1 - P \left( \overline{\bigcap_{n=1}^{\infty} B_n}\right)$ Using De Morgan’s Law of Intersection we can rewrite this as $1 - P \left( \bigcup_{n=1}^{\infty} \overline{B_n}\right)$ So by the previous theorem we have that $= 1 - P \left( \bigcup_{n=1}^{\infty} \overline{B_n}\right) = 1 - \lim_{n \rightarrow \infty} P(\overline{B_n})$ $= \lim_{n \rightarrow \infty} \left[ 1 - P(\overline{B_n})\right] = \lim_{n \rightarrow \infty}P(B_n)$ $\blacksquare$

Negligible Sets

Definition 16. A set $N \subset \Omega$ is called P-negligible if it is contained in an event $A \in \mathcal{F}$ where the probability $P(A) = 0$ .

Theorem 17. A countable union of negligible sets is a negligible set.

Proof: Let $N_k$ with $k \in \mathbb{N}$ be P-negligible sets. By definition there exists a sequence $A_k$ , $\forall k \in \mathbb{N}$ such that $P(A_k) = 0$ and $N_k \subseteq A_k$ for all $k$ . Therefore we have $P\left(\bigcup_{k=1}^{\infty} N_k \right) \leq P\left(\bigcup_{k=1}^{\infty} A_k \right) \leq \sum_{k=1}^\infty P(A_k) = 0$ $\blacksquare$

Independence and Conditioning

Another important topic in probability is that of independence and conditional probability. If we know what the weather is today, will that affect our prediction for tomorrow? Probably, since if it’s snowing today we most likely won’t have scorching heat tomorrow. So how do we take this into account?

Independent Events

Definition 18. Two events $A$ and $B$ are called independent if and only if $P(A \cap B) = P(A)P(B)$

Note that we may also write $P(A\cap B)$ as $P(A,B)$ .

Definition 19. A family $\{A_n\}_{n \in \mathbb{N}}$ is called jointly independent or just independent if, for any finite set of indices, $i_1 < i_2 < \dots < i_r \quad i_j \in \mathbb{N}, 1 \leq j \leq r$ we have $P(A_{i_1} \cap A_{i_2} \cap \dots \cap A_{i_r}) = P(A_{i_1})P(A_{i_2}) \dots P(A_{i_r})$

Bayes’ Calculus

What if two or more events are not independent? Then, we will have to take into account some extra information in order to compute probability. We do this by using conditional probability.

Definition 20. The conditional probability of A given B is the number $P(A|B) = \frac{P(A\cap B)}{P(B)}$ defined when $P(B) > 0$ .

$P(A|B)$ is read as "the probability of A given B" and represents the probability of A occurring given that we know B is true.

Note that this also gives us an alternative but equivalent definition of independence. We can say that $A$ and $B$ are independent if $P(A|B) = P(A)$ and $P(B|A) = P(B)$ meaning that whether or not B is true has no effect on A, and vice versa.

Theorem 21. Bayes’ Rule

Given $P(A)>0$ , we have that $P(B|A) = \frac{P(A|B) P(B)}{P(A)}$

We can prove this theorem using the definition of conditional probability rewritten as $P(A\cap B) = P(A|B)P(B)$ Note that $P(A\cap B)$ is the same as $P(B\cap A)$ , so we also have that $P(A\cap B) = P(B|A)P(A)$ Equating these two we get $P(A|B)P(B) = P(B|A)P(A)$ $\Rightarrow P(B|A) = \frac{P(A|B) P(B)}{P(A)}$ $\blacksquare$

Theorem 22. Bayes’ Rule of Total Probability

Let $B_1, B_2, B_3, \dots$ be events which form a partition of $\Omega$ . That is, $B_1, B_2, B_3, \dots$ is a set of disjoint sets such that $\bigcup_{i=1}^{\infty} B_i = \Omega$ Then, $\forall A \in \mathcal{F}$ , we have that $P(A) = \sum_{i=1}^{\infty} P(A|B_i) P(B_i)$

Proof: Since $A$ is in the power set of $\Omega$ , we have that $A = A \cap \Omega = A \cap \left( \sum_{i=1}^{\infty} B_i \right) = \sum_{i=1}^{\infty} (A \cap B_i)$ Using $\sigma$ -additivity and the definition of conditional probability, we therefore have that $P(A) = P\left(\sum_{i=1}^{\infty} (A \cap B_i) \right) = \sum_{i=1}^{\infty} P(A \cap B_i)$ $= \sum_{i=1}^{\infty} P(A|B_i) P(B_i)$ $\blacksquare$

Theorem 23. The Bayes’ Sequential

For any sequence of events $A_1, A_2, \dots, A_n$ we have $P\left(\bigcap_{i=1}^n A_i \right) = P(A_1) P(A_2|A_1) P(A_3|A_1 \cap A_2) \dots P\left(A_n \biggr\rvert \bigcap_{i=1}^{n-1} A_i \right)$

Proof: We will use induction on $n$

Base step: Let $n=2$ . We have that $P(A_1 \cap A_2) = P(A_2 | A_1) P(A_1)$ Inductive hypothesis: Assume the theorem is true for $n-1$ , for some $n >2$ .

Inductive step: $P\left( \bigcap_{i=1}^{n} A_i \right) = P\left( \left(\bigcap_{i=1}^{n-1} A_i\right) \cap A_{n} \right)$ $= P\left( A_{n} \biggr\rvert \bigcap_{i=1}^{n-1} A_i\right) P\left(\bigcap_{i=1}^{n-1} A_i\right)$ $= P(A_1) P(A_2|A_1) P(A_3|A_1 \cap A_2) \dots P\left(A_n \biggr\rvert \bigcap_{i=1}^{n-1} A_i \right)$ $\blacksquare$

Example 24. The False Positive Problem

A certain disease affects one in every thousand people. A test for this disease will result in a positive for 99% of the patients who have the disease. However, this test also gives a positive result for 2% of the population without the disease.

If a patient gets a positive result using this test, what is the probability that they have the disease?

Solution: Let $D$ be the event that the patient does have the disease and let $pos$ be the event that the test is positive.

The given information tells us that $P(D) = 0.001$ $P(pos|D) = 0.99$ $P(pos|\overline{D}) = 0.02$

Using Bayes’ Rule we know that $P(D|pos) = \frac{P(pos|D)P(D)}{P(pos)}$

To find $P(pos)$ , we use Bayes’ Rule of Total Probability: $P(pos) = P(pos|M)P(M) + P(pos| \overline{M}) P(\overline{D})$ $= (0.99)(0.0001)+ (0.02)(0.999) = 0.02097$

and so $P(D|pos) = \frac{(0.99)(0.001)}{0.02097} \approx 0.05$ Thus, if the patient tests positive there is only a 5% chance that they actually have the disease!

Example 25. Bertrand’s Ballot Problem

In an election, candidates $A$ and $B$ have received $a$ and $b$ votes respectively. Candidate $A$ won the election, meaning $a>b$ .

Prove that the following statement is true: The probability that, in the course of counting all the votes, candidate $A$ has always has the lead, is $\frac{a-b}{a+b}$

Solution: We are going to prove this by induction on the number of votes. Let $P_{a,b}$ denote the probability that $A$ is always in the lead.

Base step: The smallest number of votes is 1, where $a=1$ and $b=0$ . Here, it is trivial that $A$ will always be in the lead.

Inductive hypothesis: We will assume that the expression is true when the total number of votes is $a+b-1$ . This includes the cases where Candidate $A$ receives $a-1$ votes and Candidate $B$ receives $b$ votes, as well as when they receive $a$ and $b-1$ votes respectively.

Inductive step: Let the total number of votes be $a+b$ where Candidate A and Candidate B receive $a$ and $b$ votes respectively, with $a, b \in \mathbb{N}$ and $a>b$ .

First, let’s look only at the last vote. The probability that the last vote will be a vote for $A$ is $P_{A \,glv}= \frac{a}{a+b}$ where $glv$ stands for "gets last vote". The probability this vote will be for $B$ is $P_{B \,glv} = \frac{b}{a+b}$ Now let’s compute $P_{a,b}$ . We will break this into two cases:

In the first case, say the final vote is for $A$ . In order for $A$ to be ahead the entire time prior to this vote, we will require Candidate $A$ ’s $a-1$ prior votes be always ahead of Candidate $B$ ’s $b$ votes. The probability that this will occur is $P_{a-1,b}$ .

For the second case, say the final vote is for $B$ . For Candidate $A$ to be ahead the entire time, up to and including this vote, we will require that Candidate $A$ ’s $a$ votes be always ahead of Candidate $B$ ’s $b-1$ votes. Note that since $a>b$ , the last vote for $B$ has no chance of making $B$ pull ahead. The probability that this will occur is $P_{a,b-1}$ .

And so we have that $P_{a,b} = (P_{a-1,b}) \, (P_{A \,glv}) + (P_{a,b-1}) \, (P_{B \,glv})$ Using the inductive hypothesis, we know that $P_{a-1,b}$ and $P_{a,b-1}$ follow the proposed formula, so we have $P_{a,b} = \left(\frac{a-1-b}{a+b-1} \right)\left(\frac{a}{a+b} \right)+ \left(\frac{a-b+1}{a+b-1}\right)\left(\frac{b}{a+b}\right)$ $\frac{a^2-a-ab+ab-b^2+b}{(a+b-1)(a+b)}$ $= \frac{(a-b)^2 + (a-b)}{(a+b-1)(a+b)} = \frac{(a-b)(a+b-1)}{(a+b-1)(a+b)}$ $= \frac{a-b}{a+b}$ Therefore $P_{a,b} = \frac{a-b}{a+b}$ . $\blacksquare$

Conditional Independence

Definition 26. Let $A, B,$ and $C$ be events where $P(C)>0$ . $A$ and $B$ are called conditionally independent given C if $P(A\cap B | C) = P(A|C)P(B|C)$

In other words, $A$ and $B$ are independent with respect to the probability $P_C$ defined as $P_C(A) = P(A|C)$

Example 27. Cheap Watches

Two factories, Factory $A$ and Factory $B$ , manufacture watches. Factory $A$ produces, on average, 1 defective item out of 100, while Factory $B$ produces, on average, 1 defective item out of 200.

You’ve received a container of watches from one of the two factories, but don’t know which one. You pull out the first watch and it works!

(a) What is the probability that the second watch you check will also work?

(b) Are the states of the first two watches independent?

(c) Are the states of the first two watches conditionally independent given we know that the container came from Factory $A$ ?

Solution:

(a) Let $X_n$ represent the state of the of the $n$ th watch in the container, where

$X_n = \begin{cases} 1 & \text{if the watch works} \\ 0 & \text{if the watch doesn't work} x \end{cases}$ Let $Y$ represent the factory the container came from. Since the we don’t know which factory that is, we represent our ignorance by letting $P(Y=A)= P(Y=B) = \frac{1}{2}$ We will assume that $P(X_1=1,X_2=0| Y=A) = P(X_1=1|Y=A) P(X_2=0|Y=A)$ We know that $P(X_n=0|Y=A) = 0.01$ $P(X_n=0|Y=B) = 0.005$ Given that the first watch worked, the probability that the second watch will work is given by $P(X_2=1|X_1=1) = \frac{P(X_1=1,X_2=1)}{P(X_1=1)}$ We can use Bayes’ formula of total probability to find the numerator and denominator of this expression: $P(X_1=1,X_2=1)$ $= P(X_1=1,X_2=1|Y=A)P(Y=A) + P(X_1=1,X_2=1|Y=B)P(Y=B)$ $= (0.99)^2 (0.5) + (0.995)^2 (0.5)$ $=0.985$ and $P(X_1=1) = P(X_1|Y=A) P(Y=A) + P(X_1=1|Y=B)P(Y=B)$ $=(0.99)(0.5)+(0.995)(0.5)$ $=0.9925$ Therefore $P(X_2=1|X_1=1) = \frac{0.985}{0.9925}$ $\approx 0.9925063$ So the probability that the second watch will work given that the first one does is about 9925063.

(b) If these two events were independent, we’d have that $P(X_2=1|X_1=1) = P(X_2=1)$ From the previous part we know that $P(X_2=1|X_1=1) \approx 0.9925063$ and $P(X_1=1)=0.9925$ By the same logic we also have that $P(X_2=1)=0.9925$ which means that $P(X_2=1|X_1=1) \neq P(X_2=1)$ meaning that the state of the two watches is not independent!

Why is this the case? We start our watch selection with zero information about whether our container is from Factory $A$ or Factory $B$ . But with each successive selection, we will gain the information to slightly adjust our probabilities.

But what if we were to know which factory our watches came from from the start?

(c) If you look closely at part (a), we’ve been assuming conditional independence from the start, as it can be seen as a bit trivial in this instance. But let’s look closely at it anyway. $X_1$ and $X_2$ are conditionally independent given $Y=A$ if $P(X_1 =1, X_2=1|Y=A) = P(X_1 =1|A)P( X_2=1|Y=A)$ We know that $P(X_1 =1|Y=A)P( X_2=1|Y=A) = (0.99)^2 = 0.9801$ and we also know that $P(X_1 =1, X_2=1|Y=A) = (0.99)^2 = 0.9801$ Therefore, the states of the two watches are conditionally independent given we know that the container came from Factory $A$ .

E

H

I

L

P

R

W

Probability (III)

Probability Space

Probability Theory

Probability Space

Negligible Sets

Independence and Conditioning

Independent Events

Bayes’ Calculus

Conditional Independence