To start our discussion of probability theory, we first need to define the concept of probability space. For a simple definition, the probability space of some probabilistic experiment is the set of all possible outcomes of that experiment. In this case, "experiment" refers to any random event where different outcomes may occur at different probabilities.
For example, if our experiment is "What will the weather be tomorrow?", the outcomes
could all be elements in the probability space. A probability space is a mathematical construct that allows us to create a formal model for these kinds of random processes. To define this concept rigorously, we will start with some definitions.
Definition 1. We define the sample space of an experiment to be the set of all possible outcomes of that experiment.
The sample space is usually denoted as or . Here, we will use .
Example 2. We toss a coin twice. What is the sample space of this experiment?
Solution: Denoting heads ad and tails as , this experiment has the four possible outcomes:
A set needs to satisfy certain properties in order to define a sample space. Letting , it must be true that:
Definition 3. We define a -field on (also called a -algebra), which we’ll denote here as , as the power set of , or .
has the following properties:
(i)
(ii), where is the complement of in
(iii)
If is a sample space, is called a -field of events. represents the the collection of all subsets of for which we can assign a probability. The elements of are called events.
Example 4. Find the -field on for an experiment with
Solution: The -field will be the set of all subsets of , or Here, the empty set represents the event where nothing happens. We will show later that the probability of is always zero. Events with multiple elements represent the case where any of the included outcomes occur. For example, {1,2} represents the event that we roll either a 1 or 2.
The concept of a -field is used outside of probability as well. Here are some other examples of -fields.
Example 5. The Borel -field on , denoted , is defined as the smallest -field on that contains all open n-rectangles .
To make this easier to visualize, let’s discuss the Borel -field on . This is the set of all combinations of open rectangles we can draw on the plane. Does this satisfy the properties of a -field?
(i) Is ? Yes! We can think of as the union of all rectangles where .
(ii) If , is ? Also yes! For instance, let’s choose the element , represented by red rectangle below. We can draw rectangles to fill in the rest of the space where these rectangles are, by definition, also contained in . This is demonstrated by the blue rectangles below.
(iii) Finally, if is ? You better believe the answer is yes! are all rectangles, so their union will be a set of rectangles which is again, by definition, in .
Example 6. If is the smallest -field that contains all sets of the form and all sets of the form
Therefore this -field also contains all sets of the form
You can check for yourself that this satisfies all the properties of a -field.
Finally, in order to define a probability space, we need one more definition.
Definition 7. A probability measure, P, on is a function such that
(i)
(ii)
(iii) for all sequences of pairwise disjoint events in .
Property (iii) is called -additivity.
Theorem 8. For any event and
Proof: We know that since is the set of all possible outcomes. So we have that: by -addivitity. Therefore: We can now use this result to prove :
The time has come to rigorously define a probability space.
Definition 9. A probability space is defined as the triple where
Example 10. Find the probability space for rolling a fair six-sided die once.
Solution: We know that: and Now, for the probability measure, we know that each option in has an equal probability of . So if we take an element of , for instance , the probability will be So for any subset of : Together these make up the probability space for this experiment.
Example 11. Say we toss a fair coin until we get a heads. Find the probability space for the event that we get our first heads on the th toss.
Solution: Here, our sample space is going to be any possible number of tosses. and, of course, We define our probability measure to represent the probability that it will take tosses to get our first heads, where . That is or where , with , is an element of .
We’ll end this section with some important theorems.
Theorem 12. Probability is monotonic. That is,
Proof: Using -additivity, given and are by definition disjoint, this implies that
Theorem 13. Probability is sub -additive. That is, for any sequence of events which may not be disjoint
Proof: Observe that where we define as By making it so each successive only includes elements present in none of the previous sets, this process gives us all disjoint sets, so -additivity applies.
Therefore we have that since .
Theorem 14. Let be a non-decreasing sequence of events. That is, . Then
Proof: Observe that since we have that for all , and Notice that we once again gave ourselves a set of disjoint sets to work with. Therefore
Corollary 15. Let , be a non-increasing sequence of events. That is, . Then
Proof: We know that Using De Morgan’s Law of Intersection we can rewrite this as So by the previous theorem we have that
Definition 16. A set is called P-negligible if it is contained in an event where the probability .
Theorem 17. A countable union of negligible sets is a negligible set.
Proof: Let with be P-negligible sets. By definition there exists a sequence , such that and for all . Therefore we have
Another important topic in probability is that of independence and conditional probability. If we know what the weather is today, will that affect our prediction for tomorrow? Probably, since if it’s snowing today we most likely won’t have scorching heat tomorrow. So how do we take this into account?
Definition 18. Two events and are called independent if and only if
Note that we may also write as .
Definition 19. A family is called jointly independent or just independent if, for any finite set of indices, we have
What if two or more events are not independent? Then, we will have to take into account some extra information in order to compute probability. We do this by using conditional probability.
Definition 20. The conditional probability of A given B is the number defined when .
is read as "the probability of A given B" and represents the probability of A occurring given that we know B is true.
Note that this also gives us an alternative but equivalent definition of independence. We can say that and are independent if and meaning that whether or not B is true has no effect on A, and vice versa.
Theorem 21. Bayes’ Rule
Given , we have that
We can prove this theorem using the definition of conditional probability rewritten as Note that is the same as , so we also have that Equating these two we get
Theorem 22. Bayes’ Rule of Total Probability
Let be events which form a partition of . That is, is a set of disjoint sets such that Then, , we have that
Proof: Since is in the power set of , we have that Using -additivity and the definition of conditional probability, we therefore have that
Theorem 23. The Bayes’ Sequential
For any sequence of events we have
Proof: We will use induction on
Base step: Let . We have that Inductive hypothesis: Assume the theorem is true for , for some .
Inductive step:
Example 24. The False Positive Problem
A certain disease affects one in every thousand people. A test for this disease will result in a positive for 99% of the patients who have the disease. However, this test also gives a positive result for 2% of the population without the disease.
If a patient gets a positive result using this test, what is the probability that they have the disease?
Solution: Let be the event that the patient does have the disease and let be the event that the test is positive.
The given information tells us that
Using Bayes’ Rule we know that
To find , we use Bayes’ Rule of Total Probability:
and so Thus, if the patient tests positive there is only a 5% chance that they actually have the disease!
Example 25. Bertrand’s Ballot Problem
In an election, candidates and have received and votes respectively. Candidate won the election, meaning .
Prove that the following statement is true: The probability that, in the course of counting all the votes, candidate has always has the lead, is
Solution: We are going to prove this by induction on the number of votes. Let denote the probability that is always in the lead.
Base step: The smallest number of votes is 1, where and . Here, it is trivial that will always be in the lead.
Inductive hypothesis: We will assume that the expression is true when the total number of votes is . This includes the cases where Candidate receives votes and Candidate receives votes, as well as when they receive and votes respectively.
Inductive step: Let the total number of votes be where Candidate A and Candidate B receive and votes respectively, with and .
First, let’s look only at the last vote. The probability that the last vote will be a vote for is where stands for "gets last vote". The probability this vote will be for is Now let’s compute . We will break this into two cases:
In the first case, say the final vote is for . In order for to be ahead the entire time prior to this vote, we will require Candidate ’s prior votes be always ahead of Candidate ’s votes. The probability that this will occur is .
For the second case, say the final vote is for . For Candidate to be ahead the entire time, up to and including this vote, we will require that Candidate ’s votes be always ahead of Candidate ’s votes. Note that since , the last vote for has no chance of making pull ahead. The probability that this will occur is .
And so we have that Using the inductive hypothesis, we know that and follow the proposed formula, so we have Therefore .
Definition 26. Let and be events where . and are called conditionally independent given C if
In other words, and are independent with respect to the probability defined as
Example 27. Cheap Watches
Two factories, Factory and Factory , manufacture watches. Factory produces, on average, 1 defective item out of 100, while Factory produces, on average, 1 defective item out of 200.
You’ve received a container of watches from one of the two factories, but don’t know which one. You pull out the first watch and it works!
(a) What is the probability that the second watch you check will also work?
(b) Are the states of the first two watches independent?
(c) Are the states of the first two watches conditionally independent given we know that the container came from Factory ?
Solution:
(a) Let represent the state of the of the th watch in the container, where
Let represent the factory the container came from. Since the we don’t know which factory that is, we represent our ignorance by letting We will assume that We know that Given that the first watch worked, the probability that the second watch will work is given by We can use Bayes’ formula of total probability to find the numerator and denominator of this expression: and Therefore So the probability that the second watch will work given that the first one does is about 9925063.
(b) If these two events were independent, we’d have that From the previous part we know that and By the same logic we also have that which means that meaning that the state of the two watches is not independent!
Why is this the case? We start our watch selection with zero information about whether our container is from Factory or Factory . But with each successive selection, we will gain the information to slightly adjust our probabilities.
But what if we were to know which factory our watches came from from the start?
(c) If you look closely at part (a), we’ve been assuming conditional independence from the start, as it can be seen as a bit trivial in this instance. But let’s look closely at it anyway. and are conditionally independent given if We know that and we also know that Therefore, the states of the two watches are conditionally independent given we know that the container came from Factory .