To start our discussion of probability theory, we first need to define the concept of probability space. For a simple definition, the probability space of some probabilistic experiment is the set of all possible outcomes of that experiment. In this case, "experiment" refers to any random event where different outcomes may occur at different probabilities.
For example, if our experiment is "What will the weather be tomorrow?", the outcomes
could all be elements in the probability space. A probability space is a mathematical construct that allows us to create a formal model for these kinds of random processes. To define this concept rigorously, we will start with some definitions.
Definition 1. We define the sample space of an experiment to be the set of all possible outcomes of that experiment.
The sample space is usually denoted as or
. Here, we will use
.
Example 2. We toss a coin twice. What is the sample space of this experiment?
Solution: Denoting heads ad and tails as
, this experiment has the four possible outcomes:
A set needs to satisfy certain properties in order to define a sample space. Letting
, it must be true that:
Definition 3. We define a -field on
(also called a
-algebra), which we’ll denote here as
, as the power set of
, or
.
has the following properties:
(i)
(ii), where
is the complement of
in
(iii)
If is a sample space,
is called a
-field of events.
represents the the collection of all subsets of
for which we can assign a probability. The elements of
are called events.
Example 4. Find the -field on
for an experiment with
Solution: The -field
will be the set of all subsets of
, or
Here, the empty set
represents the event where nothing happens. We will show later that the probability of
is always zero. Events with multiple elements represent the case where any of the included outcomes occur. For example, {1,2} represents the event that we roll either a 1 or 2.
The concept of a -field is used outside of probability as well. Here are some other examples of
-fields.
Example 5. The Borel -field on
, denoted
, is defined as the smallest
-field on
that contains all open n-rectangles
.
To make this easier to visualize, let’s discuss the Borel -field on
. This is the set of all combinations of open rectangles we can draw on the
plane. Does this satisfy the properties of a
-field?
(i) Is ? Yes! We can think of
as the union of all rectangles
where
.
(ii) If , is
? Also yes! For instance, let’s choose the element
, represented by red rectangle below. We can draw rectangles to fill in the rest of the space where these rectangles are, by definition, also contained in
. This is demonstrated by the blue rectangles below.
(iii) Finally, if is
? You better believe the answer is yes!
are all rectangles, so their union will be a set of rectangles which is again, by definition, in
.
Example 6. If is the smallest
-field that contains all sets of the form
and all sets of the form
Therefore this -field also contains all sets of the form
You can check for yourself that this satisfies all the properties of a -field.
Finally, in order to define a probability space, we need one more definition.
Definition 7. A probability measure, P, on is a function
such that
(i)
(ii)
(iii) for all sequences
of pairwise disjoint events in
.
Property (iii) is called -additivity.
Theorem 8. For any event
and
Proof: We know that since
is the set of all possible outcomes. So we have that:
by
-addivitity. Therefore:
We can now use this result to prove
:
The time has come to rigorously define a probability space.
Definition 9. A probability space is defined as the triple where
Example 10. Find the probability space for rolling a fair six-sided die once.
Solution: We know that: and
Now, for the probability measure, we know that each option in
has an equal probability of
. So if we take an element of
, for instance
, the probability will be
So for any subset of
:
Together these make up the probability space for this experiment.
Example 11. Say we toss a fair coin until we get a heads. Find the probability space for the event that we get our first heads on the th toss.
Solution: Here, our sample space is going to be any possible number of tosses. and, of course,
We define our probability measure to represent the probability that it will take
tosses to get our first heads, where
. That is
or
where
, with
, is an element of
.
We’ll end this section with some important theorems.
Theorem 12. Probability is monotonic. That is,
Proof: Using
-additivity, given
and
are by definition disjoint, this implies that
Theorem 13. Probability is sub -additive. That is, for any sequence
of events which may not be disjoint
Proof: Observe that where we define
as
By making it so each successive
only includes elements present in none of the previous sets, this process gives us all disjoint sets, so
-additivity applies.
Therefore we have that
since
.
Theorem 14. Let be a non-decreasing sequence of events. That is,
. Then
Proof: Observe that since
we have that
for all
, and
Notice that we once again gave ourselves a set of disjoint sets to work with. Therefore
Corollary 15. Let ,
be a non-increasing sequence of events. That is,
. Then
Proof: We know that Using De Morgan’s Law of Intersection we can rewrite this as
So by the previous theorem we have that
Definition 16. A set is called P-negligible if it is contained in an event
where the probability
.
Theorem 17. A countable union of negligible sets is a negligible set.
Proof: Let with
be P-negligible sets. By definition there exists a sequence
,
such that
and
for all
. Therefore we have
Another important topic in probability is that of independence and conditional probability. If we know what the weather is today, will that affect our prediction for tomorrow? Probably, since if it’s snowing today we most likely won’t have scorching heat tomorrow. So how do we take this into account?
Definition 18. Two events and
are called independent if and only if
Note that we may also write as
.
Definition 19. A family is called jointly independent or just independent if, for any finite set of indices,
we have
What if two or more events are not independent? Then, we will have to take into account some extra information in order to compute probability. We do this by using conditional probability.
Definition 20. The conditional probability of A given B is the number defined when
.
is read as "the probability of A given B" and represents the probability of A occurring given that we know B is true.
Note that this also gives us an alternative but equivalent definition of independence. We can say that and
are independent if
and
meaning that whether or not B is true has no effect on A, and vice versa.
Theorem 21. Bayes’ Rule
Given , we have that
We can prove this theorem using the definition of conditional probability rewritten as Note that
is the same as
, so we also have that
Equating these two we get
Theorem 22. Bayes’ Rule of Total Probability
Let be events which form a partition of
. That is,
is a set of disjoint sets such that
Then,
, we have that
Proof: Since is in the power set of
, we have that
Using
-additivity and the definition of conditional probability, we therefore have that
Theorem 23. The Bayes’ Sequential
For any sequence of events we have
Proof: We will use induction on
Base step: Let . We have that
Inductive hypothesis: Assume the theorem is true for
, for some
.
Inductive step:
Example 24. The False Positive Problem
A certain disease affects one in every thousand people. A test for this disease will result in a positive for 99% of the patients who have the disease. However, this test also gives a positive result for 2% of the population without the disease.
If a patient gets a positive result using this test, what is the probability that they have the disease?
Solution: Let be the event that the patient does have the disease and let
be the event that the test is positive.
The given information tells us that
Using Bayes’ Rule we know that
To find , we use Bayes’ Rule of Total Probability:
and so Thus, if the patient tests positive there is only a 5% chance that they actually have the disease!
Example 25. Bertrand’s Ballot Problem
In an election, candidates and
have received
and
votes respectively. Candidate
won the election, meaning
.
Prove that the following statement is true: The probability that, in the course of counting all the votes, candidate has always has the lead, is
Solution: We are going to prove this by induction on the number of votes. Let denote the probability that
is always in the lead.
Base step: The smallest number of votes is 1, where and
. Here, it is trivial that
will always be in the lead.
Inductive hypothesis: We will assume that the expression is true when the total number of votes is . This includes the cases where Candidate
receives
votes and Candidate
receives
votes, as well as when they receive
and
votes respectively.
Inductive step: Let the total number of votes be where Candidate A and Candidate B receive
and
votes respectively, with
and
.
First, let’s look only at the last vote. The probability that the last vote will be a vote for is
where
stands for "gets last vote". The probability this vote will be for
is
Now let’s compute
. We will break this into two cases:
In the first case, say the final vote is for . In order for
to be ahead the entire time prior to this vote, we will require Candidate
’s
prior votes be always ahead of Candidate
’s
votes. The probability that this will occur is
.
For the second case, say the final vote is for . For Candidate
to be ahead the entire time, up to and including this vote, we will require that Candidate
’s
votes be always ahead of Candidate
’s
votes. Note that since
, the last vote for
has no chance of making
pull ahead. The probability that this will occur is
.
And so we have that Using the inductive hypothesis, we know that
and
follow the proposed formula, so we have
Therefore
.
Definition 26. Let and
be events where
.
and
are called conditionally independent given C if
In other words, and
are independent with respect to the probability
defined as
Example 27. Cheap Watches
Two factories, Factory and Factory
, manufacture watches. Factory
produces, on average, 1 defective item out of 100, while Factory
produces, on average, 1 defective item out of 200.
You’ve received a container of watches from one of the two factories, but don’t know which one. You pull out the first watch and it works!
(a) What is the probability that the second watch you check will also work?
(b) Are the states of the first two watches independent?
(c) Are the states of the first two watches conditionally independent given we know that the container came from Factory ?
Solution:
(a) Let represent the state of the of the
th watch in the container, where
Let
represent the factory the container came from. Since the we don’t know which factory that is, we represent our ignorance by letting
We will assume that
We know that
Given that the first watch worked, the probability that the second watch will work is given by
We can use Bayes’ formula of total probability to find the numerator and denominator of this expression:
and
Therefore
So the probability that the second watch will work given that the first one does is about 9925063.
(b) If these two events were independent, we’d have that From the previous part we know that
and
By the same logic we also have that
which means that
meaning that the state of the two watches is not independent!
Why is this the case? We start our watch selection with zero information about whether our container is from Factory or Factory
. But with each successive selection, we will gain the information to slightly adjust our probabilities.
But what if we were to know which factory our watches came from from the start?
(c) If you look closely at part (a), we’ve been assuming conditional independence from the start, as it can be seen as a bit trivial in this instance. But let’s look closely at it anyway. and
are conditionally independent given
if
We know that
and we also know that
Therefore, the states of the two watches are conditionally independent given we know that the container came from Factory
.