Probability Theory

# Probability Theory

**by David O. Siegmund**

Probability theory, a branch of mathematics concerned with the analysis of random phenomena. The outcome of a random event cannot be determined before it occurs, but it may be any one of several possible outcomes. The actual outcome is considered to be determined by chance.

The word probability has several meanings in ordinary conversation. Two of these are particularly important for the development and applications of the mathematical theory of probability. One is the interpretation of probabilities as relative frequencies, for which simple games involving coins, cards, dice, and roulette wheels provide examples. The distinctive feature of games of chance is that the outcome of a given trial cannot be predicted with certainty, although the collective results of a large number of trials display some regularity. For example, the statement that the probability of “heads” in tossing a coin equals one-half, according to the relative frequency interpretation, implies that in a large number of tosses the relative frequency with which “heads” actually occurs will be approximately one-half, although it contains no implication concerning the outcome of any given toss. There are many similar examples involving groups of people, molecules of a gas, genes, and so on. Actuarial statements about the life expectancy for persons of a certain age describe the collective experience of a large number of individuals but do not purport to say what will happen to any particular person. Similarly, predictions about the chance of a genetic disease occurring in a child of parents having a known genetic makeup are statements about relative frequencies of occurrence in a large number of cases but are not predictions about a given individual.

This article contains a description of the important mathematical concepts of probability theory, illustrated by some of the applications that have stimulated their development. For a fuller historical treatment, see probability and statistics. Since applications inevitably involve simplifying assumptions that focus on some features of a problem at the expense of others, it is advantageous to begin by thinking about simple experiments, such as tossing a coin or rolling dice, and later to see how these apparently frivolous investigations relate to important scientific questions.

**Experiments, sample space, events, and equally likely probabilities**

Applications of simple probability experiments

The fundamental ingredient of probability theory is an experiment that can be repeated, at least hypothetically, under essentially identical conditions and that may lead to different outcomes on different trials. The set of all possible outcomes of an experiment is called a “sample space.” The experiment of tossing a coin once results in a sample space with two possible outcomes, “heads” and “tails.” Tossing two dice has a sample space with 36 possible outcomes, each of which can be identified with an ordered pair (i, j), where i and j assume one of the values 1, 2, 3, 4, 5, 6 and denote the faces showing on the individual dice. It is important to think of the dice as identifiable (say by a difference in colour), so that the outcome (1, 2) is different from (2, 1). An “event” is a well-defined subset of the sample space. For example, the event “the sum of the faces showing on the two dice equals six” consists of the five outcomes (1, 5), (2, 4), (3, 3), (4, 2), and (5, 1).

A third example is to draw n balls from an urn containing balls of various colours. A generic outcome to this experiment is an n-tuple, where the ith entry specifies the colour of the ball obtained on the ith draw (i = 1, 2,…, n). In spite of the simplicity of this experiment, a thorough understanding gives the theoretical basis for opinion polls and sample surveys. For example, individuals in a population favouring a particular candidate in an election may be identified with balls of a particular colour, those favouring a different candidate may be identified with a different colour, and so on. Probability theory provides the basis for learning about the contents of the urn from the sample of balls drawn from the urn; an application is to learn about the electoral preferences of a population on the basis of a sample drawn from that population.

Another application of simple urn models is to use clinical trials designed to determine whether a new treatment for a disease, a new drug, or a new surgical procedure is better than a standard treatment. In the simple case in which treatment can be regarded as either success or failure, the goal of the clinical trial is to discover whether the new treatment more frequently leads to success than does the standard treatment. Patients with the disease can be identified with balls in an urn. The red balls are those patients who are cured by the new treatment, and the black balls are those not cured. Usually there is a control group, who receive the standard treatment. They are represented by a second urn with a possibly different fraction of red balls. The goal of the experiment of drawing some number of balls from each urn is to discover on the basis of the sample which urn has the larger fraction of red balls. A variation of this idea can be used to test the efficacy of a new vaccine. Perhaps the largest and most famous example was the test of the Salk vaccine for poliomyelitis conducted in 1954. It was organized by the U.S. Public Health Service and involved almost two million children. Its success has led to the almost complete elimination of polio as a health problem in the industrialized parts of the world. Strictly speaking, these applications are problems of statistics, for which the foundations are provided by probability theory.

In contrast to the experiments described above, many experiments have infinitely many possible outcomes. For example, one can toss a coin until “heads” appears for the first time. The number of possible tosses is n = 1, 2,…. Another example is to twirl a spinner. For an idealized spinner made from a straight line segment having no width and pivoted at its centre, the set of possible outcomes is the set of all angles that the final position of the spinner makes with some fixed direction, equivalently all real numbers in [0, 2π). Many measurements in the natural and social sciences, such as volume, voltage, temperature, reaction time, marginal income, and so on, are made on continuous scales and at least in theory involve infinitely many possible values. If the repeated measurements on different subjects or at different times on the same subject can lead to different outcomes, probability theory is a possible tool to study this variability.

Because of their comparative simplicity, experiments with finite sample spaces are discussed first. In the early development of probability theory, mathematicians considered only those experiments for which it seemed reasonable, based on considerations of symmetry, to suppose that all outcomes of the experiment were “equally likely.” Then in a large number of trials all outcomes should occur with approximately the same frequency. The probability of an event is defined to be the ratio of the number of cases favourable to the event—i.e., the number of outcomes in the subset of the sample space defining the event—to the total number of cases. Thus, the 36 possible outcomes in the throw of two dice are assumed equally likely, and the probability of obtaining “six” is the number of favourable cases, 5, divided by 36, or 5/36.

Now suppose that a coin is tossed n times, and consider the probability of the event “heads does not occur” in the n tosses. An outcome of the experiment is an n-tuple, the kth entry of which identifies the result of the kth toss. Since there are two possible outcomes for each toss, the number of elements in the sample space is 2n. Of these, only one outcome corresponds to having no heads, so the required probability is 1/2n.

It is only slightly more difficult to determine the probability of “at most one head.” In addition to the single case in which no head occurs, there are n cases in which exactly one head occurs, because it can occur on the first, second,…, or nth toss. Hence, there are n + 1 cases favourable to obtaining at most one head, and the desired probability is (n + 1)/2n.

**The principle of additivity**

This last example illustrates the fundamental principle that, if the event whose probability is sought can be represented as the union of several other events that have no outcomes in common (“at most one head” is the union of “no heads” and “exactly one head”), then the probability of the union is the sum of the probabilities of the individual events making up the union. To describe this situation symbolically, let S denote the sample space. For two events A and B, the intersection of A and B is the set of all experimental outcomes belonging to both A and B and is denoted A ∩ B; the union of A and B is the set of all experimental outcomes belonging to A or B (or both) and is denoted A ∪ B. The impossible event—i.e., the event containing no outcomes—is denoted by Ø. The probability of an event A is written P(A). The principle of addition of probabilities is that, if A1, A2,…, An are events with Ai ∩ Aj = Ø for all pairs i ≠ j, then

Equation (1) is consistent with the relative frequency interpretation of probabilities; for, if Ai ∩ Aj = Ø for all i ≠ j, the relative frequency with which at least one of the Ai occurs equals the sum of the relative frequencies with which the individual Ai occur.

Equation (1) is fundamental for everything that follows. Indeed, in the modern axiomatic theory of probability, which eschews a definition of probability in terms of “equally likely outcomes” as being hopelessly circular, an extended form of equation (1) plays a basic role (see the section Infinite sample spaces and axiomatic probability).

An elementary, useful consequence of equation (1) is the following. With each event A is associated the complementary event Ac consisting of those experimental outcomes that do not belong to A. Since A ∩ Ac = Ø, A ∪ Ac = S, and P(S) = 1 (where S denotes the sample space), it follows from equation (1) that P(Ac) = 1 − P(A). For example, the probability of “at least one head” in n tosses of a coin is one minus the probability of “no head,” or 1 − 1/2n.

**Multinomial probability**

A basic problem first solved by Jakob Bernoulli is to find the probability of obtaining exactly i red balls in the experiment of drawing n times at random with replacement from an urn containing b black and r red balls. To draw at random means that, on a single draw, each of the r + b balls is equally likely to be drawn and, since each ball is replaced before the next draw, there are (r + b) ×⋯× (r + b) = (r + b)n possible outcomes to the experiment. Of these possible outcomes, the number that is favourable to obtaining i red balls and n − i black balls in any one particular order is

The number of possible orders in which i red balls and n − i black balls can be drawn from the urn is the binomial coefficient

where k! = k × (k − 1) ×⋯× 2 × 1 for positive integers k, and 0! = 1. Hence, the probability in question, which equals the number of favourable outcomes divided by the number of possible outcomes, is given by the binomial distribution

where p = r/(r + b) and q = b/(r + b) = 1 − p.

For example, suppose r = 2b and n = 4. According to equation (3), the probability of “exactly two red balls” is

In this case the

possible outcomes are easily enumerated: (rrbb), (rbrb), (brrb), (rbbr), (brbr), (bbrr).

(For a derivation of equation (2), observe that in order to draw exactly i red balls in n draws one must either draw i red balls in the first n − 1 draws and a black ball on the nth draw or draw i − 1 red balls in the first n − 1 draws followed by the ith red ball on the nth draw. Hence,

from which equation (2) can be verified by induction on n.)

Two related examples are (i) drawing without replacement from an urn containing r red and b black balls and (ii) drawing with or without replacement from an urn containing balls of s different colours. If n balls are drawn without replacement from an urn containing r red and b black balls, the number of possible outcomes is

of which the number favourable to drawing i red and n − i black balls is

Hence, the probability of drawing exactly i red balls in n draws is the ratio

If an urn contains balls of s different colours in the ratios p1:p2:…:ps, where p1 +⋯+ ps = 1 and if n balls are drawn with replacement, the probability of obtaining i1 balls of the first colour, i2 balls of the second colour, and so on is the multinomial probability

The evaluation of equation (3) with pencil and paper grows increasingly difficult with increasing n. It is even more difficult to evaluate related cumulative probabilities—for example the probability of obtaining “at most j red balls” in the n draws, which can be expressed as the sum of equation (3) for i = 0, 1,…, j. The problem of approximate computation of probabilities that are known in principle is a recurrent theme throughout the history of probability theory and will be discussed in more detail below.

**The birthday problem**

An entertaining example is to determine the probability that in a randomly selected group of n people at least two have the same birthday. If one assumes for simplicity that a year contains 365 days and that each day is equally likely to be the birthday of a randomly selected person, then in a group of n people there are 365n possible combinations of birthdays. The simplest solution is to determine the probability of no matching birthdays and then subtract this probability from 1. Thus, for no matches, the first person may have any of the 365 days for his birthday, the second any of the remaining 364 days for his birthday, the third any of the remaining 363 days,…, and the nth any of the remaining 365 − n + 1. The number of ways that all n people can have different birthdays is then 365 × 364 ×⋯× (365 − n + 1), so that the probability that at least two have the same birthday is

Numerical evaluation shows, rather surprisingly, that for n = 23 the probability that at least two people have the same birthday is about 0.5 (half the time). For n = 42 the probability is about 0.9 (90 percent of the time).

This example illustrates that applications of probability theory to the physical world are facilitated by assumptions that are not strictly true, although they should be approximately true. Thus, the assumptions that a year has 365 days and that all days are equally likely to be the birthday of a random individual are false, because one year in four has 366 days and because birth dates are not distributed uniformly throughout the year. Moreover, if one attempts to apply this result to an actual group of individuals, it is necessary to ask what it means for these to be “randomly selected.” It would naturally be unreasonable to apply it to a group known to contain twins. In spite of the obvious failure of the assumptions to be literally true, as a classroom example, it rarely disappoints instructors of classes having more than 40 students.

**Conditional probability**

Suppose two balls are drawn sequentially without replacement from an urn containing r red and b black balls. The probability of getting a red ball on the first draw is r/(r + b). If, however, one is told that a red ball was obtained on the first draw, the conditional probability of getting a red ball on the second draw is (r − 1)/(r + b − 1), because for the second draw there are r + b − 1 balls in the urn, of which r − 1 are red. Similarly, if one is told that the first ball drawn is black, the conditional probability of getting red on the second draw is r/(r + b − 1).

In a number of trials the relative frequency with which B occurs among those trials in which A occurs is just the frequency of occurrence of A ∩ B divided by the frequency of occurrence of A. This suggests that the conditional probability of B given A (denoted P(B|A)) should be defined by

If A denotes a red ball on the first draw and B a red ball on the second draw in the experiment of the preceding paragraph, then P(A) = r/(r + b) and

which is consistent with the “obvious” answer derived above.

Rewriting equation (4) as P(A ∩ B) = P(A)P(B|A) and adding to this expression the same expression with A replaced by Ac (“not A”) leads via equation (1) to the equality

More generally, if A1, A2,…, An are mutually exclusive events and their union is the entire sample space, so that exactly one of the Ak must occur, essentially the same argument gives a fundamental relation, which is frequently called the law of total probability:

**Applications of conditional probability**

An application of the law of total probability to a problem originally posed by Christiaan Huygens is to find the probability of “gambler’s ruin.” Suppose two players, often called Peter and Paul, initially have x and m − x dollars, respectively. A ball, which is red with probability p and black with probability q = 1 − p, is drawn from an urn. If a red ball is drawn, Paul must pay Peter one dollar, while Peter must pay Paul one dollar if the ball drawn is black. The ball is replaced, and the game continues until one of the players is ruined. It is quite difficult to determine the probability of Peter’s ruin by a direct analysis of all possible cases. But let Q(x) denote that probability as a function of Peter’s initial fortune x and observe that after one draw the structure of the rest of the game is exactly as it was before the first draw, except that Peter’s fortune is now either x + 1 or x − 1 according to the results of the first draw. The law of total probability with A = {red ball on first draw} and Ac = {black ball on first draw} shows that

This equation holds for x = 2, 3,…, m − 2. It also holds for x = 1 and m − 1 if one adds the boundary conditions Q(0) = 1 and Q(m) = 0, which say that if Peter has 0 dollars initially, his probability of ruin is 1, while if he has all m dollars, he is certain to win.

It can be verified by direct substitution that equation (5) together with the indicated boundary conditions are satisfied by

With some additional analysis it is possible to show that these give the only solutions and hence must be the desired probabilities.

Suppose m = 10x, so that Paul initially has nine times as much money as Peter. If p = 1/2, the probability of Peter’s ruin is 0.9 regardless of the values of x and m. If p = 0.51, so that each trial slightly favours Peter, the situation is quite different. For x = 1 and m = 10, the probability of Peter’s ruin is 0.88, only slightly less than before. However, for x = 100 and m = 1,000, Peter’s slight advantage on each trial becomes so important that the probability of his ultimate ruin is now less than 0.02.

Generalizations of the problem of gambler’s ruin play an important role in statistical sequential analysis, developed by the Hungarian-born American statistician Abraham Wald in response to the demand for more efficient methods of industrial quality control during World War II. They also enter into insurance risk theory, which is discussed in the section Stochastic processes: Insurance risk theory.

The following example shows that, even when it is given that A occurs, it is important in evaluating P(B|A) to recognize that Ac might have occurred, and hence in principle it must be possible also to evaluate P(B|Ac). By lot, two out of three prisoners—Sam, Jean, and Chris—are chosen to be executed. There are

possible pairs of prisoners to be selected for execution, of which two contain Sam, so the probability that Sam is slated for execution is 2/3. Sam asks the guard which of the others is to be executed. Since at least one must be, it appears that the guard would give Sam no information by answering. After hearing that Jean is to be executed, Sam reasons that, since either he or Chris must be the other one, the conditional probability that he will be executed is 1/2. Thus, it appears that the guard has given Sam some information about his own fate. However, the experiment is incompletely defined, because it is not specified how the guard chooses whether to answer “Jean” or “Chris” in case both of them are to be executed. If the guard answers “Jean” with probability p, the conditional probability of the event “Sam will be executed” given “the guard says Jean will be executed” is

Only in the case p = 1 is Sam’s reasoning correct. If p = 1/2, the guard in fact gives no information about Sam’s fate.

**Independence**

One of the most important concepts in probability theory is that of “independence.” The events A and B are said to be (stochastically) independent if P(B|A) = P(B), or equivalently if

The intuitive meaning of the definition in terms of conditional probabilities is that the probability of B is not changed by knowing that A has occurred. Equation (7) shows that the definition is symmetric in A and B.

It is intuitively clear that, in drawing two balls with replacement from an urn containing r red and b black balls, the event “red ball on the first draw” and the event “red ball on the second draw” are independent. (This statement presupposes that the balls are thoroughly mixed before each draw.) An analysis of the (r + b)2 equally likely outcomes of the experiment shows that the formal definition is indeed satisfied.

In terms of the concept of independence, the experiment leading to the binomial distribution can be described as follows. On a single trial a particular event has probability p. An experiment consists of n independent repetitions of this trial. The probability that the particular event occurs exactly i times is given by equation (3).

Independence plays a central role in the law of large numbers, the central limit theorem, the Poisson distribution, and Brownian motion.

**Bayes’s theorem**

Bayes’s theorem used for evaluating the accuracy of a medical test [Credit: Encyclopædia Britannica, Inc.]Consider now the defining relation for the conditional probability P(An|B), where the Ai are mutually exclusive and their union is the entire sample space. Substitution of P(An)P(B|An) in the numerator of equation (4) and substitution of the right-hand side of the law of total probability in the denominator yields a result known as Bayes’s theorem (after the 18th-century English clergyman Thomas Bayes) or the law of inverse probability:

As an example, suppose that two balls are drawn without replacement from an urn containing r red and b black balls. Let A be the event “red on the first draw” and B the event “red on the second draw.” From the obvious relations P(A) = r/(r + b) = 1 − P(Ac), P(B|A) = (r − 1)/(r + b − 1), P(B|Ac) = r/(r + b − 1), and Bayes’s theorem, it follows that the probability of a red ball on the first draw given that the second one is known to be red equals (r − 1)/(r + b − 1). A more interesting and important use of Bayes’s theorem appears below in the discussion of subjective probabilities.

**Random variables, distributions, expectation, and variance**

Random variables

Usually it is more convenient to associate numerical values with the outcomes of an experiment than to work directly with a nonnumerical description such as “red ball on the first draw.” For example, an outcome of the experiment of drawing n balls with replacement from an urn containing black and red balls is an n-tuple that tells us whether a red or a black ball was drawn on each of the draws. This n-tuple is conveniently represented by an n-tuple of ones and zeros, where the appearance of a one in the kth position indicates that a red ball was drawn on the kth draw. A quantity of particular interest is the number of red balls drawn, which is just the sum of the entries in this numerical description of the experimental outcome. Mathematically a rule that associates with every element of a given set a unique real number is called a “(real-valued) function.” In the history of statistics and probability, real-valued functions defined on a sample space have traditionally been called “random variables.” Thus, if a sample space S has the generic element e, the outcome of an experiment, then a random variable is a real-valued function X = X(e). Customarily one omits the argument e in the notation for a random variable. For the experiment of drawing balls from an urn containing black and red balls, R, the number of red balls drawn, is a random variable. A particularly useful random variable is 1[A], the indicator variable of the event A, which equals 1 if A occurs and 0 otherwise. A “constant” is a trivial random variable that always takes the same value regardless of the outcome of the experiment.

**Probability distribution**

Suppose X is a random variable that can assume one of the values x1, x2,…, xm, according to the outcome of a random experiment, and consider the event {X = xi}, which is a shorthand notation for the set of all experimental outcomes e such that X(e) = xi. The probability of this event, P{X = xi}, is itself a function of xi, called the probability distribution function of X. Thus, the distribution of the random variable R defined in the preceding section is the function of i = 0, 1,…, n given in the binomial equation. Introducing the notation f(xi) = P{X = xi}, one sees from the basic properties of probabilities that

and

for any real numbers a and b. If Y is a second random variable defined on the same sample space as X and taking the values y1, y2,…, yn, the function of two variables h(xi, yj) = P{X = xi, Y = yj} is called the joint distribution of X and Y. Since {X = xi} = ∪j{X = xi, Y = yj}, and this union consists of disjoint events in the sample space,

Often f is called the marginal distribution of X to emphasize its relation to the joint distribution of X and Y. Similarly, g(yj) = ∑ih(xi, yj) is the (marginal) distribution of Y. The random variables X and Y are defined to be independent if the events {X = xi} and {Y = yj} are independent for all i and j—i.e., if h(xi, yj) = f(xi)g(yj) for all i and j. The joint distribution of an arbitrary number of random variables is defined similarly.

Suppose two dice are thrown. Let X denote the sum of the numbers appearing on the two dice, and let Y denote the number of even numbers appearing (see the table). The possible values of X are 2, 3,…, 12, while the possible values of Y are 0, 1, 2. Since there are 36 possible outcomes for the two dice, the accompanying table giving the joint distribution h(i, j) (i = 2, 3,…, 12; j = 0, 1, 2) and the marginal distributions f(i) and g(j) is easily computed by direct enumeration.

For more complex experiments, determination of a complete probability distribution usually requires a combination of theoretical analysis and empirical experimentation and is often very difficult. Consequently, it is desirable to describe a distribution insofar as possible by a small number of parameters that are comparatively easy to evaluate and interpret. The most important are the mean and the variance. These are both defined in terms of the “expected value” of a random variable.

**Expected value**

Given a random variable X with distribution f, the expected value of X, denoted E(X), is defined by E(X) = ∑ixif(xi). In words, the expected value of X is the sum of each of the possible values of X multiplied by the probability of obtaining that value. The expected value of X is also called the mean of the distribution f. The basic property of E is that of linearity: if X and Y are random variables and if a and b are constants, then E(aX + bY) = aE(X) + bE(Y). To see why this is true, note that aX + bY is itself a random variable, which assumes the values axi + byj with the probabilities h(xi, yj). Hence,

If the first sum on the right-hand side is summed over j while holding i fixed, by equation (8) the result is

which by definition is E(X). Similarly, the second sum equals E(Y).

If 1[A] denotes the “indicator variable” of A—i.e., a random variable equal to 1 if A occurs and equal to 0 otherwise—then E{1[A]} = 1 × P(A) + 0 × P(Ac) = P(A). This shows that the concept of expectation includes that of probability as a special case.

As an illustration, consider the number R of red balls in n draws with replacement from an urn containing a proportion p of red balls. From the definition and the binomial distribution of R,

which can be evaluated by algebraic manipulation and found to equal np. It is easier to use the representation R = 1[A1] +⋯+ 1[An], where Ak denotes the event “the kth draw results in a red ball.” Since E{1[Ak]} = p for all k, by linearity E(R) = E{1[A1]} +⋯+ E{1[An]} = np. This argument illustrates the principle that one can often compute the expected value of a random variable without first computing its distribution. For another example, suppose n balls are dropped at random into n boxes. The number of empty boxes, Y, has the representation Y = 1[B1] +⋯+ 1[Bn], where Bk is the event that “the kth box is empty.” Since the kth box is empty if and only if each of the n balls went into one of the other n − 1 boxes, P(Bk) = [(n − 1)/n]n for all k, and consequently E(Y) = n(1 − 1/n)n. The exact distribution of Y is very complicated, especially if n is large.

Many probability distributions have small values of f(xi) associated with extreme (large or small) values of xi and larger values of f(xi) for intermediate xi. For example, both marginal distributions in the table are symmetrical about a midpoint that has relatively high probability, and the probability of other values decreases as one moves away from the midpoint. Insofar as a distribution f(xi) follows this kind of pattern, one can interpret the mean of f as a rough measure of location of the bulk of the probability distribution, because in the defining sum the values xi associated with large values of f(xi) more or less define the centre of the distribution. In the extreme case, the expected value of a constant random variable is just that constant.

**Variance**

It is also of interest to know how closely packed about its mean value a distribution is. The most important measure of concentration is the variance, denoted by Var(X) and defined by Var(X) = E{[X − E(X)]2}. By linearity of expectations, one has equivalently Var(X) = E(X2) − {E(X)}2. The standard deviation of X is the square root of its variance. It has a more direct interpretation than the variance because it is in the same units as X. The variance of a constant random variable is 0. Also, if c is a constant, Var(cX) = c2Var(X).

There is no general formula for the expectation of a product of random variables. If the random variables X and Y are independent, E(XY) = E(X)E(Y). This can be used to show that, if X1,…, Xn are independent random variables, the variance of the sum X1 +⋯+ Xn is just the sum of the individual variances, Var(X1) +⋯+ Var(Xn). If the Xs have the same distribution and are independent, the variance of the average (X1 +⋯+ Xn)/n is Var(X1)/n. Equivalently, the standard deviation of (X1 +⋯+ Xn)/n is the standard deviation of X1 divided by √n. This quantifies the intuitive notion that the average of repeated observations is less variable than the individual observations. More precisely, it says that the variability of the average is inversely proportional to the square root of the number of observations. This result is tremendously important in problems of statistical inference. (See the section The law of large numbers, the central limit theorem, and the Poisson approximation.)

Consider again the binomial distribution given by equation (3). As in the calculation of the mean value, one can use the definition combined with some algebraic manipulation to show that, if R has the binomial distribution, then Var(R) = npq. From the representation R = 1[A1] +⋯+ 1[An] defined above, and the observation that the events Ak are independent and have the same probability, it follows that

Moreover,

so Var(R) = npq.

The conditional distribution of Y given X = xi is defined by:

(compare equation (4)), and the conditional expectation of Y given X = xi is

One can regard E(Y|X) as a function of X; since X is a random variable, this function of X must itself be a random variable. The conditional expectation E(Y|X) considered as a random variable has its own (unconditional) expectation E{E(Y|X)}, which is calculated by multiplying equation (9) by f(xi) and summing over i to obtain the important formula

Properly interpreted, equation (10) is a generalization of the law of total probability.

For a simple example of the use of equation (10), recall the problem of the gambler’s ruin and let e(x) denote the expected duration of the game if Peter’s fortune is initially equal to x. The reasoning leading to equation (5) in conjunction with equation (10) shows that e(x) satisfies the equations e(x) = 1 + pe(x + 1) + qe(x − 1) for x = 1, 2,…, m − 1 with the boundary conditions e(0) = e(m) = 0. The solution for p ≠ 1/2 is rather complicated; for p = 1/2, e(x) = x(m − x).

**An alternative interpretation of probability**

In ordinary conversation the word probability is applied not only to variable phenomena but also to propositions of uncertain veracity. The truth of any proposition concerning the outcome of an experiment is uncertain before the experiment is performed. Many other uncertain propositions cannot be defined in terms of repeatable experiments. An individual can be uncertain about the truth of a scientific theory, a religious doctrine, or even about the occurrence of a specific historical event when inadequate or conflicting eyewitness accounts are involved. Using probability as a measure of uncertainty enlarges its domain of application to phenomena that do not meet the requirement of repeatability. The concomitant disadvantage is that probability as a measure of uncertainty is subjective and varies from one person to another.

According to one interpretation, to say that someone has subjective probability p that a proposition is true means that for any integers r and b with r/(r + b) < p, if that individual is offered an opportunity to bet the same amount on the truth of the proposition or on “red in a single draw” from an urn containing r red and b black balls, he prefers the first bet, while, if r/(r + b) > p, he prefers the second bet.

An important stimulus to modern thought about subjective probability has been an attempt to understand decision making in the face of incomplete knowledge. It is assumed that an individual, when faced with the necessity of making a decision that may have different consequences depending on situations about which he has incomplete knowledge, can express his personal preferences and uncertainties in a way consistent with certain axioms of rational behaviour. It can then be deduced that the individual has a utility function, which measures the value to him of each course of action when each of the uncertain possibilities is the true one, and a “subjective probability distribution,” which expresses quantitatively his beliefs about the uncertain situations. The individual’s optimal decision is the one that maximizes his expected utility with respect to his subjective probability. The concept of utility goes back at least to Daniel Bernoulli (Jakob Bernoulli’s nephew) and was developed in the 20th century by John von Neumann and Oskar Morgenstern, Frank P. Ramsey, and Leonard J. Savage, among others. Ramsey and Savage stressed the importance of subjective probability as a concomitant ingredient of decision making in the face of uncertainty. An alternative approach to subjective probability without the use of utility theory was developed by Bruno de Finetti.

The mathematical theory of probability is the same regardless of one’s interpretation of the concept, although the importance attached to various results can depend very much on the interpretation. In particular, in the theory and applications of subjective probability, Bayes’s theorem plays an important role.

For example, suppose that an urn contains N balls, r of which are red and b = N − r of which are black, but r (hence b) is unknown. One is permitted to learn about the value of r by performing the experiment of drawing with replacement n balls from the urn. Suppose also that one has a subjective probability distribution giving the probability f(r) that the number of red balls is in fact r where f(0) +⋯+ f(N) = 1. This distribution is called an a priori distribution because it is specified prior to the experiment of drawing balls from the urn. The binomial distribution is now a conditional probability, given the value of r. Finally, one can use Bayes’s theorem to find the conditional probability that the unknown number of red balls in the urn is r, given that the number of red balls drawn from the urn is i. The result is

This distribution, derived by using Bayes’s theorem to combine the a priori distribution with the conditional distribution for the outcome of the experiment, is called the a posteriori distribution.

The virtue of this calculation is that it makes possible a probability statement about the composition of the urn, which is not directly observable, in terms of observable data, from the composition of the sample taken from the urn. The weakness, as indicated above, is that different people may choose different subjective probabilities for the composition of the urn a priori and hence reach different conclusions about its composition a posteriori.

To see how this idea might apply in practice, consider a simple urn model of opinion polling to predict which of two candidates will win an election. The red balls in the urn are identified with voters who will vote for candidate A and the black balls with those voting for candidate B. Choosing a sample from the electorate and asking their preferences is a well-defined random experiment, which in theory and in practice is repeatable. The composition of the urn is uncertain and is not the result of a well-defined random experiment. Nevertheless, to the extent that a vote for a candidate is a vote for a political party, other elections provide information about the content of the urn, which, if used judiciously, should be helpful in supplementing the results of the actual sample to make a prediction. Exactly how to use this information is a difficult problem in which individual judgment plays an important part. One possibility is to incorporate the prior information into an a priori distribution about the electorate, which is then combined via Bayes’s theorem with the outcome of the sample and summarized by an a posteriori distribution.

**The law of large numbers, the central limit theorem, and the Poisson approximation**

The law of large numbers

The relative frequency interpretation of probability is that if an experiment is repeated a large number of times under identical conditions and independently, then the relative frequency with which an event A actually occurs and the probability of A should be approximately the same. A mathematical expression of this interpretation is the law of large numbers. This theorem says that if X1, X2,…, Xn are independent random variables having a common distribution with mean μ, then for any number ε > 0, no matter how small, as n → ∞,

The law of large numbers was first proved by Jakob Bernoulli in the special case where Xk is 1 or 0 according as the kth draw (with replacement) from an urn containing r red and b black balls is red or black. Then E(Xk) = r/(r + b), and the last equation says that the probability that “the difference between the empirical proportion of red balls in n draws and the probability of red on a single draw is less than ε” converges to 1 as n becomes infinitely large.

Insofar as an event which has probability very close to 1 is practically certain to happen, this result justifies the relative frequency interpretation of probability. Strictly speaking, however, the justification is circular because the probability in the above equation, which is very close to but not equal to 1, requires its own relative frequency interpretation. Perhaps it is better to say that the weak law of large numbers is consistent with the relative frequency interpretation of probability.

The following simple proof of the law of large numbers is based on Chebyshev’s inequality, which illustrates the sense in which the variance of a distribution measures how the distribution is dispersed about its mean. If X is a random variable with distribution f and mean μ, then by definition Var(X) = ∑i(xi − μ)2f(xi). Since all terms in this sum are positive, the sum can only decrease if some of the terms are omitted. Suppose one omits all terms with |xi − μ| < b, where b is an arbitrary given number. Each term remaining in the sum has a factor of the form (xi − μ)2, which is greater than or equal to b2. Hence, Var(X) ≥ b2 ∑′ f(xi), where the prime on the summation sign indicates that only terms with |xi − μ| ≥ b are included in the sum. Chebyshev’s inequality is this expression rewritten as

This inequality can be applied to the complementary event of that appearing in equation (11), with b = ε. The Xs are independent and have the same distribution, E[n−1(X1 +⋯+ Xn)] = μ and Var[(X1 +⋯+ Xn)/n] = Var(X1)/n, so that

This not only proves equation (11), but it also says quantitatively how large n should be in order that the empirical average, n−1(X1 +⋯+ Xn), approximate its expectation to any required degree of precision.

Suppose, for example, that the proportion p of red balls in an urn is unknown and is to be estimated by the empirical proportion of red balls in a sample of size n drawn from the urn with replacement. Chebyshev’s inequality with Xk = 1{red ball on the kth draw} implies that, in order that the observed proportion be within ε of the true proportion p with probability at least 0.95, it suffices that n be at least 20 × Var(X1)/ε2. Since Var(X1) = p(1 − p) ≤ 1/4 for all p, for ε = 0.03 it suffices that n be at least 5,555. It is shown below that this value of n is much larger than necessary, because Chebyshev’s inequality is not sufficiently precise to be useful in numerical calculations.

Although Jakob Bernoulli did not know Chebyshev’s inequality, the inequality he derived was also imprecise, and, perhaps because of his disappointment in not having a quantitatively useful approximation, he did not publish the result during his lifetime. It appeared in 1713, eight years after his death.

**The central limit theorem**

The desired useful approximation is given by the central limit theorem, which in the special case of the binomial distribution was first discovered by Abraham de Moivre about 1730. Let X1,…, Xn be independent random variables having a common distribution with expectation μ and variance σ2. The law of large numbers implies that the distribution of the random variable X̄n = n−1(X1 +⋯+ Xn) is essentially just the degenerate distribution of the constant μ, because E(X̄n) = μ and Var(X̄n) = σ2/n → 0 as n → ∞. The standardized random variable (X̄n − μ)/(σ/√n) has mean 0 and variance 1. The central limit theorem gives the remarkable result that, for any real numbers a and b, as n → ∞,

where

Thus, if n is large, the standardized average has a distribution that is approximately the same, regardless of the original distribution of the Xs. The equation also illustrates clearly the square root law: the accuracy of X̄n as an estimator of μ is inversely proportional to the square root of the sample size n.

Use of equation (12) to evaluate approximately the probability on the left-hand side of equation (11), by setting b = −a = ε√n/σ, yields the approximation G(ε√n/σ) − G(−ε√n/σ). Since G(2) − G(−2) is approximately 0.95, n must be about 4σ2/ε2 in order that the difference |X̄n − μ| will be less than ε with probability 0.95. For the special case of the binomial distribution, one can again use the inequality σ2 = p(1 − p) ≤ 1/4 and now conclude that about 1,100 balls must be drawn from the urn in order that the empirical proportion of red balls drawn will be within 0.03 of the true proportion of red balls with probability about 0.95. The frequently appearing statement in newspapers in the United States that a given opinion poll involving a sample of about 1,100 persons has a sampling error of no more than 3 percent is based on this kind of calculation. The qualification that this 3 percent sampling error may be exceeded in about 5 percent of the cases is often omitted. (The actual situation in opinion polls or sample surveys generally is more complicated. The sample is drawn without replacement, so, strictly speaking, the binomial distribution is not applicable. However, the “urn”—i.e., the population from which the sample is drawn—is extremely large, in many cases infinitely large for practical purposes. Hence, the composition of the urn is effectively the same throughout the sampling process, and the binomial distribution applies as an approximation. Also, the population is usually stratified into relatively homogeneous groups, and the survey is designed to take advantage of this stratification. To pursue the analogy with urn models, one can imagine the balls to be in several urns in varying proportions, and one must decide how to allocate the n draws from the various urns so as to estimate efficiently the overall proportion of red balls.)

Considerable effort has been put into generalizing both the law of large numbers and the central limit theorem, so that it is unnecessary for the variables to be either independent or identically distributed.

The law of large numbers discussed above is often called the “weak law of large numbers,” to distinguish it from the “strong law,” a conceptually different result discussed below in the section on infinite probability spaces.

**The Poisson approximation**

The weak law of large numbers and the central limit theorem give information about the distribution of the proportion of successes in a large number of independent trials when the probability of success on each trial is p. In the mathematical formulation of these results, it is assumed that p is an arbitrary, but fixed, number in the interval (0, 1) and n → ∞, so that the expected number of successes in the n trials, np, also increases toward +∞ with n. A rather different kind of approximation is of interest when n is large and the probability p of success on a single trial is inversely proportional to n, so that np = μ is a fixed number even though n → ∞. An example is the following simple model of radioactive decay of a source consisting of a large number of atoms, which independently of one another decay by spontaneously emitting a particle. The time scale is divided into a large number of very small intervals of equal lengths, and in each interval, independently of what happens in the other intervals, the source emits one or no particle with probability p or q = 1 − p respectively. It is assumed that the intervals are so small that the probability of two or more particles being emitted in a single interval is negligible. One now imagines that the size of the intervals shrinks to 0, so that the number of trials up to any fixed time t becomes infinite. It is reasonable to assume that the probability of emission during a short time interval is proportional to the length of the interval. The result is a different kind of approximation to the binomial distribution, called the Poisson distribution (after the French mathematician Siméon-Denis Poisson) or the law of small numbers.

Assume, then, that a biased coin having probability p = μδ of heads is tossed once in each time interval of length δ, so that by time t the total number of tosses is an integer n approximately equal to t/δ. Introducing these values into the binomial equation and passing to the limit as δ → 0 gives as the distribution for N(t) the number of radioactive particles emitted in time t:

The right-hand side of this equation is the Poisson distribution. Its mean and variance are both equal to μt. Although the Poisson approximation is not comparable to the central limit theorem in importance, it nevertheless provides one of the basic building blocks in the theory of stochastic processes.

**Infinite sample spaces and axiomatic probability**

Infinite sample spaces

The experiments described in the preceding discussion involve finite sample spaces for the most part, although the central limit theorem and the Poisson approximation involve limiting operations and hence lead to integrals and infinite series. In a finite sample space, calculation of the probability of an event A is conceptually straightforward because the principle of additivity tells one to calculate the probability of a complicated event as the sum of the probabilities of the individual experimental outcomes whose union defines the event.

Experiments having a continuum of possible outcomes—for example, that of selecting a number at random from the interval [r, s]—involve subtle mathematical difficulties that were not satisfactorily resolved until the 20th century. If one chooses a number at random from [r, s], the probability that the number falls in any interval [x, y] must be proportional to the length of that interval; and, since the probability of the entire sample space [r, s] equals 1, the constant of proportionality equals 1/(s − r). Hence, the probability of obtaining a number in the interval [x, y] equals (y − x)/(s − r). From this and the principle of additivity one can determine the probability of any event that can be expressed as a finite union of intervals. There are, however, very complicated sets having no simple relation to the intervals—e.g., the rational numbers—and it is not immediately clear what the probabilities of these sets should be. Also, the probability of selecting exactly the number x must be 0, because the set consisting of x alone is contained in the interval [x, x + 1/n] for all n and hence must have no larger probability than 1/[n(s − r)], no matter how large n is. Consequently, it makes no sense to try to compute the probability of an event by “adding” the probabilities of the individual outcomes making up the event, because each individual outcome has probability 0.

A closely related experiment, although at first there appears to be no connection, arises as follows. Suppose that a coin is tossed n times, and let Xk = 1 or 0 according as the outcome of the kth toss is heads or tails. The weak law of large numbers given above says that a certain sequence of numbers—namely the sequence of probabilities given in equation (11) and defined in terms of these n Xs—converges to 1 as n → ∞. In order to formulate this result, it is only necessary to imagine that one can toss the coin n times and that this finite number of tosses can be arbitrarily large. In other words, there is a sequence of experiments, but each one involves a finite sample space. It is also natural to ask whether the sequence of random variables (X1 +⋯+ Xn)/n converges as n → ∞. However, this question cannot even be formulated mathematically unless infinitely many Xs can be defined on the same sample space, which in turn requires that the underlying experiment involve an actual infinity of coin tosses.

For the conceptual experiment of tossing a fair coin infinitely many times, the sequence of zeros and ones, (X1, X2,…), can be identified with that real number that has the Xs as the coefficients of its expansion in the base 2, namely X1/21 + X2/22 + X3/23 +⋯. For example, the outcome of getting heads on the first two tosses and tails thereafter corresponds to the real number 1/2 + 1/4 + 0/8 +⋯ = 3/4. (There are some technical mathematical difficulties that arise from the fact that some numbers have two representations. Obviously 1/2 = 1/2 + 0/4 +⋯, and the formula for the sum of an infinite geometric series shows that it also equals 0/2 + 1/4 + 1/8 +⋯. It can be shown that these difficulties do not pose a serious problem, and they are ignored in the subsequent discussion.) For any particular specification i1, i2,…, in of zeros and ones, the event {X1 = i1, X2 = i2,…, Xn = in} must have probability 1/2n in order to be consistent with the experiment of tossing the coin only n times. Moreover, this event corresponds to the interval of real numbers [i1/21 + i2/22 +⋯+ in/2n, i1/21 + i2/22 +⋯+ in/2n + 1/2n] of length 1/2n, since any continuation Xn + 1, Xn + 2,… corresponds to a number that is at least 0 and at most 1/2n + 1 + 1/2n + 2 +⋯ = 1/2n by the formula for an infinite geometric series. It follows that the mathematical model for choosing a number at random from [0, 1] and that of tossing a fair coin infinitely many times assign the same probabilities to all intervals of the form [k/2n, 1/2n].

**The strong law of large numbers**

The mathematical relation between these two experiments was recognized in 1909 by the French mathematician Émile Borel, who used the then new ideas of measure theory to give a precise mathematical model and to formulate what is now called the strong law of large numbers for fair coin tossing. His results can be described as follows. Let e denote a number chosen at random from [0, 1], and let Xk(e) be the kth coordinate in the expansion of e to the base 2. Then X1, X2,… are an infinite sequence of independent random variables taking the values 0 or 1 with probability 1/2 each. Moreover, the subset of [0, 1] consisting of those e for which the sequence n−1[X1(e) +⋯+ Xn(e)] tends to 1/2 as n → ∞ has probability 1. Symbolically:

The weak law of large numbers given in equation (11) says that for any ε > 0, for each sufficiently large value of n, there is only a small probability of observing a deviation of Xn = n−1(X1 +⋯+ Xn) from 1/2 which is larger than ε; nevertheless, it leaves open the possibility that sooner or later this rare event will occur if one continues to toss the coin and observe the sequence for a sufficiently long time. The strong law, however, asserts that the occurrence of even one value of Xk for k ≥ n that differs from 1/2 by more than ε is an event of arbitrarily small probability provided n is large enough. The proof of equation (14) and various subsequent generalizations is much more difficult than that of the weak law of large numbers. The adjectives “strong” and “weak” refer to the fact that the truth of a result such as equation (14) implies the truth of the corresponding version of equation (11), but not conversely.

**Measure theory**

During the two decades following 1909, measure theory was used in many concrete problems of probability theory, notably in the American mathematician Norbert Wiener’s treatment (1923) of the mathematical theory of Brownian motion, but the notion that all problems of probability theory could be formulated in terms of measure is customarily attributed to the Soviet mathematician Andrey Nikolayevich Kolmogorov in 1933.

The fundamental quantities of the measure theoretic foundation of probability theory are the sample space S, which as before is just the set of all possible outcomes of an experiment, and a distinguished class M of subsets of S, called events. Unlike the case of finite S, in general not every subset of S is an event. The class M must have certain properties described below. Each event is assigned a probability, which means mathematically that a probability is a function P mapping M into the real numbers that satisfies certain conditions derived from one’s physical ideas about probability.

The properties of M are as follows: (i) S ∊ M; (ii) if A ∊ M, then Ac ∊ M; (iii) if A1, A2,… ∊ M, then A1 ∪ A2 ∪ ⋯ ∊ M. Recalling that M is the domain of definition of the probability P, one can interpret (i) as saying that P(S) is defined, (ii) as saying that, if the probability of A is defined, then the probability of “not A” is also defined, and (iii) as saying that, if one can speak of the probability of each of a sequence of events An individually, then one can speak of the probability that at least one of the An occurs. A class of subsets of any set that has properties (i)–(iii) is called a σ-field. From these properties one can prove others. For example, it follows at once from (i) and (ii) that Ø (the empty set) belongs to the class M. Since the intersection of any class of sets can be expressed as the complement of the union of the complements of those sets (DeMorgan’s law), it follows from (ii) and (iii) that, if A1, A2,… ∊ M, then A1 ∩ A2 ∩ ⋯ ∊ M.

Given a set S and a σ-field M of subsets of S, a probability measure is a function P that assigns to each set A ∊ M a nonnegative real number and that has the following two properties: (a) P(S) = 1 and (b) if A1, A2,… ∊ M and Ai ∩ Aj = Ø for all i ≠ j, then P(A1 ∪ A2 ∪ ⋯) = P(A1) + P(A2) +⋯. Property (b) is called the axiom of countable additivity. It is clearly motivated by equation (1), which suffices for finite sample spaces because there are only finitely many events. In infinite sample spaces it implies, but is not implied by, equation (1). There is, however, nothing in one’s intuitive notion of probability that requires the acceptance of this property. Indeed, a few mathematicians have developed probability theory with only the weaker axiom of finite additivity, but the absence of interesting models that fail to satisfy the axiom of countable additivity has led to its virtually universal acceptance.

To get a better feeling for this distinction, consider the experiment of tossing a biased coin having probability p of heads and q = 1 − p of tails until heads first appears. To be consistent with the idea that the tosses are independent, the probability that exactly n tosses are required equals qn − 1p, since the first n − 1 tosses must be tails, and they must be followed by a head. One can imagine that this experiment never terminates—i.e., that the coin continues to turn up tails forever. By the axiom of countable additivity, however, the probability that heads occurs at some finite value of n equals p + qp + q2p + ⋯ = p/(1 − q) = 1, by the formula for the sum of an infinite geometric series. Hence, the probability that the experiment goes on forever equals 0. Similarly, one can compute the probability that the number of tosses is odd, as p + q2p + q4p + ⋯ = p/(1 − q2) = 1/(1 + q). On the other hand, if only finite additivity were required, it would be possible to define the following admittedly bizarre probability. The sample space S is the set of all natural numbers, and the σ-field M is the class of all subsets of S. If an event A contains finitely many elements, P(A) = 0, and, if the complement of A contains finitely many elements, P(A) = 1. As a consequence of the deceptively innocuous axiom of choice (which says that, given any collection C of nonempty sets, there exists a rule for selecting a unique point from each set in C), one can show that many finitely additive probabilities consistent with these requirements exist. However, one cannot be certain what the probability of getting an odd number is, because that set is neither finite nor its complement finite, nor can it be expressed as a finite disjoint union of sets whose probability is already defined.

It is a basic problem, and by no means a simple one, to show that the intuitive notion of choosing a number at random from [0, 1], as described above, is consistent with the preceding definitions. Since the probability of an interval is to be its length, the class of events M must contain all intervals; but in order to be a σ-field it must contain other sets, many of which are difficult to describe in an elementary way. One example is the event in equation (14), which must belong to M in order that one can talk about its probability. Also, although it seems clear that the length of a finite disjoint union of intervals is just the sum of their lengths, a rather subtle argument is required to show that length has the property of countable additivity. A basic theorem says that there is a suitable σ-field containing all the intervals and a unique probability defined on this σ-field for which the probability of an interval is its length. The σ-field is called the class of Lebesgue-measurable sets, and the probability is called the Lebesgue measure, after the French mathematician and principal architect of measure theory, Henri-Léon Lebesgue.

In general, a σ-field need not be all subsets of the sample space S. The question of whether all subsets of [0, 1] are Lebesgue-measurable turns out to be a difficult problem that is intimately connected with the foundations of mathematics and in particular with the axiom of choice.

**Probability density functions**

For random variables having a continuum of possible values, the function that plays the same role as the probability distribution of a discrete random variable is called a probability density function. If the random variable is denoted by X, its probability density function f has the property that

for every interval (a, b]; i.e., the probability that X falls in (a, b] is the area under the graph of f between a and b (see the Probability density function. [Credit: Encyclopædia Britannica, Inc.]figure). For example, if X denotes the outcome of selecting a number at random from the interval [r, s], the probability density function of X is given by f(x) = 1/(s − r) for r < x < s and f(x) = 0 for x < r or x > s. The function F(x) defined by F(x) = P{X ≤ x} is called the distribution function, or cumulative distribution function, of X. If X has a probability density function f(x), the relation between f and F is F′(x) = f(x) or equivalently

The distribution function F of a discrete random variable should not be confused with its probability distribution f. In this case the relation between F and f is

If a random variable X has a probability density function f(x), its “expectation” can be defined by

provided that this integral is convergent. It turns out to be simpler, however, not only to use Lebesgue’s theory of measure to define probabilities but also to use his theory of integration to define expectation. Accordingly, for any random variable X, E(X) is defined to be the Lebesgue integral of X with respect to the probability measure P, provided that the integral exists. In this way it is possible to provide a unified theory in which all random variables, both discrete and continuous, can be treated simultaneously. In order to follow this path, it is necessary to restrict the class of those functions X defined on S that are to be called random variables, just as it was necessary to restrict the class of subsets of S that are called events. The appropriate restriction is that a random variable must be a measurable function. The definition is taken over directly from the Lebesgue theory of integration and will not be discussed here. It can be shown that, whenever X has a probability density function, its expectation (provided it exists) is given by equation (15), which remains a useful formula for calculating E(X).

Some important probability density functions are the following:

The cumulative distribution function of the normal distribution with mean 0 and variance 1 has already appeared as the function G defined following equation (12). The law of large numbers and the central limit theorem continue to hold for random variables on infinite sample spaces. A useful interpretation of the central limit theorem stated formally in equation (equation (12) is as follows: The probability that the average (or sum) of a large number of independent, identically distributed random variables with finite variance falls in an interval (c1, c2] equals approximately the area between c1 and c2 underneath the graph of a normal density function chosen to have the same expectation and variance as the given average (or sum). The figure illustrates the normal approximation to the binomial distribution with n = 10 and p = 1/2.

The exponential distribution arises naturally in the study of the Poisson distribution introduced in equation (13). If Tk denotes the time interval between the emission of the k − 1st and kth particle, then T1, T2,… are independent random variables having an exponential distribution with parameter μ. This is obvious for T1 from the observation that {T1 > t} = {N(t) = 0}. Hence, P{T1 ≤ t} = 1 − P{N(t) = 0} = 1 − exp(−μt), and by differentiation one obtains the exponential density function.

The Cauchy distribution does not have a mean value or a variance, because the integral (15) does not converge. As a result, it has a number of unusual properties. For example, if X1, X2,…, Xn are independent random variables having a Cauchy distribution, the average (X1 +⋯+ Xn)/n also has a Cauchy distribution. The variability of the average is exactly the same as that of a single observation. Another random variable that does not have an expectation is the waiting time until the number of heads first equals the number of tails in tossing a fair coin.

**Conditional expectation and least squares prediction**

An important problem of probability theory is to predict the value of a future observation Y given knowledge of a related observation X (or, more generally, given several related observations X1, X2,…). Examples are to predict the future course of the national economy or the path of a rocket, given its present state.

Prediction is often just one aspect of a “control” problem. For example, in guiding a rocket, measurements of the rocket’s location, velocity, and so on are made almost continuously; at each reading, the rocket’s future course is predicted, and a control is then used to correct its future course. The same ideas are used to steer automatically large tankers transporting crude oil, for which even slight gains in efficiency result in large financial savings.

Given X, a predictor of Y is just a function H(X). The problem of “least squares prediction” of Y given the observation X is to find that function H(X) that is closest to Y in the sense that the mean square error of prediction, E{[Y − H(X)]2}, is minimized. The solution is the conditional expectation H(X) = E(Y|X).

In applications a probability model is rarely known exactly and must be constructed from a combination of theoretical analysis and experimental data. It may be quite difficult to determine the optimal predictor, E(Y|X), particularly if instead of a single X a large number of predictor variables X1, X2,… are involved. An alternative is to restrict the class of functions H over which one searches to minimize the mean square error of prediction, in the hope of finding an approximately optimal predictor that is much easier to evaluate. The simplest possibility is to restrict consideration to linear functions H(X) = a + bX. The coefficients a and b that minimize the restricted mean square prediction error E{(Y − a − bX)2} give the best linear least squares predictor. Treating this restricted mean square prediction error as a function of the two coefficients (a, b) and minimizing it by methods of the calculus yield the optimal coefficients: b̂ = E{[X − E(X)][Y − E(Y)]}/Var(X) and â = E(Y) − b̂E(X). The numerator of the expression for b̂ is called the covariance of X and Y and is denoted Cov(X, Y). Let Ŷ = â + b̂X denote the optimal linear predictor. The mean square error of prediction is E{(Y − Ŷ)2} = Var(Y) − [Cov(X, Y)]2/Var(X).

If X and Y are independent, then Cov(X, Y) = 0, the optimal predictor is just E(Y), and the mean square error of prediction is Var(Y). Hence, |Cov(X, Y)| is a measure of the value X has in predicting Y. In the extreme case that [Cov(X, Y)]2 = Var(X)Var(Y), Y is a linear function of X, and the optimal linear predictor gives error-free prediction.

There is one important case in which the optimal mean square predictor actually is the same as the optimal linear predictor. If X and Y are jointly normally distributed, the conditional expectation of Y given X is just a linear function of X, and hence the optimal predictor and the optimal linear predictor are the same. The form of the bivariate normal distribution as well as expressions for the coefficients â and b̂ and for the minimum mean square error of prediction were discovered by the English eugenicist Sir Francis Galton in his studies of the transmission of inheritable characteristics from one generation to the next. They form the foundation of the statistical technique of linear regression.

**The Poisson process and the Brownian motion process**

The theory of stochastic processes attempts to build probability models for phenomena that evolve over time. A primitive example appearing earlier in this article is the problem of gambler’s ruin.

**The Poisson process**

An important stochastic process described implicitly in the discussion of the Poisson approximation to the binomial distribution is the Poisson process. Modeling the emission of radioactive particles by an infinitely large number of tosses of a coin having infinitesimally small probability for heads on each toss led to the conclusion that the number of particles N(t) emitted in the time interval [0, t] has the Poisson distribution given in equation (13) with expectation μt. The primary concern of the theory of stochastic processes is not this marginal distribution of N(t) at a particular time but rather the evolution of N(t) over time. Two properties of the Poisson process that make it attractive to deal with theoretically are: (i) The times between emission of particles are independent and exponentially distributed with expected value 1/μ. (ii) Given that N(t) = n, the times at which the n particles are emitted have the same joint distribution as n points distributed independently and uniformly on the interval [0, t].

As a consequence of property (i), a picture of the function N(t) is very easily constructed. Originally N(0) = 0. At an exponentially distributed time T1, the function N(t) jumps from 0 to 1. It remains at 1 another exponentially distributed random time, T2, which is independent of T1, and at time T1 + T2 it jumps from 1 to 2, and so on.

Examples of other phenomena for which the Poisson process often serves as a mathematical model are the number of customers arriving at a counter and requesting service, the number of claims against an insurance company, or the number of malfunctions in a computer system. The importance of the Poisson process consists in (a) its simplicity as a test case for which the mathematical theory, and hence the implications, are more easily understood than for more realistic models and (b) its use as a building block in models of complex systems.

**Brownian motion process**

The most important stochastic process is the Brownian motion or Wiener process. It was first discussed by Louis Bachelier (1900), who was interested in modeling fluctuations in prices in financial markets, and by Albert Einstein (1905), who gave a mathematical model for the irregular motion of colloidal particles first observed by the Scottish botanist Robert Brown in 1827. The first mathematically rigorous treatment of this model was given by Wiener (1923). Einstein’s results led to an early, dramatic confirmation of the molecular theory of matter in the French physicist Jean Perrin’s experiments to determine Avogadro’s number, for which Perrin was awarded a Nobel Prize in 1926. Today somewhat different models for physical Brownian motion are deemed more appropriate than Einstein’s, but the original mathematical model continues to play a central role in the theory and application of stochastic processes.

Let B(t) denote the displacement (in one dimension for simplicity) of a colloidally suspended particle, which is buffeted by the numerous much smaller molecules of the medium in which it is suspended. This displacement will be obtained as a limit of a random walk occurring in discrete time as the number of steps becomes infinitely large and the size of each individual step infinitesimally small. Assume that at times kδ, k = 1, 2,…, the colloidal particle is displaced a distance hXk, where X1, X2,… are +1 or −1 according as the outcomes of tossing a fair coin are heads or tails. By time t the particle has taken m steps, where m is the largest integer ≤ t/δ, and its displacement from its original position is Bm(t) = h(X1 +⋯+ Xm). The expected value of Bm(t) is 0, and its variance is h2m, or approximately h2t/δ. Now suppose that δ → 0, and at the same time h → 0 in such a way that the variance of Bm(1) converges to some positive constant, σ2. This means that m becomes infinitely large, and h is approximately σ(t/m)1/2. It follows from the central limit theorem (equation equation (12) that lim P{Bm(t) ≤ x} = G(x/σt1/2), where G(x) is the standard normal cumulative distribution function defined just below equation (12). The Brownian motion process B(t) can be defined to be the limit in a certain technical sense of the Bm(t) as δ → 0 and h → 0 with h2/δ → σ2.

The process B(t) has many other properties, which in principle are all inherited from the approximating random walk Bm(t). For example, if (s1, t1) and (s2, t2) are disjoint intervals, the increments B(t1) − B(s1) and B(t2) − B(s2) are independent random variables that are normally distributed with expectation 0 and variances equal to σ2(t1 − s1) and σ2(t2 − s2), respectively.

Einstein took a different approach and derived various properties of the process B(t) by showing that its probability density function, g(x, t), satisfies the diffusion equation ∂g/∂t = D∂2g/∂x2, where D = σ2/2. The important implication of Einstein’s theory for subsequent experimental research was that he identified the diffusion constant D in terms of certain measurable properties of the particle (its radius) and of the medium (its viscosity and temperature), which allowed one to make predictions and hence to confirm or reject the hypothesized existence of the unseen molecules that were assumed to be the cause of the irregular Brownian motion. Because of the beautiful blend of mathematical and physical reasoning involved, a brief summary of the successor to Einstein’s model is given below.

Unlike the Poisson process, it is impossible to “draw” a picture of the path of a particle undergoing mathematical Brownian motion. Wiener (1923) showed that the functions B(t) are continuous, as one expects, but nowhere differentiable. Thus, a particle undergoing mathematical Brownian motion does not have a well-defined velocity, and the curve y = B(t) does not have a well-defined tangent at any value of t. To see why this might be so, recall that the derivative of B(t), if it exists, is the limit as h → 0 of the ratio [B(t + h) − B(t)]/h. Since B(t + h) − B(t) is normally distributed with mean 0 and standard deviation h1/2σ, in very rough terms B(t + h) − B(t) can be expected to equal some multiple (positive or negative) of h1/2. But the limit as h → 0 of h1/2/h = 1/h1/2 is infinite. A related fact that illustrates the extreme irregularity of B(t) is that in every interval of time, no matter how small, a particle undergoing mathematical Brownian motion travels an infinite distance. Although these properties contradict the commonsense idea of a function—and indeed it is quite difficult to write down explicitly a single example of a continuous, nowhere-differentiable function—they turn out to be typical of a large class of stochastic processes, called diffusion processes, of which Brownian motion is the most prominent member. Especially notable contributions to the mathematical theory of Brownian motion and diffusion processes were made by Paul Lévy and William Feller during the years 1930–60.

A more sophisticated description of physical Brownian motion can be built on a simple application of Newton’s second law: F = ma. Let V(t) denote the velocity of a colloidal particle of mass m. It is assumed that

The quantity f retarding the movement of the particle is due to friction caused by the surrounding medium. The term dA(t) is the contribution of the very frequent collisions of the particle with unseen molecules of the medium. It is assumed that f can be determined by classical fluid mechanics, in which the molecules making up the surrounding medium are so many and so small that the medium can be considered smooth and homogeneous. Then by Stokes’s law, for a spherical particle in a gas, f = 6πaη, where a is the radius of the particle and η the coefficient of viscosity of the medium. Hypotheses concerning A(t) are less specific, because the molecules making up the surrounding medium cannot be observed directly. For example, it is assumed that, for t ≠ s, the infinitesimal random increments dA(t) = A(t + dt) − A(t) and A(s + ds) − A(s) caused by collisions of the particle with molecules of the surrounding medium are independent random variables having distributions with mean 0 and unknown variances σ2 dt and σ2 ds and that dA(t) is independent of dV(s) for s < t.

The differential equation (18) has the solution

where β = f/m. From this equation and the assumed properties of A(t), it follows that E[V2(t)] → σ2/(2mf) as t → ∞. Now assume that, in accordance with the principle of equipartition of energy, the steady-state average kinetic energy of the particle, m limt → ∞E[V2(t)]/2, equals the average kinetic energy of the molecules of the medium. According to the kinetic theory of an ideal gas, this is RT/2N, where R is the ideal gas constant, T is the temperature of the gas in kelvins, and N is Avogadro’s number, the number of molecules in one gram molecular weight of the gas. It follows that the unknown value of σ2 can be determined: σ2 = 2RTf/N.

If one also assumes that the functions V(t) are continuous, which is certainly reasonable from physical considerations, it follows by mathematical analysis that A(t) is a Brownian motion process as defined above. This conclusion poses questions about the meaning of the initial equation (18), because for mathematical Brownian motion the term dA(t) does not exist in the usual sense of a derivative. Some additional mathematical analysis shows that the stochastic differential equation (18) and its solution equation (19) have a precise mathematical interpretation. The process V(t) is called the Ornstein-Uhlenbeck process, after the physicists Leonard Salomon Ornstein and George Eugene Uhlenbeck. The logical outgrowth of these attempts to differentiate and integrate with respect to a Brownian motion process is the Ito (named for the Japanese mathematician Itō Kiyosi) stochastic calculus, which plays an important role in the modern theory of stochastic processes.

The displacement at time t of the particle whose velocity is given by equation (19) is

For t large compared with β, the first and third terms in this expression are small compared with the second. Hence, X(t) − X(0) is approximately equal to A(t)/f, and the mean square displacement, E{[X(t) − X(0)]2}, is approximately σ2/f 2 = RT/(3πaηN). These final conclusions are consistent with Einstein’s model, although here they arise as an approximation to the model obtained from equation (19). Since it is primarily the conclusions that have observational consequences, there are essentially no new experimental implications. However, the analysis arising directly out of Newton’s second law, which yields a process having a well-defined velocity at each point, seems more satisfactory theoretically than Einstein’s original model.

**Stochastic processes**

A stochastic process is a family of random variables X(t) indexed by a parameter t, which usually takes values in the discrete set Τ = {0, 1, 2,…} or the continuous set Τ = [0, +∞). In many cases t represents time, and X(t) is a random variable observed at time t. Examples are the Poisson process, the Brownian motion process, and the Ornstein-Uhlenbeck process described in the preceding section. Considered as a totality, the family of random variables {X(t), t ∊ Τ} constitutes a “random function.”

**Stationary processes**

The mathematical theory of stochastic processes attempts to define classes of processes for which a unified theory can be developed. The most important classes are stationary processes and Markov processes. A stochastic process is called stationary if, for all n, t1 < t2 <⋯< tn, and h > 0, the joint distribution of X(t1 + h),…, X(tn + h) does not depend on h. This means that in effect there is no origin on the time axis; the stochastic behaviour of a stationary process is the same no matter when the process is observed. A sequence of independent identically distributed random variables is an example of a stationary process. A rather different example is defined as follows: U(0) is uniformly distributed on [0, 1]; for each t = 1, 2,…, U(t) = 2U(t − 1) if U(t − 1) ≤ 1/2, and U(t) = 2U(t − 1) − 1 if U(t − 1) > 1/2. The marginal distributions of U(t), t = 0, 1,… are uniformly distributed on [0, 1], but, in contrast to the case of independent identically distributed random variables, the entire sequence can be predicted from knowledge of U(0). A third example of a stationary process is

where the Ys and Zs are independent normally distributed random variables with mean 0 and unit variance, and the cs and θs are constants. Processes of this kind can be useful in modeling seasonal or approximately periodic phenomena.

A remarkable generalization of the strong law of large numbers is the ergodic theorem: if X(t), t = 0, 1,… for the discrete case or 0 ≤ t < ∞ for the continuous case, is a stationary process such that E[X(0)] is finite, then with probability 1 the average

if t is continuous, converges to a limit as s → ∞. In the special case that t is discrete and the Xs are independent and identically distributed, the strong law of large numbers is also applicable and shows that the limit must equal E{X(0)}. However, the example that X(0) is an arbitrary random variable and X(t) ≡ X(0) for all t > 0 shows that this cannot be true in general. The limit does equal E{X(0)} under an additional rather technical assumption to the effect that there is no subset of the state space, having probability strictly between 0 and 1, in which the process can get stuck and never escape. This assumption is not fulfilled by the example X(t) ≡ X(0) for all t, which gets stuck immediately at its initial value. It is satisfied by the sequence U(t) defined above, so by the ergodic theorem the average of these variables converges to 1/2 with probability 1. The ergodic theorem was first conjectured by the American chemist J. Willard Gibbs in the early 1900s in the context of statistical mechanics and was proved in a corrected, abstract formulation by the American mathematician George David Birkhoff in 1931.

**Markovian processes**

A stochastic process is called Markovian (after the Russian mathematician Andrey Andreyevich Markov) if at any time t the conditional probability of an arbitrary future event given the entire past of the process—i.e., given X(s) for all s ≤ t—equals the conditional probability of that future event given only X(t). Thus, in order to make a probabilistic statement about the future behaviour of a Markov process, it is no more helpful to know the entire history of the process than it is to know only its current state. The conditional distribution of X(t + h) given X(t) is called the transition probability of the process. If this conditional distribution does not depend on t, the process is said to have “stationary” transition probabilities. A Markov process with stationary transition probabilities may or may not be a stationary process in the sense of the preceding paragraph. If Y1, Y2,… are independent random variables and X(t) = Y1 +⋯+ Yt, the stochastic process X(t) is a Markov process. Given X(t) = x, the conditional probability that X(t + h) belongs to an interval (a, b) is just the probability that Yt + 1 +⋯+ Yt + h belongs to the translated interval (a − x, b − x); and because of independence this conditional probability would be the same if the values of X(1),…, X(t − 1) were also given. If the Ys are identically distributed as well as independent, this transition probability does not depend on t, and then X(t) is a Markov process with stationary transition probabilities. Sometimes X(t) is called a random walk, but this terminology is not completely standard. Since both the Poisson process and Brownian motion are created from random walks by simple limiting processes, they, too, are Markov processes with stationary transition probabilities. The Ornstein-Uhlenbeck process defined as the solution (19) to the stochastic differential equation (18) is also a Markov process with stationary transition probabilities.

The Ornstein-Uhlenbeck process and many other Markov processes with stationary transition probabilities behave like stationary processes as t → ∞. Roughly speaking, the conditional distribution of X(t) given X(0) = x converges as t → ∞ to a distribution, called the stationary distribution, that does not depend on the starting value X(0) = x. Moreover, with probability 1, the proportion of time the process spends in any subset of its state space converges to the stationary probability of that set; and, if X(0) is given the stationary distribution to begin with, the process becomes a stationary process. The Ornstein-Uhlenbeck process defined in equation (19) is stationary if V(0) has a normal distribution with mean 0 and variance σ2/(2mf).

At another extreme are absorbing processes. An example is the Markov process describing Peter’s fortune during the game of gambler’s ruin. The process is absorbed whenever either Peter or Paul is ruined. Questions of interest involve the probability of being absorbed in one state rather than another and the distribution of the time until absorption occurs. Some additional examples of stochastic processes follow.

**The Ehrenfest model of diffusion**

The Ehrenfest model of diffusion (named after the Austrian Dutch physicist Paul Ehrenfest) was proposed in the early 1900s in order to illuminate the statistical interpretation of the second law of thermodynamics, that the entropy of a closed system can only increase. Suppose N molecules of a gas are in a rectangular container divided into two equal parts by a permeable membrane. The state of the system at time t is X(t), the number of molecules on the left-hand side of the membrane. At each time t = 1, 2,… a molecule is chosen at random (i.e., each molecule has probability 1/N to be chosen) and is moved from its present location to the other side of the membrane. Hence, the system evolves according to the transition probability p(i, j) = P{X(t + 1) = j|X(t) = i}, where

The long run behaviour of the Ehrenfest process can be inferred from general theorems about Markov processes in discrete time with discrete state space and stationary transition probabilities. Let T(j) denote the first time t ≥ 1 such that X(t) = j and set T(j) = ∞ if X(t) ≠ j for all t. Assume that for all states i and j it is possible for the process to go from i to j in some number of steps—i.e., P{T(j) < ∞|X(0) = i} > 0. If the equations

have a solution Q(j) that is a probability distribution—i.e., Q(j) ≥ 0, and ∑Q(j) = 1—then that solution is unique and is the stationary distribution of the process. Moreover, Q(j) = 1/E{T(j)|X(0) = j}; and, for any initial state j, the proportion of time t that X(t) = i converges with probability 1 to Q(i).

For the special case of the Ehrenfest process, assume that N is large and X(0) = 0. According to the deterministic prediction of the second law of thermodynamics, the entropy of this system can only increase, which means that X(t) will steadily increase until half the molecules are on each side of the membrane. Indeed, according to the stochastic model described above, there is overwhelming probability that X(t) does increase initially. However, because of random fluctuations, the system occasionally moves from configurations having large entropy to those of smaller entropy and eventually even returns to its starting state, in defiance of the second law of thermodynamics.

The accepted resolution of this contradiction is that the length of time such a system must operate in order that an observable decrease of entropy may occur is so enormously long that a decrease could never be verified experimentally. To consider only the most extreme case, let T denote the first time t ≥ 1 at which X(t) = 0—i.e., the time of first return to the starting configuration having all molecules on the right-hand side of the membrane. It can be verified by substitution in equation (20) that the stationary distribution of the Ehrenfest model is the binomial distribution

and hence E(T) = 2N. For example, if N is only 100 and transitions occur at the rate of 106 per second, E(T) is of the order of 1015 years. Hence, on the macroscopic scale, on which experimental measurements can be made, the second law of thermodynamics holds.

**The symmetric random walk**

A Markov process that behaves in quite different and surprising ways is the symmetric random walk. A particle occupies a point with integer coordinates in d-dimensional Euclidean space. At each time t = 1, 2,… it moves from its present location to one of its 2d nearest neighbours with equal probabilities 1/(2d), independently of its past moves. For d = 1 this corresponds to moving a step to the right or left according to the outcome of tossing a fair coin. It may be shown that for d = 1 or 2 the particle returns with probability 1 to its initial position and hence to every possible position infinitely many times, if the random walk continues indefinitely. In three or more dimensions, at any time t the number of possible steps that increase the distance of the particle from the origin is much larger than the number decreasing the distance, with the result that the particle eventually moves away from the origin and never returns. Even in one or two dimensions, although the particle eventually returns to its initial position, the expected waiting time until it returns is infinite, there is no stationary distribution, and the proportion of time the particle spends in any state converges to 0!

**Queuing models**

The simplest service system is a single-server queue, where customers arrive, wait their turn, are served by a single server, and depart. Related stochastic processes are the waiting time of the nth customer and the number of customers in the queue at time t. For example, suppose that customers arrive at times 0 = T0 < T1 < T2 <⋯ and wait in a queue until their turn. Let Vn denote the service time required by the nth customer, n = 0, 1, 2,…, and set Un = Tn − Tn − 1. The waiting time, Wn, of the nth customer satisfies the relation W0 = 0 and, for n ≥ 1, Wn = max(0, Wn − 1 + Vn − 1 − Un). To see this, observe that the nth customer must wait for the same length of time as the (n − 1)th customer plus the service time of the (n − 1)th customer minus the time between the arrival of the (n − 1)th and nth customer, during which the (n − 1)th customer is already waiting but the nth customer is not. An exception occurs if this quantity is negative, and then the waiting time of the nth customer is 0. Various assumptions can be made about the input and service mechanisms. One possibility is that customers arrive according to a Poisson process and their service times are independent, identically distributed random variables that are also independent of the arrival process. Then, in terms of Yn = Vn − 1 − Un, which are independent, identically distributed random variables, the recursive relation defining Wn becomes Wn = max(0, Wn − 1 + Yn). This process is a Markov process. It is often called a random walk with reflecting barrier at 0, because it behaves like a random walk whenever it is positive and is pushed up to be equal to 0 whenever it tries to become negative. Quantities of interest are the mean and variance of the waiting time of the nth customer and, since these are very difficult to determine exactly, the mean and variance of the stationary distribution. More realistic queuing models try to accommodate systems with several servers and different classes of customers, who are served according to certain priorities. In most cases it is impossible to give a mathematical analysis of the system, which must be simulated on a computer in order to obtain numerical results. The insights gained from theoretical analysis of simple cases can be helpful in performing these simulations. Queuing theory had its origins in attempts to understand traffic in telephone systems. Present-day research is stimulated, among other things, by problems associated with multiple-user computer systems.

Reflecting barriers arise in other problems as well. For example, if B(t) denotes Brownian motion, then X(t) = B(t) + ct is called Brownian motion with drift c. This model is appropriate for Brownian motion of a particle under the influence of a constant force field such as gravity. One can add a reflecting barrier at 0 to account for reflections of the Brownian particle off the bottom of its container. The result is a model for sedimentation, which for c < 0 in the steady state as t → ∞ gives a statistical derivation of the law of pressure as a function of depth in an isothermal atmosphere. Just as ordinary Brownian motion can be obtained as the limit of a rescaled random walk as the number of steps becomes very large and the size of individual steps small, Brownian motion with a reflecting barrier at 0 can be obtained as the limit of a rescaled random walk with reflection at 0. In this way, Brownian motion with a reflecting barrier plays a role in the analysis of queuing systems. In fact, in modern probability theory one of the most important uses of Brownian motion and other diffusion processes is as approximations to more complicated stochastic processes. The exact mathematical description of these approximations gives remarkable generalizations of the central limit theorem from sequences of random variables to sequences of random functions.

**Insurance risk theory**

The ruin problem of insurance risk theory is closely related to the problem of gambler’s ruin described earlier and, rather surprisingly, to the single-server queue as well. Suppose the amount of capital at time t in one portfolio of an insurance company is denoted by X(t). Initially X(0) = x > 0. During each unit of time, the portfolio receives an amount c > 0 in premiums. At random times claims are made against the insurance company, which must pay the amount Vn > 0 to settle the nth claim. If N(t) denotes the number of claims made in time t, then

provided that this quantity has been positive at all earlier times s < t. At the first time X(t) becomes negative, however, the portfolio is ruined. A principal problem of insurance risk theory is to find the probability of ultimate ruin. If one imagines that the problem of gambler’s ruin is modified so that Peter’s opponent has an infinite amount of capital and can never be ruined, then the probability that Peter is ultimately ruined is similar to the ruin probability of insurance risk theory. In fact, with the artificial assumptions that (i) c = 1, (ii) time proceeds by discrete units, say t = 1, 2,…, (iii) Vn is identically equal to 2 for all n, and (iv) at each time t a claim occurs with probability p or does not occur with probability q independently of what occurs at other times, then the process X(t) is the same stochastic process as Peter’s fortune, which is absorbed if it ever reaches the state 0. The probability of Peter’s ultimate ruin against an infinitely rich adversary is easily obtained by taking the limit of equation (6) as m → ∞. The answer is (q/p)x if p > q—i.e., the game is favourable to Peter—and 1 if p ≤ q. More interesting assumptions for the insurance risk problem are that the number of claims N(t) is a Poisson process and the sizes of the claims V1, V2,… are independent, identically distributed positive random variables. Rather surprisingly, under these assumptions the probability of ultimate ruin as a function of the initial fortune x is exactly the same as the stationary probability that the waiting time in the single-server queue with Poisson input exceeds x. Unfortunately, neither problem is easy to solve exactly, although there is a very good approximate solution originally derived by the Swedish mathematician Harald Cramér.

**Martingale theory**

As a final example, it seems appropriate to mention one of the dominant ideas of modern probability theory, which at the same time springs directly from the relation of probability to games of chance. Suppose that X1, X2,… is any stochastic process and, for each n = 0, 1,…, fn = fn(X1,…, Xn) is a (Borel-measurable) function of the indicated observations. The new stochastic process fn is called a martingale if E(fn|X1,…, Xn − 1) = fn − 1 for every value of n > 0 and all values of X1,…, Xn − 1. If the sequence of Xs are outcomes in successive trials of a game of chance and fn is the fortune of a gambler after the nth trial, then the martingale condition says that the game is absolutely fair in the sense that, no matter what the past history of the game, the gambler’s conditional expected fortune after one more trial is exactly equal to his present fortune. For example, let X0 = x, and for n ≥ 1 let Xn equal 1 or −1 according as a coin having probability p of heads and q = 1 − p of tails turns up heads or tails on the nth toss. Let Sn = X0 +⋯+ Xn. Then fn = Sn − n(p − q) and fn = (q/p)Sn are martingales. One of the basic results of martingale theory is that, if the gambler is free to quit the game at any time using any strategy whatever, provided only that this strategy does not foresee the future, then the game remains fair. This means that, if N denotes the stopping time at which the gambler’s strategy tells him to quit the game, so that his final fortune is fN, then

Strictly speaking, this result is not true without some additional conditions that must be verified for any particular application. To see how efficiently it works, consider once again the problem of gambler’s ruin and let N be the first value of n such that Sn = 0 or m; i.e., N denotes the random time at which ruin first occurs and the game ends. In the case p = 1/2, application of equation (21) to the martingale fn = Sn, together with the observation that fN = either 0 or m, yields the equalities x = f0 = E(fN|f0 = x) = m[1 − Q(x)], which can be immediately solved to give the answer in equation (6). For p ≠ 1/2, one uses the martingale fn = (q/p)Sn and similar reasoning to obtain

from which the first equation in (6) easily follows. The expected duration of the game is obtained by a similar argument.

A particularly beautiful and important result is the martingale convergence theorem, which implies that a nonnegative martingale converges with probability 1 as n → ∞. This means that, if a gambler’s successive fortunes form a (nonnegative) martingale, they cannot continue to fluctuate indefinitely but must approach some limiting value.

Basic martingale theory and many of its applications were developed by the American mathematician Joseph Leo Doob during the 1940s and ’50s following some earlier results due to Paul Lévy. Subsequently it has become one of the most powerful tools available to study stochastic processes.