Mr. Smith walks with his dog every morning; those neighbors who can see the street where they walk are used to seeing them around 7:30 am. Ask any of the neighbors what the chances are that Mr. Smith and his dog will be walking down the street about 7:30 tomorrow morning, and they’ll say it’s practically 100% certain.
Suppose, however, that sometime before dawn, the neighborhood experiences a power outage of several hours, so Mr. Smith’s alarm clock doesn’t go off as usual in the morning. The probability of seeing Smith and dog around 7:30 is, in this situation, appreciably less.
This example shows how the probability of any event, like Mr. Smith’s morning walk occurring on schedule, is not absolute, but is relative to some particular set of information. Given only the information in our opening sentence plus other common knowledge, and not the fact that tomorrow morning won’t be like most other recent mornings, the probability that Mr. Smith will be walking his dog around 7:30 as usual is a high number near 100%. In a standard symbolism, the walk can be represented by W, the common information that tells Mr. Smith’s neighbors that he’s normally almost certain to walk his dog at 7:30 can be represented as I, and p(W|I) stands for the probability (“p”) that the walk occurs (“W”), given (“|”) the common information (“I”). But someone aware of the power outage would also have the further information O that the outage happened. The probability of Smith’s walking on schedule given that information, p(W|IO), is lower than the near-100% figure p(W|I).
Put another way, there really is no such thing as “the” probability of anything; all probabilities depend on whatever information is given. If we are given information J instead of information I, the probability of an event E is p(E|J) instead of p(E|I), and p(E|J) may or may not be equal to p(E|I). Instead of “the probability of event E”, what we actually have is “the probability of E given information I” or “information J”.
The mathematical relation between p(W|I) and p(W|IO)—or more generally, how the truth of a proposition (like W, “Smith and his dog will walk by at 7:30 tomorrow morning”) can be more probable or less probable with additional given information (e.g., information I vs. information I and O)—is a simple one, but it has far-reaching consequences for inferring how likely anything is, given some relevant information about it. Such inferences can tell us a great deal about important things that we can’t observe directly, as several reports available through OSTI’s SciTech Connect demonstrate. Other recent reports deal with this simple relation’s more subtle mathematical implications.
We can understand how the different probabilities relate using our dog-walking and power-outage example, and looking at the probability that the two different events, a power outage O and Smith’s 7:30 am dog walk W, occur in the same morning, given the general background information I represented by ordinary common information plus our first sentence above. In symbols, the joint probability of W and O is
The probability, given the background information I, that two different propositions are both true is equal to the probability that one is true, given the other (and I), times the probability that the other proposition is true (also given I). In our example, the probability p(WO|I) that Smith walks his dog on schedule and that a power outage also occurred is equal to the product of two probabilities—the probability p(W|IO) that the walk occurs, given our information about what normally happens (I) plus the fact that a power outage did happen (O), times the probability p(O|I) of the outage itself given I:
p(WO|I) = p(W|IO) p(O|I).
If the probability p(O|I) of an outage is low, and if Smith is unlikely to walk his dog on his usual schedule if the power went out (i.e., p(W|IO) is low), then the probability p(WO|I) that Smith is walking his dog at 7:30 despite the power outage is very low.
As far as the mathematics is concerned, it doesn’t matter which proposition is “the one” and which “the other”; we could equally well talk about the probability p(O|IW) that a power outage does occur overnight, given that Mr. Smith and his dog do take their walk around 7:30 am. We then have
p(WO|I) = p(O|IW) p(W|I).
As we noted at first, p(W|I) is nearly 100%. p(O|IW) is the probability that, given W (that Mr. Smith and his dog do take their 7:30 walk) and common additional information I, there was a power outage last night—very unlikely, as we’ve said; seeing Mr. Smith walking his dog as usual suggests that no power outage interfered with Mr. Smith waking up on time.
Since both p(W|IO) p(O|I) and p(O|IW) p(W|I) equal p(WO|I), they equal each other:
p(W|IO) p(O|I) = p(O|IW) p(W|I),
which shows us the relation between p(W|IO) and p(W|I)—the probabilities that Smith and his dog take a walk, assuming different sets of given information. This relation implies that if an outage would make the walk less likely—i.e., if p(W|IO) is less than p(W|I)—then p(O|IW) is also less than p(O|I): however unlikely it is in general that a long power outage occurred last night, seeing Mr. Smith and his dog walking on schedule means that a long outage is even less likely to have happened.
We can generalize our mathematical statement beyond the example of dog walks and power outages to any pairs of assertions. Thus the probability of any one assertion conditional on another, times the probability of the other, equals the probability of the other assertion conditional on the first, times the probability of the first (all probabilities also being conditional on the assumed common background information). The assertions could be about anything at all, not just power outages and dog walks; the probabilities described have the same relation, though the actual percentages that express the probabilities may be different.
More generally and more interestingly, one assertion could be that a certain hypothesis H is true, while the other “assertion” could be a set of data D that has some relevance for the truth or falsehood of the hypothesis. Replacing W by the hypothesis H, and O by the set of relevant data D, we get
p(H|ID) p(D|I) = p(D|IH) p(H|I),
and dividing both sides by the probability p(D|I)—the probability that we would observe data D in any case, regardless of whether hypothesis H is true—we get
One thing that makes this interesting and useful is the fact that in many situations, the ratio [p(D|IH) / p(D|I)] can be calculated exactly for any given data set. If the ratio is greater than 1, p(H|ID) is greater than p(H|I), meaning that data D provides evidence in favor of hypothesis H being true. If the ratio is less than 1, p(H|ID) is less than p(H|I), meaning that the hypothesis is less probable with the data than without it. So if the ratio [p(D|IH) / p(D|I)] can be calculated for any particular data D, one can quantify how much evidence D provides for or against the hypothesis.
This simple algebraic result is known as Bayes’ theorem[Wikipedia] after the 18th-century mathematician Thomas Bayes,[Wikipedia] who demonstrated a special case of it. The theorem’s importance isn’t due to any algebraic or conceptual complexity, but to its combination of initial unobviousness with usefulness.
The different reports described here each deal with calculating probabilities of some significant hypotheses H given specific kinds of data D. In particular, these reports deal with evaluating hypotheses about things we can’t observe directly, but can only infer from somewhat ambiguous data. Each report is distinguished by addressing different problems using different mathematical methods, some of which involve different ways to arrive at the ratio p(D|IH) / p(D|I) that shows how many times more likely (or less likely) hypothesis H is in view of data D.
“Introduction to Bayesian methods in macromolecular crystallography”,[SciTech Connect] from Los Alamos National Laboratory is actually a set of slides for a tutorial on the use of Bayes’ theorem in deducing the structures of crystals. Here, H represents a hypothesis about what kind of atoms the crystal is made of and how they’re arranged. A crystal with a given atomic structure tends to scatter x-rays into some directions more intensely than in others; atoms of different species, differently arranged, will generally produce different scattering patterns.[Wikipedia] The slides use the testing of a very simple hypothesis as an example to introduce Bayes’ theorem, and then provide illustrations to clarify the meaning of the theorem’s different probabilities, using a common notation for probabilities that leaves the background information I implicit by omitting it.[Note] Further lessons, some illustrated in the slides and some only listed, deal with testing hypotheses about one variable when both it and another variable affect the data D, handling D when it’s affected by systematic error, and applying Bayesian analysis to measurement problems generally and to crystallographic problems in particular. One listed lesson is a class demonstration and discussion of the Phenix software suite, which uses Bayesian inference for automated determination of macromolecular structures.
The crystallography tutorial slides, clearly addressed to students already familiar with crystallography, don’t explain much detail about methods for analyzing crystals, but simply concentrate on how Bayes’ theorem can be used with them. The Lawrence Livermore National Laboratory report “Sequential Threat Detection for Harbor Defense: An X-ray Physics-Based Bayesian Approach”[SciTech Connect] describes another use of x-rays in more detail, in which x-ray detectors are used to see the effect that items of cargo have on the x-rays used to scan them. Different materials will affect the x-rays differently: objects that constitute a threat, like explosives, will absorb, scatter, or transmit the x-rays in ways that have only partial resemblance to the way nonthreatening objects absorb, scatter, or transmit them. How each detector responds to any x-rays that reach it from the object will indicate a probability that the object does or does not constitute a threat. Each x-ray detector’s response is affected by random influences from its environment and from within the detector itself; this random “noise” has to be accounted for in calculating the probability that the detectors’ responses indicate a real threat or its absence.
In the ratio p(D|IH) / p(D|I) relevant to the threat-detection problem, data D represents the output read from all the x-ray detectors, information I includes the physical laws governing how materials interact with x-rays and how x-ray detectors behave, and H stands for either of the two hypotheses T (“this scanned item of cargo represents a security threat”) or N (“this item is not a security threat”). The calculation worked out in the report breaks the probability p(D|IT) down into different functions p(dk |Dk-1TI), each being the probability that the kth x-ray detector will give a particular response dk if (a) the first k-1 detectors yield data Dk-1, (b) the scanned cargo item is a threat, and (c) the information I is accurate. The probability p(D|IN) is similarly broken down into functions p(dk |Dk-1NI). The individual functions p(dk |Dk-1TI) and p(dk |Dk-1NI) are worked out, and (in a further use of Bayes’ theorem) the probabilities p(D|IT) and p(D|IN) are deduced, from which the probability that any given piece of scanned cargo does or does not constitute a security threat based on data set D is calculated.
While the main question addressed by the cargo scans is binary (threat, or not?), other questions involve parameters whose values might be anywhere in a continuous range. How moisture is distributed within a plot of soil above the water table[Wikipedia, Wikipedia] is a question of this type, whose answer is crucial for predicting such things as crop yields, irrigation effects, weather, flooding, and contaminant transport.
Since we can’t directly observe the soil moisture distribution, we can only hypothesize what it is using indirect data D, and try to collect enough of that data so the correct hypothesis H is the only hypothesis that has a high probability p(H|DI). One relevant type of indirect data comes from measuring how long it takes ground-penetrating radar signals to travel through the soil between different places. The travel times are largely but not solely affected by the presence of moisture in the soil, so they give partial information about the groundwater distribution. How to get that information from the radar travel times is the subject of the report “Entropy-Bayesian Inversion of Time-Lapse Tomographic GPR data for Monitoring Dielectric Permittivity and Soil Moisture Variations” [SciTech Connect, SciTech Connect] by researchers from Pacific Northwest National Laboratory, SUNY Buffalo, and Lawrence Berkeley National Laboratory. Since different hypothetical moisture distributions H depend on continuous variables just like the different radar readings D, the ratio p(D|IH) / p(D|I) is calculated based on a theorem in addition to Bayes’—one whose basic point is that the information we still lack about which hypothesis H is true is represented by a function of the probabilities p(H|DI) for the different possible H given our data D and information I. So the ratio p(D|IH) / p(D|I) that accurately represents what we learn about the possible moisture distributions H from our radar data D will make the size of the “missing information” function, known as the information entropy,[Wikipedia] as large as possible.
Sometimes the main problem in calculating p(H|DI) = [p(D|IH) / p(D|I)] p(H|I) is the complexity of the probability functions involved. This is the case when one is interested in predicting where soil contaminants are most likely to turn up, and what their concentrations will be, both of which depend on several parameters like the rates at which the contaminants are produced or eliminated by chemical reactions. Each possible set of values for these parameters can be thought of as a different hypothesis H, whose probabilities depend on given measurement data D and other relevant information I like the laws of physics and chemistry. Calculating the probabilities to see which sets of parameters (or which hypotheses H) are the most likely, and thereby determine how much of each contaminant is likely to appear where, and when, requires a great deal of computer processing time, so there’s much incentive to find less computer-intensive methods for calculating those probabilities. One such method is presented in report “An adaptive sparse-grid high-order stochastic collocation method for Bayesian inference in groundwater reactive transport modeling” [SciTech Connect] by researchers from Oak Ridge National Laboratory and Florida State University. The basic idea is to approximate the actual, more complicated function p(D|IH) p(H|I) by piecing together simpler cubic polynomial functions, of the form (a - )(b - )(c - ) with a, b, and c being sets of constants, that equal p(D|IH) p(H|I) at selected values of the parameter sets —with the selected being concentrated among the most likely possibilities. The cubic functions’ constants a, b, and c are chosen so that even between the selected , the approximate surrogate functions (a - )(b - )(c - ) are close to p(D|IH) p(H|I). Since calculations involving the surrogate functions require less computer power than calculations with p(D|IH) p(H|I) itself, values of for which the surrogate functions, rather than p(D|IH) p(H|I), are high are assumed to be likely values of . Those values, when used in the equations that describe processes involving the soil contaminants, represent ways that the contaminants are likely to react and be transported through the soil.
The problem addressed by “An overview of component qualification using Bayesian statistics and energy methods”[SciTech Connect] from Sandia National Laboratories is how to determine the reliability and safety of weapon-system components under conditions more severe than the ones imposed during tests. Quantities of interest, such as how adversely the components are affected, or the mean time to component failure, are assumed to depend on two unknown parameters b and h, which are to be estimated from how the component responds to such things as stresses, shocks, and vibrations. One of the report’s purposes is to compare the usefulness of two different kinds of data for reliability and safety estimates: how a component is accelerated during the tests versus how much energy it absorbs. But the report also makes certain points about the probability functions p(H|ID), [p(D|IH) / p(D|I)], and p(H|I) relevant to the estimates that show how significant data can be. First, the probability p(H|I), that any given hypothesis about the values of b and h is true prior to acquiring data, is almost never an exactly known function for the type of problem this report deals with. In other words, the exact way a probability depends on H and I alone in such problems is seldom clear. However, with enough data D, the factor [p(D|IH) / p(D|I)] is so much higher or lower than p(H|I) for most hypotheses H that an estimate of p(H|ID) = [p(D|IH) / p(D|I)] p(H|I) based on almost any reasonable-looking p(H|I) would yield an accurate estimate of p(H|ID).
Some other recent reports, all from Los Alamos National Laboratory, are less about specific uses of Bayesian inference than about types of uses. Another pair of Los Alamos reports explains a flaw in a recent critique of Bayesian inference.
Analyses like the aforementioned computer modeling of groundwater transport are performed to predict how complex physical processes are likely to turn out. Just as that groundwater analysis was designed to improve on earlier models (in this case, by using significantly less computer time to get equally reliable results), new simulations are often proposed to overcome problems with existing ones. Does a new simulation really improve on the older ones? This question is addressed in the report “Using a Simple Binomial Model to Assess Improvement in Predictive Capability: Sequential Bayesian Inference, Hypothesis Testing, and Power Analysis”[SciTech Connect]. One may find that a newer computer code simulates some particular experiment more accurately than an older one, or that the older one simulates the experiment more accurately. Checking both codes against different experiments, one can collect a stream of data about which code was more accurate for each experiment. The probability that the new code will simulate more accurately than the old code for a given percentage of experiments, given that data plus other relevant information, is a simple matter to calculate from Bayes’ theorem. The hypothesis H in question is simply “The new code will simulate fraction of all experiments more accurately than the old code”. The report discusses how many simulations one needs to compare to determine the fraction , and how confident one can be about the result.
Such considerations are of interest when one wants to find the single best computer code for modeling something, as one might when the execution of each code requires a lot of computer processing time. The slide presentation “Turning Bayesian model averaging into Bayesian model combination”[SciTech Connect] describes an improvement for a different situation, when multiple computer models can be used concurrently and one wants to take advantage of the strengths of each to overcome any weaknesses in the others. The models described are not process simulators, but automatic classifiers that are trained to categorize what kind of thing something is when given certain data about it.
The standard technique mentioned first in the title, “Bayesian model averaging”, seems at first sight like a proper application of Bayes’ theorem to combining the results of different categorization models, each of which has a probability p(H|DI) that its hypothesis H about how to categorize things is correct. But as the slides indicate, Bayesian model averaging implicitly assumes that the different models’ hypotheses are mutually exclusive, and that exactly one of the models is correct while the rest are wrong. If that assumption is true, Bayesian model averaging is the most accurate way to combine the results of categorizing models, as can be mathematically proven; in effect, Bayesian model averaging selects one model as the best, and does so optimally. Yet experiments show that other model-combination methods classify things more accurately. The presentation describes some simple combination methods that have been found to outperform Bayesian model averaging and the authors’ plans to investigate other methods.
Whereas some of the models described in abovementioned reports take parameters estimated from experimental data as inputs, a slide presentation by researchers from Los Alamos National Laboratory and Argonne National Laboratory, “Bayesian approaches for combining computational model output and physical observations”[SciTech Connect], illustrates the use of simulations as an input for estimating parameters, exemplifying the technique with cosmological-parameter estimation. As with other complex phenomena like groundwater transport, simulating cosmological processes requires a lot of computer processing, so economizing on simulation is also important here. For the investigation presented, the processes are simulated relatively few times for a few sets of parameters, and the output is used to construct an “emulator” to estimate what the simulator would output for any other sets of parameters—a procedure expected to work well if the output doesn’t vary a lot as the parameters change. The slides contain little explanatory text, but reports available through OSTI’s E-print Network by some of the same investigators provide more detail.[E-print Network]
As the reports already described indicate, Bayesian inference represents a useful method of estimating how likely it is that a given hypothesis is true given certain relevant information; in fact, the method can be considered an application of logic where the given information is incomplete or ambiguous.[Reference] As we’ve seen, Bayesian inference involves some estimate of a probability function p(H|I)—how the probability of a hypothesis H given the information I varies with changes in the hypothesis H, which is often a hypothesis about the values of some set of parameters . There are some sets of information I and hypotheses H (or parameters ) for which we can calculate p(H|I) exactly, but in other cases we only know how to make a rough estimate of p(H|I) (or p( |I)). Even so, calculations of the probabilities of different or related quantities, given information I and any data D acquired, can be made using Bayes’ theorem, and the informativeness of that probability distribution can be assessed. For example, the narrower distribution curve on the left graph of Figure 3 gives us more information than the broader one about the superiority of the new simulation code.
The validity of Bayesian inference was challenged by a demonstration, entitled “Bayesian Brittleness: Why no Bayesian model is ‘good enough’” [Reference], regarding reasonable-looking estimates of p( |I) and values of quantities that depend on q. What one needs for useful estimates of such quantities is for the estimates not to vary by much for different reasonable functions p( |I). But the demonstration proved that seemingly similar functions p(q |I) could lead to functions p( |DI) = [p(D| I) / p(D|I)] p( |I) for which -dependent quantities that also depend on the data D could have any value within a wide range.
The challenge was met. The demonstration’s premises included a subtle flaw, which is described briefly in the report “Is Bayesian inference ‘brittle’?”[SciTech Connect] and in more detail in the report “Brittleness and Bayesian Inference”[SciTech Connect]. The basic problem was found to be that the set of plausible-looking estimates for the function p( |I) includes functions that actually depend on whatever data D is actually found. Such functions really aren’t estimates of the probability of given just the background information I; they’re the kind of function p(|DI) one is trying to calculate in the first place. And it turns out that to be sure of finding functions p( |DI) = [p(D| I) / p(D|I)] p( |I) that are guaranteed to make quantities depending on have high or low values, one needs to know the data D to select a “p( |I)” that will work. If one had that data to begin with, it would mean that one’s given information I already contained D, which is contrary to what the function p( |I) is supposed to represent—namely, how probable is prior to knowledge of any data D. Where I already includes D, the solitary I in Bayes’ theorem can be replaced by ID so that the theorem reads
Since the probability of any information D, given D to begin with, is 100%, we have
which means that, under the premises of “Bayesian Brittleness”, the data D provides no new information anyway. Thus nothing is found wrong with how Bayes’ theorem is used in practice, when the prior probability function p( |I) doesn’t depend on the data D that one gathers to learn more about .
Beyond the use of Bayes’ theorem to calculate initially unknown probabilities from known ones, Bayesian inference involves ideas about how one arrives at any known probabilities in the first place.
Some probabilities, like the probability that some unspecified individual in the general population carries a certain gene, can only be determined from data about how frequently the gene occurs in the population. But in other situations, frequency data is either missing or irrelevant. What is the probability that life has ever existed on Mars? Whatever information we have to make that estimate doesn’t include a large set of frequencies: there’s only one Mars, so we don’t have a large set of them to examine to see how many have ever had life on them. How likely is Halley’s Comet to return to the inner solar system in 2061? Here history does provide centuries of data showing that Halley’s Comet has turned up at regular intervals, but its return in 2061 is far more likely than the frequency of its previous visits alone would indicate: our information about Newton’s laws of motion and gravity is even more pertinent. Bayes’ theorem doesn’t indicate any difference between frequency data or any other kind of information. The relation of probability p(A|BC) of A given B and C to, e.g., the probability p(C|BA) of C given B and A is the same, regardless of what type of information A, B, and C stand for. What is generally meant by “Bayesian inference” includes this concept of information, not just Bayes’ theorem itself, which is an uncontroversial algebraic result.
Mathematically well-defined functions for hypotheses’ “prior probabilities” p(H|I) (i.e., probabilities based on the information I that one has prior to taking any additional data D) have been discovered for some cases; the entropy-based method of “Entropy-Bayesian Inversion of Time-Lapse Tomographic GPR data for Monitoring Dielectric Permittivity and Soil Moisture Variations” [SciTech Connect, SciTech Connect] provides one example. And for other cases, like the one described in “An overview of component qualification using Bayesian statistics and energy methods”[SciTech Connect] for which a mathematical derivation of the probability p(H|I) hasn’t been discovered, even reasonable-looking guesses for p(H|I) can be useful enough, especially if the ratio [p(D|IH) / p(D|I)] makes a lot more difference to p(H|ID) than p(H|I) does.
Differing estimates for p(H|I) can be more problematic when [p(D|IH) / p(D|I)] is very different from 1. In the absence of data, different people’s estimates of p(H|I) will be based on differing relevant information I, which includes different experiences and different, even partly unconscious and perhaps faulty methods of evaluating how likely things are, and their estimates will only be as similar as their differing information allows. Such estimates of probability are subjective. Some practitioners of Bayesian inference refer to probabilities themselves as subjective quantities, but others note that people may continue to discover methods for objectively quantifying more types of information in probability terms as they have in the past. One of the aforementioned references argues well for the latter point of view.
[Note] ^ The most commonly used notation for probabilities omits explicit reference to background information I that all the probabilities in a calculation are conditional on; thus Bayes’ theorem, for example, would read p(H|D) = [p(D|H) / p(D)] p(H) in this notation. This saves writing and space, but unfortunately obscures the fact that all probabilities are conditional, and inadvertently promotes the misconception that the probabilities of things are absolute instead of depending on whatever background information I is relevant to their truth or falsehood. This article’s other probability equations always note the background information explicitly, but this is not generally the case in the reports being discussed—something to be aware of when comparing the reports’ equations with the ones in this article.
These tutorial slides include a set of illustrations (“Visualizing Bayes’ rule”) that may be helpful for understanding the conditional-probability relation p(H|ID) p(D|I) = p(D|IH) p(H|I) (represented slightly differently in the slides) that underlies Bayes’ theorem.
More detail about how the investigators economize on the computational modeling is available in two earlier reports, “Cosmic Calibration” and “Cosmic Calibration: Constraints from the Matter Power Spectrum and the Cosmic Microwave Background”, available through a search of OSTI’s E-print Network.
These lectures, given at the European particle physics laboratory CERN by Glen Cowan of Royal Holloway, University of London, introduce statistics as applied in particle physics, including the use of Bayes’ theorem, to provide all the necessary basics for data analysis at the Large Hadron Collider, the largest particle accelerator at CERN (and in the world). As in many of the reports cited above, these lectures deal with Bayesian inference of hypotheses’ probabilities p(H|DI) from experimental data D and the probabilities p(H|I) of the hypotheses without information from (or “prior to”) the data D. The first lecture describes these prior probabilities as subjective, but the fourth also discusses the calculation of prior probabilities from objective rules.
Prepared by Dr. William N. Watson, Physicist
DoE Office of Scientific and Technical Information