IV

IV. Conclusion

Daniel Bernoulli resolved the St. Petersburg Paradox by replacing mathematical expectation with moral expectation. But Nicholas Bernoulli, who formulated the St. Petersburg paradox, never accepted his cousin’s solution, believing that there should be a single fair price for the game. As Stigler (1950) writes, economists may find it surprising that Nicholas Bernoulli and eighteenth century mathematicians believed that the St. Petersburg Paradox could only be “solved” by finding a single price for the game. Might future economists find it peculiar that twentieth century economists held firmly to EU in the face of the Allais Paradox and other violations? Stigler’s analysis of the development of utility theory through the beginning of this century leads him to three criteria for successful theories: generality, congruence with reality (or fit, in our terms), and manageability. We mention each of these criteria in summarizing the main points of this paper.

There have been many experimental studies comparing EU with competing theories of decision making under risk. Many of these studies use a similar format: Subjects are given several pairwise choices between lotteries. The lotteries are constructed so that EU implies a certain consistency between choices (for example, picking the riskier gamble from one pair implies picking the riskier gamble in another pair). Various theories can then be cast as predictions about patterns of choices that should be observed. We note two important features of our approach: First, our goal is to discriminate among theories which attempt to describe actual choices; we have nothing to say about the normative appeal of EU or its generalizations. Second, the generalizability of our results is limited to the extent that naturally-occurring choices are different from lotteries with well-specified probabilities of monetary outcomes.

We conducted analyses on 23 data sets containing nearly 8,000 choices and 2,000 choice patterns, and aggregated the results. We draw several specific conclusions.

1. All the theories are rejected by a chi-squared test. For every theory there is systematic variation in excluded patterns which could, in principle, be explained by a more refined theory.

2. There is room for improvement in two directions. Some theories, like EU and WEU, are too lean: They could explain the data better by allowing a few more common patterns. Other theories, such as mixed fanning and rank-dependent EU, are too fat: They allow a lot of patterns which are rarely observed. Our analyses provide theorists with a way to diagnose empirical shortcomings of current theories, and perhaps inspiration for new theorizing.

3. There are dramatic differences between theory accuracy when the gambles in a pair have different support (they lie on the triangle boundary) and when they have the same support (they lie in the triangle interior). EU predicts poorly when support is different, and predicts well when support is the same. The transition from the boundary to the interior implies adding support, typically a small probability of an outcome. Therefore, the accuracy of EU in the interior and its inaccuracy on the boundary suggests that nonlinear weighting of small probabilities is empirically important in explaining choice behavior. This conclusion has been suggested before, but is confirmed dramatically by our analysis. Indeed, Morgenstern (1979) himself accepted that EU had limited applicability when probabilities were low:

Now the von Neumann-Morgenstern utility theory, as any theory, is only an approximation to an undoubtedly much richer and far more complicated reality than that which the theory describes in a simple manner.

. . . one should now point out that the domain of our axioms on utility theory is also restricted. Perhaps we should have pointed that out, instead of assuming that this would be understood ab ovo. For example, the probabilities used must be within certain plausible ranges and not go to 0.01 or even less to 0.001, then to be compared with other equally tiny numbers such as 0.02, etc. Rather, one imagines that a normal individual would have some intuition of what 50:50 or 25:75 means, etc. (p. 178)

4. The broadest conclusion of our analysis is that there are some losers among competing theories, and some winners. Losers include general theories which rely on betweenness rather than independence, and theories which assume fanning in throughout the triangle; those theories are dominated by other theories which use fewer free parameters and are more accurate. There is some irony here: Some of the theories we test were developed after theorists had seen some of the data sets -- these include mixed fan (which we concocted), linear mixed fan or disappointment-aversion theory, lottery dependent utility, etc. It is clear that the development of linear mixed fanning, say, was influenced by data we use to test linear mixed fan. Instead of presenting a problem, the results testify to the power of our approach: We are able to reject some theories using the same data which were taken as inspiration, or support, for developing the theory in the first place.

We cannot declare a single winner among theories-- much as we cannot declare a best ice cream or university-- because the best theory depends on one's tradeoff between parsimony and fit. But suppose a researcher can specify a single parameter expressing the price of precision, or the reduction in goodness-of-fit (measured by a chi-squared statistic) necessary to justify allowing an extra free parameter. (Some statistical criteria suggest what this price should be.) We construct a menu of theories which are best at each price-of-precision; researchers can then use the menu to decide which theory to adopt, depending on the price they are willing to pay.

When lotteries have different support, there is never a price-of-precision which justifies using EU; anyone who values parsimony enough to use EU over all the generalizations should use EV instead of EU. Combining all the studies (see the bottom of Table 13), the menu of best theories is: mixed fanning, prospect theory, EU, and EV. Statistical criteria suggest various prices of precision which favor either mixed fanning or EU; the middle ground between high and low prices favors prospect theory.

We cannot give a more definitive answer to the question of which theory is best because people use theories for different purposes. A researcher interested in a broad theory, to explain choices by as many people as possible, cares less for parsimony and more for accuracy; she might choose mixed fanning or prospect theory. A decision analyst who wants to help people make more coherent decisions, by adhering to axioms they respect but sometimes wander from, might stick with EU or EV. A mathematical economist who uses the theory as a brick to build theories of aggregate behavior may value parsimony more highly; she might choose EU or EV (though she should never use EU when choices involve gambles with different support).

However, an historical parallel described by Stigler (1950) may be instructive for those who cling to EU:

Economists long delayed in accepting the generalized utility function because of the complications in its mathematical analysis, although no one (except Marshall) questioned its realism...Manageability should mean the ability to bring the theory to bear on specific economic problems, not ease of manipulation. The economist has no right to expect of the universe he explores that its laws are discoverable by the indolent and the unlearned. The faithful adherence for so long to the additive utility function strikes one as showing at least a lack of enterprise (pp 393-394).

The pairwise-choice studies suggest that violations of EU are robust enough that modeling of aggregate economic behavior based on alternatives to EU is well worth exploring. So far there have been relatively few such efforts. (Epstein (1990) reviews some efforts by economists.) Ultimately, most of the payoff for economics will come from replacing EU in models of individual behavior with more accurate descriptive principles or a single formal theory. Our results suggest which replacements are most promising, and which modifications of the currently available theories are most productive.

We see our paper as summarizing a chapter in the history of empirical studies of risky choice. We think the weight of evidence from recent studies with multiple pairwise choices, when aggregated across those studies, is sufficiently great that new pairwise-choice studies are unlikely to budge many basic conclusions-- the statistical value-added of more such studies is low (compared to the value-added of new approaches). However, this sweeping conclusion leans heavily on the assumption that different studies are completely independent (which they likely are not). If studies are highly dependent then our results are overstated, and there may still be substantial value in using the pairwise-choice paradigm to exploring new domains of gambles (e.g., gambles over losses, gambles with many possible outcomes, gambles with very low probabilities); more data could change the way at least some theories are ranked.

If our analysis closes the chapter on pairwise-choice empirics (or summarizes the much that we know so far), then it opens new chapters as well-- particularly, a chapter devoted to combining structural explanations of choice problems with more sophisticated theories of errors. Empirical studies fitting individual non-EU functions and parameters to subjects are useful and relatively rare (see Daniels & Keller, in press; Hey & Di Cagno, 1990; Tversky & Kahneman, 1992). Studies that test axioms directly-- e.g., Wakker, Weber & Erev (1993) test comonotonic independence, the crucial ingredient in rank-dependent approaches-- are useful too. Function-fitting, and our approach, both allow heterogeneous preferences. (The fact that estimated pattern proportions are fairly even across patterns suggest there is substantial heterogeneity.) For analytical tractability, it is often useful to assume homogeneity (in representative-agent models); then the sensible empirical question is which single theory, and which precise parameter values, fits everyone’s choices best (see Camerer & Ho, in press).

Finally, our general method could be applied in other domains. For example, various non-cooperative solution concepts permit different sets of choices in games. Theories could be characterized as restrictions on allowable patterns of choices, and the distribution of patterns could be explicitly connected through an error rate. For example, McKelvey & Palfrey (1992) apply a similar error theory to fit various equilibrium concepts to experimental data on the “centipede” game, and to test restrictions imposed by different concepts;. El-Gamal & Grether (1993) apply a similar analysis to experimental data on probability judgments. Most importantly, our method would allow one to judge which concepts, like Nash equilibrium or its various refinements (and coarsenings), best trade off parsimony and accuracy. A similar method could be applied to compare solution concepts in cooperative games. The discussion above shows how our method uses more information and hence is more powerful than methods which judge theories only by the percentage of consistent responses (e.g., Selten, 1987).