Neil W. Henry September 1997, revised 2001
The emphasis on distributions has two roots. First, it is central to the modern notion of data analysis using computers. Even with large datasets it is possible to "look at" the data by sorting and organizing it into distributional patterns. Stories about the data can be told without traditional calculation, stories involving ranges, outliers, quartiles (or percentiles) and shape (e.g., symmetry, skew). Second, the bar chart / histogram organization of a finite collection of data values, with area in the graph representing frequency, is a stepping stone to the concept of the distribution of a continuous variable. The normal distribution is the first such distribution that is introduced. This jump, in texts that do not assume the reader is familiar with Isaac Newton's integral calculus, usually involves a lot of hand waving and appeals to visual intuition (histogram intervals getting narrower and narrower, for instance).
It seems to me that these two motivating principles are somewhat contradictory and can lead to confusion. On the one hand data are always finite in number and thus discrete. Numeric values, no matter how carefully measured, never fill an interval of the real number line. Raising the distinction between discrete and continuous variables at the beginning of a statistics text leads to angels-on-the-head-of-a-pin discussions that may interfere with the development of data analytic intuitions. (A&F, p.30, Problem 8: "Which of the following variables are continuous when the measurements are as fine as possible?") On the other hand, practical issues of data analysis such as how many intervals should be used in a histogram or frequency table may get ignored in the rush to normality.
Frequency tables are usually presented in percentage form, at least in social science applications. The words "relative frequency", or "proportion", are less likely to appear in early chapters on data description since "percentage" is the more common usage in the "real" world (compare A&F pp.37-38, Table 3.3, and M&M pp. 12-13, Table 11). When it comes time to introduce the concept of probability, however, the proportions take on a much more important role. A&F explicitly use the long term relative frequency definition of probability:
A&F do not explain how the probability distribution of Y (the variable named above) shown in Table 4.1 came to be known "approximately". Their statement, "according to results from recent General Social Surveys the probability distribution is approximately the one shown in Table 4.1," is not an explanation. What we know up to this point from their text is that
M&M use the following example to introduce the concept of a probability distribution:
Problem 2 (A&F p. 111) might, in my opinion, have been a more appropriate way to introduce the concept of a probability distribution, though they should have said "Suppose Table 3.9 is the population distribution" and then specified how randomness is going to enter the picture. Table 3.9 gives percentages for the various responses in a population of 1,598 people. If we select one person at random, as in a lottery where 1,598 tickets have been sold, we can speak of the probability that the selected person knows Y people with AIDS. Under other circumstances there is no such justification for calling the numbers "probabilities".
Useful Problems from A&F Chapter 4 include: 1 (who is "you"?), 2, 7, 8 (verify using the Table A), 14, 18, 20, 22, 29, 31, 32 (again, who is "you"), 53.
To facilitate early detection of breast cancer, women are encouraged from a particular age on to participate at regular intervals in routine screening, even if they have no obvious symptoms. Imagine you conduct in a certain region such a breast cancer screening using mammography. For symptom-free women aged 40 to 50 who participate in screening using mammography, the following information is available for this region:
In this classic example the screening test is fallible, so that there may be "false positive" and "false negative" results. The so-called probabilities have (presumably) been derived from a large amount of data. Let's see what happens if we reverse the mode of thinking that I used in the previous section and think of the probabilities as proportions, in a large population of, say, 10,000 women. One percent of these (100) have breast cancer, and 9,900 do not. 80% of the 100 with cancer (i.e. 80 women) show up as positive when given the test. On the other hand, 10% of the 9,900 cancer-free women (990) are false positives.
A physician looking at one of the 1,070 women who has tested positive (990 + 90 = 1070) should realize (in this hypothetical situation) that the vast majority of the 1,070 are cancer-free. Only 80 of 1,070 have cancer, a ratio of .078 or 7.8%. Until further tests are made, or other relevant information about her is obtained, it seems obvious that she should not be informed that there is a very good chance that she has breast cancer. Yet some physicians still will say that "Statistical information is one big lie" and "Statistics is alien to everyday concerns and of little use for judging individual persons".
"The psychology of good judgment", by Gerd Gigerenzer. Medical Decision Making, 1996, vol. 16, no.3, pp. 273-280. "How to improve Bayesian reasoning without instruction: Frequency formats" by Gigerenzer & Hoffrage (1995), Psychological Review, 102, 684-704.