Distributions of Perentages and Probabilities:

Distributions of Percentages and Probabilities:
When is a Proportion a Probability?

Neil W. Henry September 1997, revised 2001

Data Distributions

All statistics textbooks these days discuss data distributions before they introduce the concept of probability. Agresti & Finlay, for example, begin their chapter (#3) on descriptive statistics with a section on "tabular and graphical description" of data, while the title of Moore & McCabe’s first chapter is "Looking at Data: Distributions."[Introduction to the Practice of Statistics, 2^nd ed.].The word "distribution" is more prominent in M&M, but is no less important a concept in A&F. In both cases we begin with numerical data of some kind and introduce the idea of organizing them in the form of a frequency distribution. The task of constructing a tabulation of frequencies, a histogram or a stem-leaf plot precedes the introduction of calculating summary measures from the data.

The emphasis on distributions has two roots. First, it is central to the modern notion of data analysis using computers. Even with large datasets it is possible to "look at" the data by sorting and organizing it into distributional patterns. Stories about the data can be told without traditional calculation, stories involving ranges, outliers, quartiles (or percentiles) and shape (e.g., symmetry, skew). Second, the bar chart / histogram organization of a finite collection of data values, with area in the graph representing frequency, is a stepping stone to the concept of the distribution of a continuous variable. The normal distribution is the first such distribution that is introduced. This jump, in texts that do not assume the reader is familiar with Isaac Newton's integral calculus, usually involves a lot of hand waving and appeals to visual intuition (histogram intervals getting narrower and narrower, for instance).

It seems to me that these two motivating principles are somewhat contradictory and can lead to confusion. On the one hand data are always finite in number and thus discrete. Numeric values, no matter how carefully measured, never fill an interval of the real number line. Raising the distinction between discrete and continuous variables at the beginning of a statistics text leads to angels-on-the-head-of-a-pin discussions that may interfere with the development of data analytic intuitions. (A&F, p.30, Problem 8: "Which of the following variables are continuous when the measurements are as fine as possible?") On the other hand, practical issues of data analysis such as how many intervals should be used in a histogram or frequency table may get ignored in the rush to normality.

Frequency tables are usually presented in percentage form, at least in social science applications. The words "relative frequency", or "proportion", are less likely to appear in early chapters on data description since "percentage" is the more common usage in the "real" world (compare A&F pp.37-38, Table 3.3, and M&M pp. 12-13, Table 11). When it comes time to introduce the concept of probability, however, the proportions take on a much more important role. A&F explicitly use the long term relative frequency definition of probability:

The probability of a particular outcome is the proportion of times that outcome would occur in a long run of repeated observations. (p. 81) According to this definition, we have to be able to imagine some long run of repetitions in order to use the word "probability".

From Population Distributions to Probability Distributions

The initial example of a probability distribution in A&F is based on a question from the General Social Survey: "How many people do you know personally who have been victims of homicide within the past 12 months?" The responses to this question define a discrete variable, and from the survey data we can calculate the proportion who gave "0", "1", "2" or more as their response. Unfortunately A&F fail to explain why it makes sense to call the numbers in Table 4.1 probabilities. They take for granted that the reader will understand that there is a random experiment being planned, in which one non-institutionalized American adult resident will be selected at random and then asked this question. Probabilities then exist because we can imagine repeating the experiment many times and finding out what the "long run proportions" are.

A&F do not explain how the probability distribution of Y (the variable named above) shown in Table 4.1 came to be known "approximately". Their statement, "according to results from recent General Social Surveys the probability distribution is approximately the one shown in Table 4.1," is not an explanation. What we know up to this point from their text is that

samples are used to make sensible statements about populations;
population distributions exist (Section 3.1);
there are techniques for selecting random samples (section 2.2);
a probability is a long term relative frequency.

In order to explain what a probability distribution is, using the GSS question as an example, I would first introduce Table 4.1 as the assumed distribution of this variable in the U.S. adult population. While I would not claim to have interviewed everyone in the population, I would say that this is my best guess about the population distribution, based on information from the GSS of sample informants. I would then ask the reader to imagine selecting one person at random from the population so described. By appealing to the notion that random sampling gives each member of the population an equal chance of being selected, I would argue that if the process were repeated for a long time we would find the relative frequency of responses approximately equal to the population distribution. Probability, or chance, thus is numerically equal to a population proportion. In fact this is precisely what A&F do later, in their simulation example on page 95.

M&M use the following example to introduce the concept of a probability distribution:

The instructor of a large class gives 15% each of A’s and D’s, 30% each of B’s and C’s, and 10% F’s. If a student is selected at random from this course, his grade on a four point scale (A=4.0) is a discrete random variable X having the [following] distribution (p. 307). They begin with percentages, presumed to be known. These percentages define probabilities only after we assert that a random activity is about to occur. Precisely the same idea is used in considering games of chance: there are no probabilities until the rules of the game have been defined. A roulette wheel has 18 black slots, 18 red slots, and 2 green slots. Percentages, or equivalently the fractions 18/36, 18/36, and 2/36, are the relative frequencies of the three colors on the wheel.. Only when I propose to send a ball spinning around the wheel, and also explain that because of the way the machinery has been constructed the ball has an equal chance of ending up in each of the 38 slots, am I justified in referring to the three numerical proportions as the probabilities of the ball coming to rest in a red, black or green slot respectively!

Problem 2 (A&F p. 111) might, in my opinion, have been a more appropriate way to introduce the concept of a probability distribution, though they should have said "Suppose Table 3.9 is the population distribution" and then specified how randomness is going to enter the picture. Table 3.9 gives percentages for the various responses in a population of 1,598 people. If we select one person at random, as in a lottery where 1,598 tickets have been sold, we can speak of the probability that the selected person knows Y people with AIDS. Under other circumstances there is no such justification for calling the numbers "probabilities".

Useful Problems from A&F Chapter 4 include: 1 (who is "you"?), 2, 7, 8 (verify using the Table A), 14, 18, 20, 22, 29, 31, 32 (again, who is "you"), 53.

From Population Distributions to Individual "Chances"

The medical literature is a very good source of examples of how population distributions become transformed into individual chances or risks. The literature also illustrates how difficult it is to make correct interpretations. I draw on a review in CHANCE News (1997) (see course handout) for the following example. In a research study the following information was given to some physicians:

To facilitate early detection of breast cancer, women are encouraged from a particular age on to participate at regular intervals in routine screening, even if they have no obvious symptoms. Imagine you conduct in a certain region such a breast cancer screening using mammography. For symptom-free women aged 40 to 50 who participate in screening using mammography, the following information is available for this region:

The probability that one of these women has breast cancer is 1%. If a woman has breast cancer, the probability is 80% that she will have a positive mammography test. If a woman does not have breast cancer, the probability is 10% that she will still have a positive mammography test. Imagine a woman (aged 40 to 50, no symptoms) who has a positive mammography test in your breast cancer screening. What is the probability that she actually has breast cancer? _____% The correct answer is 7.8%, but 95 of the 100 physicians gave answers that were much larger, near 75% on the average. Is it such a difficult problem?

In this classic example the screening test is fallible, so that there may be "false positive" and "false negative" results. The so-called probabilities have (presumably) been derived from a large amount of data. Let's see what happens if we reverse the mode of thinking that I used in the previous section and think of the probabilities as proportions, in a large population of, say, 10,000 women. One percent of these (100) have breast cancer, and 9,900 do not. 80% of the 100 with cancer (i.e. 80 women) show up as positive when given the test. On the other hand, 10% of the 9,900 cancer-free women (990) are false positives.

A physician looking at one of the 1,070 women who has tested positive (990 + 90 = 1070) should realize (in this hypothetical situation) that the vast majority of the 1,070 are cancer-free. Only 80 of 1,070 have cancer, a ratio of .078 or 7.8%. Until further tests are made, or other relevant information about her is obtained, it seems obvious that she should not be informed that there is a very good chance that she has breast cancer. Yet some physicians still will say that "Statistical information is one big lie" and "Statistics is alien to everyday concerns and of little use for judging individual persons".

"The psychology of good judgment", by Gerd Gigerenzer. Medical Decision Making, 1996, vol. 16, no.3, pp. 273-280. "How to improve Bayesian reasoning without instruction: Frequency formats" by Gigerenzer & Hoffrage (1995), Psychological Review, 102, 684-704.

Return to Statistics 508 Home