AAPOR
The leading association
of public opinion and
survey research professionals
American Association for Public Opinion Research

2004 Presidential Address

2004 Presidential Address from the AAPOR 59th Annual Conference

Unfinished Business by Elizabeth Martin


AAPOR has always felt like a home to me. Since 1973, when I attended my first AAPOR conference, I think I’ve missed only two annual meetings. I was very honored to be elected as your president, and I want to thank you for giving me the opportunity to serve.

In recent years, AAPOR Council has been taking a more proactive role to protect and advance our shared professional interest in the credibility and reputation of surveys. One part of Council’s more proactive stance is to position AAPOR to better address potential threats to our ability to conduct surveys. This year, ably led by Mark Schulman and Nancy Belden, we joined with other organizations in a Research Industry Alliance to make sure that surveys were not inadvertently swept up in legislation and court decisions concerning the Do Not Call list. AAPOR also applied to join the Council of Social Science Associations as a governing member, not just an affiliate, and was accepted. Joining with the principal social science associations should help us more effectively protect the scientific integrity of surveys and better address issues involving, for example, Institutional Review Boards.

A second part of a more proactive agenda is focused on the public’s image of survey research and actions to improve its reputation. You heard about this in last night’s plenary session, and I believe you’ll be hearing more about it from Nancy Belden during her presidency.

A third part of this agenda is to ensure and enhance the quality of surveys and the public’s ability to understand and use their results sensibly. Our credibility rests on the quality and transparency of the methods we use to conduct surveys and produce results. One way for AAPOR to support the goal of enhancing survey quality is to promote ongoing review and discussion of methodological issues that affect the accuracy of surveys and, when appropriate, to foster research that can be the basis for improved methods. This is the function of a Committee to Review Pre-Election Poll Methodology that was established by AAPOR Council two years ago. (Establishing such a committee was recommended by the National Academy of Sciences two decades ago.) Its purpose is to “review, document, and analyze the methods on which election forecasts are based.” It’s chaired by Mike Traugott, and its focus is on methodological issues that may affect results. I’m sure you’ll be hearing more about its activities as we move into election season.

Today, I want to talk about a methodological issue that I fear we do not always face squarely, perhaps not consistently living up to our own standards. I fear as a result we may put the scientific integrity and reputation of surveys at risk. The problem I want to focus on, survey nonresponse rates and reporting about them, has troubled AAPOR since the beginning. In recent years, we’ve made considerable strides, for example, by standardizing response rate definitions. However, when several of us on council worked to develop guidelines for documenting survey results to put on AAPOR’s Web site, it became clear that disclosure of response rates is still a touchy issue. Some people expressed reluctance to say much about them for what may seem like contradictory reasons—first, discussing response rates will leave surveys and polls vulnerable to attacks from those who want to discredit them, and second, response rates don’t matter anyway. I felt sympathetic with the first reason but skeptical about the second. If we don’t publish response rates, how can survey consumers, or we professionals, ever begin to gauge their possible effects on the data? In 2002, Smith concluded that “response rates are not routinely reported when survey results are released,” (p. 35) not only by public polling organizations, but by academic ones, as well—and, I might add, in government surveys. If we hold back this information, doesn’t that leave us more, not less, vulnerable to the accusation that we’re hiding something? Lack of disclosure was, after all, one of the complaints that inspired Arianna Huffington to launch her “Campaign for a Poll Free America.” I wondered what the evidence on this issue is, and how we might answer basic questions from a concerned data user or careful journalist, such as: Should I believe the results of a survey, given that it has an 80%, or 50%, or 30% response rate? What difference does the response rate make, anyway?

With apologies in advance to my friends and colleagues who have worked so diligently in this area, I concluded that it would be pretty difficult to arrive at a ready answer to these questions. To my surprise, I could not find a comprehensive review of the literature on the effects of nonresponse on survey data. Groves and Couper’s rich, brilliant, earthshaking, pathbreaking research (is that enough adjectives, Bob?) has been focused on the important question of the causes of nonresponse, and less on the affects on the data. The statisticians have done very sophisticated work on nonresponse adjustments, but much of it is abstracted from real surveys, and the assumptions they sometimes rely on may not hold generally (such as the assumption that nonrespondents resemble reluctant respondents more than easy respondents; see, e.g., Lin and Schaeffer 1995, Teitler, Reichman, and Sprachman 2003). At the other extreme, some published articles dismiss nonresponse effects citing a cursory review of evidence that doesn’t acknowledge what we don’t know.

So, I decided to review some of the work in this area myself, having in mind the questions posed by our hypothetical concerned data user or careful journalist. Why did I do this? Well reason one is, I asked Stan Presser to do it and he turned me down. A second reason is that I believe it would be helpful to arrive at an understanding of the problem that would enable us to talk about it more openly and less defensively. More open discussion might help us address questions about response rates more credibly and effectively. Providing contextual information mightallow survey consumers to interpret response rate information more sensibly, and survey producers to feel more comfortable reporting it, with fewer fears of misuse. A response rate alone is not very informative, and perhaps invites misinterpretation. A response rate these days is a naked, ugly, scary thing—no wonder people cringe when it’s paraded out. Finally, I think we are at some risk of dismissing the problem based on a still-insufficient body of data—some of it read, perhaps, with rose-tinted glasses. We need to acknowledge and address the gaps in our knowledge, and explain the limitations of the data. I don’t think we are doing a good job of that. If we can’t or won’t do it, then we ought to abandon claims that our work is scientific.

My reading suggests that several myths need to be dispelled, and several basic points communicated, in order to facilitate a more informed discussion about response rates with survey users and among ourselves. I’ll welcome any discussion that follows—in fact I’ll feel like my effort was a success if it fosters debate or stimulates others to do a more comprehensive review.

Myth 1. Response rate is an indicator of the quality of a survey. In general, good survey practice includes following up with nonrespondents to get interviews with as many as possible and thereby improve response rates. However, many factors that may influence a survey’s response rate (such as how much time an interview takes and the saliency and interest of the topic) do not reflect on the quality of its methods, execution, or data. Other sources of error—such as biased or poorly worded questions—are greater threats to survey validity than nonresponse. Thus, a survey with a high response rate may or may not be “better” than one with a lower response rate.

In some cases, steps taken to improve response rates may actually make a survey less representative. For example, in one mail survey, an incentive was especially appealing to women and people over 50, making an existing bias worse (Moore and Tarnai 2002). In this case, the poll with the higher response rate was less representative of the population.

Myth 2. There is a minimum response rate below which survey results are not valid. Most survey practitioners probably have informal expectations about what constitutes a “good” response rate to shoot for. However, there does not seem to be a basis for fixing a level or range below which we can say results are compromised by nonresponse bias, and above which they aren’t. It depends very much upon the nature of the phenomenon under study and how it is related to the causes of the nonresponse. Nonresponse error arises when the causes of participation are related to the values of the statistics measured. This cuts both ways—it means that in one survey, or on a given topic, or for a given purpose, there can be substantial nonresponse bias even at high levels of response. In another, there may be none even though the response rate is low. It also means that different statistics in the same survey may have different nonresponse errors.

A survey of new (mostly unwed) parents, sampled from hospital records of births, provides a good illustration (Teitler, Reichman, and Sprachman 2003). The survey attempted to interview both fathers and mothers. Mothers’ answers to questions about the fathers of their babies allowed the investigators to evaluate bias due to failure to interview 20 percent of the fathers. Figure 1 shows the values of survey variables at successive levels of effort. The two right-most columns show what we almost never see—the “truth” for the whole sample of fathers eligible for the survey1, and for the nonresponding fathers. (All information was provided by the mothers of the babies.)

  Figure 1.
Figure 1.Characteristics of new fathers at different levels of survey effort. Source: Teitler, Reichman, and Sprachman (2003).

For some characteristics—such as the percentage of new fathers with less than a high school education—the sample quickly became representative with a low level of effort and a response rate of 42 percent. For other characteristics—such as the fraction of new fathers who had informed the mother they would support the child, and the fraction with drug and alcohol problems—the interviewed sample of fathers never became representative of the full sample. For these characteristics, nonrespondents were quite different from respondents. For example, 92 percent of the fathers who were eventually interviewed had said they would support the child, compared to 64 percent of those who were never interviewed. Thus, for some characteristics a 42 percent response rate was sufficient, and for some others an 80 percent response rate was not good enough to eliminate nonresponse bias.

Similarly, the effects of nonresponse may depend on the purpose of an analysis. In a simulation of what difference it would have made if the Survey of Consumer Attitudes had not increased callbacks to maintain its 70 percent response rate, Curtin, Presser, and Singer (2000) found that restricting the number of calls would have affected levels of the Index of Consumer Sentiment, but not estimates of change over time.

Myth 3. Response rates don’t matter. Practically speaking, nonresponse is usually ignored in discussions of particular surveys, even though we know that it has several effects on survey data (see, e.g., Groves and Couper 1998).

It is a potential threat to the inferential basis for generalizing from a sample to a population.

It leads sampling errors to be underestimated, enticing us to be more confident of survey estimates than we should be.

It is a possible source of bias, which increases as a function of the size of the nonresponse rate, and the magnitude of difference between respondents and nonrespondents on variables of interest.

The survey of new parents I just described is one in which nonresponse cannot be ignored for some variables. We can see this by comparing the interviewed fathers with the full sample of fathers (figure 2). Estimates of the fraction of new fathers with drug or alcohol problems or who did not commit to support the child are substantially biased due to nonresponse, because the differences between respondents and nonrespondents are so large.

  Figure 2.
Figure 2.Comparison of characteristics of interviewed fathers and all fathers. Source: Teitler, Reichman, and Sprachman (2003).

This survey involves a special population, and these variables—child support and drug or alcohol use—are direct causes of nonresponse. Most of us have more faith in the robustness of survey results in general population surveys, and recent research is reassuring on that point. Survey findings may be fairly robust to the effects of a large difference in response rates. This was the conclusion of the excellent and careful experiment conducted by Pew and published by Keeter et al. in 2000, and recently replicated (Pew 2004). Two tightly controlled RDD surveys were conducted in 1997 with different levels of effort—a “standard” survey was done over five days and obtained a 36 percent response rate, and a “rigorous” survey was done over two months, offered a $2 incentive, and obtained a 61 percent response rate. The surveys asked identical questions and obtained similar results. Fourteen of 91 questions, or 15 percent, showed significant differences at the .05 level. It is encouraging that the experiment did not identify large substantive differences between surveys with response rates this different. However, I think we must be careful not to draw general conclusions from this study. For one thing, assuming that chance alone might result in 5 percent of the item distributions being significantly different at the .05 level, then finding that 15 percent of the questions showed significant differences suggests that nonresponse (or other) differences between the surveys did affect some results. There were differences on some opinion questions, and marked differences in the characteristics of interviewed households, such as the fraction with multiple telephone numbers, the fraction with listed phone numbers, and the fraction of single person households. These characteristics were correlated with attitudes. In addition, as the authors point out, but those citing this study sometimes do not, the results cannot tell us anything about the effects of response rate differences in other ranges, for example, between 60 and 80 percent, or 80 and 100 percent. (Some results of the survey of new parents were insensitive to response rate differences between 42 percent and 80 percent, although substantial nonresponse bias was present.) As I’ve said, the effects of nonresponse depend on how the phenomenon under study is related to the causes of nonresponse, so we don’t know yet whether these results generalize to other surveys on other topics. We need to accumulate a larger body of careful research before drawing general conclusions. The authors’ conclusions seem apt. They see the value of their experiment as “stimulating other work ... to discover under what circumstances and for what measures nonresponse rate differences imply disparate nonresponse errors. We expect that such work requires theory development that links the decision to participate with the purposes of the survey” (2000:147). As well, practical tools are needed to inform us whether and when the correlation between causes of participation and key survey statistics is important.

Myth 4. A record of good performance means we don’t have to worry about nonresponse bias. Some practitioners conclude from the research that nonresponse can be ignored, and they buttress their conclusion by noting that surveys generally do a good job of prediction when there is an external criterion—such as election outcomes—to check them by. This is no doubt true, but the trouble with relying on this assumption is we don’t know when it fails. We cannot rest on our laurels.

Perhaps the most vivid illustration of the pitfalls is the Literary Digest poll, drawn from ancient survey history. The Digest’s straw poll had an excellent record of predicting the winner in four presidential elections from 1920 to 1932. In 1932 the Digest forecast Roosevelt’s victory within one point of the actual result. But in 1936 the poll failed spectacularly, predicting a landslide victory for Landon over Roosevelt. The conventional wisdom attributes the failure to an unrepresentative sample based on telephone directories and lists of car owners, as well as registered voters. These sources were biased toward more affluent, Republican voters. However, a flawed sample was only part of the story. Prediction failure also resulted from a low response rate of about 25 percent coupled with substantial nonresponse bias (Squire 1988; Cahalan 1989).

Research after the fact (Cahalan 1989) revealed that voters for the party out of power were more motivated to return their “straw ballots” than the party in power. In elections before 1936, the Democrats were out of power and more likely to return their ballots, offsetting the Republican sample bias. In 1936, Republicans returned their “straw ballots” at a higher rate than Democrats. Nonresponse bias and sample bias both favored Republicans, and their effects compounded the error rather than canceling out as they had before.

Several lessons that are relevant to modern surveys can be drawn from this cautionary tale from our past. First, it is risky to assume that the effects of bias due to nonresponse—or other sources—will remain constant. Second, low response rates do not mean that nonresponse bias is present, but they leave surveys more vulnerable to its effects if it is present. Third, if we do not understand the response mechanism—that is, the processes and influences on survey participation—we are at risk of being blindsided by its effects. The Literary Digest was the victim of a particularly lethal response mechanism—differential response by voters whose party was out of power—that generated changing nonresponse biases that favored Republicans in 1936 and Democrats in earlier years.

There is also some evidence from recent surveys suggesting that the response mechanism may interact with time or operate differently in different surveys, although nothing so dramatic as the Literary Digest. For example, in the most recent replication of the Pew experiment, there are more Independents in the rigorous, high response survey (Pew 2004), while the earlier experiment found more Independents in the standard survey, although the latter difference was only marginally significant (Keeter et al. 2000: 139). As shown in figure 3, the first Pew experiment found more single person households in the rigorous survey than in the standard survey, while the recent replication found the opposite pattern (Pew 2004).

  Figure 3.
Figure 3.Percentage of single person households in two nonresponse treatments in 1997 and 2003 surveys. Source: Pew (2004).

The role of nonresponse in an interaction effect such as the one shown in Figure 3 (assuming it is statistically reliable) is uncertain. This difference between experiments in the pattern of bias might result if the standard treatment was implemented differently in the two experiments. However, results such as this provide a caution that we must be alert to the possibility of a dynamic response mechanism that creates changing biases over time or in different surveys. Statisticians emphasize the need to understand the response mechanism in order to understand the effects of nonresponse on the data. Groves and his colleagues have made strides toward understanding causes of nonresponse, but their results are not often introduced to discuss the limitations and biases of particular surveys, to help our hypothetical journalist interpret survey findings.

So—back to our hypothetical journalist—what can or should we tell him or her to help interpret the results of a particular survey? First, we need to provide more general information to allow users to interpret response rate information more intelligently. (Better information should also be provided to clients—see, for example, Colasanto’s 2004 discussion.) We might explain that response rates should not be treated as indicators of survey quality, since many factors that have nothing to do with quality may influence them. We should further explain that the effects of nonresponse on the data depend on the relationship between the variables of interest and the causes of nonresponse. Therefore, it is unwise to use response rates as a measure of nonresponse error of all statistics in any survey, or to judge a low response rate as evidence that some statistics must be harmed by nonresponse error. We should educate survey consumers to expect presentations of survey results to be accompanied by information about the response rate and discussion of the possible effects of nonresponse (and other survey limitations) on the conclusions that can be drawn from the data.

Second, we need to put information about response rates for particular surveys in context by describing more of what we know about the nonresponse causes and effects for that survey.

Third, to report such contextual information, we have to collect it. There need to be more auxiliary studies of the sources and effects of nonresponse for particular surveys or survey series.

Here are a couple of examples of the sort of expanded discussions that might help put response rates in context.

For an exit poll:

“The overall response rate was 52%.”

versus

“The overall response rate was 52%. Response rates were lower in precincts where interviewers were not allowed to stand close to the voting area. There was no, or a very small, relationship between the response rate and the accuracy of an exit poll for a precinct.” (Based on Merkle and Edelman 2002.)

Note that in this example the response rate alone is neither informative nor reassuring. An expanded discussion that draws on auxiliary information helps a user understand one of the causes of nonresponse and indicates that the accuracy of the results was unaffected.

A second example, for a survey of new fathers:

“The final response rate was 80%.”

versus

“The final response rate was 80%. Nonresponding fathers were less likely than responding fathers to have said they would support the child, and more likely to have drug and alcohol problems. These differences limit the generalizability of the results.” (Based on Teitler, Reichman, and Sprachman 2003.)

In this example, the response rate alone is misleading. Because it is high, a user might infer that the data are unaffected by nonresponse bias, which is not the case. The expanded discussion makes the nature of the bias clear so that the user can interpret the results appropriately. Note that describing a nonresponse bias even as serious as this one does not lead one to dismiss or reject the survey; rather, the data are still interesting and informative, once the nature of the bias is understood.

Providing contextual information along with the response rate can lessen the focus on the numerical rate as an isolated factoid and correct the sometimes damaging misperception that information is being withheld. Acknowledging their limitations can make it easier to defend survey results.

There is no reason that more surveys can’t build in auxiliary studies to shed light on the sources and effects of nonresponse. Here, I’m taking a page from the National Center for Education Statistics standard that requires that potential nonresponse bias must be evaluated before survey data can be released whenever survey response falls below a certain level (85 percent in their case). The goal of such studies might be modest, not necessarily to provide a comprehensive or quantitative measurement of nonresponse effects, but rather to shed enough light to be able to say something sensible to help users judge their effects in a survey. Such auxiliary studies might also provide a useful early warning when nonresponse bias really is compromising survey findings. (It’s worth noting that the Literary Digest actually collected information about voters’ choices in the prior election that, had it been analyzed, would have told analysts about the huge Republican bias affecting their forecasts before it led them into survey infamy.) Over time, such adjunct studies might cumulate and contribute to the broader literature on effects of nonresponse.

I am not the first AAPOR president to talk about the need to more squarely address the nonresponse problem. In 1992, Norman Bradburn said, “Until we feel comfortable about our understanding of the characteristics of the nonresponders and incorporate this information into our analyses of the data, we cannot feel comfortable about delivering reports to clients based on surveys with low completion rates” (p. 397). And, he said, “Perhaps we cannot do much to increase the response rates, but we can bring more attention and sophistication to the treatment of data with low completion rates” (p. 397). Notwithstanding some of the research that has been done, it seems to me that we have been slow to take his advice. (Perhaps his point was dulled by his unfortunately sharp metaphor of the guillotine....)

Earlier, in 1989, Warren Mitofsky raised similar issues in his presidential speech. He argued that “It is the responsibility of the researcher to spell out the limitations of his or her own survey when there are limitations” (p. 451). He asserted, “If there are biases in the research design—for good reason or not—the researcher conducting the study [should] be the one to make these design flaws and their consequences known to the consumer of the research” (p. 451). This includes nonresponse and other sources of bias and means going beyond the minimal disclosure standards to discuss how various sources of bias affect the data.

Mitofsky might have added that our claims to be doing science are imperiled if we don’t follow his advice. It is the responsibility of the researcher not only to document the response rate (and other limitations) but also to explain how it may affect the generalizability of the results. The aim, after all, is inference about the population, not the sample.

I don’t know if you have been listening to the fascinating National Public Radio series on South Africa. Nelson Mandela was released from prison in 1990, and in the 14 years since, South Africa has to a remarkable extent eliminated apartheid and begun to address its effects. About the same amount of time has elapsed since Bradburn’s and Mitofsky’s speeches, and yet according to Smith (2002) we still don’t routinely report response rates, much less discuss their effects on the data, as Bradburn and Mitofsky advised. Could routine reporting of response rates really be more difficult and require more time to accomplish than ending apartheid?

I’m not the first AAPOR president to address this issue, but it sure would be great if I were the last. I’m not talking about this because it’s my favorite topic or because I think it’s the most important threat to survey quality. I think that measurement error is a more important source of bias and that we are investing far too little in question development and testing. My reason for talking about it is that we in AAPOR still seem to have a difficult time doing so, as if nonresponse were our dirty little family secret. People often become defensive and evasive when the question is asked. Some organizations require special permissions from higher ups to release response rate information, which may never be forthcoming. Other organizations document the information in reports available on special request but do not provide it on websites where survey results are presented and discussed. Some organizations that have multiple stages of response report only the rate of response at the final stage, rather than the rate reflecting all the stages of response. Such partial disclosures can put our credibility at risk and create the perception we have something to hide. I think such circumventions also make us more, not less, vulnerable to attack when problems do emerge. They also inhibit the process of learning about and coping with the problem. Nonresponse, along with other limitations on survey results, needs to be discussed routinely, matter-of-factly, and in the context of specific surveys. Such discussions need to routinely accompany presentations of survey results. It’s time we took care of this piece of unfinished business and moved on.