AAPOR
The leading association
of public opinion and
survey research professionals
American Association for Public Opinion Research

2017 Presidential Address

2017 Presidential Address from the 72nd Annual Conference

It’s you, Survey and Polling Industry, you that I love
 by Roger Tourangeau



Thank you very much. It is great honor to be speaking with you today!

I’d like to begin with a quotation from a poem by Frank O’Hara. It is the opening passage from his poem, “To the Film Industry in Crisis”:
 
Not you, lean quarterlies and swarthy periodicals
… nor you, promenading Grand Opera,
obvious as an ear (though you are close to my heart), but you,
Motion Picture Industry, it’s you I love!

In times of crisis, we must all decide again and again whom we love.
 
It seems obvious as an ear that the survey and polling business is in crisis. Apart from the recent “failure”— more apparent than real — to predict the outcome of the Presidential election, our industry is faced with two long-term trends that have made doing surveys and polls increasingly difficult. First, response rates have been falling for more than 30 years. As recently as the late 1970s, the University of Michigan’s monthly telephone Survey of Consumer Attitudes was getting a response rate in the 70s. Even high quality face-to-face surveys rarely reach a 70 percent response rate these days. In addition, data collection costs have skyrocketed, reflecting the ever-larger number of contact attempts — whether by mail, telephone, or face-to-face — that are now needed to get someone to complete an interview. And many researchers have abandoned the traditional modes of data collection in favor of much cheaper methods that essentially give up on representativeness — doing robopolls with abysmal response rates or web surveys that are based on non-probability samples. And these problems are not just U.S. problems; the trend is the same everywhere across the developed world. It is such a relief to be nearing retirement age!

My talk is an attempt to examine what has happened and whether it matters. It addresses three questions or paradoxes: 
  •  Why are response rates so weakly related to nonresponse bias?
  • Why aren’t alternative indicators of the risk of nonresponse bias much better?
  • Why can’t we solve our problems through responsive and adaptive design?
Let me start with the first of the paradoxes. The mathematics of nonresponse bias is straightforward. The bias in an unadjusted mean or proportion is determined by the average response propensity — that is, the average probability that a sample member will respond to the survey — and the covariation between the response propensities and values on the variable of interest.

Although the bias will vary from one variable to the next, depending the variable’s correlation with the response propensities, it should be inversely related to the overall response rate. As an empirical matter, though, a number of studies come to a different conclusion. Consider the study by Merkle and Edelman (2002), which looked at the correlation between the response rate to an exit poll at a precinct and the poll’s error in the vote share. Their study examined both signed and unsigned errors in the projected vote share. Across four elections, the largest correlation they report (in absolute terms) is -.13.

Here is the scatterplot from one of the elections they examined.

T1.jpg
In case you were wondering what a zero correlation looks like, here it is. Results don’t get any nuller than this!

Overall, the situation reminds me of the saying, sometimes attributed to the New York Yankee catcher and epistemologist Yogi Berra, that, in theory, theory and practice don’t differ, but in practice they do.
The most influential of these studies is the meta-analysis done by Groves and Peytcheva (2008), which examined 959 bias estimates from 59 studies. The studies used several methods for estimating nonresponse bias — for example, some used frame data and compared the frame data for the full initial sample with the data for just the respondents. They report “If a naïve OLS regression line were fit [to the 959 estimates] …, the R2 would be 0.04,” which implies a correlation between nonresponse bias and nonresponse rate of around 0.20, roughly the same magnitude that Merkle and Edelman found. Here is the key result from Groves and Peytcheva.
       
   
 
     

T2.jpg
Each column of dots represents a single study and each dot a single bias estimate.  There are some very large biases in this figure. As a public service, I’ve circled two estimates in red that seem to be off by nearly 100 percent!  
My colleague Mike Brick and I recently reexamined these data to see whether there might be a study-level component to the nonresponse bias, as predicted by the equation I showed earlier.  (Our thanks to Bob and Emilia for letting us have at their data!!) Even though there is a high level of within-study variation in the bias estimates, that does not rule out a substantial between-study component of variation as well.  Like Groves and Peytcheva, Brick and I examined the absolute difference between the respondent estimate and the full sample estimate expressed as a percentage of the “truth” (that is, the full sample estimate).

Looking at all 959 bias estimates we find a weak overall correlation between the bias estimates and the response rate — the correlation is -.200, confirming one of the key findings from Groves and Peytcheva. Replication is a beautiful thing!
We looked at the data a couple of different ways, using methods that are standard in the meta-analytic literature. For example, we took an average of the bias estimates from each study and weighted each study by the average sample size on which its bias estimates were based. When we did this, we found correlations closer to .5 than to .2.  This figure is a bubble plot showing the relationship between the study’s response rate and the average bias in the study’s estimates; the size of each bubble reflects the study’s average sample size. The correlation is .49.
T3.jpg
So there is a substantial study-level component to nonresponse bias. Other analyses we did showed that variation across studies accounts for about a quarter of the variation in the bias estimates (Brick and Tourangeau, in press). This strongly suggests that nonresponse bias is partly a function of study-level characteristics; and the results in the figure show that one of those study-level characteristics is the response rate.

The figure makes it clear that there is a relationship between response rates and nonresponse bias, but it isn’t as strong or as clear as we might have expected.  There must be something about the response propensities that helps explain why the relationship isn’t stronger. Brick and I proposed four response propensity models that provide possible explanations for these weak relationships.

According to the first model, the random propensities model, most of the variation in response propensities is due to a large number of largely transient influences (whether the person happens to be sick at the time of the initial contact, his or her mood on that day, how busy he or she is at work, who in the household answers the door or the telephone, his or her travel schedule, whether the person reads the advance letter or throws it away, whether he or she happens to hit it off with the interviewer, whether the television is on when the telephone rings, and so on). As a result, whether someone takes part in a study may, in effect, be the outcome of a kind of multidimensional coin flip. A theological interpretation of this model is that God loves survey researchers and has made nonresponse a completely random phenomenon!

If this model holds, nonresponse is unlikely to produce many estimates with large nonresponse biases, and the average bias should be small, close to zero. That is because the response propensities are unlikely to be related to any stable respondent characteristics.
The second model, which we call the design-driven propensities model, says that most of the variation in response propensities is driven by study-level design features that are largely unrelated to the characteristics of the sample members. A federal survey that features repeated callbacks and offers people multiple ways to respond is likely to produce a high response rate with relatively limited variation in propensities across different sample members. In such a survey, nearly everyone is likely to respond eventually. The high response rate does not prevent some estimates from having substantial biases, but it would be very unlikely that the biases would be large on average. Similarly, a telephone survey conducted over a very short time period, say two or three days, with no advance letter or incentive is unlikely to produce a high response rate. It is unlikely that anyone has a very high response propensity for such a study or that the response propensities vary much across people. Still, there may be subgroups, such as retired people (and aspiring retired people), who are more likely to be respondents. Thus, some estimates might have relatively large biases and the average of the biases will be greater than if the response rate were higher. But, if response propensities are mostly driven by features of the data collection protocol rather than by characteristics of the sample members, then on average the estimates are likely to have small nonresponse biases. This model may help explain why many low response rate telephone surveys still seem to have relatively modest biases.

Proponents of responsive and adaptive design implicitly subscribe to some version of the design-driven propensities model, using changes in mode, incentive, and contact protocols in an attempt to raise or lower response propensities. We’ll look at studies using these designs in a bit.

A third possibility is that propensities are determined, at least in part, by household or person-level characteristics, but the characteristics most strongly related to response propensities are only weakly related to the bulk of survey variables. In general, in the U.S. at least, older people are more likely to respond than younger people (thank you, senior citizens!), women more likely to respond than men (thank you, women!), residents of suburban and rural areas more likely to respond than city dwellers (thank you, exurbanites), well-educated persons more likely to respond than less-educated persons (thank you, brainiacs!), home owners more likely to respond than renters (we’ll get you, recalcitrant renters, eventually!), people with listed telephone numbers more likely to respond than people without listed numbers, and so on. We refer to this as the demographic-driven propensities model. Again, when this model holds, we expect most estimates to have relatively low levels of nonresponse bias because most survey variables are only weakly related to these demographic characteristics. The estimates based on the few variables that are highly correlated with these household or person-level demographic characteristics are likely to have large biases, but on average the bias would be low. And, because there are good population benchmarks available for many of these demographic variables (thank you, Census Bureau!), weighting schemes can compensate for many of the likely imbalances in the respondent pool.

The fourth model is that response propensities vary both with survey design features and with characteristics of the sample members. This model — the correlated propensities model — implies that some groups are consistently overrepresented among the respondents. Biases for survey variables highly correlated with these characteristics can be large. This may be true regardless of the overall response rate, if the survey protocol results in large overrepresentation of specific groups among the respondents. An example might be the overrepresentation of highly educated respondents in many state election polls; Courtney Kennedy and her election polling task force colleagues argued this was an important contributor to the errors in some of the state polls they examined.
In the Groves and Peytcheva data, the percentage estimates are off by about 2 percent on average (this is the mean absolute error in the 804 percentage estimates), suggesting that in general one of the first three models hold. As Bob Groves put it in a recent interview, “We’re sort of lucky.”

Because estimates with a large average nonresponse bias are most likely to occur when the correlated response propensities model holds, the question is when this arises in practice. A number of studies show that a cluster of variables related to a sense of civic obligation (variables like volunteering, altruism, and voting) are related both to survey participation and to estimates based on these variables. As a result, surveys estimating levels of voting or levels of participation in civic activities, are at risk for large average biases. For example, I did a study with Bob Groves and Cleo Redline (Tourangeau, Groves, and Redline, 2010) and we found very high levels of nonresponse bias in a survey of adults registered to vote in Maryland. The voters were strongly overrepresented among the survey respondents and the nonvoters were strongly underrepresented; these imbalances produced large overestimates of the level of participation in past elections, biases that were apparent whether we did a mail or a telephone survey.
 
Bias Estimates from Tourangeau et al. (2010), Overall and By Mode of Data Collection
  Entire Sample (Frame Data) Respondents
(Frame Data)
Respondents
(Survey Reports)
Bias
Nonresponse Measurement
Overall  
43.7 (2689)
 
57.0 (904)
 
76.0 (895)
 
13.3
 
19.0
Telephone
Mail
43.2 (1020)
43.9 (1669)
57.4 (350)
56.7 (554)
79.4 (345)
73.8 (550)
14.2
12.8
22.0
17.1
 
This shows that you can still get very large biases if well-known survey methodologists put their backs into it!! Fans of measurement error will note that the measurement errors are even larger than the errors due to nonresponse. So there’s plenty of work for all of us to go around!
So, let me summarize a bit. We’ve been lucky so far.  Substantial nonresponse biases seem relatively rare. As I noted, the percentage estimates in Groves and Peytcheva are off by about two percentage points on average. Things are generally even better when we weight the data to correct for noticeable imbalances. But, unfortunately, sometimes we are way off or, in the case of the election, off just enough in a few places to make us look really bad!
If the nonresponse rate is at best a weak indicator of the risk of nonresponse bias, are there alternatives that we should be using instead? Two recent proposals have been made. Both of them reflect the key insight that it is not just the mean response propensity but the variation in the propensities across individuals or subgroups that is critical to the risk of nonresponse bias. If the response propensities do not vary, the risk of bias is zero; the propensities cannot covary with the survey variables unless they vary themselves.
Schouten and his Dutch colleagues have proposed the R-indicator as an alternative to the response rate. The R-indicator is a simple function of the standard deviation of the estimated response propensities. In addition, Särndal has proposed imbalance indicators, which measure the distance of the responding sample from a set of population benchmarks; this distance presumably reflects differences in nonresponse (and, in some cases, differences in coverage) rates across different subgroups of the sample.
Both of these alternatives have appealing features. If the propensities are constant (and non-zero), then the threat of bias is eliminated (for all variables). Thus, in principle, the standard deviation of the propensities should be a good indicator of the overall risk of nonresponse bias. The imbalance indicator is, conceptually, a proxy measure for bias; it is based on the observed deviation of the responding sample from the population targets for a set of variables, typically demographic variables.
But both measures have their limitations as well. There are no whole things in life! The weakness of the R-indicator is that it is based on modeled propensities. If the model isn’t very good, the fitted propensities won’t vary very much, giving a falsely reassuring picture of the risk for nonresponse bias. Unfortunately, the auxiliary variables available for modelling response propensities are often not very predictive of survey nonresponse. Imbalance indicators have a similar weakness. Being close on the demographics doesn’t guarantee being close on the survey variables of interest. And both indicators have the problem that they are study-level indicators, and, as Groves and Peytcheva’s data make clear, most of the variation in nonresponse bias is within-study.
Still, both indicators add information to the nonresponse rate and both are worth monitoring. It is not surprising that both were proposed by researchers in countries with population registries — that is, with much richer auxiliary data that is typically available to survey researchers in the U.S.  Clearly, we should all move to Sweden or the Netherlands, where the auxiliary data are decent!

Can we raise response rates, reduce the risk of bias, or lower data collection costs by using responsive or adaptive designs? Responsive designs consist of multiple phases, with different data collection protocols in each phase; one goal is to achieve a less biased, more representative sample. Adaptive designs either tailor their data collection protocols for different subgroups from the outset or adapt them continuously throughout the field period via case prioritization. A recent review by my colleagues and me finds that such designs can help, though they are hardly a panacea (Tourangeau, Brick, Lohr, and Li, 2017).

To illustrate these designs, I’ll briefly describe an experiment done in the Netherlands. Luiten and Schouten (2013) carried out a design that featured both two phases and tailoring of the protocol to different subgroups. They classified sample members into quartiles based on their propensity to cooperate (in the first phase) or their likelihood of being contacted by telephone (in the second). In Phase 1, sample members in the highest cooperation quartile were asked to complete a web survey; those in the lowest cooperation quartile were asked to complete a mail survey; and those in the middle were given a choice between the two. In the second phase, interviewers attempted to contact and interview the phase 1 nonrespondents by telephone. The timing of contact attempts depended on the sample members’ contact propensities; in addition, the harder-to-contact cases were assigned to the best interviewers and the easiest-to-contact to the worst. The goal in both phases was to equalize rates across the different propensity groups.
Here are the key results from the study. (The control group is the main production sample for the study, which was subject to the normal data collection protocol.) The sample members who underwent the experimental protocol showed less variation across propensity quartiles in both phases of data collection than the members of the control group.  Accordingly, the R-indicator for the experimental group was significantly higher than that for the control group.
 
 
Cooperation and Contact Rates, by Propensity Quartile and Experimental Group
 
Cooperation Propensity Quartile
Cooperation Rates
Experimental Control
Lowest Cooperation Propensity
Second Lowest Cooperation Propensity
Second Highest Cooperation Propensity
Highest Cooperation Propensity
65.1
71.4
72.8
74.7
62.7
68.4
75.3
79.2
 
Contact Propensity Quartile
                                                                     
Contact Rates
Experimental Control
Lowest Contact Propensity
Second Lowest Contact Propensity
Second Highest Contact Propensity
Highest Contact Propensity
87.1
96.6
93.7
95.3
84.2
94.5
95.7
96.9
 
     Note: Adapted from Tourangeau et al. (2017). Data from Luiten and Schouten (2013).
 
 
Adaptive designs target cases at both ends of the response propensity continuum, but for different reasons. They may use low-cost data collection methods with the easy cases to reduce their propensities, save money, or both. They may put a cap on effort on the most difficult cases to save money or they may increase the effort for these cases in an effort to raise their propensities.

Several things limit the effectiveness of these designs. The main problem is that we often don’t know how to raise the response propensities of the hard cases. In the U.S., three moves are common — shift to a higher propensity mode (for example, from mail to face-to-face), offer larger incentives, or shorten the questionnaire. But there are drawbacks to all three. The higher propensity modes are also more expensive. Incentives show a strong pattern of diminishing returns; the bigger the incentive, the less bang for the buck. And shortening the questionnaire means less data and higher variances for estimates involving the variables dropped from the questionnaire. Often it is easier to lower the response propensities of the high propensity cases by limiting effort on these cases, who are otherwise likely to be overrepresented among the respondents. We do know how to lower propensities—stop trying to reach the case! This almost always works! Unfortunately, giving up on the relatively easy cases may be hard to sell to clients who want the largest possible sample. What is wrong with these clients and their endless demands for more cases!

So here are some conclusions. The response rate is a useful indicator; our analysis shows that there is a significant relationship between nonresponse rates and nonresponse bias. Still, the response rate only reflects the mean of the response propensities. The R-indicator and balance indicators are useful additions that also reflect the variation in the response propensities. Both the mean and the variation in the response propensities play important roles in nonresponse bias. The average level of nonresponse bias is related to study-level characteristics, including the response rate. But there is lots of within-study variability in the biases so that any study-level indicator of the risk will always be imperfect at best.
Brick and I outlined four possible models relating response propensities to bias and three of them suggest that bias will, in most cases, be low on average. In any given study, though, all four models are likely to be partially true, suggesting that some estimates may be at risk of larger biases. This conjecture is consistent with Groves and Peytcheva’s results.

In the current environment, we often don’t know how to raise response propensities substantially, especially for the most difficult cases. That means that the best strategy for reducing the risk of nonresponse bias may be to adjust the field work to reduce variation in overall propensities. This is what many adaptive and responsive designs attempt to do.

Let me end by saying that we are faced with a world-wide social and cultural phenomenon that we are only just beginning to understand. Given the long time period over which our problem has developed and its international scope, our usual methods may not be adequate to diagnosing the ailment. That is unfortunate since it is much harder to fix a problem when we don’t really know what the problem is. We need to redouble our efforts to understand why people don’t want to do surveys any more. Surveys and polls are just for too valuable to let them die.  It is worth noting that Frank O’Hara died in 1966 and the motion picture industry is still with us. I suspect 50 years from now, there will still be polls and surveys.

In times of crisis, we must all decide again and again whom we love. In my case, it’s you, Survey and Polling Industry, you that I love.

Thank you.