5 We note that due to the emergence of cell-only households landline coverage has dropped significantly in recent years and is now less than 80 percent and may soon approach or even be less than household Internet penetration. However, it also is becoming common practice to include cell phone samples in telephone studies that aim to represent the full population.
Measurement Error in Online Panel Surveys
A primary interest in social science research is to understand how and why people think, feel, and act in the ways they do. Much of the information that we use to help us describe and explain people’s behavior comes from surveys. One way we gather this information is to ask people about the occurrence of events or experiences using nominal classification (Stevens, 1946; 1951), often in the form of “did the event occur?” or “which event(s) occurred?” We also ask respondents to evaluate their experiences along some underlying quality or dimension of judgment using ordinal, interval, or ratio scales. As an example, when we ask a person to indicate his/her attitude toward a governmental policy, we assume that the attitude can be represented along some dimension of judgment (e.g., ‘Very bad’ to ‘Very good’). In the process, there are a number of sources for potential errors that can influence the accuracy of these kinds of measurements.
Measurement error is commonly defined as the difference between an observed response and the underlying true response. There are two major types of measurement error: random and systematic.
Random error occurs when an individual selects responses other than his/her true response without any systematic direction in the choices made. With random error, a person is just as likely to select a response higher on the continuum as select a lower response than his/her true position.
Systematic measurement error (also known as “bias”) occurs when respondents select responses that are more often in one direction than another and these responses are not their true responses. Random error tends to increase the dispersion of observed values around the average (most often the mean) but does not generally affect the average value when there is a sufficiently large sample. Systematic error may or may not increase the dispersion of values around the average, but will generally shift the measure of central tendency in one direction or another. In general, systematic error is of greater concern to the researcher than is random error.
There are a number of potential causes of measurement error. They include: how the concepts are measured (the questions and responses used – typically referred to as questionnaire design effects), the mode of interview, the respondents, and the interviewers.
Questionnaire Design Effects
The influence of questionnaire design on measurement error has received attention in a number of publications (e.g., Dillman, Smyth, and Christian, 2009; Galesic and Bosnjak, 2009; Lugtigheid and Rathod, 2005; Krosnick, 1999; Groves, 1989; Tourangeau, 1984) and the design of Web questionnaires has introduced a new set of challenges and potential problems. Much of the literature specific to Web along with its implications for survey design has been conveniently summarized in Couper (2008). That literature has demonstrated a wide range of response effects due to questionnaire and presentation effects in Web surveys. However, there is no empirical evidence to tie those effects to sample source (RDD-recruited, nonprobability recruited, river, etc.). Although researchers doing research by Web should familiarize themselves with the research on questionnaire design effects those findings are beyond the scope of this report. The primary concern in this section is the possibility of measurement error arising either out of mode of administration or the respondents themselves.
Mode Effects
The methodologies employed by online panels involve two shifts away from the most popular methodologies preceding them: (1) the move from interview-administered questionnaires to self-completion questionnaires on computers and (2) in the case of online volunteer panels, the move from probability samples to non-probability samples. A substantial body of research has explored the impact of the first shift, assessing whether computer self-completion yields different results than face-to-face or telephone interviewing. Other studies have considered whether computer self-completion yields different results with nonprobability samples than with probability samples. Some studies combined the two shifts, examining whether computer self-completion by non-probability samples yields different results than face-to-face or telephone interviewing of probability samples.
This section reviews this research. In doing so it considers whether computer self-completion might increase or decrease the accuracy of reports that respondents provide when answering survey questions and how results from non-probability samples compare to those from probability samples in terms of their accuracy in measuring population values.
6 We note that a number of these studies have focused on pre-election polls and forecasting. We view these as a special case and discuss them last.
The Shift from Interviewer Administration to Self-Administration by Computer. In a study by Burn and Thomas (2008) the same respondents answered a set of attitudinal questions both online and by telephone, counter-balancing the order of the modes. The researchers observed notable differences in the distributions of responses to the questions, suggesting that mode alone can affect answers (and perhaps answer accuracy). However, in a similar study by Hasley (1995), equivalent answers were obtained in both modes. So differences between modes may occur sometimes and not others, depending on the nature of the questions and response formats.
Researchers have explored two specific hypotheses about the possible impact of shifting from one mode (interviewer administration) to another (computer self-administration). They are
social desirability response bias and
satisficing.
The
social desirability hypothesis proposes that in the presence of an interviewer, some respondents may be reluctant to admit embarrassing attributes about themselves and/or may be motivated to exaggerate the extent to which they possess admirable attributes. The risk of having an interviewer frown or sigh when a respondent says he/she cheated an income tax return, or inadvertently convey a sign of approval when hearing that the respondent gave money to charity may be the source of such intentional misreporting. An even more subtle influence could be characteristics of the interviewer. For example, consider a situation in which an interviewer asks a respondent whether he/she thinks that the federal government should do more to ensure that women are paid as much as men are for doing the same work. If a female interviewer asks the question, respondents might feel some pressure to answer affirmatively, because saying so would indicate support for government effort to help a social group to which the interviewer belongs. But if asked the same question by a male interviewer the respondent might feel no pressure to answer affirmatively and perhaps even the reverse. Thus, the social desirability hypothesis states that respondents may be more honest and accurate when reporting confidentially on a computer than when providing reports orally to an interviewer.
A number of studies have explored the idea that computer self-completion yields more honest reporting of embarrassing attributes or behaviors and less exaggeration of admirable ones. For the most part, this research finds considerable evidence in support of the social desirability hypothesis. However, many of these studies simply demonstrate differences in rates of reporting socially desirable or undesirable attributes, without providing any direct tests of the notion that the differences were due to intentional misreporting inspired by social desirability pressures.
For example, Link and Mokdad (2004, 2005) conducted an experiment in which participants were randomly assigned to complete a questionnaire by telephone or via the Internet. After weighting to yield demographic equivalence of the two samples, the Internet respondents reported higher rates of diabetes, high blood pressure, obesity, and binge drinking, and lower rates of efforts to prevent contracting sexually transmitted diseases when compared to those interviewed by telephone. This is consistent with the social desirability hypothesis, assuming that all of these conditions are subject to social desirability pressures. The telephone respondents also reported more smoking than did the Internet respondents, which might seem to be an indication of more honesty on the telephone. However, other studies suggest that adults’ reports of smoking are not necessarily subject to social desirability pressures (see Aguinis, Pierce, and Quigley, 1993; Fendrich, Mackesy-Amiti, Johnson, Hubbell, and Wislar, 2005; Patrick, Cheadle, Thompson, Diehr, Koepsell, and Kinne, 1994)
Mode comparison studies generally have used one of three different designs. A first set of studies (Chang and Krosnick, 2010; Rogers, Willis, Al-Tayyib, Villarroel, Turner, Ganapathi, et al., 2005) have been designed as true experiments. Their designs called for respondents to be recruited and then immediately assigned to a mode, either self-completion by computer or oral interview, making the two groups equivalent in every way, as in all true experiments. A second set of studies (Newman, Des Jarlais, Turner, Gribble, Cooley, and Paone, 2002; Des Jarlais, Paone, Milliken, Turner, Miller, Gribble, Shi, Hagan, and Friedman, 1999; Riley, Chaisson, Robnett, Vertefeuille, Strathdee, and Vlahov, 2001) randomly assigned mode at the sampling stage, that is, prior to recruitment. Because assignment to mode was done before respondent contact was initiated, the response rates in the two modes differed, introducing the potential for confounds in the mode comparisons. In a final series of studies, (Cooley, Rogers, Turner, Al-Tayyib, Willis, and Ganapathi, 2001; Metzger, Koblin, Turner, Navaline, Valenti, Holte, Gross, Sheon, Miller, Cooley, Seage, and HIVNET Vaccine Preparedness Study Protocol Team, 2000; Waruru, Nduati, and Tylleskar, 2005; Ghanem, Hutton, Zenilman, Zimba, and Erbelding, 2005) respondents answered questions both in face-to-face interviews and on computers. All of these studies, regardless of design, found higher reports of socially stigmatized attitudes and behaviors in self-administered computer-based interviews than in face-to-face interviews.
This body of research is consistent with the notion that self-administration by computer elicits more honesty, although there is no direct evidence of the accuracy of those reports (one notable exception being Kreuter, Presser, and Tourangeau, 2008). They are assumed to be accurate because the attitudes and behaviors are assumed to be stigmatized.
The
satisficing hypothesis focuses on the cognitive effort that respondents devote to generating their answers to survey questions. The foundational notion here is that providing accurate answers to such questions usually requires that respondents carefully interpret the intended meaning of a question, thoroughly search their memories for all relevant information with which to generate an answer, integrate that information into a summary judgment in a balanced way, and report that judgment accurately. But some respondents may choose to shortcut this process, generating answers more superficially and less accurately than they might otherwise (Krosnick, 1991; 1999). Some specific respondent behaviors generally associated with satisficing include response non-differentiation (“straightlining”), random responding, responding more quickly than would be expected given the nature of the questions and responses (“speeding”), response order effects, or item non-response (elevated use of non-substantive response options such as “don’t know” or simply skipping items).
Some have argued that replacing an interviewer with a computer for self-administration has the potential to increase the likelihood of satisficing due to the ease of responding (simply clicking responses without supervision). If interviewers are professional and diligent and model their engagement in the process effectively for respondents, this may be contagious and may inspire respondents to be more effortful than they would be without such modeling. Likewise, the presence of an interviewer may create a sense of accountability in respondents, who may feel that they could be asked at any time to justify their answers to questions. Elimination of accountability may allow respondents to rush through a self-administered questionnaire without reading the questions carefully or thinking thoroughly when generating answers. Such accountability is believed to inspire more diligent cognitive effort and more accurate answering of questions.
Although much of the literature on satisficing has often focused on the characteristics of respondents (e.g., male, lower cognitive skills, younger), the demands of the survey task also can induce higher levels of satisficing. Computer-based questionnaires often feature extensive grid response formats (items in rows, responses in columns) and may ask more responses than what might occur in other modes. In addition, some researchers leverage the interactive nature of online to design response tasks and formats (such as slider bars and complex conjoint designs) that may be unfamiliar to respondents or increase respondent burden.
It is also possible that removing interviewers may improve the quality of the reports that respondents provide. As we noted at the outset of this section, interviewers themselves can sometimes be a source of measurement error. For example, if interviewers model only a superficial engagement in the interviewing process and suggest by their non-verbal (and even verbal) behavior that they want to get the interview over with as quickly as possible, this approach may also be contagious and may inspire more satisficing by respondents. When allowed to read and think about questions at their own pace during computer self-completion, respondents may generate more accurate answers. Further, while some have proposed that selection of neutral responses or the use of non-substantive response options reflect lower task involvement, it may be that such choices are more accurate reflections of people’s opinions. People may feel compelled to form an attitude in the presence of an interviewer but not so when taking a self-administered questionnaire (Fazio, Lenn, and Effrein, 1984). Selection of these non-substantive responses might also be more detectable in an online survey when it is made an explicit response rather than an implicit response as it often occurs in interviewer administered surveys.
Chang and Krosnick (2010) conducted a true experiment, randomly assigning respondents to complete a questionnaire either on a computer or to be interviewed orally by an interviewer. They found that respondents assigned to the computer condition manifested less non-differentiation and were less susceptible to a response order effects.
Other studies not using true random assignment yielded more mixed evidence. Consistent with the satisficing hypothesis, Chatt and Dennis (2003) observed more non-differentiation in telephone interviews than in questionnaires completed online. Fricker, Galesic, Tourangeau, and Yan (2005) found less item non-response among people who completed a survey via computer than among people interviewed by telephone.
On the other hand, Heerwegh and Loosveldt (2008) found more non-differentiation and more ”don’t know” responses in computer-mediated interviews than in face-to-face interviews. Fricker, Galesic, Tourangeau, and Yan (2005) found more non-differentiation in data collected by computers than in data collected by telephone and no difference in rates of acquiescence. Miller (2000; see also Burke, 2000) found equivalent non-differentiation in computer-mediated interviews and telephone interviews. And Lindhjem and Navrud (2008) found equal rates of “don’t know” responses in computer and face-to-face interviewing. Because the response rates in these studies differed considerably by mode (e.g., in Miller’s, 2000, study, the response rate for the Internet completion was one-quarter the response rate for the telephone surveys), it is difficult to know what to make of differences or lack of differences between the modes.
Speed of survey completion is another potential indicator of satisficing. If we assume that rapid completion reflects less cognitive effort then most research reinforces the argument that administration by computer is more prone to satisficing. In a true experiment done in a lab, Chang and Krosnick (2010) found that computer administration was completed more quickly than oral interviewing. In a field study that was not a true experiment, Miller (2000; see also Burke, 2000) described a similar finding. A telephone survey lasted 19 minutes on average, as compared to 13 minutes on average for a comparable computer-mediated survey. In a similar comparison, Heerwegh and Loosveldt (2008) reported that a computer-mediated survey lasted 32 minutes on average, compared to 48 minutes for a comparable face-to-face survey. Only one study, by Christian, Dillman, and Smyth (2008) found the opposite: Their telephone interviews lasted 12 minutes, whereas their computer self-completion questionnaire took 21 minutes on average.
Alternatively, one could argue that speed of completion, in and of itself, compared to completion in other modes is not necessarily an indication that quality suffers in self-administration modes. Perhaps respondents answer a set of questions in a visual self-administered mode more quickly than in an aural format primarily because people can read and process visual information more quickly than they can hear and process spoken language.
Primacy and
recency effects are also linked to satisficing.
Primacy is the tendency for respondents to select answers offered at the beginning of a list.
Recency is the tendency for respondents to select answers from among the last options offered. Nearly all published primacy effects have involved visual presentation, whereas nearly all published recency effects have involved oral presentation (see, e.g., Krosnick and Alwin, 1987). Therefore, we would expect computer administration and oral administration to yield opposite response order effects, producing different distributions of responses. Chang and Krosnick (2010) reported just such a finding, although the computer mode was less susceptible to this effect than was oral administration, consistent with the idea that the latter is more susceptible to satisficing.
As the foregoing discussion shows, the research record relative to the propensity for respondents to satisfice across survey modes is conflicted. True experiments show reduced satisficing in computer responses than in telephone or face-to-face interviews. Other studies have not always found this pattern, but those studies were not true experiments and involved considerable confounds with mode. Therefore, it seems reasonable to conclude that the limited available body of evidence supports the notion that there tends to be less satisficing in self-administration by computer than in interviewer administration.
Another way to explore whether interviewer-administered and computer-administered questionnaires differ in their accuracy is to examine
concurrent and predictive validity, that is, the ability of measures to predict other measures to which they should be related on theoretical grounds. In their experiment, Chang and Krosnick (2010) found higher concurrent or predictive validity for computer-administered questionnaires than for interviewer-administered questionnaires. However, among non-experimental studies, some found the same pattern (Miller, 2000; see also Burke, 2000), whereas others found equivalent predictive validity for the two modes (Lindhjem and Navrud, 2008).
Finally, some studies have assessed validity by comparing results with nonsurvey measurements of the same phenomena. In one such study Bender, Bartlett, Rand, Turner, Wamboldt, and Zhang (2007) randomly assigned respondents to report on their use of medications either via computer or in a face-to-face interview. The accuracy of their answers was assessed by comparing them to data in electronic records of their medication consumption. The data from the face-to-face interviews proved more accurate than the data from the self-administered by computer method.
Overall, the research reported here generally suggests higher data quality for computer administration than for oral administration. Computer administration yields more reports of socially undesirable attitudes and behaviors than does oral interviewing, but no evidence that directly demonstrates that the computer reports are more accurate. Indeed, in one study, computer administration compromised accuracy. Research focused on the prevalence of satisficing across modes also suggests that satisficing is less common on computers than in oral interviewing, but more true experiments are needed to confirm this finding. Thus, it seems too early to reach any firm conclusions about the inherent superiority or equivalence of one mode vis-a-vis the other in terms of data accuracy.
The Shift from Interviewer Administration with Probability Samples to Computer Self-Completion with Non-Probability Samples. A large number of studies have examined survey results when the same questionnaire was administered by interviewers to probability samples and online to nonprobability samples (Taylor, Krane, and Thomas, 2005; Crete and Stephenson, 2008; Braunsberger, Wybenga, and Gates, 2007; Klein, Thomas, and Sutter, 2007; Thomas, Krane, Taylor, & Terhanian, 2008; Baker, Zahs, and Popa, 2004; Schillewaert and Meulemeester, 2005; Roster, Rogers, Albaum, and Klein, 2004; Loosveldt and Sonck, 2008; Miller, 2000; Burke, 2000; Niemi, Portney, and King, 2008; Schonlau, Zapert, Simon, Sanstad, Marvus, Adams, Spranca, Kan, Turner, and Berry, 2004; Malhotra and Krosnick, 2007; Sanders, Clarke, Stewart, and Whiteley, 2007; Berrens, Bohara, Jenkins-Smith, Silva, and Weimer, 2003; Sparrow, 2006, Cooke, Watkins, and Moy, 2007; Elmore-Yalch, Busby, and Britton, 2008). Only one of these studies yielded consistently equivalent findings across methods, and many found differences in the distributions of answers to both demographic and substantive questions. Further, these differences generally were not substantially reduced by weighting.
Once again,
social desirability is sometimes cited as a potential cause for some of the differences. A series of studies comparing side-by-side probability sample interviewer-administered surveys with nonprobability online panel surveys found that the latter yielded higher reports of:
- Opposition to government help for blacks among white respondents (Chang and Krosnick, 2009);
- Chronic medical problems (Baker, Zahs, and Popa, 2004);
- Motivation to lose weight to improve one’s appearance (Baker, Zahs, and Popa, 2004);
- Feeling sexually attracted to someone of the same sex (Taylor, Krane, and Thomas, 2005);
- Driving over the speed limit (Taylor, Krane, and Thomas, 2005);
- Gambling (Taylor, Krane, and Thomas, 2005);
- Cigarette smoking (Baker, Zahs, and Popa, 2004; Klein, Thomas, and Sutter, 2007);
- Being diagnosed with depression (Taylor, Krane, and Thomas, 2005);
- Consuming beer, wine, or spirits (Taylor, Krane, and Thomas, 2005).
Conversely, compared to interviewer-administered surveys using probability-based samples, online surveys using nonprobability panels have documented fewer reports of:
- Excellent health (Baker, Zahs, and Popa, 2004; Schonlau, Zapert, Simon, Sanstad, Marcus, Adams, Spranca, Kan, Turner, and Berry, 2004; Yeager, Krosnick, Chang, Javitz, Levendusky, Simpser, and Wang, 2009);
- Having medical insurance coverage (Baker, Zahs, and Popa, 2004);
- Being motivated to lose weight for health reasons (Baker, Zahs, and Popa, 2004);
- Expending effort to lose weight (Baker, Zahs, and Popa, 2004);
- Giving money to charity regularly (Taylor, Krane, and Thomas, 2005);
- Doing volunteer work (Taylor, Krane, and Thomas, 2005)l
- Exercising regularly (Taylor, Krane, and Thomas, 2005);
- Going to a church, mosque, or synagogue most weeks (Taylor, Krane, and Thomas, 2005);
- Believing in God (Taylor, Krane, and Thomas, 2005);
- Cleaning one’s teeth more than twice a day (Taylor, Krane, and Thomas, 2005).
It is easy to imagine how all the above attributes might be tinged with social desirability implications and that self-administered computer reporting might have been more honest than reports made to interviewers. An alternative explanation may be that the people who join online panels are more likely to truly have socially undesirable attributes and to report that accurately. And computer self-completion of questionnaires could lead to more accidental misreading and mistyping, yielding inaccurate reports of socially undesirable attributes. More direct testing is required to demonstrate whether higher rates of reporting socially undesirable attributes in Internet surveys is due to increased accuracy and not due to alternative explanations.
Thus, the bulk of this evidence can again be viewed as consistent with the notion that online surveys with nonprobability panels elicit more honest reports, but no solid body of evidence documents whether this is so because the respondents genuinely possess these attributes at higher rates or because the data collection mode elicits more honesty than interviewer-based methods.
As with computer administration generally, some researchers have pointed to
satisficing as a potential cause of the differences observed in comparisons of results from Web surveys using nonprobability online panels with those from probability samples by interviewers. To test this proposition Chang and Krosnick (2009) administered the same questionnaire via RDD telephone and a Web survey using a nonprobability online panel. They found that the online survey yielded less non-differentiation, which is consistent with the claim that Web surveys elicit less satisficing.
Market research practitioners often use the term “inattentives” to describe respondents suspected of satisficing (Baker and Downes-LeGuin, 2007). In his study of 20 nonprobability panels, Miller (2008) found an average incidence of nine percent inattentives (or, as he refers to them, “mental cheaters”) in a 20-minute customer experience survey. The maximum incidence for a panel was 16 percent and the minimum 4 percent. He also fielded the same survey online to a sample of actual customers provided by his client and the incidence of inattentives in that sample was essentially zero. These results suggest that volunteer panelists may be more likely to satisfice than online respondents in general.
Thus far in this section we have considered research that might help us understand more clearly why results from nonprobability online panels might different from those obtained by interviewers from probability samples. Much of this research has compared results from the two methods and simply noted differences but without looking specifically at the issue of accuracy. Another common technique for evaluating the accuracy of results from these different methods has been to compare results with external benchmarks established through non-survey means such as Census data, election outcomes, or industry sales data. In comparisons of nonprobability online panel surveys with RDD telephone and face-to-face probability sample studies, a number of researchers have found the latter two modes to yield more accurate measurements when compared to external benchmarks in terms of voter registration (Niemi, Portney, and King, 2008; though see Berrens, Bohara, Jenkins-Smith, Silva, and Weimer, 2003), turnout (Malhotra and Krosnick, 2007; Sanders, Clarke, Stewart, and Whiteley, 2007), vote choice (Malhotra and Krosnick, 2007; though see Sanders, Clarke, Stewart, and Whiteley, 2007), and demographics (Crete and Stephenson, 2008; Malhotra and Krosnick, 2007; Yeager et al., 2009). Braunsberger, Wybenga, and Gates (2007) reported the opposite finding: greater accuracy online than in a telephone survey.
7 Krosnick, Nie, and Rivers (2005) found that while a single telephone RDD sample was off an average of 4.5 percent from benchmarks, six different nonprobability online panels were off an average of five percent to 12 percent, depending on the nonprobability sample supplier. In an extension of this same research, Yeager et al. (2009) found that the probability sample surveys (whether by telephone or Web) were consistently more accurate than the nonprobability sample surveys even after post-stratification by demographics. Results from a much larger study by the Advertising Research Foundation (ARF) using 17 panels have shown even greater divergence, although release of those results is only in the preliminary phase (Walker and Pettit, 2009).
Findings such as those showing substantial differences among nonprobability online panel suppliers inevitably lead to more questions about the overall accuracy of the methodology. If different firms independently conduct the same survey with nonprobability online samples simultaneously and the various sets of results closely resemble one another then researchers might take some comfort in the accuracy of the results. But disagreement would signal the likelihood of inaccuracy in some if not most such surveys. A number of studies in addition to those cited in the previous paragraph have arranged for the same survey to be conducted at the same time with multiple nonprobability panel firms (e.g., Almore-Yalch, Busby, and Britton, 2008; Baim, Galin, Frankel, Becker, and Agresti, 2009; Ossenbruggen, Vonk, and Williams, 2006). All of these studies found considerable variation from firm-to-firm in the results obtained with the same questionnaire, raising questions about the accuracy of the method.
8
Finally, a handful of studies have looked at concurrent validity across method. These studies administered the same questionnaire via RDD telephone interviews and via Web and nonprobability online panels and found evidence of greater concurrent validity and less measurement error in the Internet data (Berrens, Bohara, Jenkins-Smith, Silva, and Weimer, 2003; Chang and Krosnick, 2009; Malhotra and Krosnick, 2007; Thomas, Krane, Taylor, & Terhanian, 2008). Others found no differences in predictive validity (Sanders, Clarke, Stewart, and Whiteley, 2007; Crete and Stephenson, 2008).
In sum, the existing body of evidence shows that online surveys with nonprobability panels elicit systematically different results than probability sample surveys on a wide variety of attitudes and behaviors. Mode effects are one frequently-cited cause for those differences, premised on research showing that self-administration by computer is often more accurate than interviewer administration. But while computer administration offers some clear advantages, the literature to date also seems to show that the widespread use of nonprobability sampling in Web surveys is the more significant factor in the overall accuracy of surveys using this method. The limited available evidence on validity suggests that while volunteer panelists may describe themselves more accurately than do probability sample respondents, the aggregated results from online surveys with nonprobability panels are generally less accurate than those using probability samples.
Although the majority of Web surveys being done worldwide are with nonprobability samples, a small number are being done with probability samples. Studies that have compared these results from these latter surveys to RDD telephone surveys have sometimes found equivalent predictive validity (Berrens, Bohara, Jenkins-Smith, Silva, and Weimer, 2003) and rates of satisficing (Smith and Dennis, 2005) and sometimes found higher concurrent and predictive validity and less measurement error, satisficing, and social desirability bias in the Internet surveys, as well as greater demographic representativeness (Chang and Krosnick, 2009; Yeager et al., 2009) and greater accuracy in aggregate measurements of behaviors and attitudes (Yeager et al., 2009).
The Special Case of Pre-Election Polls. Pre-election polls are perhaps the most visible context in which probability sample and non-probability sample surveys compete and can be evaluated against an objective benchmark – specifically, an election outcome. However, as tempting as it is to compare the accuracy of final polls across modes of data collection, one special aspect of this context limits the usefulness of such comparisons. Analysts working this area must make numerous decisions about how to identify likely voters, how to handle respondents who decline to answer vote choice questions, how to weight data, how to order candidate names on questionnaires, and more, so that differences between polls in their accuracy may reflect differences in these decisions rather than differences in the inherent accuracy of the data collection method. The leading pollsters rarely reveal the details of how they make these decisions for each poll, so it is impossible to take them fully into account.
A number of publications have compared the accuracy of final pre-election polls forecasting election outcomes (Abate, 1998; Snell et al, 1999; Harris Interactive, 2004, 2008; Stirton and Robertson, 2005; Taylor, Bremer, Overmeyer, Sigeel, and Terhanian, 2001; Twyman, 2008; Vavreck and Rivers, 2008). In general, these publications document excellent accuracy of online nonprobability sample polls (with some notable exceptions), some instances of better accuracy in probability sample polls, and some instances of lower accuracy than probability sample polls.
Respondent Effects
No matter how we recruit respondents for our surveys, respondents will vary from each other in terms of their cognitive capabilities, motivations to participate, panel-specific experiences, topic interest and experience, and survey satisficing behaviors. These respondent-level factors can influence the extent of measurement error on an item-by-item basis and over the entire survey as well. While demographic variables may influence respondent effects, other factors likely have greater influence.
Cognitive Capabilities. People who enjoy participating in surveys may have higher cognitive capabilities or a higher need for cognition (Cacioppo and Petty, 1982). If respondents join a panel or participate in a survey based on their cognitive capabilities or needs then it can lead to differences in results compared to samples selected independent of cognitive capabilities or needs. For example, in a self-administered survey, people are required to read and understand the questions and responses. The attrition rate of those who have lower education or lower cognitive capabilities is often higher in a paper-and-pencil or Web survey. Further, if the content of the survey is related to cognitive capabilities of the respondents (e.g., attitudes toward reading newspapers or books), then there may also be significant measurement error. A number of studies have indicated that people who belong to volunteer online panels are more likely to have higher levels of education than those in the general population (Malhotra and Krosnick, 2007). To the extent that this is related to their responses on surveys, such differences may either reduce or increase measurement error depending on the survey topic or target population.
Motivation to participate. Respondents also can vary in both the types of motivation to participate and in the strength of that motivation. For example, offering five dollars to a respondent to participate in a survey might be more enticing to those who make less money. Others might be more altruistic or curious and like participating in research more than others. Still others may want to participate in order to find out how their opinions compare with those of other people. There may be some people who are more than willing to participate in surveys about political issues but not consumer issues, so topical motivation can vary between respondents. Participation in a survey or in a panel may not be motivated by a single motive but by multiple motives that vary in strength and across time and topics. To the extent that motivation affects who participates and who does not, results may be biased and less likely to reflect the population of interest. While this potentially biasing effect of motivation occurs with other survey methods,
9 it may apply even more to online surveys with nonprobability panels where people have self-selected into the panel to begin with and then can pick and choose the surveys to which they wish to respond. This is especially true when people are made aware of the nature and extent of incentives and even the survey topic prior to their participation by way of the survey invitation.
The use of incentives, in particular, whether to induce respondents to join a panel, to maintain their membership in a panel, or take a particular survey may lead to measurement error (Jäckle and Lynn, 2008). Respondents may over-report behaviors or ownership of products in order to obtain more rewards for participation in more surveys. Conversely, if they have experienced exceptionally long and boring surveys resulting from their reports of behaviors or ownership of products, they may under-report these things in subsequent surveys.
One type of respondent behavior observed with nonprobability online panels is false qualifying. In the language of online research these respondents are often referred to as “fraudulents” or “gamers” (Baker and Downes-LeGuin, 2007). These are individuals who assume false identities or simply misrepresent their qualifications either at the time of panel registration or in the qualifying questions of individual surveys. Their primary motive is assumed to be incentive maximization. They tend to be seasoned survey takers who can recognize filter questions and answer them in ways that they believe will increase their likelihood of qualifying for the survey. One classic behavior is selection of all options in multiple response qualifying questions; another is overstating purchase authority or span of control in a business-to-business (B2B) survey. Downes-Le Guin, Mechling, and Baker (2006) describe a number of first hand experiences with fraudulent panelists. For example, they describe a study in which the target respondents were both home and business decision makers to represent potential purchasers of a new model of printer. The study was multinational with a mix of sample sources including a U.S. customer list provided by the client, a U.S. commercial panel, a European commercial panel, and an Asian phone sample that was recruited to do the survey online. One multiple response qualifying question asked about the ownership of ten home technology products. About 14 percent of the U.S. panelists reported owning all 10 products, including the Segway Human Transporter, an expensive device known to have a very low incidence (less than 0.1 percent) of ownership among consumers. This response pattern was virtually nonexistent in the other sample sources.
The above examples are within the range of fraudulent reporting reported by Miller (2008). He found an average of about five percent fraudulent respondents across the 20 panels he studied with a maximum of 11 percent on one panel and a minimum of just 2 percent on four others. Miller also points out that while about five percent of panelists are likely to be fraudulent on a high incidence study that number can grow significantly—to as much as 40 percent—on a low incidence study where a very large number of panelist respondents are screened.
Panel Conditioning. The experience of repeatedly taking surveys may lead to some respondents experiencing changes in attitudes or even behaviors as a consequence of survey participation. For example, completing a series of surveys about electoral politics might cause a respondent to pay closer attention to news stories on the topic, to become better informed and even to express different attitudes on subsequent surveys. Respondents who frequently do surveys about various kinds of products may become aware of new brands and report that awareness in future surveys. This type of change in respondent behavior or attitudes due to repeated survey completion is known as
panel conditioning.
Concerns about panel conditioning arise because of the widespread belief that members of online panels complete substantially more surveys than, say, RDD telephone respondents. For example, Miller (2008), in his comparison study of 20 U.S. online panels, found an average of 33 percent of respondents reported taking 10 or more online surveys in the previous 30 days. Over half of the respondents on three of the panels he studied fell into this hyperactive group. One way for panelists to maximize their survey opportunities is by joining multiple panels. A recent study of 17 panels involving almost 700,000 panelists by ARF analyzed multi-panel membership and found either a 40 percent or 16 percent overlap in respondents, depending on how one measures it (Walker and Pettit, 2009). Baker and Downes-LeGuin (2007) report that in general population surveys rates of multi-panel membership (based on self reports) of around 30 percent are not unusual. By way of comparison, they report that on surveys of physicians rates of multipanel membership may be 50 percent or higher, depending on specialty. General population surveys of respondents with few qualifying questions often show the lowest levels of hyperactivity, while surveys targeting lower incidence or frequently surveyed respondents can be substantially higher.
Whether this translates into measurable conditioning effects is still unclear. Coen, Lorch, and Piekarski (2005) found important differences in measures such as likelihood to purchase based on previous survey taking history with more experienced respondents generally being less positive than new panel members. Nancarrow and Cartwright (2007) found a similar pattern, although they also found that purchase intention or brand awareness was less affected when time between surveys was sufficiently long. Other research has found that differences in responses due to panel conditioning can be controlled when survey topics are varied from study to study within a panel (Dennis, 2001; Nukulkij, Hadfield, Subias, and Lewis, 2007).
On the other hand, a number of studies of consumer spending (Bailar, 1989; Silberstein and Jacobs, 1989), medical care expenditures (Corder and Horvitz, 1989), and news media consumption (Clinton, 2001) have found few differences attributable to panel conditioning. Studies focused on attitudes (rather than behaviors) across a wide variety subjects (Bartels, 2006; Sturgis, Allum, and Brunton-Smith, 2008; Veroff, Hatchett, and Douvan, 1992; Waterton and Lievesley, 1989; Wilson, Kraft, and Dunn, 1989) also have reported few differences.
Completing a large number of surveys might also cause respondents to approach survey tasks differently than those with no previous survey taking experience. It might lead to ”bad” respondent behavior, including both weak and strong satisficing (Krosnick, 1999). Or the experience of completing many surveys might also lead to more efficient survey and accurate survey completion (Chang and Krosnick, 2009; Waterton and Lievesley, 1989 Schlackman, 1984). Conversely, Toeppel, Das, and van Soest (2008) compared the answering behavior of more experienced panel members with those less experienced and found few differences.
Topic Interest and Experience. Respondent experience with a topic can influence reactions to the questions about the topic. For example, if a company wants to measure people’s feelings and thoughts about the company, they often invite people from all backgrounds to the survey. If, however, invitations or the survey itself screens respondents on the basis of their experience with the company, the resulting responses will generally tend to be more positive than if all respondents familiar with the company were asked to respond. Among the people who have not purchased in the past 30 days we are more likely to find people who have had negative experiences or who feel less positively toward the company. Therefore, the results from the sample will not give as accurate a picture of the company’s reputation as it exists in the general population.
People who join online panels that complete consumer-based surveys may have greater consumer orientation, either through self-selection at the outset or through attrition. If this tendency exists and remains uncorrected, surveys using that panel might yield substantially different results concerning consumer attitudes and product demand among the general population.
While self-selection occurs in both RDD-recruited and nonprobability panels (as it does even for single-occasion randomly-selected samples), self-selection is more likely to be a stronger factor for respondents in nonprobability panels since there is strong self-selection at the first stage of an invitation to join the panel and at the single study stage where the survey topic is sometimes revealed. Stronger self-selection factors can also yield respondents with a higher likelihood to be significantly different from all possible respondents in the larger population of interest. People who join panels voluntarily can differ from a target population in a number of ways (e.g., they may have less concern about their privacy, be more interested in expressing their opinions, be more technologically interested or experienced, be more involved in the community or political issues). For a specific study sample, this may be especially true when the topic of the survey is related to how the sample differs from the target population. For example, results from a survey assessing people’s concerns about privacy may be significantly different in a volunteer panel than in the target population (Couper, Singer, Conrad, and Groves, 2008). For nonprobability online panels, attitudes toward technology may be more positive than the general population since respondents who are typically recruited from those who already have a computer and spend a good deal of time online. As a consequence, a survey concerning government policies toward improving computing infrastructure in the country may yield more positive responses in a Web nonprobability panel than in a sample drawn at random from the general population (Duffy et al., 2005). Chang and Krosnick (2009) and Malhotra and Krosnick (2007) found that in surveys using nonprobability panels, respondents were more interested in the topic of the survey (politics) than were respondents in face-to-face and telephone probability sample surveys.
6 We do not review an additional large literature that has compared paper-and-pencil self-completion to other modes of data collection (e.g., interviewer administration, computer self-completion).
7 Braunsberger et al. (2007) did not state whether their telephone survey involved pure random digit dialing – they said it involved “a random sampling procedure” from a list “purchased from a major provider of such lists” (p. 761). And Braunsberger et al. (2007) did not describe the source of their validation data.
8 A series of studies at first appeared to be relevant to the issues addressed in this literature review, but closer inspection revealed that their data collections were designed in ways that prevented them from clearly addressing the issues of interest here (Boyle, Freeland, & Mulvany, 2005; Schillewaert & Meulemeester, 2005; Gibson & McAllister, 2008; Jackman, 2005; Stirton & Robertson, 2005; Kuran & McCaffery, 2004, 2008; Elo, 2010; Potoglou & Kanaroglou, 2008; Duffy, Terhanian, Bremer, & Smith, 2005; Sparrow & Curtice, 2004; Marta-Pedroso, Freitas, & Domingos, 2007)
9 For example, people who answer the phone and are willing to complete an interview may be substantially different (e.g., older, more likely to be at home, poorer, more altruistic, more likely to be female, etc.) than those who do not.
Sample Adjustments to Reduce Error and Bias
While there may be considerable controversy surrounding the merits and proper use of nonprobability online panels, one thing virtually everyone agrees to is that the panels themselves are not representative of the general population. This section describes three techniques sometimes used to attempt to correct for this deficiency with the goal of making these results projectable to the general population.
Purposive Sampling
Purposive sampling is a non-random selection technique that has as its goal a sample that is representative of a defined target population. Anders Kiaer generally is credited for first advancing this sampling technique at the end of the 19
th century with what he called “the representative method.” Kiaer argued that if a sample is representative of a population for which some characteristics are known then that sample also will be representative of other survey variables (Bethlehem and Stoop, 2007). Kish (1965) used the term
judgment sampling to convey the notion that the technique relies on the judgments of experts about the specific characteristics needed in a sample for it to represent the population of interest. It presumes that an expert can make choices about the relationship between the topic of interest and the key characteristics that influence responses and their desired distributions based on knowledge gained in previous studies.
The most common form of purposive sampling is
quota sampling. This technique has been widely used for many years in market and opinion research as a means to protect against nonresponse in key population subgroups and to reduce costs. Quotas typically are defined by a small set of demographic variables (age, gender, and region are common) and other variables thought to influence the measures of interest.
Purposive sampling is widely used by online panel companies to offer samples that correct for known biases in the panel itself. In the most common form of purposive sampling, the panel company provides a “census-balanced sample” meaning samples that are drawn to conform to the overall population demographics (typically age, gender, and perhaps, race) as measured by the U.S. Census. Individual researchers may request that the sample be stratified by other characteristics or they may implement quotas at the data collection stage to ensure the achieved sample meets their requirements.
More aggressive forms of purposive sampling use a wider set of both attitudinal and behavioral measures in sample selection. One advantage of panels is that the panel company often knows a good deal about individual members via profiling and past survey completions, and this information can be used in purposive selection. For example, Kellner (2008) describes the construction of samples for political polls in the UK that are drawn to ensure not just a demographic balance but also “the right proportions of past Labour, Conservative, and Liberal Democrat voters and also the right number of readers of each national newspaper.”
The use of purposive sampling and quotas, especially when demographic controls are used to set the quotas, is the basis on which results from online panel surveys are sometimes characterized as being “representative.”
The merits of purposive or quota sampling versus random probability sampling have been debated for decades and will not be reprised here. However, worthy of note is the criticism that purposive sampling relies on the judgment of an expert and so to a large degree the quality of the sample in the end depends on the soundness of that judgment. Where nonprobability online panels are concerned, there appears to be no research that focuses specifically on the reliability and validity of the purposive sampling aspects of online panels when comparing results with those from other methods.
Model-Based Methods
Probability sampling has a rich tradition, a strong empirical basis, and a well-established theoretical foundation, but it is by no means the only statistical approach to making inferences. Many sciences, especially the physical sciences, have rarely used probability sampling methods and yet they have made countless important discoveries using statistical data. These studies typically have relied on statistical models and assumptions and might be called model-based.
In the survey realm, small area estimation methods (Rao, 2003) have been developed to produce estimates for areas for which there are few or no observations. Prediction-based methods and Bayesian methods that either do not require probability sampling or ignore the sampling weights at the analysis stage have also been proposed (Valliant et al., 2001).
Epidemiological studies (Woodward, 2004) may be closely related to the types of studies that employ probability sampling methods. These studies often use some form of matching and adjustment methods to support inferences rather than relying on probability samples of the full target population (Rubin, 2006). An example is a case-control study in which controls (people without a specific disease) are matched to cases (people with the disease) to make inferences about the factors that might cause the disease.
Some online panels use approaches that are related to these methods. The most common approach of online panels has been to use propensity or other models to make inferences to the target population. At least one online panel (Rivers, 2007) has adopted a sample matching method for sampling and propensity modeling to make inferences in a manner closely related to the methods used in epidemiological studies.
Online panels are relatively new, and these ideas are still developing. Clearly, more theory and empirical evidence is needed to determine whether these approaches may provide valid inferences that meet the goals of the users of the data. Major hurdles that face nonprobability online panels are related to the validity and reproducibility of the inferences from these sample sources. To continue the epidemiological analogy, (external) validity refers to the ability to generalize the results from the study beyond the study subjects to the population while reproducibility (internal validity) refers the ability to derive consistent findings within the observation mechanism. Since many users of nonprobability online panels expect the results to be reproducible and to generalize to the general population, the ability of these panels to meet these requirements is a key to their utility.
In many respects the challenges for nonprobability panels are more difficult than those faced in epidemiological studies. All panels, even those that are based on probability samples, are limited in their ability to make inferences to dynamic populations. Changes in the population and attrition in the panel may affect the estimates. In addition, online panels are required to produce a wide variety of estimates, as opposed to modeling a very specific outcome such as the incidence of a particular disease in most epidemiological studies. The multi-purpose nature of the requirement significantly complicates the modeling and the matching for the panel.
Post-survey Adjustment
Without a traditional frame, the burden of post-survey adjustment for online nonprobability panels is much greater than in surveys with random samples from fully-defined frames. The gap between the respondents and the sample (that arises from nonresponse) is addressed through weighting procedures that construct estimates that give less weight to those respondents from groups with high response rates and more weight to those respondents from groups with low response rates. The gap between the sample and the sampling frame with probability samples is handled through very strong probability theory principles. The gap between the sampling frame and the target population is handled by using full target population counts from censuses or other sources, in an attempt to repair omissions from the frame.
Although a researcher working with a sample from an online volunteer panel may have counts or estimates for the full target population, there is no well-defined frame from which the respondents emerge. For that reason, post-survey adjustments to results from online panels take on the burden of moving directly from the respondents to the full target population.
Weighting Techniques. For all surveys, regardless of mode or sample source, there is a chance that the achieved sample or set of respondents may differ from the target population of interest in nonignorable ways. This may be due to study design choices (e.g., the choice of the frame; analysis goals that require over-sampling) or to factors not easily controlled (e.g., the coverage of any given frame, nonresponse to the survey). Weights are often used to adjust survey data to help mitigate these compositional differences.
Compositional differences are an indication of possible bias. If people differ in their responses based upon some set of underlying characteristics and the people are not represented in their true proportions based upon these characteristics, estimates obtained will be biased or not representative of the population as a whole.
There are three main reasons why people might not be represented in their proper proportions in the survey and weighting adjustments may be needed to compensate for this over- or underrepresentation. First, weights may be needed to compensate for differences in selection probabilities of individual cases. Second, weights can help compensate for subgroup differences in response rates. Even if the sample selected is representative of the target population differences in response rate can compromise representation without adequate adjustments.
For either of the above situations, weighting adjustments can be made by using information from the sample frame itself. However, even if these types of weights are used, the sample of respondents may still fluctuate from known population characteristics, which leads to another type of weighting adjustment.
The third type of weight involves comparing the sample characteristics to an external source of data that is deemed to have high degree of accuracy. For surveys of individuals or households this information often comes from sources such as the U.S. Census Bureau. This type of weighting adjustment is commonly referred to as a
post-stratification adjustment and it differs from the first two types of weighting procedure in that it utilizes information external to the sample frame.
Online panels can have an underlying frame that is a probability based or nonprobability based frame. If the frame is probability based, all of the weighting methods mentioned above might apply and weights could be constructed accordingly.
Things are a bit different for a frame that is not probability based. Although cases may be selected at different rates from within the panel, knowing these probabilities tell us nothing about the true probabilities that would occur in the target population. The same basic problem holds true for sub-group response rates. Although sub-group response rates can often be measured for online panels, as with selection probabilities, it is difficult to tie them to the target population. For these reasons, weights for nonprobability panels typically rely solely upon post-stratification adjustments to external population targets.
The most common techniques to make an online panel more closely mirror the population at large occur either at the sample selection stage or after all data has been collected. At the selection stage, panel administrators may use purposive sampling techniques to draw samples that match the target population on key demographic measures. Panel administrators also may account and adjust for variation in response rates (based upon previous studies) related to these characteristics. The researchers may place further controls on the make-up of the sample through the use of quotas. Thus, a sample selected from this panel and fielded will yield a set of respondents that more closely matches the target population than a purely “random” sample from the online panel.
After data collection, post-stratification can be done by a weighting adjustment. Post-stratification can take different forms, the two most common of which are: (1)
cell-based weighting where one variable or a set of variables is used to divide the sample into mutually exclusive categories or cells, with adjustments being made so that the sample proportions in each cell match the population proportions; or (2)
marginal-based weighting whereby the sample is matched to the marginal distribution of each variable in a manner such that all the marginal distributions for the different categories will match the targets. For example, assume a survey uses three variables in the weighting adjustment: age (18-40 years old, 41-64 years old, and 65 years old or older), sex (male and. female) and race/ethnicity (Hispanic, non-Hispanic white, non-Hispanic black, and non-Hispanic other race). Cell-based weighting will use 24 (3*2*4) cross-classified categories, where the weighted sample total of each category (e.g., the total number of Hispanic 41-64 year old males) will be projected the known target population total. By contrast, marginal-based weighting, which is known by several names including iterative proportional fitting, raking, and rim weighing, will make adjustments to match the respective marginal proportions for each category of each variable (e.g., Hispanic). Post-stratification relies on the assumption that people with similar characteristics on the variables used for weighing will have similar response characteristics for other items of interest. Thus, if samples can be put into their proper proportions, the estimates obtained from them will be more accurate (Berinsky, 2006). Work done by Dever, Rafferty, and Valliant (2008) however suggests that post-stratification based on standard demographic variables alone will likely fail to adequately adjust for all the differences between those with and without Internet access at home, but with the inclusion of sufficient variables they found that statistical adjustments alone could eliminate any coverage bias. However, their study did not address the additional differences associated with belonging to a nonprobability panel. A study by Schonlau and colleagues (2009) casts doubt on using only a small set of variables in the adjustment.
Propensity Weighting. Weighting based on
propensity score adjustment is another technique that is used in an attempt to make online panels selected as nonprobability samples more representative of the population. Propensity score adjustment was first introduced as a post-hoc approach to alleviate the confounding effects of the selection mechanism in observational studies by achieving a balance of covariates between comparison groups (Rosenbaum and Rubin, 1983). It is widely used in biostatistical applications involving quasi-experimental designs in an attempt to equate non-equivalent groups. It has its origin as a statistical solution to the selection problem (Caliendo and Kopeinig, 2008). It has been adopted in survey statistics mainly for weighting adjustment of telephone, mail, and face-to-face surveys (Lepkowski et al., 1989, Czajka et al, 1992, Iannacchione et al, 1991, Smith et al., 2001, Göksel, et al, 1991, Garren and Chang, 2002, Duncan and Stasny, 2001, Lee and Valliant, 2009) but not necessarily for sample selection bias issues.
Propensity score weighting was first introduced for use in online panels by Harris Interactive (Taylor, 2000, Terhanian and Bremer, 2000) and further examined by Lee and her colleagues (Lee, 2004, 2006; Lee and Valliant, 2009), Schonlau and his colleagues (Schonlau et al. 2004), Loosveldt and Sonck (2008), and others. Its purpose is to use propensity score models to reduce or eliminate the selection biases in samples from nonprobability panels by aligning the distributions of certain characteristics (covariates) within the panel to those of the target population.
Propensity score weighting differs from the traditional weighting techniques in two respects. First, it is based on explicitly specified models. Second, it requires the use of a supplemental or reference survey that is probability-based. The reference survey is assumed to be conducted parallel to the online survey with the same target population and survey period. Better coverage and sampling properties and higher response rates than the online survey are also expected for a reference survey. Furthermore, it is assumed that there are measurement error differences between the reference survey and the online survey. For instance, the reference survey may be conducted using traditional survey modes, such as RDD telephone in Harris Interactive’s case (Terhanian and Bremer, 2000). The reference survey must include a set of variables that are also collected in the online surveys. These variables are used as covariates in propensity models. The hope is to use the strength of the reference survey to reduce selection biases in the online panel survey estimates. Schonlau, van Soest, and Kapteyn (2007) give an example of this.
By using data combining both reference and online panel surveys, a model (typically logistic regression) is built to predict whether a sample case is from the reference or the online survey. The covariates in the model can include items similar to the ones used in post-stratification, but other items are usually included that more closely relate to the likelihood of being on an online panel. Furthermore, propensity weighting can utilize not only demographic characteristics, but attitudinal characteristics as well. For example, people’s opinions about current events can be used as these might relate a person’s likelihood of choosing to be on an online panel. Because the technique requires a reference survey, items can be used that often don’t exist from traditional population summaries, like the decennial census.
Once the model is developed, each case can then be assigned a predicted propensity score of being from the reference sample (a predicted propensity of being from the online sample could also be used). First, the combined sample cases are divided into equal sized groups based upon their propensity scores. (One might also consider using only reference sample cases for this division.) Ideally, all units in a given subclass will have about the same propensity score or, at least, the range of scores in each class is fairly narrow. Based on the distribution of the proportions of reference sample cases across divided groups, the online sample is then assigned adjustment factors that can be applied to weights reflecting their selection probabilities.
Propensity score methods can be used alone or along with other methods, such as post-stratification. Lee and Valliant (2009) showed that weights that combine propensity score and calibration adjustments with general regression estimators were more effective than weighting by propensity score adjustments alone for online panel survey estimates that have a sample selection bias.
Propensity weighting still suffers from some of the same problems as more traditional weighting approaches and adds a few as well. The reference survey needs to be high quality. To reduce cost one reference study is often used to adjust a whole set of surveys. The selection of items to be used for the model is critical and can be depend on the topic for the survey. The reference study is often done with a different mode of administration, such as a telephone survey. This can complicate the modeling process if there are mode effects on responses, though items can be selected or designed to function equivalently in different modes. Moreover, the bias reduction from propensity score adjustment comes at the cost of increased variance in the estimates, therefore, decreasing the effective sample sizes and estimate precision (Lee, 2006). When the propensity score model is not effective, it can increase variance without decreasing bias, increasing the overall error in survey estimates. Additionally, the current practice for propensity score adjustment for nonprobability online panels is to treat the reference survey as though it were not subject to sampling errors, although typically reference surveys have small sample sizes. If the sampling error of the reference survey estimates is not taken into account, the precision of the online panel survey estimates using propensity score adjustment will be overstated (Bethlehem, 2009).
While propensity score adjustment can be applied to reduce biases, there is no simple approach for deriving variance estimates. As discussed previously, because online panels samples do not follow the randomization theory, the variance estimates cannot be interpreted as repeated sampling variances. Rather, they should be considered as reflecting the variance with respect to an underlying structural model that describes the volunteering mechanism and the dependence of a survey variable on the covariates used in adjustment. Lee and Valliant (2009) showed that näively using variance estimators derived from probability sampling may lead to a serious underestimation of the variance, erroneously causing Type 1 error. Also, when propensity score weighting is not effective in reducing bias, estimated variances are likely to have poor properties, regardless of variance estimators.
The Industry-wide Focus on Panel Data Quality
Over the last four or five years there has been a growing emphasis in the market research sector on online panel data quality (Baker, 2008). A handful of high profile cases in which online survey results did not replicate despite use of the same questionnaire and the same panel caused deep concern among some major client companies in the market research industry. One of the most compelling examples came from Kim Dedeker, Vice President for Global Consumer and Market Knowledge at Procter and Gamble when she announced at the 2006 Research Industry Summit on Respondent Cooperation, “Two surveys a week apart by the same online supplier yielded different recommendations … I never thought I was trading data quality for cost savings.” At the same time, researchers working with panels on an ongoing basis began uncovering some of the troubling behaviors among panel respondents described in Section 5 (Downes-LeGuin, Mechling, and Baker, 2006). As a consequence, industry trade associations, professional organizations, panel companies, and individual researchers have all focused on the data quality issue and created differing responses to deal with it.
Initiatives by Professional and Industry Trade Associations
All associations in the survey research industry share a common goal of encouraging practices that promote quality research and the credibility of results in the eye of consumers of that research, whether clients or the public at large. Virtually every association both nationally and worldwide has incorporated some principles for conducting online research into their codes and guidelines. Space limitations make it impossible to describe them all here and so we note just four that seem representative.
The Council of American Survey Research Organization (CASRO) was the first US-based association to modify its “Code of Standards and Ethics for Survey Research” to include provisions specific to online research. A section on Internet research generally was added in 2002, and was revised in 2007 to include specific clauses relative to online panels. The portion of the CASRO code related to Internet research and panels is reproduced in Appendix A.
One of the most comprehensive code revisions has come from ESOMAR. Originally the European Association for Opinion and Market Research, the organization has taken on a global mission and now views itself as “the world association for enabling better research into markets, consumers, and societies.” In 2005 ESOMAR developed a comprehensive guideline titled, “Conducting Market and Opinion Research Using the Internet.” and incorporated it into their “International Code on Market and Social Research.” As part of that effort ESOMAR developed their “25 Questions to Help Research Buyers.” This document was subsequently revised and published in 2008 as “26 Questions to Help Research Buyers of Online Samples.” Questions are grouped into seven categories:
- Company profile;
- Sources used to construct the panel;
- Recruitment methods;
- Panel and sample management practices;
- Legal compliance;
- Partnership and multiple panel partnership;
- Data quality and validation.
The document specifies the questions a researcher should ask of a potential online panel sample provider along with a brief description of why the question is important. It is reproduced in Appendix B.
The ISO Technical Committee that developed ISO 20252 – Market, Opinion and Social Research also developed and subsequently deployed in 2009 an international standard for online panels, ISO 26362 – Access Panels in Market, Opinion, and Social Research (International Organization for Standardization, 2009). Like the main 20252 standard, ISO 26362 requires that panel companies develop, document, and maintain standard procedures in all phases of their operations and that they willingly share those procedures with clients upon request. The standard also defines key terms and concepts in an attempt to create a common vocabulary for online panels. It further details the specific kinds of information that a researcher is expected to disclose or otherwise make available to a client at the conclusion of every research project.
Finally, in 2008 the Advertising Research Foundation (ARF) established the Online Research Quality Council which in turn designed and executed The Foundations of Quality research project. The goal of the project has been to provide a factual basis for a new set of normative behaviors governing the use of online panels in market research. With data collection complete and analysis ongoing the ARF has turned to implementation via a number of test initiatives under the auspices of their Quality Enhancement Process (QeP). It is still too early to tell what impact the ARF initiative will have.
AAPOR has yet to incorporate specific elements related to Internet or online research into its code. However, it has posted statements on representativeness and margin of error calculation on its Web site. These are reproduced in Appendix C and D.
Panel Data Cleaning
Both panel companies and the researchers who conduct online research with nonprobability panels have developed a variety of elaborate and technically sophisticated procedures to remove “bad respondents.” The goal of these procedures, in the words of one major panel company (MarketTools, 2009), is to deliver respondents who are “real, unique, and engaged.” To be more specific, this means taking whatever steps are necessary to ensure that all panelists are who they say they are, that the same panelist participates in the same survey only once, and that the panelist puts forth a reasonable effort in survey completion.
Eliminating Fraudulents. Assuming a false identity or multiple identities on the same panel is one form of fraud or misrepresentation. Validating the identities of all panelists is a responsibility that typically resides with the panel company. Most companies do this at the enrollment stage and a prospective member is not available for surveys until their identity has been verified. The specific checks vary from panel to panel but generally involve verifying information provided at the enrollment stage (e.g., name, address, telephone number, email address) against third-party databases. When identity cannot be verified, the panelist is rejected. The specific checks done vary from one panel company to the next, but all reputable companies will supply details on their validation procedures if asked.
A second form of fraudulent behavior consists of lying in the survey’s qualifying questions as a way to ensure participation. Experienced panel respondents understand that the first questions in a survey typically are used to screen respondents and so they may engage in behaviors that maximize their chances of qualifying. Market research surveys, for example, may be targeted at people who own or use certain products and those surveys often ask about product usage in a multiple response format. As described earlier, fraudulent respondents sometimes will select all products in the list in an attempt to qualify for the survey. Examining the full set of responses for respondents who choose all options in a multiple response qualifier is one technique for identifying fraudulent respondents. Surveys may also qualify people based on their having engaged in certain types of activities, the frequency with which they engage, or the number of certain products owned. When qualifying criteria such as these are collected over a series of questions the researcher can perform consistency checks among items or simply perform reasonableness tests to identify potential fraudulent respondents.
Increasingly researchers are designing questionnaires in ways to make it easier to find respondents who may be lying to qualify. For example, they may include false brands, nonexistent products, or low incidence behaviors in multiple choice questions. They might construct qualifying questions in ways that make it easier to spot inconsistencies in answers.
Identifying Duplicate Respondents. While validation checks performed at the join stage can prevent the same individual from joining the same panel multiple times, much of the responsibility for ensuring that the same individual does not participate in the same survey more than once rests with the researcher. It is reasonable to expect that a panel company has taken the necessary steps to eliminate multiple identities for the same individual and a researcher should confirm that prior to engaging the panel company. However, no panel company can be expected to know with certainty whether a member of their panel is also a member of another panel. In those instances where the researcher feels is necessary or wise to use sample from multiple panels on the same study the researcher must also have a strategy for identifying and removing potential duplicate respondents.
The most common technique for identifying duplicate respondents is
digital fingerprinting. Specific applications of this technique vary, but they all involve the capture of technical information about a respondent’s IP address, browser, software settings and hardware configuration to construct a unique ID for that computer. (See Morgan (2008) for an example of a digital fingerprinting implementation.) Duplicate IDs in the same sample signal that the same computer was used to complete more than one survey and so a possible duplicate exists. False positives are possible (e.g., two persons in the same household), and so it is wise to review the entire survey for expected duplicates prior to deleting any data.
To be effective, digital fingerprinting must be implemented on a survey–by-survey basis. Many survey organizations have their own strategies and there are several companies who specialize in these services.
Measuring Engagement. Perhaps the most controversial set of techniques are those used to identify satisficing respondents. Four are commonly used:
- Researchers often look at the full distribution of survey completion times to identify respondents with especially short times;
- Grid or matrix style questions are a common feature of online questionnaires and respondent behavior in those questionnaires is another oft-used signal of potential satisficing. “Straightlining” answers in a grid by selecting the same response for all items in a grid or otherwise showing low differentiation in the response pattern can be an indicator of satisficing (though it could also indicate a poorly designed questionnaire). Similarly, random selection of response options can be a signal although this is somewhat more difficult to detect (high standard deviation around the average value selected by a respondent may or may not signal random responding). Trap questions in grids that reverse polarity are another technique (though this may reflect questions that are more difficult to read and respond to).;
- Excessive selection of non-substantive responses such as “don’t know” or “decline to answer” are still another potential indicator of inattentiveness (though they could also reflect greater honesty).;
- Finally, examination of responses to open-ended questions can sometimes identify problematic respondents. Key things to look for are gibberish or answers that appear to be copied and then repeatedly pasted in question after question.
Putting it all together. There is no widely accepted industry standard for editing and cleaning panel data. Which, if any, of these techniques is used for a given study is left to the judgment of individual researchers. Similarly, how the resulting data are interpreted and the action taken against specific cases varies widely. Failure on a specific item such as a duplication check or fraudulent detection sequence may be enough to cause a researcher to delete a completed survey from the dataset. Others may use a scoring system where a case must fail on multiple tests before it is eliminated. This is especially true with attempts to identify inattentive responses. Unfortunately, there is nothing in the research literature to help us understand how significantly any of these respondent behaviors affect estimates.
This editing process may strike researchers accustomed to working with probability samples as a strange way to ensure data quality. Eliminating respondents because of their response patterns is not typically done with these kinds of samples. On the other hand, interviewers are trained to recognize some of the behaviors and take steps to correct them during the course of the interview.
We know of no research that shows the effect these kinds of edits either on the representativeness of these online surveys or their estimates. Nonetheless, these negative respondent behaviors are widely believed to be detrimental to data quality.
Conclusions/Recommendations
We believe that the foregoing review, while not exhaustive of the literature, is at least comprehensive in terms of the major issues researchers face with online panels. But research is ongoing and both the panel paradigm itself and the methods for developing online samples more generally continue to evolve. On the one hand, the conclusions that follow flow naturally from the state of the science as we understand it today. Yet they also are necessarily tentative as that science continues to evolve.
Researchers should avoid nonprobability online panels when one of the research objectives is to accurately estimate population values. There currently is no generally accepted theoretical basis from which to claim that survey results using samples from nonprobability online panels are projectable to the general population. Thus, claims of “representativeness” should be avoided when using these sample sources. Further, empirical research to date comparing the accuracy of surveys using nonprobability online panels with that of probability-based methods finds that the former are generally less accurate when compared to benchmark data from the Census or administrative records.
From a total survey error perspective, the principal source of error in estimates from these types of sample sources is a combination of the lack of Internet access in roughly one in three U.S. households and the self-selection bias inherent in the panel recruitment processes.
Although mode effects may account for some of the differences observed in comparative studies, the use of nonprobability sampling in surveys with online panels is likely the more significant factor in the overall accuracy of surveys using this method. The majority of studies comparing results from surveys using nonprobability online panels with those using probability-based methods (most often RDD telephone) report significantly different results on a wide array of behaviors and attitudes. Explanations for those differences sometimes point to classic measurement error phenomena such as social desirability response bias and satisficing. And indeed, the literature confirms that in many cases self administration by computer results in higher reports of socially undesirable behavior and less satisficing than in interviewer administered modes. Unfortunately, many of these studies confound mode with sample source, making it difficult to separate the impact of mode of administration from sample source. A few studies have attempted to disentangle these influences by comparing survey results from different modes to external benchmarks such as the Census or administrative data. These studies generally find that surveys using nonprobability online panels are less accurate than those using probability methods. Thus, we conclude that while measurement error may explain some of the divergence in results across methods the greater source of error is likely to be the undercoverage and self selection bias inherent in nonprobability online panels.
There are times when a nonprobability online panel is an appropriate choice. To quote Mitofsky (1989), “…different surveys have different purposes. Defining standard methodological practices when the purpose of the survey is unknown does not seem practical. Some surveys are conducted under circumstances that make probability methods infeasible if not impossible. These special circumstances require caution against unjustified or unwarranted conclusions, but frequently legitimate conclusions are possible and sometimes those conclusions are important.” The quality expert J. M. Juran (1992) expressed this concept more generally when he coined the term “fitness for use” and argued that any definition of quality must include discussion of how a product will be used, who will use it, how much it will cost to produce it, and how much it will cost to use it. Not all survey research is intended to produce precise estimates of population values. For example, a good deal of research is focused on improving our understanding of how personal characteristics interact with other survey variables such as attitudes, behaviors and intentions. Nonprobability online panels also have proven to be a valuable resource for methodological research of all kinds. Market researchers have found these sample sources to be very useful in testing the receptivity of different types of consumers to product concepts and features. Under these and similar circumstances, especially when budget is limited and/or time is short, a nonprobability online panel can be an appropriate choice. However, researchers also should carefully consider any biases that might result due to the possible correlation of survey topic with the likelihood of Internet access, the propensity to join an online panel, or to respond to and complete the survey and qualify their conclusions appropriately.
Research aimed at evaluating and testing techniques used in other disciplines to make population inferences from nonprobability samples is interesting but inconclusive. Model-based sampling and sample management have been shown to work in other disciplines but have yet to be tested and applied more broadly. While some have advocated the use of propensity weighting in post-survey adjustment to represent the intended population, the effectiveness of these different approaches has yet to be demonstrated consistently and on a broad scale. Nonetheless, this research is important and should continue.
Users of online panels should understand that there are significant differences in the composition and practices of individual panels that can affect survey results. It is important to choose a panel sample supplier carefully. One obvious difference among panels that is likely to have a major impact on the accuracy of survey results is method of recruitment. Panels using probability-based methods such as RDD telephone or addressed-based mail sampling are likely to be more accurate than those using nonprobability-based methods, assuming all other aspects of survey design are held constant. Other panel management practices such as recruitment source, incentive programs, and maintenance practices also can have major impacts on survey results. Arguably the best guidance available on this topic is the ESOMAR publication,
26 Questions to Help Research Buyers of Online Samples, included as Appendix B to this report. Many panel companies have already answered these questions on their Web sites, although words and practices sometimes do not agree. Seeking references from other researchers may also be helpful.
Panel companies can inform the public debate considerably by sharing more about their methods and data describing outcomes at the recruitment, join, and survey-specific stages. Despite the large volume of research that relies on these sample sources we know relatively little about the specifics of undercoverage or nonresponse bias. Such information is critical to fit-for-purpose design decisions and attempts to correct bias in survey results.
Disclosure is critical. O’Muircheartaigh (1997) proposed that error be defined as “work purporting to do what it does not do.” Much of the controversy surrounding use of online panels is rooted in claims that may or may not be justified given the methods used. Full disclosure of the research methods used is a bedrock scientific principle and a requirement for survey research long-championed by AAPOR. Disclosure is the only means by which the quality of research can be judged and results replicated. Full and complete disclosure of how results were obtained is a requirement for all survey research regardless of methodology. The disclosure standards included in the
AAPOR Code of Professional Ethics and Practice are an excellent starting. Researchers also may wish to review the disclosure standards required in ISO 20252 and, especially, ISO 26362. Of particular interest is the calculation of a within-panel “participation rate” in place of a response rate, the latter being discouraged by the ISO standards except when probability samples are used. The participation rate is defined as “the number of respondents who have provided a usable response divided by the total number of initial personal invitations requesting participation.”
10
AAPOR should consider producing its own “Guidelines for Internet Research” or incorporate more specific references to online research in its code. AAPOR has issued a number of statements on topics such as representativeness of Web surveys and appropriateness of margin of error calculation with nonprobability samples. These documents are included as Appendix C and D respectively. AAPOR should consider whether these statements represent its current views and revise as appropriate. Its members and the industry at large also would benefit from a single set of guidelines that describe what AAPOR believes to be appropriate practices when conducting research online across the variety of sample sources now available.
Better metrics are needed. There are no widely-accepted definitions of outcomes and methods for calculation of rates similar to AAPOR’s
Standard Definitions (2009) that allow us to judge the quality of results from surveys using online panels. For example, while the term “response rate” is often used with nonprobability panels the method of calculation varies and it is not at all clear how analogous those methods are to those described in
Standard Definitions. Although various industry bodies are active in this area we are still short of consensus. AAPOR may wish to take a leadership position here much as it has with metrics for traditional survey methods. One obvious action would be to expand
Standard Definitions to include both probability and nonprobability panels.
Research should continue. Events of the last few years have shown that despite the widespread use of online panels there still is a great deal about them that is not known with confidence. There continues to be considerable controversy surrounding their use. The forces that have driven the industry to use online panels will only intensify going forward, especially as the role of the Internet in people’s lives continues to expand. AAPOR, by virtue of its scientific orientation and the methodological focus of its members is uniquely positioned to encourage research and disseminate its findings. It should do so deliberately.
10 We should note that while response rate is a measure of survey quality participation rate is not. It is a measure of panel efficiency.
References and Additional Readings
AAPOR (2009).
Standard Definitions: Final Dispositions of Case Codes and Outcomes for Surveys.
AAPOR (2008).
Guidelines and Considerations for Survey Researchers When Planning and Conducting RDD and Other Telephone Surveys in the U.S. With Respondents Reached via Cell Phone Numbers.
Abate, T. (1998). “Accuracy of On-line Surveys May Make Phone Polls Obsolete.”
The San Francisco Chronicle, D1.
Aguinis, H., Pierce, C.A., & Quigley, B.M. (1993). “Conditions under which a bogus pipeline procedure enhances the validity of self-reported cigarette smoking: A meta-analytic review.”
Journal of Applied Social Psychology 23: 352-373.
Alvarez, R.M., Sherman, R., & VanBeselaere, C. (2003). “Subject acquisition for Web-based surveys,”
Political Analysis, 11, 1, 23-43.
Bailar, B.A. (1989). "Information Needs, Surveys, and Measurement Errors," in Daniel Kasprzyk et al. (eds),
Panel Surveys. New York: John Wiley.
Baim, J., Galin, M., Frankel, M. R., Becker, R., & Agresti, J. (2009). Sample surveys based on Internet panels: 8 years of learning. New York, NY: Mediamark.
Baker, R. (2008). “A Web of Worries,”
Research World, 8-11.
Baker, R. & Downes-LeGuin, T. (2007). “Separating the Wheat from the Chaff: Ensuring Data Quality in Internet Panel Samples.” in
The Challenges of a Changing Word; Proceedings of the Fifth International Conference of the Association of Survey Computing. Berkeley, UK: ASC.
Baker, R., Zahs, D., & Popa, G. (2004). “Health surveys in the 21
st century: Telephone vs. web” in
Cohen SB, Lepkowski JM, eds., Eighth Conference on Health Survey Research Methods, 143-148. Hyattsville, MD: National Center for Health Statistics.
Bandilla, W., Bosnjak, M. & Altdorfer, P. (2003). “Survey Administration Effects?: A Comparison of Web and Traditional Written Self-Administered Surveys Using the ISSP Environment Module.”
Social Science Computer Review 21: 235-243
Bartels, L. M. (2006). “Three Virtues of Panel Data for the Analysis of Campaign Effects.” Capturing Campaign Effects, ed. Henry E. Brady and Richard Johnston. Ann Arbor, MI: University of Michigan Press.
Bender B, Bartlett SJ, Rand CS, Turner CF, Wamboldt FS, & Zhang L. (2007) “Impact of Reporting Mode on Accuracy of Child and Parent Report of Adherence with Asthma Controller Medication.”
Pediatrics, 120: 471-477.
Berinsky, A.J. (2006). “American Public Opinion in the 1930s and 1940s: The Analysis of Quota-controlled Sample Survey Data.”
Public Opinion Quarterly 70:499-529.
Berrens, R. P., Bohara, A. K., Jenkins-Smith, H., Silva, C., & Weimer, David L. (2003). “The Advent of Internet Surveys for Political Research: A Comparison of Telephone and Internet Samples.”
Political Analysis 11:1-22.
Bethlehem, J. (2009).
Applied Survey Methods: A Statistical Perspective. New York: Wiley.
Bethlehem, J. (2008a). “Can we make official statistics with self-selection web surveys?” in
Statistics Canada’s International Symposium Series. Catalogue Number 11-522-X.Bethlehem, J. (2008b). “How accurate are self-selection web surveys?” Discussion paper 08014, Statistics Netherlands. The Hague: Statistics Netherlands.
Bethlehem, J. & Stoop, I. (2007), “Online Panels – A Theft of a Paradigm?” in
The Challenges of a Changing Word; Proceedings of the Fifth International Conference of the Association of Survey Computing. Berkeley, UK: ASC.
Bethlehem, J. (2002). "Weighting nonresponse adjustments based on auxiliary information," in Groves, R.M., Dillman, D.A., Eltinge, J.L. and Little, R.J.A. (eds.),
Survey Nonresponse. New York:Wiley.
Biemer, P.P. (2001). Nonresponse bias and measurement bias in a comparison of face to face and telephone interviewing.
Journal of Official Statistics, 17, 295-320.
Biemer, P. P., Groves, R. M., Lyberg, L. E., Mathiowetz, N. A., and Sudman, S. (Eds.) (1991).
Measurement errors in surveys. New York: John Wiley and Sons.
Biemer, P., & Lyberg, L.E. (2003).
Introduction to Survey Quality. Wiley Series in Survey Methodology. Hoboken, NJ: John Wiley and Sons, Inc.
Black, G. S., & Terhanian, G. (1998). “Using the Internet for Election Forecasting.”
The Polling Report October 26.
Blankenship, A.B., Breen, G., & Dutka, A. (1998).
State of the Art Marketing Research. Second edition, Chicago, IL: American Marketing Association.
Blumberg, S.J. & Luke, J.V. ( 2008). “Wireless substitution: Early release of estimates from the National Health Interview Survey, July to December 2007.” National Center for Health Statistics. Available from: http://www.cdc.gov/nchs/nhis.htm. May 13, 2008.
Birnbaum, M. H. (2004). “Human Research and Data Collection via the Internet.”
Annual Review of Psychology 55:803-822.
Bowling, A. (2005). Mode of questionnaire administration can have serious effects on data quality. Journal of Public Health, 27, 281-291.
Boyle, J.M., Freeman, G. & Mulvany, L. “Internet Panel Samples: A Weighted Comparison of Two National Taxpayer Surveys.” Paper presented at the 2005 Federal Committee on Statistical Methodology Research Conference.
Braunsberger, K., Wybenga, H., & Gates, R. (2007). “A comparison of reliability between telephone and web-based surveys.”
Journal of Business Research 60, 758-764.
Burke. (2000). “Internet vs. telephone data collection: Does method matter?” Burke White Paper 2(4).
Burn, M., & Thomas, J. (2008). “Do we really need proper research any more? The importance and impact of quality standards for online access panels.” ICM White Paper. London, UK: ICM Research.
Cacioppo, J, T. & Petty, R. (1982). “The Need for Cognition.”
Journal of Personality and Social Psychology 42:116-131.
Caliendo, M. & Kopeinig, S. (2008), “Some Practical Guidance for the Implementation of Propensity Score Matching.”
Journal of Economic Surveys 22: 31-72
Callegaro, M., & Disogra, C., (2008). “Computing Response Metrics for Online Panels.”
Public Opinion Quarterly, 72(5): 1008-32.
CAN-SPAM. http://www.ftc.gov/bcp/edu/pubs/business/ecommerce/bus61.shtm
Cartwright, T., & Nancarrow, E. (2006). “The effect of conditioning when re-interviewing: A pan-European study,” Panel Research 2006: ESOMAR World Research Conference. Amsterdam: ESOMAR.
CASRO. (2009). http://www.casro.org/codeofstandards.cfm.
Chang, L.C. & Krosnick, J.A. (2001). “National Surveys via RDD Telephone Interviewing vs. the Internet: Comparing Sample Representativeness and Response Quality.” Paper presented at the 56
th Annual Conference of the American Association for Public Opinion Research, Montreal.
Chang, L., & Krosnick, J.A. (2010). “Comparing oral interviewing with self-administered computerized questionnaires: An experiment.”
Public Opinion Quarterly.
Chang, L., & Krosnick, J.A. (2009). “National surveys via RDD telephone interviewing versus the internet: Comparing sample representativeness and response quality.”
Public Opinion Quarterly.
Chatt, C., & Dennis, J. M. (2003). “Data collection mode effects controlling for sample origins in a panel survey: Telephone versus internet.” Paper presented at the 2003 Annual Meeting of the Midwest Chapter of the American Association for Public Opinion Research, Chicago, IL.
Christian, L. M., Dillman, D. & Smyth, J.D. (2006). “The Effects of Mode and Format on Answers to Scalar Questions in Telephone and Web Surveys.” Paper presented at the 2nd International Conference on Telephone Survey Methodology, Miami, Florida.
Christian, L. M., Dillman, D. A., and Smyth, J. D. (2008). “The effects of mode and format on answers to scalar questions in telephone and web surveys.” In
J. M. Lepkowski et al. (Eds.). Advances in telephone survey methodology. New York: John Wiley and Sons, 250-275.
Clinton, J.D. (2001). “Panel Bias from Attrition and Conditioning: A Case Study of the Knowledge Networks Panel.” Stanford.
Coen, T., Lorch, J. & Piekarski, L. (2005). “The Effects of Survey Frequency on Panelists’ Responses.” Worldwide Panel Research: Developments and Progress. Amsterdam: ESOMAR.
Comly, P. (2007). “Online Market Resarch.” In
Market Research Handbook, ed. ESOMAR, pp. 401-20, Hoboken, NJ: Wiley.
Comly, P. (2005). “Understanding the Online Panelist.” Worldwide Panel Research: Developments and Progress. Amsterdam: ESOMAR.
Converse, P, E., & Traugott, M.W. (1986). “Assessing the Accuracy of Polls and Surveys.”
Science 234: 1094-1098.
Cooke, M., Watkins, N., & Moy, C. (2007). “A hybrid online and offline approach to market measurement studies.”
International Journal of Market Research, 52, 29-48.
Cooley, P.C., Rogers, S.M., Turner, C.F., Al-Tayyib, A.A., Willis, G., & Ganapathi, L. (2001). “Using touch screen audio-CASI to obtain data on sensitive topics.”
Computers in Human Behavior, 17: 285-293.
Corder, Larry S. and Daniel G. Horvitz. (1989). "Panel Effects in the National Medical Care Utilization and Expenditure Survey," in Daniel Kasprzyk and others (eds),
Panel Surveys. New York: Wiley.
Couper, M. P. (2008).
Designing Effective Web Surveys. New York: Cambridge University Press.
Couper, M. P. (2000). “Web Surveys: A Review of Issues and Approaches.”
Public Opinion Quarterly 64:464-494.
Couper, M. P., Traugott, M.W., & Lamias, M.J. (2001). “Web Survey Design and Administration.”
Public Opinion Quarterly 65:230-253.
Couper, M.P., Singer, E., Conrad, F., & Groves, R. (2008). “Risk of Disclosure, Perceptions of Risk, and Concerns about Privacy and Confidentiality as Factors in Survey Participation.”
Journal of Official Statistics 24: 255-275.
Crete, J., & Stephenson, L. B. (2008). “Internet and telephone survey methodology: An evaluation of mode effects.” Paper presented at the annual meeting of the MPSA, Chicago, IL.
Curtin, R., Presser, S., & Singer, E. (2005). “Changes in Telephone Survey Nonresponse over the Past Quarter Century.”
Public Opinion Quarterly 69:87-98.
Czajka, J.L., Hirabayashi, S.M., Little, R.J.A., and Rubin, D.B. (1992). “Projecting from Advance Data Using Propensity Modeling: An Application to Income and Tax Statistics,”
Journal of Business and Economic Statistics, 10(2), 117-132.
Dever, Jill A., Rafferty, Ann, & Valliant, Richard. (2008). “Internet Surveys: Can Statistical Adjustments Eliminate Coverage Bias?”
Survey Research Methods 2: 47-62.
De Leeuw, Edith D. (2005). “To Mix or Not to Mix Data Collection Modes in Surveys.”
Journal of Official Statistics 21:233-255.
Dennis, J.M. (2001). “Are Internet Panels Creating Professional Respondents? A Study of Panel Effects.”
Marketing Research 13 (2): 484-488.
Denscombe, Martyn. (2006). “Web Questionnaires and the Mode Effect.”
Social Science Computer Review 24: 246-254.
Des Jarlais DC, Paone D, Milliken J, Turner CF, Miller H, Gribble J, Shi Q, & Hagan H, Friedman. (1999). “Audio-computer interviewing to measure risk behaviour for HIV among injecting drug users: a quasi-randomised trial.”
Lancet, 353(9165): 1657-61.
Dillman, D. (1978).
Mail and Telephone Surveys: The Total Design Method. New York: Wiley.
Dillman, Don A. & Leah Melani Christian. (2005). “Survey Mode as a Source of Instability in Responses Across Surveys.”
Field Methods 17:30-52.
Dillman, Don A., Smyth, Jolene D., & Christian, Leah Melani. (2009).
Internet, Mail, and Mixed-Mode Surveys: The Tailored Design Method (3rd Ed.), Hoboken, NJ: Wiley.
DMS Research. (2009). “The Devil Is in the Data”.
Quirks Marketing Research Review April 2009.http://www.quirks.com/search/articles.aspx?search=DMS+Research&searched=39711685.
Downes-Leguin, T., Meckling, J., & Baker, R. (2006). “Great results from ambiguous sources: Cleaning Internet panel data.”
Panel Research 2006: ESOMAR World Research Conference. Amsterdam: ESOMAR.
Duffy, Bobby, Smith, Kate, Terhanian, George, & Bremer, John. (2005). “Comparing Data from Online and Face-to-Face Surveys.”
International Journal of Market Research 47:615-639.
Duncan, K.B., and Stasny, E.A. (2001). “Using Propensity Scores to Control Coverage Bias in Telephone Surveys,”
Survey Methodology, 27(2). 121-130.
Elmore-Yalch, R., Busby, J., & Britton, C. (2008). “Know thy customer? Know thy research!: A comparison of web-based & telephone responses to a public service customer satisfaction survey.” Paper presented at the TRB 2008 Annual Meeting.
Elo, Kimmo. (2010). “Asking Factual Knowledge Questions: Reliability in Web-Based, Passive Sampling Surveys.”
Social Science Computer Review 28.
Emerson, Ralph Waldo. 1841. Essays: First Series.
Ezzati-Rice, T. M., Frankel, M.R., Hoaglin, D. C., Loft, J. D., Coronado, V. G., and Wright, R. A. (2000). "An Alternative Measure of Response Rate in Random-Digit-Dialing Surveys that Screen for EligibleSubpopulations,"
Journal of Economic and Social Measurement, 26, 99-109.
Fazio, Russell H., Lenn, T. M., & Effrein, E. A. (1984). “Spontaneous Attitude Formation.”
Social Cognition 2:217-234.
Fendrich, M., Mackesy-Amiti, M. E., Johnson, T. P., Hubbell, A., & Wislar, J. S. (2005). Tobacco-reporting validity in an epidemiological drug-use survey. Addictive Behaviors, 30, 175−181.
Fine, Brian, Menictas, Con, & Casdas, Dimitrio. (2006). “Comparing people who belong to multiple versus single panels,” Panel Research 2006: ESOMAR World Research Conference. Amsterdam: ESOMAR.
Fricker, Scott, Galesic, Mirta, Tourangeau, Roger, & Yan, Ting. (2005). “An Experimental Comparison of Web and Telephone Surveys.”
Public Opinion Quarterly 69:370-392.
Frisina, Laurin T., Krane, David, Thomas, Randall K., & Taylor, Humphrey. (2007). ”Scaling social desirability: Establishing its influence across modes.” Paper presented at the 62
nd Annual Conference of the American Association for Public Opinion Research in Anaheim, California.
Galesic, M. & M. Bosjnak,. (2009). “Effects of Questionnaire Length on Participation and Indicators of Response Quality in Online Surveys.”
Public Opinion Quarterly 73: 349-360.
Garren, S.T., and Chang, T.C. (2002). “Improved Ratio Estimation in Telephone Surveys Adjusting for Noncoverage,”
Survey Methodology, 28(1), 63-76.
Garland, P., Santus, D., & Uppal, R.. (2009). “Survey Lockouts: Are we too cautious.” Survey Sampling Intl. white paper. http://www.surveysampling.com/sites/all/files/SSI_SurveyLockouts_0.pdf
Ghanem, K.G., Hutton, H. E., Zenilman, J. M., Zimba, R., & Erbelding, E. J. (2005). “Audio computer assisted self interview and face to face interview modes in assessing response bias among STD clinic patients.”
Sex Transm Infect, 81: 421–425.
Gibson, R., & McAllister, I. (2008). “Designing online election surveys: Lessons from the 2004 Australian Election.”
Journal of Elections, Public Opinion, and Parties 18: 387-400.
Göksel, H., Judkins, D. R., & Mosher, W. D. (1991). “Nonresponse adjustments for a telephone follow-up to a national in-person survey.”
Proceedings of the Section on Survey Research Methods, American Statistical Association, 581–586.
Gosling, S. D., Vazire, S., Srivastava, S. & John, O. P. (2004). “Should we trust Web studies?: A comparative analysis of six preconceptions about Internet questionnaires.”
American Psychologist 59:93-104.
Groves, R. M. (1989).
Survey Errors and Survey Costs. New York: John Wiley and Sons.
Groves, R. M. (2006). “Nonresponse Rates and Nonresponse Bias in Household Surveys.”
Public Opinion Quarterly 70: 646–675.
Groves, R., Brick, M., Smith, R., & Wagner, J. (2008). “Alternative Practical Measures of Representativeness of Survey Respondent Pools,” presentation at the 2008 AAPOR meetings.
Gundersen, D. A. (2007). “Mode effects on cigarette smoking estimates: Comparing CAPIR and CATI responders in the 2001/02 Current Population Survey.” Paper presented at the APHA Annual Meeting and Expo, Washington, DC.
Harris Interactive (2008). “Election results further validate efficacy of Harris Interactive/s Online Methodology.” Press Release from Harris Interactive, November 6, 2008.
Harris Interactive. (2004). “Final pre-election Harris Polls: Still too close to call but Kerry makes modest gains.” The Harris Poll #87, November 2, 2004. http://www.harrisinteractive.com/harris_poll/index.asp?pid=515.
Hasley, S. (1995). “A comparison of computer-based and personal interviews for the gynecologic history update.”
Obstetrics and Gynecology, 85, 494-498.
Hennigan, K. M., Maxson, C. L., Sloane, D., & Ranney, M. (2002). “Community views on crime and policing: Survey mode effects on bias in community surveys.”
Justice Quarterly 19, 565-587.
Heerwegh, D. & Loosveldt, G. (2008). “Face-to-face versus web surveying in a high Internet coverage population: differences in response quality,”
Public Opinion Quarterly 72, 836-846.
Holbrook, Allyson L., Green, Melanie C. & Krosnick, Jon A. (2003). “Telephone versus Face-to-Face Interviewing of National Probability Samples with Long Questionnaires: Comparisons of Respondent Satisficing and Social Desirability Response Bias.”
Public Opinion Quarterly 67:79-125.
Hoogendoorn, Adriaan W. & Daalmans, Jacco (2009). “Nonresponse in the Recruitment of an Internet Panel Based on Probability Sampling.”
Survey Research Methods 3: 59-72.
Iannacchione, V.G., Milne, J.G., & Folsom, R.E. (1991). “Response probability weight adjustments using logistic regression.” Presented at the 151st Annual Meeting of the American Statistical Association, Section on Survey Research Methods, Atlanta, GA, August 18-22.
Jäckle, Annette & Lynn, Peter. (2008). “Respondent incentives in a multi-mode panel survey: cumulative effects on non-response and bias.”
Survey Methodology, 34.
Jackman, S. (2005). “Pooling the polls over an election campaign.”
Australian Journal of Political Science 40: 499-517.
Inside Research (2009). “U.S. Online MR Gains Drop.” 20(1), 11-134.
International Organization for Standardization, (2009).
ISO 26362:2009 Access panels in market, opinion, and social research- Vocabulary and service requirements. Geneva.
International Organization for Standardization. (2006).
ISO 20252: 2006 Market, opinion, and social research- Vocabulary and service requirements. Geneva.
Juran, J.M. (1992).
Juran on Quality by Design: New Steps for Planning Quality into Goods and Services. New York: The Free Press.
Keeter, S, Kennedy, C., Dimock, M., Best, J., & Craighill. P. (2006). “Gauging the Impact of Growing Nonresponse on Estimates from a National RDD Telephone Survey.”
Public Opinion Quarter 70:759-779.
Keeter, S., Miller, C., Kohut, A., Groves, R.M. & Presser, S.. (2000). “Consequences of Reducing Nonresponse in a National Telephone Survey.”
Public Opinion Quarterly 64:125-148
Kellner, P. (2008). “Down with random samples.”
Research World, June, 31.
Kellner, P. (2004). “Can Online Polls Produce Accurate Findings?”
International Journal of Market Research 46: 3-21.
Kish, L. (1965).
Survey Sampling. New York: John Wiley and Sons.
Klein, J. D., Thomas, R. K., & Sutter, E. J. (2007). “Self-reported smoking in online surveys: Prevalence estimate validity and item format effects.”
Medical Care 45: 691-695.
Kohut, A., Keeter, S., Doherty, C. & Dimock, M.. (2008). “The Impact of ‘Cell-onlys’ on Public Opinion Polling: Ways of Coping with a Growing Population Segment. “ The Pew Research Center. Available from http://people-press.org, January 31, 2008.
Knapton, K. & Myers, S. (2005). “Demographics and Online Survey Response Rates.”
Quirk’s Marketing Research Review: 58-64.
Krane, D., Thomas, R.K., & Taylor, H. (2007). “Presidential approval measures: Tracking change, predicting behavior, and cross-mode comparisons.” Paper presented at the 62
nd Annual Conference of the American Association for Public Opinion Research in Anaheim, California.
Kreuter, F., Presser, S., & Tourangeau, R. (2008). “Social Desirability Bias in CATI, IVR, and Web Surveys: The Effects of Mode and Question Sensitivity.”
Public Opinion Quarterly 72:847-865.
Krosnick, J. A. (1999). “Survey Research.”
Annual Review of Psychology 50:537-567.
Krosnick, J. A. (1991). “Response Strategies for Coping with Cognitive Demands of Attitude Measures in Surveys.”
Applied Cognitive Psychology 5: 213-236.
Krosnick, J. A., Nie, N., & Rivers, D. (2005). “Web Survey Methodologies: A Comparison of Survey.” Paper presented at the 60th Annual Conference of the American Association for Public Opinion Research in Miami Beach, Florida.
Krosnick, J. A., & Alwin, D.F. (1987). “An Evaluation of a Cognitive Theory of Response Order Effects in Survey Measurement.”
Public Opinion Quarterly, 51, 201-219.
Kuran, T. and McCaffery, E. J. (2008). "Sex Differences in the Acceptability of Discrimination,"
Political Research Quarterly, 61, No. 2, 228-238.
Kuran, Timur & McCaffery, Edward J. (2004). “Expanding Discrimination Research: Beyond Ethnicity and to the Web.”
Social Science Quarterly 85.
Lee, S. (2004). “Statistical Estimation Methods in Volunteer Panel Web Surveys.” Unpublished Doctoral Dissertation, Joint Program in Survey Methodology, University of Maryland, USA.
Lee, S. (2006). “Propensity Score Adjustment as a Weighting Scheme for Volunteer Panel Web Surveys.”
Journal of Official Statistics, 22(2), 329-349.
Lee, S. & Valliant, R. (2009). “Estimation for Volunteer Panel Web Surveys Using Propensity Score Adjustment and Calibration Adjustment.”
Sociological Methods and Research 37: 319-343
Lepkowski, J.M. (1989). “The treatment of wave nonresponse in panel surveys,” in
Panel Surveys (D.Kasprzyk, G. Duncan, G. Kalton, and M.P. Singh, eds.) New York: J.W. Wiley and Sons.
Lessler, J.T. and Kalsbeek, W.D. (1992).
Nonsampling Errors in Surveys. New York: Wiley.
Lindhjem, H. and S. Navrud (2008) "Asking for Individual or Household Willingness to Pay for Environmental Goods? Implication for aggregate welfare measures".
Environmental and Resource Economics, 43(1): 11-29.
Lindhjem, H., and Stale, N. (2008). “Internet CV surveys – a cheap, fast way to get large samples of biased values?”
Munich Personal RePEc Archive Paper 11471. http://mpra.ub.uni-muenchen.de/11471.
Link, M. W., & Mokdad, A. H. (2005). “Alternative modes for health surveillance surveys: An experiment with web, mail, and telephone.”
Epidemiology 16, 701-704.
Link, M., & A. Mokdad (2004). “Are Web and Mail Feasible Options for the Behavioral Risk Factor Surveillance System?” in
Cohen SB, Lepkowski JM, eds., Eighth Conference on Health Survey Research Methods,149-158. Hyattsville, MD: National Center for Health Statistics.
Link, M. W., Battaglia, M.P., Frankel, M. R., Osborn, Larry, & Mokdad, Ali H.. (2008). “A Comparison of Address-based Sampling (ABS) versus Random-Digit Dialing (RDD) for General Population Surveys.”
Public Opinion Quarterly 72:6-27.
Loosveldt, G. & Sonck, N. (2008). “An Evaluation of the Weighting Procedures for an Online Access Panel Survey.”
Survey Research Methods 2: 93-105.
Lozar Manfreda, K. & Vehovar, V. (2002). “Do Mail and Web Surveys Provide Same Results?”
Metodološki zvezki 18:149-169.
Lugtigheid, A., & Rathod, S. (2005).
Questionnaire Length and Response Quality: Myth or Reality, Survey Sampling International.
Malhotra, N. & Krosnick, J. A. (2007). “The Effect of Survey Mode and Sampling on Inferences about Political Attitudes and Behavior: Comparing the 2000 and 2004 ANES to Internet Surveys with Nonprobability Samples.”
Political Analysis 286-323.
MarketTools. (2009). “MarketTools TrueSample.” http://www.markettools.com/pdfs/resources/DS_TrueSample.pdf.
Marta-Pedroso, Cristina,
, Freitas, Helena & Domingos, Tiago. (2007). “Testing for the survey mode effect on contingent valuation data quality: A case study of web based
versus in-person interviews.”
References and further reading may be available for this article. To view references and further reading you must purchase this article.Ecological Economics 62 (3-4), 388-398.
Merkle, D. M. and Edelman, M. (2002). "Nonresponse in Exit Polls: A Comprehensive Analysis." In
Survey Nonresponse, ed. R. M. Groves, D. A. Dillman, J. L. Eltinge, and R. J. A. Little, pp. 243-58. New York: Wiley.
Merkle, D. & Langer, G. (2008). “How Too Little Can Give You a Little Too Much.”
Public Opinion Quarterly 72:114-124.
Metzger, D.S., Koblin, B., Turner, C., Navaline, H., Valenti, F., Holte, S., Gross, M., Sheon, A., Miller, H., Cooley, P., Seage, G.R., & HIVNET Vaccine Preparedness Study Protocol Team (2000). “Randomized controlled trial of audio computer-assisted self-interviewing: utility and acceptability in longitudinal studies.”
American Journal of Epidemiology, 152(2): 99-106.
Miller, Jeff. (2008). “Burke Panel Quality R and D.” Cincinnati: Burke, Inc.
Miller, Jeff. (2006). “Online Marketing Research.” In
TheHandbook of Marketing Research. Uses, Abuses and Future Advances, eds. Rjaiv Grover and Marco Vriens, pp.110-31. Thousand Oaks, CA: Sage.
Miller, J. (2000). “Net v phone: The great debate.”
Research, 26-27.
Mitofsky, Warren J. (1989). “Presidential Address: Methods and Standards: A Challenge for Change."
Public Opinion Quarterly 53:446-453.
Morgan, Alison. (2008). “Optimus ID: Digital Fingerprinting for Market Research.” San Francisco: PeanutLabs.
Nancarrow, C. & Cartwright, Trixie. (2007). “Online access panels and tracking research: the conditioning issue.”
International Journal of Market Research 49 (5).
Newman JC, Des Jarlais DC, Turner CF, Gribble J, Cooley P, & Paone D. (2002). “The differential effects of face-to-face and computer interview modes.”
Am J Public Health 92(2): 294-7.
Niemi, R. G., Portney, K., & King, D. (2008). “Sampling young adults: The effects of survey mode and sampling method on inferences about the political behavior of college students.” Paper presented at the annual meeting of the American Political Science Association, Boston, MA.
Nukulkij, P., Hadfield, J.,Subias, S, and Lewis, E. (2007). “An Investigation of Panel Conditioning with Attitudes Toward U.S Foreign Policy.” Presented at the AAPOR 62
nd Annual Conference.
Olson, Kristen. (2006). “Survey participation, nonresponse bias, measurement error bias, and total bias.”
Public Opinion Quarterly 70, 737-758.
O’Muircheartaigh, C. (1997). “Measurement Error in Surveys: A Historical Perspective,” in L. Lyberg, P. Biemer, M. Collins, E. de Leeuw, C. Dippo, N. Schwarz, and D. Trewin, eds.,
Survey Measurement and Process Quality. New York: Wiley.
Ossenbruggen, R. van, Vonk, T., & Willems, P. (2006). “Results Dutch Online Panel Comparison Study (NOPVO).” Paper presented at the open meeting “Online Panels, Goed Bekeken”, Utrecht, the Netherlands. www.nopvo.nl.
Partick, P.L., Cheadle, A., Thompson, D.C., Diehr, P., Koepsell, T. & Kinne, S. (1994). “The Validity of Self-Reported Smoking: A Review and Meta-Analysis,”
American Journal of Public Health 84,7,1086-1093.
Pew Research Center for the People and the Press. (2009). http://www.pewInternet.org/static-pages/trend-data/whos-online.aspx.
Pew Research Center for the People and the Press. (2008). http://www.pewInternet.org/Data-Tools/Download-Data/~/media/Infographics/Trend%20Data/January%202009%20updates/Demographics%20of%20Internet%20Users%201%206%2009.jpg
Peirce, Charles Sanders. (1877). "The Fixation of Belief."
Popular Science Monthly 12: 1–15.
Piekarski, L., Galin, M., Baim, J., Frankel, M., Augemberg, K. & Prince, S. (2008). “Internet Access Panels and Public Opinion and Attitude Estimates.” Poster Session presented at 63
rd Annual AAPOR Conference, New Orleans LA.
Pineau, Vicki & Slotwiner, Daniel. (2003). “Probability Samples vs. Volunteer Respondents in Internet Research: Defining Potential Effects on Data and Decision-making” in Marketing Applications. Knowledge Networks White Paper.
Potoglou, D. & Kanaroglou, P. S. (2008). “Comparison of phone and web-based surveys for collecting household background information.” Paper presented at the 8
th International Conference on Survey Methods in Transport, France.
Poynter, R. & Comley, P. (2003). “Beyond Online Panels.”
Proceedings of the ESOMAR Technovate Conference. Amsterdam: ESOMAR.
Rainie, L. (2010). “Internet, Broadband, and Cell Phone Statistics,” Pew Internet and American Life Project. Pew Research Center.
Rao, J.N.K. (2003).
Small Area Estimation. Hoboken, New Jersey, John Wiley & Sons.
Riley, Elise D., Chaisson, Richard E., Robnett, Theodore J., Vertefeuille, John, Strathdee, Steffanie A. & Vlahov, David. (2001). “Use of Audio Computer-assisted Self-Interviews to Assess Tuberculosis-related Risk Behaviors.”
Am. J. Respir. Crit. Care Med. 164(1): 82-85
Rivers, Douglas. (2007). “Sample matching for Web surveys; Theory and application.” Paper presented at the 2007 Joint Statistical Meetings.
Rogers, S.M., Willis, G., Al-Tayyib, A., Villarroel, M.A., Turner, C.F., Ganapathi, L. et al. (2005). “Audio computer assisted interviewing to measure HIV risk behaviours in a clinic population.”
Sexually Transmitted Infections 81(6): 501-507.
Rosenbaum, Paul R. & Rubin, Donald B. (1983). “The Central Role of the Propensity Score in Observational Studies for Causal Effects.”
Biometrika 70:41-55.
Rosenbaum, Paul R. & Rubin, Donald B. (1984). “Reducing Bias in Observational Studies Using Subclassification on the Propensity Score.”
Journal of the American Statistical Association 79:516-524.
Roster, Catherine A., Rogers, Robert D., Albaum, Gerald & Klein, Darin. (2004). “A Comparison of Response Characteristics from Web and Telephone Surveys.”
International Journal of Market Research 46:359-373.
Rubin, D.R. (2006).
Matched Sampling for Causal Effects. New York, Cambridge University Press
Ryzin, Gregg G. (2008). “Validity of an On-Line Panel Approach to Citizen Surveys.”
Public Performance and Management Review 32:236-262.
Sanders, D., Clarke, H. D., Stewart, M. C., & Whiteley, P. (2007). “Does Mode Matter for Modeling Political Choice? Evidence from the 2005 British Election Study.”
Political Analysis 15 (3): 257-285.
Saris, Wilem E. (1998). “Ten Years of Interviewing without Interviewers: the Telepanel.” In
Computer Assisted Survey Information Collection, eds. Mick Couper, Reginald P. Baker, Jelke Bethlehem, Cynthia Z. F. Clark, Jean Martin, William L. Nicholls, and James O’Reilly, pp. 409-29. New York: Wiley.
Sayles, H. & Arens, Z. (2007). “A Study of Panel member Attrition in the Gallup Panel.” Paper presented at 62
nd AAPOR Annual Conference, Anaheim, CA.
Schillewaert, Niels & Meulemeester, Pascale. (2005). “Comparing Response Distributions of Offline and Online Data Collection Methods.”
International Journal of Market Research 47:163-178.
Schlackman, W. (1984) “A Discussion of the use of Sensitivity Panels in Market Research.”
Journal of the Market Research Society 26: 191 - 208.
Schonlau, Matthias, van Soest, Arthur & Kapteyn, Arie. (2007). “Are ‘Webographic’ or Attitudinal Questions Useful for Adjusting Estimates from Web Surveys Using Propensity Scoring?”
Survey Research Methods 1: 155-163.
Schonlau, Matthias, van Soest, Arthur, Kapteyn, Arie & Couper, Mick. (2009). “Selection Bias in Web Surveys and the Use of Propensity Scores.”
Sociological Methods and Research 37: 291-318.
Schonlau, Matthias, Zapert, Kinga, Simon, Lisa P., Sanstad, Katherine H., Marcus, Sue M., Adams, John, Spranca, Mark, Kan, Hongjun, Turner, Rachel, & Berry, Sandra H. (2004). “A Comparison Between Responses From a Propensity-Weighted Web Survey and an Identical RDD Survey.”
Social Science Computer Review 22:128-138.
"Silberstein, Adriana R. and Jacobs, Curtis A. (1989). “Symptoms of Repeated Interview Effects in the Consumer Expenditure Interview Survey,” in
Panel Surveys (D.Kasprzyk, G. Duncan, G. Kalton, and M.P. Singh, eds.) New York: J.W. Wiley and Sons.Smith, P.J., Rao, J.N.K., Battaglia, M.P., Daniels, D., and Ezzati-Rice, T. (2001). “Compensating for Provider Nonresponse Using Propensities to Form Adjustment Cells: The National Immunization Survey,”
Vital and Health Statistics, Series 2, No. 133, DHHS Publication No. (PHS) 2001-1333.
Smith, Renee & Brown, Hofman. (2006). “Panel and data quality: Comparing metrics and assessing claims.” Panel Research 2006: ESOMAR World Research Conference. Amsterdam: ESOMAR.
Smith, Tom W. (2001). “Are Representative Internet Surveys Possible?” Proceedings of Statistics Canada Symposium 2001, Achieving Data Quality in a Statistical Agency: A Methodological Perspective.
Smith, T. W., & Dennis, J. M. (2005). “Online vs. In-person: Experiments with mode, format, and question wordings.”
Public Opinion Pros.
Smyth, Jolene D., Christian, Leah Melani, & Dillman, Don A. (2008). “Does "Yes or No" on the Telephone Mean the Same as "Check-All-That-Apply" on the Web?”
Public Opinion Quarterly 72: 103-113.
Snell, Laurie, J., Peterson, Bill & Grinstead, Charles. (1998). “Chance News 7.11.” Accessed August 31, 2009: http://www.dartmouth.edu/~chance/chance_news/recent_news/chance_news_7.11.html.
Squire, Peverill. (1988). “Why the 1936 Literary Digest Poll Failed.”
Public Opinion Quarterly 52, 125-133.
Sparrow, Nick. (2006). “Developing reliable online polls.”
International Journal of market Research 48, 659-680.
Sparrow, Nick & Curtice, John. (2004). “Measuring the Attitudes of the General Public via Internet Polls: An Evaluation.”
International Journal of Market Research 46:23-44.
Speizer, H., Baker, R., & Schneider, K. (2005). „Survey Mode Effects; Comparison between Telephone and Web.” Paper presented at the annual meeting of the American Association For Public Opinion Association, Fontainebleau Resort, Miami Beach, FL.
Stevens, S.S. (1946). “On the theory of scales of measurement.”
Science 103: 677-680.
Stevens, S.S. (1951). “Mathematics, measurement and psychophysics.” In S.S. Stevens (Ed.), Handbook of experimental psychology, pp. 1-49. New York: Wiley.
Stirton, J. & Robertson, E. (2005). “Assessing the viability of online opinion polling during the 2004 federal election.”
Australian Market and Social Research Society. http://www.enrollingthepeople.com/mumblestuff/ACNielsen%20AMSRS%20paper%202005.pdf
Sturgis, P., Allum, N. & Brunton-Smith, I. (2008) “Attitudes Over Time: The Psychology of Panel Conditioning.” In P. Lynn (ed) Methodology of Longitudinal Surveys, Wiley.
Suchman, E., & McCandless, B. (1940). “Who answers questionnaires?”
Journal of Applied Psychology, 24 (December), 758-769.
Sudman, S., Bradburn, N., & Schwarz, N. (1996)
Thinking About Answers. San Francisco: Jossey-Bass.
Taylor, Humphrey. (2000). “Does Internet Research Work?: Comparing Online Survey Results with Telephone Surveys.”
International Journal of Market Research 42:51-63.
Taylor, Humphrey. (2007). “The Case For Publishing (Some) Online Polls.” Polling Report. Accessed August 31, 2009 from http://www.pollingreport.com/ht_online.htm.
Taylor, Humphrey, Bremer, John, Overmeyer, Cary, Siegel, Jonathan W., & Terhanian, George. (2001). “The Record of Internet-based Opinion Polls in Predicting the Results of 72 Races in the November 2000 U.S. Elections.”
International Journal of Market Research 43:127-136.
Taylor, Humphrey, Krane, David, & Thomas, Randall K. (2005) “Best Foot Forward: Social Desirability in Telephone vs. Online Surveys.”
Public Opinion Pros, Feb. Available from http://www.publicopinionpros.com/from_field/2005/feb/taylor.asp.
Terhanian, G., and Bremer, J. (2000). “Confronting the Selection-Bias and Learning Effects Problems Associated with Internet Research.” Research Paper: Harris Interactive.
Terhanian, G., Smith, R., Bremer, J., and Thomas, R. K. (2001). “Exploiting analytical advances: Minimizing the biases associated with internet-based surveys of non-random samples.”
ARF/ESOMAR: Worldwide Online Measurement 248: 247-272.
Thomas, R. K., Krane, D., Taylor, H., and Terhanian, G. (2006). “Attitude measurement in phone and online surveys: Can different modes and samples yield similar results?” Paper presented at the Joint Conference of the Society for Multivariate Analysis in the Behavioural Sciences and European Association of Methodology, Budapest, Hungary.
Thomas, R. K., Krane, D., Taylor, H., & Terhanian, G. (2008). “Phone and Web interviews: Effects of sample and weighting on comparability and validity.” Paper presented at ISA-RC33 7th International Conference, Naples, Italy.
Toepoel, Vera, Das, Marcel, & van Soest, Arthur. (2008). “Effects of design in web surveys: Comparing trained and fresh respondents.”
Public Opinion Quarterly 72:985-1007.
Tourangeau, Roger. (2004). “Cognitive science and survey methods.” In T. Jabine et al. (Eds.),
Cognitive Aspects of Survey Design: Building a Bridge Between Disciplines. Washington: National Academy Press, pp.73‑100.
Tourangeau, Roger. (2004). “Survey Research and Societal Change.”
Annual Review of Psychology 55:775-801.
Tourangeau, R. (1984). “Cognitive Science and Survey Methods.” In T. Jabine et al. (Eds.),
Cognitive Aspects of Survey Design: Building a Bridge Between Disciplines. Washington: National Academy Press, pp.73-100.
Tourangeau, R., R.M. Groves, R. M., Kennedy, C., & Yan, T. (2009). "The Presentation of a Web Survey, Nonresponse and Measurement Error among Members of Web Panel."
Journal of Official Statistics 25: 299-321.
Twyman, J. (2008). “Getting it right: YouGov and online survey research in Britain.”
Journal of Elections, Public Opinion, and Parties 18: 343-354.
Valliant, R., Royall, R., & Dorfman, A. (2001).
Finite Population Sampling and Inference: A Prediction Approach. New York, Wiley.
Vavreck, L., & Rivers, D. (2008) “The 2006 Cooperative Congressional Election Study.”
Journal of Elections, Public Opinion, and Parties 18: 355-366.
Vonk, T.W.E., Ossenbruggen, R & Willems, P. (2006). “The effects of panel recruitment and management on research results.” ESOMAR Panel Research 2006.
Walker, R.& Pettit, R. (2009). “ARF Foundations of Quality: Results Preview.” New York: The Advertising Research Foundation.
Walker, R., Pettit, R., & Rubinson, J. (2009). “The Foundations of Quality Study Executive Summary 1: Overlap, Duplication, and Multi Panel Membership.” New York: The Advertising Research Foundation.
Waksberg, J. (1978). “Sampling Methods for Random Digit Dialing,”
Journal of the American Statistical Association 73:40-46.
"Wardle, J., Robb, K., & Johnson, F. (2002). “Assessing socioeconomic status in adolescents: the validity of a home affluence scale,”
Journal of Epidemiology and Community Health, 56,595-599.
Waruru, A. K., Nduati, R., & Tylleskar, T. (2005). “Audio computer-assisted self-interviewing (ACASI) may avert socially desirable responses about infant feeding in the context of HIV.”
Medical Informatics and Decision Making 5: 24-30.
Waterton, J. & Lievesley, D. (1989). “Evidence of conditioning effects in the British Social Attitudes Panel.” in Kasprzyk, D., Duncan, G., Kalton, G., and Singh, M.P
. Panel Surveys. New York, John Wiley and Sons.
Weijters, B., Schillewaert, N., & Geuens, M. (2008). „Assessing response styles across modes of data collection.”
Academy of Marketing Science 36: 409-422.
Wilson, T. D., Kraft, D., and Dunn, D. S. (1989), "The Disruptive Effects of Explaining Attitudes: The Moderating Effect of Knowledge About the Attitude Object,"
Journal of Experimental Social Psychology, 25,379 400.
Woodward, M. (2004).
Study Design and Data Analysis, 2
nd Edition. Boca Rotan, Florida Chapman & Hall.
Yeager, D. S., Krosnick, J.A., Chang, L., Javitz, H.S., Levindusky, M.S., Simpser, A., & Wang, R. (2009). “Comparing the Accuracy of RDD Telephone Surveys and Internet Surveys Conducted with Probability and Nonprobability Samples.” Stanford University.
Appendix A: Portion of the CASRO Code of Standards and Ethics dealing with Internet Research