AAPOR
The leading association
of public opinion and
survey research professionals
American Association for Public Opinion Research

Methodology 2008 Primary Polls

Sample Frames and Respondent Selection
An Evaluation of the Methodology of the 2008 Pre-Election Primary Polls

 
Prepared by the American Association for Public Opinion Research Ad Hoc Committee on the 2008 Presidential Primary Polling
 
ReviseVersion April 2009

Members of the Ad Hoc Committee:
Glen Bolger, Public Opinion Strategies Darren W. Davis, University of Notre Dame
Charles Franklin, University of Wisconsin-Madison
Robert M. Groves, Institute for Social Research, University of Michigan Paul J. Lavrakas, methodological research consultant
Mark S. Mellman, The Mellman Group Philip Meyer, University of North Carolina
Kristen Olson, University of Nebraska-Lincoln
J. Ann Selzer, Selzer&Company
Michael W. Traugott (Chair), Institute for Social Research, University of Michigan Christopher Wlezien, Temple University
 

An Evaluation of the Methodology of the 2008 Pre-Election Primary Polls

 
Table of Contents
Foreword
Executive Summary
Introduction
Analyzing Accuracy
Organization of the Committee’s Work
How Well Did the Polls Do?
Explanations for Differences in Accuracy
Mode of Data Collection
Sample Frames and Respondent Selection
Non-coverage of Cell Phone Only Voters
Nonresponse Bias and the Composition of Samples
Trial Heat Question Wording
Likely Voter Definitions
Calling Protocols
Timing of Data Collection
Social Desirability
Weighting
Time of Decision
Participation of Independents
Allocations of ‘Undecideds’
Ballot Order Effects
Conclusions
References
Appendix A:  Committee Members
Appendix B:  Charge to the Committee
Letter from AAPOR President Mathiowetz
Appendix C:  AAPOR Request for Information
Roper Center Guidelines for Depositing Data
Appendix D:  Hypotheses for Errors
Appendix Table 1
Appendix Table 2
“A Review and Proposal for a New Measure of Poll Accuracy”

Foreword

This report was prepared by a special committee appointed by the American Association for Public Opinion Research (AAPOR). The problem that prompted the formation of the committee was poll performance leading up to the New Hampshire primary. The members of the committee volunteered their time in this effort because of their interest in understanding what happened with estimation of candidate preference in the 2008 polls. Their expertise in various aspects of polling and survey research methods guided the analysis and write-up of the results. From the beginning, committee members decided that they would pursue their investigation empirically. But there was no opportunity to design prospective research into polling methods. The fact that the committee’s work began only after questions were raised about the quality of the polling in the early primaries meant that some avenues of inquiry could not be pursued. Appropriate data to explore these matters were not available (or were not made available) to the committee by those who conducted the polls. This report represents the committee’s best effort to address these issues raised by the 2008 pre-primary polls within the constraints of limited available information.

Initially, we expected the committee to work rapidly and complete its report by the annual AAPOR conference in May 2008. However, the slow response of many pollsters and a lack of cooperation from some delayed the analysis and subsequent reporting. The fact that many pollsters did not provide us with detailed methodological information about their work on a timely basis is one reason we will never know for certain exactly what caused the problems in the primary polling that we studied. It is also true that some of the more interesting questions about the causes are not amenable to post hoc analysis. While the available data allowed us to disconfirm some hypotheses and provide some tantalizing clues about what went wrong, definitive declarations about the sources of estimation errors are not possible.

Polling during an election campaign is an important element of news coverage of related events. The symbiotic relationship between campaign coverage and polling is a given in contemporary campaigns; it is hard to imagine one without the other. But polling is also a scientific data collection technique, and it is impossible to evaluate the performance of the pollsters without information about their methodology. That is why the AAPOR “Code of Professional Ethics and Practices” include a set of elements that those who conduct polls and surveys should disclose so that other professionals can evaluate the quality of the research that they conduct and the results that they disseminate. The committee’s experience suggests that some firms engaged in election polling pay only lip service to these disclosure standards, while others are operating at such a thin margin that they do not have the resources to devote to answering questions about their methods. The committee believes that professional organizations like AAPOR, the Council of American Survey Research Organizations (CASRO), and the National Council on Public Polls (NCPP) should review their published standards with an eye toward an educational program to explain them and reinforce the underlying justification for having them and to promote more effective enforcement of them.

A number of acknowledgments are due.  The committee’s efforts were supported by a grant from James S. Jackson, director of the Institute for Social Research (ISR) at the University of Michigan. These funds were used to support research assistance and administrative costs, and several of the analyses would not have been possible without this support. In particular, these resources were used to support a content analysis of the press coverage of polling around the New Hampshire primary; these results are discussed in the introduction to the report. This content analysis was conducted at ISR by Colleen McClain and Brian Krenz, while the analysis of the information received from the pollsters was performed by Courtney Kennedy, a graduate student in the Program in Survey Methodology at ISR, all under my supervision. Ms. Kennedy also participated in the organization, design and layout of the report, and she drafted some sections. The pollsters who provided information to the committee were given a draft to review for accuracy with regard to statements about their procedures; their comments were helpful.  The final report was edited by N.E. Barr, who significantly improved the prose. As always, the contents of the report are the sole responsibility of the committee and no one else.

This version of the report incorporates some small editorial changes and supersedes the March 2009 version. 
 
Michael Traugott
Ann Arbor, Michigan
 

Executive Summary

This report is the product of an investigation by the Ad Hoc Committee on the 2008 Presidential Primary Polling appointed by the American Association for Public Opinion Research (AAPOR). The committee pursued its investigation by analyzing information that AAPOR solicited from the pollsters who conducted studies in four primaries in the 2008 presidential campaign: New Hampshire, South Carolina, California, and Wisconsin. A central concept underlying the role of public opinion research and polling as a scientific enterprise is the disclosure of the methods used. While most citizens cannot make informed judgments about the quality of polling data or their interpretation, other professionals in the field can, if they have access to a minimal set of information about how the data were collected and analyzed. The failure of some pollsters to provide information on a timely basis runs counter to this principle, and it hindered the progress of the committee’s work and delayed the release of this report. While the results of this investigation could not affect subsequent polling in the 2008 campaign, the committee hopes that its work will raise questions for consideration about the methodology of subsequent pre-election polling. We also hope that the report will spur timely disclosure of information to aid in future evaluations of the methodological quality of pre-election polling.

The committee developed and tested a series of hypotheses that could be tested empirically, employing information at the level of the state, the poll, and, in limited cases, the respondent. Since the analysis was conducted after data collection, it was not possible to evaluate all of the hypotheses in a way that permitted strong causal inferences. And due to the incomplete nature of the data for various measures, it was not possible to pursue all hypotheses about what might have happened, nor was it possible to pursue multivariate analyses that looked simultaneously at multiple explanatory factors. In the end, the analysis suggests some possible explanations for the estimation errors and the unlikely impact of other factors. The research also highlights the need for additional disclosure requirements such as information on likely voter models and the details of weighting algorithms, as well as the need for better education by professional associations like AAPOR, the Council of American Survey Research Organizations (CASRO), and the National Council on Public Polls (NCPP).

Polling in primary elections is inherently more difficult than polling in a general election. Usually there are more candidates in a contested primary than in a general election, and this is especially true at the beginning of the presidential selection process. For example, there were a total of 15 candidates entered in the Iowa caucuses and more than 20 names on the New Hampshire primary ballot. Since primaries are within-party events, the voters do not have the cue of party identification to rely upon in making their choice. Uncertainty in the voters’ minds can create additional problems for pollsters. Turnout is usually much lower in primaries than in general elections, although it varies widely across events. Turnout in the Iowa caucuses tends to be relatively low compared to the New Hampshire primary, for example. So estimating the likely electorate is often more difficult in primaries than in the general election. Furthermore, the rules of eligibility to vote in the primaries vary from state to state and even within party; New Hampshire has an open primary in which independents can make a choice at the last minute in which one to vote. All of these factors can contribute to variations in turnout, which in turn may have an effect on the candidate preference distribution among voters in a primary election.

The estimation errors in the polls before the New Hampshire Democratic primary were of about the same magnitude as in the Iowa caucus. But the mis-estimation problems in New Hampshire received much more – and more negative –coverage than they did in Iowa. Because of a small level of undecided voters in every poll, the estimates for each individual candidate were generally lower than the proportion of votes they received. And these underestimates tended to be greater for the first place finisher than the second place finisher. But the majority of the polls before New Hampshire suggested the wrong winner, while only half in Iowa did.
 
Factors that may have influenced the estimation errors in the New Hampshire pre-primary polls include:

  1. Given the compressed caucus and primary calendar, polling before the New Hampshire primary may have ended too early to capture late shifts in the electorate there, measuring momentum as citizens responded to the Obama victory in the Iowa caucus but not to later events in New Hampshire.
  2. Patterns of nonresponse, derived from comparing the characteristics of the pre-election samples with the exit poll samples, suggest that some groups that supported Senator Hillary Clinton were underrepresented in the pre-election polls.
  3. Variations in likely voter models could explain some of the estimation problems in individual polls. Application of the Gallup likely voter model, for example, produced a larger error than their unadjusted data. While the “time of decision” data do not look very different in 2008 compared to recent presidential primaries, about one-fifth of the voters in the 2008 New Hampshire primary said they were voting for the first time. This influx of first-time voters may have had an adverse effect on likely voter models.
  4. Variations in weighting procedures could explain some of the estimation problems in individual polls. And for some polls, the weighting and likely voter modeling were comingled in a way that makes it impossible to distinguish their separate effects.
  5. Although no significant social desirability effects were found that systematically produced an overestimate of support for Senator Obama among white respondents or for Senator Clinton among male respondents, an interaction effect between the race of the interviewer and the race of the respondent did seem to produce higher support for Senator Obama in the case of a black interviewer. However, Obama was also preferred over Clinton by those who were interviewed by a white interviewer.
Factors unlikely to have contributed to the estimation errors in the New Hampshire pre-primary polls include:
  1. The exclusion of cell phone only (CPO) individuals from the samples did not seem to have an effect. However, this proportion of citizens is going to change over time, and pollsters should remain attentive to its possible future effects.
  2. Using a two-part trial heat question, intended to reduce the level of “undecided” responses, did not produce that desired effect and does not seem to have affected the eventual distributions of candidate preference.
  3. The use of either computerized telephone interviewing (CATI) techniques or interactive voice response (IVR) techniques made no difference to the accuracy of estimates.
  4. The use of the trial heat questions was quite variable, especially with regard to question order, but no discernible patterns of effects on candidate preference distributions were noted. While the names of the (main) candidates were frequently randomized, the committee did not receive data that would have permitted an analysis of the impact of order.
  5. Little compelling information indicates that Independents made a late decision to vote in the New Hampshire Republican primary, thereby increasing estimate errors.
Factors that present intriguing potential explanations for the estimation errors in the New Hampshire polls, but for which the committee lacked adequate empirical information to thoroughly assess include:
 
  1. The wide variation in sample frames used to design and implement samples – ranging from random samples of listed telephone numbers, to lists of registered voters with telephone numbers attached, to lists of party members – may have had an effect. Greater information about sample frames and sample designs, including respondent selection techniques, would facilitate future evaluations of poll performance.
  2. Differences among polls in techniques employed to exclude data collected from some respondents could have affected estimates. Given the lack of detailed disclosure of how this was done, it is not possible to assess the impact of this procedure.
  3. Some polls combined weighting to adjust for nonresponse among demographic groups with weighting that reflects likely voter models into a single set of weights for a study. This complicates the analysis of whether or how much sampling issues or likelihood of voting models are contributing to estimation error.

Finally, factors that appeared to be potential explanations for estimation errors, but for which the committee lacked any empirical information to assess include:
 
  1. Because of attempts by some states to manipulate the calendar of primaries and caucuses, the Iowa and New Hampshire events were rescheduled to the first half of January, with only five days between the events, truncating the polling field period in New Hampshire following the Iowa caucus.
  2. The order of the names on the ballot – randomly assigned but fixed on every ballot - may have contributed to the increased support that Senator Hillary Clinton received in New Hampshire.
 
All of the information provided to the committee is being deposited in the Roper Center Data Archive, where it will be available to other analysts who wish to check on the work of the committee or to pursue their own independent analysis of the pre-primary polls in the 2008 campaign.

Introduction

Polling in primary elections is inherently more difficult than polling in a general election. Usually there are more candidates in a contested primary than in a general election, and this is especially true at the beginning of the presidential selection process. For example, there were a total of 15 candidates entered in the Iowa caucuses.1   Since primaries are within party events, the voters do not have the cue of party identification to rely on in selecting their choice. This level of uncertainty in the voters’ minds can create additional problems for the pollsters. Turnout is usually much lower in primaries than in general elections, although it varies widely across events. So estimating the likely electorate is often more difficult in primaries than in the general election. Furthermore, the rules of eligibility to vote in the primaries vary from state to state and even within party; New Hampshire has an open primary in which independents (those with an undeclared party registration) can make a choice in which one to vote. All of these factors can contribute to variations in turnout, which in turn may have a great effect on the candidate preference distribution among voters in a primary compared to the general election.

In the 2008 primary campaign, the record of the polls in estimating outcomes differed in the Democratic and Republican events. This could be explained by a number of factors, not the least of which is that the Democratic contest was hard fought and went on for the entire calendar of events, while the Republicans had selected John McCain as their presumptive nominee by March. On the Democratic side, the percent of actual votes cast for the winner among the votes cast for the top two candidates tended to be greater than the same ratio for the winner in the final week’s pre-election polls. That is, the polls generally underestimated the winner’s performance on Election Day relative to the second place finisher, although analysis shows that, by this measure, the performance of the polls may have improved slightly over the course of the primary calendar.

This relationship was quite different in the Republican contests. In the early contests – and up through Super Tuesday – a similar pattern of underestimating the winner’s share of the vote for the top two candidates appears. The winner’s share of support for the top two candidates in the polls was generally less than his share of the actual vote for the same two candidates, though less consistently (and to a lesser extent) than on the Democratic side.  In the later primaries, after McCain essentially secured the nomination, this tendency disappeared. For a while the polls tended to exhibit less bias with respect to the winner, and then they tended consistently to overestimate the winner’s share of the two-candidate vote. This analysis suggests that there were factors associated with the contests themselves, including the level of competition or the uncertainty of the outcome, that seem to have had an effect on the quality of estimation.2

The impetus for this report was the performance of the polls prior to the New Hampshire primary. In the run up to the January 8 event, the pre-election polls showed Senator Barack Obama with a comfortable lead over Senator Hillary Clinton, while Senator John McCain was holding a steady lead over former governors Mitt Romney and Mike Huckabee. McCain won about as expected on the Republican side, while Clinton bested Obama by three percentage points.

Although the Republican contest ended up about as the polls showed, the mis-estimation of the Democratic race caused much consternation in the press and within the polling profession. The errors in the Democratic estimates in Iowa, discussed briefly in the beginning of this report, were about the same magnitude as in New Hampshire, but the level of criticism was much lower because several polls correctly projected an Obama victory. In New Hampshire, the pollsters got the Democratic winner wrong, and that made all the difference in the press coverage and commentary about the polls. It also determined that the pre-primary polls in New Hampshire would be a focal point for our inquiry, although other states were added to the list, as noted below.

A content analysis of the media coverage in six sources3 for the first twelve days of January 2008 showed that concerns about the portrayal of the polling industry were warranted.  The coverage of polls increased in amount after New Hampshire, turned more negative than in the period leading up to the primary, and became more focused on “the polls” as a group rather than on specific estimates produced by individual polling firms, as shown in Table 1 (Traugott, Krenz, and McClain 2008).

table1.JPG
Furthermore, the most negative elements of the post-New Hampshire coverage were the references to the (lack of) accuracy of the polls, as shown in Table 2. These trends raised concerns among the professional associations whose members are pollsters and survey researchers because they understood the unusual relationship between the accuracy of pre-election polls and the image of the entire industry. Most public polling estimates do not have external validation, but pre-election polling is the special case where the results of the election itself validate the quality of the estimation. The leadership of the American Association for Public Opinion Research (AAPOR) also believed that, given the scientific basis for polling, it should be possible to explore reasons for the estimation problems.

table2.JPG

The parallels between these concerns and those expressed after the 1948 general election were obvious, and AAPOR decided to empanel a group of experts to investigate the potential explanations for the mis-estimates in the New Hampshire pre-primary polls. The committee comprised academic experts, public pollsters, and partisan pollsters who work for candidates, although not for candidates in the 2008 presidential campaign. By agreement, our work was empirical, involving only the evaluation of possible explanations that could be investigated with information about how the polls were conducted or through the analysis of data collected from them. Since the investigation was not planned ahead of time, the committee often found itself without the appropriate data with which to test some hypotheses.

The work of the committee, and hence this report, has been delayed by a slow response from many of the pollsters who collected data from the four states in which the committee focused its efforts– New Hampshire, South Carolina, Wisconsin, and California.4 This is quite a different situation than after the 1948 general election, when there were fewer firms engaged in public polling, the threat to the future of the industry seemed to be greater, and the polling firms were fully cooperative. In 2008, many of the firms that polled in New Hampshire had studies in the field for primaries that occurred right after that. Today, there are well-publicized standards for disclosure of information about how polls are conducted. AAPOR, an organization of individuals engaged in public opinion research; the National Council on Public Polls (NCPP), an organization of organizations that conduct public opinion research; and the Council of American Survey Research Organizations (CASRO), also an organization of organizations, have all promulgated standards of disclosure.  Despite the norms, at the time this report was finalized, one-fifth of the firms from which information was requested had not provided it. For each of these four firms, we were able to retrieve some of the requested information through Internet searches, but this was incomplete at best. If additional information is received after the report’s release, the database at the Roper Center will be updated.

1 In the New Hampshire Democratic primary, there were 19 candidates officially named on the ballot, and 19 Democrats received write-in votes; only five of these candidates received more than 1,000 votes out of almost 300,000 cast. In the Republican primary, there were 22 candidates named on the ballot, and 21 additional candidates received votes. In total, only nine candidates received more than 1,000 votes out of almost 250,000 cast.
2 Additional information about this analysis can be obtained from Christopher Wlezien, Department of Political Science, Temple University: Wlezien@temple.edu.
3 The news outlets reviewed in the media content analysis were CNN, FOX News, CBS News, the New York Times, the Washington Post, and the Boston Globe.

4 The last pieces of information were supplied between October 15, 2008 and March 13, 2009 after a final request for information and a reminder that the responses to the request would be fully disclosed in the committee’s report.  Some additional information arrived on or after February 10, 2009, after the AAPOR Standards Chair contacted several organizations about various elements of disclosure that were lacking.

Analyzing Accuracy
 
In this evaluation of the pre-primary and caucus polls, we measured their accuracy using the statistic A,5   the natural log odds ratio of the relative standing of two candidates in a poll compared to their relative standing in the actual election returns. In the case of each primary that was analyzed, the denominator of A was the same for all polls; the difference in the value of A can be attributed to differences in the numerator, which was calculated from the poll estimates themselves. Primaries are multi-candidate events, often with very large fields of competing candidates at the beginning of the presidential nomination process. We used A in two different ways to measure the relative standing of the top candidates in the voting compared to the estimate of their standing in the polls. For the Democrats, we consistently measured Obama’s share compared to Clinton’s. For the Republicans, we measured the first place finisher’s share compared to the second place finisher’s – a pairing that changed from event to event as clearly noted in each analysis. The closer A is to zero, the closer the odds ratio is to 1.0 and the more accurate the poll is in relation to the election outcome. In Democratic contests, negative values denote an underestimation of Obama’s vote share compared to Clinton’s vote share (or an overestimation of Clinton’s share compared to Obama’s), relative to the actual outcome of the election. Positive values indicate an overestimation of the Obama’s share compared to Clinton’s (or an underestimation of Clinton’s share compared to Obama’s). In Republican contests, negative values denote an underestimation of the winner’s vote share compared to the second place finisher in the poll relative to the actual outcome of the election. A positive value indicates an overestimation of the first place finisher’s share compared to the second place finisher’s share.

Information presented in Table 3, based upon each firm’s last poll before the relevant event, suggests that the pre-voting polls on the Democratic side in Iowa were no more accurate than those in New Hampshire, by both the average absolute value of A (0.26 in Iowa and 0.27 in New Hampshire) and the range of values (a comparison of a range from 0 to .55 in Iowa to a range of .10 to .45 in New Hampshire). Table 3 also shows the differences in the point estimates for the first and second place finishers; most of these differences are negative because the total of the votes cast add to 100%, but the poll estimates include some “undecided” voters. The Democratic polls generally did well in projecting the vote share of the second place finisher, but they were off severely in estimates for the winner’s proportion. In both Iowa and New Hampshire, the polls in the Democratic race understated the winner’s vote share by 9 to 10 percentage points, but came within one point of projecting the loser’s vote share. On the Republican side, the polls also underestimated the winner’s proportion in both Iowa and New Hampshire by almost four percentage points on average, but there was only a one percentage point difference in Iowa in the estimates of the runner up’s proportion.
 
table3.JPG
Source for election results: http://nass.org/index.php?option=com_content&task=view&id=89&Itemid=223

Estimating the outcome of the Iowa caucuses6 is complicated by a two-step process of measuring preferences: supporters of a candidate who does not receive votes from at least 15% of the caucus delegates who voted in the first round can express a second choice in the next round. Barack Obama finished first with 37.6% of the eventual Democratic caucus delegates, John Edwards second with 29.8%, and Hillary Clinton third with 29.5%. Most of the Iowa pre-caucus polls underestimated the size of Barack Obama’s margin over Hillary Clinton while getting the winner correct in about half of the estimates. On the Republican side, Mike Huckabee finished first with 34.4% support from those caucus attendees, Mitt Romney second with 25.2%, Fred Thompson third with 13.4%, and John McCain fourth with 13.0%. Here the polls consistently underestimated support for Huckabee, and three suggested that Romney would finish first.

In the New Hampshire primary, which came only five days later amid a flurry of coverage of the Obama victory in Iowa, every one of the pre-primary polls showed him in the lead, although the margin was often by a statistically insignificant amount. When the votes were counted, however, Hillary Clinton had won 39% of the vote while Barack Obama received 36% of the vote, a narrow but unexpected Clinton victory. While these polls were not far off on their estimate of support for Obama, they all underestimated support for Clinton. The range of the absolute values for A, all of which were positive, was from .10 to .45. On the Republican side, 10 out of 12 polls suggested a McCain victory, with one showing the contest between McCain and Mitt Romney very close and one suggesting that Romney was in the lead. The range of the absolute values of A went from .02 to .30. In this case, the estimates were generally predictive of the winner.

The inability of the vast majority of the pre-primary polls in New Hampshire to estimate the Clinton victory accurately, even while they estimated the Republican outcome reasonably well, produced consternation in the national media about the performance of “the polls.” This in turn prompted discussion among the leadership of the American Association for Public Opinion Research (AAPOR) because political polling represents the public face of the public opinion and marketing research professions.  Given the virtually unique circumstances of having a source of external validation for published survey estimates – in this case, the election results – the potential for public reaction to pre-primary polling estimates is great, and these reactions may influence the public’s more general view of the industry, and thus its reputation.

5 This measure is described in detail in Martin, Traugott, and Kennedy (2005). A is a statistic that captures the both the magnitude and the direction of estimation errors by computing the log odds ratio of the vote shares in the election for the two leading candidates compared to the vote shares in a pre-election poll of the same two candidates. Other measures of polls accuracy are available (Mitofsky 1998) but are not included here in order to simplify the analysis. A copy of the original article is appended to this report.
6 The election results reported for the Iowa Democratic caucuses are the relative proportions of the delegate totals, not popular vote totals.  The Iowa Republican Party releases the results of a straw poll conducted as part of their caucus, but the state Democratic Party does not.

Organization of the Committee’s Work

In her charge to the AAPOR committee formed to investigate the estimation problems in the pre-New Hampshire polls (see Appendix A for a list of all committee members), then President Nancy Mathiowetz asserted: “What we learn from this review will help us to continue to improve our methodology and ensure continued accuracy.” (The formal charge to the committee is contained in Appendix B.) When organizing the work of the committee, chair Michael Traugott of the University of Michigan proposed an empirical investigation of a series of possible explanations for the problems. To learn more about what might have happened in the New Hampshire Democratic pre-primary polls, the committee agreed to expand its analysis in limited ways to include investigation of the estimations in both the Republican and Democratic pre-election polls in four states: New Hampshire, South Carolina, Wisconsin, and California. As will be explained later, the polls generally underestimated support for Barack Obama in South Carolina and overestimated support for Hillary Clinton in California. In addition, the unusually large numbers of undecided respondents in South Carolina and Wisconsin suggested that analyzing those polls would yield insightful results.

The committee decided upon a number of data elements that it would need to pursue its analysis of possible explanations for the differences between the final pre-election estimates and the actual primary outcomes in the selected states. The next step was to obtain information about each survey to support the committee’s work. President Mathiowetz took on the task of recruiting information from the 21 polling organizations that produced publicly reported estimates in any of the four states within two weeks of the primary election.7 Her request included information that public pollsters subscribing to the Standards for Minimal Disclosure as part of the AAPOR Code of Ethics and Professional Practice could be expected to reveal under normal circumstances.8 Beyond these minimal items, Mathiowetz’s layered request asked for information not part of the minimal disclosure requirements, but that would help the committee with its work. This information included a copy of the micro dataset from the survey, as well as data concerning interviewer characteristics, where applicable, and other administrative data from the data collection process such as calling information. (A copy of the disclosure request is included in Appendix C.)  This request, dated March 4, 2008, came during a very busy part of the primary schedule itself, and many firms were engaged in continuing data collection that prevented them from responding immediately.  And AAPOR was not in a position to offer financial assistance in the preparation of these additional materials, so the firms had to bear the cost themselves.

Table 4 shows the level of polling organization response to these requests.9 The rate of providing information and the specific information elements provided varied widely by firm. Despite repeated requests for information, at the time of the analysis of data for this report three firms never responded: Clemson University, Ebony/Jet, and Strategic Vision.
 
table4.JPG
Y: Inform ation provided directly to the com mittee in response to the AAPOR request through March 2009. W : Information obtained from m aterials posted on the Internet.
n/a: Information is not applicable
*T his inform ation was requested but is not required under the AAPOR Minim al Disclosure Standards.
** SurveyUSA used IVR, but they still provided the race and gender of the interviewers who read the questionnaire.

Given the lack of information from several of the firms, the committee attempted to retrieve some of the minimum disclosure information from alternative sources, such as a firm’s web site or news sources reporting on the study. The requested minimal disclosure items included: sponsorship (16 of the 21 firms or survey agencies provided this information), the exact wording of each question asked (16 firms provided this information directly and for 4 firms, the information was obtained from other sources), a description of the sampling frame (provided by or found for 19 of the firms), sample sizes and eligibility criteria (available for 33 surveys from 20 firms with varying levels of specificity for eligibility criteria), response rates (available for 17 surveys representing 12 firms), a description of the precision of the findings as assessed by a margin of error statement (provided by or found for all surveys), the weighting procedures (available for 17 firms), and the dates of the field period (provided by or found for all surveys except one).

The additional requested information included: a micro data file for analysis (provided for 7 surveys, including two from SurveyUSA),10 information on the characteristics of the interviewers (provided for three surveys and not directly appropriate for the four IVR surveys), and the call records (not provided by any firm, although two datasets included information on the call on which the interview was completed).

The major design features of the pre-election polls are presented in Table 5. All the polls were conducted over the telephone. Most used computer-assisted telephone interviewing (CATI), but a noticeable minority was conducted using interactive voice response (IVR) equipment. With IVR, the respondents hear a pre-recorded reading of the survey questions, and they register their answers verbally or manually on their telephone touchpad. The implications for estimation error stemming from data collection mode and other design features are investigated in the analysis below.
 
Summary: The committee commends the firms that did respond on a timely basis, especially those that provided micro datasets for extended analysis. Several of the firms that conducted pre-election polling in the 2008 primaries in New Hampshire, South Carolina, Wisconsin, and California were slow to disclose the details of their work to the committee.  Beyond those that did not respond, others made incomplete information available in terms of AAPOR’s minimal disclosure “requirements.”
 
table5.JPG
NA indicates that the information was not supplied by the pollster and could not be located elsewhere.
RR indicates a response rate calculated according to an AAPOR form ula. Any other response rate was supplied by a pollster but without an indication of how it was calculated.
T able entries are based upon inform ation provided directly to the com mittee in response to the AAPOR request though mid March, 2009.

7 Some firms conducted surveys in more than one of the four states, and these multiple occurrences are counted in the 31 estimates.
8 For a complete description of these items, see https://www.aapor.org/Standards-Ethics/AAPOR-Code-of-Ethics.aspx.

9 Some surveys had multiple sponsors plus a data collection firm (e.g., LA Times/CNN/Politico.com/ORC).  To make the tables and text more readable, we employ a shorthand label for each survey, corresponding to the organization contacted by the committee.
10 When subsequent information is presented by firm, there is only one entry for SurveyUSA as they employed the same methodology in each poll.

How Well Did the Polls Do? Evaluation of the Pre-election Polling Estimates in Five States11
 
The committee began by developing a series of hypotheses or conjectures, based upon the members’ own expertise as well as opinions contained in the news media or put forward by other survey professionals after the New Hampshire primary. (The list of hypotheses is contained in Appendix D.) A useful way to think initially about the estimation issues is through a visual examination of the accuracy of the polls, as measured by A, for the period of approximately one month preceding each election date; in later analysis, we will focus on the polls conducted in the two-week period preceding each event.

One would not necessarily expect the estimates four weeks in advance of an election with many candidates to be accurate, but this time period gives a perspective on a number of issues such as whether the estimates became more accurate as Election Day approached; whether the A values of different polls were more or less randomly distributed around a value of 0, indicating more or less correct estimation without any aggregate bias; and whether there were differences in the accuracy of the estimates produced in the Democratic and Republican contests. These graphical summaries are presented in Figures 1 through 5. By proceeding in the chronological order of the events, it is also possible to see how or whether the accuracy of the estimates might have changed over time, especially as the number of candidates in each field declined.

Starting with Iowa as a foreground for the committee’s central focus and to provide a context for investigating accuracy (Figure 1), we employ a red “R” to indicate each estimate from a single poll for the Republican caucus and a blue “D” to indicate each estimate for the Democratic caucus. The range of the values of A for the Democratic contests was more variable and generally tended to overestimate support for Hillary Clinton relative to Barack Obama in comparison to the actual outcome. In the final week before the caucus, the polls moved toward more accurate estimation, with the exception of an estimate by the American Research Group that showed Clinton ahead by 9 percentage points. In the Republican contest, the estimates from the polls showed the same movement toward correct estimation, but all of the polls underestimated Mike Huckabee’s margin over Mitt Romney. In both contests, the polls underestimated the support for the second place candidate relative to the winner, although this error generally was reduced as caucus day approached.

figure1-(1).jpg

Data are presented in Figure 2 for the equivalent time series of poll estimates in New Hampshire, occurring only five days later. In this case, the problem seems relatively clear: 18 out of 21 polls that went into the field after the Iowa caucus showed Obama in the lead (with two estimates within the margin of error); one showed Clinton in the lead, early on, although the difference was within the margin of error. The figure shows that relative support for Obama increased after the Iowa caucus, but there was no associated shift in support for McCain or Romney, as some of the values of A in the last few days – generally smaller than the corresponding values in the Democratic estimates – were positive and others negative. These patterns suggest a systematic shift in the estimation of the outcome of the Democratic primarsy and something more akin to random differences in the Republican estimates.

figure2.jpg

Analogous data are presented in Figure 3 for the estimates produced before the South Carolina primaries. The Republicans and Democrats in this state actually held their primaries one week apart, with the Republicans voting on January 19th and the Democrats on January 26th. Fewer pre-election polls were conducted in South Carolina than in Iowa or New Hampshire. McCain won the primary by a narrow three-percentage-point victory over Huckabee, and generally the pre-election polls indicated a tight Republican contest, although one suggested Huckabee would win. Obama won a decisive victory over Clinton by 28 percentage points, and, while all of the polls indicated he was in the lead, they consistently produced underestimates of his eventual margin. And these estimates did not improve as Election Day approached.

figure3.jpg

Data are presented in Figure 4 that show the distribution of the pre-primary estimates for the California contests. Focusing on the final estimates produced in the week leading up to the primaries, two different patterns emerge. For the Democrats, Clinton won by a 9-point margin; but several of the polls overestimated her advantage. On the Republican side, the pre-primary estimates favored Romney – absolutely and relative to McCain, although McCain won the primary by 7 percentage points. The time series of estimates suggests that overall the polls in California became less accurate as Election Day approached.

figure4.jpg

In the case of Wisconsin, with its primaries held two weeks after Super Tuesday on February 19, Obama was an easy winner by 17 percentage points. Fewer polls were conducted before the Wisconsin primary than other primaries, and they all underestimated support for Obama. One factor contributing to this error may have been the Wisconsin rule permitting Election Day registration, which can make it harder for pollsters to anticipate the turnout rate accurately, overall and among key subgroups. The data presented in Figure 5 show that the estimates got closer to the eventual outcome as Election Day approached. On the Republican side, McCain defeated Huckabee by eight percentage points, and the final polls projected this within their margin of error.

figure5.jpg

Summary: An examination of the time series of estimates of caucus and primary outcomes in five states does not show a consistent pattern of accuracy in estimation. In some cases, estimation improved as Election Day approached, but in other cases it did not. In some cases, the inaccuracy of estimates seemed random, but in others there was an indication of systematic bias favoring one candidate or another. In general, the average estimates of the outcomes in the Republican races were more accurate than those in the Democratic races.

11 The five states for which some data are presented in this report differ with respect to the rules governing who can vote in their primaries.  Generally speaking, the more inclusive the primary, the more difficult the outcome is to predict. Both major parties held caucuses in Iowa, and voters were allowed to register at the caucus site.  New Hampshire had semi-open primaries with Election Day registration.  South Carolina had open primaries for both parties.  California had semi-open primaries with no Election Day registration permitted.  Wisconsin had open primaries for both parties, and Election Day registration was permitted.

Explanations for Differences in the Accuracy of Pre-Primary Polls
 
In this section, we present the results of analyses related to possible explanations for estimation errors. The results are presented in line with the general sequence or chronology of decisions made in the design and implementation of any survey. That is, we begin with an examination of differences in the mode of data collection, then turn to sampling issues, wrap up with an analysis of the impact of weighting procedures and likely voter models on the final preference distributions, and end with a consideration of other external factors. The analysis will shift between comparisons of states, survey organizations, and individual survey estimates, and such changes will be made clear as they occur.

The use of this structure and the nature of the available data for analysis imply that we looked at one cause at a time for problems with accuracy. However, this should not be construed as an expectation that there was a single explanation for the problems. Any single monocausal explanation is likely to be both inaccurate and misleading. Pre-election polling is a complex process, and explanations for difficulties could derive from multiple factors such as statistical sampling theory or voter decision making or the psychology of interviewer-respondent interactions. Moreover a variety of small effects can accumulate to produce significant inaccuracies. Untangling these relationships in an ex post facto analysis such as this is virtually impossible given differences in the amount of information available for each poll estimate.
 
Mode of Data Collection
 
An initial decision in designing any survey is selecting a mode of data collection. In the case of pre-election polling, firms organize their work around a particular mode of data collection, and sponsors or clients usually end up with a particular mode as a result of selecting a firm to do the work, typically on the basis of cost. In the case of the pre-primary and caucus polls we analyzed, only two modes of data collection were used: 1) telephone interviews using a human interviewer in combination with a computer-assisted telephone interviewing (CATI) system, and 2) telephone interviews using an interactive voice recognition (IVR) system in which digitally recorded questions were answered using a touch-tone phone.

Data are presented in Table 6 that show the average absolute value of A, the accuracy measure, for final trial heat estimates made in the final two weeks leading up to the primary elections in five states. The averages are computed separately for CATI polls and IVR polls because we wanted to assess whether the level of accuracy differed by mode. However, it is important to note that these polls differed on factors besides mode (e.g., field dates), which does not make this comparison straightforward.

In every state, more surveys were conducted via CATI than via IVR systems, although the differences were relatively small in South Carolina and Wisconsin compared to Iowa, New Hampshire, and California, which may be attributable to the  larger number of polls conducted in the latter three states. The number of IVR polls conducted per state ranged from 1 to 3, while the number of CATI polls ranged from 3 to 13. Where only one IVR poll was conducted, the comparison might be thought
of as an evaluation of a “house effect” from a single firm as much as a mode comparison.
 
table6.JPG
A Only the final estimates from each poll are included in this analysis.  No poll is included m ore than once, but the set of polls considered is larger than that listed in Table 5 because all polls during the final two weeks are included here.
B Small sample sizes, particularly with respect to the small num ber of IVR polls, severely limit attempts to isolate an effect from mode on accuracy.

Table 6 presents data for 10 comparisons encompassing state and party. In half of the comparisons, only one IVR poll was conducted, so its final estimate is compared to an average value for multiple CATI polls. In these five comparisons, the average value of A was lower for the CATI surveys in three cases (both Iowa caucuses and the New Hampshire Democratic primary) and higher in two (the New Hampshire and Wisconsin Republican primaries). In the other 5 comparisons, where there were 2 or 3 IVR firms conducting polls, the average value of A was lower for the IVR estimates in 4 cases (South Carolina Democratic and Republican primaries, and the California and Wisconsin Democratic primaries). In one case, the California Republican primary, the average IVR A value was slightly higher than the average CATI A value.

With only 4 IVR firms producing 18 estimates in these 10 contests, it is possible to look at the average value of A for the three firms that produced multiple estimates. The lowest average absolute value of A was produced by Public Policy Polling (.147), based upon four estimates in both parties’ primaries in South Carolina and Wisconsin. This was very closely followed by SurveyUSA (.147), based upon four estimates in both parties’ primaries in South Carolina and California.12 Rasmussen Reports produced nine estimates in total with an average A value of .233.
 
Summary: All of the final pre-primary polls were conducted by telephone, using either CATI or IVR systems. We found no evidence that one approach consistently out-performed the other – that is, the polls using CATI or IVR were about equally accurate. We caution that all of the comparisons are based on very small sample sizes and are potentially confounded with other factors that can contribute to accuracy.

12 The difference in the relative value of the A for these two firms is due to rounding.

Sample Frames and Respondent Selection
 
The issues of sample frames, respondent selection, estimating the likelihood of voting, and weighting are inextricably linked in pre-election surveys. Separate sections in this report discuss estimating likelihood and weighting, and each of these issues receives some treatment in this section as well. Pre-election pollsters are interested in obtaining information from people who will vote in the election (caucus, primary, or general), but the respondent may not even know at the time of interview whether he or she will participate. Obviously, under these circumstances there is no way to identify the population of interest in advance, that is, to obtain or construct a sample frame consisting of a list of voters prior to the event. In this way, the use of a particular frame and the definition of likelihood of voting are linked.

Table 7 indicates the kind of frame used to select a sample for the pre-primary polls in the four states on which we focus, as well as the method for determining likelihood of voting. Some firms employed a definition of likelihood to vote at the same time that they drew their sample; for others, they drew a sample from a frame and then asked questions in the survey to determine likelihood of voting. In the case of one firm, the interviewer asked to speak to “a registered voter in the household.”

Another distinction is that firms drawing samples of phone numbers differed in whether or not cell phone numbers were included and how information about whether the number was listed or not was used. They then selected a respondent by a variety of methods (see Table 5) after contact was made with the household or telephone number.13  In these cases, a series of questions at the beginning of the interview were used to determine likelihood of voting, and the interview only proceeded with a likely voter according to that firm’s definition. In other cases, the interview started with some questions that everyone answered, then likelihood was determined, and then the trial heat question(s) about candidate preference were asked only of those identified or defined as likely voters. In still other cases, firms purchased a sample of registered voters that presumably included telephone numbers and sometimes included information about past voting behavior, and they attempted to contact individuals selected on the basis of their registration status and/or past voting behavior for an interview. These lists were purchased from commercial firms or, in one case, supplied by a political party. Another issue, discussed in greater detail in the next section, has to do with the inclusion or exclusion of cell phone numbers. Very few of the polling firms included a special sample of cell phone numbers in their primary polling.
 
Summary: The specific impacts of the use of particular sampling frames and methods of respondent selection on the accuracy of estimations remain difficult to assess. This is because of the joint relationship between the use of certain frames, respondent selection after contact, the determination of voting likelihood, and post-survey weighting procedures, as discussed in greater detail below.

table7.JPG

13 It is unclear how the firms using IVR techniques that purchased a list that included a subset of identified registered voters insured that they were speaking to the correct respondent.
 
Non-coverage of Cell Phone Only (CPO) Voters
 
At the time of the New Hampshire primary, about 14.5% of American adults lived in a household with a cell phone but no landline, according to the National Health Interviews Survey (Blumberg and Luke 2008). We adopt the conventional short-hand of “cell phone only” (CPO) to describe this population. Cell phone only voters were excluded from all but 1 of the 13 New Hampshire primary polls studied in this report. Only the Gallup Organization included a sample of CPO adults in its poll.

During the primary season, it was widely believed that the omission of CPO individuals did not have a sizable impact on estimation of candidate preference. An analysis of exit poll data from the 2004 presidential election suggested that post-survey demographic adjustments, in particular an iterative-proportional fitting technique (raking) to population control totals for age, effectively eliminated coverage error in pre-election polls (Keeter 2006). In January 2008 a report from the Pew Research Center concluded that including cell phone interviews did not substantially change any key survey findings (Keeter 2008). And in May 2008 two other studies indicated that including a sample of cell phones had minimal effects on primary election trial heat estimates (Jones 2008; ZuWallack, Piehl, and Holland 2008).

By the time of the general election, however, views about the exclusion of this group had changed. The Pew Research Center estimated in September 2008 that this omission could underrepresent support for Obama by 2 percentage points in general election trial heats (Keeter, Dimock and Christian 2008). Similarly, Gary Langer (2008c), director of polling for ABC News, reported that Obama had a 6-point margin (50% to 44%) over McCain among likely voters when CPO interviews were included and a 4-point margin (49% to 45%) when they were excluded. Differences for full-sample estimates were generally quite small, however, leading Langer to conclude that the effect from adding cell phone only interviews was negligible.

Looking back on the New Hampshire pre-primary polls, exclusion of CPO individuals does not appear to have been an important factor in the estimation errors. Several teams of researchers independently concluded that CPO exclusion influenced estimates of support for Clinton versus Obama by at most 1 or 2 percentage points, and the direction was not always consistent. Pew had a statistically non-significant finding that Obama benefited from the exclusion of CPO individuals, while ORC Macro and CNN had a statistically non-significant finding that Clinton benefited. Furthermore, Gallup, which included a cell phone only sample, performed no better in New Hampshire than the other polls in estimating the election outcome.14
 
Summary: Only one survey organization (Gallup) fielded a random-digit dialing (RDD) sample of cell phone cases. Other New Hampshire polls appear to have excluded CPO individuals. We found some evidence suggesting that this coverage gap influenced estimates in the general election, presumably as young people were motivated to turn out in support of Barack Obama, but we found no strong evidence suggesting that the gap influenced primary estimates in any meaningful way. In particular, non-coverage of CPO individuals in pre-primary polls does not appear to have been an important factor in the New Hampshire Democratic primary poll errors.

14 The Gallup poll was in fact the least accurate in estimating the New Hampshire Democratic Primary winner.  This raised the question as to whether including the CPO sample might actually have increased the error in Gallup’s estimate rather than decreased it.  The committee was unable to test this hypothesis directly as it requires the availability of two sets of weights – one for the full sample estimate and a separate one for a landline-only sample estimate.  Lacking the latter weight, we compared the unweighted full sample estimate with the unweighted landline sample estimate.  The results were identical (37% Obama, 36% Clinton), suggesting that the addition of the CPO sample data had minimal effects on the final weighted estimate.  It is also the case that the difference in vote preference between the CPO respondents (41% Obama versus 25% for Clinton) and the landline respondents (37% Obama, 36% Clinton) was probably not sizable enough to move the estimates by a meaningful amount given that the CPO cases represented 7.2% of the entire sample, and there was no post-survey adjustment for telephone service.
 

Nonresponse Bias and the Composition of Responding Samples

Another hypothesized cause of the error in the pre-election polls is nonresponse bias. That is, voters who could not be located or declined to participate in the surveys may have favored different candidates than those who did participate in the surveys. Demographic weighting that aligns the characteristics of the survey sample to characteristics of the entire voting population is generally thought to remove much of the bias from nonresponse. Weighting, however, is an imperfect technique and relies on assumptions that may not be valid in the context of these particular primary elections.
In an op-ed in the New York Times (2008) after the primary, Andrew Kohut, president of the Pew Research Center, posited that the pre-election polls for the New Hampshire Democratic primary missed the mark because of nonresponse bias. Kohut noted that poorer, less educated whites are less likely to participate in surveys than other voters and that these whites have more unfavorable views of blacks than respondents who participate in surveys. By this line of reasoning, the survey estimates would still be subject to nonresponse bias even after weighting adjustments were made for the educational and racial composition of the sample.

The difficulty in assessing this hypothesis and others related to nonresponse bias is that we generally know little if anything about those who do not participate in surveys. The 2008 primary polls were no different, and none of the organizations contacted by the committee provided call records for non-responding cases. Given this lack of information about how much effort was devoted to contacting original sample elements, we cannot rigorously test any hypotheses about how pre-election poll errors could be attributable to nonresponse bias. We were able, however, to conduct an indirect assessment of the levels of nonresponse bias in the polls by comparing characteristics of the respondents in the pre- election samples to those in the exit poll samples.
The best information available to us on this issue comes from the National Election Pool (NEP) exit polls funded by major news outlets and conducted by Edison Media Research and Mitofsky International. The exit polls provide demographic, attitudinal, and behavioral characteristics of those who voted in the primaries. The exit poll data are weighted to actual vote totals, and the resulting aggregate estimates are the best available information on the characteristics of the participating electorate.15 In the following analysis we make inferences about pre-election survey nonrespondents by comparing the personal demographic characteristics of the survey samples to the same characteristics of the voters as measured by the exit polls in the New Hampshire primary.

This analysis has two important limitations that should be kept in mind when considering the results. First, differences we observed between survey samples and the exit polls are not just due to nonresponse bias, but they are also due to non-coverage and possibly measurement differences. Error from non-coverage and error from nonresponse may not affect survey results in the same way.

Unfortunately, we have no good way to disentangle these two error sources in this analysis.16 The second important point is that the exit polls have some problems of their own. People refuse to cooperate with the exit polls just as they refuse to cooperate with pre-election polls. The researchers who release the exit poll estimates do attempt to correct for this by aligning the exit poll results to the actual vote division. It is still possible, however, that residual nonresponse bias affects the accuracy of the exit poll figures. We use the exit poll data here because it is the best available information on the composition of the electorate.

In Table 8, differences between the demographic distributions of four New Hampshire Democratic pre-election primary polls and the primary exit poll are reported. On several dimensions – gender, race, marital status, and party identification – the pre-election survey samples resembled the electorate as measured in the exit polls. On other dimensions the survey samples look a bit different, providing some possible explanations for the errors in the polls.
The reader will recall that pre-election polls in the New Hampshire Democratic primary understated support for Hillary Clinton. Table 8 shows that two of the surveys slightly under- represented households with at least one labor union member; a measure of union membership was not available for the other surveys. According to the exit poll, Clinton won union households by a 40% to 31% margin. This suggests that the pre-election polls may have missed some Clinton support by underestimating the size of the union vote. We see a similar result looking at education. In the exit poll, Clinton won voters with less than a college education by a 48% to 30% margin. Three of the pre- election polls underrepresented this group and thus appear to have missed some Clinton support. This could be consistent with the Kohut hypothesis. The CBS News poll, which was a re-interview survey,17 overrepresented respondents with a high school education or less, so their error would have come from some other sources.

One difference between the pre-election polls and the exit poll leads to a counter-intuitive result.  Table 8 shows that all four pre-election polls in New Hampshire overrepresented registered Democrats. According to the exit poll, registered Democrats comprised 52% of the actual voters, but they comprised roughly 57 to 61% of the survey samples. Clinton won this group by a 43% to 32% margin. It is curious that the surveys over-represented this pro-Clinton group but still underestimated her support.

In sum, we find some evidence that the pre-election polls may have missed Clinton support by underestimating union households and non-college graduates and overstated Clinton support by overestimating registered Democrats. We reiterate, however, that the exit poll is itself an imperfect benchmark, so the conclusions are more suggestive than definitive.

In addition to the exit poll analysis, the committee sought to investigate nonresponse error by testing for a relationship between response rate and accuracy. Response rate is a poor indicator of nonresponse error (Curtin, Presser, and Singer 2000; Groves 2006; Keeter et al. 2000; Merkle and Edelman 2002), but survey organizations are required to disclose it under AAPOR’s minimum disclosure standards. Our attempt to carry out this analysis was hindered by two factors. Critically, only 8 of the 13 survey organizations releasing estimates for the New Hampshire Democratic primary disclosed their response rate (see Table 5). Furthermore, the rates disclosed were not all calculated in the same fashion. One was reported as a range; three are based on AAPOR response rate (RR) calculation 1; one each is based on RR (2), RR (3), and RR (5); and one is based on an unknown calculation.18 In light of these limitations, we did not pursue a rigorous analysis of the relationship between response rate and accuracy. The correlation between A (absolute value) and response rate for all of the primary polls disclosed was not significant (Pearson r = 0.317, p=.22).
 
Summary:  We found some evidence that nonresponse error contributed to the estimation errors in the New Hampshire Democratic primary polls, but we lack data to confirm this rigorously. According to the exit polls, the pre-election polls underestimated the size of two pro-Clinton groups (union households and those with less than a college education). On the other hand, the pre-election poll samples were similar to weighted exit poll results in terms of gender, race, marital status, and party identification.
 
15 There are a variety of different weighting schemes used in the pre-primary polls themselves, and they will be discussed in greater detail in a separate section that follows.
16 One possible way to separate nonresponse error from noncoverage error would be to compare the error properties of surveys using a landline RDD design from those using a landline RDD + cell sample design. We did not perform this analysis because there are so few surveys representing each design and these surveys differ in a number of respects besides sample design. Any attempt to isolate the effect from sample design would be confounded by other differences between surveys.
17 The CBS News poll re-interviewed New Hampshire registered voters first interviewed in November of 2007 to measure the amount of individual change.
18 Both the AAPOR definitions of disposition codes and response rate calculations, as well as an Excel spreadsheet for making response rate calculations can be found at: http://aapor.org/resources.

Trial Heat Question Wording

In the pre-primary surveys, the respondents are asked a “trial heat” question about their preferred candidate in the election. This question can be asked in several forms, both in terms of the stem question wording and the response categories offered. We had information on 13 different question wordings used in the Democratic primary and 11 different questions used in the Republican primary in New Hampshire, assembled from information that the polling organizations provided or that could be located from other sources. We also had information on four different questions used in the South Carolina Democratic and Republican primaries, seven questions used in the California Democratic and Republican primaries, four questions used in the Wisconsin Democratic primary, and three questions used in the Wisconsin Republican primary.19    This information is presented in Table 9.

One of the ways that such questions might differ is in whether the names of some or all of the candidates are mentioned, providing a form of recognition of the candidates for those who may not have been paying much attention to the campaign. In New Hampshire, the survey conducted by RKM did not have the interviewer read any of the names of the candidates, while the survey conducted by Suffolk University told the respondent that there were 22 names on the ballot but offered the names of only the “eight major candidates” listed alphabetically.

Another difference is whether the order of the names in the trial heat question corresponds to the order of the names on the ballot. For example, the New Hampshire ballot arranges candidates’ names alphabetically starting with a randomly selected letter of the alphabet. (See the section on Ballot Order Effects below.) However, only in the Zogby and Suffolk University trial heat questions for New Hampshire’s Democratic primary were the candidates’ names arranged in the same order that they appeared on the ballot. In South Carolina, the candidates were listed alphabetically on the ballot, and only one of the polls presented the names in that order to all respondents. In California, ballot order was randomized and rotated across districts. In the Wisconsin Democratic primary, the order of the candidates was assigned at random, and none of the questions offered the names in that order.

Still another difference is whether the CATI and IVR pre-primary surveys randomly rotated the names of the candidates in the heat trial question – a method for counteracting recency effects (Krosnick and Alwin 1987; Holbrook, Krosnick, Moore, and Tourangeau, 2007). A recency effect is a cognitive bias that makes respondents more likely to select the option they heard last. As shown in Table 9, most of the New Hampshire Democratic primary polls randomized the order of candidate names across interviews, but four polls did not. The trial heat question from the Zogby, Suffolk University and LA Times polls presented Obama’s name after Clinton’s, which may have made Obama somewhat more salient in respondents’ minds as they formulated their answer. These polls had Obama leading by 13, 5 and 2 percentage points, respectively. The fourth poll using a uniform name order was conducted by Research 2000, which presented Clinton’s name after Obama’s. If a recency effect occurred, we might expect this poll to be more accurate than the others because a possible recency effect would break in favor of Clinton and offset the general trend of Clinton support being underestimated in the New Hampshire polls. This was the case, as Research 2000 had Obama leading by just 1 percentage point (a statistical dead heat).

Unfortunately, this analysis is suggestive but not definitive; we do not have an experimental manipulation of name order to assess its impact. The results regarding response order and estimation error are in the expected direction, but there are far too few data points to establish any causal relationship. We find these results intriguing, but much more data are required to address this issue properly.



Still another difference in the trial heat questions used by polling firms is whether candidate preference is asked as a single question or in a two-step sequence in which those who respond as “Undecided” or “Don’t know” to the first question are asked in a follow-up question to name the candidate toward which they are leaning. Surveys that used the two-step question should ultimately produce a lower level of “Undecided” voters because the respondents were provided with two chances to give a candidate’s name. The data in Table 10 show that six of the New Hampshire survey firms used a single question, and seven used a two-question sequence. Information is provided in Table 10 on the level of “Undecideds” in each survey. Surprisingly, the average level of “Undecideds” was no higher for the polls using single-question format (6%) than for polls using the two-question format (7%). However, this can be attributed to the fact that two of the firms using the two-part trial heat question, Opinion Dynamics and the Los Angeles Times, had significantly higher levels of undecided responses than any other survey. Excluding them, the average level for the two-question sequence would have been 5.2%.



While the primary polls differed in their trial heat question wording, they were nearly uniform on another dimension. None of the poll estimates studied for this report featured an allocation of undecided voters in the published trial heat estimates. 21 This means that the poll-based point estimates of candidate vote share are expected to be systematically lower than the actual election vote share because they will not add to 100%. Relative shares of the vote and the margin between the top two candidates, however, should be largely unaffected by whether or not undecided voters were allocated, as the level of undecided was low for each poll. The decision not to allocate probably had little if any impact on the accuracy of the estimates. To be sure, if all the New Hampshire Democratic primary polls had allocated the share of undecided voters entirely to Senator Clinton, they would have come closer to projecting the actual result; but this outcome obviously could not have been know in advance. Even under this hypothetical allocation, several of the polls still would have projected an Obama victory in New Hampshire. Furthermore, when allocation of undecided voters is performed, most often it is assumed that each of the top two or three candidates will receive some proportion of the undecided vote, rather than one candidate receiving all of it.
 
Summary:  We found no compelling evidence to suggest that the wording of the trial heat questions contributed to the New Hampshire polling errors in any meaningful way. Most polls randomized the order of the candidates’ names, but we were unable to evaluate independently whether this had any impact on respondents’ expression of support for candidates. Levels of undecided voters were generally low in most polls, with some exceptions; and average levels of “undecided” responses were similar for surveys using either the one- or the two-question format to ascertain candidate preference. Polls that used a follow-up question for undecided respondents performed no differently in terms of accuracy than polls that did not.
 
19 The Field Poll had slightly different question sequences for the Democratic and Republican pre-primary polls.  Early in their polling on these races, when the field of candidates in each contest was large, respondents were asked which of the candidates they could vote for and then which was their first choice. By the time the primaries approached, the Democratic field had narrowed, and only one question about a choice among three candidates was asked. The Republican field was still relatively large, so the earlier sequence was used to preserve an ability to comment on the trend in support based upon the first choices.
20 This wording differs slightly from the one that appears in the final University of New Hampshire press release.
21 The one exception is the University of New Hampshire poll. Their press release included estimates with no allocation (39% Obama, 30% Clinton) as well as estimates with undecided voters allocated. With allocation, the point estimates for the top two candidates each increased by two percentage points (41% Obama, 32% Clinton), and the difference between them remained the same.

Likely Voter Definitions

One problematical reality of pre-election polling is that not all persons interviewed for a survey will, in fact, vote. Survey researchers attempt to account for this in a number of ways. One is to ask questions such as “How often do you usually vote?” and “Do you know the location of your polling place?” – answers to which help pollsters predict which of their respondents are likely to vote. Another method is to draw samples from registered voter lists that have some form of voting history attached; a likely voter may be defined as one who has voted in a previous primary, either for president or for some other office. Models of likely voters are quite variable across survey organizations, and they are sometimes considered proprietary. Whatever the method of identifying a likely voter, those deemed unlikely to vote are typically excluded from estimates of candidate preference in the electorate or are assigned a relatively small weight when estimation is performed.

Developing an effective likely voter model is particularly difficult in primary contests where the electorate can change in important ways from one election year to another. Anticipating the profile of the voting electorate was especially challenging in 2008 given the significant increase in turnout relative to recent primary elections. In 2004 some 219,787 New Hampshire voters cast ballots in the Democratic primary, but in 2008 this figure increased 31%, to 287,527. In other states, the increase in turnout was even more dramatic. For example, nearly twice as many South Carolina voters attended the Democratic primary in 2008 as did in 2004. And almost two million more voters participated in the 2008 than in the 2004 California primary, which featured an earlier election date. The increase in turnout in 2008 for four early Democratic contests is shown in Table 11 and suggests that the use of historical likely voter models might not have worked as effectively in 2008 as it has in the past.


 
The New Hampshire exit poll is the only one in which any questionnaires included an item about whether the respondent was voting for the first time in a primary. Because analysis of the exit poll data shows that first-time voters were more likely to support Obama over Clinton (47% to 37%) than were those who had voted in previous primaries (33% to 38%), knowing the correct proportions of first-time and previous primary voters in the sample could affect estimation. And the estimated 19% of self- reported first-time voters in 2008 would not have been picked up in a likely voter model that was based on prior voting. This kind of complexity makes likely voter modeling a probable suspect when researchers seek to explain errors in pre-election polls.

Two distinct approaches are used to account for likelihood of voting in pre-election surveys (Traugott and Tucker, 1984), one involving the construction of a likelihood index and the other involving the calculation of likelihood weights for each respondent. In the first case, a series of variables are combined to form an index of likelihood of voting, with only cases that fall in an acceptable range, based on cutoff points on the index, included in the analysis. In the second approach, a weight is calculated for each respondent based upon their likelihood of voting, ranging from close to 0 for the least likely to approaching 1.0 for the most likely. All weighted cases are included in the candidate preference distribution. Several firms reported which questions they used to create their likely voter models, but few went so far as to report which of these two statistical approaches they used. When the nonresponse weights are combined with weights for the likelihood of voting, it is difficult if not impossible to assess the relative contribution of each factor to estimation. For the datasets provided to the committee, we can attempt to infer which approach was used based on the distribution of the weight factors used in the trial heat estimate. The topic of weighting is considered separately later in this report. At this point, we consider weights only to learn more about the likely voter models used.

Data in Table 12 summarize the distribution of the values of the weights assigned to respondents deemed likely voters in the datasets provided to the committee. The average weights across the surveys are very similar, approximating 1.00 with a range from .76 to 1.12, while the ratio of the weighted number of likely voters to the actual number of likely voters ranges from .76 (CBS News) to 1.12 (Field Poll). CBS News employs regression-based weighting for likelihood estimation that incorporates information from all of the cases in analysis, as well as weights that account for differential response; these weights range from 0.06 to 6.30. The Gallup Organization uses an index- based system to account for likelihood to vote, with weights ranging from 0.34 to 2.33.



Some polling organizations used unusual terminology, or common terminology in unusual ways, to describe their approaches to sample weighting. Some IVR pollsters, for example, may have less control over the details of their interviewing in the sense that they cannot simply “turn off” their computerized data collection when they complete the exact number of interviews for which they have a contract. So they can sometimes end up with more cases than they use. Furthermore, these additional cases appear, as would be expected, among those who are most likely to be at home when the phone rings, such as women, older people, or whites. One IVR firm, SurveyUSA, reweights its samples to bring such groups (which they describe as “oversamples”) back into their appropriate proportions in the population22; however another IVR firm, Datamar, uses an algorithm to discard or delete extra cases at random. Datamar considers the details of its approach to be proprietary and did not disclose them.23

The effect of the likely voter models for two firms in the New Hampshire and California primaries for which we received micro-level datasets, is presented in Table 13.  It reports the unweighted trial heat estimate based on the full sample, or all respondents asked the trial heat question, and the weighted trial heat estimate based only on the likely voters. Most datasets provided to the committee could not be used for this analysis because the polling organization did not disclose how likely voters were defined, or because likely voters were identified mid-interview and the trial heat questions were only administered to them. As a result, only three comparisons could be made.

Gallup’s likely voter estimate is clearly farther from the election outcome than the full sample estimate. In the full sample estimates, Obama led Clinton by 5 percentage points, while the weighting for likelihood and other factors produced an Obama advantage of 13 percentage points. The Gallup Organization uses the index construction method to measure likelihood to vote, which has been shown to introduce error variance into pre-election polls during general elections (Erikson, Panagopoulos and Wlezien, 2004). The application of their likely voter model increased Obama’s proportion slightly, but
decreased Clinton’s support by a greater amount. Internal analysis at Gallup24 led their editor-in-chief,

Frank Newport, to conclude that a faulty likely voter model was the single biggest factor in their underestimation of support for Clinton in New Hampshire. The Gallup likely voter model included measures of enthusiasm and attention to the race, dimensions found in higher levels among Obama supporters following the Iowa primary. It is quite possible that for the earlier Gallup polls in New Hampshire, disproportionate numbers of Clinton supporters were classified as “unlikely voters” and were inappropriately dropped from the Gallup estimates (Erikson and Wlezien 2008). It is important to note, though, that even Gallup’s full sample estimate would have indicated the wrong winner.



Analogous comparisons of full sample and likely voter estimates are reported for the Public Policy Institute of California’s (PPIC) surveys in that state’s primaries. For the California Democratic primary, the difference in estimates was only two percentage points; this reduced the overestimation of the margin between Clinton and Obama, as reflected in the value of A being closer to zero. On the Republican side, moving to a likely voter subset slightly reduced the substantial underestimate of Romney’s support but did not change the underestimate of McCain’s support. Overall, the value of A was improved in the Republican race by the move to a likely voter subsample.

Summary:  The likely voter model appears to explain much of the error in the Gallup poll in New Hampshire, but we find no compelling evidence that it explains errors in the other pre-primary polls for which we had appropriate data to analyze. In fact, outside of the Gallup poll in New Hampshire, the use of a likely voter model did not change estimates of candidate support very much in relation to the candidate preference distribution for the entire sample in the other polls.
 
22 SurveyUSA provided a Power Point version of a presentation at the 2005 AAPOR annual conference about its weighting algorithms.  That Power Point is available at the Roper Center site that contains all of the information provided by the pollsters in response to AAPOR’s request for disclosure.
23 Personal communications with the committee staff.
24 This analysis was reported by Gary Langer of ABC News (Langer 2008b).
 

Calling Protocols

Errors in the pre-election estimates could also have stemmed from decisions about how the data were actually collected. All of primary polls were conducted by telephone, but these polls varied in the number of call attempts they made to each case. Studies have shown that increasing the number of call attempts in a survey can change the partisan composition of the sample (Traugott 1987; Keeter et al. 2000, 2006), and it might change the proportions of likely voters or the candidate preference distribution as well.

The committee was severely constrained in its ability to investigate any relationship between the number of calls attempted and the degree of error in the survey estimates. Only CBS News and Gallup provided datasets for their New Hampshire polls with the number of call attempts made on each case. The CBS News New Hampshire Democratic primary estimates are based on a re-contact survey rather than a fresh sample, which limits the generalizability of the results. CBS News conducted a maximum of six calls, and Gallup conducted a maximum of five calls. In both surveys, more than 94% of respondents provided data on the first or second attempt, undoubtedly a consequence of each firm trying to achieve a large enough sample size in a brief field period so that campaign events would not have an appreciable effect on candidate preferences. These distributions are shown in Table 14.


 
In our assessment of candidate preference by call effort, we found no statistically significant difference in the preference distribution for those who were interviewed on the first call or the second call. The small number of cases in the third or later call category limits the statistical power of this test, but the results are shown in Figure 6. The Gallup survey’s unweighted estimates (dashed lines) indicate that respondents requiring 3 or more calls favored Clinton over Obama (49% to 35%), while those requiring 1 or 2 calls favored Obama (37% to 35%). When weighted estimates (solid lines) are considered, the difference in candidate preference by level of call effort is similar – with the hard-to- reach favoring Clinton and the easy-to reach favoring Obama – though less dramatic.

The findings from the CBS News survey are more mixed, reflecting the fact that only 17 of the 322 re-contacted respondents required 3 or more calls. The weighted CBS estimates show the same pattern as the Gallup data, with harder-to-reach respondents favoring Clinton by a 2-to-1 margin, and the easy-to-reach favoring Obama. The unweighted CBS estimates, however, show the opposite: those requiring three or more calls were slightly more favorable toward Obama than those requiring one or two calls. The callback design of the CBS study and the small sample size in the “higher effort” group makes these data less suitable for this type of analysis than the Gallup data, which come from fresh RDD landline and cell phone samples. However, in both surveys, the more difficult to reach respondents were more likely to favor Clinton.



Analysis of the Gallup data offers some indication that the primary pollsters may have achieved more accurate estimates had they implemented a more rigorous (and expensive) data collection protocol. Had the surveys used, say, an 8 or 10 call maximum rule rather than a 5 or 6 maximum, it appears they may have tapped into Clinton support in New Hampshire that was missing from their final estimates. That said, this analysis is limited in statistical power and potentially confounded with real changes in preferences. The CBS News and Gallup surveys were fielded January 5th - 6th and January 4th - 6th, respectively. On average, those requiring more than two calls were most likely interviewed on the last or penultimate day of interviewing, while those requiring fewer calls were interviewed earlier. Changes in candidate preference at the individual level during this January 4th - 6th period might be misclassified in the Figure 6 analysis as differences between groups (i.e., those requiring low- versus high-effort calling). While such individual change may have occurred, we suspect it does not swamp the differences between the low- versus high-effort groups. We say this in part because Senator Clinton’s emotional moment at the New Hampshire diner – which is speculated to have influenced some undecideds and tepid Obama supporters – did not occur until January 7th, after both studies had ended data collection. It is also important to keep in mind how the primary calendar influenced data collection. The New Hampshire primary was held just five days after the Iowa caucuses. It is unlikely that any “bounce” Obama received after Iowa would have dissipated by the time interviewing was underway for final New Hampshire polls.

Summary: We found a modest indication that the primary pollsters may have achieved slightly more accurate estimates in New Hampshire had they implemented a more rigorous (and expensive) data collection protocol. Respondents reached in three or more calls were more likely to support Clinton. In polls with very short field periods, the sample tends to be comprised of respondents contacted on the first few attempts, complicating assessments of the impact of interviews collected with more effort to contact respondents. This raises the prospect that sample management during the field period could have affected accuracy, with more prolonged effort producing better estimation. However, these results are based on only two surveys, one of which was unusual because it was a call-back study.

25 Dashed lines denote unweighted estimates; solid lines denote weighted estimates.
 

Timing of Data Collection

Another design feature often thought to affect the accuracy of pre-election surveys is the length of time between data collection and Election Day. Research on this topic has yielded mixed results (DeSart and Holbrook 2003). Some studies find that polls fielded closer to Election Day are more accurate (Crespi 1988); others find a null or even negative relationship (Lau 1994; Martin et al. 2005). We tested for a relationship between poll timing and accuracy in the Iowa, New Hampshire, South Carolina, and California primaries. The results, presented in Table 15, emanate from a richer analysis because the field dates of surveys are commonly reported by polling firms, as per AAPOR guidelines. We summarized the relationship between timing and accuracy with simple bivariate correlations.

Table 15

These are the correlations between the absolute values of the accuracy scores (A) and the number of days out from the election. Simply put, if polls taken closer to the election were more accurate, then we would expect to observe a positive correlation: accuracy scores near zero are better than those farther from zero in either a positive or negative direction. For example, in the South Carolina Republican primary, the later polls were more reflective of McCain’s three-point victory than earlier polls, resulting in positive correlation (.64).

The results in Table 15 demonstrate that the relationship between timing and accuracy varied greatly by election. In the problematical New Hampshire Democratic primary, there is a negative correlation (-.30 for the absolute value of A) between accuracy and temporal distance from the election. This quantifies the pattern in Figure 2 where we see that the final polls faired slightly worse on average than those fielded earlier (before the Iowa caucuses). We found a similar though weaker relationship (-.17), computed on the same basis, between poll timing and accuracy in the Iowa Democratic caucuses, and we found essentially no relationship in the Iowa Republican caucus or the California Democratic and Republican primaries. The only races in which polls conducted later were noticeably more accurate are the New Hampshire Republican primary, the South Carolina Republican and Democratic primaries, and the Wisconsin Democratic and Republican primaries – and most of these events were later in the calendar, when the field of candidates was smaller. However, in the most problematic races, particularly the Iowa Democratic caucus and the New Hampshire Democratic primary that preceded it, the final polls did not seem to improve as Election Day approached.

We attempted to test this further by using the micro-level datasets provided to the committee. Three of the New Hampshire datasets contained a variable for the interview date (CBS and Gallup).26 We merged cases containing relevant common variables from these datasets and used a logistic regression model to test whether the timing of the interview relative to the election had a significant relationship with vote preference for Hillary Clinton, while controlling for other factors. Specifically, we estimated a logistic regression with vote preference for Clinton as the dependent variable and number of days until election, gender, age, education, survey firm, interviewer demographics, and Democratic Party affiliation as the independent variables.

This approach is limited in two important ways. First, the key independent variable has a very narrow range of values because the interviewing dates for these three surveys were between January 4th and 7th. Second, any observed effect from the number of days until the election will be confounded to some extent by other factors such as ease of contact, as reported earlier. The regression analysis suggests that the number of days until the election did not have a significant effect on the likelihood of favoring Clinton after controlling for the other factors. The estimated model parameters are provided in Appendix Table 2. This null finding does not rule out the possibility that vote preferences changed in the days leading up to the New Hampshire primary, but we find no support for a shift toward Clinton during the January 4–7 time period.
 
Summary: We found that the timing or field periods of the polls may have contributed to estimation errors in the early Democratic events in Iowa and New Hampshire, though we lack the full set of relevant data for proper evaluation. The timing of the New Hampshire primary, so closely following the Iowa caucus, appears to have contributed to instability in or changing preferences and, in turn, in the poll estimates.
 
26 The University of New Hampshire dataset also contained the interview date, but it did not contain all of the other predictors in the model.  A reduced model was estimated so that cases from all three surveys were included, and the results did not change appreciably. There was still no significant effect from the number of days until the election.
 

Social Desirability

Some election observers speculated that the New Hampshire polls overestimated support for Obama because some respondents told interviewers that they would vote for him but actually voted for Clinton (Nichols 2008; Robinson 2008). Such intentional misreporting in voter preference polls is attributed to latent bigotry or intolerance in conjunction with an inclination to provide socially desirable responses.27 However, in the New Hampshire pre-primary polls, the estimation error did not derive from overestimating support for Obama – which could have been driven by latent racism among respondents – but from underestimating support for Clinton. Therefore, latent misogyny cannot explain the errors in the New Hampshire polls because it would have had the opposite observed effect: the polls would have overstated support for Clinton rather than understated it.

Several compelling pieces of evidence suggest that the New Hampshire estimation errors were probably not caused by the “Bradley effect” – or the tendency for respondents to report a preference for a black candidate (Obama) but vote instead for a white opponent. A meta-analysis by Hopkins (2008) indicates that while the Bradley effect did undermine some state-level polls in previous decades, there is no evidence for such an effect in recent years. In the 2008 general election, the very accurate final poll estimates of Barack Obama’s fairly decisive victory over John McCain dispelled suspicion that the Bradley effect was at play during the final weeks of the fall contest. There is also a conspicuous lack of evidence for a Bradley effect in the primary contests outside of New Hampshire. Of the 81 polls conducted during the final 30 days of the Iowa, South Carolina, California, and Wisconsin contests, the vast majority (86%) over-estimated Clinton’s relative vote share, while just 14% over-estimated Obama’s relative vote share. This finding is based on the signed direction of A for each survey.28 Furthermore, as reported in Table 3, poll estimates of Obama’s vote share in New Hampshire were quite accurate – it was only Clinton’s share that was consistently underestimated. However, it is still possible that intentional misreporting occurred during the lead up to the New Hampshire Democratic primary because of the interaction between the race of the interviewer and the race of the respondent.  If social desirability influenced respondents’ answers, we would expect to observe more support for the African-American candidate when the interviewer was African American than when the interviewer was not, based upon an assumption that the respondent could correctly infer the race of the interviewer over the telephone. If respondents were answering truthfully, we would expect to find no statistically significant difference between the vote preferences recorded by African- American interviewers and interviewers of other races.

The interviewer effects approach has substantial drawbacks. To the extent that respondents misreport regardless of interviewer race, this test will understate a social desirability effect. Also, interviewing staffs in the United States tend to be comprised mostly of Caucasians; consequently, the number of interviews conducted by African Americans is often low, yielding low statistical power for the test. And interviewers were not assigned randomly to cases so that the race of interviewer effects could be discerned without confounds. Results from race-of-interviewer effects analysis should be interpreted with these factors in mind. 

In October, Gary Langer (2008a) of ABC News reported results from his analysis in the general election of the relationship between the race of the interviewer and the respondent. He found no evidence of racially motivated misreporting. Other pollsters also found no such evidence. As mentioned above, however, this does not rule out the possibility that racially motivated misreporting occurred during the primaries.

CBS News polling director Kathleen Frankovic (2008) used panel data to test whether racial attitudes affected the New Hampshire polls.  She noted that voters who are concerned that their candidate preference may be socially unpopular could contribute to polling error in two ways. They could misreport their true preference or they could decline to be interviewed altogether. The CBS panel data was used to evaluate the latter hypothesis – were voters opposing Obama less likely to be interviewed in New Hampshire?  Frankovic’s analysis suggests that the answer is “No.” The January response rate for those who supported Obama in November was similar to the January response rate for those who supported Clinton in November (74% and 68%, respectively).  This difference is in the expected direction, but the magnitude is not large enough to explain the error in the polls. CBS News post-stratified their January sample to account for this difference in response rates.

While informative, this analysis has an important limitation. The test is based on people who already agreed to participate in a survey. This step may have filtered out many of those who would decline a survey request for fear of offending someone with their candidate preference. This limitation could explain the lack of a large difference in the January response rates. The CBS News analysis, therefore, does not rule out the possibility that Obama supporters were more likely to respond than those who did not support him.

The committee was also able to conduct analysis on the topic. Three survey organizations provided data to the committee that could be used to test for a race-of-interviewer effect in the New Hampshire Democratic primary. Gallup, CBS News, and the University of New Hampshire included the race of the interviewer in their survey dataset; but only CBS News included race of the respondent.

Based upon the 2006 U.S. Census estimate of the proportion of the New Hampshire population that is white (95.8%), we assumed in our analysis that all of the survey respondents were white.29 We combined data from the three surveys to increase statistical power and compared reported vote preference among respondents who spoke with a white interviewer to respondents who spoke with an African-American interviewer. The results are displayed in Table 16. In the pooled analysis, Obama led Clinton 36% to 35% when the interviewer was white, and he led 43% to 29% when the interview was black. This finding is in the direction of a social desirability effect and is statistically significant. Using just the CBS News dataset, we performed the same analysis looking only at white respondents. We find that black interviewers recorded higher support for Obama than white interviewers. Although the effect is not quite statistically significant by standard levels due to small sample size (p=.13), it is quite noticeable and in the expected direction. We also tested for this effect in a multivariate setting. The race of the interviewer was a significant predictor of vote preference for Clinton when controlling for other factors in the logistic regression presented in Appendix Table 2.


Overall, these findings suggest that the misreporting of candidate preference due to racial sensitivity to black interviewers may have contributed to the overstatement of support for Obama relative to Clinton in the New Hampshire Democratic primary polls. It could also be the case that a social desirability effect was at play even when the interviewer was white. Absent individual-level vote data to append to these datasets, we are unable to test that hypothesis rigorously.

We used these same pooled datasets to test whether vote preferences reported in the polls were influenced by the gender of the interviewer. Just as social desirability pressure may have led some Clinton supporters to report they would vote for Obama, so might some Obama supporters report they would vote for Clinton if they were interviewed by a female interviewer. We did not find strong evidence that interviewer gender influenced responses. The bivariate findings are presented in Table 17. Among male respondents, the Obama lead was 14 percentage points for male interviewers and 17 percentage points for female interviewers, indicating no effect from interviewer gender. Among female respondents, Clinton had a 3 percentage point lead when the interviewer was male and a 6 percentage point lead when the interviewer was female. This finding for female respondents is in the direction of a social desirability effect and is statistically significant. Interviewer gender is only marginally significant (p=.079) in the multivariate model. Any misreporting favoring Clinton in New Hampshire does not, however, help to explain why her support was underestimated in the polls.


 
Both the race-of-interviewer and gender-of-interviewer analyses are limited because they rely solely on survey responses. Ideally, we would be able to compare survey responses about candidate preference with actual voting behavior, and the interviewers would have been randomly assigned to cases rather than in conjunction with their schedules or other factors.

Summary:  We found mixed evidence for social desirability effects on polling errors. Social desirability pressures may explain a small proportion of the error but probably no more than that. In a pooled analysis of three New Hampshire surveys, we find that support for Obama is significantly greater when the interviewer is black than when he or she is white. In the same analysis, however, Obama is still favored over Clinton among respondents interviewed by a white interviewer, and the number of interviews taken by black interviewers was too small to affect the overall estimates.
 
27 There is a dispute about whether and to what extent a “Bradley effect” ever existed. In terms of the original 1982 gubernatorial election in California, Tom Bradley received more votes than George Deukmejian at the polls but lost in the absentee balloting by a much larger amount in the first election when parties could organize efforts to make absentee ballots available to voters. The Republicans outmaneuvered the Democrats in this regard. But in 1989, in relation to the election returns, there appeared to be an over-report of support for David Dinkins in the pre-election polls for the New York mayor’s race and in an exit poll estimating support for L. Douglas Wilder in the Virginia governor’s race. (Traugott and Price, 1992)  See also a discussion by Lance Tarrance during the 2008 general election campaign (http://www.realclearpolitics.com/articles/2008/10/the_bradley_effect_selective_m.html).
28 The reader should remember, as shown in Table 3, that the inclusion of undecided voters in the candidate preference distribution implies that individual candidate support levels will be underestimated. In New Hampshire, for example, 22 out of 22 polls in the month leading up to the primary underestimated support for Obama, and 18 out of 22 polls underestimated support for Clinton.
29 Available at http://quickfacts.census.gov/qfd/states/33000.html.
 

Weighting

In the common application of the technique, pre-election pollsters use weighting to align the demographic characteristics of their sample with known characteristics of the voting population after the interviewing is completed, usually based on Census data or information about the characteristics of registered voters.30 Constructing a survey weight for primary pre-election polls is complicated by several factors. As with nearly all telephone surveys of the U.S. public, responding samples contain disproportionate numbers of women, seniors, and whites (among other demographic characteristics). Furthermore, the demographic and party identification characteristics of a primary electorate can shift substantially from election to election, making it difficult to identify appropriate parameter estimates for weighting. For example, what proportion of the New Hampshire Republican primary voters will be registered Republicans versus registered Independents, or others? In effect, the selection of weighting variables and the construction of the weights themselves are akin to building a likely voter model.

In 2008, primary pollsters addressed the weighting issue in a number of ways. Some procedures were implemented at the sampling stage, while others were implemented after data collection was completed. We discuss sampling and post-survey adjustment procedures together because, essentially, they were used as two different tools to accomplish the same task: achieving the appropriate levels of representation in the poll for certain groups that most commonly included women/men, young/old, white/black, Hispanic/non-Hispanic, and registered or self-identified Independents, Democrats, and Republicans, depending upon the state and available data from it.

Table 18 presents a summary of the procedures implemented for each poll studied by the committee. At the sampling stage, polls using registration-based sampling (RBS) made use of the information on the frame. Most of the RBS pollsters drew their samples based on what they knew about each person’s voting history, party registration, and/or demographics at some time in the past. At the post-survey adjustment stage, pollsters used several different procedures that appear related to the mode of administration. Two of the three IVR pollsters who were willing to discuss their methodology described how they deleted cases to make their samples more representative.31 They randomly deleted a subset of cases from demographic groups, such as older women, who were overrepresented in the responding sample. None of the CATI pollsters used this deletion technique. Instead, the CATI pollsters generally used an iterative-proportional fitting technique (e.g., raking) to align the responding sample with population control totals for several relevant dimensions.

The committee is not in a position to evaluate fully the relative merits of case deletion versus post-stratification weighting. With full disclosure of the specific procedures employed, the application of weights has an advantage in that a secondary analyst can compare the weighted distribution to the unweighted distribution to assess the impact of the weights.  And if such an analyst does not agree with the algorithm or its results could in principle apply their own population-based nonresponse adjustment and assess its relative efficacy compared to the original. Deleting a case could have roughly the same effect on point estimates as assigning it a weight close to zero under certain assumptions, such as a simple random procedure for deletion; but the absence of such a case from the publicly available dataset does not provide a mechanism for assessing alternative models for deletion. Table 6 shows that the IVR polls (some of which use deletion) performed at least as well as the CATI polls (which do not use deletion). In terms of poll accuracy, we found no discernable difference between these adjustment techniques.  But given the potential confounding of mode and adjustment technique, we could not evaluate the relative merits of these two adjustment techniques without additional information.

An issue of serious concern, however, is whether or not these procedures are appropriately reflected in the margins of error. If cases are deleted, the margins of error should be increased accordingly. None of the pollsters who used the case deletion approach disclosed the number of cases that were deleted. This information is critical to understanding the magnitude of the adjustment and implications for the variance of the survey estimates. Similarly, for post-stratified surveys, the variance in the weights should be reflected in the margin of error.



In addition to presenting this overview of procedures, we sought to evaluate empirically the effect of weighting in the primary polls by comparing weighted trial heat estimates to unweighted estimates where possible. The results are shown in Table 19. In general we found the unweighted estimates were very similar to the weighted estimates, and the few differences do not lead the weighted estimates to be consistently more accurate than the unweighted estimates. For example, in the New Hampshire Democratic primary, the unweighted University of New Hampshire estimates were 38% Obama and 33% Clinton, for an A value of 0.21. The weighted estimates were 39% Obama and 30% Clinton for an A value of 0.33. The reader may recall that higher A values reflect a greater deviation between the survey estimates and the actual election result. The results in Table 19 suggest that the weights applied to several of the primary polls were not particularly effective in reducing errors in the estimates of candidate preference. But these weights were often combined with the likely voter model. This makes it difficult to segregate the effects due to weighting from the effects due to likely voter estimation. This seems to be especially true for Gallup, whose unweighted estimates were the closet to the Election Day outcome – though they still showed a slight (37% to 36%) Obama lead – but whose weighted estimates were furthest from the final result.

Summary:  We found strong evidence that faulty weighting techniques explain some (but not all) of the polling errors. In three of the four surveys suitable for examination, the weighted New Hampshire estimates were less accurate than the unweighted estimates. Because all four unweighted estimates still had Obama in the lead, however, weighting cannot explain all of the error in the polls.
 
30 In some cases, pre-election pollsters are now combining weights to account for sample representation with adjustments made for likelihood of voting. For example, they assign unlikely voters a weight of zero to delete them from the analysis. There is additional discussion of such procedures in the section on Likely Voter Definitions, above.
31 This information was provided in follow up telephone conversations and not in their original submissions of materials for their polls.

Time of Decision

Two survey organizations polling in New Hampshire - Gallup and CBS News - conducted panel studies in which the same sample of voters was interviewed twice. These types of studies can be highly informative because they allow researchers to evaluate changes in preference the level of the individual voter, not just aggregate change. The two New Hampshire panel studies were designed quite differently.  The Gallup study is a “before and after” panel in that the sample was interviewed immediately before the January 8th primary (January 4-6, 2008) and immediately after (January 11-29, 2008). The CBS News study involved two surveys with the same sample that were both conducted before the primary, although they were spaced several months apart (November 2-12, 2007 and January 5-6, 2008).

The re-interview rates in both studies were quite high. The original Gallup sample included 1,179 respondents planning to vote in the Democratic primary and 1,180 who said that they planned to vote in the Republican primary. The Gallup post-election survey included interviews with 818 of them (69%) who reported voting in the Democratic primary and 800 (68%) who reported voting in the Republican primary.  In the CBS News study, three-quarters (77%) of the likely Democratic primary voters identified in November completed the January re-interview.

The CBS News study measured change in candidate preferences between November 2007 and early January 2008. The results, presented in Table 20, suggest significant shifts in support during that time.  Among those supporting Obama just days before the primary, only half had been supporting him in November.  Those self-reported late-comers to the Obama campaign included significant numbers of former Clinton supporters.

The composition of Clinton supporters looked quite different. The vast majority (87%) of those supporting Clinton in early January had been planning to vote for her since November. The CBS  News study found no evidence that Clinton captured significant numbers of supporters from any of the other major candidates during that time.  However, the study did not measure any changes in preferences that took place in the last two or three days of the campaign. Results from a trial heat follow-up question suggest that this gap in measurement may have been important. When asked if their mind was made up or it was too early to say for sure whom they would vote for, almost half (46%) of John Edwards supporters said that it was too early. This suggests that many Edwards’ supporters might have anticipated his third place finish and were strongly considering voting for one of the top two contenders.

 
The Gallup study began where the CBS study left off, albeit with a different group of New Hampshire voters.  Results from the Gallup post-election interview are presented in Table 21. There is some evidence that Clinton won the vote of many Edwards supporters. Some 6% of Clinton voters said that they supported Edwards in the pre-election poll. Similarly, about 4% of Obama voters had recently come from the Edwards camp.  For his part, Edwards looks to have attracted some former Obama and Clinton supporters.  Unfortunately, sample sizes in both of these studies are too small to say with confidence whether many of these shifts within the electorate were significant.

The impact of this movement toward Clinton on the self-reported vote distribution is shown in Table 22, in which the data are not weighted. The level of self-reported support for Obama (about 37%) did not change substantially from the pre-election to the post-election survey. The proportion endorsing Clinton, however, increased by 4 percentage points. Taken together, the CBS News and the Gallup panels provide some evidence for a real, sizable shift in preferences to Obama in the final weeks of the campaign and a separate shift to Clinton in the final days. This changing terrain complicated the task of the pollsters and increased their difficulty in producing accurate projections.




The Gallup re-interview also provides some empirical insights on why some Democratic voters decided during the final days of the campaign to vote for Clinton. The survey explored voter reactions to the January 5th Democratic debate, Clinton’s emotional campaign appearance on January 7th, and get-out-the-vote (GOTV) efforts conducted January 7th and 8th.  Each of these three events appears to account for a small increase in Clinton’s support at the close of the campaign.

Among Democratic primary voters, viewership of the final debate was quite broad. About two- thirds (65%) reported watching the debate, and an additional 23% reported hearing or reading news coverage about it. Only a small fraction (4%) of those who watched or saw news coverage said that they changed their candidate preference because of the debate.

Coverage of Clinton’s January 7th campaign appearance was followed closely by voters as well. Eight in ten (82%) of Democratic primary voters said that they had seen the video of the campaign appearance, and most viewers said that their reaction was positive or neutral. The Gallup re-interview also asked Clinton voters whether a number of different considerations factored into their decision to vote for her.  One of the considerations was the video of that campaign appearance, and 15% reported that this was a factor in their vote.

Another speculation about a cause for late decisions was a highly publicized campaign event in which Senator Clinton responded in an uncharacteristically emotional way to a voter’s question. The day before the primary election, a woman in a Portsmouth asked Clinton how she stays so “put together” on the campaign trail.32 Up until that point, Clinton had a reputation for being somewhat steely and hard-driving, but the question triggered a reflective, emotional response. This exchange revealed a more human side of Clinton that may have appealed to some Democratic voters, including some tepid Obama supporters. However, because most of the New Hampshire polls ended data collection prior to this event, these polls would have missed any related last-minute shift to Clinton.

The only information that we have about when people decided on their choice comes from self- reports of that decision in the exit polls. These are not easy data to evaluate because the exit poll respondents may have had difficulty in responding to or interpreting this question. For example, if a New Hampshire voter had been an early supporter of Hillary Clinton and then reconsidered his or her preference after Iowa but ended up voting for her on Election Day, would they have responded that they decided months ago or in the last day or two? On average, the pre-election polls suggested only about 3% of likely voters were undecided, but about 17% of voters told exit poll interviewers that they made their candidate choice on Election Day.

Table 23 shows proportions of respondents that were reportedly late deciders in the primary elections. Between 17% and 19% of those who voted in the New Hampshire Democratic or Republican primary, respectively, said they made up their mind on the day they voted. This was only slightly more than those who said they made up their minds at the same point in the same 2000 events (15% and 14%, respectively). Almost 40% said they made up their minds during the final three days of the 2008 campaign (38% of those voting in the Democratic primary and 39% in the Republican primary). This was the same as in the Democratic primary in 2004 (35%), but higher than the proportion of late deciders in either primary in 2004 (both 26%).



A closer look at the late deciders in the 2008 New Hampshire Democratic primary exit poll does not show enough late movement to Clinton to explain the error in the polls. According to analysis by Gary Langer of ABC News (2008b), the 17% of voters who said they made up their mind on the last day went narrowly for Clinton (39% to 36%) – a margin too small to explain fully the overestimation of support for Obama. Another 21% reported that they decided in the last three days, and they split narrowly for Obama (37% to 34%), again insufficient to explain the estimation problems in the pre-election polls.
 
Summary:  We found that decision timing – in particular, late decisions – may have contributed significantly to the error in the New Hampshire polls, though we lack the data for proper evaluation. The fact that Clinton's “emotional moment” at the diner occurred after nearly all the polls were complete adds fuel to this speculation, as does the high proportion of New Hampshire Democratic primary voters who said they made up their minds during the final three days. It is also true that the percentages of voters reporting they made up their minds on Election Day or in the three preceding days are not substantially different from historical levels.
 
32 The exact question that prompted the response was “How did you get out the door every day? I mean, as a woman, I know how hard it is to get out of the house and get ready. Who does your hair?"
 

Participation of Independents Assuming an Obama Victory

Another hypothesis for the error in the New Hampshire polling estimates concerns the relative proportion of self-described Independents in the survey sample and in the electorate. 33   Some have speculated that Independents who liked both Obama and McCain could have been under the impression from the polls that Obama had locked up the Democratic race, and so they decided to participate in the Republican primary and support McCain. This hypothesis suggests that: (1) Independents should comprise a larger segment of the Democratic primary electorate in the pre- election polls than in the exit poll because at the time that they were interviewed they had not decided to abandon Obama to support McCain, and (2) Independents should comprise a smaller segment of the Republican primary electorate in the pre-election polls than in the exit poll for the same reason. We find mixed support at best for this hypotheses based on the four pre-election surveys that are reported in Table 24.



Each of these surveys ended data collection on January 6th, two days prior to the primary election. According to the New Hampshire exit poll, 44% of Democratic primary voters were Independents, compared to 37% of Republican primary voters, so they formed a slightly larger proportion of the Democratic than the Republican primary electorate. The pre-election surveys estimated that this figure would be between 39% and 45% on the Democratic side, a fairly narrow range that encompasses the exit poll reading. On the Republican side, the pre-election polls estimated that 30% (Gallup) or 34% (University of New Hampshire) of voters would be Independents, lower than the 37% indicated in the exit poll. These estimates for the Republican race suggest that some Independents classified as Obama voters before the election may have ended up McCain voters at the polls. This effect, however, is not especially large and is not conclusive evidence for an Independent shift.

Summary:  We found little compelling information to suggest that Independents, by deciding in disproportionate numbers and at the last minute to vote in the Republican rather than the Democratic primary, contributed to the New Hampshire polling errors. The proportion of Independent voters in the Democratic primary pre-election polls is comparable to the proportion in the exit poll. Also, the differences between the exit poll and the pre-election surveys are not large enough to explain the errors in the polls.

33 New Hampshire has a party registration system whereby citizens must register as a Democrat, Republican or Undeclared. The Undeclared registrants may opt into either party’s presidential primary.  This formal designation is not what is typically measuredin the pre-election polls, however; that is self-reported party identification. Andrew Smith of the University of  New Hampshire reports that the Secretary of State counted 121,515 Undeclared voters in the Democratic primary (42.1% of all voters) and 75,522 Undeclared voters in the Republican primary (31.3% of all voters).
 

Allocations of Those Who Remained Undecided

In some pre-election surveys, pollsters allocate respondents who remain “undecided” to yield proportions reflecting support for each candidate that add to 100%. Pollsters may use a number of different allocation approaches, any of which may have an effect on the accuracy of their estimates. However, only one of the pollsters who collected data in the contests under study, the University of New Hampshire Survey Center, used an allocation method. Without allocation, they showed a 9 percentage point lead for Obama over Clinton, and with allocation, they showed the same lead.34 Hence this cannot explain differences between the final pre-election estimates and the outcome of the elections.
 
34 See the University of New Hampshire press release for this result: http://www.unh.edu/survey- center/news/pdf/primary2008_demprim10708.pdf.
 

Ballot Order Effects

Another possible explanation for differences between the pre-primary poll estimates and the election outcomes concerns measurement in the election itself, rather than in the polls. Political methodologists have documented a small but non-trivial bias in favor of candidates listed first on election ballots (Miller and Krosnick 1998), one that appears to be robust in primary elections (Ho and Imai 2008). This bias is a version of a primacy effect (the opposite of the recency effect discussed in the Trial Heat Question Wording section), which is a cognitive bias leading people to select options presented near the top of a list when the list is presented visually, as on a ballot.

Jon Krosnick of Stanford University wrote an op-ed for the ABC News website explaining how a ballot order effect may have influenced the results of the New Hampshire Democratic primary (2008) and contributed significantly to discrepancies between the pre-election surveys and the election outcome in New Hampshire. Krosnick noted that, unlike previous primaries in the state, the 2008 contest featured the same ordering of candidate names on all ballots: an alphabetical list starting with the randomly drawn letter Z. Consequently, Joe Biden was listed first on every ballot, closely followed by Hillary Clinton, and Barack Obama was listed near the bottom of 21 candidate names. Krosnick estimates that Clinton received at least a three percentage point boost from her position near the top of the ballot order.

In other early primaries, ballot order does not explain differences between the election outcome and the pre-election polls. One reason for this may be that the list of candidates on the New Hampshire ballot was more than twice as long as the list in other primaries (21 names versus 8 names). Recency effects are generally thought to increase in size with list length. Ballot order rules, candidate name orderings, and election outcomes are presented in Table 25.
 
In South Carolina and Wisconsin, Clinton was placed near the top of the list and Obama was listed near the bottom. Had there been a strong ballot order effect, we would expect Clinton to have done better on Election Day in these states than in the polls. In fact, the reverse occurred. Obama’s actual margin of victory in South Carolina and Wisconsin was substantially greater than the margin suggested by the polls. This result does not rule out the possibility that Clinton benefitted from her higher ballot position in these states, but it suggests that any such effect was swamped by other factors. In California, the ordering was randomized and then rotated across districts, so there is no reason to believe that ballot order affected Election Day support for Clinton or Obama in that primary.

The ballot order analysis presented here is purely observational and therefore may be limited in its generalizability. We do not find evidence for a ballot order effect for the top two candidates in the South Carolina, Wisconsin, or California primaries, but this does not rule out the possibility of such an effect in New Hampshire, especially given the much longer list of candidates on the New Hampshire ballot. Unfortunately, we have no way to test directly for a ballot order effect in New Hampshire. We can only infer the consequences, as Krosnick does, from similar elections in which ballot order was varied across precincts. Absent more compelling data, we find no reason to discount Krosnick’s hypothesis.

Summary:  Ballot order is one of the possible explanations for some of the estimation errors, and it may explain some of the error in the New Hampshire Democratic primary polls. Krosnick’s analysis of recent New Hampshire primaries suggests a 3 percentage point effect from ballot order (2008). Clinton was listed near the top and Obama near the bottom on every ballot, which is consistent with greater support for Clinton in the returns than in the pre-election polls. This conclusion is based upon an observation and not experimental evidence to evaluate ballot order effects.

Conclusions

The committee evaluated a series of hypotheses that could be tested empirically, employing information at the level of the state, the poll, and, in limited cases, the respondent. Since the analysis was conducted after data collection, it was not possible to evaluate all of the hypotheses in a way that permitted strong causal inferences. And given the incomplete nature of the data for various measures, it was not possible to pursue all hypotheses about what might have happened, nor was it possible to pursue multivariate analyses that looked simultaneously at multiple explanatory factors. In the end, however, the analysis suggests potential explanations for the estimation errors and the unlikely impact of other factors. The research also highlights the need for additional disclosure requirements and the need for better education by professional associations like AAPOR, the Council of American Survey Research Organizations (CASRO), and the National Council on Public Polls (NCPP).

Polling in primary elections is inherently more difficult than polling in a general election. Usually there are more candidates in a contested primary than in a general election, and this is especially true at the beginning of the presidential selection process. For example, there were a total of 15 candidates entered in the Iowa caucuses and more than 20 names on the New Hampshire primary ballot.  Since primaries are within-party events, the voters do not have the cue of party identification to rely on in making their choice. Uncertainty among voters can create additional problems for pollsters.  Turnout is usually much lower in primaries than in general elections, although it varies widely across events. Turnout in the Iowa caucuses tends to be relatively low compared to the New Hampshire primary, for example. So estimating the likely electorate is often more difficult in primaries than in the general election. Furthermore, the rules of eligibility to vote in the primaries vary from  state to state and even within party; New Hampshire has an open primary in which independents can make a choice at the last minute in which one to vote. All of these factors can contribute to variations in turnout, which in turn may have an effect on the candidate preference distribution among voters in a primary compared to the general election.

The estimation errors in the polls before the New Hampshire Democratic primary were of about the same magnitude, as measured by the statistic A, as in the Iowa caucus. But the misestimation problems in New Hampshire received much more – and more negative –coverage than they did in Iowa.  Because of a small level of undecided voters in every poll, the estimates for each individual candidate were generally lower than the proportion of votes they received. And these underestimates tended to be greater for the first place finisher than the second place finisher. But the majority of the polls before New Hampshire suggested the wrong winner, while only half in Iowa did.

All of the committee’s conclusions are summarized briefly in Table 26. Factors that may have influenced the estimation errors in the New Hampshire pre-primary polls include:
  • Respondents who required more effort to contact seemed more likely to support Senator Clinton, but most interviews were conducted on the first or second call, favoring Senator Obama.
  • Patterns of nonresponse, derived from comparing the characteristics of the pre-election samples with the exit poll samples, suggest that some groups that supported Senator Hillary Clinton were underrepresented in the pre-election polls.
  • Variations in likely voter models could explain some of the estimation problems in individual polls. Application of the Gallup likely voter model, for example, produced a larger error than their unadjusted data. While the “time of decision” data do not look very different in 2008 compared to recent presidential primaries, about one-fifth of the voters in the 2008 New Hampshire primary said they were voting for the first time. This influx of first-time voters may have had an adverse effect on likely voter models.
  • Variations in weighting procedures could explain some of the estimation problems in individual polls. And for some polls, the weighting and likely voter modeling were comingled in a way that makes it impossible to distinguish their separate effects.
  • Although no significant social desirability effects were found that systematically produced an overestimate of support for Senator Obama among white respondents or for Senator Clinton among male respondents, an interaction effect between the race of the interviewer and the race of the respondent did seem to produce higher support for Senator Obama in the case of a black interviewer. However, Obama was also preferred over Clinton by those who were interviewed by a white interviewer.
Factors unlikely to have contributed to the estimation errors in New Hampshire include:
  • The exclusion of cell phone only (CPO) individuals from the samples did not seem to have an effect. However, this proportion of citizens is going to change over time, and pollsters should remain attentive to its possible future effects.
  • The use of a two-part trial heat question, intended to reduce the level of “undecided” responses, did not produce that desired effect and does not seem to have affected the eventual distributions of candidate preference.
  • The use of either computerized telephone interviewing (CATI) techniques or interactive voice response (IVR) techniques made no difference to the accuracy of estimates.
  • The use of the trial heat questions was quite variable, especially with regard to question order, but no discernible patterns of effects on candidate preference distributions were noted. While the names of the (main) candidates were frequently randomized, the committee did not receive data that would have permitted an analysis of the impact of order.
  • Little compelling information indicates that Independents made a late decision to vote in the New Hampshire Republican primary, thereby increasing estimate errors.
Factors that present intriguing potential explanations for the estimation errors in the New Hampshire polls, but for which the committee lacked adequate empirical information to thoroughly assess include:
  • The wide variation in sample frames used to design and implement samples – ranging from random samples of listed telephone numbers, to lists of registered voters with telephone numbers attached, to lists of party members – may have had an effect. Greater disclosure about sample frames and sample designs, including respondent selection techniques, would facilitate future evaluations of poll performance.
  • Differences among polls in techniques employed to exclude data collected from some respondents could have affected estimates. Given the lack of detailed disclosure of how this was done, it is not possible to assess the impact of this procedure.
Finally, factors that appeared to be potential explanations for estimation errors, but for which the committee lacked any empirical information to assess include:
  • Because of attempts by some states to manipulate the calendar of primaries and caucuses, the Iowa and New Hampshire events were rescheduled to the first half of January, with only five days between the events, truncating the polling field period in New Hampshire following the Iowa caucus.
  • Given the calendar, polling before the New Hampshire primary may have ended too early to capture late shifts in the electorate there, measuring momentum as citizens responded to the Obama victory in the Iowa caucus but not to later events in New Hampshire such as the restaurant interview with Senator Hillary Clinton.
  • The order of the names on the ballot – randomly assigned but fixed on every ballot - may have contributed to the increased support that Senator Hillary Clinton received in New Hampshire.
All of the information provided to the committee is being deposited in the Roper Center Data Archive, where it will be available to other analysts who wish to check on the work of the committee or to pursue their own independent analysis of the pre-primary polls in the 2008 campaign.

References

Blumberg, Stephen J. and Julian V. Luke. 2008. “Wireless Substitution: Early Release of Estimates from the National Health Interviews Survey, July-December 2007.” Web article posted for the National Center for Health Statistic at http://www.cdc.gov/nchs/data/nhis/earlyrelease/wireless200805.htm.
Crespi, Irving. 1988. Pre-election Polling: Sources of Accuracy and Error. New York: Russell Sage Foundation.
Curtin, Richard, Stanley Presser, and Eleanor Singer. 2000. "The Effects of Response Rate Changes on the Index of Consumer Sentiment." Public Opinion Quarterly 64:413-28.
DeSart, Jay and Thomas Holbrook. 2003. “Campaigns, Polls, and the States: Assessing the Accuracy of Statewide Presidential Trial-Heat Polls.” Political Science Quarterly, 56(4): 431-439.
Erikson, Robert S., Costas Panagopoulos, and Christopher Wlezien. 2004. “Likely (and Unlikely) Voters and the Assessment of Campaign Dynamics.” Public Opinion Quarterly 68(4):588-601.
Erikson, Robert S. and Christopher Wlezien. 2008. “Likely Voter Screens and the Clinton Surprise in New Hampshire.” Web article posted for Pollster.com at  http://www.pollster.com/blogs/likely_voter_screens_and_the_c.php.
Frankovic, Kathleen. 2008. “N.H. Polls: What Went Wrong?” Web article posted for CBS News at  http://www.cbsnews.com/stories/2008/01/14/opinion/pollpositions/main3709095.shtml?source=RSS&attr=_3709095.
Groves, Robert M. 2006 “Non-response Rates and Non-response Bias in Household Surveys.” Public Opinion Quarterly 70:646-675.
Ho, Daniel E. and Kosuke Imai. 2008. “Estimating Causal Effects of Ballot Order from a Randomized Natural Experiment: The California Alphabet Lottery, 1978 – 2002.” Public Opinion Quarterly, 72:216-240.
Holbrook, Allison, Jon A. Krosnick, David Moore, and Roger Tourangeau. 2007. “Response Order Effects in Dichotomous Categorical Questions Presented Orally: The Impact of Question and Respondent Attributes.” Public Opinion Quarterly 71:325-348.
Hopkins, Daniel J. 2008. “No More Wilder Effect, Never a Whitman Effect: When and Why Polls Mislead about Black and Female Candidates.” Poster presented at the Meeting of the Society for Political Methodology.
Jones, Jeff. 2008. “Cell Phones in Primary Pre-Election Surveys.” Paper presented at the Annual Conference of the American Association for Public Opinion Research.
Keeter, Scott. 2006. “The Impact of Cell Phone Noncoverage Bias in the 2004 Presidential Election.” Public Opinion Quarterly 70:88-98.
Keeter, Scott. 2008. “The Impact of ‘Cell-Onlys’ on Public Opinion Polling: Ways of Coping with a Growing Population Segment” Web article posted for the Pew Research Center at http://people-press.org/report/391/.
Keeter, Scott, Michael Dimock, and Leah Christian. 2008. “Cell Phones and the 2008 Vote: An Update” Web article posted at http://pewresearch.org/pubs/964/.
Keeter, Scott, Carolyn Miller, Andrew Kohut, Robert M. Groves, and Stanley Presser. 2000. "Consequences of Reducing Non-response in a Large National Telephone Survey." Public Opinion Quarterly 64:125-48.
Keeter, Scott, Courtney Kennedy, Michael Dimock, Jonathan Best, and Peyton Craighill. 2006. “Gauging the Impact of Growing Non-response on Estimates from a National RDD Telephone Survey." Public Opinion Quarterly 70:759-779.
Kohut, Andrew. 2008. “Getting It Wrong.” The New York Times, January 10. Available at  http://www.nytimes.com/2008/01/10/opinion/10kohut.html?ref=opinion.
Krosnick, Jon A. 2008. “Clinton’s Favorable Placement on Ballots May Account for Part of Poll Mistakes.” Web article posted for ABC News at http://abcnews.go.com/PollingUnit/Decision2008/story?id=4107883.
Krosnick, Jon A. and Duane Alwin. 1987. “An Evaluation of a Cognitive Theory of Response-Order Effects in Survey Measurement.” Public Opinion Quarterly 51:210-219.
Langer, Gary. 2008a. “Dissecting the ‘Bradley Effect’.” Web article posted for ABC News at  http://blogs.abcnews.com/thenumbers/2008/10/the-bradley-eff.html.
Langer, Gary. 2008b. “A New Hampshire Post-Mortem.” Web article posted for ABC News at  http://blogs.abcnews.com/thenumbers/2008/02/a-new-hampshire.html.
Langer, Gary. 2008c. “Cell-Onlies: Report on a Test.” Web article posted for ABC News at  http://blogs.abcnews.com/thenumbers/2008/09/cell-onlies-rep.html.
Martin, Elizabeth A., Michael W. Traugott, and Courtney Kennedy. 2005. “A Review and Proposal for a New Measure of Poll Accuracy.” Public Opinion Quarterly 69: 342-369.
Merkle, Daniel, and Murray Edelman. 2002. "Non-response in Exit Polls: A Comprehensive Analysis." In Survey Non-response, ed. Robert M. Groves, Don A. Dillman, John L. Eltinge, and Roderick J. A. Little, pp. 243-58. New York: Wiley.
Mosteller, Frederick. 1948. The Pre-Election Polls of 1948. New York: Social Science Research Council.
Nichols, John. 2008. “Did ‘The Bradley Effect’ Beat Obama in New Hampshire?” Web article posted for The Nation and available at http://www.thenation.com/blogs/state_of_change/268328.
Robinson, Eugene. 2008. “Echoes of Tom Bradley.” The Washington Post January 11, A17.
Traugott, Michael W. Robert M. Groves, James M. Lepkowski. 1987. “Using Dual Frame Designs to Reduce Non-response in Telephone Surveys.” Public Opinion Quarterly 51: 522-539.
Traugott, Michael W., Brian Krenz, and Colleen McClain. 2008. “Press Coverage of the Polling Surprises in the New Hampshire Primary.” Paper presented at the Annual Conference of the Midwest Association for Public Opinion Research.
Traugott, Michael W. and Vincent Price. 1992. “Exit Polls in the 1989 Virginia Gubernatorial Race: Where Did They Go Wrong?” Public Opinion Quarterly 52: 245-253.
Traugott, Michael W. and Clyde Tucker. 1984. “Strategies for Predicting Whether a Citizen Will Vote and Estimation of Electoral Outcomes.” Public Opinion Quarterly 48: 330-343.
ZuWallack, Randal, Jeri Piehl, and Keating Holland. 2008. “Supplementing a National Poll with Cell Phone Only Respondents.” Paper presented at the Annual Conference of the American Association for Public Opinion Research.

Appendix A

Members of the AAPOR Special Committee on 2008 Presidential Primary Polling
 
Glen Bolger, a partner and co-founder of Public Opinion Strategies.
Darren W. Davis, Professor of Political Science at the University of Notre Dame.
Charles Franklin, Professor of Political Science at the University of Wisconsin and co-developer of Pollster.com.
Robert M. Groves, Director, the University of Michigan Survey Research Center, Professor of Sociology at the University of Michigan, Research Professor at its Institute for Social Research, and Research Professor at the Joint Program in Survey Methodology at the University of Maryland.
Paul J. Lavrakas, a methodological research consultant. Mark S. Mellman, CEO of The Mellman Group.
Philip Meyer, Professor Emeritus in Journalism at the University of North Carolina.
Kristen Olson, Assistant Professor of Survey Research and Methodology and Assistant Professor of Sociology at the University of Nebraska-Lincoln.
J. Ann Selzer, President of Selzer & Company.
(Chair) Michael W. Traugott, Professor of Communication Studies and Senior Research Scientist in the Center for Political Studies at the Institute for Social Research at the University of Michigan.
Christopher Wlezien, Professor of Political Science and Faculty Affiliate in the Institute for Public Affairs at Temple University.

Appendix B

 
Charge to the Committee
  1. Examine the available information concerning the conduct and analysis of the pre-election, entrance and exit polls related to the 2008 primaries, including, but not limited to, press releases, post-election hypotheses, and evaluations conducted by the respective polling organizations as well as the news media.
  2. To the extent possible, interview those involved in the New Hampshire primary pre-election and exit polls to gather additional information for the committee.
  3. Synthesize and report on the findings from the various polling organizations. The report will include a summary of the factors that have been considered to date as well as recommendations and guidelines for future research. The report is tentatively scheduled for release in early April, 2008.
  4. Present the findings from the report in a public forum, hosted by the Kaiser Family Foundation at its Barbara Jordan Conference Center in Washington, D.C. The forum is tentatively scheduled for Spring, 2008.
  5. To facilitate research on all of the possible factors that may have contributed to the New Hampshire polling process this year, request all sample, survey and ancillary data associated with the polls leading up to and following the New Hampshire primary, including the New Hampshire exit poll data. The request will be broad in nature – for example, so as to inform hypotheses concerning nonresponse, sample and call-record data will be included.
  6. The Roper Center has generously offered to serve as the archivist for the data associated with the ad hoc committee. The Roper Center is keenly sensitive to the risks associated with these datafiles and the potential for exposure of confidential information. These data will be archived and maintained separately from the general access archives in a secure environment. Scholars interested in analyzing these restricted datasets will complete an Application for Restricted Data Use. Approval of the application will be made if the research purpose meets the criteria outlined by the ad-hoc committee. Further limitations to the researcher will include the destruction of the data files after a designated period of time, as outlined in the application.
  7. Establish seed-funding for support of additional research on the New Hampshire and other primary pre-election and exit poll data. This may include support of the work of the ad hoc committee to undertake analysis or the work of individual scholars interested in conducting research on the topic.
  8. Beyond the public forum in Spring, 2008, the findings of the ad hoc committee will be disseminated on the AAPOR web site and as part of a special panel discussion at the AAPOR 63rd Annual Conference.
 

Letter to Survey Firms from President Mathiowetz

I am writing to you as President of the American Association for Public Opinion Research (AAPOR) with regard to polling you have conducted in [STATE] as part of the 2008 presidential election campaign. As you may be aware, AAPOR has named an Ad Hoc Committee on 2008 Presidential Primary Polling. The task of that committee is to evaluate the methodology of the pre-election primary polls and the way they are being reported by the media and used to characterize the contests in pre-election press coverage. Although originally formed in response to the disparity between the pre-election polls and the outcome of the Democratic contest in New Hampshire, the mission statement of the committee has been expanded to include examination and archiving of primary data conducted in states other than New Hampshire.

The variation between the pre-election polls and the final outcome of the elections in, for example, New Hampshire (Democrats), South Carolina (Democrats), and California (Republicans), has raised questions concerning the profession as a whole as a reflection of the quality of estimates of candidate standing in those contests. The horserace aspect of polling, albeit only a small part of the work our profession does, offers an immediate and visible validation of survey estimates. In this way, the image of the entire industry is affected by the quality of the estimates that are made at the end of political campaigns. We are a profession that benefits from our collective understanding of the sources of errors that impact our estimates. After the 1948 Presidential election, the pollsters involved in the pre- election polling undertook an examination and analysis of the factors that contributed to the miscalling of that election. It is in that spirit, and because of the collective knowledge that will come from this work, that I ask for your cooperation with the request outlined below.

The AAPOR Code of Professional Ethics and Practices as well as the Principles of Disclosure of the National Council of Public Polls (NCPP) and the Code of Standards and Ethics of the Council of American Survey Research Organizations (CASRO) all call for the disclosure of essential information about how research was conducted when the results are widely disseminated to the public. At a minimum, we ask for disclosure of the essential information outlined by these codes. However, you will see that the request does go beyond the disclosure guidelines outlined in these respective codes. This information will be critical for the AAPOR Committee to pursue its evaluation.

The Ad Hoc Committee will focus on addressing empirically-based hypotheses that can be addressed post hoc–for example, whether differences in likely voter screening, turnout models, differential non-response, the allocation of undecideds, weighting procedures, and other sources of measurement error could have contributed to these estimation errors. To address these issues, the request outlined in the attached document is broad-based, ranging from full questionnaires to individual-level data to documentation of procedures. The committee is interested in obtaining information from every firm or research center that collected data prior to these elections and caucuses.

The Roper Center has offered to serve as the archivist for the data associated with the work of the committee because we expect that when the Ad Hoc Committee issues its report, others will be interested in examining how the committee came to the conclusions it did. The Roper Center is keenly sensitive to the risks associated with these data files and the potential for exposure of confidential information. In the short run, access will be limited to the committee (and research assistants working with the committee members). In the long run, the goal is to provide access to the data for other scholars. The Roper Center will work with individual pollsters to determine which files may eventually be made available to the broader community of scholars (for example, after an embargo period).

Scholars interested in analyzing these restricted datasets will complete an Application for Restricted Data Use; review of these applications will be completed by a joint committee of the Roper Center and a subgroup of the Ad Hoc Committee.

I look forward to working with you in the weeks to come. The issues we face as a profession are challenging; I hope the work of the committee sheds light on issues that may have been unique to the 2008 pre-election primary polls as well as those issues that will inform and improve the methodology of our industry in the years to come. If you have any questions or would like any additional information about AAPOR or the Ad Hoc Committee, please feel free to contact me.

Regards,
 
Nancy A. Mathiowetz
President, American Association for Public Opinion Research
 

Appendix C

 
AAPOR Special Committee on 2008 Presidential Primary Polls Information Request
 
The request for information and data has been separated into two parts, (1) information that is part of the AAPOR Standards for Minimal Disclosure and (2) information or data that goes beyond the minimal disclosure requirements. The information you provide could take one or more forms— documentation, tables, and individual-level data. Our mission is to be able to examine data that will empirically inform questions concerning the disjuncture between the pre-election primary polls and the election outcomes. Members of the committee are willing to work with you in order to obtain the information of interest, in whatever form is easiest for you to provide.
 

AAPOR Standards for Minimal Disclosure

 
As noted in the AAPOR code of professional ethics and practices, the following items should, at a minimum be disclosed for all public polls:
 
1. Who sponsored the survey, and who conducted it.
 
2. The exact wording of questions asked, including the text of any preceding instruction or explanation to the interviewer or respondents that might reasonably be expected to affect the response.
 
Here the committee would request that you provide the complete questionnaire–hard copy, executable, screen shots–in whatever format is feasible. This should indicate which question or questions are used as the likely voter screening and which question or questions are used for the trial heat. The questionnaire should indicate any randomization of questions or response options
 
3. A definition of the population under study, and a description of the sampling frame used to identify this population.
 
The description of the sampling frame should indicate whether or not the frame includes cell phones.
 
4. A description of the sample design, giving a clear indication of the method by which the respondents were selected by the researcher, or whether the respondents were entirely self- selected.
 
For those studies that include cell phones in the sample frame, the description should include the sample selection for these numbers.
 
5. Sample sizes and, where appropriate, eligibility criteria, screening procedures, and response rates computed according to AAPOR Standard Definitions. At a minimum, a summary of disposition of sample cases should be provided so that response rates could be computed.
 
In addition to sample sizes, demographic information on those screened out would be useful in distinguishing between nonresponse bias and turnout model bias.
 
6. A discussion of the precision of the findings, including estimates of sampling error, and a description of any weighting or estimating procedures used.
 
In addition to the standard weighting information, documentation on the weighting procedure used to produce final turnout estimate
 
7. Which results are based on parts of the sample, rather than on the total sample, and the size of such parts.
 
8. Method, location, and dates of data collection.
 
If interviewers are used for the data collection, please indicate if live interviewers or IVR technology was used.

Beyond Minimum Disclosure 

To address the various hypotheses concerning the New Hampshire, South Carolina, California, and Wisconsin pre-election polls, the committee is requesting information and data beyond the minimum disclosure outlined above. For several of the items, the request is for the raw data; if the data are not available or can not be made available, the committee would benefit from the provision of the analysis tables as described. In cases where no written documentation is available, one or more of the committee members would be willing to discuss the procedures with you or one of your staff members.
 
I. Data
 
1. Individual-level data for all individuals contacted and interviewed, including those who failed the likely-voter screening and including all weights used in the production of the final estimates prior to the election, date of the interview, and interviewer identification number
 
A. In lieu of the individual-level data, demographic information on those who were screened out along with crosstabulations between vote preference and likely/ unlikely voters as well as registered/unregistered voters
 
B. Final estimate of the demographic composition of the turnout
 
C. Share of the voting age population represented by the turnout estimate, within demographic subgroup
 
D. Tabulations that reflect the combination of likelihood of voting and candidate preference, indicating how the trial heat question might differ by different estimates of turnout
 
E. If date of interview is not included on the individual level data file, distribution of candidate preferences (the trial heat question) by date of interview
 
F. In lieu of individual-level data, crosstabulations of voter preference by demographic characteristic, within subgroups. We are especially interested in the distributions of candidate preference by age, race, sex, party identification, political ideology, and religiosity
 
2. Reinterview data, if a post-election re-interview was conducted
 
3. To examine hypotheses related to social desirability, it would be beneficial to be able to examine characteristics of interviewers, matched to respondents. Listed in (I.1) above was a request for an interviewer ID as part of the individual file. Ideally, we would like to be able to link individual level records to characteristics of interviewers
 
4. To examine hypotheses related to nonresponse bias, the committee would need to have access to data for the full sample, including call record information, disposition of each sampled number, including attempts at recontacts
 
II. Documentation
 
1. Interviewer documentation, including instructions not included in the text of the questionnaire (e.g., instructions for probing “Don’t Know” responses).
2. Allocation rules for Don’t Knows and Undecideds
3. Documentation of the rules, if any, for sample allocation to interviewers
4. Documentation of approach to handling “early voting” in the composition of the final sample
5. The last press release or releases associate with your poll most proximal to the election.


 
Appendix D

Hypotheses for Sources of Error in 2008 Primary Pre-election Surveys

1.    Likely voter screening
1a.  Likely voter screening questions used in general elections do not work as well in (the 2008) primary elections.
1b. Likely voter screening questions used in primaries do not work as well in unusually high turnout primaries.
2.    Turnout models/Turnout surge
2a. Because of the calendar and the proximity of events, voter interest is stimulated from one event to another.
2b. The higher turnout of African Americans was underestimated on the Democratic side. 2c. The higher turnout of women was underestimated on the Democratic side
3.    Inability to capture last-minute deciders or changes in preference
3a. The nature of the contests (in 2008) means that many voters are making up their minds late and deciding to go to the polls (combination of turnout and decisions)
3b. Voters are changing their minds late; turnout estimates are all right but preferences change.
4.    Misreporting issues
4a. Voters are misreporting preferences to interviewers (social desirability) because they are unwilling to say they won’t vote for an African American or a female candidate
4b.  Respondents are misreporting their intention to vote (staying home or going to the polls) 4c. Misreporting is greater for “non-traditional” (non-white, non-male) candidates than for traditional candidates
5.    Nonresponse bias
5a. The short time between events and the need to take previous results into account in measurement means that response rates are low, and nonrespondents are different from respondents.
5b.  Differential nonresponse by key groups in the electorate affects the distribution of candidate preferences (African Americans, Whites, men, women).
6.    Question wording effects
6a. Differences in the way that the “trial heat” question is asked account for differences in polls 6b. Differences in the way that the trial heat question is asked (e.g., explicit or implicit Don’t Know alternatives) produce different levels of Don’t Know or Undecided responses
6c. The ordering of the candidates in the trial heat question affects the distribution of responses (i.e., primacy or recency effects)
7.    Question order effects
7a.  The preceding questions affect responses to the “trial heat” question but do not have an impact in the voting booth
8.    Allocation of undecideds
8a. Some results are being reported with undecideds included while others are being reported with them excluded
8b. Some results are being reported with the undecideds allocated (by different methods) while others are not
9.    Sampling Issues
9a.  Some groups are being oversampled, others under sampled
9b.  In states where there are open primaries, Independents or Not identified people are being oversampled or weighted
9c.  Weighting algorithms are not working as well in the primaries as in a general election.
9d. A large proportion of votes are being cast absentee or by other “early” procedures and they are not being captured in the survey or they are mis-weighted proportionately when combined with those who intend to vote in person on Election Day.
9e. Differential loss of cell phone only voters by state in certain primaries?
10.    External Factors
10a. The rules of voting in the primaries are not adequately captured in the polling methodology (who is eligible to vote in a specific event)
10b. The order of the names on the ballot has an independent effect on the outcome (or in relation to the order of the names on in the “trial heat” question)
10c. Senator Clinton’s emotional response swayed voters at the last minute
10d. President Clinton’s sharp criticisms of Obama swayed African American voters at the last minute