Data Science

DrDan · ‎10-25-2012

In the run up to this year’s Presidential election, one or more new national horse-race polls come out on daily basis. However, as the late Tip O’Neill famously said: “All politics is local.” This is true even for a presidential election since local campaign offices need to determine exactly where to send their field canvassers or place signs for their candidate to get maximum impact; while the national campaign needs to determine which community center, factory, school, or neighborhood is the right one to send their candidate or a high level surrogate for an event, and determine where to concentrate their “get out the vote” efforts to maximize their probability of victory. While polling information at the national levels is plentiful, and for some states is available in reasonable quantities, the same cannot be said of a 15 block area of Youngstown, Ohio. Conducting well done polls is expensive in general, and the ability to conduct a poll for an arbitrary 15 block area in a particular community is essentially impossible. Consequently, the ability to leverage plentiful national and/or state level polling information for local decision making at the local level provides a huge benefit to a campaign. The question is how to do this? The answer can be found in the tools and technology behind Alteryx's Presidential Election App now available on the Alteryx Analytics Gallery.

The Presidential Election App provides local area estimates (down to a 15 block area of Youngstown Ohio) of the percentage of registered voters in that area who plan to vote for or lean toward either of the two major party candidates (Barack Obama and Mitt Romney), a third party candidate, or who remain undecided about the race. While a large number of people are likely to find these numbers of interest, the natural question to ask is what are they based on? In this (series of) blog(s), I describe how the estimates reported in the Presidential Election App are developed, and some of the challenges we faced in producing these estimates. The development of this app has had three distinct phases: (1) modeling voter preferences using individual respondent data to the USA Today / Gallup polls (provided to us by the Roper Center) conducted in late July, August, and September of this year (we hope to receive the final October poll data prior to the election); (2) matching the demographic, socioeconomic, and party affiliation measures collected in the polling data to the data that is actually available at the local level in order to make local area predictions; and (3) develop the reports and user interface for querying and reporting the voter preference estimates. The blog(s) will focus on the first two of these (starting with the voter preference model), the third component involves Alteryx's industry leading spatial ETL (extract/transform/load) tools and its flexible reporting tools, but does not influence the predicted values themselves.

The Voter Preference Model

In addition, to candidate preference, the USA Today / Gallup Poll collects information on each participant's demographic characteristics (e.g., gender, age, race, Hispanic ethnicity), socioeconomic characteristics (e.g., income, educational attainment, employment status), political party affiliation, their identification and participation with respect to organized religion, and their state of residence. The demographic and socioeconomic measures generally have close (although not always perfect, as we will see in the next section) analogs at local geographic levels via Census and Census related data such as the Experian CAPE data that is bundled with Alteryx. In the case of political party affiliation, the USA Today / Gallup Poll asks respondents to self-identify the party they consider themselves to be a member of or lean toward, while the data available to us at the local level (via a third-party political consulting firm that wishes to remain anonymous) is based on a probability measure that a voter will self-identify with a particular party. Local area measures related to religion are not available to us for this application, so were not considered in the model development process.

As is common in developing predictive analytic models, exploratory analysis was the first activity carried out. In examining the relationships between the available predictors and candidate preferences, a number of our prior hypotheses were confirmed, while others were not. In particular, the expected strong effects of party identification, age, and racial background were confirmed. However, while the direction of their effects were consistent with expectations, the size of the effects of gender, Hispanic ethnicity, income, and educational attainment were not as strong as one might expect. By and large these same patterns emerged when we moved from the exploratory analysis to the developing the full choice model, with the exception that income has a more important (albeit not overwhelming) effect than is suggested in the exploratory analysis.

The preference model was created using the random forest algorithm via Alteryx's Forest Model predictive analytic macro based on the R randomForest package. Overall, the model fits the data extremely well, classifying the candidate preference of poll respondents not used to create the model with nearly 80% accuracy.

The important variables in the predictive model, in their order of importance, are:

Party identification
Race
Income
Age

Out of this set, party identification is by far the most important predictor. As is often the case in predictive models, demographic and socioeconomic factors are relatively less important than measures more closely aligned with the behavior of interest, in this case party identification and Presidential candidate preference. Socioeconomic and demographic factors are related to party identification, but the effects of these variables beyond the effect they have on party identification, are generally modest. In addition to these four variables, region of the country identifiers are also included in the model, but have a minimal impact. Because of their extremely weak effects (once party identification is taken into account), gender, educational attainment, and Hispanic ethnicity were removed from the final model. One benefit of doing this is it helps simplify the process of going from the predictive model to the local area predictions.

Matching the Polling Data Measures to the Local Area Measures

The more technically challenging part in creating the application is matching the available local area measures to those used in the USA Today / Gallup Poll. Some of these issues, as they relate to creating the voter preference model, are discussed above, but there are other issues that need to be addressed as well. One that may be somewhat surprising, put particularly challenging, is household income. Like the vast majority of other voter and consumer surveys, the USA Today / Gallup Poll asks each respondent to indicate the range in which their annual household income falls. The issue that arises is that this results in an "observation unit" problem. Specifically, local area information from the US Census Bureau (and third-party data providers who base their own data on US Census Bureau data) that is related to individuals, such as age, is reported in the form of counts of individuals that fall in a particular age range in a geographic area, while information related to households (such as household income) is reported by the number of households that fall into different income range groups within the same geographic area. Where this leaves us is that we know the number of individuals who are over 18 years of age in a local area (the relevant population for this application) and the number of households that fall into different income groups in the same area. However, what we need to have in order to match the polling data with the local area data is an estimate of number of individuals age 18 years and older that reside in households that fall into a particular income group. In other words, we need to have an estimate of the number of individuals age 18 years and older that reside in households that have an annual household income between $50,000 and $74,999 in each area. We were able to develop the needed estimates of the number of individuals aged 18 and older that fall into different household income groups, but it did require us to make use of the US Census Bureau's American Community Survey Public Use Microdata Sample, solve roughly 250,000 (one for each local area) quadratic programming problems, and estimate the same number of joint empirical distributions between the number of adults aged 18 and older in a household and household income groups using iterative proportional fitting in order to develop the needed estimates. While this may seem like "extreme analytics," the problem that we address is one that will likely need to be addressed by anyone who is attempting to project predictive models based on individuals down to local geographic areas when household levels predictors are important, and household income is often an important predictor of individual behavior. Consequently, a technical addendum to this blog post is in the works that describes in much greater detail what we did to address this issue.

A second issue, mentioned earlier, that needs to be addressed is matching the self-reported party identification information contained in the polling data with the probability based measures of party identification available to us. The specific information at our disposal was individual voter level probabilities of party self-identification, however, (to maintain privacy) individuals were only identified to us at the census block group level (i.e., we knew the block group in which an individual voter resided, but not their name or actual address). From this information we could set cut points for the probability that allowed us to classify the number of voters in an area as being Democrats, Republicans, or uncommitted. In addition, we could also determine the number of unregistered voters in an area. With only minor adjustments to the cut-off probabilities suggested by the data provider (and were carried out at the state level), the predicted voter preferences aggregated to the state level were very reasonable relative to state level returns from the 2008 Presidential election and Nate Silver's poll-based composite state level forecasts of the 2012 election for all but seven states. The seven remaining states required the use of what we feel are extreme changes in the suggested cut-off point in order to have them come anywhere close to the 2008 election returns and the 2012 state-level election forecasts. As a result, for the vast majority of states we are very comfortable with our predictions, but for these seven states our comfort level is much lower.

Some may argue that this "norming" of the model predictions at the state level is undesirable since it does not allow us to develop independent estimates of state level results (and thus a prediction of the Electoral College results). However, the purpose of the presidential election app is not to predict the winner of the election overall, rather its purpose is to gain an understanding of which local areas (e.g., counties and zip codes) in a state where the election is most competitive, and the use of norming at the state level makes it more likely that we will successfully accomplishing this task. As a side note, norming of this type is a very common practice in marketing analytics.

The third issue that needs to be addressed in order to make accurate predictions with the voter preference model is that all the predictor variables need to be jointly considered when making predictions. To do this, the model was estimated using only categorical variables, and then probability predictions using the model were made for each candidate and the undecided category for all possible combinations of the variables (a total of 1,024 combinations), resulting in 4,096 predicted probabilities. Each of the 1,024 possible combinations of the predictor variables represents a possible voter profile. To obtain local area estimates, the expected number of voters that fall into each profile is estimated for each local area (based on iterative proportional fitting) and is multiplied by the estimated probability that members within each profile support a particular candidate (or the undecided group). Finally, these values are summed across the different voter profiles for each candidate (and the undecided group), resulting in an estimate of the overall support for each candidate and the total number of undecided voters.

Summary

The goal of the Presidential Election App is to provide users with estimates of the relative preferences of voters at lower geographical levels than is possible using traditional polling data. Doing this requires the advanced strategic analytical methods and agile business intelligence tools provided by Alteryx and its R-based predictive analytics capabilities, coupled with high quality data (provided to us by the Roper Center and others). While all predictions (including ours) face high levels of uncertainty, we believe that our efforts will allow users to determine the intensity of this year's presidential election across geographical regions at a level of granularity that has previously not been achieved. Ultimately, however, the goal of this app, like all predictive analytics applications, is to predict behavior rather than influence it. To insure this, do your part, and make sure to vote on November 6.

File Attachment:

Presidential Election App Addendum.pdf

Data Science

Presidential Election App: Predictive Analytics at its best

The Voter Preference Model

Matching the Polling Data Measures to the Local Area Measures

Summary