Free Trial

Analytics

News, events, thought leadership and more.
DrDan
Alteryx Alumni (Retired)

As you will recall from previous blogs, one of the unique thing about our 2016 Presidential Electoral App is the use of Census Tract level data. The Census Tract level estimates of the percentage of the popular vote that is expected to go to each of the two major party candidates, and third-party candidates, involves two major components. The first component, based on a candidate choice model, is the expected probability that a registered voter with a specific demographic and socioeconomic profile (e.g., a 40 to 44 year old Asian woman with a Bachelor's degree), residing in a particular county (e.g., Cook County, Illinois) will vote for one of the following candidates: Hillary Clinton for the Democratic Party, Donald Trump for the Republican Party, Gary Johnson of the Libertarian Party, Jill Stein of the Green Party, or Evan McMullen running as an independent.

 

The second component involves estimating the number of U.S. citizens age 18 and over that are in each Census Tract with reported population characteristics that fall into each demographic and socioeconomic profile.1 The estimates of the number of U.S. citizens of voting age in each profile is then adjusted to match the number of registered voters in a Census Tract. The number of votes for each candidate in a Census Tract is calculated by multiplying the probability an individual with a specific profile will select a particular candidate with the expected number of individuals who meet that profile in the Census Tract, and then summing these values across all the demographic and socioeconomic profiles.

 

Given this overview of the process, what follows is a more detailed description, along with the methods and data used, to implement these two components.

 

The Voter Choice Component

The base data for the voter choice model is SurveyMonkey's syndicated polling data. We received two different individual polling datasets from SurveyMonkey. The first covered a period from August 28 to September 6, while the second covered the period from October 7 to October 17 (the period between the second and third presidential debates). Over 32,000 individual responses were available in the first dataset, and over 28,000 responses were available in the second dataset. The first set of data was used to create the voter choice model for the launch of this year's Presidential Election App, but we recently re-estimated the voter choice model with the much more recent second dataset. The target population for SurveyMonkey's data were registered voters, as opposed to likely voters, thus, our estimates are based on the registered voter population.2

 

The SurveyMonkey data provided:

 

  • The respondent's preferred candidate between the two major party candidates using what is known as a "head-to-head" format
  • The respondent's preferred candidate across the two major party candidates and the more prominent third-party and independent candidates
  • The major political party the respondent identified with, or, if they did not identify with a party, the major party to which they leaned
  • The respondent's approval rating of President Obama
  • The social/policy/political issue of greatest concern to the respondent
  • The demographic and socioeconomic characteristics of the respondent
  • The respondent's county and ZIP Code of residence

The target variable used in the voter choice model was the respondent's preferred candidate among both the major party and prominent third-party candidates.

 

In terms of the set of predictor variables that were used in the voter choice models, preference was given to observable, objective measures as opposed to subjective, personal opinion measures since the more subjective measures are difficult to link to available measures at the Census Tract level. Moreover, several of these, specifically, those related to party identification/leaning suffered from potential reverse causality problems (it is not clear that supporting Hillary Clinton causes one to identify with the Democratic party, or if identifying with Democratic party causes one to support Hillary Clinton).

 

One feature of the SurveyMonkey polling data that made it unusual compared to other polling data was the provision of information regarding the ZIP Code and county of a respondent’s residence. Even though the number of SurveyMonkey poll respondents are much larger than most polls (around 30,000 in each of the two datasets we have been provided compared to slightly over a thousand for traditional political polls), it was still too small to incorporate either county (there are over 3,000 counties and county equivalents in the 50 states) or ZIP Code (there are over 40,000 ZIP Codes in the US) of residence indicators as predictors directly. Instead, relevant county level characteristics were used as proxy measures. The county level was selected since this was the lowest level of geography consistently available for a number of measures of interest. The specific county level variables use were:

 

  • A modified Partisan Voting Index for the county, which measured the partisan orientation of the area in which the respondent resided. As explained in one of the blog posts I did leading up to the Presidential Election App, the modification to the PVI was to give a 75% weight to the previous presidential elections and 25% to the presidential election prior to that rather than weight them equally.
  • The percentage of a county's population that belonged to an Evangelical congregation and the percentage of the county's population that belonged to a Church of Jesus Christ of Latter Day Saints (or LDS) congregation. These two religious groups have shown a strong Republican leaning in prior presidential elections. However, research conducted during this campaign season indicates that LDS members have concerns about Donald Trump, while Evangelicals exhibit less concern. As a result, these measures were included to capture possible county level differences in behavior in this election compared to previous elections.
  • A county level measure of Democratic Party orientation provided to us by TargetSmart Communications.
  • The population density (the number of individuals per square mile) of the respondent's county of residence, which is an indicator of the rural versus urban nature of the county.
  • A set of state indicators (e.g., Alabama, Alaska, etc.) to capture possible other regional effects.

 

The respondent level demographic and socioeconomic variables provided in the SurveyMonkey data were:

  • Race
    • Asian
    • Black or African American
    • Hispanic or Latino
    • Other
    • White
  • Age, which is given in years, but was transformed, for implementation purposes, to the following groups
    • 18 to 24
    • 25 to 29
    • 30 to 34
    • 35 to 39
    • 40 to 44
    • 45 to 49
    • 50 to 54
    • 55 to 59
    • 60 to 64
    • 65 to 69
    • 70 to 74
    • 75 to 79
    • 80 and above
  • Gender
    • Female
    • Male
  • Educational attainment
    • Less than high school
    • High school graduae or G.E.D.
    • Some college
    • Associate's degree
    • Bachelor's degree
    • Post graduate or professional degree

 

To create the voter choice model, a training/test design was used in which 67% of the records were in the training (estimation) set, and 33% of the records were in the test (validation) set. Four different modeling algorithms were examined: decision trees (recursive partitioning using the Alteryx Decision Tree tool); random forest (using the Alteryx Forest Model tool); gradient based boosting (using the Alteryx Boosted Model tool); and neural networks (using the Alteryx Neural Network tool). A systematic search over a number of hyper-parameters was conducted. In the end, a Boosted Model with two-way interactions had the highest predictive efficacy in the test (validation) set, as shown in Table 1, and was selected as the modeling approach to use, but was then re-estimated to use all of the data from each of the two datasets provided by SurveyMonkey.

 

Table 1. The Validation Set Accuracy of the Different Models by Candidate

 

Model Overall Clinton Trump Johnson Stein
Boosted Model 0.5855 0.7334 0.6148 0.0087 0.0000
Forest Model 0.5746 0.7151 0.6095 0.0062 0.0000
Neural Network 0.5653 0.7076 0.5913 0.0187 0.0000
Decision Tree 0.5735 0.7444 0.5725 0.0000 0.0000

 

Figure 1 provides the variable importance weight plot of the predictor variables used for the final Boosted Model leveraging the most recent dataset provided by SurveyMonkey. The plot revealed that the PVI of the county in which the respondent resides was the most important predictor variable, accounting for nearly 40% of the relative importance. The effect of this variable is largely consistent with expectations (Trump support increases as the PVI of the county of residence of the respondent becomes more Republican, while support for Clinton increases as the PVI of the county of residence of the respondent becomes more Democratic leaning), with the minor exception that as the PVI becomes extremely Democratic leaning, there is a small decline in support for Hillary Clinton that benefits Jill Stein of the Green party.

 

Figure 1. The Variable Importance Plot of the Boosted Model

 

Figure 1. The Variable Importance Plot of the Boosted Model

 

The next set of predictors in term of relevant importance (both accounting for around 20% of the relative importance) were race and educational attainment. In terms of race, respondents who were whites skewed towards Trump (followed closely by the "other" race group), African Americans strongly skewed towards Clinton, and Asians and Latinos skewed towards Clinton, but not as strongly as African Americans. With educational attainment, respondents with less than a Bachelor's degree skewed towards Trump (particularly for high school graduates and those with an Associate's degree), while those with a Bachelor's degree and above skewed towards Clinton, with the skew being more pronounced for those with a post-graduate or professional school degree.

 

The final two predictors with comparatively high levels of relative importance (each accounting for roughly 10% of the relative importance) were age and gender. The main effect of age was that older registered voters were more inclined to skew towards Trump, while younger voters were more likely to skew toward third-party candidates, and the effect of age being fairly modest on the level of support given to Clinton. With respect to gender, men were more likely to support Trump, while women were more likely to support Clinton.

 

The variables with the least predictive power (but still helpful in  improving predictive efficacy, based on excluding these variables in the training set and then comparing the resulting models in the test set) were the percentage of the population in the county in which the respondent resided that were members of an LDS congregation, the population density of the respondents county of residence, the percentage of the population in the county in which the respondent resided that were members of an Evangelical congregation, the county level measure of Democratic Party orientation, the PVI trend measures, and a set of 18 state indicator variables. The state indicator variables (a list which include the District of Columbia) were converted into a set of zero-one indicator variables, and 33 of these variables were never included in a decision tree in the final ensemble.

 

Estimating the Number of Registered Voters in Each Census Tract

Ideally, what we would have available is a contingency based on five different demographic and socioeconomic variables (the voter choice model used four demographic and socioeconomic variables, the fifth is citizenship since the relevant population consists of voting age citizens). However, this table would consist of 1,560 cells for each Census Tract, which wasn’t available. Instead, we estimated the number of individuals in each cell based on the data that was available for each Census Tract (the marginal distributions of the demographic and socioeconomic variables) and a full (prior) contingency table constructed from a sample of individuals 18 years of age and older that was obtained from the Public Usage Microdata Sample (or PUMS) of the 2010 to 2014 American Community Survey (the most recent sample of this data available at this time that contains all the needed fields). The purpose of the full contingency table constructed using individual records was to provide an indication of the relationships between the different demographic and socioeconomic variables, or, technically, the joint distribution of these variables.

 

We used the available data on the marginal distribution of each socioeconomic and demograpic variable for each Census Tract, taken from both the Experian CAPE data and the American Community Survey summary files, and condition on the PUMS based prior table in order to obtain an estimate of the joint distribution (contingency table) of the demographic and socioeconomic variables in each Census Tract using a method known as iterative proportional fitting.3 The expected number of individuals in each profile was then adjusted for the number of registered voters in the Census Tract using data provided to us by TargetSmart Communications.

 

At this point, the probability model was combined with the estimates of the number registered voters within a specific demographic and socioeconomic profile to create the final vote estimates for each candidate in a Census Tract based on the registered voter target population, and converted to percentage terms. County level percentages were obtained by summing the votes for each candidate across the Census Tracts in the county and then converted to percentages.

 

1Roughly 2,000 out of 74,000 Census Tracts in the 50 states and the District of Columbia do not have reported population characteristics, due to their small (typically zero) population.

2A recent blog post on the FiveThirtyEight site contains an analysis which suggest that the difference between registered and likely voter polling results appear to be minimal in this election.

3Where possible, the 2016 Q1 estimates from the Experian CAPE data were used for each Census Tract. However, the educational attainment data from this source is based on the population 25 years of age and older, not 18 and older. As a result, the marginal distribution for educational attainment in each Census Tract was taken from the 2010 to 2014 American Community Survey summary files and adjusted to match the relevant population total.

Dan Putler
Chief Scientist

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.