Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Data Science

Machine learning & data science for beginners and experts alike.
DrDan
Alteryx Alumni (Retired)

Early in December three teams of Alteryx associates and a team of Alteryx ACEs participated in a Department of Health and Human Services sponsored Opioid “code-a-thon” in Washington, DC. Our team, which we named “Helping Hands,” were in the “treatment” track, which was one of three different tracks (the other two tracks were “usage” and “prevention”). Our team consisted of Ben Burkholder (Customer Success Manager on our Customer Success team), CJ Campbell (a developer on our Product Development team), Michael Chadwick (also a developer on our Product Development team), Mala Gosakan (a former technical program manager on our Product Development team), and Dan Putler (the Chief Data Scientist at Alteryx, aka “Dr. Dan”).

During the code-a-thon, our team's objective was to provide a web based solution (that uses the Alteryx Gallery API) that would allow national, state, and local decision makers to determine where to spatially locate opioid treatment facilities and related resources (such as stationary outdoor advertising for public health campaigns) that maximizes the proximity of those locations to individuals that abuse or are dependent upon opioids.

 

Developing the Needed Data

Given the data challenges, we knew our objective was very ambitious for a 24 hour code-a-thon. As a result, and consistent with the rules of the code-a-thon, one team member began to gather the relevant available data, and created predictive models using that data prior to arriving at the code-a-thon.

The real challenge with the available data is that to implement our goal of optimally locating treatment facilities, we needed to have estimates of the number of individuals dependent upon or abusing opioids at a very fine grained geographic level (ultimately we worked with census tract level data, but originally hoped to have block group level data). This particular data does not actually exist, but there is sufficient data to develop predictions of the number of individuals age 18 years and older who abuse or are dependent upon opioids in each census tract, using methods similar to the ones that were used to implement the 2016 Presidential Election App. This approach involves three steps:

  1. Using individual respondent survey data (in this case from the 2016 National Survey on Drug Use and Health, which had over 40,000 respondents age 18 and over) to estimate models that link the probability of abusing or being dependent upon opioids to the socioeconomic and demographic characteristics of each individual age 18 and over.
  2. Develop estimates of the number of individuals age 18 and over that fall into each unique socioeconomic and demographic group in a census tract, which requires the use of census tract summary data from the 2011-2015 American Community Survey Five Year Estimates and data from the 2011-2015 American Community Survey Five Year Public Usage Microdata Sample, and a statistical method known as iterative proportional fitting.
  3. Multiply the estimated number of individuals in each socioeconomic and demographic group in a census tract by the probability that an individual with that socioeconomic and demographic profile abuses or is dependent upon opioids, and then sum these values across all of the profiles for a census tract.

We found that socioeconomic and demographic factors are good indicators of the probability that an individual age 18 or over abuses or is dependent upon opioids. The important predictors in our model (which was created through the use of Alteryx's Boosted Model tool), in their order of importance, are:

  1. Age and gender group (women and men age 65 and over have the lowest probability, while men age 26 to 34 have the highest)
  2. Race and ethnicity (non-Hispanic whites have the highest probability, while Asians have the lowest probability)
  3. Employment status (those who are not working due to a disability have the highest probability, while those who are in school or a training program full-time have the lowest)
  4. Marital status (those who are married have the lowest probability, while those who are divorced or separated have the highest)
  5. Income (the probability decreases as income increases)
  6. Educational attainment (the probability decreases as educational attainment increases)

For our code-a-thon application, we focused on three states: Indiana, Ohio, and West Virginia. We calculated the expected number of individuals in each of the 4940 census tracts in the three states who abuse or are dependent upon opioids. We selected these three states since they have a an opioid related death rate that is above the national average, and are contiguous with one another.

 

Examining the Validity of the Data

One natural thing to wonder at this point is how valid are the estimates that come out of the three step approach? The problem in assessing this is we are trying to estimate data that simply isn't available. However, information related to the values of interest are available for a large number of counties, specifically, mortality rates due to drug overdoses. Opioid use is one of the leading causes of drug overdose mortality. Given this, we should see a strong, but less than perfect, relationship between the estimated percentage of individuals who are 18 years of age and older in a county who abuse or dependent on opioids (which can be obtained from summing the expected number of individuals who abuse or are dependent upon opioids across all the census tracts in a county, and then dividing by the population 18 years of age and older). There are a number of reasons why this relationship would not be perfect, such as non-opioid related drug overdoses, the difference between including the entire population versus only those age 18 and older, and differences in the ability to respond to and treat overdose victims with Naloxone.

Conveniently, the County Health Rankings and Roadmaps site (a joint project of the Robert Wood Johnson Foundation and the University of Wisconsin Population Health Institute) has compiled this information from the Centers for Disease Control's WONDER database. There are 235 counties in the three states our project focused on, and 176 of them report death rates due to drug overdoses. Figure 1 shows a scatter plot between the estimated percentage of individuals age 18 and over who abuse or are dependent on opioids and the the death rates (measured in individuals per 100,000 population), which shows a strong relationship between these two variables. The Pearson correlation between the two variables is 0.65, a value that is highly statistically significant, and very strong given the cross-sectional nature of the data being used. Given this, we have strong confidence in our estimates based on the three step approach.

Figure 1Figure 1

 

Exploring Geographic Patterns in the Expected Percentage of Adults Who Abuse or are Dependent on Opioids

Figure 2 provides a histogram of the expected percentage of adults 18 years of age and older who abuse or are dependent on opioids. An examination of this figure reveals this percentage ranges from close to zero to nearly 3% across the 4940 census tracts. The mean is 0.93%, while the median is 0.88% (the median being smaller than the mean is consistent with the right skew of the distribution that can be seen in the histogram).

Figure 2Figure 2

Figures 3 to 5 provide choropleth maps for each of the three states. Looking across the three maps indicates that more rural areas (larger polygons tend to have lower population densities since census tracts are designed to have roughly comparable population sizes) and the southern areas of each state (which corresponds to more mountainous areas) tend to have a higher percentage of individuals age 18 and over who abuse or are dependent on opioids. In addition, it is fairly easy to see that opioid dependence and abuse is much higher in West Virginia than it is in either Indiana or Ohio. These findings are consistent with what has been discussed in the media about the opioid epidemic, giving us further confidence that our census tract level estimates of the number of individuals age 18 and over who abuse or are dependent on opioids are appropriate for use in an application to select new treatment facility locations.

Figure 3: Census Tract Estimates of the Percentage of Individuals Age 18 and Over who Abuse or are Dependent Upon Opioids in IndianaFigure 3: Census Tract Estimates of the Percentage of Individuals Age 18 and Over who Abuse or are Dependent Upon Opioids in Indiana

Figure 4: Census Tract Estimates of the Percentage of Individuals Age 18 and Over who Abuse or are Dependent Upon Opioids in OhioFigure 4: Census Tract Estimates of the Percentage of Individuals Age 18 and Over who Abuse or are Dependent Upon Opioids in Ohio

Figure 5: Census Tract Estimates of the Percentage of Individuals Age 18 and Over who Abuse or are Dependent Upon Opioids in West VirginiaFigure 5: Census Tract Estimates of the Percentage of Individuals Age 18 and Over who Abuse or are Dependent Upon Opioids in West Virginia

 

The Treatment Facility Location App

To create the application, we can rely on an Alteryx feature known as a “Location Optimizer Macro”. In the app, a user is asked how many additional treatment facilities they would like to add to the existing set of treatment facilities, which combination of the three states (all three, any two of them, or only a single state) should be considered for locating the new treatment centers, and how the underlying map data should be displayed (the choices are to show the number of expected individuals 18 years of age who abuse or are dependent upon opioids at either the census tract or county level). The user input interface is shown in Figure 6.

Figure 6: The App User InterfaceFigure 6: The App User Interface

Once the user has entered the needed information, the app searches for the user provided number of locations that maximize the number of expected individuals age 18 years and over that are within a 10 mile radius of the selected sites, and who are not currently within a 10 mile radius of any existing site. The ten mile radius is somewhat arbitrary (it seemed reasonable to us since it would represent a fairly short travel time), and the app can easily be altered to make this a parameter that is under a user's control. The output of the app is a table (an example of which is shown in Figure 7) that lists the expected number of individuals served, along with the percentage of individuals age 18 and over in the served area that abuse or are dependent on opioids, along with a map that shows the locations of the proposed new outlets, along with the locations of existing treatment locations (Figure 8 provides an example map).

Figure 7: An Example of the Treatment Facility Location App Output TableFigure 7: An Example of the Treatment Facility Location App Output Table

 

Figure 8: An Example of the Treatment Facility Location App Output MapFigure 8: An Example of the Treatment Facility Location App Output Map
Originally, we had hoped to use the Gallery API and have the app be a true Web app. However, we did not have sufficient time to do this, but we were able to quickly turn our work from the code-a-thon into a Gallery App very shortly after arriving home. Follow this link to access the app on the public Alteryx Analytics Gallery. Given the nature of the optimization algorithm, the optimizer can take up to several minutes to run depending on the states selected and the number of new facility locations that have been specified.

 

Reflections on our Participation in the HHS Opioid Code-a-thon

Overall, we worked very well together as a team. We got a lot done (and even got a couple of hours of sleep) at the code-a-thon, but we still did not do all we had hoped to do (which turns out to be the norm, in the academic literature it is known as the "Planning Fallacy"). The gallery app was mostly done by the end of the event, but a web app via the Gallery API turned out to be a bit overly ambitious. In all honesty, if we had not done a fair amount of work getting the data together and creating predictive models before the code-a-thon, we would have found ourselves in a world of trouble at the code-a-thon. Fortunately, some of us had participated in code-a-thon's before, and had learned from that experience that it is very hard to do something substantive in a 24 hour time period.

What was different about this code-a-thon, as compared to the others some of us have participated in, was that it dealt with a socially important issue, that could literally be life saving. Most other code-a-thons deal with a “cool” technology platform or language, not saving lives and keeping families together. As a result, most of us left the event with a favorable impression, despite the sleep deprivation, and a staff not exactly used to, or a space designed for, dealing with around 300 code-a-thon participants.

Dan Putler
Chief Scientist

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.

Comments