Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
DrDan
Alteryx Alumni (Retired)

Recently, I was involved in a project that examines opioid dependency and abuse in the US. This was actually the second opioid related predictive analytics project I worked on over the past several months. The first I undertook as part of a team to demonstrate a real life application of our Spark Direct Capability at Inspire Europe back in September 2017. The application involved examining differences in opioid analgesic prescription rates across the 209, geographically defined, Clinical Commissioning Groups (or CCGs) in England. We chose this topic since we were interested in examining whether the drivers of opioid prescription rates in England were similar to the drivers of opioid abuse in the US, since prescription opioids have acted as a “gateway” to heroin and other opioid “street drugs” in the US. One important note about the analysis presented in this post compared to the earlier post on locating opioid treatment centers is that the analysis presented here is done at a population level (geographic area prescription rates are associated with population characteristics for that area), while the analysis in the previous post was done at the individual level. There is a relationship between the individual level and the population level, but that relationship is not as straightforward as it might first appear.

 

For this application, our UK based colleague Nick Jewell pointed us to the practice level prescription data from the National Health Service in England, which tracks prescriptions for every general practitioner medical practice (not individual doctor) in England at the compound, size, and form level (e.g., amoxicillin 500mg capsules) on a monthly reporting basis. We decided to use the full 2016 calendar year of this data, which is the most recently available on this basis (full calendar year data for 2017 is scheduled to be released in March of this year) to determine which socioeconomic and demographic factors are related to differences in prescription rates (the total number of prescriptions divided by total population in an area) for opioid analgesics across CCGs. To do this, we also needed CCG level socioeconomic and demographic information. Fortunately, CCGs are defined by aggregating adjoining Output Areas together (an Output Area is the smallest geographic area for which census data is reported in England and Wales), so standard socioeconomic and demographic data products can be used. Another of our UK colleagues Fadi Basel guided us to our European data partner MB-Research, who graciously provided us with the bulk of the needed data, with some additional data (such as the the CCG area map polygons and the crosswalk between CCG areas and Output Areas) came from the Office for National Statistics.

 

Data Preparation and an Assessment of the Variability in Opioid Prescription Rates Across CCGs

 

The data prep for the target variable for this analysis involved identifying all prescriptions for opioid analgesic compounds based on the British National Formulary, and then filtering out all other compounds. The reduced data were then grouped by CCG, and the number of prescriptions summed and then divided by the total population in the CCG to get the prescription rate. Figure 1 provides a choropleth map of the prescription rate across the 209 CCGs.

 

Figure 1: Opioid Prescription Rates for CCGs in EnglandFigure 1: Opioid Prescription Rates for CCGs in England

The figure indicates a high degree of variability in prescription rates. Across all CCGs, the opioid prescription rate varies from 0.14 prescriptions per person (or one opioid prescription per every 7.25 people) to 1.10 prescriptions per person (or one opioid prescription per every 0.91 people). This indicates that across CCGs the opioid prescription rate is nearly 8 times higher for the CCG with the highest prescription rate compared to the lowest. In addition, the figure reveals strong spatial patterns, with the lowest prescription rates occurring in CCGs near London, and extremely high prescription rates in the CCGs located in the Northeast of England, and high prescription rates in the East Midlands, Devon and Cornwall, and the Cumbria and the Lakes regions of the country.

 

Our goal in this analysis is to examine the extent to which socioeconomic and demographic characteristics of the population in each CCG can explain the variation in opioid prescription rates across CCGs. As indicated above, CCGs are not a standard census geography in England, but data reported at the Output Area level can easily be aggregated up to the CCG level. The raw data is in counts of individuals or households, but given the nature of our analysis, we need to transform them into percentages of the total relevant population (people or households). The Output Area level socioeconomic and demographic data available to us includes total population, total households, MB-Research Purchasing Power Index (modeled disposable income per capita which is scaled so that the country average is 100), the number of individuals in each of 28 different age/gender groups, the number of individuals in the work age population (16 to 64 years of age) that are not employed, and the number of households that fall into different household categories, which was narrowed down to the following four groups:

 

  • Married couples with children present
  • Other households with children present (typically single parent or unmarried couple)
  • Multi-person (often multiple generation) households with children present
  • Childless households (either single person or married)

All the socioeconomic and demographic variables (except the purchasing power index, which was already given on the appropriate basis) were divided by the relevant population or household total to put all values on a percentage basis.

 

In addition to these potential predictors, we calculated an additional variable of possible interest, the population density in each CCG to examine the possible difference in prescription rates for rural areas versus urban areas. This is calculated by taking the total population in a CCG, and dividing it by the land area in that CCG. Given the nature of how it is constructed, it is in an appropriate format for this analysis since it is appropriately normalized.

 

Using Principal Components to Reduce the Large Number of Age/Gender Groups

 

One remaining issue is the large number of age/gender groups. Rather than include all 28 age/gender groups in the model (which are highly correlated with one another), we made use of a common method to address this type of problem. Specifically, we ran a principal components analysis on the 28 age/gender groups using Designer’s Principal Components tool. The principal components analysis revealed that a very small number of principal components captures the lions share of the variability in the 28 age/gender groups across CCGs. The first principal component captures nearly 62% of the variance in the age/gender group data, the second principal component captures over 21% of the remaining variance, while the third captures nearly 12% of the remaining variance. Collectively the first three principal components capture 95% of the variance in the age/gender groups, while the first two capture 83% of the variance.

 

The loadings on the first two principal components make them readily interpretable. In the case of the first principal component, the loadings are negative for younger age/gender groups, and positive for older age groups. As a result, the population is older in a CCG that has a large positive value for the first principal component, and younger in a CCG that has a large (in absolute value terms) negative value for this component. Given this, we can rename this component as “younger vs older.”

 

The second principal component only has large loadings, in absolute value terms, for age/gender groups that are between 0 and 17 years of age and age/gender groups that are between 18 and 24 years of age. A CCG with a high percentage of its population between the ages of 0 and 17 years, and another CCG with a high percentage of its population between the ages of 18 and 24 will both be relatively young compared to other CCGs, but the nature of these two CCGs will likely be different. A large negative value on the second principal component indicates that there is a high percentage of individuals between the ages of 0 and 17 years in the CCG, while a large positive value indicates that there is a high percentage of individuals between the ages of 18 and 24 years. Based on this, we can rename the second component as “kids vs young adults”.

 

Modeling Opioid Prescription Rates

 

Three different modeling methods were applied to the data, linear regression (via the Linear Regression tool), random forest (via the Forest Model tool), and gradient boosted model (via the Boosted Model tool). In the case of gradient boosted models, we allow for different levels of interactions between the predictors. Model selection was done using the Cross-Validation tool that is available through the Predictive District of the Alteryx Analytics Gallery. Ultimately, the “younger vs older” principal component of the age/gender groups was removed since it did not help to improve out of sample prediction based on the cross validation results. Both the average root mean squared error and the average mean absolute percentage error across the cross validation replicates pointed to a gradient boosted model using three-way interactions as the best model for this data out of the models examined. The in-sample correlation between fitted and actual values for this model is 0.94 (the out of sample cross validation average is 0.86), suggesting that the model fits the data well.

 

Based on the model fit statistics, socioeconomic and demographic characteristics of the population within a CCG are capable of explaining much of the difference in opioid prescription rates across CCGs. The variables and their effects, in order of importance, are:

 

  1. Purchasing Power Index: As purchasing power (disposable income) increases in a CCG, the opioid prescription rate decreases.
  2. Childless Households: As the percentage of childless households in a CCG increases, the opioid prescription rate increases.
  3. Married Couple with Children Households: As the percentage of married couple households with children present in a CCG increases, the opioid prescription rate decreases.
  4. Population Density: As the population density of a CCG increases (thus becoming more urban), the opioid prescription rate decreases.
  5. Multi-Person Households with Children: As the percentage of multi-person households with children in a CCG increases, the opioid prescription rate decreases.
  6. Working Age Individuals not Working: As the percentage of working age individuals not working in a CCG increases, the opioid prescription rate increases.
  7. Other Households with Children: As the percentage of households headed by a single parent or an unmarried couple with children present increases, the opioid prescription rate decreases.
  8. Kids vs Young Adults: As a CCG becomes more oriented toward young adults as opposed to children, the opioid prescription rate increases.

The relative importance plot for the final model is shown in Figure 2. One important thing to notice in this plot is that the Purchasing Power Index is very important. The other thing to notice is that the percentage of households with or without children appears to be critical in explaining opioid prescription rates in a CCG.

 

Figure 2: The Variable Importance Plot for Prescription Rate ModelFigure 2: The Variable Importance Plot for Prescription Rate Model

Assessing the Results

 

The models indicate that CCGs that have a population that is more affluent, has greater adult employment levels, has a high percentage of households with children present (particularly when those households are headed by married couples) and is in a more urban area will have the lowest opioid prescription rates. In contrast, poorer areas, with lower levels of adult employment, a high percentage of childless households, and are more rural will have the highest opioid prescription rates. The differences between high and low opioid prescription rates across CCGs can be boiled down to two more basic factors, economic opportunity and household structure, particularly as it relates to the presence of children.

 

Although they differ in the nature of the target variable (individual abuse or dependence on opioids for adults versus opioid prescription rates for different geographic areas), the available data (data on the racial and ethnic makeup of the CCGs was not available), and the country of analysis (the US versus England), both this analysis and the analysis in my previous post find that economic opportunity and household structure are two critical elements related to the potential abuse of opioids. Moreover, the present analysis suggests that the factors at play in the US opioid epidemic appear to be at play in England as well.

Dan Putler
Chief Scientist

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.