Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Data Science

Machine learning & data science for beginners and experts alike.
DrDan
Alteryx Alumni (Retired)

Summertime, and the livin’ is easy. Fish are jumpin’ and the cotton is high.

The quote above are the opening lyrics to the aria “Summertime” composed by George Gershwin with lyrics by DuBose Heyward from the opera Porgy and Bess. While it is the waning days of summer, and hopefully the living is easy, a declining percentage of Americans are taking up a rod and reel to take advantage of the jumping fish. To illustrate this point, Figure 1 shows that the number of fishing licenses issued and paid for across the 50 states per 100 people over the period from 2000 to 2013 has generally been in a slow, steady decline. The only exceptions to this are the presidential election years of 2004 and 2012 (what triggers this apparent increase in fishing interest in presidential election years is not known, but it is interesting to speculate about).

Figure 1. Fishing Licenses per 100 Population in the 50 US States

 

We first stumbled across this trend as part of a group brainstorming session several of us had at Alteryx that was related to an issue a customer was trying to address. The idea was fishing licenses issued each year might be a relevant predictor variable in an application the customer was trying to implement. We were able to easily obtain annual fishing license data for each US state from the U.S. Fish and Wildlife Service. Ultimately, this data wasn’t used in the customer’s project, but we found the trends in the data interesting, and we thought it might make a good example case to both showcase some of the early work on interactive visualization in our predictive analytics tools that is being done by our new colleague Ramnath Vaidyanathan, and also an opportunity to highlight Alteryx’s R-based predictive analytics capabilities in general. Ramnath and I were very ably assisted in this project by our summer intern Joe Lombardi.

 

While Figure 1 illustrates the general decline in fishing as a leisure activity, it does not provide an explanation of the reasons why this is occurring. One possible way that can be used to uncover the underlying reasons behind this decline is to use time series regression methods, such as an ARIMA model with covariates, which can be done using Alteryx’s ARIMA tool starting with the upcoming 9.1 release. However, the fishing license time series exhibits a general gradual decline, a pattern that many other time series variables also possess, making it essentially impossible to separate causality from mere correlation in this case. This is a very common situation, so assessing causation from time series data alone is often very difficult.

 

What is the alternative? The use of cross sectional data (data collected in the same time period across members of the population of interest) is often a better choice for determining the important underlying drivers of a time series pattern. Fortunately, for this project we were able to take advantage of cross sectional data from the National Survey of Fishing, Hunting, and Wildlife Associated Recreation which is a sample survey done every five years, with each year of the survey having between 66,000 and 145,000 respondents. The survey is conducted by the U.S. Census Bureau for the U.S. Fish and Wildlife Service. The most recent survey was done in 2011, and the other survey years we make use of in this analysis are 2001 and 2006. In what follows, we present an analysis of the factors that influence the probability that an individual has gone fishing at least once in the prior year (the 2011 survey asked respondents if they had gone fishing at least once in 2010), and then relate two of the relevant drivers (age and urban status that have seen important changes in the overall U.S. population over recent years) back to the observed trends in fishing licenses per 100 population using an ARIMA model with covariates.

 

The preview to our findings is that most of the slow decline in fishing as a recreational activity can be attributed to several underlying demographic trends in the US population. Specifically, the aging of the population (likely counter to many people’s expectations, the probability of having fished in the previous year does not increase in a person’s late middle age and early retirement years), and the increasingly urbanized nature of the US population. In addition, the relative importance and basic nature of the key drivers of the probability that an individual fished in the previous year is remarkably stable. Having said this, an analysis of the predicted probability of having fished in the prior year for four different individual profiles (a male aged either 15 or 50 residing in either a large urban area or a rural area) suggests that there are some subtle changes that have occurred between 2000 and 2010. Specifically, the estimated probability of fishing has generally decreased for 15 year olds (which is an age just past the peak of an individual’s probability of having fished in the previous year) over the ten years between 2000 and 2010, particularly for individuals residing in large urban areas. In contrast, the estimated probability of having fished in the previous year has generally increased for men 50 years of age, particularly those residing in rural areas. This interaction between age and residing in an urban or rural area on changes in the estimated probability of having fished in the prior year is an interesting one, and could be related to the observed spikes associated with fishing licenses observed in Presidential election years. In addition, it is consistent with the notion that there has been a “crowding out” of fishing and other traditional leisure-time activities by emerging (often digital) entertainment technologies, but as of 2010, this crowding out seems to have been extremely modest. However, there is reason to believe that this crowding out effect may have greater long-run implications for fishing as a leisure time activity.

 

The Drivers of Fishing Propensity

 

To determine the drivers of fishing propensity, we estimate a separate binary classification model for each year of the survey. We could have pooled the data. However, potential fields that report dollar measures on an interval basis cannot be adjusted for inflation. Moreover, before conducting the analysis, we could not be certain whether the magnitude of the effect of certain drivers were decreasing or increasing over the ten year span between the first and last surveys. Finally, for reasons we give below, we make use of sampling weights, and these weights are specific to each year of the survey.

 

The three binary classification models are estimated using Alteryx’s Boosted Model tool, which implements the gradient based boosting algorithm of Jerome Friedman via R’s gbm package. This method has a number of nice features:

 

  • Field selection among possible predictors is internally handled
  • Non-linear relationships between the target and the predictors are automatically modeled
  • Interaction terms are automatically addressed based on a user specified depth (in our models we allow for up to three-way interactions)
  • Marginal effects plots are easily produced which visually show the relationship between a predictor and the target in a way that controls for other (possibly confounding) factors

 

Since the data is from a stratified random sample, sampling weights (which adjust for over or under sampling of certain population groups) were used in estimating the models.

Figure 2. Variable Importance Weights by Survey Year

 

Figure 2 provides the relative importance plot for the set of predictor fields used in each of the three models. For all years, the four most important predictors (in order) are the state of residence, family income, age, and gender (men are over twice as likely to have fished in the prior year compared to women). The next most important predictors are area metropolitan area status (large, with over 1 million population); mid-size (between 250,000 and 1 million population), small (50,000 to 249,999) and non-metro) and urban status (rural or urban). While the categories for these two fields do not perfectly overlap, there is a very high degree of overlap. As a result, the variable importance values of these two measures are likely to be understated in the variable importance plots. The individual’s race (Native Americans/Alaska Natives and Whites are more likely to participate in fishing than other racial groups) and the individual’s relationship (child, spouse, etc.) to the “reference person” (typically the family’s primary wage earner) are the remaining predictor fields that have important relative effects. Below we examine a number of the more important drivers of the probability of fishing in the previous year in more detail.

Figure 3. Marginal Effect Probabilities by State for each Survey Year

 

Figure 3 shows the patterns of the state of residence on the probability of having fished in the prior year (once other predictors have been statistically factored out) for each of the three surveys as a set of choropleth (thematic) maps. The figures do reveal variation across states for the different survey years (the 2001 and 2011 surveys are more similar to one another than either is to the 2006 survey), but what consistently comes through is that the Upper Midwest, the northern Rocky Mountain states (Idaho, Montana, and Wyoming), the South Central states along the Mississippi River (excluding Tennessee), and the states of Alaska and Oklahoma are relatively high probability fishing areas once other factors are taken into account. In contrast, the Southwest, Mid-Atlantic, and southern New England states have lower probabilities of fishing. One likely cause for the differences observed across states is the relative ease of accessing fishing areas in a short amount of time. However, other factors, such as local culture, are likely to be at play as well.

Figure 4. Marginal Effect Probabilities by Income Group for each Survey Year

 

Figure 4 illustrates the effect of family income level on the probability an individual fished in the previous year for each survey year. In general, the set of figures indicates that the probability of having fished in the previous year increases with income, but at higher income levels, the incremental effect of higher income levels is reduced.

Figure 5. Marginal Effect Probabilities by Age for each Survey Year

 

Figure 5 examines the effect of age on the probability of fishing the previous year. The youngest age allowed for participants in the survey is six years. For all three survey years, the plot reveals that the probability of fishing rises rapidly until reaching a peak around ages 9 or 10. What has changed over time is the age range over which the peak in the probability is maintained before started an extended downward trend. In the 2001 survey the downward trend begins around age 12, in 2006 the start of the downward trend moves to age 11, and in 2011 it moves even earlier, to around age 10. Once the slide begins it is fairly rapid until individuals in their early 20s are reached, and at which point the rate of decline significantly slows until individuals in their mid-40s are reached (in their early 50s for the 2011 survey data), after which the rate of decline picks up again. While not shown in the plot, past age 80, all three plots stabilize at a low fishing probability level.

Figure 6. Marginal Effect Probabilities by Rural/Urban Areas for each Survey Year

 

The difference in the probability of fishing across urban versus rural areas is illustrated in Figure 6. The figure reveals that individuals residing in rural areas are much more likely to have fished in the previous year compared to those residing in urban areas. These differences are likely to reflect both differences in the ability to easily access fishing area and cultural orientation.

 

Looking for a General Downward Trend in Fishing Interest

 

One thing we are interested in assessing is whether there is a secular downward trend in the probability of fishing that is unrelated to the demographic, socioeconomic, and location drivers we have just examined. There are a number of reasons to think this may be the case. Between the 2000 and 2010 there has been a large shift in the entertainment options available due to advances in entertainment technologies, many of them made possible through the expansion of both broadband Internet connectivity and new mobile technologies. Given this, it seems very possible that these entertainment technologies could be “crowding out” traditional leisure time activities such as fishing.

 

While we do have three different survey years (spanning a decade that saw many of these changes in entertainment options), pooling the data into a single sample, and including the survey year as a predictor, is problematic since we do not have appropriate multi-year sampling weights available. As an alternative, we decided to examine the percentage change in the predicted probability of fishing in the prior year between the 2001 and 2011 surveys for four well defined individual profiles for each state. The profiles differ based on age (15 versus 50) and how urbanized an area in which the individual resides (a large metro area with a population that is greater than or equal to 1 million people versus a non-metro, rural area). The profiles are based on white males, and other factors are set in appropriate ways. Specifically, marital status is set to “Married” for 50 year old men, and never married for 15 year olds; employment status is set to working for 50 year olds and student for 15 year olds; and educational attainment is set to college graduate for 50 olds and less than high school for 15 year olds.

 

The drawback with our approach is it does not allow us to look at statistical significance levels. This limits our ability to conduct a formal hypothesis test, thus making our analysis in this area more tentative it nature, it does allow us to gain a qualitative sense of the direction of the effect.

Figure 7. The Profile Probabilities by Survey Year for Two States

 

Figure 7 gives the predicted probabilities of the four profiles for individuals residing in both California and Minnesota for the 2011 survey. This figure highlights the strong effects of state of residence, age, and the urban versus rural nature of an individual’s area of residence on the probability of having fished in the prior year. The results of the analysis can be seen in Figure 8, which is a collection of four dotplots each arranged in a histogram pattern. The figures suggest that the percentage change between 2001 and 2011 in the probability of having fished the prior year has primarily been negative across states for individuals who are 15 years of age, particularly those residing in urban areas. In contrast, the percentage change between 2001 and 2011 for the same measure has been mostly positive across states for individuals aged 50, particularly for those individuals residing in rural areas. This suggests that while emerging entertainment technologies may have “crowded out” fishing as a leisure time activity for younger age groups, particularly in urban areas, other groups appear to have experienced an increase in the probability of fishing in the previous year. Overall, this suggests that while fishing may have experienced some crowding out from newer, alternative leisure time activities, the extent of this crowding out is likely to be fairly modest. Returning to Figure 7, which show the strong effects of location (state and urban/rural) and age are on the probability of fishing in the prior year, underlying demographic and migration trends in the US population are likely largely responsible for the decline of fishing as a leisure time activity.

Figure 8. The Percentage Change in Fishing Probabilities Between the 2001 and 2011 Surveys

 

Developing a Forecasting Model for Fishing Licenses

 

The analysis of the individual level cross section data points to several possible variables that are worth including in a time series forecasting model of issued fishing licenses. A number of variables that strongly influence the probability of fishing have experienced significant trends over the past several years. Both the aging of the US population and the long-term increasing urbanization of the US continues. Both of these trends should result in a decrease in the number of fishing licenses per 100 population being issued over the 2000 to 2013 time period. Trends in other the demographic and socioeconomic variables have also likely had an impact, but the effect of these other variables is often more ambiguous on fishing behavior. For instance, there is currently increased concern about rising income inequality in the US. However, this suggests that there has been a movement of families toward both tails of the income distribution, which has an ambiguous effect on the probability of fishing. As a result, we focus on changes in the age distribution and the percentage of the population in developing a time series forecasting model.

Figure 9. Trends in the Fishing Licenses, Rural Population, and Median Age

 

In terms of capturing the effect of the aging of the US population on the number of fishing licenses issued per 100 population, we use the median age in each year based on population estimates for that year. Figure 9 illustrates the general upward trend in this measure. We use the percentage of the population residing in rural areas to capture the changing rural/urban composition of the US population, which is also visually presented in Figure 9. Given the definition of these measures, we expect to see a negative effect for median age and a positive effect for the percentage of the population residing in rural areas (since this figure is declining over time, the number of fishing license issued should decline as a result).

As indicated in the introduction, attempting to determine causality using time series models is often difficult due to the high levels of correlation between potential predictors. This example serves as a case in point. Specifically, both median age and the percentage of the population in rural areas are highly correlated with one another (the Pearson correlation coefficient is -0.9 between the two measures), and both measures are highly correlated with the number of fishing licenses issued per 100 population, with licenses per 100 being particularly highly correlated with median age (the Pearson correlation coefficient between these two variables is -0.95, while the Pearson correlation coefficient between license per 100 population and the percentage of individuals residing in rural areas is 0.81). This high level of correlation between variables complicates the analysis. Initially, a model that included both the median age of the population and the percentage residing in rural areas was estimated, but the percentage residing in rural areas had a counter intuitive negative sign (rather than the expected positive sign), and was far from being statistically significant. The reason for both the incorrect sign and the statistical insignificance are both almost certainly due to the high level of correlation between this variable and the median age of the population (when an ARIMA model with the percentage of the population residing in a rural area is being used as a single covariate is estimated, the variable has the expected positive sign, and is highly statistically significant). A second model was estimated that used the median age of the population as a covariate, which results in a model with the expected negative sign on median age, and produces very plausible forecasts.

 

Both a univariate ARIMA and ETS model (models that only rely on past levels of the target variable for model creation) were estimated. The ARIMA model that uses the median age of the population as a covariate produces better in-sample predictions and produces more plausible forecasts than the univariate ARIMA model. However, the ARIMA model with median age as a covariate does not appear to produce better forecasts than the univariate ETS model (the forecasts from the two models are nearly identical). The situation where a univariate times series model (in this case the ETS model) does as well, if not better, than a time series model that uses covariates is a very common occurrence in practice as a result of the slowly evolving changes in key drivers that are often readily captured by the trend components of univariate time series models. Additionally, given the high level of correlation between possible predictors, typically only a few can be included in a time series model before the statistical problems that occurred in this example come into play.

 

The forecast for the next three years (shown in Figure 10) indicates a continued expected decline in fishing licenses per 100 population (albeit, if the Presidential year pattern continues, our 2016 forecast may well be on the low side), but the rate of this decline is forecasted to begin to level-off.

Figure 10. Actual and Forecast Fishing Licenses per 100 Population to 2016

 

Conclusions

 

The bulk of the observed decline in the propensity of individuals fishing in the US is due to underlying trends in the US population (particularly the aging of the population). We do find some evidence that is consistent with the notion that there is “crowding out” of fishing as a leisure time activity (particularly among younger individuals) as a result of emerging (largely broadband or mobile enabled) entertainment options. In the time frame covered by the 2001, 2006, and 2011 survey years examined, this crowding out appears to be fairly minimal overall. However, given the effect this technology may have in reducing the percentage of individuals who fish in their childhood, it is very plausible that the crowding out effect on fishing may have a much more important effect on the decline of fishing in the future. The long-term trend toward an increasingly urbanized US population is also likely to foster an ongoing downward trend in fishing as a leisure time activity.

Dan Putler
Chief Scientist

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.

Dr. Dan Putler is the Chief Scientist at Alteryx, where he is responsible for developing and implementing the product road map for predictive analytics. He has over 30 years of experience in developing predictive analytics models for companies and organizations that cover a large number of industry verticals, ranging from the performing arts to B2B financial services. He is co-author of the book, “Customer and Business Analytics: Applied Data Mining for Business Decision Making Using R”, which is published by Chapman and Hall/CRC Press. Prior to joining Alteryx, Dan was a professor of marketing and marketing research at the University of British Columbia's Sauder School of Business and Purdue University’s Krannert School of Management.

Comments