2018 Excellence Awards Entry: A Lyft Driver Uses Linear Regression to Predict Their Next Fare Amount
Name: Tara Charter
Title: Sr. Technical Writer
Company: Alteryx
Overview of Use Case:
As a part-time Lyft driver, I want to predict how much I can expect to receive on my next fare. If I can better determine how much I can expect to receive, I can more easily set and reach my goals. Some predictors of how much my next fare will be include variables like ride distance, time of day, type of ride, and customer demographics.
Describe the business challenge or problem you needed to solve:
The math-infused landscape of linear regression is difficult to navigate. It is nearly impossible for a citizen data scientist who has never written code to build a linear regression model. I don't want to guess. I want to be able to predict my next fare based on a set of constants. Predicting my next fare requires a linear regression model. I need a tool that will help me build a model where I can input customer data and get reliable scientific-based results.
Describe your working solution:
Using Alteryx Designer 2018.2, the predictive analytics tools, and the Alteryx Predictive Analytics starter kit, and Excel, I successfully built a linear regression model that predicts my next fare based on a set of constants that I input as customer data. I can repeat my workflow using new customer data. The steps for my solution are as follows (and are documented in my data science blog post, "A Lyft Driver Uses Linear Regression to Predict Their Next Fare Amount"):
1. Plan. Decide what my independent and dependent variables are. My dependent variable is fare amount. My independent variables include ride distance, time of day, type of ride, and customer demographics including customer age and household income.
2. Gather transaction data. I collect Lyft fare data over a one month period. Data includes transaction ID, fare amount, customer ID, miles, and time of day.
3. Find sample demographic data. Using the Alteryx Predictive Analytics starter kit, I create customer demographic data that includes customer ID, customer's phone type, gender, age, and household income.
4. Download and install Alteryx Designer 2018.2, the Alteryx predictive analytics tools, and the Alteryx Predictive Analytics starter kit.
5. Open the Demand Forecasting guided workflow from the starter kit.
6. Follow the guided instructions in the starter kit to create a new workflow where I can prep and blend my data sets.
6a. Cleanse Transaction Data
Using the following Alteryx Designer tools, I prepped my data set:
- Input Data, Auto Field (to set field types), Data Cleansing (to select fields to cleanse), Browse tool.
6b. Prep Transaction Data
Using the following Alteryx Designer tools, I prepped my data set:
- Input Data to bring in my cleansed data from Excel
- Select to select only the fields I need
- Summarize to group, order and summarize by what I want to predict (I want total fare by transaction)
6c. Prep Customer Demographic data
Using the following Alteryx Designer tools, I prepped my data set:
- Input Data to bring in my sample customer demographic data from Excel
- Select to select only the fields I need and speed up the workflow
- Filter to remove null or empty fields (I don't want to use any rows where Age is empty)
6d. Join my Transaction Data and Customer Demographic Data
- Using the Join tool, I join my data sets on Customer ID
- Using the Output Data tool, I create a new blended data set
7. Create my linear regression model to predict my next fare
- 7a. Using the Input Data tool, I bring in my new blended data set
- 7b. Using the Select tool, I reset my data types to the most precise data type available
- 7c Using the Create Samples tool, I can split my data into two (one for estimating and one for validating)
- 7d Using the Linear Regression tool, I use the estimation output from Create Samples to create my model object and my outputs (one of which is interactive)
- 7e Using the Score tool, I use the validation output from the Create Samples tool to score my data and see if it's fit for new customer data
- 7f Using Browse tools as a way to view the output, I run the workflow
8. Tune my linear regression model to predict my next fare
- Checking my interactive report (which is one output from the Linear Regression tool), I scroll down to advanced stats. This is where I find out which variables (predictors) I should keep.
- Using the report stats, I decide I don't want to keep any variables that are not high confidence or "significant", so I deselect them and rerun the workflow. This is known as tuning the model.
- I also check the Score tool to see if my blended data set is going to make a good model. I see that using the Browse tool shows a smooth slope with few outliers so I know I have a good data set.
9. Apply my linear regression model to predict my next fare
- Finally, I use a fresh new customer data set (via a new Input Data tool) to apply my new model and predict my next fare based on certain constants!
- Using another Score tool, a Browse tool, and an Output Data tool, I simply apply the model to a fresh new set of customer data. The new data has everything except fare amount. When I run my workflow, I get a predicted fare amount for each row in the data set.
- For example, I can predict the fare amount for a certain customer ID, a specific ride distance, time of day, customer age and household income,
Describe the benefits you have achieved:
The benefits of having a predicted fare amount are reduced loss of income and reduced wait times. I can better understand what time of day, customer demographic, and ride type will yield the highest fare amount. Without this model, I would have incorrectly assumed the next fare amount was based only on ride distance.
Related Resources:
Media
Data Science Blog:
Here is a video that shows you how I configured my Linear Regression tool