We are updating the requirements for Community registration. As of 7/21/21 all users will be required to register a phone number with their My Alteryx accounts. If you have already registered, you will be prompted on your next login to add your phone number.
alteryx Community

# Data Science

Machine learning & data science for beginners and experts alike.

## Using Linear Regression to Conduct a Pay Equity Study

Alteryx

Data can sometimes reveal patterns that provide insights into complex issues, especially around diversity, equity and inclusion. This post is your guide to conducting a pay equity analysis using Alteryx Designer — specifically, creating a linear regression model that helps you assess whether pay differences exist between people in different demographic groups.

If you’d like to follow along, download our workflow, attached at the bottom of this post. We hope this guide and workflow can be a starting point for your own analysis in your organization.

## The Question and Approach

The overall question is, “Is there a pay gap at your company? If so, who may need a correction to their compensation?”

To answer this question, you want to build a model that will help you determine what “acceptable” variables influence pay, and check whether “unacceptable” terms (i.e., those representing potential causes of bias, like gender) may have a statistically significant effect in lowering an individual’s pay. If a pay gap exists, you can flag the employees whose pay is below the expected range for review by a compensation expert. You can use linear regression to help make this determination.

For this process to work, it is important to have a large enough number of employees to analyze in order to be able to judge the statistical significance of your findings. Say you have a team of 10 individuals, including four women, who are located in different cities, have varying years of experience, and perform different jobs. In this case, it would be very difficult to make conclusions about the statistical significance of any pay differences you identify. However, if your numbers are too low to leverage the full approach, you can still follow this framework and find look-alike employees to do manual comparisons.

## Gathering the Right Data

Note: The input data shown here is from a fictional company with a 2% pay gap.

You’ll need to prepare an anonymized data set that includes all the factors that should drive an employee’s compensation, as well as the potential causes of bias that should not. In our sample data set, we identified factors such as education, level, and performance ratings that should drive employee compensation. A factor that should not drive compensation? Gender!

While age shouldn’t factor into compensation — you can see how being “too old” or “too young” might influence compensation unfairly over performance — we did use it as a proxy for years of experience.

For your analysis, consider what should be driving your pay process and make sure you are capturing that data. It’s vital to consult with your HR compensation experts who are most knowledgeable about these issues.

## Exploratory Data Analysis

In your Exploratory Data Analysis (EDA), you’ll look for outliers, understand what correlations exist, and identify any errors in the dataset. Don’t skip this step, as lots of valuable learning happens here!

Here are some key EDA steps to complete.

Response variable distribution: This helps us to spot outliers in the dataset. By checking these distributions, we spotted two outliers, representing employees being paid in a different currency (large values, circled in the left histogram below), as well as part-time employees who do not work the entire year (low values, circled in the center histogram). Note you should run different countries separately as they have different job markets. Fixing these outliers (more on that later) made our distribution look a bit more normal (right histogram).

Correlations between variables: This step helps us understand which factors are correlated with the response variable, as well as with each other. There are a few techniques to view these correlations, depending on the column’s data types.

Continuous vs. Continuous Plots (Association Analysis)

The heat map on the left is shaded based on how correlated two columns are with one another. We can see age (our proxy for years of experience) is the most correlated with Total Cash Compensation, as denoted by a darker red square. Performance is also correlated with compensation, but less so than age, based on its lighter red color.

Continuous vs. Categorical Plots (Plot of Means)

We often have categorical data such as Education Level, which can’t be used for correlations in the above association analysis. Instead, we can use the Plot of Means Tool to see if there are different salary ranges for different categories. The above plot shows that people with doctorates in this dataset are paid higher than people at other education levels. Also, there is clear separation between the distributions, so we’d expect it to be a statistically significant term.

## Data Preparation

Now that we’ve gathered our data and done some preliminary analysis to better understand it, we can take some necessary steps to get it ready for modeling.

Log transform the response variable: Linear regression models assume that the model error (here, Total Cash Compensation) is normally distributed. However, the model error often is not normally distributed, and the model fit will not follow the regression assumptions. The histogram on the left below shows how our response variable is distributed in its original form.

A trick is to take the natural log of the response variable to produce a log normal distribution. This sounds complicated, but it’s really easy; you can use log(total_cash_compensation) in a Formula Tool. The histogram on the right below shows how our distribution has been shifted to look more like a normal “bell curve” after the log transformation. You can try building your model first without the transformation, and see if the residual diagnostics meet the assumptions in our model evaluation steps below. If not, try a natural log transformation, and reevaluate your results.

This transformation has the added benefit of making it easier to interpret the model output, because it reframes your results from "women make X dollars less" to "women make 98% of men in the same job." This is an easier way to understand the magnitude of the pay gap, if one is present at your company.

Binning categories into higher levels: This step will merge people in smaller groups into larger groups based on common characteristics. For example, in the table below, you can see how people with various management and executive job titles have been binned together into a broader category that encompasses more people. This step increases the sample size of each bin, thus increasing degrees of freedom. Work with your HR and compensation experts on this step, as they will know which groups can be aggregated together.

 Management Level (Actual) Management Level Bin Senior Vice President Executive Vice President Executive Senior Manager Manager Manager Manager Group Manager Manager

Removing outliers: In our dataset, we saw a few outliers. They represent different groups and processes (e.g., the part-time job market and another country’s job market), so we feel we can exclude these for a more robust analysis.

Cleaning up missing values: In our exploratory analysis, we noticed that some employees were missing educational information. We opted to replace missing data with “unknown.”

## Fitting the Regression Model and Interpreting Results

Now that we have a dataset in good shape, it’s time to fit a model using the Linear Regression Tool!

Place a Browse Tool after the “R” output of the Linear Regression Tool, and click on it to see the model summary.

Let’s walk through the steps of evaluating this regression model.

1. Is any factor significant in the model? Check the F-statistic.
2. Which factors are/are not statistically significant? Check the p-values down the right-hand side.
3. How well does the model fit the data? Check the R-Squared, which here is 0.8651. This means 86% of the variance in compensation data can be explained by the model.
4. Does the model meet the regression assumptions? Check the residual diagnostics by clicking the arrow.
1. Linear Relationship (“Residuals vs. Fitted” plot, top left): The errors have mean zero (red line is close to zero) and have a constant variance around the red line for different fitted values.
2. Normality of the Error Term Distribution (“Normal Q-Q Plot,” top right): The model errors follow a normal distribution (verifying the normality assumption). The dots should follow the 45-degree line on the plot.
3. Constant Variance of the Error Term (“Scale-Location” plot, bottom left)
4. Independence of the Error Terms (“Residuals vs. Fitted” plot, top left)

We can then report out our findings.

In the sample dataset, we found a pay gap of 2% between men and women at the fictional company. In the table above, this is the coefficient of -0.02 calculated for the “GenderzFemale” variable, and it reflects how much a change in this variable (from male to female) would affect the response variable of Total Cash Compensation, if all other variables were kept the same. This gender variable was statistically significant.

Note that you can have a negative coefficient in your regression model; however, make sure you check that the coefficient is significantly different from 0.

In this fictional dataset, there was a pay gap estimated by the model. The next step is to flag employees who are being paid below their estimated range. If we were to use the Score Tool with the estimated model, it would predict compensation amounts that reflect this gender pay gap, since the model estimated that women are being paid less than men. Therefore, we need to flip gender to “male” for every employee and then estimate employees’ salary range. If the model uses these adjusted data to predict compensation for an employee that is below their actual salary range, that individual can be flagged for review by HR.

## Meaningful Patterns and Meaningful Action

A pay gap of 2% may sound minimal, but a 2% difference in pay over an entire career — 40+ years of work — certainly adds up, not to mention the loss of related retirement contributions and interest on investments.

Linear regression may be thought of as a simple data analysis tool, but here it reveals meaningful patterns in compensation that should motivate a response. Understanding the impact of gender on compensation and making necessary adjustments are important steps toward workplace equality.

Have you conducted your own pay equity study? What methods did you use? What questions do you have about this analysis? Let us know with a comment below, and subscribe to the blog to get future articles.

Alteryx

Great analysis!  I look forward to hearing what others think.

6 - Meteoroid

I had issues trying to import the file after opening the workflow.  What version of Alteryx was this workflow created in?

Alteryx

It was created in v21.1.

I have also submitted an edit on the article that is awaiting approval with the workflow in version 2019.2 that should be able to be opened in any more recent versions.

The R score tool changed the column name the tool outputs in a version. So if you get an error on the last filter tool in the workflow update the column to whichever one isn't in the workflow [X_lwr] or [Score_lwr].

Thanks for reading and reach out with other questions/feedback!

Alteryx Community Team

Thanks, @bhamel - that additional workflow is now attached to the post!

6 - Meteoroid

Thank you very much!

9 - Comet

I'm in the middle of Pay Equity analysis right now, so very timely.  Really validating to see that what I've built is remarkably similar.

In my working model, I struggle with the natural clustering that people do.  Much like Will.i.am spoke about, people go into fields, and pursue things that they know and understand.  People pursue careers where they see other people similar to them.  Those with the same educational attainment tend to group together.

And variables like Age, and Education don't have the same impact at different levels within the company.  There are times when Education has a really high value, and other times when it doesn't.  In my work, Age is the best proxy I have for skills, and real world experience.  And the "value" of age changes!

As a result, I'm using a 3 cluster model instead of one generalized model.  To get technical, the coefficients flip signs on me, or have vastly different scales.  Breaking my data into three clusters allowed me to overcome that challenge.

Alteryx

@Charity_K_Wilson  Thanks for reading and the great feedback! Segmenting before fitting models always seems to help boost performance and in this case get the correct coefficents for inference.