Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Maveryx Success Stories

Learn how Alteryx customers transform their organizations using data and analytics.
STORIES WANTED

Showcase your achievements in the Maveryx Community by submitting a Success Story now!

SUBMISSION INSTRUCTIONS

Weighting Survey Data and Linear Regression Analysis -Weighted Least Squares with Alteryx

kadirakyuz54
7 - Meteor

Overview of Use Case
In this article, we are going to illustrate the process of weighting (‘raking’) survey data and conducting a linear regression-weighted least squares analysis with Alteryx. We are going to adopt the actual data we collected in one of our research studies which examines the impact of Syrian refugees on fear of crime in Turkey.

 

Describe the business challenge or problem you needed to solve

According to the United Nations, the world is witnessing the highest levels of displacement on record making the refugee problem a global one. The current use case involves the survey data administered by the authors as a part of scientific research to evaluate the impact of Syrian refugees on fear of crime in Turkey. Since the beginning of the Syrian crisis, millions of refugees fled the country and they took refuge in other countries, especially the proximate ones. Turkey is one of them and currently hosting the largest number of (Syrian) refugees. The social consequences of such a large number of refugees are worth studying in many aspects. 

 

Most surveys are likely to have the problem of under/over representation with certain characteristics. Survey weighting aims at re-balancing the data to reflect the population more accurately and compensating the differences between the sample and target population. It is also known as sample balancing or raking the data. To this end, we need to calculate a survey weight for each respondent in terms of one or multiple demographic variables and then include this new weight variable into the regression analysis. The regression will take the weight into account while calculating the regression coefficients. 

 
Describe your working solution

We use survey data collected via Twitter from 786 respondents from 53 different provinces of Turkey. In addition to that we also integrated to our dataset the demographic statistics which were obtained from Turkish Statistical Institute (TUIK). 

 

This is the screenshot of the initial survey data which has 786 respondents from 53 different provinces of Turkey.  

 

Picture1.png

 

First, we created our combination groups with the demographic variables we would like to weight. They are respectively; 1) population of the province 2) gender and 3) five different age groups. We need to get the number of the respondents in each combination of the three demographic groups. The first summarize tool in the Alteryx workflow (see below) returns the number of respondents in each combination of the demographic groups.  

 

Picture2.png

Below is the configuration of the first summarize tool and the output. 

 

Picture3.pngPicture3_1.png

   

 

Then we need to get the total number of the respondents in our sample (sample size=786). We can do this by using the Summarize tool above in the Alteryx workflow. Below is the configuration of the second summarize tool and its output.

 

Picture4.pngPicture4_1.png

   

 

Using the Append tool in the Alteryx workflow above, we join the two tables. Below is the output. 

 

Picture5.png

Now, we can calculate the proportion of each combination in our sample by dividing the count of respondents in each row (combination) by the total number of the respondents (786) and multiply it with 100.  E. g. 7/786*100 = 0.8905 

 

Picture6.png

 

This calculation gives us the current_weight measure which corresponds to the proportion of each combination of the demographic groups in our sample. 

 

Picture7.png

Note that, in Adana province (PR_ID_NEW = 1), the number of the males (newsex = 0) in the age group 3 (Age =3) is 7 and the representation of this category in our sample is % 0.89. In Adana (PR_ID_NEW = 1), the number of the females (newsex = 1) in age group 3 (Age =3) is 3 and the representation for this group in the sample is % 0.38. 

 

 As you may notice in our sample, we do not have respondents for all combinations of the demographic groups. We have respondents from 53 of the 81 provinces in Turkey. That means if we had respondents from all the demographic groups, we would have 530 different combinations of demographic groups. (population of the provinces-53- * categories of gender -2- * age groups-5- =530). However, as you can see, we have respondents from 190 out of the 530 possible groups.

 

For example, in Adana (PR_ID_NEW = 1) there is no male respondent in the age group 1. That is what makes the calculation of the survey weight more complicated in this case because before calculating the survey_weight, we need to create the same combination of the demographic groups for our target population and obtain the relevant statistics for each, and then calculate the proportion of each combination in the overall population which we will call target_weight. 

 

Here is how we calculated the target_weight. At this stage, we need the data for the same combinations of groups in our target population. Below is the data we obtained from the Turkish Statistical Institute, TUIK. Based on this; we can obtain the male and female population in each age group in each province of Turkey. Now all we need to do is to create a new dataset from the overall population matching the same combinations of demographic groups in our sample. 

 

Picture8.png

 

It required some tedious work to obtain each statistics, but after applying the same procedures we easily calculated the target weight. We divided the number of the individuals in each demographic group by the overall population of the 53 provinces of Turkey. If we obtained respondents from all the provinces of Turkey this figure would be 81, but we could not. Below is the configuration of the formula tool in Alteryx and its output.

 

Picture9.png

Picture10.png

 

Now we can calculate the survey weight. To this end, we need to join the two datasets (our sample and the overall population), based on the same demographic weight variables. Below is the workflow configuration to do that. 

Picture11.pngPicture11_1.png

 

 

We now have both the target_weight and current_weight  and all we need to do is to calculate the survey_weight by dividing the target_weight by the current_weight.

 

Picture12.png

 

Survey_weight= target_weight/current_weight     

 

Picture13.png

 

Picture14.png

 

After calculating the survey weight, all we need to do is include it in the regression analysis together with the other predictor variables. The outcome will give the weighted regression results. Since we have two dependent variables and we run the regression with and without the survey_weight, you see four different regression models in the Alteryx workflow below. 

 

Picture15.png

 

 

Picture16.pngPicture16_1.png

         

use_case_workflow.JPG
Describe the benefits you have achieved

Massive refugee flow is a global concern that many countries and people across the world have to deal with today. It has both humanitarian and political aspects. The study explores whether the presence of Syrian refugees in Turkey fuels fear of crime among the country's citizens and how a general fear of crime under what might be considered normal circumstances differs from fear of refugee crime. It provides valuable insights into the social consequences of the massive refugee flows and suggests policy implications for the host countries.

 

Furthermore, convenience sample, rather than a random sample, was used for the current study and representativeness of the data for the convenience samples is always a cause for concern. A convenience sample allows anyone who wants to participate in the study to do so; therefore, some of the demographic characteristics might be either overrepresented or underrepresented in the sample. Survey weighting is the process of compensating the differences between the sample and target population in terms of one or multiple demographic variables. In case of multiple variables, weighting process may become more complicated. Using an actual dataset with Alteryx, the use case illustrates the survey weighting process with multiple weight variables step by step. We believe it is informative for those who might want to employ survey weighting in their research in any field.  

 
Related Resources
 
 
 
 
 
Attachments
Comments
sjain71
7 - Meteor

Can you please send me the excel sheets used in this workflow