Hi,
I want to test my data, however it is not normally distributed and now I dont know which test to use. Within my data there are two groups which I want to compare. The variable measures the speed between two events (average = 7 days). This is because of some outliers which I need to filter out.
1) What is the best way to handle outliers (z-score?)
2) Which test can I use to compare both groups?
How you guys can help me
Solved! Go to Solution.
This is a question that involves some feature engineering and data sciene knowledge. Perhaps you can consider anything beyond -3 or +3 Z-Scores to be outliers, thereby reducing the number of observations in your analysis.
Your 2nd question is a bit vague. What are you testing? What are you testing between? Not very clear...
Also, it helps if you can provide your data or your workflow. Please provide relevant data to this use case, and kindly provide your criteria in as much detail as possible. If you have a workflow built halfway, kindly export that over as well.
To export a workflow go to: Options > Export Workflow. Kindly do NOT send a "Save As" copy.
Hi caltang,
Thank you for your response, I want to know which group is the quickest. However other test to determine the mean between two groups have the assumption that the data needs to be normnal distirbuted
Without the data, no one can give you good help. Can you kindly provide some data so that the community has something to work with...?
My apologies I just saw this. Based on your data, can you explain a bit more about “Days”? Is it TAT?
Having 0s will affect your standardisation.
Yes, the days are the time between two events. So 0 means the first and second event happened on the same day. This is important for me because this means the time between events is low (same day) and this is important. How longer the days, we suspect less result. That is what I want to test
@Wouterrrrrr this resource is fantastic if you're looking for predictive value: https://community.alteryx.com/t5/Data-Science/Alteryx-Predictive-Tools-Flowchart/ba-p/602881
if you're looking at which statistical test to apply, I think flowcharts are invaluable (source: https://onishlab.colostate.edu/summer-statistics-workshop-2019/which_test_flowchart/).
Also run your ideas through chatbotGPT/google. Assumptions etc.
All the best,
BS
The right skew will exist even after you calculate your z-scores. The z-score doesn't inherently change the shape of the distribution.
I'm pretty stumped myself - you may have to try other methodologies to cater to your non-standardized data.
Perhaps you can try standardizing the days data this way:
But is your research question: Is there a difference in days between Closed Won and other stages as you expected?
H0: No difference between Closed Won and other stages
Ha: There is a difference between Closed Won and other stages
If you do reach standardization, this article is useful: https://www.thedataschool.co.uk/liu-zhang/test-for-the-difference-in-the-mean-t-test-in-alteryx/