Hi folks,
I am trying to build a classification model on a dataset that has most of it's field to be 1's and 0's (Yes or No respectively).
Quest 1: Is setting the data types for these field as Boolean most appropriate?
Quest 2: How would you advice I handle the missing values in these Boolean fields? Should I replace them? If yes, with what? If no, kindly advice please. Thanks.
Hi @Deebo,
Really interesting questions! My answers would be:
Question 1 -> Yes, boolean is used if you want to test if a certain binary condition is met (Binary condition is met with boolean)
Datawise it doesn't matter whether you use byte as a datatype or boolean since both take up 1 byte in Alteryx.
Question 2 -> Filter out those rows if neither "Yes" or "No" is represented in the dataset. In my opinion it's better to have a smaller, higher quality dataset, especially when going into machine learning/statistics (you might otherwise make the wrong decision based on how you would fill up those empty rows).
I hope some more experienced Alteryx user pops up and answers this question as well, really curious!
Greetings,
Seb
Hi @Deebo ,
with a binary field a boolean response is ideal, but as @Sebastiaandb pointed out, null or blank values need to be handled. It's not necessarily optimal to simply remove them, you might want to use imputation if they are in a minority. This way you can impute the modal/median etc.
Alternatively, if you experiencing a field that has multiple values and you wish to retain null as a valid figure, then you should consider using One-Hot Encoding. This would usually be applied to categorical variables where there are more than two values.
For example, if you have a column which is "Country" and you have values such as Italy, France, UK, Germany etc. then encoding these effectively will pivot the data so these values become the column headers and a value is then boolean as to which one it is:
I hope this helps,
M.
Thanks alot Bolide for your response. But I still feel like you didn't get me. May I break it down a little further?
I am trying to build a predictive model where most of the the variables and even target variable are Yes's and No's. So I converted them from Yes's and No's to 1's and 0's in other for me to use classifiers. Importing the data into Alteryx requires me to ensure I have the correct datatypes on all the variables. My question now is, if I change the datatypes to Bool in Alteryx, would it affect my overall model?
Hi @Deebo,
mmmm this goes beyond my level of knowledge ;-).
@mceleavey is way more knowledgeable than i am so he might know the answer to that. Besides, @mceleavey, do you have any recommendations on books/online courses to refresh my statistics knowledge (has been about 8 years that i left University and i seem to forgot some of the essentials haha)?
Sorry i couldn't help you out @Deebo
Greetings,
Seb
User | Count |
---|---|
52 | |
27 | |
25 | |
24 | |
21 |