Alteryx Designer

Find answers, ask questions, and share expertise about Alteryx Designer.

How do the prediction tools generate dummy variables?

Highlighted
6 - Meteoroid

I have notice, that for example the linear regression tool can automatically transforms categorical to nummerical or a vector.

How is this precisely done ?

 

or

 

Where do I find the explanation?

 

So far I have converted them "by hand", e.g.

assume we have one categorical feature with n different values A_1,...,A_n

I assigned to A_i the i-th basis vector (0,...,0,1,0,...,0)

 

Toy example:

idcat
1A_1
2A_2
3A_2

is transformed to

idcat_A1cat_a2
110
201
301

 

In the next step I would ask:

Where I can change in the tool the algorithm how this transformation is done, e.g. instead of the transformation above use some kind of binary encoding,

i.e. implement a bijective map from my Category to some vector space over the field with two elements.

Highlighted
Alteryx Certified Partner
Alteryx Certified Partner

If you're looking for a process to transform categorical fields into dummy variables for modeling, I have attached a solution I built a while back. Let me know if this works for you. 

 

 

Highlighted
6 - Meteoroid

Hello CharlieS!

 

I thank you for your workflow, but I have buildt them on my own. Nevertheless I will take a close look at your solution,

as one can allways learn something new. :-)

 

However; The question is still there :

how it is implemented in Alteryx, as it looks like certain tools do this to some extend automatically?

 

More precise: (although the following is simplyfied)

"By accicdent" I pluged in some categorical data into the linear regression tool and the output suggests that

the different entries in the categorical columns were identified and transformed.

To keep it simple: One column of categorical were handled quite well, but one column of categorical was "ignored".

 

As I do not know how this was done, I can only do some guesses (hence I created this thread).

One reason could be : The good column contains only a few number of different categoricals, where the bad column contains lots of different. Hence any statistic we apply is more resonable for the "good" column.

Labels