Alteryx Analytics Hub

Find answers, ask questions, and share expertise about Alteryx Analytics Hub.

Boosted Model - Format of Dat

5 - Atom



I've settled on a Boosted Model for my approach to predicting unpaid invoices in a model.


The data I'm feeding is essentially as follows:


Target: Unpaid Amount (Numeric) 


1) Invoice Age (e.g. 32 days old, 69 days old) --> Numeric

2) Invoice Aging Bucket (e.g. < 30 days, <90 days) -->Categorical

3) Invoice Region (e.g. Asia Pacific, Europe, Americas) --> Categorical

4) Invoice Sub-Region (e.g. Middle East, Northern Asia, India) -->Categorical


My questions are - for Categorical variables where I tend to have a lot (e.g. 5 regions, and something like 15 sub-regions):


1) Do I need to use one-hot encoding?


The initial data comes in one single column (e.g. Region and Sub-Region) (or Aging Bucket)

I have used One-Hot Encoding to make this a large number of columns where it is 0 or 1.

E.g. a sample probably has >50 columns. But the gist of it is below. If Americas = 1, then USA or Canada will be 1, but not both.


Invoice amtAgeAged <30Aged >30 <90AmericasUSACanada



2) If One-Hot Encoding is the way to go - how do I determine how much influence region / sub-region have? 


3) If One-Hot Encoding is not the way to go - what is the reason for this? I was led to believe that's how categorical variables must be inputted for the model to work. 



My model works great at Global scale so far. But at the regional level, it diverges pretty significantly. And I can forget about applying it to sub-regions. 


My goal would be to have it at least work at a regional level. So I am wondering if I am doing the one-hot-encoding wrong (are there alternative methods?). Or perhaps what I should do is shift to to training the model on a region specific level since I have 4 regions, instead of training a global model with regions as categories? 


I've attached a small snippet - about 52 columns are shown. In reality, there's a lot of combinations of these categoric variables. Probably more like 170 columns one-hot-encoded, where a ton are obviously 0 because it includes these categorical variables that have >20 options. 


In the sample provided, you can see:

-->Invoice is $2M --> It is <30 days since printed. It is <0 days since due. The Aging Bucket is <30 days. The Region is Central Europe. The Sub-Region is Germany/Austria. The customer is a Private Company. The Quarter is Q4 (seasonality).