Hello,
I have a list of zip codes which I am trying to use in my predictive model. Converting zip codes into numbers through select tool makes 06001 into 6001. I need the starting zero in the number. Is there a way I can have them as numbers without losing starting zeros?
Thanks
Solved! Go to Solution.
You should not be using a zip code in a model which requires the variables to be continuous measures!
I suggest you look at other models which allow for you to use string fields as predictor variables.
Ben
@BenMoss it is a transportation problem where start and end destinations play an important role. I wanted to try a decision tree model which doesn't take string predictors.
The forest model does allow you to take in string fields though?
Irrespective of whether you want to include it in the model you must ensure you make the right model selection given the data you have, not change the data to suit the model!
Ben
Here's a post that may be helpful
'One of my favorite uses of zip code data is to look up demographic variables based on zipcode that may not be available at the individual level otherwise...
For instance, with http://www.city-data.com/ you can look up income distribution, age ranges, etc., which might tell you something about your data. These continuous variables are often far more useful than just going based on binarized zip codes, at least for relatively finite amounts of data.
Also, zip codes are hierarchical... if you take the first two or three digits, and binarize based on those, you have some amount of regional information, which gets you more data than individual zips.
As Zach said, used latitude and longitude can also be useful, especially in a tree based model. For a regularized linear model, you can use quadtrees, splitting up the United States into four geographic groups, binarized those, then each of those areas into four groups, and including those as additional binary variables... so for n total leaf regions you end up with [(4n - 1)/3 - 1] total variables (n for the smallest regions, n/4 for the next level up, etc). Of course this is multicollinear, which is why regularization is needed to do this.'
Hi ayadav8,
When using zip codes in modeling you want to treat them at a categorical variable. Even though they are numeric the number doesn't actually mean anything other than specifying the area an individual lives. This number isn't like Temperature, for example, when you increase or decrease Temperature it actually means a change in the amount of degrees (F or C). What happens when you increase or decrease a zip code? It does not mean you are changing the amount of something. I like what someone mentioned earlier about using median income or population to describe the zip code as a numeric value.
-Tony
@asilva Yeah I see what you saying. Thanks!