Morning All,
I've recently had the need to geocode IP addresses - not a thing Alteryx has a built in provision for. Having looked at the community for ideas I've seen that it's a problem that has come up a few times over the years with not much in the way of solutions.
I thought it might be useful to share my experiences.
There are a few approaches I took before landing on my ultimate solution to the problem which I'll quickly outline -
- API based - Feeding a list to a API based system
- This ultimately wouldn't work for me as I have a large number of IP addresses to geocode and no budget to pay for a service.
- Free plans are typically limited to 10,000 queries per month.
- Using a free database to build my own - There are a few free (but less accurate) IP databases provided by the same companies that provide the API services. I ended up using db-ip (https://db-ip.com/db/). The problem with this method is how to efficiently match an IP address to the data.
- The data is across 3.2 million records, each representing a range of IP addresses. The total number of IP addresses covered being in the billions.
- The question became how to join this data given that Alteryx has no in built function to join if a value is >= x and <= y.
Methods I tried -
- Brute force 1 - You could just append the IP to the full list and then identify which range it falls in with a filter. Works, but very inefficient and very slow.
- Brute force 2 - Generate a massive database of all the unique IPs from the ranges and then join to that. Again super inefficient and felt unneccesary.
- Advanced Join Macro - Download the advanced join macro from the gallery - this does allow you to join a value that falls between ranges but after trying it, it was still extremely slow.
Ultimately I decided it was about narrowing down the intital group you match to.
My background is in Location Planning/Analysis so I'm well familiar with the spatial tools.
It occured to me that the spatial match tool is incredibly efficient at both loading large sets of data and performing match operations, with this in mind I decided to convert the IP ranges in the IP database to co-ordinates and create a point spatial object to represent them (I did this by making them fractions of the maximum possible IP and then multiplying them by 90, IP range start becomes the X co-ord, IP range end becomes the Y co-ord). I then embed this in the spatial match tool.
Then if you were to convert the IP you are interested in to co-ordinates as well (using the converted IP as both X and Y) you will get another spatial point. The next part of the process was to draw a small circle around the IP of interest and spatial match to full database. (the size of the circle determines the number of possible matches you will get), you can then just filter this small subset of possibles to get your answer.
Using this method I was able to bring the run time down to 1 minute per 10,000 geocodes.
If anyone is interested in this I'm happy to provide more detail.
Regards,
Ben