Hello
I have the following challenging scenario. In one datasource I have 18M rows, each row two columns: IP and Date; On the second datasource I have a controlled list of IPs that contain IP Range and Bot. This file has 1857 IP variations, some of the IPs are complete IPs (XXX.XXX.XXX.XXX) and others are IP ranges (XXX.XXX or XXX.XXX.XXX)
This is an example of what I need to achieve:
Datasource 1
IP | Date |
216.239.41.96 | 2018-01-01 00:00:00 |
64.233.173.197 | 2018-01-01 00:00:00 |
64.68.90.80 | 2018-01-01 00:00:00 |
209.85.238.11 | 2018-01-01 00:00:00 |
Datasource 2
IP range | Bot |
216.239 | |
64.68.90.80 | Bing |
64.233.173 | Yahoo |
209.85 | Other |
Final Table
IP | Date | Bot |
216.239.41.96 | 2018-01-01 00:00:00 | |
64.233.173.197 | 2018-01-01 00:00:00 | Yahoo |
64.68.90.80 | 2018-01-01 00:00:00 | Bing |
209.85.238.11 | 2018-01-01 00:00:00 | Other |
At this moment I am running a workflow that contains a gigantic IF CONTAINS formula with all 1857 possibilities and of course that's super slow.
I wonder if there is a more intelligent way of doing this?
Thanks in advance
Rodrigo
Solved! Go to Solution.
One possibility would be to turn the ranges into lists using a generate rows tool. Your controlled list of IPs would then serve as a lookup table which you would join to the first dataset on IP address.
Thanks Brandon
I've added a Generate Rows tool but I am not sure how to complete the IP ranges (2 and 3 parts) to transform then into IP with 4 parts.
Any examples I could follow?
Rodrigo
Would a Find Replace tool not work? I have it configured to find from the beginning of the string and appending the "Bot" field from DS2 when a match is found. Output matched your final table example, but maybe there's something i missed that wouldn't make this option feasible.
Hi @calloni
Here's another technique you could try
Split the IP and Bot IPs into subnets IP1-IP4. Use filters and corresponding joins to match. For BOT ranges that only have the first 2 subnets, only join on those 2, etc. Add a formula to indicate the number of subnets matched after each join, union everything and use the sample tool to pull out the record for each IP that matches the largest number of subnets. This last bit handles the case where you have subcategories of BOTs i.e
216.239 | |
216.239.1 | Google Mail |
216.239.2 | Google Calendar |
Note: I just made these up to demonstrate the concept. I have no idea if Google has sub categories of bots.
The end result looks like this. Remove any columns you don't need
Dan
User | Count |
---|---|
19 | |
14 | |
13 | |
9 | |
8 |