This started out as a simple idea for a blog post that has grown out of control. Following up the article on Geocoding IP Addresses, I wanted to write a simple module to parse a web servers log and share it with this blog. It turned out to lead me down a rabbit hole.
First let's look at a sample of a log file:
#Software: Microsoft Internet Information Services 6.0
#Version: 1.0
#Date: 2009-05-19 00:00:10
#Fields: date time cs-uri-stem c-ip cs(Referer)
2009-05-19 00:00:10 /pp/anp.php 88.80.7.248 -
2009-05-19 00:00:28 /pixel.gif 213.17.9.61
"http://home.hccnet.nl/t.amerongen/Uitwerkingen%20Hoofdstukken.htm
2009-05-19 00:00:29 /pixel.gif 82.75.139.91
"http://home.hccnet.nl/t.amerongen/Uitwerkingen%20Hoofdstukken.htm
2009-05-19 00:01:10 /pixel.gif 62.131.15.194
http://home.hccnet.nl/t.amerongen/Wereldklok.htm
2009-05-19 00:01:21 /pixel.gif 80.127.8.10
"http://home.hccnet.nl/t.amerongen/Uitwerkingen%20Systematische%
I don't have any log files I can give away, but there are lots of people on line who leave them available for download. A quick Google search will give you lots of files of all different sizes & styles.
It might also be worth your time to read the specification for these files http://www.w3.org/TR/WD-logfile.html. The good news is that this file format is an open standard and used in most web servers. The bad news is that the standard is very flexible. That means that we have to do a lot of work in Alteryx to parse all the permutations.
1st up is the multi-line header. We have to split off all the header rows and find the one starting with #fields, parse that and get the fields names out.
After that we have to parse all the non-header rows. Mostly it is a matter of splitting on whitespace. The problem is that a value can contain whitespace, in which case it will be in quotes. Further there is a problem that a quoted string can contain quotes, in which case the internal quote will be doubled up. After a bunch of trial and error and help from RegEx Coach, I settled on the following regular expression to do the parsing:
(?:^|\s)((?:[^"\s]+)|(?:"(?:[^"]*(?:"")*)+(?:"|$)))
Breaking that expression down, we get:
- (?:^|\s) - an unmarked expression: the token will start with the beginning of the line or a whitespace character.
- ((?:[^"\s]+)|(?:"(?:[^"]*(?:"")*)+(?:"|$))) - a marked expression containing the token
- The token will be 1 of the following:
- (?:[^"\s]+) - 1 or more character NOT containing a quote or whitespace
- (?:"(?:[^"]*(?:"")*)+(?:"|$)) - a quote followed by the token followed by another quote or the end of the line
- (?:[^"]*(?:"")*)+ - and finally the body of the token. It is 0 or more of non-quote characters followed by 0 or more "" and the whole thing repeated 1 or more times.
It can be helpful to break a regular expression up like that. It really helps you see what you are matching. Anyway using the regular expression tool in Split to Rows mode, we successfully got all the fields parsed out and (I think) matched the parsing rules from the spec.
Joining our fields together with our data was fairly simple, but it involves adding a RecordID, a few joins, an append fields and even a crosstab. When I first did it it sure seemed like the simple way. The module can be found here. Again, you will have to have your own server log, or get some from Google.
This module works great, but there are 2 problems that will be address in future blogs.
1) Most servers write a new log per day. We will want to read a whole batch of them. Here is a hint to how we will do that.
2) When you parse a years worth of data, it is slow. Its a lot of data, but there ought to be a way to speed it up.