Engine Works

ned_blog · ‎09-16-2009

Last time I wrote a module that could parse web server logs according to the W3C spec. Although it works well, server logs can get very big and there can be a lot of them - typically 1 per day. It is important that the parsing run as fast as possible. On one machine, parsing 100,000 records with 22 fields takes about 7 minutes. When web server hits often number in the millions, that is just too long just for parsing.

The left side of this module is obviously the crucial part, because that is parsing the bulk of the records. The right side is only parsing the field names, which should only be a single row so we won't worry about speed on the right. Looking at the left side (screen shot below), some obvious speed issues stand out. The following tools require all the data to be sorted: Tile, Summarize, Join, Join, Crosstab. There are potentially 5 different sorts going on here! There may be less than 5 because Alteryx detects if a stream has already been sorted, but it's still bad. This is exacerbated by the fact that the RegEx tool is tokenizing to a stream, which means that we are multiplying the number of records by the # of fields. This means that we are sorting 2.2 million records instead of 100,000!

So, in order to make this run faster, we want to avoid sorts and also avoid parsing to rows. The RegEx tool allows us to parse to columns as well as rows, but you have to tell it how many columns ahead of time. In this case I just picked a number that is too high. I picked 40, since our server logs have 22 fields. If your server logs have more, you might have to change that 1 tool. I then used a MultiFieldFormula instead of a formula to remove quotes and escaping since all fields are in a single row. From here onto a Dynamic Rename to rename the fields and then I get to use the new DynamicSelect tool to filter out all the extra columns that were parsed and not used. It now runs in 27.2 seconds! that is more than a 15X improvement on our throughput.

There is still 1 temp file left. That is in the dynamic rename because it has to wait for the field names to come in from the right before it can release any records from the left. My 1st thought was to have a Sample tool on the right stream to only look at the 1st 100 records or so for the field names. Unfortunately, the way the sample tool was written, it still didn't release the record stream until its input was complete, even though it all pushes the 1st 100 records. Good thing I have access to the source code though, I changed that behavior so it releases the full stream after it gets its record limit. This didn't make any difference in performance for the 100,000 record test, because Alteryx was smart enough to keep that temp file in memory, but for larger data sets it makes a positive speed difference.

As you can see, spending time optimizing Alteryx modules can make a huge difference in performance. Next time we will talk about converting this to a batch macro so we can parse a directory full of logs. The module from this blog can be downloaded here.

A quick addendum: I finally got around to testing a fresh x64 Alteryx. This same test ran in 20.4 seconds. Since there are no temp files, there is no advantage with memory. That extra 25% speed is purely the speed advantage of 64bit over 32bit.

Below is a screen shot of the before and after for the relevant part of the module. You can see it is both simpler and has less tools.

Before (7:02 Minutes)

After (27.1 seconds)

Engine Works

Parsing Web Server Logs, Part 2 - Optimizing