Engine Works Blog

Under the hood of Alteryx: tips, tricks and how-tos.
6 - Meteoroid

As a recovering DBA I still reach into my bag of tricks when having to do low level data movement. It's a habit hard won, but I'm breaking it. And Ned gave me a little more push in that direction.


We have a fairly large static table that is the source for some reporting that has been giving us headaches. We've indexed, dropped unnecessary columns, run every applicable DBCC and still the table would not clean up. Rowcount multiplied by maximum row width should give you an approximation of the table size; that number was half what was sitting on the database. It was bloated. I think the pages were so torn that the data couldn't reorganize into a tidy bundle. So, I was going to go old school and BCP the data out, truncate the table and BCP the data back in. Simple enough. The BCP in would arrange the data tightly. And, assuming the clustered index wasn't insane, the data would remain compact. I expected 1.5 hours BCP out, 1.5 hours back in and another hour to lay indices.


I mentioned my plans to Ned.


He reminded me that we have the data in a Calgary data file. I didn't need to BCP out; I could use the Calgary file as my source. Fantastic, one step out of the way. So, I wrote a quick module to load the data from Calgary into MSSQL Server.


Given the large numbers of records, I set the Transaction Size option (#8) on the output to 10000. Setting the Transaction Size will allow the log to commit. Without this option set, my log would have grown and I didn't want to have that headache too.


The data loaded in 15 minutes. Fifteen. In all fairness, the Calgary file only contained 25 columns, a quarter of the original source. Nevertheless, Alteryx loaded 15 million records from Calgary into MSSQL Server in 15 minutes. I think the time to build the Calgary data file was approximately 1.5 hours.


With the necessary indices (a unique clustered, a covering and 4 others) the total size on the database is less than 6GB. The original size was 58GB.


While this isn't necessarily a best of breed example for Calgary's use, I think it demonstrates in real-world terms how it can help solve some data management nightmares. We needed a large chunk of data pulled out of a RDBMS and put back in. Additionally, we can copy/move the Calgary data file to any other server that needs this source data and load it up quickly. We know that the data will be the same on each server because the Calgary data file is atomic. We can check it into our source control to have a point-in-time recovery for this reporting source. Going forward, we'll use the Calgary data file as the source in our web applications too.