This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
When it comes to science, reproducibility is key. Scientists must be able to take previous work, understand the conditions under which it was run and then be able to re-run the experiment and obtain similar – if not identical – results. This enables not only trust and confidence in the results, but provides a solid foundation to extend the experiment and push the boundaries of scientific knowledge.
Data scientists working in commercial situations must also hold themselves to these rigorous standards if their work is to be built upon. An experiment should be documented and its steps made repeatable, either through techniques such as notebooks (as with the popular open-source Jupyter system) or through visual workflows as we see with Alteryx.
Keeping a close eye on the data science literature through blogs, tweets and other media, it’s possible to see evidence of these experiments taking place. Kaggle, for example, opens up data science through competitions with standard data sets and invites competitors to produce open models that can be trained, tested and ultimately published in a repeatable way.
Often, due to commercial sensitivities, a data science experiment cannot be made fully available for general consumption. Of course, this makes perfect sense – after all, who wants to give away the recipe to their new ‘secret sauce’ algorithm or technique? However, when data scientists do reveal their methodologies and data sets it can be a huge enabler for others in the field to build on their efforts.
A recent blog post by data scientist Dorian Beganovic takes a dataset of approximately 13 million records of open data (from the Dutch vehicle registration records) and puts Python’s Pandas library head-to-head against Apache Spark for data preparation activities, prior to more advanced algorithms.
Pandas is a hugely popular open source data manipulation library, allowing users to interact with data through a data-frame (much like the concept in the R language). Apache Spark, as Dorian notes, is quickly becoming one of the most popular data analysis platforms with support for streaming, machine learning and a SQL-like front-end.
The findings from the blog are interesting. Given a reasonably powerful desktop setup (32Gb RAM for the main Python work, Spark nodes over 3 machines of comparable specification) Pandas appears to outperform Spark on datasets smaller than 10 million records. As a conclusion, he recommends Spark as a data analysis solution only for much larger datasets (e.g. 500Gb or above) with a scaled-up Pandas server supporting smaller datasets.
As the author kindly provided both a link to the underlying dataset and the methodologies he used for testing the two tools (five common data science preparation operations including querying, sorting, joining and grouping) it seemed like a perfect opportunity to put my Alteryx Designer to the same test. After all, as a citizen data scientist with a powerful laptop and the leading data analytics platform at my disposal, what kind of performance could I expect without needing to write any code?
Preparing for the experiment
I downloaded the full 13.6 million rows of the open dataset in CSV format and created a number of alternative copies to test some different scenarios. In Alteryx, reading from CSV isn’t the only game in town for the canny analyst. The Alteryx database format (YXDB) often provides a much more efficient storage and retrieval capability. There’s also the oft-overlooked Calgary database format (CYDB) that – with a little upfront preparation – can deliver astonishing performance, right from the analyst’s personal machine.
These data preparation steps felt valid as part of a good data architecture – as an architect, I’d want to make sure that my information consumers have access to the data in the appropriate format to do their job as effectively as possible. The conversion of CSV to YXDB is extremely rapid, taking mere seconds, even for the full dataset. The conversion to CYDB takes slightly longer, factoring in the need to build indexes against all the columns in the dataset.
Once the data was ready in these three formats, I installed the latest version of the Anaconda Python distribution (including Jupyter notebooks) so that I could follow Dorian’s previous experiment step-by-step, as well as create Alteryx workflows that replicated those steps for CSV, YXDB and CYDB formats.
As per the instructions, I used Alteryx to create a number of subsets (using the Sample tool) – splitting the data into 100k, 1m, 5m, 10m and 13.6m samples. The latter represents the full dataset.
As the datasets were produced, I analysed the relative file-sizes for the different formats, as shown in the table below.
SAMPLE SIZE (RECORDS)
File compression using different formats. All sizes in Mb
Smaller is better!
My experimental hardware was relatively modest compared to the previous effort – a Dell i7 (6600U CPU @2.60 GHz) with 16 Gb of RAM and a solid state drive. In terms of software, I used Windows 7 Enterprise with Alteryx 11.0.3 and compared against Python 3.4 (ensuring that I restarted the Python kernel between runs to effectively ‘reset’ the state).
I7 6600U CPU @2.60 GHz
All my experiments were run 3 times (restarting Alteryx each time) and the results below are a mean (average) timing for those results. This was to reduce experimental anomalies due to, for example, other processing running periodically on the computer.
In order to compare like-with-like, it’s important to understand that a Pandas routine includes a data ingestion step, prior to analysis, where the data frame is initially populated. This is akin to the Input Data tool step in Alteryx. To ensure that we’re comparing the different tools fairly, the Pandas analysis-only step is analysed, but the time needed to ingest the data from a ‘cold start’ was also measured separately over 3 runs and this ‘mean cold-start’ time is also included in the analysis as the actual elapsed time to ingest and process the query in Python.
The mean cold-start times are shown below:
PANDAS MEAN COLD START /S
In my next post, I'll break down each of my experiments and give my experimental analysis. I conducted 5 experiments in all: Complex Joins, Sorting, Joining, Self-Joins, and Grouping.