Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
ChrisF
Alteryx Alumni (Retired)

These days, if you're following any of the "big data" trends, you've almost certainly encountered a dictionary's worth of new terms that seek to explain what big data is and why it matters to you. Terms like "distributed", "MapReduce", "in-memory", and "shuffle", among many others. All combined, these terms create a kind of standard language that allows us to describe the problem space that big data presents in a way that is universally meaningful. That's great, right? Except that if you're really new to the whole big data thing (or even just the regular data thing) to begin with, the suggestion that any of these terms are "universally meaningful" is likely going to annoy you, at the very least.

 

I hear you, believe me.

 

That said, it turns out that the "big data" problem space is actually pretty complicated, and using abstractions like "shuffle" instead of "that part of the process where data goes flying all over the place and ends up being totally reorganized because sometimes you really need a set of values to be co-located in memory in order to do things like aggregate your data while grouping by a key" can be a real lifesaver. See what I mean? The abstractions make the whole process easier to talk about assuming you already understand what they are. With that in mind, I present to you, the reader, a bunch of stuff that I wish someone had told me early on.

 

What we talk about when we talk about big data

 

First things first: why is big data a problem that requires a special solution? Isn't it just...more data? Well, yes and no. While "big data" is, technically, also just "more data", when we talk about it as a thing unto itself what we really mean is "data that is so big it can't reasonably fit on one computer/server/whatever". That's a significant distinction because "more data" is a problem that we can solve all kinds of ways, while "big data" is one that necessarily demands its own special solution.

 

There are really two main questions when it comes to data that can't all fit on one machine:

 

  • Storage: How do we store all of this data?
  • Analysis: Once we've stored it, how do we derive value from it without wishing we'd never figured out how to store it in the first place?

 

The first problem (storage) is one for which there have been a number of solutions in the database world: SQL and noSQL databases alike usually have support for several partitioning strategies, whether it's sharded data (split the table by rows and keep sets of rows on different servers), or "row splitting" (split the table by columns and keep different sets of columns in different places). Both of these strategies provide a solution, of sorts, to the storage problem, but not without some serious drawbacks. In order to do this effectively, you have to introduce a significant amount of complexity across the board. The database itself has to be architected in a way that supports your partitioning strategy. This means that someone has to figure out how partition your specific data in a way that makes sense, making general solutions to the problem harder to come by. In addition, there's an added level of operational complexity since you have to think about things like what happens to your table if one of your servers crashes, or how do you write meaningful SQL queries when your data exists in 5 different places and they're all isolated from each other? These aren't necessarily deal-breakers by any means, and plenty of organizations have used these strategies to great success. The point is that this approach has its own set of headaches, and it was only a matter of time before other solutions came about.

 

So...Hadoop?

 

Yes, Hadoop. The de facto answer to the "there has to be a better way to store all this stuff" question. Hadoop has a lot of moving parts and I don't want to give the impression that it is somehow magically less complex than the approaches I described earlier (it's not), but it is interesting at the very least, and has certainly proven valuable in a number of different industries/use cases. Plus, the concepts behind it are key to understanding a lot of what has come after Hadoop. There's a lot to be said about Hadoop, but I think we can boil the big picture down to two main concepts: HDFS and MapReduce.

 

HDFS

Hadoop Distributed File System. This is the thing that lets you store your data on multiple machines (aka, a cluster) but treat it as if it lives on one machine. This is different from the solutions we talked about before where you have a table that is broken up into pieces very deliberately. In the HDFS scenario, you just start out with a really big file, and HDFS allows you to act like you have a really big hard drive to match.

 

The basic breakdown of HDFS is as follows:

 

  • A Hadoop cluster is (at least) one namenode and a bunch of worker nodes (also called data nodes in Hadoop lingo) all linked together via a network.
  • The namenode has a bunch of special responsibilities, chief among them being telling the worker nodes what to do.
  • You have a really big file (let's call it xyz) that you'd like to store, so you tell the namenode that you need to load it into your cluster.
  • And then some magic happens.
  • By "magic", I mean HDFS.
  • So, Hadoop takes your massive file, breaks it into a bunch of 64 mb chunks (called blocks), spreads those blocks across all the worker nodes in your cluster, and then remembers that you have a "file" called xyz that now resides in HDFS, and which happens to actually not be a single file anymore.
  • On top of spreading the blocks across the cluster, Hadoop also duplicates all of those blocks at least once and then makes sure those blocks are all sent to different nodes.
    • In other words, your data gets copied and distributed across the cluster in multiple ways so that the full dataset can be reconstructed in case one of your nodes crashes.

  MapReduce

 

Once you've got your file loaded into HDFS, chances are you want to actually do things with it, but the fact that it's now spread all over your cluster means that you need to interact with it in a different manner if you want to take full advantage of the distributed nature of HDFS. The idea here is that now that you've got pieces of your data spread across multiple machines, you'd like to treat that dataset as one complete file (e.g. "I want to count all the records in xyz"), but you want to leave all the data where it is and have each worker perform calculations on the data it has access to. This way, you're able to perform computations that would be impossible (or at least really inadvisable) if you tried to do it all one machine. This is what MapReduce allows you to do.

 

However, before we go into the actual MapReduce paradigm, it's probably a good idea to have some insight into where the term "MapReduce" actually comes from. To that end, we'll use some very basic examples from R. In many programming languages, there are a set of functions referred to as "higher order functions", which is a fancy way of saying "a function that takes another function as an argument and does something with it". Map and Reduce are two such functions that are used very frequently when working with collections of data (like a list, for example), and seeing how they work gives us a good foundation for understanding the MapReduce process in Hadoop. Let's look at an example:

 

We'll start out with a collection of numbers. Our goal is to round all of these numbers to the nearest whole number, and then find the maximum and the sum. Here's what our starting collection looks like:

 

numbers <- c(1.3, 2.5, 4.9, 6.4, 15.7, 3.3, 4.9)
First up, we need to round each number in the collection. Since Map allows us to apply a function to each element of a collection and return a new collection, it'll be perfect for this task. We'll just apply R'sround function to each element in our numbers collection and save the result:
 
numbersToInt <- Map(function(number) { round(number) }, numbers)

As a result of the Map, our original collection (1.3 2.5 4.9 6.4 15.7 3.3 4.9) becomes (1 2 5 6 16 3 5).

 

Next up, we want to calculate the maximum and the sum for the new collection. There are definitely built-in functions for both of those things, but the basic principle behind any kind of aggregation function ultimately revolves around a Reduce function, so let's do it manually. Reduce is a little harder to wrap your head around than Map. Instead of simply transforming each element of a collection, the goal of Reduceis to step through the elements from left to right and keep a running tab of the result with each new step. Don't worry, this'll make more sense in the example. Let's go ahead and find the maximum:

 

maximum <- Reduce(function(currentElement, nextElement) {
                    if (currentElement > nextElement) currentElement else nextElement
                  }, numbersToInt)
print(maximum) # This returns 16

In this case, Reduce starts walking through the collection and looks to see if the new element is bigger than the previous. If it is, the bigger element becomes the new result, and when Reduce gets to the end of the collection, it'll return whatever number ended up being the biggest.

The process for calculating the sum is very similar, except instead of comparing two values, we're just adding each new element to our running total and then returning the final result:

 

sum <- Reduce(function(currentElement, nextElement) {
                currentElement + nextElement
              }, numbersToInt)
print(sum) # This returns 38

Alright, so how does this help us understand how the whole distributed computing thing works? Consider the basic thought process here:

  • I have a collection of elements and I want to know something about all of them.
  • First, I use Map to do something to each element individually.
  • Then, I use Reduce to learn something about all of those elements combined.

 

This is pretty similar to the pattern we see when we run a MapReduce job in Hadoop, except instead of applying a function to each element of a list, we're applying it to the segment of the data that lives in each node of the cluster, and then combining the results into one coherent answer.

 

To illustrate how a MapReduce job might work, let's consider the following scenario: You and 5 friends walk into a library and you decide that you want to know how many books contain the world "elephant" in the title. There are an awful lot of books in the library, so much so that one person really couldn't handle the whole job, so let's turn it into a MapReduce job:

 

  • You (the namenode) tell your 5 friends (the worker nodes) that they are responsible for 1/5 of the shelves in the library and send them off to their respective areas, where they then wait for further instructions.
    • You have just loaded the library into HDFS. Congratulations :)
  • You want to know how many books contain "elephant" in the title, so you tell all of your friends to start counting, but to only count the shelves they've been assigned (i.e. you "Map" the counting job to all 5 friends individually).
  • You hang out while all of your friends count a ton of books. Ha, suckers.
  • As each of your friends finishes counting, they come back and tell you how many books they found. Once they're all done, you have a list of 5 different counts.
  • You (the namenode) then add up all 5 numbers to get the total count (the "Reduce" step).

 

Admittedly, this is a pretty simplified version of the process, but it should at least give you an idea of the 10,000 foot view of the "big data" world as it relates to Hadoop and things that came before it. However, you may remember that earlier I said there are two main problems that need solving when it comes to big data. Hadoop technically provides solutions for both storage and analysis, but the analysis part (MapReduce) comes with a lot of baggage. MapReduce jobs are famously difficult to write, require a ton of disk I/O (writing to/reading from hard drives over and over isn't exactly the most efficient process), and only run as batch processes which makes interactive analysis pretty difficult. This is where Spark enters the picture! In part 2 of this series, we'll cover the basics of Spark as a computing platform as well as how it relates to the existing Hadoop infrastructure.

Comments
hilton
7 - Meteor

Excellent article, thanks! This has helped me a lot, especially with the emerging demand for Big Data technologies, and how well Alteryx supports the new technologies.

 

I really look forward to reading Part 2, about Spark and how it works with Hadoop. And if you can also a Part 3 about Cloudera or other similar Hadoop distributions, that would also be useful! Or just point me in the right direction of existing articles written like yours :)

MGA
7 - Meteor

Thanks so much, love (the funny and true) analogous story to really help the concept sink in!