Engine Works

ChrisF · ‎03-21-2016

In the first part of this article, I talked about the general problems that Big Data poses (e.g. storage vs. analysis), as well as some of the existing tools that help solve those problems. Database-specific solutions, as well as HDFS/MapReduce are widely used and provide answers to a number of questions that have, in the past, been very difficult to answer. Hadoop has become the de facto standard for this kind of work in a lot of ways, but, as many of you probably know, there's a new game in town: Apache Spark. In this article, I'll discuss how Spark fits into the overall Big Data ecosystem and also go into detail on some of Spark's main features, specifically in-memory processing, DAGs, and RDDs vs. DataFrames.

Spark: The Big Picture

Looking back at where part 1 left off, we've established that a core strategy for handling Big Data is to distribute the data across a cluster of machines and avoid moving it unnecessarily whenever we can. HDFS handles the storage portion, and MapReduce gives us a paradigm for working with data once it has been distributed. That said, MapReduce has some significant downsides: it relies entirely on disk I/O (reading and writing from the hard drive itself), is primarily suited for batch processing, and is not always the most intuitive paradigm to work with. Spark seeks to solve some of these problems by providing a platform for distributed computation that can be used in both interactive AND batch settings, and also makes use of the considerable amount of processing power available in more modern cluster hardware.

To put it more succinctly, Spark is:

An in-memory execution engine for distributed data processing
A solution to the "analysis" problem in Big Data (as opposed to the "storage" problem)
NOT a drop-in replacement for your entire Big Data stack

Alright, let's unpack that a bit.

In-Memory Execution

One of Spark's biggest innovations is the fact that it relies as much on the computer's memory (as opposed to its hard drive) as possible. This is crucial for its performance since reading and writing to RAM is much faster than it is to disk, especially in the case that you're iterating over a chunk of data multiple times (something that happens quite a bit in data analysis). This is not to say that Spark will only work with RAM. In fact, it absolutely uses the hard drive when it needs to (often referred to as "spilling over" to disk), but Spark's execution process will prefer memory over disk whenever possible. The fact that memory is so crucial to Spark brings up a key distinction when comparing Hadoop-based solutions to Spark: To get the most out of a Spark cluster, you will need to consider the resources available in your cluster. Whereas Hadoop can work on "commodity" (read: cheaper) hardware just fine, you'll likely need to worry more about the specific hardware configurations of your machines when working with Spark since you'll want to maximize the amount of memory available.

Analysis, Not Storage

A lot of the time, it's easy to hear about Spark and assume that it's an "all-in-one" solution for Big Data, when in reality it's much more useful to think of it as a component in a larger stack. Spark provides the data processing capabilities (the engine), but doesn't make any promises about storage or server resources. You still have to figure out how to handle storing and securing your data, and you also probably want to consider how you're going to integrate Spark into a cluster of machines that also has other services running (and, thus, other memory and CPU requirements). At first glance, this may seem like a glaring downside, but in reality it is, at worst, a reasonable fact of life when solving these kinds of problems and, at best, an opportunity to tailor your software stack to needs of your business. Do you already have your entire database architected in MongoDB? No problem, there's a connector for that. Redshift? Also fine.. In fact, due to the modular nature of the Spark ecosystem, you can actually use it with a lot of different data sources.

What about the rest of the stack?

Hopefully by this point it has become apparent that, although it has a lot going for it, Spark is not the only piece of the puzzle. As I mentioned above, you still definitely have to figure out how to store your data, but another big concern is how to manage your whole cluster. Spark comes with a built-in "standalone" mode, which basically means that it will handle the basics of how to marshall resources across all the different nodes in your cluster, but this is really intended more for testing and development than a full production use case. To illustrate the difference, let's consider the following scenario:

You have a cluster of 10 machines, each of which has 16GB of RAM. You have Spark and HDFS installed on your cluster, and all of your data is stored in HDFS. One morning, you decide you want to reconcile a bunch of different files in HDFS by aggregating everything on customer ID, joining all the data together, and then writing it back out to HDFS to serve as your master customer list. You fire up your Spark cluster to do this, write the job, run it, and everything goes fine. That afternoon, you decide that you'd like to do some predictive analytics on that new data set, so you again fire up your cluster, write the program to do it, and run it. And then your cluster crashes and you see a terminal window full of error messages complaining about "Heap Allocation" problems. What gives? Well, it turns out that some of your coworkers are now also running Spark jobs on the same cluster, and therefore you don't have quite the same resources as you did earlier in the day, but Spark is still pretending like you do. As a result, your Spark job crashed because it's trying to eat up a bunch of memory/disk/CPU resources that someone else happens to be using, and there isn't quite enough to go around.

To be fair, this scenario is definitely a simplification of what qualifies as "production use", and Spark's standalone mode has gotten much more robust recently, but the basic point remains: Distributed systems are (unsurprisingly) composed of a bunch of moving pieces, and making sure all the pieces work together, especially under a heavy load, is a problem worth considering. Spark helps out where it can, but you're typically better off using a tool that is purpose-built for this kind of problem. If you're interested in that kind of thing, check out YARN or Mesos.

Spark: The Details

Now that we've talked a bit about the overall landscape, let's dig into some of Spark's core concepts. Spark has a lot of different moving pieces, all of which exist at varying levels of abstraction, but there are a number of key pieces that should give you a pretty solid foundation for understanding what all of it means.

The RDD

A Resilient Distributed Dataset (RDD) is the core data abstraction in Spark and is a big part of Spark's high performance capabilities. Spark's RDD essentially provides the missing link between the cluster's storage layer and the "in-memory execution" part of Spark. RDDs are how Spark stores your data in memory, and the fundamental component of how it manipulates the data once your program starts running. When you get right down to it, and RDD is basically just a collection of data, similar to an Array in Python (or C, or Go, etc.) or a Vector in R. On the surface, it's a big list that serves as the building block for a lot of other data structures in Spark (more on this later).

The thing that makes RDDs special is that, in addition to "just" being a list of things, RDDs also know about their history. When you read a file into memory, it gets dumped into an RDD. From there, you can do all sorts of things to it (sort, filter, print, etc). All of these "methods" fall into one of two categories: transformations and actions. A transformation is something that manipulates the data and returns a new RDD (e.g. map, filter, sort), while an action is something that actually produces a new value (e.g. reduce, count). The key distinction here is that actions require Spark to actually know about what your data looks like at that very moment, whereas a transformation is something that can be done "lazily".

Here's an example:

I read a csv file containing a single column of words into Spark, which gives me an RDD in memory.
Next, I tell Spark I want to calculate the length of every element in the RDD. This returns a new RDD containing said lengths.
Then, I filter the RDD, saying that I only want the values whose length is < 10.
Finally, I tell Spark I want to know how many values are left.

The key thing to note is that in steps 2 and 3, I never explicitly told Spark I wanted to actually see the results. Both map andfilter are transformations, which basically just means that the RDD will come back and say "yep, we added it to the list. we'll get to it eventually". In other words, Spark hasn't actually done any work yet (hence the "lazy" part). On the other hand, when I get to the final step and tell Spark I want it to count how many records are left, Spark knows that there's no way it can give me that answer without also computing all the previous steps. count is an action, which means that it forces to Spark to look at the whole history of the RDD and do all the work necessary to produce that final result.

At this point, you might be wondering why this might be a useful feature. Valid question. The reasons are twofold: first, having the RDD be lazily evaluated means that Spark can do all kinds of things to optimize the full end-to-end process once it comes time to actually compute something. Second, the fact that the RDD keeps track of all the different transformations means that if one of your servers goes down, Spark can figure out how to redistribute the data across the rest of the cluster AND recreate all the work you've already done (this is the "resilient" part).

DAGs

If you've read much about Spark, you may have comes across the term "DAG Execution Engine". The example above is a relatively straightforward example of what people mean by this. DAG is an acronym for "Directed Acyclic Graph", which is a data structure that has three main components:

It is composed of a bunch of elements (nodes) and the connections between them (edges).
The edges are "directed", meaning that they are particular about the way each connection works. For example, node A might be able to connect to node B, but node B can't connect back to A.
The entire graph is structured such that if you started at one node it would be impossible to follow all the edges in such a way that you would end up back where you started (a.k.a "acyclic")

In the context of program execution, a DAG is useful because it serves as a mapping of how a Spark job flows from beginning to end, where each node in the graph is a task in the execution flow. With that mapping in place, Spark is able to perform a number of optimizations on your code at run-time (or compile-time, as the case may be), but it also serves as a really helpful mental model when trying to think about how data flows through any kind of system. If you find this part of Spark interesting, I definitely recommend checking out this blog post from Databricks about how Spark jobs are visualized in the web UI.

Shuffle

Finally, we come to "the shuffle", an oft-mentioned and more frequently maligned concept in world of distributed data. The shuffle operation is probably one of the more complicated aspects of the Spark world, but it comes up so often that I think it's useful to know a little bit about how it works, or, at the very least, why it exists.

Consider, for a moment, what it might look like if you took a bunch of index cards and wrote the names, addresses, and height of all your friends and family on them. Now mix up all those index cards (this part isn't the shuffle!) and put them in different rooms in your house. Now I'd like you to tell me the average height of your friends and family by city. Take 30 seconds and think about how you'd solve that problem (assuming you can't just go find all the index cards and pile them up on the kitchen table). Pay special attention to the "by city" part. The answer isn't one number, it's a bunch of numbers.

Ok, how do we solve this? Well, if we didn't care about each person's city, we'd sum up the height of all the index cards in an individual room and write down both the sum and the number of people, repeating the process until we'd covered all the rooms (or until our awesome friends who have agreed to be guinea pigs in this experiment are done counting up each room for us). Then, once we had the sums and counts for each room, we'd all them all up and calculate the average. However, in this example, we DO care about the city, which means that we essentially need to perform the same process as before, but we're adding a "group by" element. More importantly, we've made it so that we can't just reduce the answer down until we have a single number. Instead, we have to find the answer once for each city and stop. But wait! Our index cards aren't in any kind of useful order. Each room has people from all over the country, and trying to sum up/count all the people from just one single city will take forever if we have to go from room to room and sift through all the cards just looking for Boulder, CO. And if we have to do that over and over? No way.

So, what do we do? We make piles. We start in one room, go through all the cards and make a pile for each city we find. Then we do that for every other room. Then (this is the crucial step) we consolidate all the different piles so that all the Boulder piles end up in one room, all the Dallas piles end up in another room, and so on. Now that we've got all the piles built up and grouped together, we can find the average height just like before.

Congratulations, you have just performed a shuffle. 🙂 Joking aside, the "group by" example is actually a pretty classic case of why distributed processing can be hard. In theory, the whole "just break your data up into small, manageable pieces!" thing sounds awesome, and in many ways it is, but sometimes it introduces a new set of problems that you might not ever think about when all your data lives in one place. More specifically, a lot of the fundamental operations we take for granted (group by, join, and even sorting) are much, much easier to perform when we can assume all the data needed for that operation is located in the same place, so much so that Spark (and Hadoop, and really all parallelized MapReduce frameworks) go to great lengths to move all the data around until that assumption is true, despite the fact that it means spending a lot of time moving data between machines. That's the one thing we didn't want to do, right? Well, yes, except in the many cases where we basically have no choice. Luckily, Spark's execution engine does a lot of the work for you, namely by re-working your program so that it can minimize the number of shuffles that need to occur in a given job. This is an area where Spark has iterated over time, and they've been optimizing the shuffle algorithm more and more with every couple of releases, but it's still a good idea to at least think about the index card example as you're writing a Spark job in case you suspect that the your cluster isn't processing things as fast as it could. Looking for unnecessary shuffle is a good place to start looking for bottlenecks.

Wrapping Up

As you can probably tell by now, there's far more to be said about Spark than anyone can cover in a blog post (or two), but hopefully I've covered enough of the moving pieces so that you can go out and start to read more about how to start using all of this stuff yourself! My ultimate goal with this article (and the one before it) was to provide the resources that I wish had existed when I first started looking into Spark. A lot of the concepts were completely foreign to me, and trying to understand things that seemed straightforward suddenly became surprisingly difficult to parse when "graphs", "shuffles", and a lot of the other terms started getting thrown around, so if you're feeling overwhelmed by the Big Data world, rest assured that you're not alone. Finally, if you're still interested in learning more about the inner workings of Spark, the official website actually has some pretty good walkthroughs, and the Databricks blog has a number of good high-level posts discussing a lot of the things I touched on in this article. Finally, the Spark Youtube channel has all the talks from the past few Spark Summit conferences, so there's a wealth of resources to choose from if you're looking for more of a deep dive.

Good luck, and thanks for reading.