Data Science

GaryS · ‎10-06-2017

2018-03-05 UPDATE: With the release of Alteryx Analytics v2018.1 the new Code Tool for Apache Spark is now Generally Available. To gain access to this new feature, you can upgrade to our latest release, or try it out with a free 14-day trial. You can also watch this in-depth demo of the features and capabilities in action.

Introduction

At our annual Inspire conference in Las Vegas earlier this year, I was thrilled to announce the new Direct Connection for Apache Spark functionality that my team has been developing. Here's a shameless plug for the presentation:

Our team's technical lead, Steve Ahlgren @SteveA, recently gave another demonstration of Spark Direct during the Inspire Europe conference last month:

So, what is the Direct Connection for Apache Spark? To quote Steve, "It's a seamless integration between Alteryx Designer and Apache Spark". It provides the "best of both worlds" by combining the ease of use of Alteryx with Spark's cluster-scale computing. Using our existing In-DB tools to talk to Spark in its native languages, Python, Scala, and R, and extending our built-in capabilities with the new Code Tool for Apache Spark (referred to as the Custom Spark tool in the videos), we are able to expose the power of the Data Lake to the non-programmer for the first time.

As we move toward releasing Spark Direct (a Beta is coming soon!), the many customers and prospects we've talked to are excited about its potential. And one common theme seems to keep coming up in these conversations: companies are struggling to make full use of their investment in building out a Data Lake, and they see Spark Direct as the answer.

In this blog post, I want to explore this problem and show you how Alteryx Direct Connection for Apache Spark addresses it.

Data Lake or Data Swamp?

Figure 1- I like to think we have more of a "Sparse Emergent Data Marsh".

Ever since the term Data Lake was introduced by James Dixon, CTO of Pentaho, it’s been through the typical tech cycle of buzz, hype, and blowback. You can find articles that run the gamut, from Data lakes: An emerging approach to cloud-based big data, Why Do I Need A Data Lake, and The Business Case for Big Data Marketing on one side of the spectrum, to Gartner Says Beware of the Data Lake Fallacy, 3 Reasons NOT to Take the Data Lake Plunge, and 5 Reasons Your Data Lake is Failing – And What You Can Do About It on the other. Even the venerable Martin Fowler has weighed in on the subject.

Here within Alteryx, I’ve been part of many conversations that argue against the necessity for technologies like Hadoop and Spark for most companies, and the reasoning is compelling. Overall, the majority of problems that data analysts are looking to solve don’t actually require access to petabytes of data nor the horsepower required to crunch it. And honestly, who has the time to become proficient in all of its complexity?

And yet companies are still making huge investments in building Data Lakes. BusinessWire reported last year that the Data Lake market was expected to grow from $2.53 Billion in 2016 to $8.81 Billion by 2021. Why make such huge capital investments on technology if you can get by without it?

It seems to me there are 3 primary reasons:

Reason 1: Big Data

Figure 2 - How many bananas in a petabyte?

Let’s get this one out of the way first. Of all the hyped tech trends over the years, Big Data might just be the hype-iest. There is no definition of what exactly constitutes Big Data, and the ensuing arguments over whether any given dataset is big enough to qualify as BIG are ultimately pointless. What we do know is the amount of data being collected is increasing exponentially, and companies are eager to make use of that data. The bigger that data gets, the harder it becomes to move in to and out of traditional analytics platforms. As this trend continues, having a central repository where data can be both stored and analyzed in place becomes increasingly attractive regardless of whether any individual job requires such a robust infrastructure.

Reason 2: Unstructured Data

Figure 3 – Embrace the chaos!

As it turns out, most of the growth in data collection is being driven by unstructured data. It’s often claimed that up to 80% of all new data is unstructured (although the origin of that number may be a little hazy). Data sources such as email, blog posts, social media feeds, and web logs have become increasingly critical to driving business decisions. Legacy data warehouses simply weren’t built to store and analyze these types of data. In contrast, Data Lake technologies like Hadoop and Spark were designed specifically with these in mind and allow those data sources to be stored and processed in their native formats.

Reason 3: Centralized Data Governance

Figure 4 - One key to rule them all.

Most organizations have a variety of data repositories, each with their own governance infrastructure. One key benefit of moving data out of these systems into a Data Lake is the ability to centralize and standardize on a single, comprehensive strategy around data governance. This simplifies management and reduces the risks involved with trying to synchronize across disparate systems.

These benefits notwithstanding, moving from the theoretical to realizing the full potential of a Data Lake is typically not smooth sailing. Two of the most common challenges we hear about are:

Challenge 1: Skills Gap

Figure 5 - We can't all be a splayd (it's a thing, look it up).

One customer we spoke to early on told us with a certain amount of pride about their production cluster. It had a combined capacity of about 1PB of storage with over 400 virtual cores for processing. But when we asked how many users they had on the system, the answer surprised us. There was just one!

Clearly this is an extreme example, but it's consistent with stories we've heard from other clients. The fact of the matter is that Hadoop and Spark are challenging to deploy to a wide audience because they’ve been built with the programming-savvy Data Scientist in mind. A 2016 survey by Cloudera and Teneja Group reports that “six out of 10 active Spark organizations reported a significant skills/training gap, while more than a third mention complexity in learning/integrating Spark” as key barriers to wider adoption. By its very nature, this shuts out access to most Data Analysts.

Challenge 2: Prep and Blend

Figure 6 - Bringing new meaning to the term "mess duty".

One of the touted benefits of the Data Lake is that data can be stored in its raw form. But to do anything meaningful with it, this raw data must be first transformed into something more useful. Research suggests that Data Scientists spend about 80% of their time just preparing the data they need for their analyses. Meanwhile the average salary for Data Scientists in the U.S. ($121,353) is more than 65% higher than for Data Analysts ($73,270) according to glassdoor.com. One can’t help but wonder whether having such expensive resources spend so much of their time on prep and blend tasks makes sense.

Not surprisingly, Data Scientists agree. According to Forbes, “76% of data scientists view data preparation as the least enjoyable part of their work”.

Opening the Flood Gates

Figure 7 - Dammed data.

Taken together, these two challenges create a significant barrier between the potential of the Data Lake and those that could benefit most from it. Fortunately, this is where Alteryx Direct Connection for Apache Spark comes in. With it, Data Scientists can unburden themselves from tedious preparation tasks and Data Analysts can finally get access to the rich resources inside the walls.

Let’s see how.

Spark In-DB

At its core, Alteryx Spark Direct sits on top of the framework already used by our In-DB toolset. In fact, we’ve had an In-DB connector for Spark for quite a while now. Unlike our normal tools which operate directly on data in memory, the In-DB tools generate SQL statements that are passed to a database for execution. The real value of these tools is that you don't have to be proficient in SQL to be able to use them. Just like all the other tools, it's drag-and-drop and Alteryx takes care of the rest.

Spark is a bit different though. Since it's not a database, it has a lot of functionality that can't be accessed via SQL. Instead, you need to use one of its native languages (Python, Scala, and R) to be able to unlock its full potential. So that's just what we did. For our standard set of In-DB tools, we take the SQL that they are already producing and wrap it in direct calls to Spark. This allows you to take advantage of all of the rich prep and blend functionality of our In-DB tools without having to write a line of Spark code.

Now, I know that you're thinking "As utterly amazing as that is, you're still limited to SQL, right? What about all of the other Spark functionality?"

And to that I say, "But wait - there's more!"

By talking to Spark in one of its native languages, we can go beyond the standard In-DB tools to access more advanced functionality. To make that available to you, we are introducing a new tool that we're calling…

The Code Tool for Apache Spark

With this tool, a user can supply any custom Spark code in Python, Scala, or R and have it execute as part of the Spark job that Alteryx is running. Incoming data from upstream tools are accessible as SparkSQL DataFrames, and any data outputs are also created as SparkSQL DataFrames. This approach allows the tool to be inserted anywhere within the Spark Direct In-DB tool chain.

Using this tool, a Data Scientist can take full advantage of Spark’s native functionality, third-party libraries, or their own internal code. As an example, we have several demos that use H2O.ai’s Sparkling Water machine learning libraries to perform predictive analytics within Spark. Additionally, the Code tool can be used as an Input Tool, providing access to data formats which Alteryx otherwise doesn’t support, such as Parquet.

Just as was the case with our R Tool, this approach allows users to incorporate advanced functionality within an Alteryx Workflow. And just like with our R Tool, this advanced functionality can be wrapped into an Alteryx Macro and exposed as a new tool within the Alteryx Designer. And, just as the R Tool provided the foundation for the explosion of rich analytic tools that we now have within Alteryx, we believe the Code Tool for Apache Spark will provide a similar foundation for analytic tools for the Spark platform.

Make a Splash

Figure 8 - Unleash your inner potential!

So, how does all of this come together to solve the challenges outlined above?

For the Data Scientist, Alteryx Direct Connection for Apache Spark gives you the ease of use that Alteryx is renowned for. Most of your data prep will go from tedious to trivial by taking advantage of the pre-built In-DB tools. Add in the Code Tool for Apache Spark and now you have an end-to-end solution that encompasses the best of both worlds: ease of use without sacrificing all of Spark’s advanced capabilities.

For the Data Analyst, the skills gap is washed away and you finally have access to all the data and functionality within your Data Lake without having to learn the complexities of Spark. As an added benefit, Data Scientists can share their advanced work with you by wrapping their Spark Code tool into a Macro, making it available as just another tool on your palette.

And for the CDO or CIO, the benefits should be clear. Your Data Scientists can focus on the hard problems that you hired them to solve. Your Data Analysts now have access to all the data you need them to use. And, most importantly, anyone who can run an Alteryx workflow (such as one shared on your private Alteryx Gallery) can now benefit from your investment in a Data Lake.

As we continue to work toward releasing Spark Direct, look forward to more blog posts that go into detail about connecting Alteryx to Spark and making use of the Spark Code tool. Until then, I would love to hear about your experiences with Spark. Are you just learning about it or are you already up and running with it? If your company already has it, have you experienced any of the challenges I've talked about? Do you think Spark Direct might be able to help you? Please leave some comments and let us know what you think. Your input can help guide us as we put the finishing touches on this great new feature!

Julien_B · ‎11-28-2017

HI Gary,

Thank you for this great post.

I'm looking forward to test the new Spark Direct function. When will it be available ?

GaryS · ‎11-28-2017

Thanks for the feedback Julien. The new Spark Direct functionality is scheduled to be released around the end of February. In the meantime, if you want to start working with it sooner, we recently released it into Beta. Please reach out to Neil Ryan (@NeilR), our Product Manager, for instructions on how to sign up for the Beta program.

Thanks,

Gary.

BorisTyukin · ‎01-08-2018

this is really exciting. I am a bit worried though livy requirements. Livy is not really active project lately and I am not sure if Cloudera will be actively supporting it. It also comes with some security risks last time I read about it

GaryS · ‎01-09-2018

Thanks for the feedback Boris. I believe that it is true that Cloudera has stepped away from Livy, but Hortonworks and Microsoft continue to be users and contributors to it. They also recently moved the project into the Apache Foundation, which is a significant step toward building out the community. Only time will tell, but I feel confident that it will continue to be supported.

Please let us know if you are planning to try out the new functionality and if you need any help.

BorisTyukin · ‎01-09-2018

thanks Gary for reply. Yep we will be interested in this functionality but do not have time now to enroll in Beta testing. BTW Livy was used recently by Safari to implement their new interactive training (Oreo) with Spark and Jupyter so I am hoping that project is not dead.

Julien_B · ‎01-24-2018

Hi Gary,

Could please confirm that the new Spark Direct functionality will be still avalaible in the next release (2018.1 / late february) ? In the laboratory or in the "official" tool list ?

GaryS · ‎01-29-2018

Hi @Julien_B. The new Spark Direct functionality will be in the 2018.1 release. It will be an option in the In-DB connection's Data Source field, and the new Spark Code tool will be found in the Developer tool palette.

1-29-2018 10-36-38 AM.png

1-29-2018 10-38-10 AM.png

Julien_B · ‎01-29-2018

Hi @GaryS !

Thanks for the information :)

dataprep · ‎05-03-2018

Hello Gary,

We plan to use the Spark direct in a few weeks. However, we struggle to find documentation. Is there any example available of a workflow using it ?

Best regard

NeilR · ‎05-03-2018

@GaryS, @DavidW, or @ColinR may have more, but here's what I'm aware of: