Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
It's the most wonderful time of the year - Santalytics 2020 is here! This year, Santa's workshop needs the help of the Alteryx Community to help get back on track, so head over to the Group Hub for all the info to get started!
Alteryx
Alteryx

In my last blog post (Part 1 - Why AMP?) I looked at the reasons why we built a new multi-threaded core engine; in this post we will take a look at some key concepts that make the AMP engine tick. These are not things that you need to know about to use the engine, the engine takes care of it all for you, but more for the technically advanced user who is looking for an understanding of what’s going on under the hood. We will cover the following concepts:

  • Record Packets
  • The Memory Allocator
  • The Task Scheduler

 

Record Packets

 

In part 1 of this series we talked about the overhead of making an application multi-threaded and how if we did that on a per row basis then the cost of multi-threading would outweigh the benefits and although we could use more CPU power we would ultimately be slower. The way we tackled this issue in AMP is that tools no longer process data on a record by record basis.

Instead, they process data in record packets.

A record packet is a fixed size allocation of memory (today 4Mbs) which contains a number of records. So if a record is 100k a packet can hold about 4000 records. Tools multi-thread work on a per packet basis which means the cost of multi-threading is spread over all the records in that packet and no longer has a detrimental effect on overall runtimes.

A record packet is always fixed in size and we try to keep packets relatively full to help with performance (Having packets only marginally filled makes for a very inefficient use of memory). Larger fields (think big spatial data or string fields) are held outside the packet. We will cover these in a future post, just know for now that you don’t need to worry about large fields taking up all the space in a packet.

 

Memory Allocator

 

Next, to deal with all these record packet memory allocations we have a new component called the Memory Allocator. The allocator’s job is to “allocate” memory for record packets and other large data fields and to manage storage of that memory.

The allocator will ensure that if you have more data than you have set for AMP to use it will first compress the record packets and then write them to disk to keep memory usage under the set limit.

This means that an individual tool does not need to worry about where that memory is stored, it receives a handle to a memory packet and when it wants to read data from that packet, the memory allocator will ensure that data is ready and available in RAM to read and write from.

 

Task Scheduler

 

The e1 engine would push records between tools in the main thread of the application, this aspect of the e1 architecture is ultimately what prevents it from being able to effectively use all of your cores on your machine. There are background threads in individual tools which do other work, but importantly only one record can be moving between tools at any given time.

 

AdamR_0-1596015708102.png

 

In AMP the actual work of a workflow is co-ordinated by a new component called the Task Scheduler. Tools will produce “tasks” of work, typically on a single packet of data and the scheduler will pull tasks out of a queue and efficiently schedule them across however many cores the user has set up for use by AMP. This is the heart of why AMP can make use of so many cores because a given tool can have multiple tasks all running together in parallel.

 

AdamR_1-1596015708102.png

 

Having introduced some foundational concepts of AMP, in our next post in this series we will take a look how we do summarize and join in a massively parallel way.

Adam Riley
Principal Software Engineer, Tech Lead Core Engines

Adam Riley is a Principal Software Engineer and the Tech Lead of the Core Engines Team at Alteryx. He started on his Alteryx journey as an Alteryx user before joining the company in 2011. He writes a personal blog about Alteryx at www.ChaosReignsWithin.com and is the former curator of the Crew Macro Pack.

Adam Riley is a Principal Software Engineer and the Tech Lead of the Core Engines Team at Alteryx. He started on his Alteryx journey as an Alteryx user before joining the company in 2011. He writes a personal blog about Alteryx at www.ChaosReignsWithin.com and is the former curator of the Crew Macro Pack.

Comments
Alteryx Certified Partner

Does the AMP Engine parse characters as UTF-8? The reason for asking is that I noticed that Dynamic Input throws a peculiar error "Invalid char in UTF-8 string" with AMP Engine turned on but not so when it's off, when reading in something like a non-breaking space (UTF8 U+00A0, 194 160).

Alteryx
Alteryx

@JZZChew It stores all of its text data internally as UTF-8 but it should parse text the same as the original engine. Are you able to send us an example workflow so we can take a look at what is going on?

Alteryx Certified Partner

This was with a dynamic input to an Azure SQL Database (service, not VM) via an alias (aka). I tried to see if I could replicate the error with offline files or even an Access database but it looks like those data sources do not throw this error.

For what it's worth, here's a link to the workflow but it will need another (online) SQL Server to connect to under the "db" alias.

Example workflow 

Alteryx Certified Partner

Hi @AdamR!  Hope all is well with you!

I've enjoyed these and the podcast as well.  Are there more articles in this series or are they still in the works?

Maureen