Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.
Free Trial

Engine Works

Under the hood of Alteryx: tips, tricks and how-tos.
MacRo
Alteryx
Alteryx

Editor's note: Caching is now built into Designer - just right click on a tool. Check out the announcement blog.

 

You can download the newest version of the Cache Dataset macro (V2) from the Alteryx Analytics Gallery and read about it here.

(You can also get the original version of the Cache Dataset here.)

 

One of the great things about Alteryx is how once you get the hang of it, you’re able to manipulate data almost as fast as you can think of the next step in your process. You end up spending less time fiddling with the mechanics of blending and analyzing your data, and more on the actual business problem at hand. You get into a rhythmic flow, that state of mind where you lose track of the outside world while splashing accents of regex and highlights of formula onto your canvas, and become one with the data to create a masterpiece of work.

 

But when building out a workflow that deals with large amounts of data and/or runs processor-heavy analytics, you might find it hard to get into that rhythmic flow that you’re used to when you have to wait several minutes between runs to see the results of adding a tool or a group of tools to the canvas.

 

One solution to this throttling of your flow is to sample all of the data. There are a couple ways to do this. One is to place sample tools strategically throughout your workflow. A more robust and cleaner way to accomplish the same thing is to create a user constant to limit the number of records on input tools:

  1. Click on any whitespace on the canvas, then in the configuration window, go to the “Workflow” tab
  2. Add a new user-defined constant called “sample” and set it to a relatively low number such as 1000
  3. In the configuration panel of any input tools, set the “Record Limit” option to “%User.sample%” (without quotes)
  4. Develop your workflow with the number of input records limited, and when you’re ready to run it with the full set of data, simply set the user-defined constant “sample” to be blank so that the data is not capped

The problem with this approach is that in many cases, you’re not going to see the behavior you need to build out your workflow appropriately.

 

But what if you could “save your spot” and run only the new tools you’ve added to the canvas, without re-running everything else over and over again? Well, I'm going to show you how you can do just that -- with the help of the Cache Dataset macro!

 

cache dataset icon.png

 

Here’s a quick demo of how it works.

Suppose you have a workflow that has three input data sources that each have over a hundred millions rows. You cleanse them and do a bit of processing, and already this is starting to take longer than you’d like to wait before adding the next tool and playing with the configuration.

 

Workflow

 

At this point, you decide to stop messing around and cut to the chase with the Cache Dataset macro. To do this, you first place the tools that you’re done configuring into a Tool Container. You then add a Cache Dataset macro to the last tool in the stream:

 

workflow with container and macro

If you're not familiar with the Tool Container tool, you can find it in the Documentation tool category. To use it, simply place it onto the canvas and then drag a selection of tools into it.

 

Next, you need to configure the Cache Dataset macro, which thankfully is very simple. You’ll want to give it a unique and descriptive name (if you have multiple Cache Dataset macros on your canvas, be sure to give them each different names!), and you'll want to leave the Run Mode set to "Write".

 

configuration

 

Now when you run the workflow, the datastream being fed into the Cache Dataset macro (in this case, the output of the second transpose tool) will be saved as a temporary file. (More specifically, it will be saved as an Alteryx data file in the same folder as the Cache Dataset macro itself.)

 

Setup complete! Let's speed up our workflow!

 

Now that we've got the workflow setup to use the Cache Dataset macro, let's put it to use. After you've run the workflow once with the Cache Dataset macro in "Write" mode, you can switch it into "Read" mode so that it will reference the temporary file instead of the input stream. When you do this, you'll see an error indicator over the Cache Dataset macro telling you that it's in "read mode". This is not an actual error, but is meant to be a convenient visual indicator so that when you start to build out a large workflow with many Cache Dataset macros, you can tell instantly which ones are set to "Read" mode. This is important because it means that the input datastream is being ignored, and if you make changes to an upstream tool, then those changes won't flow through the Cache Dataset macro while it is in "Read" mode since it is reading from the saved temporary dataset instead.

 

read mode container disabled

 

Now when you run the workflow again, instead of taking a few minutes, it will finish running almost instantaneously! You can now continue building out your workflow from the Cache Dataset macro. When you've got your workflow setup to your liking, you can right click on the Cache Dataset macros and "Delete and Connect Around" to get rid of them without disrupting the flow of data, or you can simply place the macro into "Bypass" mode, which will let the data flow through it untouched as if it weren't even there.

 

Back into the flow

 

Once you start using the Cache Dataset macro, you'll get the hang of it quickly, and find yourself getting back into that sweet groove that you're accustomed to, even when crafting some heavy resource-intensive processing in your workflow.

 

Hopefully you find the Cache Dataset macro to be a useful addition to your toolkit that allows you to get more out Alteryx. If you have any questions, comments, or feedback, please share it in the comment section below, I'd really appreciate it and am happy to help out with any issues you might run into.

 

One quick note before parting: As mentioned, this macro writes out temporary datasets to the folder containing the Cache Dataset macro. You may want to occasionally delete the datasets in that folder if they are taking up much space. At some point, I plan to update the macro to include logic for cleaning up these temporary datasets, but that will be for another time. When I do, I'll be sure to link to the update here. In the meantime, that is left to the user.

 

Afterward

 

To wrap this up, I wanted to show an example of what a workflow might look like that has been developed one section at a time with the Cache Dataset macro. In the workflow screenshot below, I had three large datasets that I needed to transform and normalize before developing a model with the Alteryx predictive toolset. It was actually this very problem that was the inspiration for the Cache Dataset macro. On the left side is the final workflow with all of the Tool Containers enabled and Cache Datasets in "Write" mode (circled in red to make them more apparent), and on the right is the same workflow with the containers disabled and the caching macros set to "Read" mode.

 

back in the flow write read

 

Shoutout to fellow Content Engineer Neil Ryan, who also developed a caching macro with the same functionality independently, before I did. I just don't feel right taking full credit for this solution to a common problem we've all run into, when Neil beat me to the punch. Despite creating basically the same thing, for whatever reason his macro didn't get as much visibility within our group, and as a result I'm the one writing this blog post. It's a veritable Leibniz-Newton situation. So if this is well received, I'm happy to share the glory with Neil. But if you don't like it, please direct your anger and disappointment toward Neil, since when you think about it, it was his idea first.

 

Updates:

- 10/12/2015: A minor update to the macro has been uploaded so that it can be present in workflows uploaded to the Gallery in 10.0. When a macro has an input tool and output tool that both point to the same file, it interprets this as a problem and will indicate that the validation has failed. This happens in this macro, since we are both reading and writing from the temporary cached file, even though it never happens at the same time. We get around this by pointing the input/output tools to different placeholder filenames, and since the Action tools update the path using the name specified by the user in the configuration, we are good to go. Keep in mind however, that if you are uploading a workflow to the Gallery that uses this macro, you will be ok if it is set to "write" or "bypass" mode, but will run into problems if you have it set to "read" mode!

- 10/14/2015: Added link to download the macro from the Gallery at the top of the page.

- 04/11/2016: Added link to the Cache Dataset V2 macro at the top of the page.

Em Roach
Lead Research Scientist

Comments
cbridges
11 - Bolide

LIFE.SAV.ER

AndyMoncla
10 - Fireball

Awesome!

ronh_allaboutgis
6 - Meteoroid
Invisioning the amount of the time this tool will save!
Hollingsworth
12 - Quasar
12 - Quasar

Very, very, very, very excited!

AndyC
Alteryx Alumni (Retired)

I was asked about this feature twice last week - amazing work!

MarqueeCrew
20 - Arcturus
20 - Arcturus

No more temp.yxdb!  I like it. 

Spoiler
Man Happy
kane_glendenning
10 - Fireball

Awesome. Great Macro!! This is going to be very popular!!

Aguisande
15 - Aurora
15 - Aurora

It's in my Favorites already!

Would be too "greedy" to ask for the Update method?

SusanDay
6 - Meteoroid

Thanks so much, I have needed this for a while & have used it already.

Instead of writing to the Macro directory,could itt write to %temp% as the default as this will clean up after itself?

Thanks again!!!!! Luv it!!!!

NeilR
Alteryx Alumni (Retired)

the error icon overlay as a visual indicator is particularly inventive 

MacRo
Alteryx
Alteryx

I'm stoked people are finding this so useful!

 

@Aguisande: Sorry, I don't follow -- what do you mean by the update method?

 

@SusanDay: Good suggestion! I think the original reason I didn't write to the %temp% directory was so that I could return to my work the next day and continue on, without having to run the whole workflow again (and reset the Cache Dataset macros on my canvas back to "write" mode). That said, I think giving users the option to choose between writing to %temp% or a permanent location is a useful feature (and easy to implement), so I've updated the macro attached to this post to include that option. Thanks for the suggestion!

MacRo
Alteryx
Alteryx

@SusanDay - Unfortunately, the temp directory actually changes with every run, making it not as simple an update as I had originally thought. The original version has been posted back up in the meantime!

chris_love
12 - Quasar

@MacRo can you not write to ..\%Temp% which will remain constant - i.e. it will move from the Engine Generated folder into the users normal Temp folder.

MacRo
Alteryx
Alteryx

Hey @chris_love! Yes we could indeed write to %temp%../ which would remain constant. However, even though it is the parent "temp" directory, it does not get cleared out automatically like the session-dependent temp folders. So while it's not a bad alternative to writing to the folder containing the macro, it would not solve the issue of potentially having these cached datasets build up over time.

dataMack
12 - Quasar
Thanks so much for posting this. I've been doing this manually by having 'containerized' database queries that write local yxdbs and then I would manually switch the connections when I want to be 'in the flow' building and tweaking vs ready to run it fully.
rmak
5 - Atom

This is great!


I have two questions:

 

1. Is there a way to toggle all Cache macros at the same time? Perhaps allowing the setting to be read through a text input set at the beginning of the flow?

2. Is there a way to automatically trigger a pass-through if the data flowing in is not disabled? vice versa: if the data flowing into the macro is disabled, it automatically sets to "write to temp file"

MacRo
Alteryx
Alteryx

Great suggestions @rmak! I really like these -- added to my backlog. No guarantees as far as timing to get them implemented, but will keep you updated!

swapnil_sawant
6 - Meteoroid

Hi,

 

I would love to use this macro however I am getting an error 

Cache Dataset (2) Tool #20: Error creating the file "C:\Program Files (x86)\Alteryx\DataProducts\Explorer\TT_CA_2015_Q1\Macros\cache_dataset.temp1.yxdb": Access is denied.

Any idea of what is going wrong? Thanks! 

 

UPDATE: Resolved it...just moved the cache macro to a different directory and it worked... :) 

MacRo
Alteryx
Alteryx

Nice -- glad you got it to work. Just as a reference for anyone else who might run into this -- it writes out the temp files to the same directory as the macro, so you have to have write access to the folder its in.

AtlasDougged
6 - Meteoroid
I may have missed it in the comments, but how long will the temp file remain in the macro folder?
MacRo
Alteryx
Alteryx

@AtlasDougged -- for the current version, the files will stay there until you delete them manually

MacRo
Alteryx
Alteryx

To those who were asking about having the cached files cleaned up by the tool, check out my new blog post, The Cache Dataset Macro V2!

 

@SusanDay @chris_love @AtlasDougged

 

(@rmak, I didn't have time to implement your suggestions for this blog post, but hope to implement at least #2 soon, keep an eye out for an update!)

ulrich_schumann
8 - Asteroid
Hi, I am quite new to Alteryx and this is my first external macro I want to use. Installation went well - writing into the macro seems to work - but after changing into the read mode 1 error (not specified) comes up - and no data is available. What can I do?
MacRo
Alteryx
Alteryx

Hey @ulrich_schumann, did you first run your workflow with the tool in write mode? The read mode error (should print "read mode" to the logs) is intentional and not a true error -- the reason for it is to give a visual indication on the canvas of which instances of the tool are reading from cached data (rather than letting the data flow through naturally). Although this visual indicator may not be necessary for smaller workflows, it becomes very useful for when dealing with larger workflows containing many instances of the Cache Dataset tool (for example the workflow shown above in the "Afterward" section of this blog post).

drom
5 - Atom

Brilliant! Real time saver

Mitchell_G
7 - Meteor

This looks awesome! From my understanding, this would not save any time if you put it right after the initial load, because it would just be caching the records that you just loaded in, taking the identical amont of time. Is this assumption correct?

 

I can see how this would be very valuable for deeper in the workflow where some filtering/aggregation was already done.

paul_houghton
12 - Quasar

@Mitchell_G it depends on where you are reading the data from. If you are reading in some large, slow CSVs or database then the records are cached to a yxdb and the secondary reads are sped up from there. In general the speed would change only if its faster to read from a local yxdb.

Aguisande
15 - Aurora
15 - Aurora

@Mitchell_GI would say that in some cases, it saves a lot of time using it right away the input tool.

The scenarios I already got are:

- Big latency on network connections,

- Complex queries against DBs,

- .XLS files (they're always slow),

- some .XLSX files,

- remote files (as @paul_houghton states, its faster to read from local yxdb's)

 

And the most used by me is to cache a minimal sample of the data, to develop and test the logic of the workflow. Thats speeds up the testing process a lot, since you focus on processing the data, not getting it.

 

Best,

_AG_ 

swapnil_sawant
6 - Meteoroid

I routinely use it right after the Input tool. I find myself accessing large XLS files from network drive and this macro just speeds up everything. It becomes so much easier to rapidly try out different things in the workflow. 

Rohit_Bajaj
9 - Comet

Is there a way wherein one can know where the previously stored cache output needs a refresh, this is more to do with data change in the source side and not about explicity code changes that one is doing and based on the code changes might call for cache refresh.

swapnil_sawant
6 - Meteoroid
@Rohit_Bajaj - Not sure if I understood your question correctly. But if you want to know how old the macro data is, what I do is I just browse to the folder where the macro files are stored and I see what date the macro was created to give me an idea of till what time certain data was loaded. Obviously you can also just sort the data based on date and pull out a sample of first 10 records to see what is stored.
Aguisande
15 - Aurora
15 - Aurora
You probably don't have access to write to %temp%



Sent from my Samsung device
aehrenwo
11 - Bolide

I am getting the following error now when I try to use the Cache Macro toolset 

 


Error: Cache Dataset V2 (39): Tool #120: The external program "C:\Windows\System32\cmd.exe" returned an error code: -1073741819
Error: Cache Dataset V2 (39): Tool #148: The external program "%temp%findalteryx.bat" returned an error code: -1073741819

 

Any suggestions? 

aehrenwo
11 - Bolide

I changed the folder to one on my Desktop for the cached file and I am still getting these errors!!! 

 

Error: Cache Dataset V2 (39): Tool #120: The external program "C:\Windows\System32\cmd.exe" returned an error code: -1073741819
Error: Cache Dataset V2 (39): Tool #148: The external program "%temp%findalteryx.bat" returned an error code: -1073741819

 

Recently Avectco Defender was installed in this PC (for work) and I am wondering if that is causing the issue because it needs to launch the command propmt to run the Macro?

 

Any sugggestions would be helpful - this is frustrating. 

 

Adam

Aguisande
15 - Aurora
15 - Aurora
Did you tried disabling the antivirus for a test?
Just to confirm is not the antivirus the cause of the failure



Sent from my Samsung device
patrick_mcauliffe
14 - Magnetar
14 - Magnetar

@aehrenwo it sounds like the User Account Control settings need to be adjusted.  If you're not Admin on the machine, ask your IT department to adjust for you.  If you are, lower the UAC level (Control Panel --> User Accounts --> User Accounts --> Change User Account Control settings).

Vadim
5 - Atom

Love the macro as much as I love this part:

 

Shoutout to fellow Content Engineer Neil Ryan, who also developed a caching macro with the same functionality independently, before I did. I just don't feel right taking full credit for this solution to a common problem we've all run into, when Neil beat me to the punch. Despite creating basically the same thing, for whatever reason his macro didn't get as much visibility within our group, and as a result I'm the one writing this blog post. It's a veritable Leibniz-Newton situation. So if this is well received, I'm happy to share the glory with Neil. But if you don't like it, please direct your anger and disappointment toward Neil, since when you think about it, it was his idea first.

 

Thank you :)

Roisinmm
7 - Meteor

Hi guys, bit late to the convo but has anyone ever experienced the error?

 

Error: Cache Dataset V2 (126): Tool #93: File not found "C:\Users\******\AppData\Local\Temp\Engine_9976_608491857960456393b620beffed4793_\directoryList.txt"

 

I thought this might be to do with where the temp files are saved but I have specified a location and it's still not working?

 

Thanks!

crpledger
5 - Atom

@Roisinmm did you solve the issue with the error message? 

 

Edited: I got the same error message after moving to a new PC and 2018.1 but it looks like it resolved itself after restarting Alteryx.  

Roisinmm
7 - Meteor

Hi @crpledger - no sorry! Ended up changing jobs and didn't have access to Alteryx anymore so it's an unsolved case....

ABCD
5 - Atom

 I am receiving the following error when I try to use Cache Dataset: "directoryList.txt" File not found

 

 

 Any thoughts on how to fix this?

crpledger
5 - Atom

@ABCD Not sure on the solution but I'd recommend downloading 2018.3 with the new cache capability built it.  Cache macro was great but the new standard functionality is much better. the new functionality allows your meta data to carry through to make building easier. 

ABCD
5 - Atom

@crpledger I don't have the license key to download the update (and it has been difficult to obtain from someone) so I am trying to find another way to overcome the issue!

ctrl_kudorje
8 - Asteroid

Seriously! Thank you so much! I am very grateful for your open source work!

Gualigee
8 - Asteroid
NeilR
Alteryx Alumni (Retired)

@Gualigee 

Added to top of article:

 

Editor's note: Caching is now built into Designer - just right click on a tool. Check out the announcement blog.

While you may still find the macro useful in some cases (such as saving cached data between Designer sessions) in general the built in functionality will be the best choice.

Gualigee
8 - Asteroid

Hi @NeilR , 

 

Thank you for your reply. 

 

I did some research and find the built-in cache function cannot always be used as a substitute for the Cache Dataset tool.
This is for 2 major reasons. The first is that the cache and run function only works when the workflow has a single linear path. For example, say we wanted to use it for the Linear Regression tool. This tool has 3 output nodes, so we cannot run and cache the outputs.
The second reason is that with cache and run, the data is only cached for the current instance of Alteryx. Once you close Alteryx and boot it up again, you have to cache and run the workflow again. With the Cache Dataset tool we can save the cached data so we never have to run the preceding tools ever again. This is very useful for models that take a long time to train but which you’d like to deploy multiple times. I think you mentioned about this, too. 
 
Any further feedback would be appreciated. 
NeilR
Alteryx Alumni (Retired)

@Gualigee I agree with your assessment and will add that if there is a tool with more than one output and you're only interested in caching one output, you can add a select tool and cache there. If you want all three cached, you can use the Cache Macro.

Gualigee
8 - Asteroid

I cannot find the cache dataset tool in the new community 😞

NeilR
Alteryx Alumni (Retired)