Hi all,
I run a workflow that contains two very large datasets. One of which I receive a few weeks before the other. However, when I receive the second dataset (which is about 1/8th the size of the first dataset) I need to complete the workflow pretty quickly. I takes quite a bit of time to run the workflow when I have to read all of the data into the workflow. I was wondering, the day before, can I cache the first source of data up to the point right before it interacts with the second source of data? Then, once I get the second source I can run the workflow. Since the first dataset is already cached I have essentially saved myself that amount of time. Does it work like this? Please let me know if you have any questions.
EDIT: It takes a few hours to read in the large dataset so my intuition tells me that if I cache this data and the manipulations I need to perform beforehand then I will be saved all that time when I actually run the full workflow. I just am not sure if I leave Alteryx open for a day with all that info cached will that cause any unexpected issues.
Hi,
I don't think what you're asking is possible. I would rather look at a few things to speed up your process. Example areas are:
Those are some of the immediate points where I would start to look.
Regards,
Tom
Can I not just do as I have done in the picture below? I have cached my workflow in three spots (circled in black). As you can see these are just additional data sources that do not interact with my initial data source immediately. Then once I get my last input file (The top input file that is not circled) I place it in there and run. The black portions were already ran and cached previously so it doesn't need to perform those steps again, it only needed to bring in the cached data.
To answer your questions:
Hi @AlteryxTrev ,
I think the good news is that we can probably make your life a lot better by working through this.
It probably won't be a quick single answer, but I'm happy to stick with you on this one ... this could save you an enormous amount of time.
Firstly, to answer your question, you could try to cache it and wait a day, but it's awfully dangerous - a sudden reboot and maybe the files have changed when you reload, the memory toll on your machine etc. Definitely don't recommend it.
Secondly, let me try to clarify some points:
1. A .yxdb is the extension of the Alteryx data format type - think of it like .xlsx is to Excel, except that this format is much more efficient for Alteryx to process. So, rather than caching (as you're doing in your image), you can output the data into a new .yxdb file (with, say, the date appended automatically) so you keep copies of what you ran the day before. I don't know your workflow, but the endpoint is likely to be the end of the circle you drew on the image. If you had access to a database, this data could also be output to a database (e.g. SQL Server).
2. In database tools apply when you are extracting data from a database. My hunch is that you're not doing that, but if you were, the In Database tools (there is a section for this in the menu in Alteryx Designer) offer in-database processing, i.e. the logic is run before it reaches Alteryx. Alteryx then doesn't have to do the crunching and only pulls the result into memory, so uses the power of the database.
3. With the "extract less" point, if you're doing a vlookup, can you extract only key fields, e.g. a name, an address or a key etc.?) The less fields you extract from the source the faster it will run
4. An extract is another name for the workflow you're building - i.e. the whole process. The key here is to break it into granular pieces as much as possible to save you time. For example:
Think the above is plenty to think about - let me know if you'd like to chat some more.
Wow, thanks for all the great info! Here is my response to each of your points.
1. Ah, I understand now. Yes currently I am reading these files from CSV. I ran a quick test and read a file with about 300m records. It took 3 minutes to read the file as a CSV and only 1 minute to read it as an Alteryx Database. So it looks like it can at least read the data much quicker. Does the benefit of the Alteryx Database extend beyond just read the file? For example, when performing other manipulations or submitting the data through a batch macro will all of those processes happen faster because the source was an Alteryx DB and not just a CSV?
2. I see, I am not using a database in this instance.
3. Got it. Yes I drop fields where I can so I don;t carry unnecessary data through my workflow
4.
1st point: Yes I can turn all of my input files into Alteryx Databases (I will do that now!). If the benefit of using an Alteryx Database extends beyond just reading the file faster I will turn all of them into this. If it does not extend beyond that benefit then I won't bother turning the last file I receive into the Alteryx Database.
2nd point: Great point! Although I think this will be changing a little bit as I am trying to have fewer input files (those three files were just a subset of three of the input files from above that I had split off in another workflow). Instead of splitting the files beforehand and running the workflow with all of the separate files I will just split the files with a filter tool at the top of my workflow. This would still require a blend tool though. At the end of the day I am trying to run my whole process in one workflow so I don't have to have bits and pieces everywhere.
3rd point: I don't think I can split my workflow up as each subsequent container uses "leftovers" from the above container. Plus I was just hoping to to hit run and walk away.
4th point: I agree that I am probably using too many fields for fuzzy matching and can probably remove some as they are somewhat redundant.
Once I make some of the above changes I will report back and let you know how they went. Then I can go further in depth with how many macros are working. But I don;t want to ask those questions as I have a lot to do for now. Hopefully the above provides more context as I am not able to send over the actual workflow. I really appreciate the time you have taken to help me though!
EDIT: I ran a test on a subset of the data. With CSV as input the workflow took 54 minutes. With Alteryx DB as input the workflow took 25 minutes (wow!). The results are slightly different which is a little scary, however, maybe it has to do with the fact that I am limiting how many records and being initiated in some of the files. Also, my thought it that converting CSV to .yxdb jumbled the records a bit and this is why results are different.
Hi @AlteryxTrev ,
Glad there are already some little improvements made. I think we can still do better!
First, to answer your question regarding the benefits of .yxdb vs .csv - personally I've always found .csv's less reliable, in that you have to be careful with the delimiters and identifying text. Also, sometimes truncation happens on fields that have a lot of verbatim text in it and it's easy to miss if you're not paying attention. If you don't have a lot of text in a single field this may not be a problem. Think the cut-off is 254 characters. This can be extended, but needs to be defined explicitly.
I highly doubt that the .yxdb would be unreliable/give incorrect results, unless not all of the data made it's way into it in the first place (from the .csv). So just do a check that every row & column were transferred into the file in your .csv -> .yxdb workflow - think you should be OK.
On the fuzzy matching, I may be able to give you a tip to help improve your performance. Not guaranteed, but let's give it a try. I attach a simple example for you to review:
* Rather than matching on multiple fields, create a single field that contains all the values you want to match on
* As your Record ID (unique identifier), also append all these values into a single field. This helps you to see if the match works or not
* On the fuzzy match tool, make sure you output the match scores (if you're not already doing that)
* Add a "Group" tool at the end of your fuzzy match - this is to de-duplicate all your matches and identify those that relate (e.g. A=B and B=C therefore A=C)
* At the end, you can add a Tile tool to give a number to each match
* Oh, and I like to use a little trick - to remove all the vowels in a word I want to match against - that is usually where most typos happen
By the way, I'm assuming you have to fuzzy match, and cannot simply join data together! 🙂
Here is an image of what this would look like:
The performance point is to simplify the number of fields your matching on (so make it 1 that is a combined field ... if you can). The rest is making sure you keep track of what is matching and what is not. Not sure how you're handling that (e.g. non-matches) currently?
Back to the performance enhancement though and wanting to press run & walk away: Look into Crew Macros - they're on the chaosreignswithin.com website and were originally written by one of the Alteryx developers. They're brilliant. The macros you're looking for are called "Runner" and "conditional runner" tools - they allow you to run multiple workflows in sequence. So you could for example:
1. Move all .csv -> .yxdb
2. Do Fuzzy match stage 1 and store it, using the input from step 1
3. Do Fuzzy match stage 2 and store it, using the input from step 2.
etc.
Also, there are "log parse" tools in the Crew Macros, so you can actually store the log results of each run, so you can see how many rows were processed and what errors (if any) happened.
Again, that's a lot of info, so I'll leave it there. Attaching the fuzzy match workflow for you.
Best,
Tom
BRILLIANT AND VERY HELPFUL!!!
Glad to hear!
 
					
				
				
			
		
