This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I am new to Alteryx and I have to design a workflow, which perform a simple “Find and Replace” task on a single field within the csv file. But there are 2 main problems that I am facing, the first one is that there are “ almost 60 comparisons” that have to be made for this find and replace task and secondly, this flow needs to be run on csv files, which have Millions of Rows inside them.
As you can see in the attached sample data file, there is a field named “Ticker” which has got multiple details inside it in the form of a “Large String Value”, which also including the DATE in this format – “YYMMM” where YY stands for Year in short form like 19 for the year 2019, and MMM stands for 3 digits of the Month, like FEB stands for February and so on for all other months.
Now, my requirement is that I have to actually REPLACE this 5 character “Date” from these large string values with the new 7 character date, which will also include the actual 2 digits of date inside them. Instead of just telling YYMMM, they will actually show the complete date in this format – DDMMMYY, where DD is the date, MMM is the month and YY is the year.
In short, “YYMMM” needs to be replaced with “DDMMMYY”, leaving all other information inside that string value as it is.
For example, the original Ticker value of
“ACC19JAN88260PE” will get converted to “ACC31JAN1988260PE”
means “19JAN got replaced by 31JAN19” ,
and the original Ticker value of “ACC19FEB88260PE”
will get converted to “ACC28FEB1988260PE”
means “19FEB got replaced by 28FEB19”
and so on for all different field values. I hope you understand the requirement.
Now, for adding this “DD” part, we have a table of values, which tells exactly what value there should be for a particular “YYMM” as given in the original data. Please have a look at the attached sample data file. We can’t just use any random date like the first date of a month or the last date of month etc. instead we have to replace the specific “YYMMM” portions with their corresponding “DDMMMYY” as given in the attached table.
Although this requirement is quite simple and it could be achieved by either using a formula for “Find and Replace” or by using “Refex” or any other method like that. However, I have to run this flow on csv data files, which are quite large, having around 4 million rows per csv file and I have to run this flow onto hundreds of such files.
So I am looking for an EFFICIENT SOLUTION for this problem, so that the PC does not remain in the HANG STATE for large duration of time because of this Extensive Load on the CPU and RAM etc.
Can someone please suggest the most efficient way of achieving this objective for large data sets? Any help or guidance is most welcome.
To make it easy for you, I have attached the sample data 20190130.csv file on which this flow needs to be run. Although the actual data file is quite big, having millions of rows in it, but I have made a very small sample out of it for sharing on the forum. All this data has a constant schema throughout.
I have attached the Excel File named "Date Table for FIND AND REPLACE REQUIREMENT .xlsx" which has to be used for creating the “Find and Replace Formula”.
I have also attached the Alteryx Worflow Package .yxzp file, so that you can easily open it up on your machine. I have used the latest version of Alteryx for creating this workflow. You may please have a look.
Thanks a lot
PS: Please note that although we are working with DATES throughout this workflow, but actually the original field “Ticker” is a STRING and it has to be kept as a string only, while doing all these steps, because it not only contains this date information, but a lot of additional information as well, which is all combined together within this string. Therefore, we cannot change this string to date format in order to make use of those formulas, which are specifically designed for working with dates. We have to treat this Ticker Field as a string only. You will understand this situation easily, when you look at the actual data.
@QiuThank you so much for the quick reply and for attaching the sample workflow file.
And @atcodedog05 thank you for posting the links with good examples.
I checked the solution suggested by Qiu and it is giving the correct output, although I would like to know, if this is the BEST/EFFICIENT METHOD for this requirement, if we have to do this on Millions of Records on each file?
Secondly, is there a way to know, for how many rows the replacement was done and for how many rows, the replacement was not done? I would specially like to know the details of those rows for which the replacement WAS NOT DONE and I would like to export them to a separate csv file, so that I can manually check and modify those particular rows.
Any ideas, regarding how could this particular step could be done.
I'd like to add commentary to @Qiu and @atcodedog05 's solutions. When asking for "efficient" solutions, I hope to explain how to make either of their solutions more performant. I have timed the solution at 1:07 seconds when the sample data is multiplied by 100,000. With 3 million records, the workflow executes in just over a minute. But, the majority of the time is spent in the DATA CLEANSE tool. In fact, more than 60% of the time is spent in that tool if you enable performance profiling in the Workflow-Configuration's runtime settings. If you replace that single tool with 2 Multi-Field Formula tools (one for text and the other for numbers), the time is reduced to 28.6 seconds.
/* Text Formula */
/* Numeric Formula */
If you then turn on the AMP engine, the time is reduced to 14.7 seconds!
By simply turning on the AMP engine, you'll improve the AS-IS workflow, then the workflow will execute in 34.5 seconds.
But the best timing that I achieved was by SAVING the Select configuration and LOADING the Select configuration (yxft) file within the SELECT tool and removing the AutoFields Tool. Then the timing (with AMP Engine on) was 9.5 seconds.
In summary, the auto-field tool is nice to use to find the right data type and size, but if you're running the process repeatedly, fix the select manually. Don't use the DATA CLEANSE tool for performance reasons. And finally, use AMP when possible.
I'm attaching a workflow without data that includes the GENERATE ROWS tool (used to create 3m records) for your review. I hope that the 84% reduction in time is found useful for you.