Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Most Efficient formula for “Find and Replace” type of requirement

StockMarket
8 - Asteroid

Hello

 

I am new to Alteryx and I have to design a workflow, which perform a simple “Find and Replace” task on a single field within the csv file. But there are 2 main problems that I am facing, the first one is that there are “ almost 60 comparisons” that have to be made for this find and replace task and secondly, this flow needs to be run on csv files, which have Millions of Rows inside them.

 

As you can see in the attached sample data file, there is a field named “Ticker” which has got multiple details inside it in the form of a “Large String Value”, which also including the DATE in this format –
“YYMMM” where YY stands for Year in short form like 19 for the year 2019, and MMM stands for 3 digits of the Month, like FEB stands for February and so on for all other months.

 

Now, my requirement is that I have to actually REPLACE this 5 character “Date” from these large string values with the new 7 character date, which will also include the actual 2 digits of date inside them. Instead of just telling YYMMM, they will actually show the complete date in this format – DDMMMYY, where DD is the date, MMM is the month and YY is the year.

 

In short, “YYMMM” needs to be replaced with “DDMMMYY”, leaving all other information inside that string value as it is.


For example, the original Ticker value of

 

“ACC19JAN88260PE” will get converted to “ACC31JAN1988260PE”

 

means “19JAN got replaced by 31JAN19” ,

 

and the original Ticker value of “ACC19FEB88260PE”

 

will get converted to “ACC28FEB1988260PE”

 

means “19FEB got replaced by 28FEB19”

 

and so on for all different field values. I hope you understand the requirement.

 

 

 

Now, for adding this “DD” part, we have a table of values, which tells exactly what value there should be for a particular “YYMM” as given in the original data. Please have a look at the attached sample data file. We can’t just use any random date like the first date of a month or the last date of month etc. instead we have to replace the specific “YYMMM” portions with their corresponding “DDMMMYY” as given in the attached table.

 

Although this requirement is quite simple and it could be achieved by either using a formula for “Find and Replace” or by using “Refex” or any other method like that. However, I have to run this flow on csv data files, which are quite large, having around 4 million rows per csv file and I have to run this flow onto hundreds of such files.

 

So I am looking for an EFFICIENT SOLUTION for this problem, so that the PC does not remain in the HANG STATE for large duration of time because of this Extensive Load on the CPU and RAM etc.


Can someone please suggest the most efficient way of achieving this objective for large data sets? Any help or guidance is most welcome.

 

To make it easy for you, I have attached the sample data 20190130.csv file on which this flow needs to be run. Although the actual data file is quite big, having millions of rows in it, but I have made a very small sample out of it for sharing on the forum. All this data has a constant schema throughout.

 

I have attached the Excel File named "Date Table for FIND AND REPLACE REQUIREMENT .xlsx" which has to be used for creating the “Find and Replace Formula”.

 

I have also attached the Alteryx Worflow Package .yxzp file, so that you can easily open it up on your machine. I have used the latest version of Alteryx for creating this workflow. You may please have a look.

 

Thanks a lot

 

PS: Please note that although we are working with DATES throughout this workflow, but actually the original field “Ticker” is a STRING and it has to be kept as a string only, while doing all these steps, because it not only contains this date information, but a lot of additional information as well, which is all combined together within this string. Therefore, we cannot change this string to date format in order to make use of those formulas, which are specifically designed for working with dates. We have to treat this Ticker Field as a string only. You will understand this situation easily, when you look at the actual data.

 

Sample Data.png

 

Workflow.png

 

11 REPLIES 11
Qiu
20 - Arcturus
20 - Arcturus

@StockMarket 
Suprisingly, Find and Replace is perfect for this job, especially you have provided a reference table.

1205-StockMarket-1.PNG

atcodedog05
22 - Nova
22 - Nova

Hi @StockMarket 

 

As @Qiu suggested Find and Replace tool is the way to go

 

Here are some resources to help you get a better understanding of it.

https://community.alteryx.com/t5/Interactive-Lessons/VLookUps-with-Designer/ta-p/80201#done

 

And here is an example documentation showing different capabilities of the tool (Workflow attached in the post)

atcodedog05_0-1607155131594.png

 

Tool documentation.

https://help.alteryx.com/current/designer/find-replace-tool

 

Hope this helps 🙂 Feel to ask if you have any questions

StockMarket
8 - Asteroid

@QiuThank you so much for the quick reply and for attaching the sample workflow file.

 

And @atcodedog05 thank you for posting the links with good examples.

 

I checked the solution suggested by Qiu and it is giving the correct output, although I would like to know, if this is the BEST/EFFICIENT METHOD for this requirement, if we have to do this on Millions of Records on each file?

 

Secondly, is there a way to know, for how many rows the replacement was done and for how many rows, the replacement was not done? I would specially like to know the details of those rows for which the replacement WAS NOT DONE and I would like to export them to a separate csv file, so that I can manually check and modify those particular rows.


Any ideas, regarding how could this particular step could be done.

 

Thank you so much.

atcodedog05
22 - Nova
22 - Nova

Hi @StockMarket 

 

In @Qiu 's workflow he has created a duplicate column and then performed replace.

 

So basically comparing [Ticker]=[Ticker1] in filter will give you the rows with no changes and false will give you changes.

atcodedog05_0-1607161683989.png

In your example all had changes.

 

Hope this helps 🙂

 

Qiu
20 - Arcturus
20 - Arcturus

@StockMarket  @atcodedog05 

I created the duplicated fields for easy comparision purpose.

Of course it can be removed.

Qiu
20 - Arcturus
20 - Arcturus

@StockMarket 
If you find my answer is useful, appreciate you would mark it as accept.

MarqueeCrew
20 - Arcturus
20 - Arcturus

@StockMarket ,

 

I'd like to add commentary to @Qiu and @atcodedog05 's solutions.  When asking for "efficient" solutions, I hope to explain how to make either of their solutions more performant.  I have timed the solution at 1:07 seconds when the sample data is multiplied by 100,000.  With 3 million records, the workflow executes in just over a minute.  But, the majority of the time is spent in the DATA CLEANSE tool.  In fact, more than 60% of the time is spent in that tool if you enable performance profiling in the Workflow-Configuration's runtime settings.  If you replace that single tool with 2 Multi-Field Formula tools (one for text and the other for numbers), the time is reduced to 28.6 seconds.

/* Text Formula */
Trim(IIF(IsNull([_CurrentField_]),"",[_CurrentField_]))

/* Numeric Formula */
IIF(Isnull([_CurrentField_]),0,[_CurrentField_])

If you then turn on the AMP engine, the time is reduced to 14.7 seconds!

capture.png

By simply turning on the AMP engine, you'll improve the AS-IS workflow, then the workflow will execute in 34.5 seconds.

 

But the best timing that I achieved was by SAVING the Select configuration and LOADING the Select configuration (yxft) file within the SELECT tool and removing the AutoFields Tool.  Then the timing (with AMP Engine on) was 9.5 seconds.

 

capture.png

 

In summary, the auto-field tool is nice to use to find the right data type and size, but if you're running the process repeatedly, fix the select manually.  Don't use the DATA CLEANSE tool for performance reasons.  And finally, use AMP when possible.

 

I'm attaching a workflow without data that includes the GENERATE ROWS tool (used to create 3m records) for your review.  I hope that the 84% reduction in time is found useful for you.

 

Cheers,

 

Mark

 

CC:  @TuvyL , @CassC , @LeahK 

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
TuvyL
Alteryx Alumni (Retired)

@RussJD: check out @MarqueeCrew's impressive AMP-boosted efficiency!

Tuvy Le
Manager, Community ACE Program
RussJD
Alteryx Alumni (Retired)

Do I sense an 'AMP' license plate in your future @MarqueeCrew ? 🙂

Honestly, this is great. I sincerely appreciate the detailed comment.

Labels