Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Group Rows by Regex majority match

cbz
8 - Asteroid

Hello

 

i am looking for a way to group rows by approximate similar content for example 70% similar content matched. and if possible give a score of similarity as well.

 

for example the 2 rows of data will be "testing file -uk" and "testing file2  - US" so the words "testing - file" are common and they are majority the same so i want to group them up as a single record

 

I have attached the picture with sample data.

 

Thanks

 

Chunbin 

5 REPLIES 5
echuong1
Alteryx Alumni (Retired)

You can use the Fuzzy match tool to identify values that are similar to each other. From there, you can use the make groups and find and replace tools to find and normalize the values.

 

From there, you can use the summarize tool to group values. See attached for an example. You can play around with the logic used for the matching as well as the matching threshold. In my example, the first and third records would be grouped together (seen by the group field).

 

echuong1_0-1595871480779.png

 

 

 

Hope this helps!

cbz
8 - Asteroid

Hi @echuong1 

 

Thank you for the solution.

 

I have few questions about this:

 

  • in the Group file why it picks the "This is a test title for similarity - Germany" instead of  "This is a test title for similarity - UK" ?
  • why the row number 2 dont have a value for Group column?
  • as they all have the comment words "This is a test title for similarity" should they all be clarified as the same group?

Thanks

 

Chunbin 

echuong1
Alteryx Alumni (Retired)

The reason is because it didn't fall within the similarity threshold I configured. You'll have to play around with the logic used to find similar records to see what works best with your dataset. I was just giving you the general configuration as an idea, since it'll need to be tweaked for your data.

 

I've adjusted the threshold lower to pick up the second row. It was not included because having "number 2" made the row not fall within the similarity threshold.

 

echuong1_0-1595872042624.png

 

cbz
8 - Asteroid

Title Similarity.PNGHi 

I have tried with your suggestion and it improved result but there still some parts i am not able to get around.

 

for example: on the screenshot i attached, so ideally i need the The Group to be "Monthly markets review - August 2018" or at least "Monthly markets review - August 2018 - Adviser - Company Name"

 

can you suggest what do i need to modify?

 

I have also attached the sample data workflows.

 

Thanks

 

Chunbin 

 

 

 

 

echuong1
Alteryx Alumni (Retired)

Again, you'll need to test the different options to see what works best with your dataset. 

 

I'd say to review the output of the make groups tool to see if everything is being categorized correctly. If not, you can use a filter to remove the records and a text input/union tool to manually add to the list. You can also use a formula tool to update the values as well.

 

If this resolves your issue, please mark this thread as solved so others can find answers more easily. Thanks!

Labels