Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Matching Multiple Keywords in a Free-Text Field (using RegEx)

ekarmanov
7 - Meteor

Hello,

 

I am working with data that contains free-text descriptions of numerous employee expenses (e.g., "Lunch with attorneys to discuss charitable giving and political contributions"). 

 

I'd like to run each of these descriptions against a list of predetermined keywords that help me identify transactions of interest. The list currently contains 50+ keywords such as ATTORNEY, CHARITABLE, POLITICAL, CONTRIBUTION.

 

I’m currently using a RegEx-based formula to match each free-text string against these keywords:

 

if (REGEX_Match([Description], '.*ATTORNEY.*')) then 'ATTORNEY'

elseif (REGEX_Match([Description], '.*CHARITABLE.*')) then 'CHARITABLE'

elseif (REGEX_Match([Description], '.*CONTRIBUTION.*')) then 'CONTRIBUTION'

elseif (REGEX_Match([Description], '.*POLITICAL.*')) then 'POLITICAL'

else '' endif

 

I’m finding this approach falls short of desired results as the current formula will only identify the first match (in this example “attorney”) neglecting the other keyword matches.

Ideally, I’m hoping to find an approach that will identify all matching keywords (there would be four in this example), list them out, and provide a count of matching keywords in each free-text string.

 

Thank you for your help,

Eugene

18 REPLIES 18
JohnJPS
15 - Aurora

Contains([field],[word]) should hit even if the word is not a full word... i.e. If [field] has "politician" it should match "politic".

 

But either way, I can see where RegEx will eventually be more useful (e.g. multiple contains), so... does this work:

 

iif(REGEX_Match([Description], ".*"+[Word]+".*"),[Word],"")

 

(EDIT: removed embedded quotes - I believe they're unnecessary.)

KaneG
Alteryx Alumni (Retired)

Hi @ekarmanov,

 

It's interesting to see how different people would go about this... I do love the dynamic formula tool, it's saved me a couple of times.

 

The solution I would go with is to split the Expense field to 1 row per word then use a Find/Replace to append the found word, before a summarise and then join back on the original term.

 

I'm assuming you won't have more than a couple of million expense records and not more than 10-12 words mx in that expense field and so only going to blow the records out to max 10 miilion or so... you can also drop your other fields off after the recordID and join them back on at the end for faster processing

 

Image 001 - 20160805 - 100800.png

 

Kane

ekarmanov
7 - Meteor

The main reason I'm trying to make RegEx work is because I have a several keywords (e.g., "Red Tape") where I need to match instances where it's spelled without a space and with the space between the two words. So I'm thinking using ".?" (as in ".*RED.?TAPE.*") should solve the problem with RegEx.

Using Contains doesn't seem to capture all the possibilities as well as RegEx does.

Though I'm now starting to think maybe I should just stick with Contains and have several versions of my keywords instead (e.g., both "Red Tape" and "RedTape")...

 

I tried using the RegEx expression you suggested, but it didn't work for me.

I then tried to configure the Contains expression so that Contains is replaced by REGEX_Match (I'm attaching the workflow), but that resulted in no WordsFound at all, which I can't understand why.

 

Appreciate any insight...

 

Trying to make RegEx work.PNG

AdamR_AYX
Alteryx Alumni (Retired)

Looks like RegEx_Match works a little different to RegEx_CountMatches.  In that RegEx_Match needs to macth the whole string.  Which means for POLITIC we need .*POLITIC.* in RegEx_Match where as we can get away with just POLITIC in RegEx_CountMatches.

 

I've updated the workflow to make this work.  I added an extra column in the keyword input for the regex as it seems you would want to report the actual word rather than the regex.

 

One thing that you had missed on your workflow was at the bottom of the summarise tool for concat you can specify the separator, which was how I got all of the + symbols between my formula parts.

Adam Riley
https://www.linkedin.com/in/adriley/
ekarmanov
7 - Meteor

Hi @KaneG,

 

Thank you for this great solution. I think it works quite well and is similar to the one proposed earlier by @jdunkerley79 (my thanks to you as well!).

My concern with appending fields is (as you pointed out) related to processing times. I'm running this workflow alongside several others and I do have several million records to get through.

 

But I can put aside my concerns with this if there's a way to configure this workflow to look at multi-word keywords (e.g., "red tape"). Current workflow will only identiy single words it seems.

Do you know if/how this could best be done? I'm attaching your workflow with this additional keyword to show what I mean.

 

Thank you,

Eugene

AdamR_AYX
Alteryx Alumni (Retired)

Last workflow had some issues on the counting side.   Try this one instead.

Adam Riley
https://www.linkedin.com/in/adriley/
ekarmanov
7 - Meteor

Thank you @AdamR 

I think your solution is the most complete and covers all the bases that I can currently think of.

 

Much appreciated!

Jeswantbabu
6 - Meteoroid

Hi,

 

I have doubt on the above flow.

 

Eg- I have a keyword say ABC & ABCD

 

Description 1-  The word ABC is contained in ABCD.

 

Description 2- The correct word is ABCD.

 

The problem that i am facing is In the second description it should take only the Keyword ABCD, but it also considers ABC. How to correct this?

Jeswantbabu
6 - Meteoroid

Hi,

 

I have doubt on the above flow.

 

Eg- I have a keyword say ABC & ABCD

 

Description 1-  The word ABC is contained in ABCD.

 

Description 2- The correct word is ABCD.

 

The problem that i am facing is In the second description it should take only the Keyword ABCD, but it also considers ABC as well which shouldnt occur. How to correct this?

Labels