Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Matching Multiple Keywords in a Free-Text Field (using RegEx)

ekarmanov
7 - Meteor

Hello,

 

I am working with data that contains free-text descriptions of numerous employee expenses (e.g., "Lunch with attorneys to discuss charitable giving and political contributions"). 

 

I'd like to run each of these descriptions against a list of predetermined keywords that help me identify transactions of interest. The list currently contains 50+ keywords such as ATTORNEY, CHARITABLE, POLITICAL, CONTRIBUTION.

 

I’m currently using a RegEx-based formula to match each free-text string against these keywords:

 

if (REGEX_Match([Description], '.*ATTORNEY.*')) then 'ATTORNEY'

elseif (REGEX_Match([Description], '.*CHARITABLE.*')) then 'CHARITABLE'

elseif (REGEX_Match([Description], '.*CONTRIBUTION.*')) then 'CONTRIBUTION'

elseif (REGEX_Match([Description], '.*POLITICAL.*')) then 'POLITICAL'

else '' endif

 

I’m finding this approach falls short of desired results as the current formula will only identify the first match (in this example “attorney”) neglecting the other keyword matches.

Ideally, I’m hoping to find an approach that will identify all matching keywords (there would be four in this example), list them out, and provide a count of matching keywords in each free-text string.

 

Thank you for your help,

Eugene

18 REPLIES 18
JohnJPS
15 - Aurora

Instead of the "if" you could create an integer field and assign it...

ToNumber(REGEX_Match([Description], '.*ATTORNEY.*'))
+ ToNumber(REGEX_Match([Description], '.*CHARITABLE.*'))
+ ToNumber(REGEX_Match([Description], '.*CONTRIBUTION.*')) 
+ ToNumber(REGEX_Match([Description], '.*POLITICAL.*')) 
...

... this would give the count, but not which keywords matched.

jdunkerley79
ACE Emeritus
ACE Emeritus

I would use an append fields tool to join all 50+ terms to the input set.

 

You can then use that in a REGEX_Match Contains as a custom filter and finally a summarise to get the count and concatenate the matches.

 

2016-08-03_14-04-45.jpg

Sample attached

pcatterson
11 - Bolide

Mine is less elegant, but it essentially is a multistepped formula tool to create a list of the keywords.

pcatterson
11 - Bolide

To get a count from mine:

 

If IsNull([Keywords])
THEN 0
ELSE REGEX_CountMatches([Keywords],', ') + 1
ENDIF
JohnJPS
15 - Aurora

It would be nice to have a way that doesn't do AppendFields, which for a very larger dataset could become cumbersome.  The formulaic approach is good from that sense, but that requires more maintenance when modifying your set of keywords. With a regular programming language, I would just loop through and apply a sort of custom "grep" on each row and get things done pretty quickly without the memory hit taken when using AppendFields.

 

With that in mind, I did it in R using a simple loop.  I actually doubt this performs any better at all, especially given the time taken to invoke R in the first place, but this is nonetheless yet another way to solve the problem. (workflow attached).

 

(@jdunkerley79 this does seem amenable to solution with the SDK: looping in a proper language (as opposed to interpreted R) should be very fast...  if I ever find time perhaps I'll give it a go.)

 

EDIT: updated workflow to shorten/simplify.

pcatterson
11 - Bolide

I could also see doing it with a iterative macro, but thats probably more complicated a solution for the problem.

jdunkerley79
ACE Emeritus
ACE Emeritus
Iterative macro was my first guess but figured the append fields would work for most reasonable size cases.

SDK solution would be fun. Will see what I can come up with.
AdamR_AYX
Alteryx Alumni (Retired)

Hear is my attempt which uses the CReW dynamic formula tool to build the formula dynamically from a list of input words.  Which has the benifit of taking the list of words without having to do the append.

 

WordFinder.png

Adam Riley
https://www.linkedin.com/in/adriley/
ekarmanov
7 - Meteor

Hello all,

 

Thank you kindly for your great responses. Though some of the solutions are a bit over my head, I am learning a lot and appreciate your guidance.

 

@AdamR, I think your solution may be the one that proves the best fit for my situation.

Thank you, btw, for educating me on availability of the great CReW tools you're building.

I am hoping you can help me place a REGEX_Match instead of Contains function as part of the IIF formula. I'd like to have a flexible match on the words and would like to use a '.*' to allow for variations on the words (e.g. Word = politic would produce hits on both "politician" and "politics").

I'm struggling to adapt the IIF formula (example below) to get it to work and show all the hits correctly. Thank you for your help with this.

 

‘iif(REGEX_Match([Description], "'.*+[Word]+.*'"),"'+[Word]+',", "")’

Labels