Alteryx Designer Desktop Discussions

ekarmanov · ‎08-03-2016

Hello,

I am working with data that contains free-text descriptions of numerous employee expenses (e.g., "Lunch with attorneys to discuss charitable giving and political contributions").

I'd like to run each of these descriptions against a list of predetermined keywords that help me identify transactions of interest. The list currently contains 50+ keywords such as ATTORNEY, CHARITABLE, POLITICAL, CONTRIBUTION.

I’m currently using a RegEx-based formula to match each free-text string against these keywords:

if (REGEX_Match([Description], '.*ATTORNEY.*')) then 'ATTORNEY'

elseif (REGEX_Match([Description], '.*CHARITABLE.*')) then 'CHARITABLE'

elseif (REGEX_Match([Description], '.*CONTRIBUTION.*')) then 'CONTRIBUTION'

elseif (REGEX_Match([Description], '.*POLITICAL.*')) then 'POLITICAL'

else '' endif

I’m finding this approach falls short of desired results as the current formula will only identify the first match (in this example “attorney”) neglecting the other keyword matches.

Ideally, I’m hoping to find an approach that will identify all matching keywords (there would be four in this example), list them out, and provide a count of matching keywords in each free-text string.

Thank you for your help,

Eugene

JohnJPS · ‎08-03-2016

Instead of the "if" you could create an integer field and assign it...

ToNumber(REGEX_Match([Description], '.*ATTORNEY.*'))
+ ToNumber(REGEX_Match([Description], '.*CHARITABLE.*'))
+ ToNumber(REGEX_Match([Description], '.*CONTRIBUTION.*')) 
+ ToNumber(REGEX_Match([Description], '.*POLITICAL.*')) 
...

... this would give the count, but not which keywords matched.

jdunkerley79 · ‎08-03-2016

I would use an append fields tool to join all 50+ terms to the input set.

You can then use that in a ~~REGEX_Match~~ Contains as a custom filter and finally a summarise to get the count and concatenate the matches.

Sample attached

pcatterson · ‎08-03-2016

Mine is less elegant, but it essentially is a multistepped formula tool to create a list of the keywords.

pcatterson · ‎08-03-2016

To get a count from mine:

If IsNull([Keywords])
THEN 0
ELSE REGEX_CountMatches([Keywords],', ') + 1
ENDIF

JohnJPS · ‎08-03-2016

It would be nice to have a way that doesn't do AppendFields, which for a very larger dataset could become cumbersome. The formulaic approach is good from that sense, but that requires more maintenance when modifying your set of keywords. With a regular programming language, I would just loop through and apply a sort of custom "grep" on each row and get things done pretty quickly without the memory hit taken when using AppendFields.

With that in mind, I did it in R using a simple loop. I actually doubt this performs any better at all, especially given the time taken to invoke R in the first place, but this is nonetheless yet another way to solve the problem. (workflow attached).

(@jdunkerley79 this does seem amenable to solution with the SDK: looping in a proper language (as opposed to interpreted R) should be very fast... if I ever find time perhaps I'll give it a go.)

EDIT: updated workflow to shorten/simplify.

pcatterson · ‎08-03-2016

I could also see doing it with a iterative macro, but thats probably more complicated a solution for the problem.

jdunkerley79 · ‎08-03-2016

Iterative macro was my first guess but figured the append fields would work for most reasonable size cases.

SDK solution would be fun. Will see what I can come up with.

AdamR_AYX · ‎08-04-2016

Hear is my attempt which uses the CReW dynamic formula tool to build the formula dynamically from a list of input words. Which has the benifit of taking the list of words without having to do the append.

Adam Riley
https://www.linkedin.com/in/adriley/

ekarmanov · ‎08-04-2016

Hello all,

Thank you kindly for your great responses. Though some of the solutions are a bit over my head, I am learning a lot and appreciate your guidance.

@AdamR, I think your solution may be the one that proves the best fit for my situation.

Thank you, btw, for educating me on availability of the great CReW tools you're building.

I am hoping you can help me place a REGEX_Match instead of Contains function as part of the IIF formula. I'd like to have a flexible match on the words and would like to use a '.*' to allow for variations on the words (e.g. Word = politic would produce hits on both "politician" and "politics").

I'm struggling to adapt the IIF formula (example below) to get it to work and show all the hits correctly. Thank you for your help with this.

‘iif(REGEX_Match([Description], "'.*+[Word]+.*'"),"'+[Word]+',", "")’

Alteryx Designer Desktop Discussions

Matching Multiple Keywords in a Free-Text Field (using RegEx)