Hello,
I am working with data that contains free-text descriptions of numerous employee expenses (e.g., "Lunch with attorneys to discuss charitable giving and political contributions").
I'd like to run each of these descriptions against a list of predetermined keywords that help me identify transactions of interest. The list currently contains 50+ keywords such as ATTORNEY, CHARITABLE, POLITICAL, CONTRIBUTION.
I’m currently using a RegEx-based formula to match each free-text string against these keywords:
if (REGEX_Match([Description], '.*ATTORNEY.*')) then 'ATTORNEY'
elseif (REGEX_Match([Description], '.*CHARITABLE.*')) then 'CHARITABLE'
elseif (REGEX_Match([Description], '.*CONTRIBUTION.*')) then 'CONTRIBUTION'
elseif (REGEX_Match([Description], '.*POLITICAL.*')) then 'POLITICAL'
else '' endif
I’m finding this approach falls short of desired results as the current formula will only identify the first match (in this example “attorney”) neglecting the other keyword matches.
Ideally, I’m hoping to find an approach that will identify all matching keywords (there would be four in this example), list them out, and provide a count of matching keywords in each free-text string.
Thank you for your help,
Eugene
Solved! Go to Solution.
Instead of the "if" you could create an integer field and assign it...
ToNumber(REGEX_Match([Description], '.*ATTORNEY.*')) + ToNumber(REGEX_Match([Description], '.*CHARITABLE.*')) + ToNumber(REGEX_Match([Description], '.*CONTRIBUTION.*')) + ToNumber(REGEX_Match([Description], '.*POLITICAL.*')) ...
... this would give the count, but not which keywords matched.
To get a count from mine:
If IsNull([Keywords]) THEN 0 ELSE REGEX_CountMatches([Keywords],', ') + 1 ENDIF
It would be nice to have a way that doesn't do AppendFields, which for a very larger dataset could become cumbersome. The formulaic approach is good from that sense, but that requires more maintenance when modifying your set of keywords. With a regular programming language, I would just loop through and apply a sort of custom "grep" on each row and get things done pretty quickly without the memory hit taken when using AppendFields.
With that in mind, I did it in R using a simple loop. I actually doubt this performs any better at all, especially given the time taken to invoke R in the first place, but this is nonetheless yet another way to solve the problem. (workflow attached).
(@jdunkerley79 this does seem amenable to solution with the SDK: looping in a proper language (as opposed to interpreted R) should be very fast... if I ever find time perhaps I'll give it a go.)
EDIT: updated workflow to shorten/simplify.
I could also see doing it with a iterative macro, but thats probably more complicated a solution for the problem.
Hello all,
Thank you kindly for your great responses. Though some of the solutions are a bit over my head, I am learning a lot and appreciate your guidance.
@AdamR, I think your solution may be the one that proves the best fit for my situation.
Thank you, btw, for educating me on availability of the great CReW tools you're building.
I am hoping you can help me place a REGEX_Match instead of Contains function as part of the IIF formula. I'd like to have a flexible match on the words and would like to use a '.*' to allow for variations on the words (e.g. Word = politic would produce hits on both "politician" and "politics").
I'm struggling to adapt the IIF formula (example below) to get it to work and show all the hits correctly. Thank you for your help with this.
‘iif(REGEX_Match([Description], "'.*+[Word]+.*'"),"'+[Word]+',", "")’