I have a dataset with ~300k rows that has a "Job Description" field which contains free form text entries describing construction projects. I need to parse the whole database to find observations that contain references to Ceiling, Drywall, Insulation, Hardware, Flooring, and Concrete. I need to use wildcards to allow for misspellings. I've figured out how to make it do what I need it to do for ONE keyword, but I don't know enough about RegEx to know how to make it use a logical OR to look for more than one keyword at a time per cell.
Here's what I've got. Each one of these lines is using the formula tool and creating a new output column:
Contains(('ceiling'), [Text])
iif(REGEX_Match([Text], '.*ceil.*'),'Ceiling'," ")
REGEX_CountMatches([Text],'Ceiling')
How can I modify this to search for more than one term at a time? Thanks in advance!
Solved! Go to Solution.
Hi,
For your Contains formula you can use Contains('ceiling', [Text]) OR Contains('drywall', [Text]) OR Contains('insulation', [Text]) etc..
For the RegEx formulas you can create a capturing group for example: REGEX_CountMatches([Product Name], "(Xerox|&|Black)")
I like this solution, because it's the most reusable. The only problem is that sometimes the text entries might have multiple keywords, and this solution only returns the first keyword in the text string. I would need to either have a count column that is updated each time a keyword is found, or I would need to have each additional keyword generate a new column to show secondary or tertiary segments. Any thoughts? Thanks so much for your help!
Thank you, but I think the text fields that I'm analyzing are too long to separate all of them into different rows after each space. It would create an unwieldy amount of rows. I will definitely use this in the future, though!
I think that should be OK, Alteryx can cope with a lot of records (I have seen processes in the billions, Ned has also Tweeted that he processed a Trillion before as well here)
A lot of it will depend on the spec of your machine though.
Thanks for your help, now I know!
Oh wow, that is super useful. My work machine is grossly underpowered - but I will give it a try and see how long it takes!
User | Count |
---|---|
17 | |
15 | |
15 | |
8 | |
5 |