Alteryx Designer Desktop Discussions

hellyars · ‎10-20-2022

Regex help (or Text Mining - Part of Speech?)

I want to parse each sentence in a paragraph into rows.

I run into a problem when a sentence begins (or any part of a sentence) contains an abbreviation with a period (e.g., U.S. Government).

How can account for abbreviations with periods when trying to parse each sentence of a paragraph?

hellyars · ‎10-20-2022

I should add....without adding a Formula tool to replace U.S. with US. There may be other unknown abbreviations in the real data (e.g., Co. Inc., Plc. etc.).

DataNath · ‎10-20-2022

@hellyars how extensive are the possibilities of abbreviations? I'm just wondering if you may be better off with a lookup table or something to replace these. I know you mentioned not doing this with a formula, but I've had a think about this and, unless you can find a threshold to put on the length of a word before a period, I can't see how you'll differentiate between the more 'innocent' occurrences like 'Inc., Plc.' and genuine ends of sentences like so (chucked a few in to demo):

AndrewSu · ‎10-21-2022

@hellyars , I agree with @DataNath that a lookup table may help in this scenario especially for the common ones like you mentioned Co. Inc., Plc. etc. I also came up with a solution doing some further transformations after your tokenization of the data in the Regex tool. Please see the attached workflow.

I'm basically using a "count words" function to identify the rows that are abbreviations, then doing a series of transformations to join those identified rows back to the correct row in the dataset.

I imagine a combination of the lookup table strategy and the example in my workflow will provide the solution.

If this solves your issue, please mark this post as the solution so that others in the community can benefit from our collaboration.

Thanks.

danilang · ‎10-23-2022

Hi @hellyars

Another excellent question from you! As you've seen, the the answer is not simple and you don't want to spend time reinventing the wheel.

The best and simplest option for you would probably be to use a python tool and leverage the nltk library. The authors of this library have done the grunt work to define all the edge cases for you. There's an example here.

The second solution in that link contains an interesting python-only approach that uses a series of lists to define the edge cases. The code uses a bunch of regex functions to replace the items in the lists with custom delimiters and then ends up with a series of sentences ending in <stop> (reminiscent of a telegram) so it should be convertible to standard Alteryx if you don't want to include a python tool.

Dan

hellyars · ‎10-24-2022

@AndrewSu

Thank you. This is only partially successful at parsing the real world data. I am looking at a few options to meld this with something I am trying. More later. Thanks again.

hellyars · ‎10-24-2022

@danilang

Thank you.

I don't know Python (yet).

I am going to try the second approach you suggest (regex). More later....

I am surprised that the Text Mining Tools can't do it.

Alteryx Designer Desktop Discussions

Parse Paragraph into Sentences and Ignore Abberviations with Periods