Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Parse Paragraph into Sentences and Ignore Abberviations with Periods

hellyars
13 - Pulsar

Regex help (or Text Mining - Part of Speech?)

 

I want to parse each sentence in a paragraph into rows.

 

I run into a problem when a sentence begins (or any part of a sentence) contains an abbreviation with a period (e.g., U.S. Government). 

 

How can account for abbreviations with periods when trying to parse each sentence of a paragraph?

 

 

ParseSentencesFromParagraph.png

6 REPLIES 6
hellyars
13 - Pulsar

I should add....without adding a Formula tool to replace U.S. with US.  There may be other unknown abbreviations in the real data (e.g., Co. Inc., Plc. etc.).

DataNath
17 - Castor

@hellyars how extensive are the possibilities of abbreviations? I'm just wondering if you may be better off with a lookup table or something to replace these. I know you mentioned not doing this with a formula, but I've had a think about this and, unless you can find a threshold to put on the length of a word before a period, I can't see how you'll differentiate between the more 'innocent' occurrences like 'Inc., Plc.' and genuine ends of sentences like so (chucked a few in to demo):

 

DataNath_0-1666282828842.png

AndrewSu
Alteryx
Alteryx

@hellyars ,  I agree with @DataNath that a lookup table may help in this scenario especially for the common ones like you mentioned Co. Inc., Plc. etc.  I also came up with a solution doing some further transformations after your tokenization of the data in the Regex tool.  Please see the attached workflow.

 

I'm basically using a "count words" function to identify the rows that are abbreviations, then doing a series of transformations to join those identified rows back to the correct row in the dataset.

 

I imagine a combination of the lookup table strategy and the example in my workflow will provide the solution.

 

If this solves your issue, please mark this post as the solution so that others in the community can benefit from our collaboration. 

 

Thanks. 

danilang
19 - Altair
19 - Altair

Hi @hellyars 

 

Another excellent question from you!  As you've seen, the the answer is not simple and you don't want to spend time reinventing the wheel.

 

The best and simplest option for you would probably be to use a python tool and leverage the nltk library.  The authors of this library have done the grunt work to define all the edge cases for you.  There's an example here

 

The second solution in that link contains an interesting python-only approach that uses a series of lists to define the edge cases.   The code uses a bunch of regex functions to replace the items in the lists with custom delimiters and then ends up with a series of sentences ending in <stop> (reminiscent of a telegram) so it should be convertible to standard Alteryx if you don't want to include a python tool.  

 

Dan

hellyars
13 - Pulsar

@AndrewSu 

Thank you.  This is only partially successful at parsing the real world data. I am looking at a few options to meld this with something I am trying.  More later.  Thanks again.

hellyars
13 - Pulsar

@danilang 

 

Thank you. 

I don't know Python (yet).

I am going to try the second approach you suggest (regex).  More later....

I am surprised that the Text Mining Tools can't do it. 

Labels