Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Use Python + NLTK or Python + Regex to Parse Sentences from a Paragraph

hellyars
13 - Pulsar

 

Hi,

 

I want to parse a paragraph into sentences -- ignoring abbreviations that include a period (e.g., Corp.) or periods (e.g., U.S.A.). 

 

Two approaches have been suggested to me by @danilang -- both utilize Alteryx's Python tool (POTENTIAL SOLUTIONS). Option 1 utilizes the NLTK library to parse sentences; Option 2 utilizes Regex.  There is only one small problem, I do NOT know Python (yet - Alteryx is my side hobby).   My original question

 

I am open to either approach Python + NLTK OR Python + Regex (not a lookup table).   Ultimately it would be useful to see how both perform as I suspect the results will not be the same. 

 

I was able to successfully install the NLTK package... but from there I do not know how to tell Python to parse a particular field (vs for example a file) using NLTK or Regex and then output to Alteryx...

 

This example from the Community uses NLTK in its solution: Alteryx + NLTK example. but I do not know how to adapt it or the potential Regex solution. 

 

Thank you

 

hellyars_0-1666714405991.png

 

 

ToCommunity_PythonNLTK_ParseSentences.png

9 REPLIES 9
hellyars
13 - Pulsar

Forgot to add the workflow

danilang
19 - Altair
19 - Altair

Hi @hellyars 

 

Good news bad news situation.

 

Good news:  Here a working version of your workflow using nltk to parse your text.

Bad News: a few issues remaining, such as sentence 3 ("Work will be performed...) is broken after Andover, Mass.  and U.S. is still coming out as a sentence on it's own.

 

Modified python text:

#################################
# List all non-standard packages to be imported by your 
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])


#################################
from ayx import Alteryx
import pandas as pd
import nltk
import nltk.data

#################################
# read in data from input anchor as a pandas dataframe
# (after running the workflow)

df = Alteryx.read("#1")

#Read metadata from connection #1

Alteryx.readMetadata("#1")  

# display the pandas dataframe
df

#################################
Alteryx.write(df,1)


#################################
# I know below is wrong for my use case...but it is an Alteryx, NLTK use case to parse sentences 
nltk.download('punkt')
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(df.iloc[0,1])
print(sentences)

#Output the clean sentences to output #2
df_sentences = pd.DataFrame(sentences) 

#Write sentences to output #2
Alteryx.write(df_sentences,2) 



#################################

 

nltk.load('punkt')  is used to download a pre-trained english language sentence parsing library that handles most of the edge cases, i.e. trailing periods don't always mark sentence boundaries (Ms.), etc.  The next line builds a tokenizer using the rules in the Punkt library. 

danilang_0-1666962339160.png

NltkInstaller(Run Elevated).yxmd needs to be run as admin to install the nltk library. 

 

Note: I also ran into a few issues within the python script when running in AMP, so I disabled it. 

 

Dan

 

hellyars
13 - Pulsar

@danilang 

 

Thanks.  I am curious about the following line:

sentences = tokenizer.tokenize(df.iloc[0,1])

I assume the [0,1] is a row, column index. I can add columns and change the column index from 0,1 to 0,2 etc. to target the specific column I want to parse.  Adding rows and changing the 0,1 to 1,1 to target a specific row did not work.  So, is the 0 in 0,1 a row reference or something else?

hellyars
13 - Pulsar

@danilang 

 

Wow --  a simple ask (parse sentences .that may contain an abbreviation with a period) is proving to be not so simple.

 

What would you suggest as a next step? 

 

...an alternative library like SpaCy?

...Python + RegEx (i.e., the StackOverflow example)?

...a a look-up table?

 

It still blows me away the the Text Mining tools can't parse sentences from a paragraph -- then again it has trouble identifying company names.

 

 

danilang
19 - Altair
19 - Altair

Hi @hellyars 

 

The .iloc[r,c] syntax is a way to reference any cell in a dataframe using r(ow) and c(olumn) indices, both 0 based.  So .iloc[0,0] works in your example to access the RecordID column in the first row, but there is no second row so .iloc[1,x]  fails. 

 

As for next steps, give SpaCy a try, specifically pySBD.  It's built based on the "Golden Rules" and should be able handle most cases.   Case 16 in this list actually deals with U.S. followed by a capitalized letter, which was one of the breaking cases in your example.      

 

Good luck

 

Dan

hellyars
13 - Pulsar

@danilang  

 

I may not have explained what I did.  

I added a few additional columns and an additional row. See image below.

[0,x] works every time. 

I have trouble getting [1,x] to work, but I got it to work now. 

It fails after changing the address when I hit run within the tool, but then works after I then run the workflow and is no longer a problem.

 

 

 

Screenshot 2022-10-30 122414.png

 

 

Screenshot 2022-10-30 123022.png

 

 

 

hellyars
13 - Pulsar

@danilang 

 

Thank you for the pySBD = Golden Rule index.  The Golden Rule index seems to account for some of my variables.  I am going to give it a shot.

jbrazeal
5 - Atom

This article may help as well….  You might compare Corp. and other abbreviations against a bag of actual words.  Isalnum() might be used to detect punctuation characters

 

https://www.freecodecamp.org/news/an-introduction-to-bag-of-words-and-how-to-code-it-in-python-for-n...

hellyars
13 - Pulsar

2023 update.  I accidentally backed into this problem again.   Armed with ChatGPT-4 I was able to hash out a solution on my own -- and (a year later) I now understand.

 

  • Reference @danilang 's sound advice above.
  • While you can do a lot of this natively with Alteryx tools, I was determined to get to work using the Python Tool.
  • I split the task into two Python tools.
  • The first tool (Python #1) sorts entries by date, assigns an entry ID, and then parses each sentence - assigning a number (ID) to each sentence grouped by its entry.  This tool is similar to the solution above, but there must have been an update to "nltk" and/or "punkt" as it ignores random abbreviations (e.g., U.S., Inc., Co., etc.) when parsing the sentences whereas the prior solution (last fall) struggled with random abbreviations.
  •  The second tool (Python #2) is utilizes SpaCY to perform Named Entity Recognition (NER).   I wanted to see if I could get it to work and compare it against the Alteryx Intel Suite's NER Tool. 
  • The lines to install NLTK, PUNKT, and SpaCY are included (but commented out).  You only need to install once (if they are not already installed) and make sure to run Alteryx in admin mode.

 

Python #1:  NLTK + PUNKT

 

#################################
# List all non-standard packages to be imported by your 
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])


#################################
from ayx import Alteryx
# Alteryx.installPackage("nltk --upgrade")

import pandas as pd
import nltk
# nltk.download('punkt')
from nltk.tokenize import sent_tokenize

def tokenize_sentences(text):
    return sent_tokenize(text)

data = Alteryx.read("#1") # assuming your Input tool is connected to input #1

# Ensure 'DATE' is in datetime format
data['DATE'] = pd.to_datetime(data['DATE'])

# Sort the data by 'DATE' from oldest to newest
data.sort_values('DATE', inplace=True)

# Assign an Entry ID to each unique 'Text' from oldest to newest
data['Entry_ID'] = data.groupby('Text').ngroup() + 1

# Tokenize the text into sentences
data['Sentences'] = data['Text'].apply(tokenize_sentences)

# Explode the list of sentences into separate rows and add a sentence number
data = data.explode('Sentences').reset_index(drop=True)
data['Sentence_Number'] = data.groupby(['Text', 'Entry_ID']).cumcount() + 1

# Sort by 'Entry_ID' and 'Sentence_Number'
data.sort_values(['Entry_ID', 'Sentence_Number'], inplace=True)

Alteryx.write(data,1)

 

Python #2 : SpaCY

 

#################################
# List all non-standard packages to be imported by your 
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])


#################################
from ayx import Alteryx, Package

# Ensure that SpaCy is installed
# Package.installPackages(['spacy'])

import spacy
import pandas as pd

# Ensure that the English model for SpaCy is installed
# Note: this requires running a shell command using Python's subprocess module, 
# which may not be supported in all Alteryx environments
import subprocess
subprocess.call(['python', '-m', 'spacy', 'download', 'en_core_web_sm'], shell=True)

# Load SpaCy's English model
nlp = spacy.load('en_core_web_sm')

data = Alteryx.read("#1") # assuming your Input tool is connected to input #1

# Apply NER to each sentence
def get_entities(sentence):
    doc = nlp(sentence)
    return [(ent.text, ent.label_) for ent in doc.ents]

data['Entities'] = data['Sentences'].apply(get_entities)

Alteryx.write(data, 1)

 

Labels
Top Solution Authors