Hi,
I want to parse a paragraph into sentences -- ignoring abbreviations that include a period (e.g., Corp.) or periods (e.g., U.S.A.).
Two approaches have been suggested to me by @danilang -- both utilize Alteryx's Python tool (POTENTIAL SOLUTIONS). Option 1 utilizes the NLTK library to parse sentences; Option 2 utilizes Regex. There is only one small problem, I do NOT know Python (yet - Alteryx is my side hobby). My original question
I am open to either approach Python + NLTK OR Python + Regex (not a lookup table). Ultimately it would be useful to see how both perform as I suspect the results will not be the same.
I was able to successfully install the NLTK package... but from there I do not know how to tell Python to parse a particular field (vs for example a file) using NLTK or Regex and then output to Alteryx...
This example from the Community uses NLTK in its solution: Alteryx + NLTK example. but I do not know how to adapt it or the potential Regex solution.
Thank you
Solved! Go to Solution.
Forgot to add the workflow
Hi @hellyars
Good news bad news situation.
Good news: Here a working version of your workflow using nltk to parse your text.
Bad News: a few issues remaining, such as sentence 3 ("Work will be performed...) is broken after Andover, Mass. and U.S. is still coming out as a sentence on it's own.
Modified python text:
#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])
#################################
from ayx import Alteryx
import pandas as pd
import nltk
import nltk.data
#################################
# read in data from input anchor as a pandas dataframe
# (after running the workflow)
df = Alteryx.read("#1")
#Read metadata from connection #1
Alteryx.readMetadata("#1")
# display the pandas dataframe
df
#################################
Alteryx.write(df,1)
#################################
# I know below is wrong for my use case...but it is an Alteryx, NLTK use case to parse sentences
nltk.download('punkt')
tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
sentences = tokenizer.tokenize(df.iloc[0,1])
print(sentences)
#Output the clean sentences to output #2
df_sentences = pd.DataFrame(sentences)
#Write sentences to output #2
Alteryx.write(df_sentences,2)
#################################
nltk.load('punkt') is used to download a pre-trained english language sentence parsing library that handles most of the edge cases, i.e. trailing periods don't always mark sentence boundaries (Ms.), etc. The next line builds a tokenizer using the rules in the Punkt library.
NltkInstaller(Run Elevated).yxmd needs to be run as admin to install the nltk library.
Note: I also ran into a few issues within the python script when running in AMP, so I disabled it.
Dan
Thanks. I am curious about the following line:
sentences = tokenizer.tokenize(df.iloc[0,1])
I assume the [0,1] is a row, column index. I can add columns and change the column index from 0,1 to 0,2 etc. to target the specific column I want to parse. Adding rows and changing the 0,1 to 1,1 to target a specific row did not work. So, is the 0 in 0,1 a row reference or something else?
Wow -- a simple ask (parse sentences .that may contain an abbreviation with a period) is proving to be not so simple.
What would you suggest as a next step?
...an alternative library like SpaCy?
...Python + RegEx (i.e., the StackOverflow example)?
...a a look-up table?
It still blows me away the the Text Mining tools can't parse sentences from a paragraph -- then again it has trouble identifying company names.
Hi @hellyars
The .iloc[r,c] syntax is a way to reference any cell in a dataframe using r(ow) and c(olumn) indices, both 0 based. So .iloc[0,0] works in your example to access the RecordID column in the first row, but there is no second row so .iloc[1,x] fails.
As for next steps, give SpaCy a try, specifically pySBD. It's built based on the "Golden Rules" and should be able handle most cases. Case 16 in this list actually deals with U.S. followed by a capitalized letter, which was one of the breaking cases in your example.
Good luck
Dan
I may not have explained what I did.
I added a few additional columns and an additional row. See image below.
[0,x] works every time.
I have trouble getting [1,x] to work, but I got it to work now.
It fails after changing the address when I hit run within the tool, but then works after I then run the workflow and is no longer a problem.
Thank you for the pySBD = Golden Rule index. The Golden Rule index seems to account for some of my variables. I am going to give it a shot.
This article may help as well…. You might compare Corp. and other abbreviations against a bag of actual words. Isalnum() might be used to detect punctuation characters
2023 update. I accidentally backed into this problem again. Armed with ChatGPT-4 I was able to hash out a solution on my own -- and (a year later) I now understand.
Python #1: NLTK + PUNKT
#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])
#################################
from ayx import Alteryx
# Alteryx.installPackage("nltk --upgrade")
import pandas as pd
import nltk
# nltk.download('punkt')
from nltk.tokenize import sent_tokenize
def tokenize_sentences(text):
return sent_tokenize(text)
data = Alteryx.read("#1") # assuming your Input tool is connected to input #1
# Ensure 'DATE' is in datetime format
data['DATE'] = pd.to_datetime(data['DATE'])
# Sort the data by 'DATE' from oldest to newest
data.sort_values('DATE', inplace=True)
# Assign an Entry ID to each unique 'Text' from oldest to newest
data['Entry_ID'] = data.groupby('Text').ngroup() + 1
# Tokenize the text into sentences
data['Sentences'] = data['Text'].apply(tokenize_sentences)
# Explode the list of sentences into separate rows and add a sentence number
data = data.explode('Sentences').reset_index(drop=True)
data['Sentence_Number'] = data.groupby(['Text', 'Entry_ID']).cumcount() + 1
# Sort by 'Entry_ID' and 'Sentence_Number'
data.sort_values(['Entry_ID', 'Sentence_Number'], inplace=True)
Alteryx.write(data,1)
Python #2 : SpaCY
#################################
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])
#################################
from ayx import Alteryx, Package
# Ensure that SpaCy is installed
# Package.installPackages(['spacy'])
import spacy
import pandas as pd
# Ensure that the English model for SpaCy is installed
# Note: this requires running a shell command using Python's subprocess module,
# which may not be supported in all Alteryx environments
import subprocess
subprocess.call(['python', '-m', 'spacy', 'download', 'en_core_web_sm'], shell=True)
# Load SpaCy's English model
nlp = spacy.load('en_core_web_sm')
data = Alteryx.read("#1") # assuming your Input tool is connected to input #1
# Apply NER to each sentence
def get_entities(sentence):
doc = nlp(sentence)
return [(ent.text, ent.label_) for ent in doc.ents]
data['Entities'] = data['Sentences'].apply(get_entities)
Alteryx.write(data, 1)