Alteryx Designer

Find answers, ask questions, and share expertise about Alteryx Designer.
Register for the upcoming Live Community Q&A Session - and don't forget to submit your questions for @DeanS regarding the future role of analytics here.

Alteryx Text Analytics - Entities Extraction - Utilizing Python & Azure Cognitive


Hi everyone,


Wanted to share this bit I build out as a prototype with one of my customers. This time: Text analytics.


Big picture use case:

Web-scrape large number of websites & extract entities from these sources at scale.

Example of a source:


Source is around TAX evasions and mentions multiple entities like football player names, club names, locations...

The original use case is to support anti-fraud teams in financial/baking segment.





- Alteryx is used as orchestration of the pipeline incl. custom code in Python

- Workflow is to be to run on Server as an analytic app asking for URL; feeding entities data into SQL DB

- Users do not need to set up environment or anything locally (no struggles with setting Python env etc)

- Analytic App would ask for URL as input and feed the entities outputs into the target storage env

- We will utilize Python to do parts of this, whilst coders/ DS team will support us on the way



- Alteryx is used to web-scrape a full website with Download Tool

- Python code tool loads the HTML text, turns it into byte-code

- Python code tool is used to extract & clean up text from HTML (text analytics) //Beautiful Soup

- Python code tool is used to turn text into clean sentences one line each (text analytics) // NLTK and RE

- Python code tool is used to prep & send sentences to Azure Cognitive - ENTITIES endpoint

- Alteryx JSON parse to extract the ENTITIES from the response

- Alteryx to prep the data for the consumption & write to the DB


Few more notes:

- Azure Cognitive seems to use WIKI to find matches and returns match scores for entities found

- Azure Cognitive tool is actually available in Alteryx as a macro, the ENTITIES endpoint is not yet integrated though as I believe this is relatively new; once implemented this will become ever simpler


MSFT Cognitive / text analytics endpoints:

- Easily evaluate sentiment, topics, language, entities to understand what users want

- Existing APIs so no need to get hands dirty training massive text analytics model yourself

- Text analytics with interactive test samples

- 5,000 transactions per month are actually free - just get your API keys at Azure (so anyone can use this really)

- Using this in Python (as entities extraction not yet available in our Alteryx Macro; sentiment and other are there)



The workflow is attached in this post. All you need to do is change the Azure cognitive keys in the Python code tool (those in the workflow are just samples and long expired) and supply URL with the text input.


What you get by running this workflow as result?

- List of extracted entities from your text;

- Compared to original text in your input

- EntityTypeScore (0 to 1; 1 being a perfect match)

- Type of entity (Person, Organization, Value, Location, Date)

- Wikipedia link (used to find the entity actually)

- WikiScore (0 to 1; 1 being a perfect match)





So, you are for instance able to say that we found entity

CRISTIANO RONALDO; with 99% confidence; the type of entity: PERSON - match correctly to the football player

LIVERPOOL; with 76% confidence; type of entity: ORGANIZATION - match correctly to Liverpool F.C.


P.s. Another interesting use case combining code free and code friendly approach to analytics.

Alteryx acting as orchestrator bringing advanced analytics to the masses in an organization.


P.s.2: Massive benefit is clearly not having to code text analytics model but utilize something like Azure Cognitive.
I am sure I would not want to code those models myself. Probably a lot more than my typical 50 lines of code. 

And tons of text used for learning I am sure. In Fact - utilize the Cognitive Macro available in Alteryx for everything else but newly available ENTITIES extraction.


And the least enjoyable bit comes last just for reference:


# List all non-standard packages to be imported by your 
# script here (only missing packages will be installed)
from ayx import Package
from ayx import Alteryx
#Read the HTML page from input #1 
df ="#1")

#Read metadata from connection #1

# Load the params from the input
html = "" 

for index, row in df.iterrows():
    html = row[0].replace("\\","/")

# Turn HTML from str to byte (encode)    
html_byte = str.encode(html)
from bs4 import BeautifulSoup
import re

#Use BeautifulSoup to parse HTML for clean text
soup = BeautifulSoup(html_byte, 'html.parser')

#Use REgex to remove extra chars, spaces, new lines
text = (re.sub('[^A-Za-z0-9\s\.]+', '', soup.get_text()).replace('\r', '').replace('\n', ''))
#Split HTML into sentences using NLTK
from nltk import sent_tokenize
sentences = sent_tokenize(text)

for sentence in sentences:

#Output the clean sentences to output #2
import pandas as pd
df_sentences = pd.DataFrame(sentences) 

#Write sentences to output #2
#Subscription keys, endpoints of Azure Cognitive for entity extraction
subscription_key = "3b371d3021a14a689ec17eeddaf65796"
endpoint = ""
language_api_url = endpoint + "/text/analytics/v2.1/entities"

#Construct the data to be sent against Azure Cognitive API
documents_list = []
row = 1

#For all sentences construct them as list of dictionaries values
for sentence in sentences:
    dict_entry = {"id":row,"text":sentence}
    row += 1


#Push the list of dictionary values into a documents dictionary
documents = {"documents": documents_list}

#Write this out to #1 connection
David Matyas
Sales Engineer
Alteryx Partner

Thanks @DavidM I'll take a look. Looks very interesting....