Image source: twinbookmarks
In The Little Prince by Antoine de Saint-Exupéry, we read about a boa constrictor digesting an elephant many times its size. This is a great metaphor for pursuing a goal despite obstacles or objections. To be bold.
I am a Partner Sales Engineer at Alteryx. A recent engagement required the extraction of information from Outlook emails (MSG files) stored in a shared directory. For security reasons, a connection to a shared mailbox was not possible in this set-up which prevented usage of available tools such as Outlook Input and Outlook 365.
Enter another constrictor that can boldly address the requirement along with Alteryx: Python.
Auditing approval and execution of system changes can be a repetitive task prone to human errors. Furthermore, email evidence and attachments are sent to various recipients, requiring consolidation of said documents. Due to security reasons, provisioning of a shared mailbox may not be allowed, thus requiring consolidation of evidence in a restricted shared folder or directory.
Alteryx's rich set of tools for extraction, parsing, evaluation, and summarization of data, complemented by Python's libraries, allowed for a seamless process of extracting and evaluating email content and attachments.
Using the extract_msg library, a Python script was created to extract information and attachments from MSG files. On a high level, the Python script process is described below. Do ignore the Alteryx equivalent for now, we will get back to that later.
# |
Step |
Python command / library |
Can be replaced by Alteryx equivalent |
1 |
Read emails from folder |
extract_msg |
N/A |
2 |
Extract email metadata and body |
extract_msg |
N/A |
3 |
Extract relevant content from emails |
re |
RegEx |
4 |
Assess whether a mail is an approval or rejection |
fuzzysearch |
Sentiment Analysis |
5 |
Extract attachments and identify mail with no attachments |
extract_msg |
N/A |
The Python output is then fed to the rest of the Alteryx workflow using standard tools. The rest of the workflow evaluates whether the emails related to a given change ID followed the correct approval sequence, contained the correct approval values, were approved by the correct personnel, and were supported by system screenshots. The focus of this blog is the Python component which will be explored in detail in the succeeding sections.
Developing the Python script and Alteryx workflow according to the format of the emails to be evaluated resulted in an automated process that can extract, evaluate, parse emails, and process screenshots for 1000 emails in ~5 minutes. This allows for reduced risk by having a larger sample size for audit while reducing human errors and execution time. Having a human in the loop after the execution of a workflow will ensure correct execution and allow for improvements in the future.
The Python tool in Alteryx designer can be accessed via Tool Palette > Developer > Python, or alternatively by searching "python" in the search bar > Python; then dragging the tool.
Dragging the tool will bring it to the Canvas.
Clicking on the tool will open the Configuration panel, where you have access to an interactive notebook environment to create your Python script.
You can create the Python script as below. Note that this follows an email format specific to a customer scenario so you may need to adjust accordingly to match your use case.
Initialization:
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])
from ayx import Alteryx
Alteryx.installPackage(package=["extract-msg","fuzzysearch"])
Reading of emails:
import os #for OS commands
import glob #to read directories
import extract_msg #to extract information from emails
import pandas as pd
import csv #to save into CSV
import re #to perform RegEx
import datetime #for Date and Time conversion
from fuzzysearch import find_near_matches #to search for text variations
#Write extracts from each email in the directory to file Email.csv
with open(r'C:\temp\shared_folder\Email.csv', mode='w') as file:
fieldnames = ['subject', 'sentDateTime', 'from address', 'body', 'save path', \
'screenshot', 'chid', 'with approval', 'no approval']
writer = csv.DictWriter(file, fieldnames=fieldnames, lineterminator='\n')
writer.writeheader()
msg_path=r'C:\temp\shared_folder\messages'
msg_files=os.listdir(msg_path)
attach_path=r'C:\temp\shared_folder\attachments'
#Extract information from each email
for f in msg_files:
xpath = os.path.join(msg_path,f)
if not xpath.endswith('.msg'):
continue
msg = extract_msg.Message(xpath)
msg_sender = msg.sender
msg_date = msg.date
msg_pdate = msg.parsedDate
msg_subj = msg.subject
msg_message = msg.body
#Extract Change ID information from email subject
msg_chid = re.search('CH[0-9]+', msg_message)
if msg_chid is not None:
msg_chid = re.search('CH[0-9]+', msg_message)
msg_chid = msg_chid.group(0)
else:
msg_chid = "None"
#Count occurences of positive and negative phrases
#This will be evaluated in the Alteryx workflow later
msg_approval = len(find_near_matches('is approve', msg_message, max_l_dist=2)) + \
len(find_near_matches('is accept', msg_message, max_l_dist=2)) + \
len(find_near_matches('we accept', msg_message, max_l_dist=1)) + \
len(find_near_matches('am comfortable', msg_message, max_l_dist=1)) + \
len(find_near_matches('are comfortable', msg_message, max_l_dist=1))
msg_no_approval = len(find_near_matches('not approve', msg_message, max_l_dist=1)) + \
len(find_near_matches('disapprove', msg_message, max_l_dist=1)) + \
len(find_near_matches('reject', msg_message, max_l_dist=1)) + \
len(find_near_matches('not accept', msg_message, max_l_dist=1)) + \
len(find_near_matches('not comfortable', msg_message, max_l_dist=1)) + \
len(find_near_matches('not', msg_message, max_l_dist=1)) + \
len(find_near_matches('denied', msg_message, max_l_dist=1)) + \
len(find_near_matches('deny', msg_message, max_l_dist=1))
msg_attachment = msg.attachments
msg.save(customPath=attach_path, attachmentsOnly=False, extractEmbedded=False, \
skipEmbedded=True, skipHidden=True, useFileName=True)
#Find the latest extracted file and associate it with the most recent email processed
#This will be evaluated in the Alteryx workflow later
#Only PNG files are supported for now
file_type = r'\*'
file_type_png = r'\*\*.png'
files = glob.glob(attach_path + file_type)
latest_file = max(files, key=os.path.getctime)
files_png = glob.glob(attach_path + file_type_png)
latest_file_png = max(files_png, key=os.path.getctime)
#Write the content to the CSV file
writer.writerow({'subject': msg_subj, 'sentDateTime': msg_date, 'from address': msg_sender, \
'body': msg_message, 'save path': latest_file, 'screenshot': latest_file_png, \
'chid': msg_chid, 'with approval': msg_approval, 'no approval': msg_no_approval,})
Reading of contents of CSV files into a Pandas dataframe:
emails = pd.read_csv(r'C:\temp\shared_folder\Email.csv')
Send output to an Alteryx Python outgoing anchor:
Alteryx.write(emails, 1)
Like the young prince and the pilot in the story we alluded to earlier, exploration results in better understanding. The following areas can be reviewed to further improve the Python component in this integration/workflow.
I hope you have found this exploration of Alteryx and Python enjoyable, if not useful. Working on business problems reminds me of the young prince's realization: knowledge becomes real when it is acquired through experiences. I thank the Alteryx community for actively sharing their experiences which have contributed a great deal to my learning journey. Eager to hear your thoughts and approaches to tackling scenarios like the one above.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.