Engine Works

leofrancia · ‎03-06-2023

In The Little Prince by Antoine de Saint-Exupéry, we read about a boa constrictor digesting an elephant many times its size. This is a great metaphor for pursuing a goal despite obstacles or objections. To be bold.

I am a Partner Sales Engineer at Alteryx. A recent engagement required the extraction of information from Outlook emails (MSG files) stored in a shared directory. For security reasons, a connection to a shared mailbox was not possible in this set-up which prevented usage of available tools such as Outlook Input and Outlook 365.

Enter another constrictor that can boldly address the requirement along with Alteryx: Python.

Business Problem

Auditing approval and execution of system changes can be a repetitive task prone to human errors. Furthermore, email evidence and attachments are sent to various recipients, requiring consolidation of said documents. Due to security reasons, provisioning of a shared mailbox may not be allowed, thus requiring consolidation of evidence in a restricted shared folder or directory.

Alteryx Solution

Alteryx's rich set of tools for extraction, parsing, evaluation, and summarization of data, complemented by Python's libraries, allowed for a seamless process of extracting and evaluating email content and attachments.

Using the extract_msg library, a Python script was created to extract information and attachments from MSG files. On a high level, the Python script process is described below. Do ignore the Alteryx equivalent for now, we will get back to that later.

#	Step	Python command / library	Can be replaced by Alteryx equivalent
1	Read emails from folder	extract_msg	N/A
2	Extract email metadata and body	extract_msg	N/A
3	Extract relevant content from emails	re	RegEx
4	Assess whether a mail is an approval or rejection	fuzzysearch	Sentiment Analysis
5	Extract attachments and identify mail with no attachments	extract_msg	N/A

The Python output is then fed to the rest of the Alteryx workflow using standard tools. The rest of the workflow evaluates whether the emails related to a given change ID followed the correct approval sequence, contained the correct approval values, were approved by the correct personnel, and were supported by system screenshots. The focus of this blog is the Python component which will be explored in detail in the succeeding sections.

Outcome

Developing the Python script and Alteryx workflow according to the format of the emails to be evaluated resulted in an automated process that can extract, evaluate, parse emails, and process screenshots for 1000 emails in ~5 minutes. This allows for reduced risk by having a larger sample size for audit while reducing human errors and execution time. Having a human in the loop after the execution of a workflow will ensure correct execution and allow for improvements in the future.

Accessing the Python tool

The Python tool in Alteryx designer can be accessed via Tool Palette > Developer > Python, or alternatively by searching "python" in the search bar > Python; then dragging the tool.

Dragging the tool will bring it to the Canvas.

Clicking on the tool will open the Configuration panel, where you have access to an interactive notebook environment to create your Python script.

Detailed code with comments

You can create the Python script as below. Note that this follows an email format specific to a customer scenario so you may need to adjust accordingly to match your use case.

Initialization:

# List all non-standard packages to be imported by your 
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])
from ayx import Alteryx
Alteryx.installPackage(package=["extract-msg","fuzzysearch"])

Reading of emails:

import os #for OS commands
import glob #to read directories
import extract_msg #to extract information from emails
import pandas as pd
import csv #to save into CSV
import re #to perform RegEx
import datetime #for Date and Time conversion
from fuzzysearch import find_near_matches #to search for text variations
 
#Write extracts from each email in the directory to file Email.csv
with open(r'C:\temp\shared_folder\Email.csv', mode='w') as file:
    fieldnames = ['subject', 'sentDateTime', 'from address', 'body', 'save path', \
                  'screenshot', 'chid', 'with approval', 'no approval']
    writer = csv.DictWriter(file, fieldnames=fieldnames, lineterminator='\n')
 
    writer.writeheader()
    
    msg_path=r'C:\temp\shared_folder\messages'
    msg_files=os.listdir(msg_path)
    attach_path=r'C:\temp\shared_folder\attachments'
    
    #Extract information from each email
    for f in msg_files:    
        xpath = os.path.join(msg_path,f)
        if not xpath.endswith('.msg'):
            continue
        msg = extract_msg.Message(xpath)  
        msg_sender = msg.sender
        msg_date = msg.date
        msg_pdate = msg.parsedDate
        msg_subj = msg.subject  
        msg_message = msg.body
        
        #Extract Change ID information from email subject
        msg_chid = re.search('CH[0-9]+', msg_message)
        if msg_chid is not None:
            msg_chid = re.search('CH[0-9]+', msg_message)
            msg_chid = msg_chid.group(0)
        else:
            msg_chid = "None"
        
        #Count occurences of positive and negative phrases
        #This will be evaluated in the Alteryx workflow later
        msg_approval = len(find_near_matches('is approve', msg_message, max_l_dist=2)) + \
        len(find_near_matches('is accept', msg_message, max_l_dist=2)) + \
        len(find_near_matches('we accept', msg_message, max_l_dist=1)) + \
        len(find_near_matches('am comfortable', msg_message, max_l_dist=1)) + \
        len(find_near_matches('are comfortable', msg_message, max_l_dist=1))
        msg_no_approval = len(find_near_matches('not approve', msg_message, max_l_dist=1)) + \
        len(find_near_matches('disapprove', msg_message, max_l_dist=1)) + \
        len(find_near_matches('reject', msg_message, max_l_dist=1)) + \
        len(find_near_matches('not accept', msg_message, max_l_dist=1)) + \
        len(find_near_matches('not comfortable', msg_message, max_l_dist=1)) + \
        len(find_near_matches('not', msg_message, max_l_dist=1)) + \
        len(find_near_matches('denied', msg_message, max_l_dist=1)) + \
        len(find_near_matches('deny', msg_message, max_l_dist=1))
        msg_attachment = msg.attachments        
        msg.save(customPath=attach_path, attachmentsOnly=False, extractEmbedded=False, \
                 skipEmbedded=True, skipHidden=True, useFileName=True) 
        
        #Find the latest extracted file and associate it with the most recent email processed
        #This will be evaluated in the Alteryx workflow later
        #Only PNG files are supported for now
        file_type = r'\*'
        file_type_png = r'\*\*.png'
        files = glob.glob(attach_path + file_type)
        latest_file = max(files, key=os.path.getctime)
        files_png = glob.glob(attach_path + file_type_png)
        latest_file_png = max(files_png, key=os.path.getctime)
        
        #Write the content to the CSV file
        writer.writerow({'subject': msg_subj, 'sentDateTime': msg_date, 'from address': msg_sender, \
                         'body': msg_message, 'save path': latest_file, 'screenshot': latest_file_png, \
                         'chid': msg_chid, 'with approval': msg_approval, 'no approval': msg_no_approval,})

Reading of contents of CSV files into a Pandas dataframe:

emails = pd.read_csv(r'C:\temp\shared_folder\Email.csv')

Send output to an Alteryx Python outgoing anchor:

Alteryx.write(emails, 1)

Limitations and potential improvements

Like the young prince and the pilot in the story we alluded to earlier, exploration results in better understanding. The following areas can be reviewed to further improve the Python component in this integration/workflow.

As described in item 3 in the Solution section, RegEx steps in Python can be moved to the Alteryx workflow. This will allow for a low-code approach in doing modifications later.
As described in item 4 in the Solution section, fuzzysearch steps in Python can be moved to the Alteryx workflow. Alteryx Sentiment Analysis will require fewer modification steps in case there are changes required later.
Python script can be updated to include other file types, not just PNG.

Conclusion

I hope you have found this exploration of Alteryx and Python enjoyable, if not useful. Working on business problems reminds me of the young prince's realization: knowledge becomes real when it is acquired through experiences. I thank the Alteryx community for actively sharing their experiences which have contributed a great deal to my learning journey. Eager to hear your thoughts and approaches to tackling scenarios like the one above.

gabslt · ‎03-14-2023

Hello!

Is there a way of doing it with .eml files?

i tried to change the file extension but its not working

leofrancia · ‎03-14-2023

Hi @gabslt,

I explored the same EML scenario but my use case ended up using MSG files exclusively. You can explore the two approaches below (there may be more).

1. You can convert the EML files to MSG using win32com (requires Outlook installed which will work with the win32com library) or msgconvert (requires msgconvert binaries installed on top of the python libraries installed via pip). This approach may be useful if you have access to install new programs/binaries in the Alteryx server in which you are running the server.

2. Alternatively, you can use email library like below. If you are working on both MSG and EML files, Alteryx tools Date Parsing, Select, Union, etc. are useful to combine the data from the two sets of filetypes.

import email

with open('Test.eml', 'r') as f:
    # Parse the email message from the file
    msg = email.message_from_file(f)

# Print the message subject, sender, and recipient
print('Subject:', msg['Subject'])
print('From:', msg['From'])
print('To:', msg['To'])

# Print the message body
if msg.is_multipart():
    # If the message is multipart (i.e., has attachments), iterate over the parts
    for part in msg.walk():
        content_type = part.get_content_type()
        if content_type == 'text/plain' or content_type == 'text/html':
            # If the part is a text/plain or text/html part, print its content
            body = part.get_payload(decode=True)
            print(body.decode('utf-8'))
else:
    # If the message is not multipart (i.e., has no attachments), print its body
    body = msg.get_payload(decode=True)
    print(body.decode('utf-8'))

cgaspar · ‎08-09-2023

This looks really useful! Is there a workflow that can be shared?

leofrancia · ‎08-11-2023

Hi, @cgaspar - sure thing. Have a look here: https://community.alteryx.com/t5/Alteryx-Designer-Desktop-Discussions/Sample-file-for-Email-Extracti...

It looks like I cannot upload a YXMD in a blog's comment section.

Please note that this is the first version of the workflow described in the blog. The latest subsequent versions started to have more customer/process-specific information so I am unable to share that. The workflow I have will likely not work unless you have an email with the correct format but I hope it will give you a general idea.

Justin_Weyer · ‎09-21-2023

I was denied access as well for my organization due to security restrictions. I'm starting here, thankful for this post good sir!