Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Help Parsing messy Email HTML into clean text for Sentiment Analysis

Highlighted
Alteryx Certified Partner

Hi there,

 

I was wondering if anyone has any experience or ideas/solutions/tools to use for parsing and cleaning up dirty HTML emails that we are feeding into Alteryx for analysis. The main issues are that the emails are collected as raw HTML which contain signatures, replies and conversation history, disclaimers and greetings etc. Hoping to get hold of the simple content of the latest email in order to perform sentiment analysis.

 

We are currently using R for parsing and then lots of RegEx in order to clean and scrape out the excess information.

 

Has anyone done anything similar? Maybe looking at text/data structures with capitalisation or paragraphing?

 

Any help would be much appreciated.

 

Luke

Alteryx Certified Partner with Keyrus UK

Labels