This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I was wondering if anyone has any experience or ideas/solutions/tools to use for parsing and cleaning up dirty HTML emails that we are feeding into Alteryx for analysis. The main issues are that the emails are collected as raw HTML which contain signatures, replies and conversation history, disclaimers and greetings etc. Hoping to get hold of the simple content of the latest email in order to perform sentiment analysis.
We are currently using R for parsing and then lots of RegEx in order to clean and scrape out the excess information.
Has anyone done anything similar? Maybe looking at text/data structures with capitalisation or paragraphing?