I have several HTML pages I want to wrangle, I don't care too much about the formatting since the data is already there but I have problem extracting the raw text. Removing the tags through wrangling is a pain. Any recommended approach ?
Solved! Go to Solution.
By the way I am on Mac, so if there is any utility I could use to do the conversion I could create a script, if needed. TIA
There are several ways you can convert.
MacOS
You can use textutil in order to convert all html pages in the current folder to txt file
textutil -convert txt ./*.html
Linux
You could use unoconv to convert between all LibreOffice supported standards, including HTML to txt. More details and examples in https://linux.die.net/man/1/unoconv