This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I'm working on a parsing project involving PDFs of multiple pages, formats, and table structures. The RegEx expressions have been a big help, however, due to the varying structures of text and numerical tables, the expressions are not perfectly reliable, yet.
Many thanks to Chad, for his post, "Can Alteryx Parse a Word Doc or PDF?", found below. His workflow using the doctotext.exe gave me a solid foundation to begin this project.
Attached, I'm including a sample of PDFs I'm working with, as well as the modified workflow. Ideally, I'd like to be able to isolate the "Investments" table, without the need for an external parser, such as Tabula, http://tabula.technology/.
Thank you for your time, and I greatly appreciate any insight or suggestions!