Align messy pdf to excel output data
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi everyone,
I am working on a webscraping project in which I have already scraped the pdfs from the link https://www.extremenetworks.com/support/end-of-sale-and-end-of-support-products/ (table 1). I am now working on extracting the tabulated data from these pdfs. When using the pdf reader tool the some tables are aligned and others are have very little alignment, as seen below compared to the data in the pdf: (I also attached the workflow, pdf and xlsx file) I would like to know if anyone knows how I can align the data. Does anyone have some ideas or have solved a problem like this before? Wanted to use the extract the pdfs as images, but data is missing from the tables if I use these computer vision tools.
Thank you for helping!
Rouche
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Does the link to the Excel file not already have what you need?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @BrandonB, thank you for your message. No, the data in excel is not structured. The data is there, but it is messy and I am looking for ways to align the data. There are 160 pdfs and I would like to do this as efficient as possible. I need the data structured for data analysis.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
what is the link of the website in the screenshot?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I was able to download the Excel file from their website and it looks like it is structured nicely. Is this not what you need? The data is already tabular and you don't need to parse any PDFs.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
It is the same page that you have linked in your post: https://www.extremenetworks.com/support/end-of-sale-and-end-of-support-products/
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi Brandon, thanks a lot for this! Have to consider though if the other data in the individual bulletins is needed - if not, then this file might be all I will be all I need from this site. Thank you
Rouche
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
If there was extra data in each of the PDFs that was needed, you would need a workflow dynamic enough to pull out and parse the tables, AND logic that standardized the column headers in a way that they could be unioned. My biggest concern is that if the bulletins don't have standardized formats then even though you could build for today's solution, they could change tomorrow. If there is extra information included I would probably reach out to the company directly and see if it could be provided in the export.