Hi everyone,
I am working on a webscraping project in which I have already scraped the pdfs from the link https://www.extremenetworks.com/support/end-of-sale-and-end-of-support-products/ (table 1). I am now working on extracting the tabulated data from these pdfs. When using the pdf reader tool the some tables are aligned and others are have very little alignment, as seen below compared to the data in the pdf: (I also attached the workflow, pdf and xlsx file) I would like to know if anyone knows how I can align the data. Does anyone have some ideas or have solved a problem like this before? Wanted to use the extract the pdfs as images, but data is missing from the tables if I use these computer vision tools.
Thank you for helping!
Rouche
Does the link to the Excel file not already have what you need?
Hi @BrandonB, thank you for your message. No, the data in excel is not structured. The data is there, but it is messy and I am looking for ways to align the data. There are 160 pdfs and I would like to do this as efficient as possible. I need the data structured for data analysis.
what is the link of the website in the screenshot?
I was able to download the Excel file from their website and it looks like it is structured nicely. Is this not what you need? The data is already tabular and you don't need to parse any PDFs.
It is the same page that you have linked in your post: https://www.extremenetworks.com/support/end-of-sale-and-end-of-support-products/
Hi Brandon, thanks a lot for this! Have to consider though if the other data in the individual bulletins is needed - if not, then this file might be all I will be all I need from this site. Thank you
Rouche
If there was extra data in each of the PDFs that was needed, you would need a workflow dynamic enough to pull out and parse the tables, AND logic that standardized the column headers in a way that they could be unioned. My biggest concern is that if the bulletins don't have standardized formats then even though you could build for today's solution, they could change tomorrow. If there is extra information included I would probably reach out to the company directly and see if it could be provided in the export.
User | Count |
---|---|
17 | |
15 | |
15 | |
8 | |
6 |