Alteryx Designer Desktop Discussions

Roche · ‎07-04-2022

Hi everyone,

I am working on a webscraping project in which I have already scraped the pdfs from the link https://www.extremenetworks.com/support/end-of-sale-and-end-of-support-products/ (table 1). I am now working on extracting the tabulated data from these pdfs. When using the pdf reader tool the some tables are aligned and others are have very little alignment, as seen below compared to the data in the pdf: (I also attached the workflow, pdf and xlsx file) I would like to know if anyone knows how I can align the data. Does anyone have some ideas or have solved a problem like this before? Wanted to use the extract the pdfs as images, but data is missing from the tables if I use these computer vision tools.

Thank you for helping!

Rouche

BrandonB · ‎07-04-2022

Does the link to the Excel file not already have what you need?

Roche · ‎07-04-2022

Hi @BrandonB, thank you for your message. No, the data in excel is not structured. The data is there, but it is messy and I am looking for ways to align the data. There are 160 pdfs and I would like to do this as efficient as possible. I need the data structured for data analysis.

Roche · ‎07-04-2022

what is the link of the website in the screenshot?

BrandonB · ‎07-04-2022

I was able to download the Excel file from their website and it looks like it is structured nicely. Is this not what you need? The data is already tabular and you don't need to parse any PDFs.

BrandonB · ‎07-04-2022

It is the same page that you have linked in your post: https://www.extremenetworks.com/support/end-of-sale-and-end-of-support-products/

BrandonB · ‎07-04-2022

I also put together a workflow that dynamically pulls the Excel file from the page and reads in the data just in case they change that URL:

Workflow attached

Roche · ‎07-05-2022

Hi Brandon, thanks a lot for this! Have to consider though if the other data in the individual bulletins is needed - if not, then this file might be all I will be all I need from this site. Thank you

Rouche

BrandonB · ‎07-05-2022

If there was extra data in each of the PDFs that was needed, you would need a workflow dynamic enough to pull out and parse the tables, AND logic that standardized the column headers in a way that they could be unioned. My biggest concern is that if the bulletins don't have standardized formats then even though you could build for today's solution, they could change tomorrow. If there is extra information included I would probably reach out to the company directly and see if it could be provided in the export.

Alteryx Designer Desktop Discussions

Align messy pdf to excel output data

Zero to Advanced in 20 days

Re: Zero to Advanced in 20 days

Re: Zero to Advanced in 20 days

Re: Identify duplicates in a specific column, and ...

Re: Filter the last day of the month