Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Align messy pdf to excel output data

Roche
8 - Asteroid

Hi everyone, 

 

I am working on a webscraping project in which I have already scraped the pdfs from the link https://www.extremenetworks.com/support/end-of-sale-and-end-of-support-products/ (table 1).  I am now working on extracting the tabulated data from these pdfs.  When using the pdf reader tool the some tables are aligned and others are have very little alignment, as seen below compared to the data in the pdf: (I also attached the workflow, pdf and xlsx file)  I would like to know if anyone knows how I can align the data.  Does anyone have some ideas or have solved a problem like this before?  Wanted to use the extract the pdfs as images, but data is missing from the tables if I use these computer vision tools.

 

Roche_0-1656942559496.png

Roche_1-1656942604976.png

Thank you for helping!

 

Rouche

 

8 REPLIES 8
BrandonB
Alteryx
Alteryx

Does the link to the Excel file not already have what you need? 

 

BrandonB_0-1656946889874.png

 

Roche
8 - Asteroid

Hi @BrandonB, thank you for your message.  No, the data in excel is not structured.  The data is there, but it is messy and I am looking for ways to align the data.  There are 160 pdfs and I would like to do this as efficient as possible.  I need the data structured for data analysis.

Roche
8 - Asteroid

what is the link of the website in the screenshot?

BrandonB
Alteryx
Alteryx

I was able to download the Excel file from their website and it looks like it is structured nicely. Is this not what you need? The data is already tabular and you don't need to parse any PDFs. 

 

BrandonB_0-1656947754687.png

 

BrandonB
Alteryx
Alteryx

It is the same page that you have linked in your post: https://www.extremenetworks.com/support/end-of-sale-and-end-of-support-products/ 

 

BrandonB_0-1656947985808.png

 

BrandonB
Alteryx
Alteryx

I also put together a workflow that dynamically pulls the Excel file from the page and reads in the data just in case they change that URL:

 

BrandonB_0-1656948444402.png

 

Workflow attached 

 

Roche
8 - Asteroid

Hi Brandon, thanks a lot for this!  Have to consider though if the other data in the individual bulletins is needed - if not, then this file might be all I will be all I need from this site.  Thank you

 

Rouche

BrandonB
Alteryx
Alteryx

If there was extra data in each of the PDFs that was needed, you would need a workflow dynamic enough to pull out and parse the tables, AND logic that standardized the column headers in a way that they could be unioned. My biggest concern is that if the bulletins don't have standardized formats then even though you could build for today's solution, they could change tomorrow. If there is extra information included I would probably reach out to the company directly and see if it could be provided in the export. 

Labels
Top Solution Authors