Image to text / pdf to text

Question

Hi everyone,

I need to parse 100+ pdfs to text, specifically only the tabulated data in the pdfs.  All the pdfs are in one folder and the number of tables and number of columns likely differs for many of the pdfs.  Thus far I have tried sending the pdf through as an image and got the following result for the attached pdf: (also attaching the flow)

I expected the tabulated data to come out as in the pdf table, but the order / position of some lines are not correct i.e  "AIRFLOW" is supposed to be with another line of text etc. Some of the data is truncated, some parsed where there seems to be a space or new line and some images seems to be red incorrectly etc.

I am hoping to transform the pdf data to text without needing to do lots of parsing since there are many files to convert.  Can someone help me with this?  Is there a specific kind of delimiter on which I need to parse to get all the data in the cells?  Or will I need to instead connect it as a pdf and then parse the outcome?

Thank you for helping!

Rouche

2_SLX 9140.pdf

image_to_text_pdf.yxmd

Roche · Answer

Good morning Samantha,

Thank you for your help on this!  Appreciate it.  Yes, this is a good example for Alteryx to look at.

Rouche

Samantha_Jayne · Answer

Afternoon @Roche, I have been looking at this deeply today. What I have noticed is that within each cell there are two rows, and the table reader is treating the data as such. i.e. like the line doesn't carry on but has distinct data on each row. Therefore what you see in your results would make more sense. However this doesn't help you when you are trying to pull all the detail out and push it into table form.

It would need some data cleansing to really achieve this. What I have been able to do is the following:

What is important to note here is that is the way the text spans two rows within a row which is causing your grief and the inconsistency of the way it is done (the last row). This is only a light touch and with more time and more examples more can be done, but hopefully this gives you a flavour of how to treat this data when its not standard table format. The use of a transpose with a recordid and filter out the empty rows certainly helps in this case, clear out some of that data. But hard to say if this will really help with 1000s of PDFs for example. I will raise this as an example internally can I ask what version of designer you are running?

Please see an example attached.

image_to_text_pdf_SJ.yxmd