Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

How to read adjacent columns in a pdf invoice

Pankhudri20
8 - Asteroid

Hello,

 

I have a pdf invoice from which I need to extract tables into an excel or csv file.

The main issue is the each page has 2 columns of tables adjacent to each other.

Also, I am using tabula to convert my pdf tables but it is not reading the 1st page table. 

It is starting from the 2nd page.

 

This is how the columns are placed in the pdf.

Pankhudri20_0-1614352507919.png

 

My Python code and warning:

Pankhudri20_1-1614352594842.png

 

I am really stuck in this and any help would be greatly appreciated.

Thank you!

Pankhudri

4 REPLIES 4
echuong1
Alteryx Alumni (Retired)

Given this is an Alteryx forum, I'll give you a guidance on how something like this can be achieved with Alteryx tools 🙂 

 

I suggest using a PDF reader to import the data - there are a couple available on the public Gallery that use python/R under the hood. A couple I recommend are below:

https://gallery.alteryx.com/#!app/PDF-Input/5b685aff0462d710907f7a3b

https://gallery.alteryx.com/#!app/PDF-Input--Text-and-Image-/5be5ec8d0462d71ffce6deaa

 

These will read in everything from your file. Once the data is imported, depending on how the data is read in, you can use text to columns and Regex to parse out the fields.

 

Hope this helps!

Pankhudri20
8 - Asteroid

Hello @echuong1 

 

Thank you for your response!

 

I did use pdf input tool to convert but I am still unable to read the adjacent columns in alteryx.

Pankhudri20_0-1614357061068.png

I need to read the 1st column and then 2nd column on the same page.

Same for every page in the pdf.

How can I achieve that in alteryx?

 

echuong1
Alteryx Alumni (Retired)

Now that the data is imported, you'll need to parse the data. This can be done with text to columns or regex. Regex looks like it'll be most helpful in your case. I suggest taking a look at the tool mastery:

https://community.alteryx.com/t5/Alteryx-Designer-Knowledge-Base/Tool-Mastery-RegEx/ta-p/37689 

 

This parsing will apply to every row, so it'll include all pages. You can use filters to remove anything that isn't relevant. 

Pankhudri20
8 - Asteroid

Thank you!

 

Can you tell me what would work in reading right side columns (from date and time to call charges) and appending it below the left side columns(from date to call charges).

Please refer to the image above for placement of columns.

 

For example: In this screenshot, I need datetime 11 below 10 and 28 below 27. So, all the same columns needs to be appended below each other on every page.

21 is 2nd page. 35 would be 3rd page and so on.

 

Pankhudri20_0-1614362038264.png

Pankhudri

Labels