ALTERYX INSPIRE | Join us this May for for a multi-day virtual analytics + data science experience like no other! Register Now

Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Extracting Tabular Data from PDF Documents with Python Code Tool

DavidM
Alteryx
Alteryx

Hi @GilYee,

 

Regarding the ghostscript exception you are getting please refer back to the original post:

 

"For ghostscript, you can get the installer at the ghostscript downloads page."

 

"After installing ghostscript, you’ll need to reboot. Seriously - do it! if you are like me you don't. But this is needed here and would have saved me a long long time.

Also, your system must have ghostscript executable’s path is in the windows PATH environment variable."

 

Cheers,

david

David Matyas
Sales Engineer
Alteryx
fajar_wimar
6 - Meteoroid

Hi All, I have made PDF to Tabular macros based on this post which can read multiple pdfs into tabular data.

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/PDF-to-Tabular/td-p/553429

Thank you.

fgranger
5 - Atom

Hi David,

 

Thanks for this script and your explanations. I have used this as a starting point, and I am working on a macro which feeds in parameters. I am struggling with passing in the table_areas and columns parameters, and I was wondering if you could provide any guidance, please.

 

Here is the "hardcoded" starting point, based on your script:

 

#HARDCODED PDF IMPORTER

from ayx import Alteryx

import camelot
import pandas as pd

tables = camelot.read_pdf(
    '[PATH TO PDF].pdf',
    password='',
    flavor='stream',
    row_tol=5,
    pages='all',
    table_areas=[
        '0,600,800,550',
        '0,550,800,470',
        '0,470,800,40',
        '0,40,800,0'
    ],
    columns=[
        '0,200,600,730',
        '0,60,330,440,500,600,700',
        '0,60,330,440,500,600,700',
        ''
    ]
)

output_df=pd.DataFrame()

for table in tables:
    output_df = output_df.append(table.df, ignore_index = True, sort = False)
print(output_df)

 

This functions as expected. As you can see, each PDF page has 4 table areas, and those table areas have different column schema.

 

I have tried to make this more flexible and user-friendly by driving this with interface tools as part of a macro:

 

#PDF IMPORTER WITH ALTERYX INPUTS

from ayx import Alteryx

import camelot
import pandas as pd

#Bring in Alteryx data connections
pdf_file = Alteryx.read("#1")['Value'].values[0];
password = Alteryx.read("#2")['Value'].values[0];
flavor = Alteryx.read("#3")['Value'].values[0];
pages = Alteryx.read("#4")['Value'].values[0];
row_tol = Alteryx.read("#5")['Value'].values[0];
table_areas = Alteryx.read("#6")['Value'].values[0];
columns = Alteryx.read("#7")['Value'].values[0]

tables = camelot.read_pdf(
    pdf_file,
    password=password,
    flavor=flavor,
    row_tol=row_tol,
    pages=pages,
    table_areas=table_areas,
    columns=columns
    ]
)

output_df=pd.DataFrame()

for table in tables:
    output_df = output_df.append(table.df, ignore_index = True, sort = False)
print(output_df)

 

Everything works apart from the table_areas and columns elements. I feel that this may be due to the treatment of Python lists. I have tried a simplified version with just 1 table area and its related columns (rather than 4 table areas), but that doesn't seem to work.

 

I have included the input tables below:

 

Input #1

RecordDescriptionParameterValue
1PDF file namepdf_file[PATH TO PDF].pdf

 

Input #2

RecordDescriptionParameterValue
1Passwordpassword[Null]

 

Input #3

RecordDescriptionParameterValue
1Flavorflavorstream

 

Input #4

RecordDescriptionParameterValue
1Page rangepagesall

 

Input #5

RecordDescriptionParameterValue
1Row tolerancerow_tol5

 

Input #6

I am struggling with this. I have tried both collapsing this into a string (using Summarize tools) and creating a list of lists, but neither seem to pass into Camelot correctly. Here is the raw data I am trying to feed in:

RecordParameterTable numberTable area elementValue
1table_areas1top-left x10
2table_areas1top-left y1600
3table_areas1bottom-right x2800
4table_areas1bottom-right y2550
5table_areas2top-left x10
6table_areas2top-left y1550
7table_areas2bottom-right x2800
8table_areas2bottom-right y2470
9table_areas3top-left x10
10table_areas3top-left y1470
11table_areas3bottom-right x2800
12table_areas3bottom-right y240
13table_areas4top-left x10
14table_areas4top-left y140
15table_areas4bottom-right x2800
16table_areas4bottom-right y20

 

Input #7

I am having the same issues as in Input #6 above. Here is the raw column data:

RecordParameterTable numberColumn numberValue
1

columns

110
2columns12200
3columns13600
4columns14730
5columns210
6columns2260
7columns23330
8columns24440
9columns25500
10columns26600
11columns27700
12columns310
13columns3260
14columns33330
15columns34440
16columns35500
17columns36600
18columns37700
19columns4[Null][Null]

 

Any ideas would be appreciated, as being able to drive the table areas more flexibly would make this much more powerful.

Labels