Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Extract specific pages from a PDF using rules (specific text on page)

pankajk
10 - Fireball

Hi friends,

I have run into an issue and hope someone can help.

Use case:  Identify the pages from a pdf based on if certain text appears on a page. Example, if I have 15 page PDF and need to identify if 'Client Account' text appears on 'x' pages then extract those pages and create a new pdf. These 'extracted' pages could be different for each pdf.

 

What's I could accomplish:   I use 'pdftools' (R lib) to do PDF parsing and then find out pages where those words exists without any issue. 

 

Problem:  Next step of 'extracting' those pages and creating a new file

Kind of resolution:  PyPDF2 offers this solution, and there is a code I can use:

https://learndataanalysis.org/how-to-extract-pdf-pages-and-save-as-a-separate-pdf-file-using-python/

This code works perfectly if I have a 'static' filename and pages numbers (input file, output file and page numbers).

The problem is when I try to make this a macro, I can't figure out how to update the 'static' fields with variables to use the filename and page numbers I pass as variables in the macro.

 

I have attached the macro I am trying to build. New to python as well and this is my first time every using python code.

 

Any help is appreciated. If I am not clear, please do ask questions.

 

8 REPLIES 8
joshuaburkhow
ACE Emeritus
ACE Emeritus

Looks like you are just not selecting the right pieces in the Action Tool. You need the data values like this: 

 

joshuaburkhow_0-1598102527701.png

 

Joshua Burkhow - Alteryx Ace | Global Alteryx Architect @PwC | Blogger @ AlterTricks
pankajk
10 - Fireball

Thanks @joshuaburkhow  - this will help resolve passing of parameters, the next challenge is the 'python' script is not recognizing the variables and returns an error. If you try putting in the values in the 'text input' and run the workflow, it will return error. How can ensure python script is reading the variables defined.

 

pankajk_0-1598118368330.png

 

ImadZidan
12 - Quasar

Hello @pankajk ,

 

Sorry if I am late into this discussion. I have looked at the macro and its Field1 uppercase F rather than field1.

 

Hope this helps

pankajk
10 - Fireball

Thanks @ImadZidan for picking up these errors, which I have fixed. But somehow it's still not picking up the filename and now giving me the type error.

I even tried with "r'" in front since it worked with the absolute filename, but there seems to be something amiss here.

Appreciate all the support.

pankajk_0-1598139555492.png

 

ImadZidan
12 - Quasar

Hello @pankajk ,

 

Is it possible to show me what you have in the three fields as a value.

 

It will help.

 

It looks to me that the code is executing. However, the PDF reader is choking when reading the PDF.

 

pankajk
10 - Fireball

thanks - I have added my workflow again (updated) with lots of comment line (trying different things).

There 3 input variables are:

Field1 =  Original PDF File name

Field2 =  New PDF filename to be created

Field3  =  Pages from original PDF file to be extracted

 

FYI..... This code is working when I use the static values for these (as per my original post which includes the Python code page link), so I don't think it's a PDF choke issue.

I was trying to print the type and looks like the filename variable is not getting the 'full value' including path and it's a 'object' while if I used the 'static' variable it's a string. But again, this is based on my limited knowledge of python.

 

ImadZidan
12 - Quasar

Hello @pankajk ,

 

Two things to change

 

1- in the text input file change to include double back slash example X:\\Pankaj\\Project\\Sample_PDF\\Sample 1.PDF

2- change code:

 

From

This gives you type object

 

filename = data["Field1"]
newfilepath = data["Field2"]
pagesextract = data["Field3"]

 

To

This gives you type string which is why you were having difficulty getting to the file.

filename = data["Field1"][0]
newfilepath = data["Field2"][0]
pagesextract = data["Field3"][0]

 

the rest of the logic seems ok. lets see.

 

pankajk
10 - Fireball

Thanks @ImadZidan  - You are awesome and thanks for your patience and all the support. I was able to make it work based on your feedback 🙂

I had to make the following additional changes so that the pagelist was read as a list versus as an object/text:

 

In my text input change from  0,12,25  to   [0,12,25]  -->  i.e, add beginning and ending brackets.

And add the following to my code so that it changed the text/string to a list.

 

import ast

 

# Converting pages to be extracted from string to list

pagelist  = ast.literal_eval(pagesextract)

 

I have converted this to an app and it works nicely.

 

I will accept your solution and give it my like!  Thanks again so very much. Greatly appreciated.

Labels