Opening and Reading a PDF with PyPDF2 (Python)

Question

Does anyone have any experience with reading PDF's via Python in Alteryx using the PyPDF2 package and can see what is wrong here?

I've managed to import the package but every time I try to run the workflow it fails with the following error message:

F

This is the script to open and read the file and this file definitely exists in this location. I've checked and double checked :)

import PyPDF2
pdf1File = open("P:\Content Manager\PDF\HABDHGOKMLD.PDF")
reader = (pdf1File)
number_of_pages = len(reader.pages)
page = reader.pages[0]
text = page.extract_text()

I have noticed '\' in the path is updated to '\' in the error but even specifying this in the open statement returns the same error. This is probably something really obvious that I just can not see.

Thanks,

ChrisDoar · Answer

Hi @grossal,

Thank you for taking the time to get back to me. Neither of those solutions worked unfortunately, see below. But, I reckon the issue may be the network path for the file. My P drive is on a network server, where as if I move the file I'm trying to read to the C drive on the actual computer it finds it no problem. I suspect if I use the full file path and not just the mapped P drive it would work. Something to investigate

and..

Thanks, 
Chris

grossal · Answer

HI @ChrisDoar,

There are usually two ways to handle this.

Option 1: Convert \ to /

It's completely up to you where you do this, you can do it in Python, or you can do it in Alteryx. I tend to use a simple Formula-Tool to do the trick.

Option 2: Raw Strings

pdf1File = open(r"P:\Content Manager\PDF\HABDHGOKMLD.PDF")

The key here is open(r"Your Path"). This should convert your path to a "raw" string and you don't need to escape anything.

Let me know if it works or if we need to dive deeper into it.

Best

Alex