Hi all,
I have been looking for a way to crack open a protected PDF for a specific use case I had and I created a Python solution wrapped into Alteryx in order to do it.
It has done what I needed to do, granted the output of the tool I made needs to be parsed and REGEX-ed accordingly to get certain fields.
I was hoping to crowdsource this further and make it even better and more dynamic. I am sharing the workflow here to see if anyone would be keen to work together on this and make it free for the community to use.
I am also aware that some teams may frown upon Python usage - not because it's Python, but because there needs to be CI/CDs in place, management of its assets and updating of its packages + testing it. I may be missing some parameters here, but primarily it's a control issue more than a tool thing. If anyone would be keen to comment as well, I'd love your feedback.
Anyway, I have also made some pretty assets for it and I am sure this may be useful to you all.
Best,
Calvin
Special thanks to Bulien for creating the unlock Excel macro that is available here: https://community.alteryx.com/t5/Community-Gallery/Password-Protected-Excel-Input-Tool/ta-p/937928 (cc @mceleavey @TheOC ) that inspired me to do this. I wish to make this better and keep it free so that the community benefits as a whole.
If anyone wants to see the code:
Packages:
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])
from ayx import Alteryx
!pip install PyPDF2
!pip install pycryptodome
Code:
import PyPDF2
import pandas as pd
# Read the full path to the PDF from Alteryx input #1
pdf_path_df = Alteryx.read('#1')
pdf_path = pdf_path_df.iloc[0, 0] # Assuming the path is in the first row and first column
# Read the password for the PDF from Alteryx input #2
pdf_password_df = Alteryx.read('#2')
pdf_password = pdf_password_df.iloc[0, 0] # Assuming the password is in the first row and first column
# Open the PDF
with open(pdf_path, "rb") as file:
reader = PyPDF2.PdfReader(file)
# Check if the PDF is encrypted
if reader.is_encrypted:
# Attempt to decrypt the PDF
success = reader.decrypt(pdf_password)
if not success:
raise ValueError("Failed to decrypt PDF with the provided password.")
# Extract text from all pages
extracted_texts = [page.extract_text() for page in reader.pages]
# Combine all extracted texts into a single string
combined_text = "\n".join(extracted_texts)
# Create a pandas DataFrame
df = pd.DataFrame({"ExtractedText": [combined_text]})
# Write the DataFrame to Alteryx output
Alteryx.write(df, 1)
Wrote it with some help from AI and also online searching + StackOverflowing. Just decided to dataframe it at the end so I can parse it my style since my PDF was more straightforward than the example attached or compared to other PDFs.
@apathetichell keen to get your feedback as well, your knowledge in making it more universal / dynamic is much appreciated here.
@caltang Nice article! The presence of PDF files is major issue in our organization that prevents us from being data analytics mind. I'm happy to discuss about tackling with PDF data in this community.
Just a small feedback on Batch operation when using Python tool...
If we use Batch Macro, Python kernel will repeat start&stop for every batch that makes running speed much slower.
Instead, we can write iteration code in Python(for file in files:...) then Python kernel launches only one time and overall running time will be much shorter. (I know writing iteration code in Python need additional effort, Alteryx is low-code platform though)
Maybe it's additional reference for you, I wrote about PDF Data parse in Japanese community that applies pdfminer.six library 2 months back.
As the license of pdfminer.six is MIT license, we can decrease our worry about sharing the product than other libraries with BSD/GPL/AGPL.
*Apologize for sharing Japanese content but can be google translated for English readers😅 At least, YXZP attached in this blog should be helpful.
This is awesome - is pycryptodome used just a sub program for PyPdf2? I usually use Cryptogrpahy - but I think it's probably overkill here. I agree with @gawa that changing this to take in a dataframe of files/passwords and then cycling through it would be faster.
This is really cool @gawa and I agree 100%. I’ve tried batching and iterating a Python tool before and sad to see its performance dip so badly.
Thanks for sharing! I’ll iterate from your points.
Also, the Japanese forums seem to have more cool tools available. Cross sharing across different language forums would be highly beneficial.
Yes! I was also looking at pdf plumber but I’m not sure how to cook up the interface on Alteryx itself for the users to determine the lengths.
Source:
https://stackoverflow.com/questions/26130032/open-a-protected-pdf-file-in-python
Im also thinking about ingesting a txt or CSV file with the fixed field lengths as a third input so the parsing is more user defined and controlled.