Hi all,
I have been looking for a way to crack open a protected PDF for a specific use case I had and I created a Python solution wrapped into Alteryx in order to do it.
It has done what I needed to do, granted the output of the tool I made needs to be parsed and REGEX-ed accordingly to get certain fields.
Sample of what it can do for my use case - again, it worked only for my use case.
I was hoping to crowdsource this further and make it even better and more dynamic. I am sharing the workflow here to see if anyone would be keen to work together on this and make it free for the community to use.
I am also aware that some teams may frown upon Python usage - not because it's Python, but because there needs to be CI/CDs in place, management of its assets and updating of its packages + testing it. I may be missing some parameters here, but primarily it's a control issue more than a tool thing. If anyone would be keen to comment as well, I'd love your feedback.
Anyway, I have also made some pretty assets for it and I am sure this may be useful to you all.
Best,
Calvin
Solved! Go to Solution.
Special thanks to Bulien for creating the unlock Excel macro that is available here: https://community.alteryx.com/t5/Community-Gallery/Password-Protected-Excel-Input-Tool/ta-p/937928 (cc @mceleavey @TheOC ) that inspired me to do this. I wish to make this better and keep it free so that the community benefits as a whole.
If anyone wants to see the code:
Packages:
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])
from ayx import Alteryx
!pip install PyPDF2
!pip install pycryptodome
Code:
import PyPDF2
import pandas as pd
# Read the full path to the PDF from Alteryx input #1
pdf_path_df = Alteryx.read('#1')
pdf_path = pdf_path_df.iloc[0, 0] # Assuming the path is in the first row and first column
# Read the password for the PDF from Alteryx input #2
pdf_password_df = Alteryx.read('#2')
pdf_password = pdf_password_df.iloc[0, 0] # Assuming the password is in the first row and first column
# Open the PDF
with open(pdf_path, "rb") as file:
reader = PyPDF2.PdfReader(file)
# Check if the PDF is encrypted
if reader.is_encrypted:
# Attempt to decrypt the PDF
success = reader.decrypt(pdf_password)
if not success:
raise ValueError("Failed to decrypt PDF with the provided password.")
# Extract text from all pages
extracted_texts = [page.extract_text() for page in reader.pages]
# Combine all extracted texts into a single string
combined_text = "\n".join(extracted_texts)
# Create a pandas DataFrame
df = pd.DataFrame({"ExtractedText": [combined_text]})
# Write the DataFrame to Alteryx output
Alteryx.write(df, 1)
Wrote it with some help from AI and also online searching + StackOverflowing. Just decided to dataframe it at the end so I can parse it my style since my PDF was more straightforward than the example attached or compared to other PDFs.
@apathetichell keen to get your feedback as well, your knowledge in making it more universal / dynamic is much appreciated here.
@caltang Nice article! The presence of PDF files is major issue in our organization that prevents us from being data analytics mind. I'm happy to discuss about tackling with PDF data in this community.
Just a small feedback on Batch operation when using Python tool...
If we use Batch Macro, Python kernel will repeat start&stop for every batch that makes running speed much slower.
Instead, we can write iteration code in Python(for file in files:...) then Python kernel launches only one time and overall running time will be much shorter. (I know writing iteration code in Python need additional effort, Alteryx is low-code platform though)
Maybe it's additional reference for you, I wrote about PDF Data parse in Japanese community that applies pdfminer.six library 2 months back.
As the license of pdfminer.six is MIT license, we can decrease our worry about sharing the product than other libraries with BSD/GPL/AGPL.
*Apologize for sharing Japanese content but can be google translated for English readers😅 At least, YXZP attached in this blog should be helpful.
This is awesome - is pycryptodome used just a sub program for PyPdf2? I usually use Cryptogrpahy - but I think it's probably overkill here. I agree with @gawa that changing this to take in a dataframe of files/passwords and then cycling through it would be faster.
This is really cool @gawa and I agree 100%. I’ve tried batching and iterating a Python tool before and sad to see its performance dip so badly.
Thanks for sharing! I’ll iterate from your points.
Also, the Japanese forums seem to have more cool tools available. Cross sharing across different language forums would be highly beneficial.
Yes! I was also looking at pdf plumber but I’m not sure how to cook up the interface on Alteryx itself for the users to determine the lengths.
Source:
https://stackoverflow.com/questions/26130032/open-a-protected-pdf-file-in-python
Im also thinking about ingesting a txt or CSV file with the fixed field lengths as a third input so the parsing is more user defined and controlled.
If you're unable to copy, print, or edit content in a PDF due to security restrictions, you're likely dealing with a protected or locked file. These restrictions can be frustrating, especially when you urgently need to work with the document. While manual methods exist, they often don’t work on complex or encrypted files.
A faster and more reliable solution is using a specialized tool like SysTools PDF Unlocker. This software is designed to remove both user-level (open password) and owner-level (permission restrictions) from PDF files. With just a few simple steps, you can unlock your PDF and regain full access to its features; without compromising formatting or content quality. It’s particularly helpful when dealing with multiple files or large documents that need to be edited or printed quickly.
I've been working on unlocking several protected PDF files lately, and it's turning out to be a bit more complicated than I expected. Some of the files require a password just to open them (open password), while others allow viewing but block actions like copying, printing, or editing (owner-level restrictions). For PDFs where I know the password, I tried using standard PDF editors to remove the security, but doing it manually for multiple files is time-consuming and inefficient. To simplify the process, I started using PDF Password Remover, and so far, it’s been quite helpful. The tool allows me to remove both types of protection and works offline, which is important when dealing with confidential documents. It also supports batch processing, so I can unlock multiple files at once without losing formatting. This is still a work in progressexperimenting with different file types and password scenarios, but I wanted to share my experience in case anyone else is facing similar challenges. If you found other efficient ways to unlock PDFs, feel free to share!