I have multiple PDF files of same structure in a folder and I want to extract text from those PDFs. I have following PDF code. How do I achieve this functionality using Alteryx Designer.
Solved! Go to Solution.
import os
from PyPDF2 import PdfReader
from ayx import Alteryx
# Define the directory where your PDF files are located
pdf_dir = "path_to_your_pdf_directory"
# Loop through each file in the directory
for filename in os.listdir(pdf_dir):
# If the file is a PDF
if filename.endswith(".pdf"):
# Define the full file path
file_path = os.path.join(pdf_dir, filename)
# Open the PDF file
reader = PdfReader(file_path)
# Initialize a list to store the text from each page
page_list = []
# Loop through each page in the PDF
for page in reader.pages:
# Extract the text from the page and append it to the list
page_list.append(page.extract_text())
# Define the output file name
output_file = filename.replace(".pdf", ".txt")
# Write the extracted text to a text file
with open(file=output_file, mode='w') as f:
for page_text in page_list:
f.write(page_text)
# Write the extracted text to an Alteryx output anchor
Alteryx.write(page_list,1)
This script will create a text file for each PDF file in the same directory and also write the extracted text to an Alteryx output anchor.
thanks a ton @nickmartella. you are a life saver, appreciate your response.
Awesome
@nickmartella If pdf are not having the same structure but are present in the same directory will this code help?
Yes, this script at a base level grabs all text present in the pdf files no matter the formats. You can build it out more in either python/alteryx to filter to different types of text