Missed the Q4 Fall Release Product Update? Watch the on-demand webinar for more info on the latest in Designer 24.2, Auto Insights Magic Reports, and more!
Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

how to read PDF data using Python tool

ayush_mishra
8 - Asteroid

I have multiple PDF files of same structure in a folder and I want to extract text from those PDFs. I have following PDF code. How do I achieve this functionality using Alteryx Designer.

 

from PyPDF2 import PdfReader

reader = PdfReader("inputfile.pdf")

page_list = []
page = reader.pages[0]
page_text =  page.extract_text()

with open(file="outputfile.txt", mode= 'w') as f:
    f.write(page_text)
5 REPLIES 5
nickmartella
7 - Meteor
import os
from PyPDF2 import PdfReader
from ayx import Alteryx

# Define the directory where your PDF files are located
pdf_dir = "path_to_your_pdf_directory"

# Loop through each file in the directory
for filename in os.listdir(pdf_dir):
    # If the file is a PDF
    if filename.endswith(".pdf"):
        # Define the full file path
        file_path = os.path.join(pdf_dir, filename)
        
        # Open the PDF file
        reader = PdfReader(file_path)
        
        # Initialize a list to store the text from each page
        page_list = []
        
        # Loop through each page in the PDF
        for page in reader.pages:
            # Extract the text from the page and append it to the list
            page_list.append(page.extract_text())
        
        # Define the output file name
        output_file = filename.replace(".pdf", ".txt")
        
        # Write the extracted text to a text file
        with open(file=output_file, mode='w') as f:
            for page_text in page_list:
                f.write(page_text)
        
        # Write the extracted text to an Alteryx output anchor
        Alteryx.write(page_list,1)

This script will create a text file for each PDF file in the same directory and also write the extracted text to an Alteryx output anchor.

ayush_mishra
8 - Asteroid

thanks a ton @nickmartella. you are a life saver, appreciate your response.

Mark_Hung
5 - Atom

Awesome

Anasalter
7 - Meteor

@nickmartella If pdf are not having the same structure but are present in the same directory will this code help?

nickmartella
7 - Meteor

Yes, this script at a base level grabs all text present in the pdf files no matter the formats. You can build it out more in either python/alteryx to filter to different types of text

Labels
Top Solution Authors