Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

how to read PDF data using Python tool

ayush_mishra
8 - Asteroid

I have multiple PDF files of same structure in a folder and I want to extract text from those PDFs. I have following PDF code. How do I achieve this functionality using Alteryx Designer.

 

from PyPDF2 import PdfReader

reader = PdfReader("inputfile.pdf")

page_list = []
page = reader.pages[0]
page_text =  page.extract_text()

with open(file="outputfile.txt", mode= 'w') as f:
    f.write(page_text)
2 REPLIES 2
nickmartella
7 - Meteor
import os
from PyPDF2 import PdfReader
from ayx import Alteryx

# Define the directory where your PDF files are located
pdf_dir = "path_to_your_pdf_directory"

# Loop through each file in the directory
for filename in os.listdir(pdf_dir):
    # If the file is a PDF
    if filename.endswith(".pdf"):
        # Define the full file path
        file_path = os.path.join(pdf_dir, filename)
        
        # Open the PDF file
        reader = PdfReader(file_path)
        
        # Initialize a list to store the text from each page
        page_list = []
        
        # Loop through each page in the PDF
        for page in reader.pages:
            # Extract the text from the page and append it to the list
            page_list.append(page.extract_text())
        
        # Define the output file name
        output_file = filename.replace(".pdf", ".txt")
        
        # Write the extracted text to a text file
        with open(file=output_file, mode='w') as f:
            for page_text in page_list:
                f.write(page_text)
        
        # Write the extracted text to an Alteryx output anchor
        Alteryx.write(page_list,1)

This script will create a text file for each PDF file in the same directory and also write the extracted text to an Alteryx output anchor.

ayush_mishra
8 - Asteroid

thanks a ton @nickmartella. you are a life saver, appreciate your response.

Labels