Alteryx Designer Desktop Discussions

ayush_mishra · ‎05-23-2024

I have multiple PDF files of same structure in a folder and I want to extract text from those PDFs. I have following PDF code. How do I achieve this functionality using Alteryx Designer.

from PyPDF2 import PdfReader

reader = PdfReader("inputfile.pdf")

page_list = []

page = reader.pages[0]

page_text = page.extract_text()

with open(file="outputfile.txt", mode= 'w') as f:

f.write(page_text)

nickmartella · ‎05-23-2024

import os
from PyPDF2 import PdfReader
from ayx import Alteryx

# Define the directory where your PDF files are located
pdf_dir = "path_to_your_pdf_directory"

# Loop through each file in the directory
for filename in os.listdir(pdf_dir):
    # If the file is a PDF
    if filename.endswith(".pdf"):
        # Define the full file path
        file_path = os.path.join(pdf_dir, filename)
        
        # Open the PDF file
        reader = PdfReader(file_path)
        
        # Initialize a list to store the text from each page
        page_list = []
        
        # Loop through each page in the PDF
        for page in reader.pages:
            # Extract the text from the page and append it to the list
            page_list.append(page.extract_text())
        
        # Define the output file name
        output_file = filename.replace(".pdf", ".txt")
        
        # Write the extracted text to a text file
        with open(file=output_file, mode='w') as f:
            for page_text in page_list:
                f.write(page_text)
        
        # Write the extracted text to an Alteryx output anchor
        Alteryx.write(page_list,1)

This script will create a text file for each PDF file in the same directory and also write the extracted text to an Alteryx output anchor.

ayush_mishra · ‎05-24-2024

thanks a ton @nickmartella. you are a life saver, appreciate your response.

Mark_Hung · ‎10-07-2024

Awesome

Anasalter · ‎10-08-2024

@nickmartella If pdf are not having the same structure but are present in the same directory will this code help?

nickmartella · ‎10-09-2024

Yes, this script at a base level grabs all text present in the pdf files no matter the formats. You can build it out more in either python/alteryx to filter to different types of text

Alteryx Designer Desktop Discussions

how to read PDF data using Python tool

Re: Row creation

Re: How to select columns dynamically using number...

Re: Batch macro to read 1000+ .xlsx files with var...

Re: Issue when using Block Until Done and Power BI...

Example workflow for setting up a custom list to u...