I am building a GST invoice extraction workflow using the LLM Prompt Tool with GPT on Alteryx AIMS. My setup processes 300 plus scanned multi-page PDFs (10-20 pages, ~1-3MB each) from different vendors in different formats and extracts structured fields into Excel.
Workflow Setup Directory Tool → Blob Input → LLM Prompt Tool (Attach Non-Text Columns) → JSON Parse → Excel Output
Why Blob Input and not PDF to Text
I initially tried the PDF to Text Tool but abandoned it because:
- The text output could not be passed cleanly to the LLM Prompt Tool due to a bytes/string type mismatch error
- OCR on scanned invoices is slow and loses layout context critical for accurate extraction
The Blob → Attach Non-Text Columns approach is more accurate for scanned PDFs as GPT reads the document visually, and is therefore our preferred route.
The Problem
When running all 300+ files in one go, the workflow consistently fails mid-run with:
"Error occurred in LLM response generation: 502 Server Error: Bad Gateway for url: https://eu1.alteryxcloud.com/aims/v1/generatedContent"
The same files process successfully in smaller batches of 25-30. This confirms the issue is volume/concurrency related, not file-specific. Splitting into multiple manual runs is not acceptable as it defeats the purpose of automation.
What we are looking for
- Is there a recommended architecture for processing 300+ scanned PDFs (where one pdf has 20 pages ie, 300*20= 6000 pages) through the LLM Prompt Tool in a single unattended run?