Hi Community,
Data-->emp1-->expense_report1-->bill1
bill2
bill3
expense_report2-->bill34
23
emp2-->expensereport_344
expensereport_454
emp3-->expensereport_345
above is the structure how bills and invoices are present in a folder for each employees.
i have to extract text from the images, pdf and then compare all bills for a particular employees with each other to find duplicacy.
problem i am facing is when i am using image input and image to text tool it is giving the some memory error and unable to extract the text.(there are around 2800 bills)
what approach should i use to make this workflow?
Solved! Go to Solution.
Hello!
To process 2800 bills efficiently, use a batch OCR workflow with tools like Tesseract or Google Vision, avoiding memory overload by streaming files and parallelizing tasks. Preprocess mywisely com images for better accuracy, store extracted text with metadata, and compare bills per employee using fuzzy matching or hashing to detect duplicates. Stick to VPP-installed apps for managed environments if using Home Assistant.
Hi @Anasalter
Are you currently loading all 2800 files through the tool in one go? If so can you try batching them, so you're only working on one employee at a time?
If you take your current workflow, and use a control parameter to affect your directory input (which I'm assuming is there), that would let you make a batch macro which should limit the amount of memory being used by the tool in one go.
There's more info on batch macros here: https://knowledge.alteryx.com/index/s/article/Getting-Started-with-Batch-Macros-1583461640393
Hope that helps,
Ollie
Hi @OllieClarke
yes earlier i was trying to load all of the files at one go but now
I have used this approach and now i am able to extract text from the images.
Happy to hear it :)
User | Count |
---|---|
76 | |
58 | |
53 | |
47 | |
38 |