I have approx. 350 individual PDF statements to process each quarter, and those are from 90+ different providers so in different formats.
We need to extract any fee information from those PDFs, so have specific key words to look for (fee, charge, chg etc.)
We have used Adobe Cloud Services to grab the data and spit it into a SQL table. However, due to the inconsistent PDF formats the data is unstructured and all over the place.
I appreciate my question is a little loose, but does anyone have any suggestions on how to tackle this project? I am planning to do some Alteryx parsing and filtering to try and make sense of the data and pull just what I need and leave the noise behind, but as I'm fairly new to Alteryx I am wondering if there are other approaches which would be better.
I was hoping to avoid building a data processing workflow for each PDF provider, as that's a lot of work.
Any suggestions are most welcome!
Thanks
We have the ability to read the PDFs as .txt file.
Based on the key words like Fee, Charges etx parsing can be done.
This is one way to approach for the solution as the problem statement was too generic.
If you could help with sample PDFs and what is the output to be extracted might help to derive solution as per your needs.
Hope it helps!!!!
Many thanks
Shanker V
 
					
				
				
			
		

