Hi All,
im trying to export the PDF data as it is to Excel but it is not working as expected. Kindly help me to resolve this issue. I have attached expected Excel Output file.
Thanks
Niranjan
@NiranjanK1 , How are you importing the PDF into Alteryx? Can you share the flow / a screenshot of the flow?
@FinnCharlton I tried with all avilable workflows avilable in Alteryx community but im getting error. none of workflows are working. I just tried with forum workflow. i havent build any, not sure where to start.
@NiranjanK1 for PDF inputs in general you will need to have Intelligence suit installed for you designer.
@NiranjanK1 , https://community.alteryx.com/t5/Community-Gallery/PDF-Input/ta-p/887038
this one has always worked for me, have you tried it? If so, what errors are you getting?
Hi @NiranjanK1
You could either use some public gallery tool/alteryx intelligence suite/python/R. The problem that this pdf seems to have is that it does not contains basic references for the tool to parse it, like gridlines and proper alignment. I believe that you would have problems with any selected tool because of this, the tool wouldnt know how to properly separe the column/rows:
So, if you can talk with someone to configure gridlines/proper alignment for these files, it will help a lot.
I was able to parse it using python + tabula library with the attached workflow. But as you can see, the tool is not knowing how to do the job properly because of the above commented issues:
@Felipe_Ribeir0 Thanks very much for your inputs i will talk to them definitly .
@FinnCharlton Sure, i will chec, thank you
@Felipe_Ribeir0 It is not working for me, Do i need to install any package
Yes, run Alteryx as admin, so this piece of code will be run properly:
then change the directory tool to point to the directory that contains the pdf files and run the workflow.
@Felipe_Ribeir0 Yes i have changed the directory, how can I RUN Alteryx as Admin. Please suggest.
Close Alteryx, and then click with the right button at it. You will have a option to Run as Admin. You will need to have sufficient privileges with your machine to do it.
@Felipe_Ribeir0 No, i do not have access to RUN As Admin
@NiranjanK1
An alternative to that is to create a folder and use this piece of code to install the libraries there and import them from there:
from ayx import Packagefrom ayx import Alteryx
import sys
Alteryx.installPackages(package="tabula",install_type="install --target=C:\\Users\\...\\PythonPackages")
Alteryx.installPackages(package="tabula-py",install_type="install --target=C:\\Users\\...\\PythonPackages")sys.path.append('C:\\Users\\...\\PythonPackages')
import tabulafrom tabula.io import read_pdf
@Felipe_Ribeir0 i tried, it is very hard me to figure it out. It is not working.
Try the attached workflow, just replace the 3 bellow locations with one from your local machine. Remember to keep the double backslashs \\ and chose one that doesnt contain spaces (Like Program Files).
If you do this , it will work. Any issue, please post the error message here.
@Felipe_Ribeir0 It is got created all the supporting files, the field names and Data is not coming as expected.
i got the error: There is no valid metadata for outgoing connection 1. Run the workflow to generate valid metadata.
i have nearly 90 pages of data with 23 columns info. data and columns are not coming.
Click on the python component and see if it has some error inside of it. A good idea is to run just with the file that you attached here first to see if you get the same result that i got.
@Felipe_Ribeir0 Yes i got the data with exception(There is no validt metadata for ...), But the file i have shared is sample data. but real data i have more than 30+ columns and 90+ pages of details.
About the (There is no validt metadata for ...) error, this is not exactly a problem. The python tool show it sometimes, it will not cause any issues.
About the rest of pdf files, it will deppend if they have the same structure of the shared one, and maybe there is some adjustment to be made on the code to consider all of them depending on how they differ one from the another. But the best idea would be to first get the files with the gridlines/proper alignment, then try to run it. Maybe the tabula function could solve it by itself.
@Felipe_Ribeir0 If i get the same format of details with my real time data, i can atleast work with the details but im not getting fileld names.
Field names where? As a row?
try to replace this
df2 = tabula.read_pdf(FullPath, pages="all", area=[36, 18, 200, 800])
by this
df2 = tabula.read_pdf(FullPath, pages="all", area=[36, 18, 580, 800], pandas_options={'header': None})
@Felipe_Ribeir0Column names with data, if i get extra columns also fine. I can use formula tool to merge as per my requirement.