Alteryx Designer Desktop Discussions

ankurrjit · ‎04-02-2021

Hi,

I am using the PDF input tool from ALteryx Gallery to bring a PDF file into Alteryx.

https://gallery.alteryx.com/#!app/PDF-Input/5b685aff0462d710907f7a3b

The problem with this method is the structure of data is not great and there is a lot of work to correct that.

Initially, the data look like below, so I use regex and give a pipe(|) delimiter to identify the value and then use the "text to column" tool to create the column

the 2nd regex is to replace $ with | delimiter.

Issue#1:

Somehow it is breaking the data into the next line, how to fix it?

E.g. for column 4 row#35

issue#2:

I am using a pipe delimiter and when using "text to columns" using pipe delimiter, it is not recognizing the blank field and shifting the values.

E.g. value 246 should come in the last column and the 2nd last column should be empty but it is shifting the value in the 2nd last column and keeping the last column empty.

Column#6 and row#45

joshbennett · ‎04-02-2021

Can you package and upload your draft workflow and/or provide a .yxdb of the converted text you are trying to parse?

ankurrjit · ‎04-02-2021

Hi - Thanks for reply. Sure I am attaching the packaged workflow also. YXDB file I am trying to parse.

joshbennett · ‎04-02-2021

Are you sure that .yxwz workflow you uploaded is the one you meant to? It does not seem related to the question you described unless I'm missing something.

Generally speaking, what you are attempting is a sort of 'brute force' method - which is generally fine for individual use cases like this - but keep in mind that such an approach may not scale well depending on format consistency between converted PDF documents. I have never personally used the specific Gallery tool you referenced, but if scalability is your objective you may want to explore how your initial conversion results compare to other available PDF ingestion methods (e.g., leveraging or building Python or R based OCR packages, etc.) to see if any of the other options give you a better initial conversion that requires less context-dependent parsing. Ideally, if you have an Intelligence Suite license (https://www.alteryx.com/products/alteryx-platform/intelligence-suite), the new Text Mining tool group has a PDF Input tool that may be worth checking out (https://help.alteryx.com/current/designer/pdf-input).

That being said, I took a crack at parsing the your .yxdb file based on your initial attempt and related questions - the attached workflow appears to generate your desired result. You can now obviously rename and re-type the fields as needed, though you may need to remove non-numeric characters with an additional formula before casting some of the fields to numeric types like Double.

Let me know if you have any questions on any of the methods / strategies employed in the attached solution - there are lots of little tricks you can use to deal with dirty data like this. 🙂

Hope that helps!

ankurrjit · ‎04-05-2021

I am sorry you are right I attached the wrong workflow but you got it right with the.YXDB file.

I agree with you, this is a kind of 'brute force' method. When I use the same PDF from different dates they appear different but I get an idea how to deal with it in case of parsing any specific PDF file.

I checked with my Alteryx team for an option for "Intelligence suite license", right now I don't have access to these mining tools.

Thanks a lot for helping me out. Wish you a great day.

Alteryx Designer Desktop Discussions

PDF input parsing issue