Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

pdf extraction

umairah
8 - Asteroid

Hello, I am a student and still learning to use Alteryx designer. I have a pdf file that I want to output as excel file and I already convert pdf file to yxdb file. My problem now is that I can't figure how to separate the country with all the data. I already filter all unnecessary sentences so I stuck at this part.

data extracted.png 

 

Any help is really appreciated and I will attach my workflow, yxdb and the pdf file. 

8 REPLIES 8
Emil_Kos
17 - Castor
17 - Castor

Hi,

 

I believe you need to use regex functionality.


Maybe article below will be helpful for you:

 

https://community.alteryx.com/t5/Alteryx-Designer-Discussions/How-to-parse-words-from-numbers/td-p/3...

 

 

PhilippK
Alteryx Alumni (Retired)

Hi @umairah ,

 

ideally you use the Text Mining tools of the Intelligent Suite (add-on for the Alteryx Designer) to read in pdfs:

https://www.alteryx.com/products/alteryx-platform/intelligence-suite

 

However, this comes with a price. You could reach out to Alteryx for Good to check whether it is possible to get a free (trial) license as a student for the Intelligent Suite:

alteryxforgood@alteryx.com 

https://www.alteryx.com/why-alteryx/alteryx-for-good/students

 

Best regards

Phil

markcurry
12 - Quasar

Hi @umairah 

 

Two things that may help you...  Firstly, as @Emil_Kos mentioned you could use RegEx, you could add the RegEx tool to identify 2 or more spaces, and replace them with a |, and then use Text to Columns to separate on the |  (see attached).  Or you could use a more complicated RegEx statement to split each line properly.

 

Or if you look at the data that you've extracted from a PDF in Notepad with a font like Courier or Consolas, where is character is the same length, you'll see that the data is fixed width,so you could use the SubString function to extract each section. 

 

I hope that helps.

marcusblackhill
12 - Quasar
12 - Quasar

Hi @umairah !

 

You can use all the answers what all said before, but if you dont want to use REGEX and you just need to separate the country name of the rest of the numbers, you can use 2 parallel data cleansing tools, 1 removing numbers, punctuation and duplicate spaces to get just country names and other removing letters to get just your number, then you join them with a join tool by position.

 

Look the attached workflow.

 

Hope that help you.

umairah
8 - Asteroid

Using the regEx is really help me in separating the country with the numbers and your solution is actually simplifly my workflow for the first part so thank you for that but for the second part in the pdf file between page 138 until 141 is the part that I stuck until now. I want to align the number with the respective country and using regEx only solve one of the issues. I am sorry for asking too much but I really don't know how solve this so any suggestion is really helpful.  

umairah
8 - Asteroid

Actually I applied for Alteryx for good but unfortunately it didn't come with text mining tools. 

shreyanshrathod
11 - Bolide

Hi @umairah ,

 

Could you tell me how did you extract your pdf file and converted it to yxdb?

 

Thanks in advance.

Shreyansh

JakobJ
7 - Meteor

How did you convert PDF to an yxdb file?

 

Thank you in advance.

Labels