Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

General Discussions has some can't miss conversations going on right now! From conversations about automation to sharing your favorite Alteryx memes, there's something for everyone. Make it part of your community routine!

SOLVED

data parsing from pdf

umairah
8 - Asteroid

Hi, I have a problem parsing my data into column. Previously I got help on using regEx tool and it has help me a lot and I use the tool to separate country and the number and use TextToColumn after that but my problem is some of the data is different row which is in the country column.  Any idea on how to solve this?

 

data extracted.png

 

p/s : This problem from page 138 until 141 from the pdf and any help is really appreciated. I will put my workflow and sample data for your reference.

 

Thank you in advance

wdavis
Alteryx
Alteryx

Hi @umairah 

 

I have taken a look at this and this is being caused by the number references that are included within the tables. When this occurs in the data, it is creating a new line for that country and therefore your logic doesn't group together at the end of the workflow.

 

I have added on a new workflow beneath your existing one which at the moment just tackles the problem with Venezuela on page 141 (I have limited the data in the workflow here to just look at this country), however, this logic should work for all countries.

 

If you want to validate that this is what you would expect to see and if you need help getting it to work across the whole workflow, let me know. The existing workflow you have built is great though and can still be used for the additional parsing and transformations across the whole dataset.

 

Kind Regards,

Will

umairah
8 - Asteroid

Hi @wdavis and thank you for response to my question,

 

I am still new to Alteryx so there is a lot of things that I still do not understand and still learning. I do notice that the reference number is included in the data but I do not know how to remove it and thank you for explain it to me.

 

I look at the workflow and it work on some countries but I notice that some countries is removed and the data is incomplete. I make little adjustment to the workflow and some countries not listed. Some guidance on the workflow is really appreciated.

 

Kind regards,

DavidP
16 - Nebula
16 - Nebula

Hi @umairah 

 

As @wdavis pointed out, the data coming out of the PDF Input macro is a little messy. I loaded the original PDF file with the PDF Input macro and pasted the result into a text input tool for an easier starting point.

 

Looking at the data in the PDF file, I used "Access Indicators" and "Use Indicators" as the start and end markers, which makes it a bit easier to filter out the pages you don't want. I then used your filter formula to remove the unwanted rows.

 

So you can clearly see where the Table references splits the numeric columns to multiple rows, but there is also another problem - if you look at Antigua & Barbuda, you can see how the name is split over multiple rows.

 

DavidP_0-1598351647770.png

 

There are quite a few of these and they are not always consecutive. If that Table row also contains Table references, part of the name will be at the start and part at the end with a number of rows in between. Sometimes the name belongs to the Country set in the rows above, but other times it belongs to the Country set in the rows below, as in this example. 

 

DavidP_1-1598351947020.png

 

This makes it really tricky to find logic to put the country names together. I couldn't come up with any logic that can correctly decide whether a name at the end of a number set belongs to the set above or below, so I decided to remove it completely. This means that the names for some countries are a bit wonky. For the US and the UK, both are "United" and one or two like 118 New Zealand are missing altogether.

 

But, at least the numbers are parsed correctly. So if you can figure out how to deal with the edge case Country Names, hopefully you can get it to work.

 

Attached is my workflow, with some documentation

 

DavidP_3-1598352572804.png

 

 

 

 

 

DavidP
16 - Nebula
16 - Nebula

Hi @umairah 

 

I've had a further play with it and managed to fix the issues with most of the country names. Version 2 attached.

 

DavidP_0-1598370496584.png

 

 

umairah
8 - Asteroid

Hi @DavidP .

 

Your solution really help me a lot because I had another table that had similar problem to this and now I can solve that table as well and refer to this workflow. About the country I will make some adjustment for country that is missing and thank you so much for the help. 

Labels