Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Parsing Unstructured Data from rows and columns into columns

Nageen
8 - Asteroid

Hello,

 

I am trying to parse the attached unstructured data set into structured columns. I am having trouble since I need to parse both rows and columns. The example output data tab is how I'm hoping to structure the data.

 

This data was originally in a text file which I copy and pasted into an Excel file. If a solution could be created using only the text file, that would be appreciated too.

 

Thank you in advance for your time and any guidance.

11 REPLIES 11
jarrod
ACE Emeritus
ACE Emeritus

Both solutions are the same regardless of input file type as long as you set it up without using headers or delimiters. 

1) Check "First Row Contains Data" (or deselect "First Row Contains Field Names" if txt file)

2) if needed, set delimiter to "\0" (needed for txt, not applicable to xlsx)

 

Then you can determine the records (as defined by NULL separator rows) and it's time to parse.

From here, it's a matter of how the data is "structured" or what patterns you can find. I don't know the data you are working with or what to expect, but i gave it a go. Take a look at the attached workflow and let me know if something doesn't make sense. 

 

Edit: needed to package workflow with data.

estherb47
15 - Aurora
15 - Aurora
Let me create a few more dummy rows in the dataset and see if I can figure
it out. My guess is that the tile tool might not be functioning properly.

Are you able to provide a few more rows for testing?

--
Esther Bezborodko
*Senior Manager*
201.650.7314 | estherbezborodko@gmail.com
beautycounter.com/estherbezborodko

*Our mission is to get safe products in the hands of everyone.*
[image: Facebook]
Nageen
8 - Asteroid

Hi @jarrod!

 

Thank you so much for providing this workflow. It's a huge help! I'm having trouble extracting some of the bolded data in the attached file. Would you have any tips for how I can do so?

 

Thank you much!

jarrod
ACE Emeritus
ACE Emeritus
Yeah, I wasn't sure how that data related to everything. I think you'll have to call out those patterns explicitly to parse. So, find "Default privileges:" and grab the row below to concatenate. Then for the last section, find "Identifier Value Attributes" then take all the rows between that and the next null cell. There are a few ways to do this, but I don't have my computer in front of me right now. Let me know if that doesn't make sense and I'll try to create a workflow.
Nageen
8 - Asteroid

Thank you, @jarrod! I gave it a shot, but I'm still not able to isolate and parse the data. Any additional help would be much appreciated!

 

Also, after adding in the actual data file, I found that some columns are being split up and duplicated with an underscore before the column name (i.e. there are two "owner" columns, one being renamed to "_owner" - each column has different values - one with the designated data, one blank). Do you have any idea why the text to columns function would be doing this?

 

Thanks again for all of your help!

estherb47
15 - Aurora
15 - Aurora

Hi @Nageen ,

 

Took a different approach than @jarrod , and it works with the dataset you provided.

First step was to clean up some of the text. Some were missing colons, some weren't as in your result example, etc. That's the first formula tool.

Then jumped into a Tokenize into Rows method of a RegEx parse, to separate the fields that have multiple fields into rows. (.*?)(?:\s{2,}|$) delimits when there are more than 2 spaces.

Because that does separate some of the data into rows that shouldn't be separated, assigned a new Row Identifier, and the multi-row formula tool looks for rows that should be combined and gives them the same Record ID. This fixes the challenge of parsing the last few columns, which have multiple values

A summarize tool then concatenates so that we have the correct number of rows in the correct order, and then the Tile tool splits evenly (so if you'd add another record into the dataset, it would split into 3 tiles, etc.)

I like the RegEx parse tool to separate into 2 columns (Name and Value) but you could use Text to Columns into 2 columns, separating on a :

Cross Tab builds the table. 

A little trick I like is to use a numeric field, that lists the fields in the order that you want them, as the headers in the Cross Tab, and then Dynamic Rename to rename with the real field headings.

You can add in a Select tool to choose only the columns you want returned.

Cheers!


Estherimage.png

Nageen
8 - Asteroid

Hi @estherb47-

 

Thank you so much for your solution and detailed explanation! I plugged in my actual dataset and for some reason it looks like the cross tab function is creating columns instead of rows. The data output only has two rows but multiple renamed columns (the actual dataset has many more than the two records included in the example data). Do you know why this is happening or how to fix this?

 

Thanks again! I appreciate everyone's help with this!

Nageen
8 - Asteroid

Hi @estherb47

 

Thank you for your willingness to help. Please see attached for additional dummy data.

jasperlch
12 - Quasar

Hi @Nageen 

 

The workflow created by @estherb47 is actually fine in the first dummy data which only contains 2 rows of records as the Tile tool was set to accept 2 records. In order to make the workflow more dynamic so that it will accept any number of records, all you need to do is to replace the Tile tool with 2 Mulit-Row Formula:

 

Capture.PNG

Labels