Hello,
I am trying to parse the attached unstructured data set into structured columns. I am having trouble since I need to parse both rows and columns. The example output data tab is how I'm hoping to structure the data.
This data was originally in a text file which I copy and pasted into an Excel file. If a solution could be created using only the text file, that would be appreciated too.
Thank you in advance for your time and any guidance.
Solved! Go to Solution.
Both solutions are the same regardless of input file type as long as you set it up without using headers or delimiters.
1) Check "First Row Contains Data" (or deselect "First Row Contains Field Names" if txt file)
2) if needed, set delimiter to "\0" (needed for txt, not applicable to xlsx)
Then you can determine the records (as defined by NULL separator rows) and it's time to parse.
From here, it's a matter of how the data is "structured" or what patterns you can find. I don't know the data you are working with or what to expect, but i gave it a go. Take a look at the attached workflow and let me know if something doesn't make sense.
Edit: needed to package workflow with data.
Hi @jarrod!
Thank you so much for providing this workflow. It's a huge help! I'm having trouble extracting some of the bolded data in the attached file. Would you have any tips for how I can do so?
Thank you much!
Thank you, @jarrod! I gave it a shot, but I'm still not able to isolate and parse the data. Any additional help would be much appreciated!
Also, after adding in the actual data file, I found that some columns are being split up and duplicated with an underscore before the column name (i.e. there are two "owner" columns, one being renamed to "_owner" - each column has different values - one with the designated data, one blank). Do you have any idea why the text to columns function would be doing this?
Thanks again for all of your help!
Hi @Nageen ,
Took a different approach than @jarrod , and it works with the dataset you provided.
First step was to clean up some of the text. Some were missing colons, some weren't as in your result example, etc. That's the first formula tool.
Then jumped into a Tokenize into Rows method of a RegEx parse, to separate the fields that have multiple fields into rows. (.*?)(?:\s{2,}|$) delimits when there are more than 2 spaces.
Because that does separate some of the data into rows that shouldn't be separated, assigned a new Row Identifier, and the multi-row formula tool looks for rows that should be combined and gives them the same Record ID. This fixes the challenge of parsing the last few columns, which have multiple values
A summarize tool then concatenates so that we have the correct number of rows in the correct order, and then the Tile tool splits evenly (so if you'd add another record into the dataset, it would split into 3 tiles, etc.)
I like the RegEx parse tool to separate into 2 columns (Name and Value) but you could use Text to Columns into 2 columns, separating on a :
Cross Tab builds the table.
A little trick I like is to use a numeric field, that lists the fields in the order that you want them, as the headers in the Cross Tab, and then Dynamic Rename to rename with the real field headings.
You can add in a Select tool to choose only the columns you want returned.
Cheers!
Esther
Hi @estherb47-
Thank you so much for your solution and detailed explanation! I plugged in my actual dataset and for some reason it looks like the cross tab function is creating columns instead of rows. The data output only has two rows but multiple renamed columns (the actual dataset has many more than the two records included in the example data). Do you know why this is happening or how to fix this?
Thanks again! I appreciate everyone's help with this!
Hi @estherb47
Thank you for your willingness to help. Please see attached for additional dummy data.
Hi @Nageen
The workflow created by @estherb47 is actually fine in the first dummy data which only contains 2 rows of records as the Tile tool was set to accept 2 records. In order to make the workflow more dynamic so that it will accept any number of records, all you need to do is to replace the Tile tool with 2 Mulit-Row Formula: