Hi,
The data source is a daily email of announcements. The data is trapped in one initial output field. I need to reshape it. I am thinking transpose, parse, multi-role formula, etc. But, I don't know where to begin. Thanks
My first step is to import the email as a TXT file. I have Delimiter import options set to \n. I can filter some of the header information using the Start Data Import on Line setting. The data table below represents the initial output of the Input Tool.
I need to parse, transpose the data into columns and rows. The tables below depict 1) the data after import and 2) the desired output.
A few things...
There are 3, sometimes 4 categories. They are always named the same (CATEGORY A, CATEGORY B, CATEGORY C, CATEGORY D). The number of items under each CATEGORY (represented by the Lorem Ipsum) can vary from 0 to N on a day-to-day basis. And, I want to ignore everything after "END" statement, which in the real world is *small business".
For Release AT |
5 p.m. ET |
No. CR-246-18 |
Dec 29, 2018 |
[Null] |
Dec. 29, 2018 |
[Null] |
Announcements For Dec. 29, 2018 |
[Null] |
CATEGORY A |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. |
CATEGORY B |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. |
CATEGORY C |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. |
End |
gibberish |
[Null] |
gibberish |
gibberish |
gibberish |
Desired Output
Date | Category | Description |
Dec. 29, 2018 | CATEGORY A | Lorem ipsum dolor sit amet, consectetur... |
Dec. 29, 2018 | CATEGORY A | Lorem ipsum dolor sit amet, consectetur... |
Dec. 29, 2018 | CATEGORY A | Lorem ipsum dolor sit amet, consectetur... |
Dec. 29, 2018 | CATEGORY A | Lorem ipsum dolor sit amet, consectetur... |
Dec. 29, 2018 | CATEGORY B | Lorem ipsum dolor sit amet, consectetur... |
Dec. 29, 2018 | CATEGORY B | Lorem ipsum dolor sit amet, consectetur.. |
Dec. 29, 2018 | CATEGORY C | Lorem ipsum dolor sit amet, consectetur.. |
Dec. 29, 2018 | CATEGORY C | Lorem ipsum dolor sit amet, consectetur.. |
Dec. 29, 2018 | CATEGORY C | Lorem ipsum dolor sit amet, consectetur.. |
Solved! Go to Solution.
Okay.
I found one way to breakout and fill-in for Category.
1. I created a new field using the Formula tool.
if [Field_1]='A' then 'A' elseif [Field_1]='B' then 'B' elseif [Field_1]='C' then 'C' elseif [Field_1]='D' then 'D' else Null() endif
2. I then used the Multi-Row Formula tool to 'fill-down'. Check out this Tool Mastery | Multi Row article.
iif(isnull([Category]),[Row-1:Category],[Category])
Based on the sample you provided I suggest:
1. Use a multi row formula to create the date column. I used REGEX_Match to look for a date and then DateTimeParse to parse:
IIF(Regex_Match([Field1],"[A-Z][a-z][a-z] \d{2},\s+\d{4}"), DateTimeParse([Field1],"%b %d,%Y"), [Row-1:Date])
2. Use a second multi row formula to make the category. Again I chose Regex to match the category text:
IIF(Regex_Match([Field1],"^CATEGORY .*"), [Field1], IIF([Field1]="End",NULL(),[Row-1:Category]))
3. Finally filter to rows with a Category and then a select to tidy up
Sample attached.
I love the simplicity of this approach. I just came up with my own solution, but I think this approach provides more flexibility should any anomalies creep up in the source data. It also requires 1/3 the steps.
Many Thanks
How might you modify the 2nd Multi-Row/Regex formula if Category was A, B, C, D and NOT Category A, Category B, Category C, and Category D?
Thanks
Assuming you know the list of categories I would suggest:
IIF([Field1] IN ("A","B","C","D"), [Field1], IIF([Field1]="End",NULL(),[Row-1:Category]))
Otherwise, you will need to come up with some rule to identify the category cells. Feel free to DM me if you want me to look at real list in private