Okay, I have a nightmare (for my novice eyes) data source.
The source document is a PDF. The PDF does not export nicely to XML or CSV and it will not import into Excel in a format that I can use. I was able to get it into Alteryx as a CSV input. But, everything is trapped in one field. There are a few exceptions, but first things first.
Fortunately, the source document is outlined.
The vast majority of what I need to extract is prefaced with a numbered outline reference. The format appears to be ###-#.#.#....
1. How can I filter out those rows that do not start ###? Let me clarify, it must start with a 3-digit number ###.
2. There is always an [Orig] between the outline ### and the text. How can I get the parse tool to use the [Orig] to output everything before it to column 1 and everything after it to column 2?
Thanks
Field 1 | ||||
SECTION 0: [Orig] General Spec Admin | ||||
042 [Orig] General Spec Requirements | ||||
042-1 [Orig] General | ||||
042-1.2 [Orig] Items blah blah blah | ||||
042-1.3 [Orig] The Contractor blah blah blah | ||||
Table 042-1 [Orig] The objective items blah |
Solved! Go to Solution.
This is one for regex.
Firstly I use a regex_match to filter to only those statements starting with 3 characters and then use the regex option in parse mode to pull out the before and after [Orig].
(\d.+) \[Orig\] (.+)
in theory it could be
(.+) \[Orig\] (.+)
Workbook attached.
Ben
Thank you. I got it to work using the less direct regex Filter formula regex_match([Field 1],"^\d.*"). But, this (especially the regex parse tool) helps get their quicker. Thanks.