Alteryx Designer Desktop Discussions

hellyars · ‎08-23-2018

Okay, I have a nightmare (for my novice eyes) data source.

The source document is a PDF. The PDF does not export nicely to XML or CSV and it will not import into Excel in a format that I can use. I was able to get it into Alteryx as a CSV input. But, everything is trapped in one field. There are a few exceptions, but first things first.

Fortunately, the source document is outlined.

The vast majority of what I need to extract is prefaced with a numbered outline reference. The format appears to be ###-#.#.#....

1. How can I filter out those rows that do not start ###? Let me clarify, it must start with a 3-digit number ###.

2. There is always an [Orig] between the outline ### and the text. How can I get the parse tool to use the [Orig] to output everything before it to column 1 and everything after it to column 2?

Thanks

Field 1
SECTION 0: [Orig] General Spec Admin
042 [Orig] General Spec Requirements
042-1 [Orig] General
042-1.2 [Orig] Items blah blah blah
042-1.3 [Orig] The Contractor blah blah blah
Table 042-1 [Orig] The objective items blah

BenMoss · ‎08-23-2018

This is one for regex.

Firstly I use a regex_match to filter to only those statements starting with 3 characters and then use the regex option in parse mode to pull out the before and after [Orig].

(\d.+) \[Orig\] (.+)

in theory it could be

(.+) \[Orig\] (.+)

Workbook attached.

Ben

hellyars · ‎08-23-2018

Thank you. I got it to work using the less direct regex Filter formula regex_match([Field 1],"^\d.*"). But, this (especially the regex parse tool) helps get their quicker. Thanks.

Alteryx Designer Desktop Discussions

How to filter rows that start with ### and then parse between ### and Text