Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

How to filter rows that start with ### and then parse between ### and Text

hellyars
13 - Pulsar

Okay, I have a nightmare (for my novice eyes) data source.  

 

The source document is a PDF.  The PDF does not export nicely to XML or CSV and it will not import into Excel in a format that I can use.  I was able to get it into Alteryx as a CSV input.   But, everything is trapped in one field.  There are a few exceptions, but first things first.

 

 

Fortunately, the source document is outlined.  

 

The vast majority of what I need to extract is prefaced with a numbered outline reference.  The format appears to be ###-#.#.#....

 

1.  How can I filter out those rows that do not start ###?  Let me clarify, it must start with a 3-digit number ###.

2.  There is always an [Orig] between the outline ### and the text.  How can I get the parse tool to use the [Orig] to output everything before it to column 1 and everything after it to column 2?

 

Thanks

 

 

 

Field 1    
SECTION 0: [Orig] General Spec Admin    
042 [Orig] General Spec Requirements    
042-1 [Orig] General    

042-1.2 [Orig] Items blah blah blah

    

042-1.3 [Orig] The Contractor blah blah blah

    

Table 042-1 [Orig] The objective items blah

    

 

 

 

2 REPLIES 2
BenMoss
ACE Emeritus
ACE Emeritus

This is one for regex.

 

Firstly I use a regex_match to filter to only those statements starting with 3 characters and then use the regex option in parse mode to pull out the before and after [Orig].

 

(\d.+) \[Orig\] (.+)

in theory it could be

 

(.+) \[Orig\] (.+)

Workbook attached.

 

Ben

hellyars
13 - Pulsar

Thank you. I got it to work using the less direct regex Filter formula regex_match([Field 1],"^\d.*").   But, this (especially the regex parse tool) helps get their quicker.  Thanks. 

Labels