Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

How to filter rows that start with ### and then parse between ### and Text

hellyars
13 - Pulsar

Okay, I have a nightmare (for my novice eyes) data source.  

 

The source document is a PDF.  The PDF does not export nicely to XML or CSV and it will not import into Excel in a format that I can use.  I was able to get it into Alteryx as a CSV input.   But, everything is trapped in one field.  There are a few exceptions, but first things first.

 

 

Fortunately, the source document is outlined.  

 

The vast majority of what I need to extract is prefaced with a numbered outline reference.  The format appears to be ###-#.#.#....

 

1.  How can I filter out those rows that do not start ###?  Let me clarify, it must start with a 3-digit number ###.

2.  There is always an [Orig] between the outline ### and the text.  How can I get the parse tool to use the [Orig] to output everything before it to column 1 and everything after it to column 2?

 

Thanks

 

 

 

Field 1    
SECTION 0: [Orig] General Spec Admin    
042 [Orig] General Spec Requirements    
042-1 [Orig] General    

042-1.2 [Orig] Items blah blah blah

    

042-1.3 [Orig] The Contractor blah blah blah

    

Table 042-1 [Orig] The objective items blah

    

 

 

 

2 REPLIES 2
BenMoss
ACE Emeritus
ACE Emeritus

This is one for regex.

 

Firstly I use a regex_match to filter to only those statements starting with 3 characters and then use the regex option in parse mode to pull out the before and after [Orig].

 

(\d.+) \[Orig\] (.+)

in theory it could be

 

(.+) \[Orig\] (.+)

Workbook attached.

 

Ben

hellyars
13 - Pulsar

Thank you. I got it to work using the less direct regex Filter formula regex_match([Field 1],"^\d.*").   But, this (especially the regex parse tool) helps get their quicker.  Thanks. 

Labels