Alteryx Designer Desktop Discussions

baarthe · ‎09-28-2020

Hi,

After having spent multiple hours on the Regex functions, which I find to be one of the most complex Alteryx functions, I am hoping some of you will be able to unblock me.

My data has been built using various framework, including the NIST security framework which is made of a number of controls. Each NIST control is part of a control family, has a control name and a control ID

- control family name is in upper cases (for ex: "MEDIA PROTECTION")

- control name is also in upper cases (for ex: "POLICY AND PROCEDURES")

- control ID is always structured as follow: 2 upper cases letter, a dash and then 2 numbers (for ex: "MP-09")

I have a text column where each cell can include letters (lower and upper cases), numbers, special characters and break lines. At some point, a pattern will emerge. I have added the following example to better illustrate the pattern:

"NIST SP 800-53 (Revision 4):MEDIA PROTECTION:MP-01 MEDIA PROTECTION

POLICY AND PROCEDURES:

MP-03 MEDIA MARKING:

SYSTEM AND INFORMATION INTEGRITY:SI-12 INFORMATION HANDLING AND

RETENTION:"

In the above example, one would find a number of irrelevant characters before and after the example. In this pattern, I am trying to extract the control IDs which would be MP-01, MP-03, SI-12 (highlighted in red above). A few comments:

- the breaking line is not always consistent

- In some occasions there may be no control ID (in which case the text above would not exist at all)

- In other occasions, there may be one or more control families, each having one or more control IDs

- the control family is only mentioned once, followed by a number of controls within that family

My best attempt at extracting the IDs have been to use Regex with the below rule, but it only extracts the first control ID:

"(?:NIST\sSP\s800-53\s\(Revision\s4\):)(?:[A-Z\s]+):([A-Z]{2}-\d{2})\s(?:[A-Z\s]+):\n"

I feel like the Parse function of Regex should be able to find these IDs. I am hoping you guys can provide some help or an alternative solution.

Thanks a lot everyone for your insight.

Barthe

echuong1 · ‎09-28-2020

Are you only looking for those IDs? It looks like it's always two uppercase letters, a dash, and then two numbers. In that case you can use (\u{2}\-\d{2}). Using Tokenize with Split to Rows will get every instance.

atcodedog05 · ‎09-28-2020

Hi @baarthe ,

You can use tokenize option in Regex with below expression.

\u\u-\d\d

Input

Output

Workflow using data cleaning to remove line breaks

You can use a summarize to concat all IDs

Hope this helps : )

If this helps please mark the post as solution.

MarqueeCrew · ‎09-28-2020

@baarthe ,

Erica @echuong1 has provided you with a possible solve for your post and I'm going to add my two cents. Alteryx supports Regular Expressions for pattern matching. The Perl-5 syntax is the specific form of RegEx supported by Alteryx. The complexity of this is hopefully simplified within Alteryx.

RegEx isn't always the solution that is best. I could solve your pattern without RegEx, it would just take more instructions. Once you learn how to solve regular expression problems, RegEx becomes your friend. If you search on YouTube for MarqueeCReW and RegEx you'll find some videos that I've put out on the subject.

When I help with RegEx I ask what pattern you see. I only allow simple answers that involve no coding. Then I break the answer into parts.

I find a dash. Following the dash I see exactly 2 numbers. Before the dash I see exactly 2 uppercase letters.

Now you can decide on whether to invest in regex patterns or use strong functions. If the text includes multiple dashes, you'll have a harder time. StartWith a recordid then You can use a text to columns tool and parse to rows on a space. Follow that by a filter and get results of all dashed words.

using left, right and substring functions you can test the values for being in the right range.

join your results back and you're done. Else use regex and amaze your friends and coworkers.

cheers,

mark

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.

baarthe · ‎09-28-2020

Thanks everyone for your responses, much appreciated 🙂

I apology as I realized my initial message may have missed some information. There will be some controls IDs with the same format before and after which I would not want to extract.

Eventually, the relevant section where control IDs should be extracted from will always start with "NIST SP 800-53 (Revision 4):". Additionally, the end of this section can be determined by looking for letters in lower cases.

For example:

"Cyber: AA-01 data

NIST SP 800-53 (Revision 4):MEDIA PROTECTION:MP-01 MEDIA PROTECTION

POLICY AND PROCEDURES:

MP-03 MEDIA MARKING:

SYSTEM AND INFORMATION INTEGRITY:SI-12 INFORMATION HANDLING AND

RETENTION:

New control TD-05 and other type of irrelevant information (including special characters and numbers 123) in upper and lower cases"

In the above example, I would disregard any control IDs such as

- AA-01 which are provided before "NIST SP 800-53 (Revision 4)"

- TD-05 as lower case letters have appeared since

I have used the parse function but unfortunately, it identifies all control IDs.

I hope this help and apology for the lack of clarity.

atcodedog05 · ‎09-28-2020

Hi @baarthe ,

It took me a while but here is a workflow for the task.

Input (using example provided by you)

Output

Workflow

The formula tool wraps up and picks the text like below.

Later finds the IDS

Hope this helps : )

If this was able to help you please mark the post as solution.

baarthe · ‎09-30-2020

Thanks @atcodedog05, this is perfect and does the job. I didn't know the FindReplace function, very happy to learn about this !!

Thanks everyone for your insights!!!

Closing this thread.

atcodedog05 · ‎09-30-2020

Happy to help : ) @baarthe

Cheers and happy analysing : )

Alteryx Designer Desktop Discussions

Regex Function Formula