Hi,
After having spent multiple hours on the Regex functions, which I find to be one of the most complex Alteryx functions, I am hoping some of you will be able to unblock me.
My data has been built using various framework, including the NIST security framework which is made of a number of controls. Each NIST control is part of a control family, has a control name and a control ID
- control family name is in upper cases (for ex: "MEDIA PROTECTION")
- control name is also in upper cases (for ex: "POLICY AND PROCEDURES")
- control ID is always structured as follow: 2 upper cases letter, a dash and then 2 numbers (for ex: "MP-09")
I have a text column where each cell can include letters (lower and upper cases), numbers, special characters and break lines. At some point, a pattern will emerge. I have added the following example to better illustrate the pattern:
Solved! Go to Solution.
Hi @baarthe ,
You can use tokenize option in Regex with below expression.
\u\u-\d\d
Input
Output
Workflow using data cleaning to remove line breaks
You can use a summarize to concat all IDs
Hope this helps : )
If this helps please mark the post as solution.
@baarthe ,
Erica @echuong1 has provided you with a possible solve for your post and I'm going to add my two cents. Alteryx supports Regular Expressions for pattern matching. The Perl-5 syntax is the specific form of RegEx supported by Alteryx. The complexity of this is hopefully simplified within Alteryx.
RegEx isn't always the solution that is best. I could solve your pattern without RegEx, it would just take more instructions. Once you learn how to solve regular expression problems, RegEx becomes your friend. If you search on YouTube for MarqueeCReW and RegEx you'll find some videos that I've put out on the subject.
When I help with RegEx I ask what pattern you see. I only allow simple answers that involve no coding. Then I break the answer into parts.
I find a dash. Following the dash I see exactly 2 numbers. Before the dash I see exactly 2 uppercase letters.
Now you can decide on whether to invest in regex patterns or use strong functions. If the text includes multiple dashes, you'll have a harder time. StartWith a recordid then You can use a text to columns tool and parse to rows on a space. Follow that by a filter and get results of all dashed words.
using left, right and substring functions you can test the values for being in the right range.
join your results back and you're done. Else use regex and amaze your friends and coworkers.
cheers,
mark
Thanks everyone for your responses, much appreciated 🙂
I apology as I realized my initial message may have missed some information. There will be some controls IDs with the same format before and after which I would not want to extract.
Eventually, the relevant section where control IDs should be extracted from will always start with "NIST SP 800-53 (Revision 4):". Additionally, the end of this section can be determined by looking for letters in lower cases.
For example:
"Cyber: AA-01 data
NIST SP 800-53 (Revision 4):MEDIA PROTECTION:MP-01 MEDIA PROTECTION
Hi @baarthe ,
It took me a while but here is a workflow for the task.
Input (using example provided by you)
Output
Workflow
The formula tool wraps up and picks the text like below.
Later finds the IDS
Hope this helps : )
If this was able to help you please mark the post as solution.
Thanks @atcodedog05, this is perfect and does the job. I didn't know the FindReplace function, very happy to learn about this !!
Thanks everyone for your insights!!!
Closing this thread.