I want to parse FIELD1 into tow groups: GROUP_1 and GROUP_2.
My REGEX tool is set to parse. My REGEX is:
(^[A-Z]{3,4}).*?(\b\d{4}\b)
This works for the majority of the records, but it does not work for #4 and #7.
RecordID | FIELD1 | GROUP_1 | GROUP_2 |
1 | GHY/5010 | GHY | 5010 |
2 | TRY/4268 | TRY | 4268 |
3 | SDE CTC/5300 | SDE | 5300 |
4 | CMP/652000 | CMP | 6520 |
5 | GVW/0605/J0449 | GVW | 0605 |
6 | CMP/0204567M/6518 | CMP | 6518 |
7 | PNA/0449C | PNA | 0449 |
8 | CMNAP/5678 | CMNAP | 5678 |
GROUP_1 is always the first letter group - as long as it is either 3 or 5 characters (not 4).
GROUP_2 is trickier. The target is a 4 digit number group. It will always only be a numbers group.
In #4 two additional zeros have been added. So, 4 digits followed by two zeroes and only two zeroes tells me it is okay to use the first 4-digits in Group 2.
In #7 the 4-digit code is there, but it is followed by a C. The presence of the C tells me its okay to ignore it and use the 4-digits before it in Group_2
How can I modify my REGEX to account for #4 and #7?
Solved! Go to Solution.
Hi E,
It works. Using your logic, I was also able to modify my original expression to (^[A-Z]{3,4}).*(\d{4}) by removing the ? and \b. But, I went with yours.
QUESTION: In the real data, I realize I need to keep the C and parse it into a 3rd group if and when it is present. How could I modify your expression to account fo this? I tried (\u{3,5}).*/(\d{4}).*|\u{3,5}.*/\d{4}.([C]).* but Alteryx just laughed.
Oh, I still can't get the Intel Suite to work with my PDFs. I think the Parallels VM tax is the biggest culprit, but I have not had a chance to test on a pure PC.
I assumed you're referring to the C at the end of record 7?
Try the following: (\u{3,5}).*/(\d{4})(C)?.*
It is looking for the same as before, but I included an optional marked group containing a C. The ? at the end makes it optional.
If this resolves your issue, please mark this thread as solved, so others can find answers more easily. Thanks!
I like @echuong1 ‘s expression but if you want to stay true to your “the the first set of letters are either 3 or 5 but never 4,” then you should use a catch for that to keep it clear:
((?:[A-Z]{3}|[A-Z]{5})).*\/(\d{4})(C)?