Click in the JOIN GROUP button in Home to follow our news and attend our events!
Hey, Could you please help me to extact the voucher number for below examples:-
Anything after EMR|LRN|ARB|NRT|TTT in regex
Description | Expected Result |
Account payable 126 EMR 00012345 dddd | 00012345 |
BCX payable 157 LRN 00012512 12357 | 00012512 |
ARB 1234566 ddd799 | 1234566 |
NRT 12345600 89766 dddd
| 12345600 |
TTT 1234565 55566 ddddd
| 1234565 |
Thanks,
Neeta
@Neeta_Latwal hello there!
I can strongly recommend using https://regex101.com/ to test, build and debug Regex in real-time. Very helpful tool!
the expression I got was this :
(EMR|LRN|ARB|NRT|TTT).(\d*)
which starts TWO match groups :
we anchor to the FIRST group, which is a 3-letter combo of either EMR|LRN|ARB|NRT|TTT
the SECOND group is what we're interested in - a set of digits that are surrounded by a space
You only need to tokenise the SECOND group, and the RegEx tool will ignore the rest of the string which you don't need.
Good luck!
I second regex101.com. You paste in the text to parse and then as you write the coding, the relevant elements are colour-coded within the text so you can troubleshoot where you are going wrong and learn what works.
If necessary, you could further refine @izamryan 's answer by adding a quantifier to the number of digits - so, if you know it's always going to be at least 7 digits long, you could write: (EMR|LRN|ARB|NRT|TTT).(\d{7,})
Where the 7 is the minimum and after the comma no maximum is given. If it is always 7-9 digits long, you could instead write: (EMR|LRN|ARB|NRT|TTT).(\d{7,9})
@TeePee very nice! More specific is good aye.
Using my more generic regex could "accidentally" match a bad record! And that unintentional case might throw the workflow off later, if you don't do data hygiene checks later in the workflow.
So some data validation within the regex itself is good!
👍 👍
I like to structure my workflows in broad blocks of "Ingest it", "Check it", "Clean it", "Blend it", "Analyse it", "Report it" (sung to the tune of that Daft Punk song ... .. .)
+1 for regex101.com, it's great!
@Neeta_Latwal my only two cents to add is that if the EMR|LRN|ARB|NRT|TTT list you've given isn't prescriptive* but rather a reflection of the sample data you've given (i.e. there are more combinations of 3 letter codes in your current data, or there is a chance that more may exist in your data in future) then you could use [A-Z]{3} instead of EMR|LRN|ARB|NRT|TTT in the expressions above to make them more flexible. This just means the expression will look for an uppercase letter [A-Z] repeated three times {3}
Using this in either of the two offered solutions/expressions would give you the expected results. But a word of warning- if the second row of sample data was "BCX 1st payable 157 LRN 00012512 12357", it would give you the wrong result with the first solution/expression (it would return 1 instead of 00012512 as it would take all digits after BCX, being the first combination of 3 uppercase letters it can find).
TLDR; there's many different REGEX expressions you could use based on just the sample data, but how specific or convoluted you need to go will depend on how variable your wider dataset is. Worth doing lots of testing (regex101.com will be your best friend!) to make sure you end up with the right answer 😊
Cheers
Lauren
*this whole answer can be ignored if the response to this is "nope, this list is prescriptive!" 😄