London, UK

Welcome to the London User Group

Click in the JOIN GROUP button in Home to follow our news and attend our events!

Regex

Neeta_Latwal
7 - Meteor

Hey, Could you please help me to extact the voucher number for below examples:- 

 

Anything after EMR|LRN|ARB|NRT|TTT in regex

 

DescriptionExpected Result 
Account payable 126 EMR 00012345 dddd 00012345 
BCX payable 157 LRN 00012512 1235700012512
ARB 1234566 ddd7991234566

NRT 12345600 89766 dddd

 

12345600
TTT 1234565  55566 ddddd

  

1234565

 

Thanks,

Neeta

4 REPLIES 4
izamryan
8 - Asteroid

@Neeta_Latwal hello there!

I can strongly recommend using https://regex101.com/ to test, build and debug Regex in real-time. Very helpful tool!

 

the expression I got was this :

 

(EMR|LRN|ARB|NRT|TTT).(\d*)

 

which starts TWO match groups :

we anchor to the FIRST group, which is a 3-letter combo of either EMR|LRN|ARB|NRT|TTT

the SECOND group is what we're interested in - a set of digits that are surrounded by a space

 

You only need to tokenise the SECOND group, and the RegEx tool will ignore the rest of the string which you don't need.

 

Good luck!

TeePee
8 - Asteroid

I second regex101.com.  You paste in the text to parse and then as you write the coding, the relevant elements are colour-coded within the text so you can troubleshoot where you are going wrong and learn what works.   

 

If necessary, you could further refine @izamryan 's answer by adding a quantifier to the number of digits - so, if you know it's always going to be at least 7 digits long, you could write:  (EMR|LRN|ARB|NRT|TTT).(\d{7,})

 

Where the 7 is the minimum and after the comma no maximum is given.  If it is always 7-9 digits long, you could instead write: (EMR|LRN|ARB|NRT|TTT).(\d{7,9})

 

  • The first part looks for the letters provided. 
  • The pipes | are OR. 
  • The brackets capture a group. 
  • The backslash "escapes" the next character so \d means "any digit". 
  • The curly brackets { } specifier a quantifier and the numbers inside are the min and max (both optional) separated by a comma.  
izamryan
8 - Asteroid

@TeePee very nice! More specific is good aye.

 

Using my more generic regex could "accidentally" match a bad record! And that unintentional case might throw the workflow off later, if you don't do data hygiene checks later in the workflow.

 

So some data validation within the regex itself is good!

 

👍 👍

 

I like to structure my workflows in broad blocks of "Ingest it", "Check it", "Clean it", "Blend it", "Analyse it", "Report it" (sung to the tune of that Daft Punk song ... .. .)

lscsparks95
5 - Atom
5 - Atom

+1 for regex101.com, it's great!

 

@Neeta_Latwal my only two cents to add is that if the EMR|LRN|ARB|NRT|TTT list you've given isn't prescriptive* but rather a reflection of the sample data you've given (i.e. there are more combinations of 3 letter codes in your current data, or there is a chance that more may exist in your data in future) then you could use [A-Z]{3} instead of EMR|LRN|ARB|NRT|TTT in the expressions above to make them more flexible. This just means the expression will look for an uppercase letter [A-Z] repeated three times {3} 

 

Using this in either of the two offered solutions/expressions would give you the expected results. But a word of warning- if the second row of sample data was "BCX 1st payable 157 LRN 00012512 12357", it would give you the wrong result with the first solution/expression (it would return 1 instead of 00012512 as it would take all digits after BCX, being the first combination of 3 uppercase letters it can find). 

 

TLDR; there's many different REGEX expressions you could use based on just the sample data, but how specific or convoluted you need to go will depend on how variable your wider dataset is. Worth doing lots of testing (regex101.com will be your best friend!) to make sure you end up with the right answer 😊

 

Cheers

Lauren

 

*this whole answer can be ignored if the response to this is "nope, this list is prescriptive!" 😄