I have extracted a dataset from a PDF which leaves everything in the first cell of each line. The data can easily be delimited the text to columns function using a space a space as the delimiter. However, in the middle of the dataset we have the company name, and then further numbers. When I delimit using a space, it splits the company name up also so the numbers become unaligned. Is it possible to use the regex tool to parse a word from a number to prevent this from happening? Example below. Thanks!
100 200 300 Alteryx 400 500 |
100 200 300 The Alteryx Community 400 500 |
100 200 300 Alteryx Resources 400 500 |
Solved! Go to Solution.
If you know the 1st 3 and last 2 columns are numbers, you can use Regex parse like this:
Hi,
Try the following RegEx
[\d\s]*([^\d]+)[\d\s]*
This will look for any number of numbers and spaces, until it finds a non-numeric value. Then it will include all characters until a digit is found again, and set this as a Marked Group.
Use the Parse output method in the RegEx tool to see if this works for your use case.
From there you have a few options:
You can make the regular expression more complicated (leverage the "Tokenize" method and build an expression that delimits on spaces only if they are next to a digit)
You can use the Replace function in a regular formula, to replace the company name with nothing, and then leverage text to columns like you were before on the remaining values.
Hope this helps!
Thanks for replying so quickly. Unfortunately this has created several new columns but they are all [Null]. Some of the numbers are in the thousands with , (e.g. 1,924) - could this affect the regex formula? Or does the field need to be in a certain data format to process perhaps?
Yes - having other characters in the numeric values could impact how this script works.
Here's an example that also looks at commas and decimals as being part of numbers (as long as they are not within a company name)
[\d\s,\.]*([^\d]+)[\d\s,\.]*
the only change between this and the prior script I had suggested is the addition of ",\." before and after the company name parse. the "\." says to look for a decimal character, as "." is a RegEx special character.
Thanks @Claje, with a couple of extra find/replaces and formulas to remove some punctuation it's pretty much working. Only thing left is to stop the word picking up the brackets of negative numbers that follow the company name, but hopefully I can fix this with something else!
Thanks very much :)