Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Challenge putting data in different columns from excel - Difficult format

sriniprad08
11 - Bolide

Hi Team,

 

Hope you are well. I am working on a project where i have to extract information from the invoice. I have used python to did that. The excel file exported from python with all the information from invoice. unfortunately it's not in the desired format. 

Please find attached the input file and the desired output file. Ideally i am looking for help where the workflow picks up the keyword from the excel file the corresponding value from the input file (for e.g Amount, total, invoice number, invoice date) from each row of the input file and create a new output file with all the values and columns.

 

Please let me know how to go about it, i need your help,

 

Thanks,

17 REPLIES 17
clmc9601
13 - Pulsar
13 - Pulsar

Hi @sriniprad08,

 

If you're asking how to come up with a regex solution-- that's a great question! Everyone thinks about it slightly differently, but I've outlined part of my thought process:

 

- I learned regex with the brief lessons on regexone.com and the Alteryx interactive lessons

- I practice using regex with my own data and within Alteryx challenges (61, 59, 47, 40, 20, 13 to name a few). Practice is the best way to get comfortable writing regex.

- When I get stuck writing a regex formula in Alteryx (which is often!), I instead type it into rubular.com because it shows me exactly what the capture groups will be. I prefer that editing environment for regex because it's easier to see the effects of any changes.

- When I'm writing regex, I look for patterns. Regex is awesome for working with patterns. In your data, the patterns were tricky! Here are some of the ones I saw

  • The totals were inconsistently placed around the data, but (!) the total was always the largest and last number
  • Most (but not all) of the dollar amounts had a $ in front of them. I used regex to capture all strings that had a $ and any attached numbers with no spaces.
  • Dates had a consistent format so were the easiest part with the regex. I looked for a string as follows: numberCharacter{1 or 2 repetitions} forwardSlashCharacter{single repetition} numberCharacter{1 or 2 repetitions} forwardSlashCharacter{single repetition} numberCharacter{2 to 4 repetitions}
  • The invoice numbers were the hardest part because the placement was not consistent and the type of string desired was not consistent. The main patterns I saw were that it was sort of close to the word "invoice", contained at least some numbers, and was not a dollar amount. So I removed all the dollar amounts in the strings before attempting to parse the invoice numbers.

So my main recommendation when using regex is to look for patterns! Location within string at large, type of characters, repeated characters, etc. All of these can be utilized by regex.

sriniprad08
11 - Bolide

Hi @clmc9601 ,

 

Thank you so much for sharing the detailed information. it's wonderful. I was missing this. Not sure where to start. I was going through the workflow you shared. It has two regex components for invoice number. I tried using that in python but i couldn't get it. 

 

But overall it was nice learning for me.

 

Cheers,

Sri

sriniprad08
11 - Bolide

hi @clmc9601 ,

 

Can you please help  me in explanation of the Regex for Invoice number. I am finding it difficult to comprehend the same in python? 

 

Thanks,

clmc9601
13 - Pulsar
13 - Pulsar

Hi @sriniprad08, sure. I'll explain both regex tools I used for parsing invoice number:

 

First: .{16}\w\wVOICE.{16}

This pulls the 16 characters before and after any instance of the string "..VOICE" with the dots here representing any letter character. In regex this is \w. The string is case insensitive.

 

Second: ((?:[\d\-]{2,}[\w\-]{0,})|(?:[\w\-]{0,}[\d\-]{2,}))

This pulls the actual invoice number. It's a general regex statement because there is such a variety of what could be considered an invoice number in your example. I'll break it apart below:

  • ((?:..........)|(?:.........)) this part says to capture the outer set of parentheses, which contain either (?: ) or (?: ) characters. The ?: just mean I'm using a group but do not want to store the group (called "unmarked").
  • Inside the first unmarked group: [\d\-]{2,}[\w\-]{0,} this says the first two or more characters should either be a number or a hyphen character and the second zero or more characters should be a letter or a hyphen character
  • Inside the second unmarked group: [\w\-]{0,}[\d\-]{2,} this is the opposite logic of the first unmarked group (recall that there is OR logic between them). The first zero or more characters should be a letter or a hyphen and the second two or more characters should be numbers.

 

I hope this helps!

 

sriniprad08
11 - Bolide

Thank you @clmc9601  . Amazing. Really helpful. I will come back if any questions.

 

Cheers,

Sri

sriniprad08
11 - Bolide

Hi @clmc9601 ,

 

Thank you for all the help. Currently i am trying with this new input data with 2 more rows. Can you please guide what changes i need to make to get the details for all the 5 rows of data.

 

Cheers

clmc9601
13 - Pulsar
13 - Pulsar

Hi @sriniprad08,

 

I'm sure there is a way to do this, but to be honest, I'm having a hard time coming up with any ideas. I have a few questions about the data:

Do those words in the new rows mean anything to you? Is it a different language? Regex works best when there is a consistent pattern, and the pattern I found before no longer applies to these rows. What patterns do you see? Patterns like number of numeric characters or letter characters, uppercase vs lowercase, specific punctuation, etc.

 

I'm struggling to even tell where the invoice number or totals are in those strings. Is this the actual data you're working with (not a sample)? Do you know why it looks so different from the images you posted? Sorry for so many questions. If you are working with a small number of rows and don't have much influence over scraping the data, it might be faster to just extract the information manually.

sriniprad08
11 - Bolide

Hi @clmc9601 ,

 

Thank you for the reply. Sorry if i am not clear. The invoices which i have is close to 100 in numbers and all are in different format. I was trying to develop a POC to see if that works with smaller number of cases. I think the regex won't be a solution for large numbers.

 

To answer your question i think the formats were part of the first question i posted two days back. 

 

Thank you for all the help.

 

 

Labels