Hi Team,
Hope you are well. I am working on a project where i have to extract information from the invoice. I have used python to did that. The excel file exported from python with all the information from invoice. unfortunately it's not in the desired format.
Please find attached the input file and the desired output file. Ideally i am looking for help where the workflow picks up the keyword from the excel file the corresponding value from the input file (for e.g Amount, total, invoice number, invoice date) from each row of the input file and create a new output file with all the values and columns.
Please let me know how to go about it, i need your help,
Thanks,
Solved! Go to Solution.
Hi @sriniprad08,
If you're asking how to come up with a regex solution-- that's a great question! Everyone thinks about it slightly differently, but I've outlined part of my thought process:
- I learned regex with the brief lessons on regexone.com and the Alteryx interactive lessons
- I practice using regex with my own data and within Alteryx challenges (61, 59, 47, 40, 20, 13 to name a few). Practice is the best way to get comfortable writing regex.
- When I get stuck writing a regex formula in Alteryx (which is often!), I instead type it into rubular.com because it shows me exactly what the capture groups will be. I prefer that editing environment for regex because it's easier to see the effects of any changes.
- When I'm writing regex, I look for patterns. Regex is awesome for working with patterns. In your data, the patterns were tricky! Here are some of the ones I saw
So my main recommendation when using regex is to look for patterns! Location within string at large, type of characters, repeated characters, etc. All of these can be utilized by regex.
Hi @clmc9601 ,
Thank you so much for sharing the detailed information. it's wonderful. I was missing this. Not sure where to start. I was going through the workflow you shared. It has two regex components for invoice number. I tried using that in python but i couldn't get it.
But overall it was nice learning for me.
Cheers,
Sri
hi @clmc9601 ,
Can you please help me in explanation of the Regex for Invoice number. I am finding it difficult to comprehend the same in python?
Thanks,
Hi @sriniprad08, sure. I'll explain both regex tools I used for parsing invoice number:
First: .{16}\w\wVOICE.{16}
This pulls the 16 characters before and after any instance of the string "..VOICE" with the dots here representing any letter character. In regex this is \w. The string is case insensitive.
Second: ((?:[\d\-]{2,}[\w\-]{0,})|(?:[\w\-]{0,}[\d\-]{2,}))
This pulls the actual invoice number. It's a general regex statement because there is such a variety of what could be considered an invoice number in your example. I'll break it apart below:
I hope this helps!
Hi @clmc9601 ,
Thank you for all the help. Currently i am trying with this new input data with 2 more rows. Can you please guide what changes i need to make to get the details for all the 5 rows of data.
Cheers
Hi @sriniprad08,
I'm sure there is a way to do this, but to be honest, I'm having a hard time coming up with any ideas. I have a few questions about the data:
Do those words in the new rows mean anything to you? Is it a different language? Regex works best when there is a consistent pattern, and the pattern I found before no longer applies to these rows. What patterns do you see? Patterns like number of numeric characters or letter characters, uppercase vs lowercase, specific punctuation, etc.
I'm struggling to even tell where the invoice number or totals are in those strings. Is this the actual data you're working with (not a sample)? Do you know why it looks so different from the images you posted? Sorry for so many questions. If you are working with a small number of rows and don't have much influence over scraping the data, it might be faster to just extract the information manually.
Hi @clmc9601 ,
Thank you for the reply. Sorry if i am not clear. The invoices which i have is close to 100 in numbers and all are in different format. I was trying to develop a POC to see if that works with smaller number of cases. I think the regex won't be a solution for large numbers.
To answer your question i think the formats were part of the first question i posted two days back.
Thank you for all the help.