Hi alteryx community!
I've been working on a problem and have not been able to come to a solution, I'm hoping someone can assist me. I believe the answer to this problem will be using a regex function to parse the data but I haven't been able to figure out how to create a formula that solves all my criteria. See details of my issue below:
- I used the 'pdf input' tool to convert a ~750 page pdf into an excel file which was used as the starting point in the workflow. After converting the pdf into excel all the columns were combined into one.
- I would like to separate the following: date, names (include first, middle and last), first set of two digits (these don't exist on all rows of data, see row 8 and 9 in the attached excel for an example), next five digits (also don't exist on all rows of data), the text after the previous five digits (should align with the text that has no digits preceding it - e.g. football, grass would be in the same column as left wing player - ice hockey etc.), the amount, and the three capitalized initials. As mentioned above, I believe the way to solve this would be through the use of a regex formula but I haven't been able to create one that can solve each of the above criteria.
- The telephone numbers at the end are irrelevant and can be removed.
Also worth noting: in some instances there is not telephone number at the end of the row of data (i.e. the three initials are the last inputs in the row). There are also a varying number of spaces between certain numbers/words in each of the rows.
Below is a screenshot of how I'm trying to get my data to end up looking:
Appreciate any help, thanks!
Solved! Go to Solution.
Hi @kge5293
It wasnt easy but was able to crack it.
Here is a workflow for the task.
Output:
Workflow:
Hope this helps 🙂 Please check and let me know
If this post helps you please mark it as solution. And give a like if you dont mind 😀👍
Thanks @atcodeog05 this is extremely helpful and solves 99% of my problem! When I was reviewing the data there were a handful of rows that didn't get parsed in the same format as the ones I initially shared I think because of minor nuances in their structure. Is there something that can be tweaked in the regex to parse these as well?
In the attached file the first row wasn't properly parsed during the first regex (no data was carried forward) and the remaining rows lost data after the second regex (only the amount and three initials were lost).
Hi @kge5293
I will take a look and get back to you. I could agree with you the data is badly structured.
Hi @kge5293
Here is the modified workflow which can handle new cases.
Output
Please check and let me know.
@atcodedog05 Thank you!
I am a newbie here, I am trying to figure out how to use the PDF Input to convert PDFs to Excel, how do you do that?