Extracting a column of data from a text file using RegEx (regular expressions)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I'm trying to extract some data from a .txt file. The data looks messy with stacked headings:
Form 1
First Name Last Name Age
Occupation Organisation Residency
------------------------------------------------------------
Jonna Karly 53
Instructor NYU Los Angeles
Hsiao-Ran Liu 22
Student NYU Kaohsiung
Form 2
First Name Last Name Age
Occupation Organisation Residency
------------------------------------------------------------
Jonna Karly 53
Instructor NYU Los Angeles
Hsiao-Ran Liu 22
Student NYU Kaohsiung
...
(Disclaimer: All data presented are fictional, no personal data is leaked.)
As can be seen, the headings are stacked on top of each other so each record has two rows of data. Note that the dashed line under the headings is also included in the text file. Also there're multiple forms sharing the same layout stacked on top of each other in the text file as shown here.
For the Jonna Karly record, First Name matches Jonna, Last Name matches Karly, Age matches 53, Occupation matches instructor, Organisation matches NYU, and Residency matches Los Angeles and so on.
Suppose that I'm trying to extract all the ages in the Age column (53, 22, etc.) for each form, how do I achieve this? I know regular expression can help, but I have absolutely no idea how to deal with this type of structure.
FYI, I've tried using delimiters when importing the text file into Alteryx Designer, but for whatever reason, the error "too many values in record #" continues to pop up, so I got to use no delimiter, and now everything's squeezed in one column. I believe RegEx is the only solution.
Solved! Go to Solution.
- Labels:
- Regex
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @45179902 - Yes, you need to read it with no delimiters and then parse it. However, the structure of your file does matter a lot. Without it every solution will be just a guess. Can you please provide a sample file? It may have just a couple of rows and you can mask the values. However, the number of symbols in each row and symbols between columns should remain unchanged.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@ArtApa Thank you for the response. Can you help parse out the data in the given example? If so, maybe I can try find out a way to modify it and apply to my files.
How do I provide a sample file? How exactly can I mask the data? The files are full of customer information, and I'm in the banking sector, and we have zero tolerance for leaking internal data and files.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator