Hello, I have string data in the following format:
"• General The general responsibilities of the role are XYZ • Physical Demands The Physical Demands of this role are XYZ • Education Requirements The education requirements for this role are • Bachelor's Degree • Appropriate Certification • Experience 5 years of experience is required"
Here, "General" is a heading, and the text that follows it is the description for that heading.
What I have done is used "•" as a delimiter, and replaced " " with "_". (these are supposed to be line breaks I believe, but in the text I have they are represented by 5 spaces). I need to basically put these in their own columns, where I could have:
General_The general responsibilities of this role are XYZ | Physical Demands_The Physical Demands of this role are XYZ | Education Requirements_The education requirements for this role are • Bachelor's Degree • Appropriate Certification | • Experience_5 years of experience is required |
I then want to isolate the Heading to pull it out so I can have a Heading Column and a Description column. I wrote code to make a column for the text before the "_" and after the "_" to make these columns. However, the issue lies in the "•" being within a description, in this example, the text "• Bachelor's Degree • Appropriate Certification".
Because of this, I receive the following result:
General_The general responsibilities of this role are XYZ | Physical Demands_The Physical Demands of this role are XYZ | Education Requirements_The education requirements for this role are | Bachelor's Degree | Appropriate Certification | Experience_5 years of experience is required |
Because of the bullet point, the Text-to-Columns tool creates a new column for Bachelor's Degree and Appropriate Certification, but I need it to be contained in the Education Requirements column.
I have been trying to find a way to delete or replace the "•" if the word/phrase it precedes does not have a "_" before the next "•". In other words, if there is a list of bullet points, but not an "_" (a line break) then I want to contain that in the same column, and not split that into its own column, as it should not be treated as a Heading.
Is there a way to do this with REGEX? I am a novice in REGEX and nothing I have been trying has seemed to work. Please note that the Headings can be anything - not just General, Physical Demands, Education Requirements, and Experience. Thank you so much.
Solved! Go to Solution.
One way using RegEx parsing:
Note, this is only dealing with the fact there are 4 groups that you are seeking.
Thank you, @BS_THE_ANALYST! That works perfectly for the above example - would you know what RegEx expression could be used to generalize this? Meaning, if there is an unknown number of headers and situations like the "• Bachelor's Degree • Appropriate Certification" situation can be anywhere in the cell?
I think it could work if I could find a way to only recognize a delimiter if it is in the format "• Header_" - is there a way to do that with RegEx?
Or, could I maybe do a RegEx replace where I can replace text in the format of "• Header_" with, say, "- Header_" ? Then I could use "-" as a delimiter and achieve that same result?
@ryanhartman32 we now have a dynamic way! I pushed your problem further along the line. There's an indepth explanation about how it works using Positive Lookaheads. Definitely more complex RegEx needed for this particular problem.
So your new dynamic worfklow is now attached. This will now account for more headers coming through: