This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
Okay. I have a messy PDF table. It does not want to place nice. The following example -- see below -- comes across as 1 cell.
I tried using the delimiter \n. That did not work.
The key is the # at the start of a line. This is the LINE NUMBER. The LINE NUMBER is followed by the PROGRAM TITLE. The line or lines between LINE NUMBER / PROGRAM TITLE rows are COMMENTS. There may be one or more COMMENT rows associated with each LINE NUMBER / PROGRAM TITLE. How can split this jumbled mess into these fields (see desired output below)?
I also want to get rid of all '.' (periods). This will leave some OCR errors, but I am okay with that (for now).
Okay. I had to change the field size to get the initial split to columns using the delimiter \n to work. The data cleansing tools takes care of the punctuation. From there, it was just a series of multi-row and REGEX parse steps.