I am looking for help in designing a process for sorting out a badly parsed PDF. I have attached an example of a PDF where the text has been badly split.
My initial thought was to Transpose this text and then use text to columns with space as a delimiter. Then I got stuck. I was hoping there was a way of concatenating based on whether the first/last letter of the column is lower/upper case and using this as a way to decipher whether columns need to be concatenated. This is not foolproof and could still create errors (e.g. "for" and "and" would need to be excluded) but most words are capitalized. Is there any clever macros or tools to make my life easier or do you think my method is my best shot? Or is there no way to do this and I just have to settle for a manual process?
Solved! Go to Solution.
Hi @jamesgough
Do you have access to the original pdf? Maybe the conversion process can be fixed so that you don't get the combination of some words split and others concatenated.
Dan
Hi @jamesgough ,
Love your ideas, but think I found another way. Created a text input with the stop words (words like "as", "of", etc) that appear in your jumbled text file.
Append all of the possible stop words to each row, and then use a multi-row formula tool to replace each of those stop words as they appear with a title case version of the word.
A sample tool then picks the last one for each record (so you have the one with all the corrections)
Data Cleansing removes all of the spaces. And a RegEx tool puts them back in the appropriate places (i.e., between a lower case letter and the next upper case letter). Another Data Cleansing removes leading spaces.
Cheers!
Esther
This is amazing Esther. Such a great idea. Thank you so much!