Good day,
I am having trouble removing some bad data from my input file before writing the data to the database (Microsoft Sql Server).
The bad data has a lots of different values but one example for a field shows:
æÂ±äº¬éƒ½è±Šå³¶åŒºåÂâ”å¤ |
I am unsure how to remove these special characters or characters of some sort from my dataset.
I have tried importing the csv with different Code Pages and tried removing the values for the various fields using the following formula tools:
Formula tool 1: REPLACE(REGEX_REPLACE([LAST_NAME],"(.*)(?:[^x20-x7E])(.*)","$1"+"$2")," "," ")
Formula tool 2: Regex_Replace([LAST_NAME],"[[:unicode:]]",'FAIL') = [LAST_NAME]
I have also tried flagging the records by creating a new field:
IF REGEX_CountMatches([FIRST_NAME],"[^'a-zA-Z\s\d-]+") > 0
then 'INVALID' else 'Fine'
endif
Any help would be much appreciated.
Cheers,
Clayton.
@ClaytonA Could you please upload some dummy data? The example you have above looks like it's just special characters. Do you have examples where characters you'd like to keep are mixed in?
Good morning,
Sorry for the delayed response.
@Preferably, if a record contains special characters I would like to send it to a separate sql table for bad data review.
Is there a way to filter on if a cell contains special characters or remove special characters from a value with regular text?
The main issue with the old process I am using is that the special characters exceed the field limits I have set. Which fails on import into my sql database using Sql Server Management Studio Tasks.
Please let me know if you have any other questions.
Some examples below:
Cheers,
Clayton Ayers.
@ClaytonA can't you just use REGEX_MATCH, this will flag the data, then filter accordingly. You just need to define what are the "suitable" characters and match those that solely contain those, or do the opposite and define the "bad" set. Whatever way suits you, however, defining the set to retain is dynamic and therefore less work.
Just lifted a character set out of one of the formulas you provided earlier. Then you get the True/False flag.
All the best,
BS
Good morning,
I did try a few use cases of REGEX_MATCH, however the T/F flag was not populating correctly.
Will your expression also work for address related data, eg. 123 Test Lane?
I will try your example and see if it flags the records accordingly.
Also is there a glossary or best practices document for Regular Expressions and their meanings? In this case I am unsure what x and E are referring to.
Thank you for your reply.
Cheers,
Clayton Ayers.