We are celebrating the 10-year anniversary of the Alteryx Community! Learn more and join in on the fun here.
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Trying to remove bad data from csv file to import into database

ClaytonA
7 - Meteor

Good day, 

 

I am having trouble removing some bad data from my input file before writing the data to the database (Microsoft Sql Server). 

 

The bad data has a lots of different values but one example for a field shows: 

東京都豊島区南å¤

 

I am unsure how to remove these special characters or characters of some sort from my dataset. 

I have tried importing the csv with different Code Pages and tried removing the values for the various fields using the following formula tools:

 

Formula tool 1: REPLACE(REGEX_REPLACE([LAST_NAME],"(.*)(?:[^x20-x7E])(.*)","$1"+"$2")," "," ")

Formula tool 2: Regex_Replace([LAST_NAME],"[[:unicode:]]",'FAIL') = [LAST_NAME]

 

I have also tried flagging the records by creating a new field:

IF REGEX_CountMatches([FIRST_NAME],"[^'a-zA-Z\s\d-]+") > 0
then 'INVALID' else 'Fine'
endif

 

Any help would be much appreciated. 

 

Cheers,

Clayton.

 

 

 

4 REPLIES 4
Prometheus
12 - Quasar

@ClaytonA Could you please upload some dummy data? The example you have above looks like it's just special characters. Do you have examples where characters you'd like to keep are mixed in?

ClaytonA
7 - Meteor

Good morning, 

 

Sorry for the delayed response.

 

@Preferably, if a record contains special characters I would like to send it to a separate sql table for bad data review. 

 

Is there a way to filter on if a cell contains special characters or remove special characters from a value with regular text? 

 

The main issue with the old process I am using is that the special characters exceed the field limits I have set. Which fails on import into my sql database using Sql Server Management Studio Tasks. 

 

Please let me know if you have any other questions. 

 

Some examples below:

 

Example 1.) خیابان پیروزی_نرسیده به درمانگاه رازی_مجتمع تجاری مسکونی ارزان تعاون شهر_واحد اوÙTEST1
 
Example 2. containing multiple fields with bad data)testСанкт- Петербург,СанкÑtest‚- Петербург

 

Cheers,

Clayton Ayers. 

BS_THE_ANALYST
15 - Aurora
15 - Aurora

@ClaytonA can't you just use REGEX_MATCH, this will flag the data, then filter accordingly. You just need to define what are the "suitable" characters and match those that solely contain those, or do the opposite and define the "bad" set. Whatever way suits you, however, defining the set to retain is dynamic and therefore less work.

 

Just lifted a character set out of one of the formulas you provided earlier. Then you get the True/False flag.

 

Screenshot 2023-11-01 133217.png

 

All the best,

BS

All the best,
BS

LinkedIN

Bulien
ClaytonA
7 - Meteor

Good morning, 

 

I did try a few use cases of REGEX_MATCH, however the T/F flag was not populating correctly. 

Will your expression also work for address related data, eg. 123 Test Lane? 

 

I will try your example and see if it flags the records accordingly. 

 

Also is there a glossary or best practices document for Regular Expressions and their meanings? In this case I am unsure what x and E are referring to. 

Thank you for your reply. 

 

Cheers,

Clayton Ayers. 

Labels
Top Solution Authors