Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

enDash and Kanji characters

DaveJ
7 - Meteor

Hello,

 

I regularly receive files from customers that I need to load into a database. I have built lookups that match a text field so that I can append additional data. This all works fine apart from a few challenging edge cases. For example, if I received a customer name like this "Customer - Name" where the hyphen is actually an ASCII (150 - enDash). In this situation even if my lookup also contains the exact same ascii string for the name (i.e. including the enDash) the JOIN tool is returning this as unmatched. I assume it's something to do with collation sequence choices that Alteryx is making?

 

Any ideas how to make this work? I have a similar problem with Kanji characters. Even though my lookup exactly matches the Kanji and I can write a "select" that works ok the Alteryx JOIN is not seeing a match. 

 

I'm using a MySQL database.

 

Cheers

Dave

4 REPLIES 4
MarqueeCrew
20 - Arcturus
20 - Arcturus

@DaveJ,

 

How about a hygiene process that turns enDash to Dash?  Or to be safe, turns it to Pipe?

 

Regex_Replace([Field],"\x150",'-')

If you do this to all incoming data, when you come to the join, it will be the same and the join will work for you.  If you need to return the data, you can reverse the process and put the \x150 as the replacement for the substituted character.  If you're reversing the process, that's why I suggested a PIPE character.

 

Thanks,

Mark

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
DaveJ
7 - Meteor

Thanks for getting back to me Mark. 

 

I'm trying to successfully process the text I encounter rather than working around it. I get a whole mixture of European characters plus Chinese and Japanese. I want to find a way to read in all possibilities and handle them as seen if possible. 

 

I'll keep your suggestion in mind for simpler one off stuff like the enDash character.

 

Thanks

Dave

MarqueeCrew
20 - Arcturus
20 - Arcturus

@DaveJ,

 

ConvertFromCodepage(s, codePage)

Translates text from a code page to Unicode. See Code Pages.

 

This function might be what the doctor ordered.

 

https://community.alteryx.com/t5/Data-Preparation-Blending/Unicode-Matching-Function/m-p/523

 

Cheers,

Mark

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
DaveJ
7 - Meteor

Thanks again Mark. I had stumbled on the ConvertFromCodePage and ConvertToCodePage functions. Very useful.

 

For the benefit of the community, I have solved my particular problem.  It perhaps highlights my own ignorance but let me share my findings.

 

It turns out that my problem was actually that my database table was not storing the characters correctly. When I was joining an excel spreadsheet to what should have been exact matches from a database table it turns out that the database had "trashed" the extended ASCII characters due to the column being "latin1". I changed the character set for the column to UTF8, cleared down the table and reloaded it. After that, the Kanji and the enDash were accurately matched between the spreadsheet and the database table. No further conversion was needed.

 

If like me, you're not that familiar with MySQL you may miss the subtlety of columns having their own character set. I was using SQL Yog and the "alter table" GUI functionality but this doesn't seem to show column-level character sets. It was only when I issued the "show create table <tablename>;" to get the DDL for the table that I realised what I was missing. 

 

I hope this might help someone else struggling with a similar problem.

 

Cheers

Dave

Labels