Engine Works

ned_blog · ‎02-11-2009

I have been spending a lot of time lately working on Unicode issues. Mostly I have been making our older products like Allocate & Solocast fully work with Unicode, but I have found some interesting gotchas in Alteryx along the way. It all started when I tried to add a function to the function tool for Fuzzy matching:

DecomposeUnicodeForMatch. I downloaded the Unicode spec from http://unicode.org/ in order to figure out how to flatten accented characters. I was reminded that Unicode characters are actually a 32bit value even though on Windows they represented as 16bit. In particular, the 16bit Unicode on windows is UTF-16. That means that some Unicode values above 65535 are represented as 2 16bit values. I had to fix CharFromInit & CharToInt to deal with the UTF-16 pairs properly, so now they can properly deal with complex Asian-language characters.

So back to Decomposing Unicode: The Unicode spec was a pain to work with because it doesn't have any field names and it doesn't even show the characters that each line is representing. I ended up writing an Alteryx module to parse the spec and then generate C++ code to properly decompose Unicode. I thought that was cool using Alteryx to generate code that goes into Alteryx. Anyway, as a bonus the Unicode spec includes the information to decompose #'s, so characters like ¼ can easily change into 1/4. This should help matching. As an added bonus, it can even translate other languages #'s into Arabic numerals. For instance: The Tibetan Digit 7 gets translated into a 7. Again, without the right font pack, you might not see these characters.

If you are interested, you can find the Unicode spec as a YXDB in the sample module here. It's kind of cool to browse through actually.

Engine Works

UTF16 not Unicode