This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I have been spending a lot of time lately working on Unicode issues. Mostly I have been making our older products like Allocate & Solocast fully work with Unicode, but I have found some interesting gotchas in Alteryx along the way. It all started when I tried to add a function to the function tool for Fuzzy matching:
DecomposeUnicodeForMatch. I downloaded the Unicode spec from http://unicode.org/ in order to figure out how to flatten accented characters. I was reminded that Unicode characters are actually a 32bit value even though on Windows they represented as 16bit. In particular, the 16bit Unicode on windows is UTF-16. That means that some Unicode values above 65535 are represented as 2 16bit values. I had to fix CharFromInit & CharToInt to deal with the UTF-16 pairs properly, so now they can properly deal with complex Asian-language characters.
So back to Decomposing Unicode: The Unicode spec was a pain to work with because it doesn't have any field names and it doesn't even show the characters that each line is representing. I ended up writing an Alteryx module to parse the spec and then generate C++ code to properly decompose Unicode. I thought that was cool using Alteryx to generate code that goes into Alteryx. Anyway, as a bonus the Unicode spec includes the information to decompose #'s, so characters like ¼ can easily change into 1/4. This should help matching. As an added bonus, it can even translate other languages #'s into Arabic numerals. For instance: The Tibetan Digit 7 gets translated into a 7. Again, without the right font pack, you might not see these characters.
If you are interested, you can find the Unicode spec as a YXDB in the sample module here. It's kind of cool to browse through actually.