Engine Works

ned_blog · ‎06-09-2010

Sometimes data comes in to Alteryx in an unexpected code page. One example recently is Tele Atlas started shipping their data in SHP files that are UTF-8 encoded. This is a little odd, because ESRI publishes a spec for SHP and it doesn't include support for UTF-8 and the DBF file that ultimately holds the data doesn't officially support it either. It seems that ESRI is playing fast and loose with their own file spec. In the future I am sure that we will figure out how ESRI has extended SHP so we can read these files directly, but in this case it wouldn't help because Tele Atlas didn't see fit to include the CPG files along with the SHP files that describe the text encoding.

Anyway, you get data in Alteryx that looks like: Saint-FÃ©lix-d'Otis where what you really want is Saint-Félix-d'Otis. For Alteryx 6.0 we added a pair of functions that make this translation really easy. ConvertToCodePage and ConvertFromCodePage. In order to convert all the text fields in a stream from UTF-8 to a valid string, you only have to use a single MultiFieldFormula tool with the formula: ConvertFromCodePage([_CurrentField_], 65001) Easy, huh?

Now the problem with this is that our customers don't have Alteryx 6.0 yet, so it would be really nice if we could come up with a macro to do the same thing. CSV files support code pages, so how can we leverage them to translate text fields for us? Simple, we'll use a RunCommand tool to write a temporary file and then read it again. A slightly undocumented feature of the RunCommand tool is that you don't need to run a command at all. You can use it just to write a file and then read it.

There is a problem though – CSV files don't support all field types. Specifically it doesn't support spatial objects which are guaranteed to be in a SHP file. We can use the new in 5.0 DynamicSelect tool to select out all the string type fields, do our processing and then join all the field back together. The other problem with CSV is that we lose the field sizes and such, so we can also include an AutoField tool to repackage the fields into the smallest size appropriate.

Once this is all done, it is an easy job to make a macro out of it. This makes a single tool that you can add to any stream to convert all the text fields from narrow character UTF-8 to standard Unicode.

The files and macro can be found here.

Engine Works

Dealing with Code Pages