This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
on 03-28-201603:00 PM - edited on 05-21-201901:18 PM by SydneyF
As users of Alteryx become increasingly international, it is important to support the need to prepare, blend and analyze data in a large number of languages. However, many users with data in non-English languages encounter roadblocks at the first step of their data analysis: inputting the data into Alteryx.
This article will demonstrate two ways of bringing double-byte characters (DBCs) into Alteryx. These particular types of characters are associated with languages that have many unique characters or symbols, such as Chinese, Japanese and Korean (CJK). These languages’ graphic characters are stored in two bytes of data rather than just one, which is sufficient for languages like English, French and Spanish (among many others) that can be represented by 256 characters or less. Chinese, Japanese and Korean languages require a fixed width sequence of two bytes for every character, which allows for about 65,000 characters. This need to support such an extensive dataset requires an Alteryx user whose data may be represented by CJK characters to consider certain formatting and preparation steps both beforeand after inputting data into Alteryx.
Before inputting data into Alteryx, data with CJK characters requires certain encoding and file type considerations. Encoding is the process of converting data from one format into another specific equivalent code that uses letters, symbols and numbers for storage and processing. When using CJK characters, encoding your data in a format that supports double-byte characters is important to accommodate the range of characters in your dataset. Storing the data as Unicode ensures that you do not lose data as you begin your inputting process into Alteryx (Figure 1).
Figure 1: To encode your data from Excel, click “Tools> Web Options” in the bottom right-hand corner of the Save As Window. Then, select “Encoding” from the file tabs. In the Dropdown menu to Save the Document, select “Unicode”, then click “OK”. Make sure that your file name does contain any non-English characters, as that can cause issues the Input tool.
The second pre-input consideration for CJK data is the file extension, or type, with respect to how the data is saved. For example, saving CJK data directly to a Comma Separated Values (CSV) format, especially without encoding, is discouraged. This is because CSV file formats distort characters other than those supported by the American Standard Code for Information Exchange (ASCII) text. In the case of CJK characters, that is likely to be all of your data (Figure 2)! Rather than suffer through the heart-wrenching experience of seeing all your data turned into meaningless question marks, try saving your data as a Unicode text file (.txt) or Excel workbook (.xlsx), both of which are supported by the Alteryx Input Tool (Figure 3). I have found that using CJK characters in these file formats ease the process of inputting data into Alteryx and even reduce some of the steps you may need to perform in your post-input process.
Figure 2: Rats! CJK characters that are saved directly to CSV format without encoding turn into meaningless question marks. WHHHYYYYYY?????
Figure 3: Rather than lose your precious data, save as Unicode Text or an Excel Workbook (encoded as Unicode as a safeguard).
Post- Input Considerations
Because CJK characters are double-byte in size, it is important that their field type is set appropriately to display the data completely. Generally, importing data stored as Unicode text or a Unicode-encoded Excel workbook will be read in as a variable length wide string format (V_WString) to accommodate these wider types of characters (Figure 4). As you can see, forcing the field type to a narrow string format (V_String) leads to data loss. Throughout the data preparation and blending process in Alteryx, you should be sure that fields containing CJK characters have the necessary space requirements to store and transmit the data. The sudden conversion of CJK characters to question marks may indicate that your field type has changed; should this occur, field types can be easily changed using a Select Tool.
Figure 4: The larger size requirements of CJK characters creates the need for using wide string fields (V_WString).
Datasets that contain a mix of Western Latin character sets and CJK characters may require the conversion of text between Code Pages. This may be especially necessary if data has been transferred among many users who have different computer language settings or encoding systems. Figure 5 shows data that has been transferred from a Chinese colleague to an American Alteryx user. Despite bringing the data in as Unicode Text, the data is still unreadable. To convert the data to a useable and meaningful format, Alteryx tools with an Edit Formula Box (or Expression builder), such as the Formula or Multi-Field Formula Tools, contain ConvertFromCodePage and ConvertToCodePage functions. These functions facilitate the conversions between language codes and Unicode. In the below example, the Multi-Field Formula tool is used to overcome coding issues to convert data from the original Chinese code to Unicode using the ConvertFromCodePage function. As a result, the data is displayed in a readable and useable way.
Figure 5: Converting the data from Code Page 20936 (Simplified Chinese GB2312) to Unicode using the ConvertFromCodePage function renders the data readable and useable…Hooray!
And voila! Errr… 这里是! Now your CJK characters are brought into Alteryx beautifully!
Have you had experience with inputting non-English language data with Alteryx? Post any helpful tips or tricks that you’ve used to the Comments section to share with the Alteryx Community.
**Thanks not only to the Alteryx users whose data prep needs have inspired this article, but also to Alteryx’s RodL for his Community post that has provided me with the steps to even know where to begin when troubleshooting data in a variety of languages including Chinese, Japanese, Korean, Thai and Vietnamese.