community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx Designer Knowledge Base

Definitive answers from Designer experts.

Inputting Data in Chinese, Japanese and Korean Characters

Sr. Instructional Designer
Sr. Instructional Designer
Created on

As users of Alteryx become increasingly international, it is important to support the need to prepare, blend and analyze data in a large number of languages. However, many users with data in non-English languages encounter roadblocks at the first step of their data analysis: inputting the data into Alteryx. 

 

This article will demonstrate two ways of bringing double-byte characters (DBCs) into Alteryx.  These particular types of characters are associated with languages that have many unique characters or symbols, such as Chinese, Japanese and Korean (CJK).  These languages’ graphic characters are stored in two bytes of data rather than just one, which is sufficient for languages like English, French and Spanish (among many others) that can be represented by 256 characters or less.  Chinese, Japanese and Korean languages require a fixed width sequence of two bytes for every character, which allows for about 65,000 characters.  This need to support such an extensive dataset requires an Alteryx user whose data may be represented by CJK characters to consider certain formatting and preparation steps both before and after inputting data into Alteryx. 

 

Pre-Input Considerations

Before inputting data into Alteryx, data with CJK characters requires certain encoding and file type considerations.  Encoding is the process of converting data from one format into another specific equivalent code that uses letters, symbols and numbers for storage and processing.  When using CJK characters, encoding your data in a format that supports double-byte characters is important to accommodate the range of characters in your dataset.  Storing the data as Unicode ensures that you do not lose data as you begin your inputting process into Alteryx (Figure 1).

 

Figure 1: To encode your data from Excel, click “Tools> Web Options” in the bottom right-hand corner of the Save As Window.  Then, select “Encoding” from the file tabs. In the Dropdown menu to Save the Document, select “Unicode”, then click “OK”.   Make sure that your file name does contain any non-English characters, as that can cause issues the Input tool. 

Figure1.jpg 

The second pre-input consideration for CJK data is the file extension, or type, with respect to how the data is saved.  For example, saving CJK data directly to a Comma Separated Values (CSV) format, especially without encoding, is discouraged.  This is because CSV file formats distort characters other than those supported by the American Standard Code for Information Exchange (ASCII) text.  In the case of CJK characters, that is likely to be all of your data (Figure 2)!  Rather than suffer through the heart-wrenching experience of seeing all your data turned into meaningless question marks, try saving your data as a Unicode text file (.txt) or Excel workbook (.xlsx), both of which are supported by the Alteryx Input Tool (Figure 3).  I have found that using CJK characters in these file formats ease the process of inputting data into Alteryx and even reduce some of the steps you may need to perform in your post-input process.  

 

Figure 2: Rats! CJK characters that are saved directly to CSV format without encoding turn into meaningless question marks.  WHHHYYYYYY?????

Figure2.jpg

Figure 3: Rather than lose your precious data, save as Unicode Text or an Excel Workbook (encoded as Unicode as a safeguard).

Figure3.jpg

 

Post- Input Considerations

Because CJK characters are double-byte in size, it is important that their field type is set appropriately to display the data completely.  Generally, importing data stored as Unicode text or a Unicode-encoded Excel workbook will be read in as a variable length wide string format (V_WString) to accommodate these wider types of characters (Figure 4).  As you can see, forcing the field type to a narrow string format (V_String) leads to data loss.  Throughout the data preparation and blending process in Alteryx, you should be sure that fields containing CJK characters have the necessary space requirements to store and transmit the data.  The sudden conversion of CJK characters to question marks may indicate that your field type has changed; should this occur, field types can be easily changed using a Select Tool.

 

Figure 4: The larger size requirements of CJK characters creates the need for using wide string fields (V_WString).

Figure4.jpg

 

Datasets that contain a mix of Western Latin character sets and CJK characters may require the conversion of text between Code Pages.   This may be especially necessary if data has been transferred among many users who have different computer language settings or encoding systems.  Figure 5 shows data that has been transferred from a Chinese colleague to an American Alteryx user.  Despite bringing the data in as Unicode Text, the data is still unreadable.  To convert the data to a useable and meaningful format, Alteryx tools with an Edit Formula Box (or Expression builder), such as the Formula or Multi-Field Formula Tools, contain ConvertFromCodePage and ConvertToCodePage functions.  These functions facilitate the conversions between language codes and Unicode.  In the below example, the Multi-Field Formula tool is used to overcome coding issues to convert data from the original Chinese code to Unicode using the ConvertFromCodePage function.  As a result, the data is displayed in a readable and useable way.

 

Figure 5: Converting the data from Code Page 20936 (Simplified Chinese GB2312) to Unicode using the ConvertFromCodePage function renders the data readable and useable…Hooray!

Figure5.jpg

 

And voila! Errr… 这里是! Now your CJK characters are brought into Alteryx beautifully!

 

Have you had experience with inputting non-English language data with Alteryx?  Post any helpful tips or tricks that you’ve used to the Comments section to share with the Alteryx Community.

 

**Thanks not only to the Alteryx users whose data prep needs have inspired this article, but also to Alteryx’s RodL for his Community post that has provided me with the steps to even know where to begin when troubleshooting data in a variety of languages including Chinese, Japanese, Korean, Thai and Vietnamese.

 

Comments
Asteroid

Hi, Do you have the sample workbook that we can check out

Community Content Engineer
Community Content Engineer

Great article!  Pro-Tip: when reading from a database using a standard Input Tool, make sure that the Force SQL WChar Support option is checked.

 

Force SQL WChar.jpg

 

Alteryx Partner

Reading CSV you can simply make use of 11. Code page (set to Unicode) and 7. Field Length settings:

 


chinese.JPG

 

Alteryx
Alteryx

Great article and tips here. Keep it coming!

Atom

@ChrsitineB

 

My problem is specifically the issue which is avoided above: Make sure that your file name does contain any non-English characters, as that can cause issues the Input tool.

 

Input files are Excel and contain a mix of Korean plus a date, e.g. <Korean>_yyyymmdd.xlsx.  The Excel coding type is Korean.

 

The characters come into Alteryx no problem.  I then read some data from them to add to the body of an email and the email picks up the original file (original file as I want to preserve the formatting).  But when run from Designer, the email picks up the correct file, but drops the Korean, so the attachment is _yyyymmdd.xlsx.  Otherwise the file is fine.  If I run from internal company gallery the Korean gets jumbled, like this ê³ì.½ë³ ê²°ì oë,´ì-­_yyyymmdd.xlsx.  Again, the file is fine otherwise.

 

How can I ensure that the file is attached to the email with Korean characters in the filename?  I've tried saving the Excel file as Unicode / UTF-8, but doesn't make any difference.