Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Difference between MD5_ASCII() and MD5_UNICODE()

OllieClarke
15 - Aurora
15 - Aurora

I was connecting to an API that required hashing, so I used the MD5_UNICODE() function. However my authentication failed. Switching to the MD5_ASCII() function successfully authenticated me. As all of my strings were ascii values, I was confused as to why the two functions produce different hashes

 

Here's the relevant documentation of the functions:

OllieClarke_1-1594984254098.png

 

and here's proof of the difference between the two:

 

OllieClarke_0-1594984003444.png

 

3 REPLIES 3
TrevorS
Alteryx Alumni (Retired)

Hey @OllieClarke 
Thanks for posting this! I have let our support team know about the issue and they are currently looking into the problem.
Thanks,
TrevorS

Community Moderator
smaruda
Alteryx
Alteryx

Let me try to explain the difference.

 

MD5 hash function works on binary data. Strings can be encoded differently depending on what character encoding was used (ASCII, UTF8, UTF16, etc) - thus they can have different binary representation and as a result visually same strings can have different md5 hashes.

 

For example, let's consider string "Hello World".

When we encode this string using ASCII encoding - raw binary data will look like (binary representation):

01001000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100

ASCII encoding can represent only latin alphabet (lower + upper case), numbers 0-9, punctuation characters + some special non-printable characters. Everything else can't be represented with ASCII. In this case our string perfectly matches this requirement and thus can be encoded as ASCII. Encoding "Привет" string ("Hello" from russian) simple not possible with ASCII.

For this reason (to support not only English but many other languages) people invented another encodings like UTF8 and UTF16. 

 

Using UTF8 we will have same result for "Hello World" string since UTF8 is fully compatible with ASCII.

 

Interesting thing happens when we encode this string using UTF16 encoding:

0100100000000000 0110010100000000 0110110000000000 0110110000000000 0110111100000000 0010000000000000 0111011100000000 0110111100000000 0111001000000000 0110110000000000 0110010000000000 

This is because UTF16 use 2 bytes per character to hold its code.

 

In Designer MD5_ASCII uses ASCII encoding, MD5_UNICODE uses UTF16 encoding - thus you see different results for visually same string:

MD5_ASCII("Hello world") = b10a8db164e0754105b7a99be72e3fe5

MD5_UNICODE("Hello world") = e9f73a54305808ec350c3bbe3db371b9

 

The problem of MD5_ASCII is that it can represent only latin alphabet, it cannot represent characters from other languages like russian, japanes, chinese, etc and will replace all unknown character with "?"

 

For example:

MD5_ASCII("Привет") = df64dc2eb4a0b85091dd31eb4923eaac

MD5_ASCII("Добрый") = df64dc2eb4a0b85091dd31eb4923eaac 

Absolutely different strings have same md5 hash which is dangerous and can lead to errors in workflow or possible data loss.

 

To fix that users should use MD5_UNICODE

 

In 20.4 release this will better documented and additionally we will add MD5_UTF8 function which has 100% compatibility with MD5_ASCII but additionally supports all known languages:

MD5_UTF8("Hello World") = b10a8db164e0754105b7a99be72e3fe5

And it will work also non-English symbols:

MD5_UTF8("Привет") = 8a669e9418750c81ab90ae159a8ec410

MD5_UTF8("Добрый") = a89002cddecf6c1b3da06928021e18bd

 

The reason why we have added the support of UTF8 encoding is that AMP Engine uses this encoding to represent strings internally, while E1 Engine used UTF16 (for this reason MD5_UNICODE is based on UTF16) and also because UTF8 is de facto the standard for strings encoding (most popular encoding among hosted web sites, all online md5 hash generators are based on UTF8, most efficient in terms of used memory)

OllieClarke
15 - Aurora
15 - Aurora

Hi @smaruda sorry I've only just seen this! Thank you for your explanation that's cleared everything up 🙂

Labels