I was connecting to an API that required hashing, so I used the MD5_UNICODE() function. However my authentication failed. Switching to the MD5_ASCII() function successfully authenticated me. As all of my strings were ascii values, I was confused as to why the two functions produce different hashes
Here's the relevant documentation of the functions:
and here's proof of the difference between the two:
Solved! Go to Solution.
Hey @OllieClarke
Thanks for posting this! I have let our support team know about the issue and they are currently looking into the problem.
Thanks,
TrevorS
Let me try to explain the difference.
MD5 hash function works on binary data. Strings can be encoded differently depending on what character encoding was used (ASCII, UTF8, UTF16, etc) - thus they can have different binary representation and as a result visually same strings can have different md5 hashes.
For example, let's consider string "Hello World".
When we encode this string using ASCII encoding - raw binary data will look like (binary representation):
01001000 01100101 01101100 01101100 01101111 00100000 01110111 01101111 01110010 01101100 01100100
ASCII encoding can represent only latin alphabet (lower + upper case), numbers 0-9, punctuation characters + some special non-printable characters. Everything else can't be represented with ASCII. In this case our string perfectly matches this requirement and thus can be encoded as ASCII. Encoding "Привет" string ("Hello" from russian) simple not possible with ASCII.
For this reason (to support not only English but many other languages) people invented another encodings like UTF8 and UTF16.
Using UTF8 we will have same result for "Hello World" string since UTF8 is fully compatible with ASCII.
Interesting thing happens when we encode this string using UTF16 encoding:
0100100000000000 0110010100000000 0110110000000000 0110110000000000 0110111100000000 0010000000000000 0111011100000000 0110111100000000 0111001000000000 0110110000000000 0110010000000000
This is because UTF16 use 2 bytes per character to hold its code.
In Designer MD5_ASCII uses ASCII encoding, MD5_UNICODE uses UTF16 encoding - thus you see different results for visually same string:
MD5_ASCII("Hello world") = b10a8db164e0754105b7a99be72e3fe5
MD5_UNICODE("Hello world") = e9f73a54305808ec350c3bbe3db371b9
The problem of MD5_ASCII is that it can represent only latin alphabet, it cannot represent characters from other languages like russian, japanes, chinese, etc and will replace all unknown character with "?"
For example:
MD5_ASCII("Привет") = df64dc2eb4a0b85091dd31eb4923eaac
MD5_ASCII("Добрый") = df64dc2eb4a0b85091dd31eb4923eaac
Absolutely different strings have same md5 hash which is dangerous and can lead to errors in workflow or possible data loss.
To fix that users should use MD5_UNICODE
In 20.4 release this will better documented and additionally we will add MD5_UTF8 function which has 100% compatibility with MD5_ASCII but additionally supports all known languages:
MD5_UTF8("Hello World") = b10a8db164e0754105b7a99be72e3fe5
And it will work also non-English symbols:
MD5_UTF8("Привет") = 8a669e9418750c81ab90ae159a8ec410
MD5_UTF8("Добрый") = a89002cddecf6c1b3da06928021e18bd
The reason why we have added the support of UTF8 encoding is that AMP Engine uses this encoding to represent strings internally, while E1 Engine used UTF16 (for this reason MD5_UNICODE is based on UTF16) and also because UTF8 is de facto the standard for strings encoding (most popular encoding among hosted web sites, all online md5 hash generators are based on UTF8, most efficient in terms of used memory)
Hi @smaruda sorry I've only just seen this! Thank you for your explanation that's cleared everything up 🙂