Hello!
I'm designing an application to transform excel reports for output to our Azure and Bigquery data warehouses. The app takes an excel file and creates a metadata table and data table linked on two ID constraints: a uniqueID field which will be populated based on the last load (find max, add 1 and keep going up) and a "ProfileID" field which is an MD5 hash of the incoming column name. This ID is currently up to 32 alphanumeric characters long:
My IT team are looking for a way we can turn this ProfileID in to a shorter key to improve indexing and querying.The shorter key cannot duplicate unless the profileID is the same. This would happen in 12 months when a profile is updated with 2020 data (new row, new ProfileUniqueID, same ProfileID). Upwards of 5,000 rows will be added to this table every year.
Any ideas out there?
Solved! Go to Solution.
You may be able to LEFT() to 10 chars the hash key. I tested on a table with an md5 key was created and based on just over 300k records, the first 10 chars was still unique.
Jimmy
Teknion Data Solutions
Hey Jimmy!
Thanks for testing this out - it seems the obvious solution I never thought of!
My data architecture guy is still adamant he wants an integer instead of a string - got any tricks for converting those 10 character strings to unique integer values without duplicates?
Dan
This is completely untested, but thinking this approach could do the trick. basically spitting each character of the hash into rows and then converting the character to it's decimal code then concatenating it back up. The "- 48" is to start it from 0 since the character "0" (zero) decimal code is 48.
Either way, let me know if this works and is still able to maintain a unique value!
Regards,
Jimmy
Teknion Data Solutions
Hi @Dan_Z
On the back end, you could create a table of ProfileIDs with a unique integer key, ProfileIDKey. When you receive new files, join to this table, which will be smaller than the main table so lookups will be faster, on ProfileID. Use the returned ProfiIeIDKey as the integer key, to satisfy the requirements of your IT Team, without the risk of duplicating keys that may happen when you start shortening the Hash value. You'll have to add new ProfileIDs to this table, but the trade-off may be worth it depending on the frequency of querying
As an aside: when you're adding 5000 records a year at the end of 20 years, you'll still only have 100000 records. That's peanuts for a modern DB, especially if you add an index on the ProfileID column. Have your IT team run some performance metrics on querying 100000 rows of data with an Int ID vs. your 32 char Hash value(indexed, of course), focusing on actual as opposed to comparative metrics and taking into account the frequency at which you forecast actually querying the data. The integer join will be faster, but if both of them will still return values fast enough, you may be undertaking this effort with no effective gain.
Dan
Thanks Dan and Jimmy! I have some testing to do today and will get back to you.
Hey Jimmy!
Using a list of 5,485 unique records I came up with 5,485 unique integers! The format had to be "double" because when converted to INT64, 1 record out of all of them was too big to fit. Just for giggles I brought the LEFT statement down to 8 characters and that issue was resolved.
Thanks a million!
Dan
Glad to hear and thanks for the update!
i do want to mention that on the 300k row table I tested, I did get two instances of duplicate keys when used only the first 9 so something that you may need to consider (I didn’t test 8).
Jimmy
Teknion Data Solutions
The worst case scenario using Jimmy's technique is getting a hash value that's all characters. Using the first 8 characters will give you an integer that's 16 digits long, which is too long to fit in the mantissa of a double(max=15). The last digits will be truncated(silently) leading to even more duplication. Use a fixed decimal(20.0) instead.
Dan
Good point @danilang!
Also, @Dan_Z, and you may have already figured this out, but instead of - 48, you probably want to set it to -47 instead so that the char to int translation starts at 1 and not 0. Otherwise, when it's outputted to a DB table where the column is numeric, it'll trim out that leading 0 if it started with it.