This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I'm designing an application to transform excel reports for output to our Azure and Bigquery data warehouses. The app takes an excel file and creates a metadata table and data table linked on two ID constraints: a uniqueID field which will be populated based on the last load (find max, add 1 and keep going up) and a "ProfileID" field which is an MD5 hash of the incoming column name. This ID is currently up to 32 alphanumeric characters long:
My IT team are looking for a way we can turn this ProfileID in to a shorter key to improve indexing and querying.The shorter key cannot duplicate unless the profileID is the same. This would happen in 12 months when a profile is updated with 2020 data (new row, new ProfileUniqueID, same ProfileID). Upwards of 5,000 rows will be added to this table every year.
This is completely untested, but thinking this approach could do the trick. basically spitting each character of the hash into rows and then converting the character to it's decimal code then concatenating it back up. The "- 48" is to start it from 0 since the character "0" (zero) decimal code is 48.
Either way, let me know if this works and is still able to maintain a unique value!
On the back end, you could create a table of ProfileIDs with a unique integer key, ProfileIDKey. When you receive new files, join to this table, which will be smaller than the main table so lookups will be faster, on ProfileID. Use the returned ProfiIeIDKey as the integer key, to satisfy the requirements of your IT Team, without the risk of duplicating keys that may happen when you start shortening the Hash value. You'll have to add new ProfileIDs to this table, but the trade-off may be worth it depending on the frequency of querying
As an aside: when you're adding 5000 records a year at the end of 20 years, you'll still only have 100000 records. That's peanuts for a modern DB, especially if you add an index on the ProfileID column. Have your IT team run some performance metrics on querying 100000 rows of data with an Int ID vs. your 32 char Hash value(indexed, of course), focusing on actual as opposed to comparative metrics and taking into account the frequency at which you forecast actually querying the data. The integer join will be faster, but if both of them will still return values fast enough, you may be undertaking this effort with no effective gain.
Using a list of 5,485 unique records I came up with 5,485 unique integers! The format had to be "double" because when converted to INT64, 1 record out of all of them was too big to fit. Just for giggles I brought the LEFT statement down to 8 characters and that issue was resolved.
The worst case scenario using Jimmy's technique is getting a hash value that's all characters. Using the first 8 characters will give you an integer that's 16 digits long, which is too long to fit in the mantissa of a double(max=15). The last digits will be truncated(silently) leading to even more duplication. Use a fixed decimal(20.0) instead.
Also, @Dan_Agenti, and you may have already figured this out, but instead of - 48, you probably want to set it to -47 instead so that the char to int translation starts at 1 and not 0. Otherwise, when it's outputted to a DB table where the column is numeric, it'll trim out that leading 0 if it started with it.