This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I have two tables, one with 150 columns, one with 110 columns. The one with 150 columns contains basically the same information as the one with 110 columns, but additional columns as well. The column names don't match - sometimes there are similarities, sometimes not.
Is there any way to automatically map the best-suited column in the target table for each column in the source table, e.g. by comparing the distribution function that most closely resembles that of my source data?
Or any other thoughts on how to automatically generate a mapping between the tables?
You mentioned distribution function - does that mean that the columns with "basically the same information" do not have exactly the same data? How then do you determine if a column is the "same" or not?
One way you can explore for the column with similarity on the name is to extract the column name with the field info, use a fuzzy match on source table and target table, rename the fields according to the fuzzy match score of one of your table and use a join as the name would be the same. But it doesn't help if names have no similarity. Perhaps the fuzzy match could be used on the value of the column too.
If there are potentially no similarities in the column names, you could try comparing the contents of the columns as a concatenated string. This might not work if you have a lot of rows of data, though, as you'd run into the max length. If you do run into a length problem, then you might have to bring the data into R/Python and hash it, and compare the hashes of each column instead.