I have two datasets of sales data from different companies. I need to union them together but the field names are different and there are multiple instances where fields are named the same name but different types (i.e. one company has month stored as a numeric "1" and another has it as character "January"). I suppose the usual recommendation is to manually go through these but this is not practical because there are about 30-50 fields in each dataset and a different number of fields between datasets. I also have a lot of datasets. I need to include these differences in our documentation as well so when this analysis is repeated or this data is used for other analyses, it can be easily referenced so I will need an output.
I am looking to get an way to print a list of the field names and types for both datasets so I can easily see if are named differently, differ in type, or possibly missing entirely. I tried pulling the field names and joining them name but it was not easy to understand nor provided all of the info I needed. I have attached two dummy datasets to work with and the desired output as an example.
Here is a brief table that explains each field for the two datasets
FIELD NAME | DUMMY DATA 1 | DUMMY DATA 2 |
STORE_ID | Character field of IDs for stores | Numeric field of IDs for stores |
REVENUE | Numeric field of Gross Revenue | Numeric field of Gross Revenue |
EARNINGS | Numeric field of Revenue after costs | Numeric field of Revenue after costs |
MONTH | Numeric field of the current month i.e. January is 1 | Not present. Match is titled "Period" |
PERIOD | Not present. Match is titled "Month" | Character field of the current month |
Customers | A numeric field of total customers | A numeric field of total customers |
BREAK_EVEN | Logical field of whether the store broke even for that sales period | Missing |
AVG_SALE_REV | Missing | Numeric field of the Average Revenue per customer |
Please note here that the data itself won't make much sense here. I am only concerned with the metadata. Also, while these conflicts are specific examples, they are not all encompassing so I am not worried about actually fixing them. With the list, the specific conflict will be identified and resolved.
Hi @jmohr
If I am not wrong you need to find the data types which are coming from different files into one file and compare them with each other.
If my understanding is correct, you can use the below workflow to get the desired output.
Please let me know if you have any further requirements on it.