I regularly get data sets that I need to join (on 1 or more fields) and the first thing I want to know is how much of the data is common and will join, and how much will fall out of the process.
The fall out may be bad data that needs cleaning or is legitimately dissimilar. For Example, take two data sets to compare:
Data Set One | Data Set 2 | |||||||
ROLE | LEVEL | TEAM | Classification | Type | Category | Experience | Scrum | |
Analyst | Senior | Full Stack | Big Data | Java SE | Cloud Native | Junior | Digital | |
Analyst | Junior | Waterfall | Blockchain | Java SE | Java, BA | Junior | Niche | |
DEV | Mid | Standard | Design Systems Designer | PM | Big Data | Intermediate | Standard | |
DEV | Senior | Full Stack | Design Systems Engineer | Tester | Cloud | Senior | Standard | |
Java | Junior | Waterfall | Cloud Native | Data Scientist | Azure | +5 Years | Digital | |
Java | Mid | Standard | Cloud Native | Data Modeller | Big Data | Senior | Waterfall | |
Project Manager | Senior | Standard | .Net | Developer | Design Systems Designer | +5 Years | Niche |
I want to make a macro where I can take the two data sets, select the Fields that I want to compare, so for the data above, I want a selection similar to a Join tool, where I select fields:
Each field value comparison will be independent. I've seen examples of how to create a single list of the field names, but not one that allows me to select and match up both sets of field names. Has any one got a solution here?
Not very sure I follow you, but correct me if I am wrong:
Basically you will ingress two files into Alteryx, and you want to see what data can join and what can't join based on selected fields?
In this case, I have a few points to make:
The above will help.
Hey @SharkeyNZ I can think of a way to build this macro, but it would involve typing in fields in the order of joining into 2 text boxes, and then updating the join tool in a nested batch macro. You could also maybe have as many drop downs as there could be join clauses (with some left blank) to set join clause 1, join clause 2 etc.
Neither one of these 'solutions' would be user friendly, both would involve nested batch macros.
Another 'solution' (attached) is to transpose all the data and append it. You can then see where there are matching values. This approach will be slow with large datasets though.
Thanks @caltang.
In terms of output, I'd like for example:
Evaluating Fields names (Selection 1 - Selection 2) | Matching values in both sets | Value from selection 1 | Value from selection 2 |
Level - Experience | Junior | Junior | Junior |
Level - Experience | Senior | Senior | Senior |
Level - Experience | Not Matched | Mid | |
Level - Experience | Not Matched | Intermediate | |
Level - Experience | Not Matched | +5 Years | |
Team - Scrum | Waterfall | Waterfall | Waterfall |
Team - Scrum | Standard | Standard | Standard |
Team - Scrum | Not Matched | Full Stack | |
Team - Scrum | Not Matched | Digital | |
Team - Scrum | Not Matched | Niche |
Part of the goal here is I want to see what's different between multiple groups of two fields, that are "supposed" to have the same data set of unique values, and if possible to create a mapping file so that I can join other elements of data sets.
I want a macro, as this problem comes up often with the data I get, and the field names change constantly. So while I can do this in a follow, I'd need to update it each time.
Thanks @OllieClarke. As you stated, the approach is slow with large datasets if not cut down to unique values. I will try out your other suggestion.