This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I have a problem with nested data on which I want to do comparisons.
Data set 1 includes a collection of 'correct' combinations of two variables, call them X and Y (or rather, such a data set can be derived from a larger data set - anyway, this is not the problem). There are in general multiple values of X (...'children') for each Y (...'parent').
Data set 2 includes a list of individuals who are each of them assigned various combinations of these X's and Y's - I want to compare the realized values of the X's and Y's of these individuals with the known 'correct' combinations to which they ought to be assigned. Individuals in the data set are each assigned a subset of the Y's (and the corresponding X's) included in the 'correct' data set, and for each Y there's some positive probability that an individual will be assigned fewer X-observations than are included in the matching 'correct' (X,Y)-combination; no individual will have more (unique) X's matching a specific Y than are present in the 'correct' dataset (so if the true number of X-variables associated with variable Y-1 is 5, the unique count for individual i may be 4, but it will never be 6 (if counting non-uniques; however the count for individual i may in some cases be 50 instead - there are a lot of duplicates, e.g. because of underlying time variation ('the same X may be repeated for each individual for each year for a decade') in which I'm not interested right now). In terms of the size of the data sets, the values of all the relevant (unique) variables mentioned are in the hundreds and the number of unique combinations of variables are in the thousands, but the size of the data sets will probably double in the months to come and a flow should be able to handle this. All data variables on which we are operating are strings. Structurally the combined data that is of interest looks a bit like this:
Y | X |
| X' | Y' | Ident |
ABC | QR |
| ABC | QR | QQQ |
ABC | ST |
| ABC | ST | QQQ |
ABC | UV |
| ABC | QR | WWW |
DEF | OPQ |
| DEF | OPQ | RRR |
DEF | PRS |
| DEF | PRS | RRR |
DEFG | TUV |
| DEFG | TUV | AAA |
DEFG | QR |
| DEFG | QR | QQQ |
I tried to make a simple example that retained most of the critical aspects of the data structure. Y & X are here the 'correct' combinations, X’, Y’ & ‘Ident’ being the comparison data which I want to analyse. In the example individual QQQ is missing 'UV' from the ABC Y-variable, individual WWW is missing QR and ST from the ABC Y-variable, individual RRR does not miss any X-variables from the DEF Y-variable and individual AAA is missing the QR variable and individual QQQ the TUV variable from the DEFG Y-variable. Note that the X-variable QR is associated with both the ABC- and the DEFG Y-variable - a substantial proportion of the X's belong to more than one Y at the same time - and that individual QQQ is assigned more than one Y-variable. The naming similarity between the X-variables DEF and DEFG was deliberate, as was the dissimilar string length. In the 'correct' data set the number of X-variables assigned to each Y varies from 1 to 30.
Comparing grouped counts is one approach I've considered as a starting point (...asking the question: how many unique X's belong to each unique Y in the 'correct' data set? Compare this number with a count for each individual assigned to variable Y..), but I need more than that because I also need to know exactly which X's are missing for each individual in contexts where an individual is assigned the Y to which the specific X belongs. Using the summarize tools and a few other tools I am able to obtain the lists I want to compare (lists like the ones illustrated above), but how to compare them at the level I want to compare them, and turning out the output in which I'm interested, is turning out to be a problem. I have a few months of experience working with Alteryx, but almost no experience working with macros (in Alteryx).
hi @BYJE
Here's a solution that provides what you're looking for...I think. For each Ident and Y, it returns the number of actual X as well as the total possible X.
This solution only uses the data from your sample and doesn't include any filtering for uniqueness. You'll need to add this based on your actual data, years, etc.
Let me know if I missed something while reading through your requirements
Edit: I re-read your requirements and I currently looking into how to list the missing Y'
Dan
Hi @BYJE
Here's an updated version of the workflow. It reports the results for all the Idents and lists the Y and X missing from the actual data. These are the rows where there are [Null] values in the X' and Y' columns. It only returns results for Y where there is at least one record in the actuals with a corresponding Y'. So there are no records for Ident=AAA and Y=ABC since there were none in the actuals. If you want to include these, remove the Filter tool("Remove the combinations...")
Dan
Hi @danilang
I decided to get some (offline) help with the problem today, and a solution mechanism was found that I was eventually convinced actually worked - it does look structurally not too dissimilar to what you suggest (combinations of summarize tools, formula tools, filters, joins and unions).