I have built a process that uses a custom fuzzy join to compare question sets in a pairwise manner. I was asked to extend this from 2 surveys to N surveys (currently built a template for 5 total surveys (2012 - 2016). The process does the following:
- Compare 2016 to 2015 and identify pairs
- Compare 2016 to 2014 and identify pairs
- Compare 2016 to 2013 and identify pairs
- Compare 2016 to 2012 and identify pairs
This leads to 4 output files all of of which have "2016_Question" as one column and "YYYY_Question" as the other (YYYY is 2015, 2014, etc.). When I union these I get the following:
2016_Question || 2015_Question || 2014_Question || 2013_Question || 2012_Question
Q_1 || Q_5 || Null || Null || Null
Q_1 || Null || Q_8 || Null || Null
Q_1 || Null || Null || Q_10 || Null
Q_1 || Null || Null || Null || Q_19
...
Ideally, I would have a single row for each unique 2016_Question, for example:
Q_1 || Q_5 || Q_8 || Q_10 || Q_19
...
Which effectively collapses the nulls. I tried googling and am getting nothing. Any tips?