Macro advice needed for new dynamic process

Question

I have something I am trying to build but not sure what the optimal way of doing it is and looking for advice.

I have several yxdb. files containing UK postcodes, where each file is from a different month. The files are in a format YYMM.yxdb so are named like:

* 1911.yxdb
* 1912.yxdb
* 2001.yxdb
* 2002.yxdb

The number of files going into the process can change each time so the process needs to be dynamic. I can easily merge all the files into one file with a MONTH field if this is better.

I want the process to find when a postcode was introduced, and when it was terminated. If a postcode is in all the files, it will not have an introduced or terminated value. The current (very outdated) process joins the first two files on postcode, assigns an introduced or terminated date if the postcode doesn't exist on either and then moves on. Postcodes that existed on both, and newly introduced postcodes are unioned and then joined with the next file in the process where it repeats the process. Then all records are finally unioned back together (including all the terminated postcodes) into one big file.

* Join first two files.
* Middle and right output are unioned and joined with third file.
* Middle and right output of that are unioned and joined with fourth file.
* Repeat until all files done.
* Create one file file of all the left outputs and the final middle and right output.

Currently as you can see, this takes a lot of manual updating before running and the joins need setting up for the number of files there are. I want this to run dynamically. This feels like it needs an iterative macro because it needs to repeat this process for the number of files.

Anyone have an optimal suggested method to tackle this? I am reasonably confident with macros so any pseudocode for any optimal method to get me going would be useful! There are millions of records in each file so processing time is important.

BrandonB · Answer

Rather than summarize with a min and max value, you may instead choose to first sort where the files are earliest to latest and sorted by Post Code and then use a Multi-Row tool. You could check the box to group by the Post Code and then add a flag if the row above is not one number less than the current. This would be an easy way to highlight these breaks where a Post Code doesn't appear and then comes back.

al_sweets · Answer

Thanks @BrandonB this is a good idea but it won't identify cases where sometimes a postcode is terminated, missing for a few months, and then reintroduced, which sometimes happens.

In the current process, in the final file you'd get separate rows for when it was terminated and reintroduced.

I'm now thinking maybe combine into one file, and then do a crosstab count so you can see which versions each postcode appears on and which it doesn't. And then I can devise a formula to find any that appeared and disappeared within the months.

BrandonB · Answer

What if you read in all yxdb files in one Input Data tool by using an asterisk as a wildcard in the file name but also changed the #4 option to Output File Name as Field. Then you could use a summarize tool where you group by the post code then do a min and a max of the file names. This would give you the minimum and maximum file for every post code.