Can we try to match column headers generate a summary table of the matching results?

Hi team,

I have a practical scenario where I have different CSV files (more than 100) that I need to process on a daily basis. The data in those CSV files changes daily and we use Alteryx to process the data in some of the columns. However, we are worried that one day, there are fewer or more data columns that we receive from our data provider, so we need to find a way to do a 'completeness check' by checking both the number and order arrangement of data files' column headers against a template for column headers (which represents our golden source).

Is it possible to do a check for multiple files, then collate the matching results of all files and then generate a summary table as output? Either in the form of xlsx or email is fine. My issue is how to use Alteryx to generate such summary table. We are not concerned about what specific information to have in the table as long as it helps in this control function (whether it shows True/False, or the actual number of column headers). I wish to have an example on how to generate sample table from our scenario described.

Please help. Thanks for your kind assistance.

Workflow

Datasets

Input

Accepted answers

Emil_Kos

Hi @wuaw,

Hope this helps:

Please mark it as solved as this is helpful for you!

Test.yxmd

danilang

Hi @wuaw

Since you have 100s of file, it would be impractical to have to duplicate @Emil_Kos' workflow 100s of times. You should use a solution that can read all the csv files in a directory and process them all at once. Normally, you'd be able to use a wild card "*.csv" on your input tool, but since there is the possibility that the columns may change, the wild card approach will miss some files giving you the dreaded "file skipped because of different schema error".

In this case you need a macro based solution to open the files one by one and compare the schema to the template schema.

The macro looks like this

Since you're only checking the schema of the files at this point, the Input is configured to read only the first record of file. The Field Info outputs a list of the fields in the table. After adding a column ID, this list is joined on column name and column ID to the template which comes in through the macro input at the bottom. The left and right output of the Join correspond to the column errors between the input and template files. The output are joined on column name only in the second join to determine which kind of error it it. The L output has extra fields. the J has fields that differ in order and the R output gives you the missing fields

When you call this through a main workflow(attached) all the files in the target directory are compared to the template file giving you results that look like this.

The "Order" result is when the columns are not in the same position as the corresponding columns in the template. "Extra" means there's an extra column and "Missing" denotes there is a column missing from the input. Note that a missing column cause the "missing" result and also an "order" result for each column that followed the missing one since these columns have shifted over by one column

Dan

MatchColumnHeaders.yxzp

All comments

Emil_Kos

Hi @wuaw,

Hope this helps:

Please mark it as solved as this is helpful for you!

Test.yxmd

wuaw

Hi @Emil_Kos

Thanks for the quick reply! I modified your workflow further by copy-paste and creating a second flow, and then added 'Summarize' at the end of each flow. But I'm stuck here - could we generate a simple summary table for this? I wish to see file name in first column, and matching results in second column. Any information about matching results are fine as long as it is useful. Thanks.

Test_summarize.yxmd

Quick Links

This months top contributors

atcodedog05 19598

Qiu 15875

binu_acs 15708

MarqueeCrew 13708

apathetichell 13703