Our Data Catalogue in Connect has about 2 millions items (tables, views, columns).
I see next issues:
- We collect metadata from about 10+ DBMS. So after each Metadata loader run, Alteryx Connect will start load_alteryx_db script and process whole staging area (DB_*) tables, not only current extracted metadata set from single DBMS. It will lead huge redundancy.
- Follows from first issue: One-by-one comparison of loaded metadata will take a lot of time in real environment with 1-2 millions items (ordinary situation in large Bank). And this comparison will be executed several times. It will increase the redundancy in the number of DBMS servers.
All queries in this script containing column or table name as a parameter (e.g. src.TABLE_NAME='${query_table_name}' AND src.COLUMN_NAME='${query_column_name}') will be executed as many times as number of columns in Data Catalogue (millions times). It will work very slow because it executes a lot of queries.
Can you optimize somehow this process?