Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Knowledge Base

Definitive answers from Designer Desktop experts.

How To: Create a Calgary Database

HenrietteH
Alteryx
Alteryx
Created

How To: Create a Calgary Database

 

A Calgary Database is a proprietary Alteryx format that allows users to query against a file of millions of records quickly without having to read in all the data. A Calgary Database is created with the Calgary Loader tool,which allows users to create a database from any type of input while selecting which fields to index.
Calgary Databases are useful for running ad-hoc queries against a large dataset, e.g.ConsumnerView data
.

 

 

Prerequisites

 

  • Alteryx Designer (any version)

 

 

Procedure

 

  1. Bring the data to be written to the Calgary Database into Alteryx and transform it until it is in the desired format, keeping in mind opportunities to standardize values to make the indexes work better. For example, are all your ZIP codes properly and consistently formatted?
    Starting with Alteryx 5.0, Calgary Indexes are not case sensitive anymore, treating "CALIFORNIA" and "California" the same. However, if some of your data uses the full state name and some uses the state abbreviation, and you are planning on using state as an index, you should pick one and use it consistently.
    You might also want to add flags that other users might find useful for querying data. For example, create a flag to indicate the current month's (or quarter's or year's) data or a custom region such as "NorthEast", "South", "Midwest", etc.

     
  2. In the Calgary Loader tool,map the location of the Calgary Database in the "Root File Name" box. The tool will create a .cydb file (the data file), multiple .cyidx files (the index files), and an __Indexes.xml file that contains the index values. Since it will write out multiple files, a best practice is to have a folder dedicated to your Calgary Database. You cannot append to a Calgary database. To add records, rebuild the Calgary database.
     
  3. Use the "Data" and "Index" columns to select which fields to include as data fields and which to index. Typically, all fields are included as data, but only certain ones are indexed as each index takes time to create. For index fields, the index type can be selected. "High Selectivity" is used for data with many different possible values, such as ZIP codes. Select "Low Selectivity" for data with fewer unique values, such as State or Region. "Low Selectivity" also creates a drop-down option for the Calgary Input Tool.
    By default, the index type setting is "Auto". In Auto mode, Alteryx looks at the first 1 million rows of data and decides if the index should be high or low selectivity. All fields with more than 550 unique values will be set to high selectivity. If the data changes after the first 1 million rows, Alteryx might select the incorrect index type. This option might also take longer to process since Alteryx has to look at 1 million rows of data for each index in Auto mode.
    HenrietteH_0-1579223126464.png
     
  4. Use the Calgary Input tool to read in data from a Calgary database as described here: Querying a Calgary DB / File to Select and Limit Input Records.

    Did you know that you can read in the .cydb file the same as a .yxdb file in the regular input tool? However, you won't be able to query any of the indexed fields.

 

 

Additional Resources

 

Comments
bertal34
8 - Asteroid

I took an existing yxdb file and created a Calgary database with it.  After testing the performance, I found the Calgary is actually slower to query than the original yxdb.  Can you expand on why this could be happening and possible changes I could make to gain the performance benefits described?

HenrietteH
Alteryx
Alteryx

Hi @bertal34 

 

It depends on how exactly you are querying the data. It has to access the index and then pull the record(s) associated with it. If you have a lot of query criteria that don't reduce the amount of records greatly, I wouldn't expect great performance. The greatest performance increase should happen if you can use a simple query (e.g. by state) to read in a small subset of your data. 

Does that help answer your question?