Hi, I'd like to learn under better under what circumstances a Calgary database is expected to outperform a .yxdb.
I am working with a large dataset of 64 million records with several dozen columns. I'm getting inconsistent performance.
In this first example my Calgary database loads slower than a .yxdb despite applying a pre-filter (on Brand Name, which has about 1,250 distinct values) that eliminates about 97% of the records. This is more surprising given that my Calgary database was created using a subset of columns from the .yxdb. Run times:
Calgary: 2:59 minutes
.YXDB: 1:13 minutes
However, in my second example, my Calgary database outperforms when I apply a pre-filter to a different field (Canadian province, about 10 distinct values).
Calgary: 0:17 minutes
.YXDB: 1:24 minutes
Any tips and advice appreciated, thanks!
Hi @schaferc
I think the biggest component here is the indices the Calgary file was built with. When the Calgary file is created with the Calgary Loader tool, there are settings to control which fields and what types of indices are created. These field indices are what make the selective loading/joining of the Calgary records possible. Fun fact, yxdb files support field indices indices, but only on the spatial fields (and are created by default). That's why performing a spatial match with a yxdb file as the universe is so fast.
If you don't have access to the workflow that created the indices, you can browse to the Calgary file folder and see what .cyidx files are present. I assume there isn't an index on the Brand Name field. Recreate the Calgary with that index and let us know how that goes.
Hi Charlie,
I do have an index for Brand Name field.
In a bizarre twist, I am getting better performance loading my entire Calgary file compared to applying the pre-filter. Loading the entire Calgary file (i.e., no pre-filters) had a load time of 0:53 min compared to 1:59 min when I set Brand Name is 'Diadora' (~3% of records). I will recreate the Calgary database, as well as set the index to high sensitivity (currently Auto). Will share the results shortly.
Thanks.
Update. No improvements from recreating the Calgary database and selecting high sensitivity for Brand Name. I do see very good performance from applying pre-filters to other fields like Province. Perhaps the performance is better because Province is being treated as a list to choose from, whereas Brand Name is string-based without a selectable list.
But the fact that I get worse performance applying a Calgary pre-filter compared to loading the whole file bewilders me.