Hello,
I purchase tax parcel information organized and delivered by U.S. state in ESRI GeoDatabase format (gdb). Each state file is about 6GB.
I convert the files into .yxdb format as my first step. I understand these files contain every parcel in each state (100,000s), but they each take 24-36 hours to load making the process unwieldy. I update these files every quarter and the process to update 15 or so states takes weeks. The waiting is ruining my timeline for getting work done. Since I am using Batch Macros, I can never tell if the machine is hung up, moving abnormally slow, or simply crunching away as expected.
Questions:
Is there a better way to import/convert GDB to be used by Alteryx? Reading the old discussions, it seems like the integration of the GDB format may have been a workaround to satisfy those asking for the functionality.
Should I be running the loading procedure in parallel and not in series with the Batch Macro? Is the bottleneck not related to machine resources? Is there a better way to see work progress than looking at the macro completion % (stays at 50%)?
My machine has a Xeon Processor with 10 real/20 virtual cores, 128 GB of RAM, and uses SSD drives.
Thanks for the help!
Can't tell without seeing your workflow. Do you incorporate the downloading (or downstream uploading) in your workflow? My major places to look would be:
1) network issues (ie downloading/uploading)
2) data cleanse. Don't do it.
That seems longer than I'd expect. And just an FYI - I tend to use Batch Macros here. I believe on large data they may slow down the processing but they do provide partial success and better tracking/testing.
Thanks for the reply.
The files are all local on my SSD. There is a Data Cleanse step in there, I'll take it out.
I just started the macro and will see if taking Data Clease out helps. I did notice that it always showed 96% until completion many hours later. What a strange bug in what should be a straightforward function. I may try smaller batches running in parallel to see if the system can handle more than one macro running at the same time.
Data cleanse is a memory hog. Also - get rid of browses dispalying huge amounts of map points. I'm running some queries with/without browses and with/without datacleanse (with a single spatial field) to document.
1,000,000 rows (one spatial point/3 text fields).
- no data cleanse - no browse (.5 seconds)
- with data cleanse - no browse (2.8 seconds)
- with data cleanse - with browse (hit stop at the 4 minutes mark 345,000 records in)
- with browse - no data cleanse - no spatial object (.4 seconds)
-with browse - with data cleanse - no spatial object (.6 seconds)
this is a base workflow. Extra browses are creating extra in-memory map objects. drop them.