This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I have a large database with 180 Million records that i put into a table called [TBASE_DEMO_ALL.yxdb]
The query i built below selects the 1st 1million records and OUTPUTS the data into a .txt file. Thats pretty much it.
Question I have is the SAMPLE tool and OUTPUT tool hit 99% in 2 minutes. The INPUT tool is slowly creeping up - and if my math is correct - should finish in a couple of hours before the INPUT tool hits 100%.
Here's the kicker - if i cancel the workflow once the SAMPLE tool and OUTPUT tool hit 99% (so after 2 minutes) - the 1million .txt file is created.
So in my head i'm thinking that the INPUT file is still trying to read the rest of the file. Anyway i can bypass it reading the rest of the file? Am i not building the query correctly? Any thoughts would be appreciated. Thanks. Javier
Input configuration has a record limit setting. I dont know if it will read in the same 1,000,000 each time, but it is worth testing (maybe test with with something less than a million records though... 🙂
If you are using the Sample tool as a way to limit downstream records, you may be better off doing that by putting the record limitation in the Input tool directly. Most data sources allow that option and this should stop reading from the Input after it hits the limit. An input tool will be read in its entirety before going to the sample tool if done using that method.
Another trick when dealing with those volumes is to right-click and Cache and Run workflow so it only reads it once as you are building. You only feel the pain once that way.
I totally missed DanielG's post in my last response. Reading those volumes, I have found requires more work to develop so you don't pull your hair out. For example, I just wrote a test workflow that pulled 2.6M records from SnowFlake and does a count distinct on my PK. This takes 42 seconds.
My trick I have been using is to put a sacrificial Select tool (or any tool with a single output anchor) after the input tool and doing the Right-Click --> Cache and Run WorkFlow. In many cases you can cache the input tool's results directly, but not when using a YXDB+Input. But you can set a cache point using the Select right after it. The second pull reading from that once in memory was 7 seconds, BUT I still have that output to use later on with all my columns.
A YXDB is basically a text file with no indexes so reading it is a brute force procedure. You can also use AMP if you have created the YXDB to be amp-friendly to help the read times.
@john_watkins -- i agree completely for testing purposes that the cache and run is a fantastic time saver.
Caching literally creates a YXDB file within the temp folders on your computer so I wonder if just dropping his actual YXDB into that same folder would allow for the improvements in run time? Does the Cache and Run process do anything under the covers to further optimize the file beyond the standard YXDB creation process in an output tool in another workflow?
Not that dropping a select after the input isnt easier than finding the cache folder and moving the YXDB there. I am just curious now that we are talking about it. haha.
Havent done anything with AMP yet either, but will try to get into a bit more when my time frees up.
@jdelaguila - hopefully something in here is helpful to you 🙂
DanielG - You are correct that the Input Tool's "Cache Input" and the Cache and Run workflow option are actually creating a local .yxdb. I thought the latter was going straight into memory.
Another option I've used when data gets into the hundreds of millions is to create a Calgary output file. This makes the "temp" database behave more like a normal indexed database and from what the help file says is in a sweet-spot between 100-300M records. I just built one. Even if you create a record-Id and just filter the Calgary output for where recorded <= 1,000,000 it will return nearly instantly.