We have a recurring monthly process which is being changed. One of the options available to us is determining the file format we will receive large data sets in. By large I mean over a hundred gigs of data. Unfortunately, we do not have the option of receiving these files in .YXDB format. So we are looking for the next best option which balances both size of the file and efficiency of reading the data into Alteryx.
I'm facing similar challenges where we read files that vary in size from 20 - 40 Gigs on a daily basis. The difficulty is that we pick files from shared network that are physically outside our country as you know most organizations are spread across.
This Calgary concept looks cool. I have a use case that is little different, would like to know some benchmarks or best practices that could help.
1. What if my inputs are flat files spanning 20 - 40 Gigs in size located at different share drive aka networks (not hosted FTP locations) ?
2. The process is slow to read such huge file and convert to alteryx database such as yxdb.
3. This usually happens when I pick files from a distant server. (overseas - physically servers are at a different country within organization limits)
4. What would be an alternate solution to handle such files as we read such universe files frequently which gets updated? So moving every time to local is again slow.
1. is it possible to create Clagray database out of an Flat file so huge?
2. What are the Benckmarks?
Eagerly waiting for some answers as it would be a game changer for the huge data pipelines that we have built in the last few years.
From my experience with large files (100+ gigs), if getting the file into an Alteryx format (YXDB or CYDB) is not an option, then I go with a delimited option, usually pipe, as a ".txt" file. These tend to be smaller in my experience than CSVs with the same exact data.
Anyway, a ".txt" file with a pipe delimiter is how I have our partners output files for my consumption into Alteryx (these partners do not have Alteryx).
Now, if the files are of the order of 200 million records or more and you're going to be filtering that data in either an app or another workflow then I would transform it to a CYDB format (Calgary).
For example, within a workflow and/or app, if you're filtering a CYDB with 500 million records, then the filtering will be very fast (seconds, if not less). You set up the filter right within the CYDB input tool. Whereas if you took the same data in YXDB format and used a filter tool after it, it might take a minute or two.
Thanks MB! Right now the files are mostly csv format and few are text files that are directly consumed. The challenge that I see here is that, we have so much of incoming data from different systems and also from external parties and these file are spread across shared Network machines / backup drives drives / FTP from where I need to pull files for investigation. As you know Alteryx is the one tool that helps us do quick slice and dice which does not demand a proper data warehouse.
So I shall continue having my files in csv and txt format for consumption unless or until am going to hit it frequently.
Thanks for your inputs! Appreciate it!!
Can you direct me to articles to build such apps ( blogs/tech notes/knowledge base) to handle such huge data mining best practices using Alteryx? It would be an eye opener.
I have the same issues - lots of large data files on lots of Shared Network Drives all across the country. The input is the most time consuming piece, no way around that unfortunately. We try to have the external folks send their data to our Corporate Data Center here. And our Alteryx is on a Server in the same building as the Corporate Data Center so the data doesn't have far to travel when being consumed into Alteryx. That helps a bit, but still, they are large files and they take a while to read in.
For best practices, I'd say check out the Interactive Lessons and Online Training, as well as searching for "optimize workflows" and the like on the community.