cancel
Showing results for 
Search instead for 
Did you mean: 

Most Efficient File Format to Read Into Alteryx, Behind YXDB

We have a recurring monthly process which is being changed. One of the options available to us is determining the file format we will receive large data sets in. By large I mean over a hundred gigs of data. Unfortunately, we do not have the option of receiving these files in .YXDB format. So we are looking for the next best option which balances both size of the file and efficiency of reading the data into Alteryx.

 

Thoughts, suggestions?

Magnetar
Magnetar

YXDB is definitely Alteryx's best.

 

I'd expect one of the flat file formats are the best, and could end up being limited by ability to read from disk. 

 

I'd try a CSV and if you can make it so no quoted strings expect Alteryx will handle very fast.

 

 

Community Content Manager
Community Content Manager

The Calgary Database file format might be the best case for you as Alteryx has a suite of tools to quickly index, query and retrieve records:

http://help.alteryx.com/current/index.htm#AlteryxFiles.htm#CYDB__Calgary_Database_

 

Tara McCoy
Sr. Community Content Manager, Alteryx
New to the community? Get started here.

Correct me if I am wrong but isn't Calgary also just an Alteryx file format?

 

If so then that won't work.

Highlighted
Raghu_s
Meteor

Hi @TaraM,

 

I'm facing similar challenges where we read files that vary in size from 20 - 40 Gigs on a daily basis. The difficulty is that we pick files from shared network that are physically outside our country as you know most organizations are spread across. 

 

This Calgary concept looks cool. I have a use case that is little different, would like to know some benchmarks or best practices that could help. 

 

1. What if my inputs are flat files spanning 20 - 40 Gigs in size located at different share drive aka networks (not hosted FTP locations) ?  

2. The process is slow to read such huge file and convert to alteryx database such as yxdb. 

3. This usually happens when I pick files from a distant server. (overseas - physically servers are at a different country within organization limits)

4. What would be an alternate solution to handle such files as we read such universe files frequently which gets updated? So moving every time to local is again slow.   

 

So ,

1. is it possible to create Clagray database out of an Flat file so huge? 

2. What are the Benckmarks? 

 

Eagerly waiting for some answers as it would be a game changer for the huge data pipelines that we have built in the last few years.

 

Thanks in advance. 

Bolide
Bolide

From my experience with large files (100+ gigs), if getting the file into an Alteryx format (YXDB or CYDB) is not an option, then I go with a delimited option, usually pipe, as a ".txt" file.  These tend to be smaller in my experience than CSVs with the same exact data.

 

Anyway, a ".txt" file with a pipe delimiter is how I have our partners output files for my consumption into Alteryx (these partners do not have Alteryx).

Now, if the files are of the order of 200 million records or more and you're going to be filtering that data in either an app or another workflow then I would transform it to a CYDB format (Calgary).

 

For example, within a workflow and/or app, if you're filtering a CYDB with 500 million records, then the filtering will be very fast (seconds, if not less).  You set up the filter right within the CYDB input tool.  Whereas if you took the same data in YXDB format and used a filter tool after it, it might take a minute or two.

Hope that helps!

Raghu_s
Meteor

Thanks MB!  Right now the files are mostly csv format and few are text files that are directly consumed. The challenge that I see here is that,  we have so  much of incoming data from different systems and also from external parties and these file are spread across shared Network machines / backup drives drives / FTP  from where I need to pull files for investigation. As you know Alteryx is the one tool that helps us do quick slice and dice which does not demand a proper data warehouse. 

 

So I shall continue having my files in csv and txt format for consumption unless or until am going to hit it frequently. 

 

Thanks for your inputs! Appreciate it!!

 

Can you direct me to articles to build such apps ( blogs/tech notes/knowledge base)  to handle such huge data mining best practices using Alteryx?  It would be an eye opener. 

Bolide
Bolide

You're welcome.

 

I have the same issues - lots of large data files on lots of Shared Network Drives all across the country.  The input is the most time consuming piece, no way around that unfortunately.  We try to have the external folks send their data to our Corporate Data Center here.  And our Alteryx is on a Server in the same building as the Corporate Data Center so the data doesn't have far to travel when being consumed into Alteryx.  That helps a bit, but still, they are large files and they take a while to read in.  

 

For best practices, I'd say check out the Interactive Lessons and Online Training, as well as searching for "optimize workflows" and the like on the community.  

 

Cheers!