Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Frequency tables on large teradata table

hanna
7 - Meteor

Dear Alteryx community,

 

I am trying to get frequencies for 200+ fields in a teradata table that has 100 million records. Is it possible? Because I picked one field and it seemes running very slow.

 

Is there any other solutions? Any suggestions are welcome!

 

Hanna

20 REPLIES 20
david_fetters
11 - Bolide

Hanna,

 

That's likely to take a minute or two to run depending on your specs!

 

With very large datasets, I find that using the Summarize tool's Group By and Count functions seems to have much less overhead than the Frequency tool.  You'll need to iterate column by column over your table to get it to work, but I think comparatively it will be much faster (on the processing side, not necessarily the IO side).

 

You might also consider writing a batch macro that executes the appropriate SQL query inside the database and then captures the results.  Essentially, just run a query like: 

 

 

 

SELECT [Field1] as FieldValue, COUNT(*) as RecordCount
FROM theDatabase
GROUP BY [Field1]

You can then have the batch macro replace [Field1] with the appropriate field names, and just add a column inside your macro with the name of the field you're counting.  Your results then get spit out as a tall table of the form (FieldName,FieldValue,RecordCount)

 

While Alteryx is pretty freaking amazing, there's a cost to moving data from a database to the client, and that cost can sometimes exceed the advantages of handling your manipulations inside Alteryx.  100 million records is a decent amount (depending on specs) and unless it was all byte data I don't think i'd want to move it out of the database to do a quick group and count IMHO.

 

hanna
7 - Meteor

Thank you very much for your reply, David.

 

It took 40 minutes to run 9 columns. And I need to work on over 200 columns. I will try both of your suggestions and see which is better. I will let you know my findings.

 

Thanks again.

Hanna

david_fetters
11 - Bolide

My pleasure Hanna.

 

I'm always interested in benchmarking results, so I think it would be great for the community if you could let us know what you find.  If you wouldn't mind, would you also post your RAM and processor speed so we can get an idea of how fast your machine is?

 

While that's not lightning fast speed, if the processing time increases linearly (another great thing to test experimentally) that would mean yours should take... (carry the two)... a shade under 15 hours to process.  If this is something you only need to do infrequently, then you can run it overnight.  Our analysts tend to load up our machines with long running workflows on a Friday and let them run for anywhere from 12 to 48 hours (large scoring projects with many models).

 

One thing I've learned from running workflows that can take a day or two to finish is that you're probably better off splitting the task into batches that you can complete and "checkpoint" just in case your computer freezes or your connection to the DB server gets cut due to some DBA's poorly timed maintenance window.  For your case, I'd probably have it run frequencies on a handful of fields at a time, then output the results to a timestamped file before starting over with the next set of fields.  That way you can pick up right where you left off if something fails.  Ever since we missed a few deadlines due to connection errors in long running streams, I've started building out batch processes for big processing jobs.  It tends to be somewhat less efficient, because Alteryx is much better at managing memory if you give it everything and let it decide the order in which it needs records, but I don't regularly lose more than a few hours of processing at a time.

hanna
7 - Meteor

Hi David,

It looks like the processing time is not linear - one field 20 minutes, 9 fields 40 minutes. 15 fields 55 minute. I am limiting my number of fields to 15. My PC info:

 

Processor:  Intel(R) Core(TM) i5-4300 CPU @ 1.90GHz  2.50 GHz

RAM:          8.00 GB

system type:  64-bit

 

I am not tech savvy. I wonder how to do it as you suggested :

" have it run frequencies on a handful of fields at a time, then output the results to a timestamped file before starting over with the next set of fields."

Do you mean schedule increamental runs?

 

Thank you David!

david_fetters
11 - Bolide

Hanna,

 

I'm going to assume that those times given are from running in Alteryx and using the Frequency tool, rather than a query to a database.

 

What you've found is an example of memory efficiency that Alteryx uses.  Basically, your results demonstrate that when Alteryx is going through and calculating frequency on a single field, it has to go through every single row once.  Instead of going through one time for each field, when you select multiple values it tries to examine all of them with a single iteration through the data.  That means it takes longer than running a single field, but not linearly longer.

 

In your case, it looks like your best bet within Alteryx is to just go ahead and run the frequencies for all 200 fields at once (just set it up for the night).  Alteryx is pretty stable, and I can't imagine your query taking more than a night with the numbers you've provided.

 

What I was referring to would involve either building a batch macro to go through and run a group of frequencies at a time and then output a file, or building a workflow with multiple frequency nodes, each assigned to run a group of fields and then output the results to a flat file.  This can be a good approach if your big streams error frequently and force you to start over, but there is an Input/Output (IO) cost anytime you pull things from a database that you should keep in mind (e.g. the database will spend time looking for, grabbing, and returning the data, and the data will take time to transfer to your machine).  

 

Much like alteryx, database query costs usually aren't linear with the number of fields, so pulling two fields should only take a fraction longer than pulling just one field.

 

Everything is relative, but your process doesn't seem like it is taking so long that I would start looking at ways to complicate it.  Thanks for following up with both experimental results and your specs, that helps a bunch in seeing how the process has been working for you.  If you'd like to go into splitting up this process into batches, I can try and put something together to demonstrate the process for you.

hanna
7 - Meteor

Thank again, David!

 

Actually I can't select all fields, it told me I have no more spool space. So I reverted to do in-DB count by group. could you please gave me more detail about  the batch macro process ?

 ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------

"You might also consider writing a batch macro that executes the appropriate SQL query inside the database and then captures the results."

 

SELECT [Field1] as FieldValue, COUNT(*) as RecordCount FROM theDatabase GROUP BY [Field1]

 

"You can then have the batch macro replace [Field1] with the appropriate field names, and just add a column inside your macro with the name of the field you're counting.  Your results then get spit out as a tall table of the form (FieldName,FieldValue,RecordCount)"

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Thank you very much!

 

david_fetters
11 - Bolide

So I got a chance to do a little cursory reading on Teradata, and it looks like that Spool error could be from trying to pull too much data into the DB's cache.  So, it does look like the easiest way will be to run this as queries in the DB.  So, how do we build a batch macro to do this?

simple batch workflow.PNG

Notice first the "Control Parameter".  This is part of what makes this a batch macro.  When you insert this macro into yourworkflow rather than opening it up (i.e. like we're doing here), you will be able to attach a datastream to an upside down question mark, which is the Control input.

 

Once for EVERY row of data hooked up to the Control input, this macro will run.  What does this macro do?  First, we have a SQL query to return the unique values and counts of those unique values (i.e. frequency) for a single field.  We pass the results of this into a formula node where we add a field called 'FieldName'.

 

Obviously we don't want to run 200 queries of the unique values of the field [Field1], so we need to use the black and white Action tool to use whatever value has been passed into the control parameter on this iteration. In this case, each control row will be a single field name from the database that we want frequencies for.  With the SQL query:

 

SELECT [Field1] as Value,
 COUNT([Field1]) as RecordCount
FROM dbo."Master"
GROUP BY [Field1]
ORDER BY [Field1] ASC

We want to replace the string "Field1" on every iteration, but not the brackets around it.  Setting up the Action tool looks like this:simple batch action tool 1 settings.PNG

With the action tool, we click through the collapsed menu until we find the specific value we want to change, in this case its our DB connection stream.  Make sure you click that field, and then click the checkbox "Replace a specific string:" and type the value you want to replace.  This will make the action tool replace every instance of the specified string with whatever is passed through the control input.

 

Looking at the second action tool:

simple batch action tool 2 settings.PNG

Here, what we are trying to do is ensure that for each round of this query that executes, all of the output rows will have the field name attached to them.  Notice that the action tool only replaces the phrase FieldName but does not replace any of the quotation marks.

 

Your result at the end of this is a table that will look like:

FieldNameValueCount
GenderF56546
GenderM64867
StateOH76465
StateVA64781

 

You'll need to find a way to get the field names into rows so that you can pass them into the Control input.  You might try querying a single row from your database (to get all the fields), then deselect any ID fields or other things you don't want to get a frequency for, and then finally shoving it through the Field Info tool (Developer tab).  The Name column that comes out of it will be a list of all the field names that were fed into it, which you can connect straight into the Control input.

hanna
7 - Meteor

Thank you very much for the details,  David! That helps a lot!

 

I substituted the input and formula tools with in -DB tools since I process in database. The in DB macro output gave me an message as in the image below. Does that mean I can't use it in the batch macro?

 

 

in-DB macro.PNG

 

BTW, I do have my field list available in a file.

david_fetters
11 - Bolide

Hey Hanna, no problem, happy to help.

 

InDB tools are great when your faced with the choice of either pulling down a really large dataset or sending a smaller dataset to the server.  In this case I don't think you need them because our query is small and the results that the DB will send back are small.  You should probably set this up as standard batch macro with the standard (non-InDB tools).

 

I usually only use the InDB tools when I'm faced with comparing a very large database against my large dataset.  Let's say I have a DB and a flat file of registered voters.  My flat file is about 100,000 people who voted last year, and my DB is about 10 million people.  If I want to know the average age of the 9.9 million people that are in the DB and not in my flat file, I have two choices: 1) pull down the entire DB, compare with my file and process locally, or 2) send my file to the DB, compare and process remotely.  The time cost of sending 100,000 records to the DB vs downloading 10 million records means its faster to do this type of thing in a database.

 

In your case, you just need to send the query for one field at a time, then wait for it to send back the results (1 row for every 1 unique value in a specific field).

Labels