Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Frequency tables on large teradata table

hanna
7 - Meteor

Dear Alteryx community,

 

I am trying to get frequencies for 200+ fields in a teradata table that has 100 million records. Is it possible? Because I picked one field and it seemes running very slow.

 

Is there any other solutions? Any suggestions are welcome!

 

Hanna

20 REPLIES 20
hanna
7 - Meteor

I am almost there, thanks a lot for your help, David!

 

I set it up the batch macro using regular tools instead of in DB tools. I tested a few fields and it worked great. the only problem is that it gaves me error on some fields as shown below:

" onefreq_macro (35) Record #7: Tool #7: The field schema for the output "Output33" changed between iterations."

and omit these variables. I am not sure what the "field schema for output changed" mean, could you help me with this please?

 

My workflow and test data are in the attachemnt. Thank you!

 

 

david_fetters
11 - Bolide

You're doing great, and really, I enjoy helping.

 

That error indicates that your batch macro has received output fields in a different number or order on one of your batches.  Because the batch macro unions each round together, all rounds must output the same fields in the same order.  You get that error when a specific round of your batch macro outputs different fields.  I'm thinking that in this case, one of your batches might not be getting any results back.

 

Try running the batch macro only feeding Record # 7 into it and see what comes fields come out.  Since you are running that single round, whatever field schema that it has should output.  It could be that this field is entirely null and Teradata doesn't return something when in that case.  The key is that each round of the macro should produce the same columns in the same order, and when that doesn't happen you'll get that field schema change error.

 

If you can show me the fields you get with Records 1-6 and the fields you get with #7, we might get this solved.  Are you sure that the field that corresponds with #7 is categorical?  You might not get a result if its a float (e.g. 0.38052083523) because it won't do groupings on that data types.  Hard to say without being able to see the structure of the teradata database

hanna
7 - Meteor

Hi David,

 

I attached first 100 records from our teradata table for the first 15 fields I ran. It looks like the values of #1-6 fields were numbers and the #7 had text values.

 

Thanks again,

Hanna

david_fetters
11 - Bolide

Ohhhhh, ok, i think we've found it.

When your SQL query is returning results, it is typing them appropriately.  So the [Value] field that contains all of your unique values is numerical for the first 6 fields, then string for your seventh.  This is the schema change it's picking up.

 

Here's what I would do:

1) Create a select node in between your data input tool and your formula tool.  Change the type and size of the [Value] field to "V_String" and "200" (large enough to fit anything that could come pouring out of the DB, hopefully).

 

Alternatively, you could create a new field using your Formula tool, call that field "CleanValue", set type to "V_String", and size to "200", then (using whatever name you have for the Value field) set the expression to: ToString( [Value] ).  Then you would just need to deselect the original value field so that it doesn't get output.

 

The second approach uses an explicit conversion, while the first uses an implicit conversion.  Not sure if there's a difference in your use case, but worth noting.

 

Give this a shot and let me know!  Come on, no whammies...

 

EDIT: Im not sure of the syntax for a Teradata DB, but in most SQL versions, you can do this type of conversion in your SQL query.  E.g., in MS SQL I could use the command CAST and write something like:

SELECT CAST( [Value] as VARCHAR(100)) as [UniqueValues],
   Count(*) as [RecordCount]
FROM MyDatabase
GROUP BY [Value]
hanna
7 - Meteor

Hi David,

 

The CAST function works well. I don't get any error messages now Smiley Very Happy

 

However I run into two issues.

1. The batch macro approach doesn't count null values.

2. The counts are different between Frequency Tool and batch macro approach.

 

I attached my result and hope you can help me with this. It is so close I can taste it.  Smiley Happy

 

 

david_fetters
11 - Bolide

We may be approaching the bounds of my ability to troubleshoot your second problem.  @jdunkerley79 may have some ideas on why you'd get a different result (or who else could better approach this, i'm still one of the new guys) using one rather than the other, but I cannot for the life of me think of why that would happen..

 

To handle the first problem, you can either take a total row count, and assume that the difference between your row count and the sum of your known values is the number of null rows, or you can use a UNION in your SQL and specifically get the count of rows where ISNULL([Value]).  Basically you'd be counting all the values, then counting all the rows without values and unioning the two to get your unknown row counts.  Given the discrepancy between SQL and alteryx on counts, I probably wouldn't make any assumptions about the data until we can understand why that's happening.

 

But I can't get my head around your second problem.  The counts are so close I just don't understand how they could be different.  To get any further I'd really need to see what all x million rows looked like coming out of the database.

 

If you remove the CAST function from your SQL and run it only on 

CHILD_00_02_HME_PRES_CD

 does that change the output counts you see vs running it with the CAST?

hanna
7 - Meteor

Hi David,

 

To answer your last question, I removed the CAST function and run the query, the result is the same.

It seems the SQL gets more counts than the Frequecy tool.

 

I will send a message to the email address you provided. Hopefully I can get this figured out.

 

I will try to figure out the first issue later,

 

Thank you very much for your help, David, we made huge progress and I learned a lot from you.

 

Hanna

jdunkerley79
ACE Emeritus
ACE Emeritus

Hi Hanna,

 

 

I am not overly familiar with Teradata, but my approach would be something like.

 

1. Get a list of the fields using an Input tool (http://community.teradata.com/t5/Database/How-to-extract-column-names-from-a-table-using-Teradata-SQ...)

2. Having got a list of fields I would then use a formula tool to create a SQL statement (much like @david_fetters SQL).

3. I would then use a dynamic input tool to run all of the fields in parallel back on the server (set up a single example and then replace the query with the generated one).

 

This isn't hugely different to the batch macro approach, but you can then easily test run the SQL queries produced.

 

As to why the frequency tool is giving a different answer, can't think of any obvious reasons without taking a look at the data itself to see if anything stands out. 

 

Sorry can't give any immediate guess why the batch macro gives a larger answer. I would verify a single field using SQL on the server to see what the correct values are. 

 

James

hanna
7 - Meteor

 

Hi David and

 

Thank you very much for your help. I think there might be lost data when I run the frequency table. I run individule column on Frequncy table and the results are the same as the batch macro results. So this is my conclusion.

 

Thanks again,

Hanna

david_fetters
11 - Bolide

Apparently I did not completely understand the Dynamic Input tool.  Thanks for that option James!

 

 

Labels
Top Solution Authors