How do you work with large data tables in a SQL database?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
My journey continues as I explore my first SQL database at my job. Today I was trying to look at a transaction table from our CRM system goes back 8 years. I didn't know how many records it contained other than after a half hour it was still loading. So I looked at it as a live table using Tableau and found that this table had over 7.5 billion records. I tried narrowing the date range for a single month and that had over 100 million records. I'm looking for how to look at this data for long and short-term sales trends and forecasting.
Any suggestions would be greatly appreciated.
Thanks in advance.
Solved! Go to Solution.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
You could try to manually push as much as you can into the SQL of the input tool. Or, better, switch to using In-Database tools. The main trick is to do whatever aggregations you need in the database; these will hopefully shrink the data to a manageable size for whatever hands-on investigation you wish to do. It's important to shift to the mindset that looking at something with a billion rows is impossible, even with all the computing resources in the world, we as humans cannot digest a billions rows anyway: think in terms of how you want to shrink it into useful information... and then do that in database.
Hope that helps!
John
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Agree with @JohnJPS
Some other things you can do on any table - all of which you can do in Alteryx
Find out rowcount:
Select count(*) from Table
some SQL engines react better to
select count(1) from table
or if you're on a Microsoft SQL server use the nolock modifier to not create table locks while doing so
select count(*) from table with (nolock)
View sample data to play with the records
set the record Limit (second parameter on the input tool) to 100 so that you can explore the data shape first
Only bring back the fields you need
If what you need is the number of records per customer then only bring back these two fields - let the SQL server do the work. On large data sets, the amount of data you transfer across the wire (across the network) becomes a big constraining factor. I've dealt with 180M row tables with 100+ columns (half a terabyte), and bringing this entire table across the network would take hours (i.e. doing a Select *) - however if you just want the count by customer then a query such as the one below will be MUCH less data
select customerID, count(*)
From
Sales
Group by
CustomerID
Aggregate; filter; or Join on the server if you can
To John's point - SQL servers are built to optimize read; query; filter and join operations - so it doesn't make sense to bring all that data back to the client and then do the work (the SQL server may well be tuned to be super-efficient at the job; and bringing extra across the wire that you don't need will create needless delays in your process).
So - to John's point - do the query on the database.
There are 2 ways to do this:
- One is to use in-database queries if you're on MS SQL or the other engines that have recently been added
- the other is to learn a little SQL and write the query. There are great SQL training courses out there, and I honestly believe that being handy at SQL is a critical skill in our job.
Tune the server:
Once you know the query that you want, that aggregates the data on the server nicely, and only brings back the columns you need - you may still face delays. However, on a billion rows of data, assuming you don't have a hardware problem, this should be very achievable on a modern database. So - if you're still facing delays - work with your DBA's and they can perform a function called Indexing, where they take the query that you're trying to do, they ask the SQL engine for the execution plan (basically saying to the SQL execution engine "tell me how you're thinking about doing this, and what we can do better:") and then they can create indices (which are like fast lookups) on the database to speed up your specific query.
I hope this helped @John_S_Thompson - feel free to reply or PM if you need more.
Sean
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Thanks for the advice. The level I want to look at is a day level. I haven't had a chance to see how many records a single day has. If I had to guess the count could be around 10 to 20 thousand records. Tomorrow when I get to work I will check out how to use the in database tool set.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
:-) the other thing to check is to see how the data is structured. take only the top 100 rows, and see if the field that gives you the day (e.g. sale date or trade date) is a date-time field. These are stored as numbers under the hood in SQL engines so date filtering is very speedy. If the date is not in a date-field, then you may need to spend some time with your DBAs either tuning this or seeing if there's a record ID that's in sequence by date that you can use reliably. The other option available to your DBAs is if you only really query the last few months, they can partition the data physically by date, which reduces query time too.
To give you a sense - we chew through well over 1M rows in seconds in Alteryx - but on large data sets like this query time is often the first factor to start working through.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
These are all great tips. My challenge is I don't have a DBA at work so it looks like I have to figure it out the hard way. I'm always up for a challenge.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi John,
I wanted to say thank you for the suggestion using the In database tools in Alteryx. Before using these tools I let my computer run all night to see how long it would take to down load 5.8 million records from the companies SQL server and when I came in the morning the time said 5 hours to complete. I watched the live training on how to use the In-Database Tools and simply aggregated the file by date, customer id and summed by sales. What took 5 hours only to 10 sec. WOW
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@John_S_Thompson wrote:These are all great tips. My challenge is I don't have a DBA at work so it looks like I have to figure it out the hard way. I'm always up for a challenge.
Hey John,
Depending on the database server that you're using, I may be able to offer help (in my past life I did the exams for Microsoft SQL server administrator, so MS SQL server we can work with, and probably figure out some options for you on other databases)
Feel free to PM me if you get stuck.
