Why is my data getting so big in the middle of the workflow?
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
This is my workflow:
My input tool at the top: 2.4 GB
My input tool at the bottom: 90kb
In the first Join tool: 4.5 GB
in the second Join tool: 350 GB >> alteryx crashed due to lack of temp drive space.
Why does this happen? How does the data become so big in the middle like that? How do I fix this? I can't upgrade my RAM and stuff because this is on a work laptop.
- Labels:
- Join
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @novalcia
Things to look for
- After your first join, you've created a copy of each of your top input top rows leading usage to double. Your second join creates ~70 copies of the top input.
- Examine the configuration of your join tools. specifically the fields that you're joining on. If you're not including every Key Field from the first table in the join, your row count will increase.
- If you have duplicates in the bottom input, these will cause multiple rows.
- If any of your joined fields have null values this can cause the row count to balloon since every null on the right will match with every null on the left
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@novalcia Please check the data Quality as rightly said by @danilang null or empty values unwantedly increase the size of data to huge no suppose you have 100 null in 1st input and 100 null in 2nd input this will generate 10000 records which will definitely be a issue.
also i suggest to add block until done after first join tool and try running it.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
drop un-needed extra fields. when joining on large files use autofields to lower string sizes. drop duplicate fields.
short is - we can't 100% tell you - for all we know one of these tools has a blob field which is huge - but those are the standard places I'd look.... also - don't be afraid of using In-DB tools if this data/some of this data is in a Database.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
This is my Join tools configuration:
This is my first Join tool, and I deselected the only right input.
This is the second join tool, and I deselected all right inputs, because they're duplicates and I only need the left ones.
I just learned that my data may be ballooning because of Cartesian Join? But how is that possible? I only joined the Acc Num (in join 1) and Identifier (in join 2), I didn't join multiple columns.Shouldn't it output only those that have the same identifier (for join 2)? And I know there are no nulls in my identifier column.Even if there are nulls in other columns, it shouldn't matter because I'm deselecting one side of the Join and I'm only Joining the identifier column anyway, right?
