Running out of disk space with a large dataset

Question

I am once again asking for your tech support.

I'm working through a huge dataset and trying to match codes from individual lines to another file that has the other portion of the data using a series of tools. The hangup with this dataset is a join tool that comes immediately after another.

The first join tool doesn't struggle too much, but when the output from that join tool has to join with the next set of information the amount of records it's joining becomes 3.2billion records. Which becomes even more in the next join. The junk gets trimmed out later in the workflow, but with the amount of data I'm trying to process here I keep using up over 500gb of disk space in temp storage and the workflow can no longer run.

I inherited this workflow when I joined this team, I've made alterations over time but I don't know how to fix this issue. And it would be illegal for me to share the data, which I know makes this a hard ask.

WishIKnewHowToCode · Answer

I trimmed off a few extraneous fields that had bad data but it's looking like that's not making much of a difference. I've thought about splitting up my two input files but I'm afraid I would lose matches.

MarqueeCrew · Answer

@WishIKnewHowToCode ,

When you create billions of records with lots of data, you'll run out of space period.  The ideal situation going into a  join is to take in ONLY the required field(s) to satisfy the join from both L & R inputs.  The inputs are hopefully UNIQUE so that 1 record in = 0 or 1 record(s) out.  Then using the embedded SELECT from the JOIN, you further limit the fields on output.  When you do this you'll be minimizing the JOIN datasets that are written to the TEMP directory.  TRIM data as early as possible is the rule.

If you're still needing all of the data for that join, I'd break the job up so that you don't exceed your space limits.

Cheers,

Mark

Felipe_Ribeir0 · Answer

Hi @WishIKnewHowToCode

I had this scenario before, what solved it to me quicky at the time was changing the Temp Dir to a large shared network folder.