This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
I am trying to use the Join tool in order to combine multiple Excel files that result in over 1 Million rows of Payroll data. This seems to overwhelm the server and/or take forever to run. Is there a way to condense the data or make this possible in some way?
Can you provide some sample data - both input and output? How many records are your inputs? If you're not expecting 1M+ records after the join you may need to join on more fields to eliminate any cross joins that are happening.
Attached is a small sample of the input data. My goal is to combine 5 files, each containing about 300,000 unique lines of this data and then output files containing the data from each unique Ledger Account (Column G).
Thanks, what is the reason for the joining you mentioned? It sounds like you would want to union the files together and then maybe use a summarize tool to group by Ledger Account and sum the debit/credit amount?
honestly, i haven't seen a good answer to joining a large table to a small dataset when i isn't possible to limit the records from the large table prior to the join in Alteryx. Suggest to use SAS for joining different environments or big data
@Time- this really depends upon your skillset with SAS vs Alteryx and institutional support. I do this all of the time in Alteryx so I don't have a problem. You/your company may have a great SAS team so this could be much easier in SAS. I'm sure both platforms can provide this capability.
Hey sounds good, so it's possible to join a small row dataset to a million row dataset. Could you explain at a high level what tools or process you use in Alteryx when it isn't possible to limit the large table dataset prior to the join in Alteryx?
First off - I don't see a million row dataset as a large dataset. I see hundreds of millions and greater as a large dataset. So if it's 1 million I just use standard tools and don't really have an issue unless I'm dealing with expansive joins. Then obviously I have to do some grouping & prep beforehand. If it's 9 figures or more - then hopefully the data is in a DB and I use the In-DB tools. I create dynamic where clauses from the smaller dataset and use dynamic input in-db. 8 figures is king of the grey area. depends upon the data.
so basically I do some form of summarize tool/dynamic input in-db and then I can extract what I need using datastream out. Note - if you want you could also create a temporary table on your DB using datastream in and then use a Join In-DB. I tend to use the dynamic where clause instead.
Great info thanks, i just found the dynamic input join tool will allow a in variable after asking our resident expert here. So, you just need to connect the "smaller" dataset to the dynamic input join tool and update the where clause to include an in statement of the key field.