Reading entire folder of Excel files at once, columns not always the same

Question

I have a folder of Excel sheets. Every month, everything in the folder is deleted, and a new set of Excel sheets are dumped in.  Whenever that happens, I need to condense all these sheets together into one single data file.

The Excel files do not follow a naming convention. The number of sheets that go into this folder each month changes. The columns are typically mostly the same throughout the sheets, but not always.  Each sheet has only one tab (mercifully).

I want to read all the files in together at once, append on same columns where ever able, and fill in Nulls everywhere else as needed.

EXAMPLE

---HAVE---

~DATA 1~

ColA ColB ColC
10 9 3
4 4 10
7 9 4
1 4 3
5 8 5
9 4 6
2 7 5
10 8 2
6 6 7
6 6 2

~DATA2~

ColA ColB
20 19
17 18
18 16
18 16
15 11
13 18
16 16
20 17
15 11
11 15

---WANT---

ColA ColB ColC
10 9 3
4 4 10
7 9 4
1 4 3
5 8 5
9 4 6
2 7 5
10 8 2
6 6 7
6 6 2
20 19 [Null]
17 18 [Null]
18 16 [Null]
18 16 [Null]
15 11 [Null]
13 18 [Null]
16 16 [Null]
20 17 [Null]
15 11 [Null]
11 15 [Null]

I've tried a few different ways so far:

* If I import each sheet individually (using one Input Data tool per sheet), and then use the Union tool, it does what I want perfectly. Problem is, I have to manually set up an Input Data tool for every single Excel file, one at a time (then feed those inputs all into the same Union).
* If I use one Input Data tool for everything, but specify a path ending with a wildcard, it grabs all Excel sheets in the folder ("S:\PATH\TO\FOLDER\*.xlsx").  Which is great, it fixes my dynamic number of sheets/names of sheets issue. However, problem is, that "one input tool for all files" approach doesn't handle the differing number of columns correctly (looks like it will only pull in the sheets with headers matching whichever was the first file it read in).

I've written a Python program that handles everything exactly as I want (below). I think I can use this script in a pinch, but I'd rather use native Alteryx if possible.

PYTHON SCRIPT

import pandas as pd
from os import walk

folder = "C:/Users/Desktop/test/"

input_files = []
for (dirpath, dirnames, filenames) in walk(folder):
   for file in filenames:
      if file.endswith('.xlsx'):
         input_files.append(folder+file)

df = pd.DataFrame()
   for i in input_files:
      d = pd.read_excel(i,dtype=str)
      df = pd.concat([df,d],axis=0, ignore_index=True, sort=False)

print(df)

Data 1.xlsx

Data 2.xlsx

Dynamic Read (attempt).yxmd

Have-Want.PNG

AmeliaG · Accepted Answer

Hi @DanWhalen,

Thanks for your question! You will need to use a batch macro to achieve this. Please take a look at the two links listed below which describe in detail how to achieve this:

Reading in Multiple files with different fields

Read in Multiple Excel files with multiple sheets

Hope this helps!

Amelia

CarlDi · Accepted Answer

Hi @DanWhalen

A batch macro should do the trick. Here's a great article with an example batch macro to input CSVs with different schemas.

https://community.alteryx.com/t5/Alteryx-Knowledge-Base/CS-Macro-Dev-Reading-in-Multiple-Files-with-Different-Field/ta-p/13370

In your case, you have excel files, so I went ahead and changed the File Format of the Input Data tool within the Batch macro.

Back to the canvas, I then have a Directory tool pointed to the folder with the excel files. They originally had the same schema but I added unique field names and content in three workbooks. I then inserted my batch macro and selected "Fullpath" in the control parameter. As you can see, it brought in all the fields despite having different schemas.

Hope that helps!

chpoland · Accepted Answer

@DanWhalen

My approach would be to use a directory tool reading all .xls files, pass that into an Input Data batch macro with the key being that you use the interface property setting of "Auto Configure by Position (Wait Until All Iterations Run)". If you don't update the output mode, it will complain of the files not having the same schema.

The attached workflow will stack/union all of the data regardless of count of columns in your desired output format with [null] for data rows with fewer columns.

Batch macro with the only tool being the Input Data tool

Access the Macro Interface settings by opening the macro, View -> Interface Designer then clicking the Gear Icon on the left

Hope this is what you were looking for!!