In part 1, I outlined the high level steps needed for an Alteryx macro that will read any number of Excel worksheets from any number of workbooks.
File Finder: The first step is to find all of the files that need to be opened and will be the "parent" macro and face of the final product.
Sheet Finder: The second step is to find all of the sheets in each file.
Sheet Reader: The third and final step is to extract the data found on each sheet in each workbook and bring them all together.
Here, I'll go in depth into each of those steps, including how to wrap them all together into a single, parent macro to streamline the interface.
Let's start from the top and work down.
- First, we'll need to gather the list of files we'll be working with. The directory tool is great for this. You can see in the output the four example files I've created so that I have an example of each file type to work with. The "*.xls" file specification will read all four file types because they all begin with "xls".
- We'll need to come back to this later to add the Sheet Finder and Sheet Opener macros as well as to build an interface, but the directory tool does everything we need from a data standpoint to get us started.
This will be a batch macro where each batch will be tied to a record produced by the Directory tool in the File Finder parent macro. We'll use a control parameter interface tool for this, labeled as "FullPath", and will map the FullPath field output by the directory tool to this parameter. This will allow us to reference the file path for the workbook relevant to each batch.
The guts of the Sheet Finder macro will need to produce a list of sheets for the file a given batch is working with. This will end up being the most complicated piece of this entire process because of the multiple types of Excel files we are supporting.
- XLSX: These are easy, the new xlsx native reader has an option to output the list of sheet names.
- XLSB/XLSM: These are a little more difficult as the new driver does not support either file type. However, borrowing some tricks (plus a few modifications) from our CTO, Ned Harding, outlined here, we can unzip each file type and find the list of sheets in the resulting output.
- XLS: Excel 2003 files proved to be the most difficult for extracting the list of sheet names. Primarily because they are old, binary files that cannot be easily be unpacked to find a list of sheet names sitting in an XML file. However, it is possible to call some VB script that will write the list of sheet names in a given .xls file out to a txt file that can then be read directly by Alteryx. I borrowed the script from a colleague (@RichardS) who already solved this particular step (a quick google search should also turn up some example scripts - but you can find the particular script I used in the supporting files folder of the final package).
Now that we have the core logic of the SheetFinder batch macro figured out, we need to construct the macro interface and actions. Remember that, because this is a batch macro, we will be working with a single Excel workbook at a time. We want each batch of the macro to run the sheet extraction logic appropriate to that batch's Excel workbook file type. To do this, we've placed each section in a tool container so that they can be disabled and enabled as needed.
Action tools will allow us to update the enable/disable status for a container tool based on the extension of the Excel workbook passed through the control parameter for each batch:
Similarly, conditions and actions can be used to update the file path referenced by the appropriate Input Data Tool or Run Command Tool:
This set up is repeated for all four Excel workbook extensions (XLSX, XLSB, XLSM, and XLS)
The results can then be brought together using a series of Detour End tools followed by appending the filepath of the Excel workbook associated with the current batch before being passed back to the parent workflow via a Macro Output Tool.
After saving the Sheet Finder macro we can now insert it into the File Finder parent macro where we can see that it is producing a record for each sheet found in each workbook.
Now that we have our list of all sheet names for all excel files in the directory, we need to read them. If we knew that all sheets would have identical schemas to one another, we could use the dynamic input tool for this. However, that is often not the case and we would prefer the macro work with any combination of schemas. Another couple of blog posts by Ned outline how to do this for supported file types so I won't go over it in detail. However, a key change was needed to ensure that the process will work with the unsupported file types .xlsm and .xlsb.
Ned's macro uses the FileFormat value of "-1" for the input data tool in his helper macro - this works great for his process because it instructs the input data tool to look at the file extension to determine appropriate driver to use for reading each file. This would work for the xls and xlsx files, but would not for xlsm or xlsb because there is no preset way for the input data tool to read them. Instead, we need to instruct the Input Data Tool to read xlsm and xlsb files using the Legacy Excel Driver - which is designated by the FileFormat value of "50". A switch function in an action tool lets us easily specify a default FileFormat value of "-1" that is over ridden when certain file extensions are found.
Now we can put everything together in the File Finder parent macro to see how it works.
Notice the file outputs listing out each sheet from each workbook as they are read.
Now that we are satisfied that the workflow is performing as desired, we can wrap it up by constructing the parent level macro interface that our users will interact with.
There are a few key things we'll want to expose to the end user.
- Most importantly, we need them to specify the folder path for the macro to look for Excel files. Remember that this function is performed by the Directory Tool. There is a similar Folder Browse Interface Tool that allows us to ask the user for a folder path input.
- The Directory Tool has an option that allows us to either include or exclude SubDirectories of the folder browse. We should probably give the end user this option as well. This can be done with a Check Box Interface Tool.
- We can also give the user the ability to choose which Excel file types are consumed with a List Box Interface Tool and Filter.
- Finally, we can copy the Output File Path Drop Down selection from the Sheet Reader macro to enable the end user to specify this value as well. We will also need a Macro Output tool to provide all of the records read from all of the Excel files outside of the macro.
Now we can use the macro as a dynamic way to import any number of Excel worksheets contained in any number of workbooks within a given directory.
The output is the union of all the sheets from all of the workbooks in the corresponding folder. Testing on the Alteryx sample directory produces something like this where you can see the filename for every record - where the schemas are mismatched but still read.
So there you have it - a macro to import the data from any number of Excel Worksheets in any number of Excel Workbooks regardless of type. Hopefully - this outline of the my thought process in constructing it will help you with your own macro building endeavors. Maybe you even have an idea for additional features for this macro that you want to incorporate. If you do - please share the results in the comments!
Note: This macro has been heavily tested and is expected to work in the cases mentioned above. We'd love to hear about your experience in working with it, but please keep in mind it is not an official Alteryx product macro.
I've added a few additional options to the version of the macro I'm currently using - updating the attached version to include these changes.
1. Option to specify how the directory tool searches for excel workbooks - primarily to allow for only including files with specific prefixes or suffixes.
2. Option to specify a specific list of sheet names to include.
I've published the latest version of the macro to the gallery. Including the following changes:
1. Option to use the xls vb reading macro for all Excel file types (fix for some users reporting issues with reading xlsm files)
2. Option to return the list of fields for each sheet without reading all of the data (I find this useful to gain a quick understanding of how well the data will match up)
3. Changed the directory tool to read all file types as some users reported issues with certain types of file systems. A filter is applied to limit to only excel files.