community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.

Reading (Multiple) Files Of Your Choice from ZIP files with ZIP File Metadata Reader Macro

Highlighted
Alteryx
Alteryx

Hey Alteryx community,

 

Have you ever needed to work with files from ZIP archive? Alteryx has your back!

 

But... What if this gets out of hand a little and you need to work with massive pile of ZIP files? And at the same time you don't know beforehand what are the file names within ZIPs because their names contain dynamically generated strings like date stamps, time stamps, or latest transaction ids? Alteryx and this post have your back!

 

This is actually a real life use case that has come up from one of a champion I work with in one of my accounts and got me hooked to try to solve it. And surely we did!

 

Thus, I wanted to share this solution that could be useful if you are ever working with ZIP files frequently and have a use case heavy on ZIP files input consolidation.

 

image.png

 

But at first, let's start from the most basic situation and work our way to the top (of more complex ones):

 

SINGLE ZIP FILE

 

Alteryx has your back when you need to work with a single ZIP file. The standard input tool allows you to select the files within that archive you need via a nice looking UI menu. This approach will surely prove effective as long as you have just a handful of ZIP files and don't need to reconfigure every day multiple times. So far so good.

 

image.png

 

TON OF ZIP FILES, SIMPLE NAMES OF OBJECTS WITHIN ZIP TO LOAD

 

Having multiple ZIP files is still no biggie. As long as you can somehow generate the file names you want to extract from those ZIPs, you are good with our "standard" file input batch macro. Just use a formula tool on top of DIRECTORY tool to construct the names of files.

 

This would be something like "\\My_Mac\Home\Archive Folder\Archive 02.zip|||Customer Data_1.csv;Customer Data_2.csv".

You should have no problem with Excel files either.

 

Sample workflow of this is attached as "Multiple Zips With Constant Names.yxzp". Sample Archive "Archive 02.zip".

 

image.png

TON OF ZIP FILES, COMPLEX/CHANGING NAMES OF OBJECTS WITHIN ZIP TO LOAD

 

What if you have multiple ZIP files that contain multiple files but there is no way to know the full name of those files you want to read?

I.e. the names of files are not constants, say Customer.csv, but rather something dynamic like "Customer Data_20190815_13234403.csv"?

The problem here is there is no way to easily construct the FullPath to read such files as in the workflow above.

 

We have several options.

1/ Use Command Tool to UNZIP the files through a CMD call and process them later

2/ Use Python and try this without having to run UNZIP (this may be a requirement to conserve space)

3/ Use R and do the same

 

image.png 

side note: Guess who the winner was for me? Ever since learning print("Hello World") in Python, I proclaimed myself a coding expert within my team. Did not learn much since (Shhhh) then but did not prevent me from posting a gazillion of community articles about using Python in Alteryx. Huge fan of what we as a company did here combining the coding and non coding approach to analytics TBH. A coder writes the few lines of magic and Alteryx allows everyone in the company to benefit from this by using the magic within the workflows!

 

Python it is then. I found a basic Python package zipfile that allows you to extract the metadata of files (like file name etc) from ZIPs without having to extract them first.

 

The full sample workflow is attached as "ZIP Files Metadata Extractor and Reader.yxzp". Sample Zip file is the "Archive 01.zip".

With this workflow you are able to read any files of your choice from any ZIP files. Of course you may want to change the filters or something.

 

The bad boy here is ZIP METADATA READER that does the job of extracting the metadata of files (file names) from your ZIP files.

note: Well, I made it work on CSVs now so XLSX where Sheet names are needed is something to still fix.

 

image.png

ZIP METADATA READER

What is happening in the ZIP METADATA READER macro?

image.png

 

CODE IN ZIP METADATA READER

 

 

# List all non-standard packages to be imported by your 
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])

from ayx import Alteryx
import pandas
import zipfile

#Read the path to of ZIP
df = Alteryx.read("#1")

# Load the params from the input
path = "" #Placeholder for folder with a script

for index, row in df.iterrows():
    path = row[0].replace("\\","/")

# Use zipfile package to extact the metadata within your archive    
with zipfile.ZipFile(path, 'r') as myzip:
    names = myzip.namelist()

# Construct a PD DF to write back to Designer
df = pandas.DataFrame({"FilesList":names})

# Output the list of files as dataframe
Alteryx.write(df,1)

 

 

So, bottom line - Alteryx is awesome once again and we are able to handle massive ZIP files consolidation situation even if you don't know what are the file names within your ZIP files beforehand.

 

Hope this helps! Sorry for a super long article that could have been a two line post. I just like to go on and on, include weird jokes and funny pictures (NOT).

 

Cheers,

david

 

David Matyas
Sales Engineer
Alteryx
Labels