Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Help with Flat file in a Zip File

nmosley
7 - Meteor

I am running into a challenge on a project and am hoping that the community can help.  I am receiving some .Zip files that contain flat files that need to be read as fixed width. 

 

Thinking of two approaches but not sure which will work.  Anyone else faced this challenge?

 

1) Extract the files from the Zip.  Extract the files via RunCmd.  I see this thread here that may be of some use (https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Can-Alteryx-unzip-a-file-as-part-of-th...) but am not quite sure how to navigate that code.  Once extracted I can import as .Flat using a layout workbook created in Excel.  I' heading down this path now since the import part is straightforward once I learn how to extract from the ZIp.

 

2) Leave the files in the Zip, but limited file types to choose from.  I can Import as Csv without a delimiter but I don't see how to utilize the layout workbook in this method.  Seems like I would be stuck parsing via another method like regex or substring for each field.

8 REPLIES 8
JoaoLeiteV
10 - Fireball

Hello,

 

Alteryx can read files inside a zip, so your second option would work.

 

If possible, could you share an example of the files you're going to be using? This way we can figure out if you can just go with substrings (As they have a fixed-width) or Regex.

patrick_digan
17 - Castor
17 - Castor

@nmosley I had this same situation 5 years ago (to the day!) and submitted an idea for the zip files to be able to read more data types that are already supported in the input tool, including flat files. Alteryx never looked into the idea.

 

For me, I found it easier to write some python to work with the zip files. I've tweaked my code below a little bit, but the goal is just to download a zip from the web and then extract all the flat files from the zip. I've found it easier to use python than the run command tool (and I've done both). Hopefully this points you in the right direction in case you wanted to try this route. After the code, you can feed the file name (and or path) coming out of the python tool into a dynamic input. 

import urllib
zipin = 'https://website/file.zip'
zipout = '//server/folder/file.zip'
urllib.request.urlretrieve(zipin,zipout)

import zipfile
import fnmatch
import pandas
from ayx import Alteryx
import os.path

zip=zipfile.ZipFile(zipout)

out = zip.namelist()

list = fnmatch.filter(out,"*flat")
Alteryx.write(pandas.DataFrame(list), 1)

zip.extract(''.join(list),os.path.dirname(zipout))

 

Hope that helps!

TheOC
15 - Aurora
15 - Aurora

Nice solution @patrick_digan, theres always a way with Alteryx!


Bulien
nmosley
7 - Meteor

Thanks @patrick_digan!  I wasn't sure if I was explaining the situation effectively but you seem to have understood perfectly.  This may be the final straw that forces me to learn python.   I tried the following:

 

from ayx import Alteryx
import pandas
import zipfile
from os.path import isdir
from os.path import dirname
 
df = Alteryx.read("#1")
source = df["InputPath"]
destination = df["OutputPath"]
 
zip = zipfile.ZipFile(source)
zip.extractall(destination)
 

One caveat to this project that differs from your sample code.  I am starting with the zip files already located in a network folder.  Another process that I have no control over is dropping them off.  I have performed a directory sweep and can figure out which zip files have already been extracted and am connecting that stream to the input anchor on the python tool.  However, it errors out saying "series has no attribute seek."   

 
I tried testing source with zipfile.is_zipfile(source) and am receiving a message that it's expecting a str not a series.  I assume that means it can only accept one row of data (which i filtered on the canvas but probably need to do in the data frame?)    or that the data type of df.InputPath is incorrect.  Thoughts?
patrick_digan
17 - Castor
17 - Castor

@nmosley Great to hear! You just need to modify the source and destination by adding a [0] at the end to specify the row number. That way python will treat it as a string. 

from ayx import Alteryx
import pandas
import zipfile
from os.path import isdir
from os.path import dirname
 
df = Alteryx.read("#1")
source = df["InputPath"][0]
destination = df["OutputPath"][0]
 
zip = zipfile.ZipFile(source)
zip.extractall(destination)

 

 If you wanted your python to handle more than one file, you could do a simple for loop (full disclosure that I'm no Python whiz and this may not be the most efficient way to do this):

from ayx import Alteryx
import pandas
import zipfile
from os.path import isdir
from os.path import dirname
 
df = Alteryx.read("#1")
for i in range(len(df)):
    source = df["InputPath"][i]
    destination = df["OutputPath"][i]

    zip = zipfile.ZipFile(source)
    zip.extractall(destination)
nmosley
7 - Meteor

Posting the final python code in the event that someone else lands here looking for an answer.

 

from ayx import Alteryx
import pandas
import zipfile
from os.path import isdir
from os.path import dirname
 
df = Alteryx.read("#1")
df.InputPath.apply(str)
for row in df.itertuples():
    
    source = df["InputPath"][row.Index]
    destination = df["OutputPath"][row.Index]
    zip = zipfile.ZipFile(source)
    
    if len(zip.namelist())==0:
        df.drop(row.Index)
    else:
        out = zip.namelist()
        zip.extract(source.join(out),destination)

 

Marta12
7 - Meteor

I'm trying to implement that code to my workflow but I'm getting some errors. 

Could you share sample workflow which is using this solution? 

Or explain more what format the fields taken from  Alteryx.read("#1") should have? (InputhPath, OutputPath)

 

nmosley
7 - Meteor

I am performing a directory sweep to get the full path of the new .ZIP files. 

From there I use ReplaceChar([InputPath], '\', '/') to flip the slash direction on InputPath only. 

OutputPath is left in the normal '\' format though I am not sure why.

Both fields are defined as V_WString 32767.

 

What error are you receiving?  If this isn't enough I can attempt to convert the workflow to run on sample data.

Labels