Read *.csv file in a double archive [it's a *.tgz file in a *.zip file]
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hello,
I am unable to open using directly the Alteryx files that are doubly archived:
1. first step: I receive a file named like: Export_ZJENKINS_SRF015_TACTICAL_SOL_20210721_063819.tgz.vfe.zip
2. step two: I need to unzip the file and then I will get: Export_ZJENKINS_SRF015_TACTICAL_SOL_20210721_063819.tgz
3. third step: I need to open the file from step 2 and obtain the following *.csv file: PRFGPRF1.ZJENKINS_SRF015_TACTICAL_SOL_20210721_063824.dmp.csv
I want to have a workflow that will be able to unzip the file, then get the .*csv inside the *.tgz file without any manual process.
Be careful, the name of the *csv file inside the *.tgz is not known and needs to be identified automatically through the workflow.
The beer will be served in Bucharest, Romania 🙂
- Labels:
- Dynamic Processing
- Input
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @EduardZ
Is there anyway you can post a sample file for us to work with? I have a method of extracting data from a zip file using Python that I think could be applied here, but would need to configure it to your needs.
Thanks!
Phil
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Please see attached how you can do it 🙂
You have to update the source and destination column in the input file.
You have to install 7zip as well.
Hope this helps!
Regards,
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hey @EduardZ
So I spent way more time on this than I should have, but I liked the challenge! 🙂
Here is a method that should work for you that extracts the contents of the zip file to your temp files and then reads in the .tgz data as a CSV file.
Step1: Directory Tool pointed to location of zip file
Step2: Create some Python friendly path names
You can see here that I'm hardcoding my local temp file as the location to extract the TGZ file and also creating a field with that path and file name. Since the name of the TGZ file is the same as the ZIP, I can just do a replace command to change it for later when Python reads it.
Step3: Keep only the fields needed for Python tool (this step isn't really that important, but it helps me keep my sanity when looking at the workflow)
Step4: Let Python do all the work!
Final output and workflow:
I like this method, since it uses your local temp file as the extraction point. This means that you're not creating files/folders on your local drive that will have to be deleted later. This method also doesn't require you to know the name of the CSV file inside the TGZ file.
One interesting thing I want to callout. You'll notice that the first column header is the name of the CSV file. This is caused because the file was tarballed first before being gzipped. Pandas is just trying to treat the file as a gzipped csv, so this is to be expected. The remaining column names come through just fine.
Attached is a copy of the workflow I built. You should be able to update the Directory tool and the Formula tool to paths on your computer and run it right away.
If this solves your problem please mark answer as correct, if not let me know!
Cheers!
Phil
