Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
It's the most wonderful time of the year - Santalytics 2020 is here! This year, Santa's workshop needs the help of the Alteryx Community to help get back on track, so head over to the Group Hub for all the info to get started!

Splitting the output file into n smaller subfiles based on size of file

Highlighted
7 - Meteor

Hi All ,

 

Is there a way to split a file beyond a given size  into  multiple sub files. Eg.  if the size of output file from the alteryx workflow  is 1 GB then split the file into two files each with size less than 1 GB and send the final  output as zipped file like  filename_1.zip  and filename_2.zip  each containing the respective splitted files

 

The workflow should break the output  automatically without user specifying the number of splits required . The only condition to ensure is that any file can't be 1 GB or more

 

Please advice if any one has ever faced with this scenario or have idea to tackle it .

 

Thanks ahead for your time and response

 

Varun

Highlighted
Alteryx Partner

So, as for the file size issue, the only thing i can think of initially is to make use of the max records per file option in the output tools configuration window (image below). Whilst this is based upon number of records rather than file size, you could manually identify how many records this should be. Not exactly what you are looking for but may be a start.

 

Brian

 

Screen Shot 2017-06-24 at 11.41.19.png

Highlighted
7 - Meteor
Hi Brian ,
Thanks for the response . This option would be the last option as trial and learn approach . But as you know the number of records isn't effective from automation point of view because it doesn't always define the size of file . The output file in my case which is a tab delimeted can accomodate n new columns in future bases on what is send from input and due to which for instance 5000000 rows previously with 10columns defining ~ 1GB will become 10+n columns now but this time 5million rows would be definetely more than ~1GB.
I hope you get my point . Btw iam also simultaneously exploring if this can be handled outside of alteryx using other tools in extreme case but wud prefer something inside alteryx wihout manual intervention each time.

Thanks
Varun
Highlighted
16 - Nebula
16 - Nebula

Hey @varundcs

 

It's not perfect, but I've added a way to get close to this.

  • Because you're pushing out delimited text, I'm assuming the data is uncompressed.   By contrast - Excel does quite a lot of compression on files so this method won't work
  • you can roughly calculate the size of each row by taking the string length - if I'm not mistaken, one character is roughly one byte, but if I'm wrong you can adjust the math.
  • You can then split this

If this is not precise enough, or if you need to use a compressed file, the only way to do this is:

  • Create an iterative macro, where you write out 1 row, and check the file size (use the directory tool).   If it is > 1GB, then increment the file size
  • Send the remaining rows back through the macro
  • This is gonna be fairly slow though, but it will work even if the file is compressed

 

2017-06-25_21-13-44.png

Highlighted
Alteryx Partner

Attached is a macro that builds off of @brianprestidge comment. The attached Macro "Split and Zip" has three supporting macros, and uses the command line tool from 7-Zip.

 

I built this to generate .csv files, but you could change the config of one tool (noted below) and generate any file type you need.

 

You can specify how much data you would like to sample to generate the average record size estimate from (the higher percent, the greater the accuracy, but it will take longer). There is also a buffer option because sometimes the sample may under represent (randomly picking the smallest records), and the buffer helps ensure the files will be smaller then the set size, but there is no 100% guarantee.

 

As for the File Name, for an example, when you set the filename to "out.csv" and three files get generated, they will be named:
out.csv
out_1.csv
out_2.csv

 

I also made zipping the output files optional since zipping can take a long time.

 

You will need to download the command line tool for 7-Zip and place it in the same directory as these macros.

 

To download, go to: http://www.7-zip.org/download.html
- Find the non-beta "7-Zip Extra: standalone console version, 7z DLL, Plugin for Far Manager"
- Download and extract the .7z file, extract and pull out the file "7za.exe" from the x64 folder (if you are on a 64 bit OS)
- Place "7za.exe" in the same directory as these macros

 

Here is a quick breakdown of what each macro is doing:

 

Split and Zip

Split and Zip.png
- Sample the records, eg 1% is 1 in 100
- Save File to a file in the temp directory (uses macro with "Records per File" parameter set to zero so there is no split)
- Get File Size macro to get the file size
- Interface elements and a Formula tool to calculate the desired "Records per File"
- Save File with Split using calculated "Records per File"
- Detour option for zipping files
- Formula to swap out the extension for zip
- Zip File macro to zip the files


Save File with Split

Save File with Split.png
- Output Data tool, configured to split or not based on control parameters (if you need a different file type, you would change the config of this tool)
- Summarize to get record count
- Generate Row for each needed File Number (counts from zero to "Record Count"/"Records per File" rounded up)
- Append with starting FullPath
- Formula to update the FullPath and Record count for each file


Get File Size

Get File Size.png
- Directory configured with control parameters, so to pulls info on a specific file
- Select to get just the Size field
- Join by record position to combine with Input data

 

Zip File

Zip File.png
- Run Command using cmd.exe to execute the 7-Zip command (7za a "out.zip" "out.csv" > "zipmessage.csv") so we can pipe ">" the results to a text file to read back in (note the "/c" command prefix)
- Filter to pull out the line with the zip file size
- Formula to keep just the file size number
- Select to keep just the Zip File Size field
- Join by record position to combine with Input data


I have also attached an example "use split and zip.yxmd" to test this out with some generated data.

 

Please test this out, and let me know if this works or breaks in any way, I am happy to adjust it. If you need additional help, you are also welcome to send me a private message, and we can setup a screen share session.

 

Thank you!

Highlighted
16 - Nebula
16 - Nebula

Wow @Joe_Mako - that's tremendously comprehensive!

 

Not sure if this is something you've solved before in your own work, but this is a very big investment in a solution!

Highlighted
7 - Meteor
Thanks @Joe_Maki for the responce . Let me work on your lines .
Labels