Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Knowledge Base

Definitive answers from Designer Desktop experts.

Amazon S3 Download – Use Wildcards to Select a Single or Multiple Objects

DanC
Moderator
Moderator
Created

As currently designed, the Amazon S3 Download tool only allows one file, or object, to be read in at a time. This article explains how to create a workflow and batch macro that will read in the list of objects in a bucket and allow you to filter for the file(s) you want using wildcards; similar to how you can use them with the regular Input tool. Workflow attached.

Pre-Requisites

  1. An Amazon S3 account – Don’t have one? Createone
  2. Download the Amazon Web Services (AWS) Command Line Interface (CLI)
  3. Configure the CLI with your credentials
    1. Navigatingto theC:\Program Files\Amazon\AWSCLI folder and type in the configure command
      aws configure
    2. Follow the prompts and enter the information requested. Choose text for output format:
      AWS Access Key ID [None]: accesskey AWS Secret Access Key [None]: secretkey Default region name [None]: us-west-2 Default output format [None]: text

Create the Workflow and Macro

Now we are ready to create the workflow and macro:

  1. Using a Text Input tool, enter the lines below. The last line can be any directory of your choosing
    cd\ cd programs files\amazon\awscli aws s3 ls s3://alteryxtest > c:\users\\awslist.txt
  2. Next, use the Run Command tool to create the .bat file, run the .bat file and read in the list of objects returned by the .bat file:
    1. Output the text from the above Input Tool to a .bat file in the Write Source section as a .csv file. Remember to change the delimiter to none (\0) and uncheck the box that says "First Row Contains Field Names" so commas and headers arenot written out to the .bat file
    2. In the 'Run External Programs' section under 'Command', choose the .bat file that gets created in the previous step. PleaseNote:This file will not exist yet, so you must run the workflow once to get it created
    3. Check the 'Run Minimized' and 'Run Silent' boxes
    4. In the 'Read Results' sections at the bottom, choose the file that was created from the .bat file. Again, run the workflow once after the .bat file is run in order for there to be a file to choose.Be sure to uncheck the 'First Row Contains Field Names' box
  3. Use the Text to Columns tool to parse the list that is returned into usablefiles names
  4. Use a Filter tool to filter the files desired
  5. Create and insert a batch macro to pull each file filtered
    1. Use a Control Parameter tool to update the Object Name of a properly configured Amazon S3 Download tool
    2. Add a macro output
    3. Save the macro and insert into the original workflow
  6. Add a browse after the inserted macro to list the data

This workflow can easily be enhanced using the Interface tools to allow the user to select the bucket to be searched, as well as give the user options on what files to filter for.

Please feel free to leave any questions in the Comment Section. Thank you for reading!

Attachments
Comments
Michal
9 - Comet
Hi @DanC, I am unable to list/copy files from s3. I followed your steps and managed to connect to aws and list the available buckets using the command, but then when I want to go deeper in the folders, it gives me Access Denied error, any idea why is that or what can be done? Thanks a lot! Michal
DanC
Moderator
Moderator

Hi @Michal! Thanks for reaching out. That error is almost certainly coming from AWS. Can you paste the exact full error that you are getting?

 

Thanks!

 

Dan

Michal
9 - Comet
I managed to figure out in the meantime, thanks a lot Dan! It was a matter of user permissions in AWS.
DanC
Moderator
Moderator

Excellent, @Michal! You're most welcome, but great job on your part. Happy Alteryxing!

Treyson
13 - Pulsar
13 - Pulsar

 Hello there! 

 

I am having an interesting problem with this one. I have been able to create the batch file via Alteryx but when I run the workflow I get an error message 1 and the .txt file returns blank. However when I run the batch program manually, it works fine.

 

Any thoughts?

DanC
Moderator
Moderator

Hi @Treyson,

 

When you run the batch file manually, are you running the Command Prompt as an Administrator?

 

Thanks!

Treyson
13 - Pulsar
13 - Pulsar

oh yeah that's my problem! New server forgot to set up to run as admin! Thank you sir!

DanC
Moderator
Moderator

Sure thing, @Treyson!

lgregor
7 - Meteor

I'm getting this error when trying to run the GetObjectList workflow: Error: Run Command (1): The external program "C:\Users\<myuserid>\listfiles.bat" returned an error code: 255

What does this mean?

 

I have several large text files that I need to read from S3 and append together. Looking for an easier/faster way to do this.

MichaelSu
Alteryx Alumni (Retired)

@lgregor I was getting the same issue at first ("C:\Users\<myuserid>\listfiles.bat" returned an error code: 255"). I was able to resolve this by creating a C:\temp folder rather than trying to point to my own directory.

 

Further, I just wanted to add some clarity on a couple items that I struggled with so that others do not do the same. 

 

Pre-Requisities

First off, do not skip over these!! The directions are straight forward above. Open a Command Prompt (right click and Run as Admin) in Windows and type in "AWS configure."

 

Text Input

cd\
cd programs files\amazon\awscli
aws s3 ls s3://alteryxtest > c:\users\<user>\awslist.txt

"alteryxtest" is going to be your own bucket name in AWS, so go ahead and replace that with your bucketname

 

Run Command Tool

The Run Command tool is essentially pulling your list of file names from AWS S3 and creating a text file with those file names. These file names are then going to be passed into the batch macro by updating the "Object Name" for each file. With that in mind, leave the Write Source and Command options in the configuration pane as is. Run the workflow and that will create a .txt list of filenames in your C:\temp folder and the rest of the workflow will error. Next, go ahead and under Read Results input the .txt file that was just created (C:\awslist.txt). 

 

Batch Macro

Right click on the Batch Macro and select Open Macro. There you can configure the Amazon S3 Download tool to sync up to your instance. Ensure under options in the Configuration Pane that if you're pulling in, say CSV's, the Comma-Delimited Text Files is selected under File Format. If you're getting the error "Too many fields in record #1," this can/should be resolved by checking the "Treat Read Errors as Warnings" in the Configuration Pane

Save the Batch Macro and make sure that when you navigate back to your workflow you update it with your newly saved Batch Macro (Right click on any white area in the canvas, select Insert, Macros).  When prompted with "Choose Field:Control Parameter," you can select "Name" in the dropdown.

 

 

ecastruita
7 - Meteor

Hi Michael, your suggestions are awesome and they almost did the trick for me. There's just one issue I haven't been able to get past... How do you deal with s3 sub directories?

 

So... when I run the cli line 3 as "aws s3 ls s3://alteryxtest -- recursive c:\temp.awslist.txt" I get something like this:  "s3://alteryxtest/building1/employee1/file.json". So when I update the object to "building1/employee/file.json" using the macro, I get nothing back from s3. It works fine when the file is directly in the main bucket, but not in sub-buckets (sub directories).

 

Thanks in advanced for any suggestions you could provide.

 

Ernie.

Michal
9 - Comet

Ernie, 

 

so if you upload a file into the subdirectory "building1/employee/file.json"

 

what do you get when you list the: aws s3 ls s3://alteryxtest/building1/employee/

 

?

ecastruita
7 - Meteor

The CLI LS call would definitely retrieve the  "file.json" file name. Using the macro that replaces the the path in the s3 download tool with this list works great. The challenge is not the LS call, but the s3 download configuration. This is what hasn't worked for me so far (picture)... It doesn't read anything back.

s3config.JPG

Michal
9 - Comet

Ernie, if you change the file format down below to json, will it help?

ecastruita
7 - Meteor

Thanks, what I discovered is that even though it's suffixed as a json (including json formatting), it needs to be read as a csv with \0 as delimiters and a few other caveats... So, I got this far and now I'm getting an "LSeek is not supported" error that I can't find references for :-(.

andyuttley
11 - Bolide
11 - Bolide

Very useful - thanks for sharing! 

DanC
Moderator
Moderator

You're very welcome @andyuttley!

jmhutbay
6 - Meteoroid
Were you able to figure out a workaround for when the files are in a S3 sub directory? 
ecastruita
7 - Meteor

Yes, but not utilizing the solutions suggested in this thread. Given that I have over 500k json files and a unique sub-directory for each, using the list pull as suggested in this thread would be prohibitive from a performance perspective.

 

As a workaround:

1) I utilized lambda (& javascript) to copy the original json files, flatten them, and then copy them to a more streamlined s3 directory structure where they are grouped by day rather than by the 500k distinct identifiers.

2) To upload to Redshift, I used an iterative macro running a python file that connects to redshift and runs a copy command.

 

The result is that 8 billion records are uploaded to redshift in under 12 minutes each day due to redshift's massive parallel processing. The biggest cost is in the lambda event-based jobs and the duplicate s3 files. However, the new structure makes recovery/backfills way easier also.

shaynie
8 - Asteroid

Thanks, this whole post was super helpful!  Regarding the subdirectories (and files), you can get them to be listed by adding the following parameter after the bucket name

--recursive

 

For example:

aws s3 ls s3://bucketname/top_directory_you_want/   --recursive   > C:\awslist.txt

 

Was I happy when this worked!

DotDigitals
7 - Meteor

Interesting to know what you achieved @ecastruita 

I have permission issues too. I am unable to access S3 using Alteryx, How did you achieve that @Michal

SiddhantSurve
6 - Meteoroid

@DanC this post has been extremely invaluable for me and my team in building a solution that I've been working on. Can't thank you enough for sharing your insights and building such tools for the community. I hope you continue building solutions/resources like these. You're the true champion !

 

Also, special thanks to @shaynie for sharing the --recursive keyword it was exactly what I was looking for. To be able to read all the files from all the sub folders/directories at once.

 

The Alteryx community has some incredible contributors. Thank you both once again !

DanC
Moderator
Moderator

@SiddhantSurve - Thank you for your feedback and kind words! They are greatly appreciated! And sorry for the very late reply. Not sure why I didn't get notified of your post. Very happy that this post was of use to you. Best of luck to you!

FláviaB
Alteryx Community Team
Alteryx Community Team

Hi @AA007. I am moving your post to our Designer Discussions board fore more visibility. 

aahaseebkhan
6 - Meteoroid

Is this compatible with AMP engine to read multiple files at once? 

dchu
6 - Meteoroid

Would this work if running from the server Gallery as well?  Or do we have to go through the same configuration on the server?

Julien_B
8 - Asteroid

Hello,

 

4 years (or more) after the initial post, is there now a way (or something planned) to natively use wildcards to select multiple Objects ?

 

Julien