community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx Designer Knowledge Base

Definitive answers from Designer experts.

Amazon S3 Download – Use Wildcards to Select a Single or Multiple Objects

Moderator
Moderator
Created on

As currently designed, the Amazon S3 Download tool only allows one file, or object, to be read in at a time. This article explains how to create a workflow and batch macro that will read in the list of objects in a bucket and allow you to filter for the file(s) you want using wildcards; similar to how you can use them with the regular Input tool. Workflow attached.

 

Pre-Requisites

  1. An Amazon S3 account – Don’t have one? Create one
  2. Download the Amazon Web Services (AWS) Command Line Interface (CLI)
  3. Configure the CLI with your credentials
    1. Navigating to the C:\Program Files\Amazon\AWSCLI folder and type in the configure command
      aws configure
    2. Follow the prompts and enter the information requested. Choose text for output format:
      AWS Access Key ID [None]: accesskey
      AWS Secret Access Key [None]: secretkey
      Default region name [None]: us-west-2
      Default output format [None]: text

 

Create the Workflow and Macro

Now we are ready to create the workflow and macro:

  1. Using a Text Input tool, enter the lines below. The last line can be any directory of your choosing
    cd\
    cd programs files\amazon\awscli
    aws s3 ls s3://alteryxtest > c:\users\<user>\awslist.txt
  2. Next, use the Run Command tool to create the .bat file, run the .bat file and read in the list of objects returned by the .bat file:
    1. Output the text from the above Input Tool to a .bat file in the Write Source section as a .csv file. Remember to change the delimiter to none (\0) and uncheck the box that says "First Row Contains Field Names" so commas and headers are not written out to the .bat file
    2. In the 'Run External Programs' section under 'Command', choose the .bat file that gets created in the previous step. Please Note: This file will not exist yet, so you must run the workflow once to get it created
    3. Check the 'Run Minimized' and 'Run Silent' boxes
    4. In the 'Read Results' sections at the bottom, choose the file that was created from the .bat file. Again, run the workflow once after the .bat file is run in order for there to be a file to choose. Be sure to uncheck the 'First Row Contains Field Names' box
  3. Use the Text to Columns tool to parse the list that is returned into usable files names
  4. Use a Filter tool to filter the files desired
  5. Create and insert a batch macro to pull each file filtered
    1. Use a Control Parameter tool to update the Object Name of a properly configured Amazon S3 Download tool
    2. Add a macro output
    3. Save the macro and insert into the original workflow
  6. Add a browse after the inserted macro to list the data

 

This workflow can easily be enhanced using the Interface tools to allow the user to select the bucket to be searched, as well as give the user options on what files to filter for.

 

 

Please feel free to leave any questions in the Comment Section. Thank you for reading!

Attachments
Comments
Alteryx Certified Partner
Hi @DanC, I am unable to list/copy files from s3. I followed your steps and managed to connect to aws and list the available buckets using the command, but then when I want to go deeper in the folders, it gives me Access Denied error, any idea why is that or what can be done? Thanks a lot! Michal
Moderator
Moderator

Hi @Michal! Thanks for reaching out. That error is almost certainly coming from AWS. Can you paste the exact full error that you are getting?

 

Thanks!

 

Dan

Alteryx Certified Partner
I managed to figure out in the meantime, thanks a lot Dan! It was a matter of user permissions in AWS.
Moderator
Moderator

Excellent, @Michal! You're most welcome, but great job on your part. Happy Alteryxing!

Quasar
Quasar

 Hello there! 

 

I am having an interesting problem with this one. I have been able to create the batch file via Alteryx but when I run the workflow I get an error message 1 and the .txt file returns blank. However when I run the batch program manually, it works fine.

 

Any thoughts?

Moderator
Moderator

Hi @Treyson,

 

When you run the batch file manually, are you running the Command Prompt as an Administrator?

 

Thanks!

Quasar
Quasar

oh yeah that's my problem! New server forgot to set up to run as admin! Thank you sir!

Moderator
Moderator

Sure thing, @Treyson!

Meteor

I'm getting this error when trying to run the GetObjectList workflow: Error: Run Command (1): The external program "C:\Users\<myuserid>\listfiles.bat" returned an error code: 255

What does this mean?

 

I have several large text files that I need to read from S3 and append together. Looking for an easier/faster way to do this.

Alteryx
Alteryx

@lgregor I was getting the same issue at first ("C:\Users\<myuserid>\listfiles.bat" returned an error code: 255"). I was able to resolve this by creating a C:\temp folder rather than trying to point to my own directory.

 

Further, I just wanted to add some clarity on a couple items that I struggled with so that others do not do the same. 

 

Pre-Requisities

First off, do not skip over these!! The directions are straight forward above. Open a Command Prompt (right click and Run as Admin) in Windows and type in "AWS configure."

 

Text Input

cd\
cd programs files\amazon\awscli
aws s3 ls s3://alteryxtest > c:\users\<user>\awslist.txt

"alteryxtest" is going to be your own bucket name in AWS, so go ahead and replace that with your bucketname

 

Run Command Tool

The Run Command tool is essentially pulling your list of file names from AWS S3 and creating a text file with those file names. These file names are then going to be passed into the batch macro by updating the "Object Name" for each file. With that in mind, leave the Write Source and Command options in the configuration pane as is. Run the workflow and that will create a .txt list of filenames in your C:\temp folder and the rest of the workflow will error. Next, go ahead and under Read Results input the .txt file that was just created (C:\awslist.txt). 

 

Batch Macro

Right click on the Batch Macro and select Open Macro. There you can configure the Amazon S3 Download tool to sync up to your instance. Ensure under options in the Configuration Pane that if you're pulling in, say CSV's, the Comma-Delimited Text Files is selected under File Format. If you're getting the error "Too many fields in record #1," this can/should be resolved by checking the "Treat Read Errors as Warnings" in the Configuration Pane

Save the Batch Macro and make sure that when you navigate back to your workflow you update it with your newly saved Batch Macro (Right click on any white area in the canvas, select Insert, Macros).  When prompted with "Choose Field:Control Parameter," you can select "Name" in the dropdown.

 

 

Meteor

Hi Michael, your suggestions are awesome and they almost did the trick for me. There's just one issue I haven't been able to get past... How do you deal with s3 sub directories?

 

So... when I run the cli line 3 as "aws s3 ls s3://alteryxtest -- recursive c:\temp.awslist.txt" I get something like this:  "s3://alteryxtest/building1/employee1/file.json". So when I update the object to "building1/employee/file.json" using the macro, I get nothing back from s3. It works fine when the file is directly in the main bucket, but not in sub-buckets (sub directories).

 

Thanks in advanced for any suggestions you could provide.

 

Ernie.

Alteryx Certified Partner

Ernie, 

 

so if you upload a file into the subdirectory "building1/employee/file.json"

 

what do you get when you list the: aws s3 ls s3://alteryxtest/building1/employee/

 

?

Meteor

The CLI LS call would definitely retrieve the  "file.json" file name. Using the macro that replaces the the path in the s3 download tool with this list works great. The challenge is not the LS call, but the s3 download configuration. This is what hasn't worked for me so far (picture)... It doesn't read anything back.

s3config.JPG

Alteryx Certified Partner

Ernie, if you change the file format down below to json, will it help?

Meteor

Thanks, what I discovered is that even though it's suffixed as a json (including json formatting), it needs to be read as a csv with \0 as delimiters and a few other caveats... So, I got this far and now I'm getting an "LSeek is not supported" error that I can't find references for :-(.

Bolide
Bolide

Very useful - thanks for sharing! 

Moderator
Moderator

You're very welcome @andyuttley!

Meteoroid
Were you able to figure out a workaround for when the files are in a S3 sub directory? 
Meteor

Yes, but not utilizing the solutions suggested in this thread. Given that I have over 500k json files and a unique sub-directory for each, using the list pull as suggested in this thread would be prohibitive from a performance perspective.

 

As a workaround:

1) I utilized lambda (& javascript) to copy the original json files, flatten them, and then copy them to a more streamlined s3 directory structure where they are grouped by day rather than by the 500k distinct identifiers.

2) To upload to Redshift, I used an iterative macro running a python file that connects to redshift and runs a copy command.

 

The result is that 8 billion records are uploaded to redshift in under 12 minutes each day due to redshift's massive parallel processing. The biggest cost is in the lambda event-based jobs and the duplicate s3 files. However, the new structure makes recovery/backfills way easier also.