This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
on 07-21-2016 08:15 PM - edited on 07-27-2021 11:38 PM by APIUserOpsDM
As currently designed, the Amazon S3 Download tool only allows one file, or object, to be read in at a time. This article explains how to create a workflow and batch macro that will read in the list of objects in a bucket and allow you to filter for the file(s) you want using wildcards; similar to how you can use them with the regular Input tool. Workflow attached.
Pre-Requisites
aws configure
AWS Access Key ID [None]: accesskey AWS Secret Access Key [None]: secretkey Default region name [None]: us-west-2 Default output format [None]: text
Create the Workflow and Macro
Now we are ready to create the workflow and macro:
cd\ cd programs files\amazon\awscli aws s3 ls s3://alteryxtest > c:\users\\awslist.txt
This workflow can easily be enhanced using the Interface tools to allow the user to select the bucket to be searched, as well as give the user options on what files to filter for.
Please feel free to leave any questions in the Comment Section. Thank you for reading!
Hi @Michal! Thanks for reaching out. That error is almost certainly coming from AWS. Can you paste the exact full error that you are getting?
Thanks!
Dan
Excellent, @Michal! You're most welcome, but great job on your part. Happy Alteryxing!
Hello there!
I am having an interesting problem with this one. I have been able to create the batch file via Alteryx but when I run the workflow I get an error message 1 and the .txt file returns blank. However when I run the batch program manually, it works fine.
Any thoughts?
oh yeah that's my problem! New server forgot to set up to run as admin! Thank you sir!
Sure thing, @Treyson!
I'm getting this error when trying to run the GetObjectList workflow: Error: Run Command (1): The external program "C:\Users\<myuserid>\listfiles.bat" returned an error code: 255
What does this mean?
I have several large text files that I need to read from S3 and append together. Looking for an easier/faster way to do this.
@lgregor I was getting the same issue at first ("C:\Users\<myuserid>\listfiles.bat" returned an error code: 255"). I was able to resolve this by creating a C:\temp folder rather than trying to point to my own directory.
Further, I just wanted to add some clarity on a couple items that I struggled with so that others do not do the same.
Pre-Requisities
First off, do not skip over these!! The directions are straight forward above. Open a Command Prompt (right click and Run as Admin) in Windows and type in "AWS configure."
Text Input
cd\ cd programs files\amazon\awscli aws s3 ls s3://alteryxtest > c:\users\<user>\awslist.txt
"alteryxtest" is going to be your own bucket name in AWS, so go ahead and replace that with your bucketname
Run Command Tool
The Run Command tool is essentially pulling your list of file names from AWS S3 and creating a text file with those file names. These file names are then going to be passed into the batch macro by updating the "Object Name" for each file. With that in mind, leave the Write Source and Command options in the configuration pane as is. Run the workflow and that will create a .txt list of filenames in your C:\temp folder and the rest of the workflow will error. Next, go ahead and under Read Results input the .txt file that was just created (C:\awslist.txt).
Batch Macro
Right click on the Batch Macro and select Open Macro. There you can configure the Amazon S3 Download tool to sync up to your instance. Ensure under options in the Configuration Pane that if you're pulling in, say CSV's, the Comma-Delimited Text Files is selected under File Format. If you're getting the error "Too many fields in record #1," this can/should be resolved by checking the "Treat Read Errors as Warnings" in the Configuration Pane
Save the Batch Macro and make sure that when you navigate back to your workflow you update it with your newly saved Batch Macro (Right click on any white area in the canvas, select Insert, Macros). When prompted with "Choose Field:Control Parameter," you can select "Name" in the dropdown.
Hi Michael, your suggestions are awesome and they almost did the trick for me. There's just one issue I haven't been able to get past... How do you deal with s3 sub directories?
So... when I run the cli line 3 as "aws s3 ls s3://alteryxtest -- recursive c:\temp.awslist.txt" I get something like this: "s3://alteryxtest/building1/employee1/file.json". So when I update the object to "building1/employee/file.json" using the macro, I get nothing back from s3. It works fine when the file is directly in the main bucket, but not in sub-buckets (sub directories).
Thanks in advanced for any suggestions you could provide.
Ernie.
Ernie,
so if you upload a file into the subdirectory "building1/employee/file.json"
what do you get when you list the: aws s3 ls s3://alteryxtest/building1/employee/
?
The CLI LS call would definitely retrieve the "file.json" file name. Using the macro that replaces the the path in the s3 download tool with this list works great. The challenge is not the LS call, but the s3 download configuration. This is what hasn't worked for me so far (picture)... It doesn't read anything back.
Ernie, if you change the file format down below to json, will it help?
Thanks, what I discovered is that even though it's suffixed as a json (including json formatting), it needs to be read as a csv with \0 as delimiters and a few other caveats... So, I got this far and now I'm getting an "LSeek is not supported" error that I can't find references for :-(.
Very useful - thanks for sharing!
You're very welcome @andyuttley!
Yes, but not utilizing the solutions suggested in this thread. Given that I have over 500k json files and a unique sub-directory for each, using the list pull as suggested in this thread would be prohibitive from a performance perspective.
As a workaround:
1) I utilized lambda (& javascript) to copy the original json files, flatten them, and then copy them to a more streamlined s3 directory structure where they are grouped by day rather than by the 500k distinct identifiers.
2) To upload to Redshift, I used an iterative macro running a python file that connects to redshift and runs a copy command.
The result is that 8 billion records are uploaded to redshift in under 12 minutes each day due to redshift's massive parallel processing. The biggest cost is in the lambda event-based jobs and the duplicate s3 files. However, the new structure makes recovery/backfills way easier also.
Thanks, this whole post was super helpful! Regarding the subdirectories (and files), you can get them to be listed by adding the following parameter after the bucket name
--recursive
For example:
aws s3 ls s3://bucketname/top_directory_you_want/ --recursive > C:\awslist.txt
Was I happy when this worked!
Interesting to know what you achieved @ecastruita
I have permission issues too. I am unable to access S3 using Alteryx, How did you achieve that @Michal
@DanC this post has been extremely invaluable for me and my team in building a solution that I've been working on. Can't thank you enough for sharing your insights and building such tools for the community. I hope you continue building solutions/resources like these. You're the true champion !
Also, special thanks to @shaynie for sharing the --recursive keyword it was exactly what I was looking for. To be able to read all the files from all the sub folders/directories at once.
The Alteryx community has some incredible contributors. Thank you both once again !
@SiddhantSurve - Thank you for your feedback and kind words! They are greatly appreciated! And sorry for the very late reply. Not sure why I didn't get notified of your post. Very happy that this post was of use to you. Best of luck to you!
Hi @AA007. I am moving your post to our Designer Discussions board fore more visibility.
Is this compatible with AMP engine to read multiple files at once?
Would this work if running from the server Gallery as well? Or do we have to go through the same configuration on the server?
Hello,
4 years (or more) after the initial post, is there now a way (or something planned) to natively use wildcards to select multiple Objects ?
Julien