I'm trying to use the download tool to download data from AWS S3 using a presigned URL, but I keep getting the following error: Error transferring data: Failure when receiving data from the peer. What am I doing wrong?
This sounds like a VPN issue. Any chance your bucket is in a VPN and you are not?
Could be. All I know is that Alteryx is the only place where this presigned URL is failing. I tried running the same code in my IDE, and then I tried opening the URL in a web browser, and it worked both times. I know the bucket I'm trying to access is protected by KMS, but that's about all I know.
EDIT: I could probably get the download to work if I had the python tool do it, but I'm using this workflow to download some large files (~5 GB), and I don't trust the python tool to do that download efficiently.
Ok... so there are a few things that it could be with that info.
1) can you try with a smaller file. -> let me know if that works.
2) can you toggle from download to temp to download to a specific file (and give it a filename)
3) can you confirm that that exact link (or any exact link which fails in Alteryx) runs on Postman?
4) if you have the aws cli -> can you try to donwload it using aws s3 cp .... commands?
also---> try throwing this pair in your headers.
User-Agent
Okay, I've generated a new presigned URL for a smaller data file (~83 KB), and the URL is still failing in Alteryx with the same error. Interestingly, I tried running the URL in Postman too, and the request ended up timing out and failing there too. Not sure what this means though.
ok-> that's a good sign ->that means it's not Alteryx specific which would be a bit of a black box. if you are on VPN -> go off VPN. if you have a KMS -> make sure you have access to the KMS and the S3 when you are creating the presigned URL. If you can -> extend the timeout. Are you creating the presigned URL in boto3 or in console?
I'm using boto3 in order to create the presigned URL in the workflow. This is the work that the python tool is doing in my workflow. I could have the python tool do everything if I wanted to, but it would result in very poor performance.
Okay I think I've narrowed it down further. I was finally able to get my presigned URL to work in postman, and the fix was to change my proxy settings to utilize the company's proxy server. It makes sense why my downloads were fine using the Python tool, because I could use os.environ to update the environment variables to include those proxy settings. I'm not sure how to do that with the download tool, though. Any suggestions?
ok -> so 1) for proxy you can go to options/user settings/edit user settingsl.
2) this may prevent your boto3 from connection to aws -> you may need it there.
3) we talk about bad performance with python when you are executing massive dataframes in python -> you are (presumably) executing a few lines of code -> I do not believe that you would see significantly worse performance in boto3 than the download tool -> and since you can use a config to set concurrency -> I am (high 90s) percent sure you would see enhanced performance for huge files in boto3 vs the download tool with presigned url.... I can take a look at timing for a 9gb file in a concurrent download in boto3 outside of Alteryx. I don't think Alteryx is adding much here.
Okay that was actually pretty revealing. It makes sense that massive dataframes would be the bottleneck in the tool. If that's the case, I don't even need a presigned URL at this point. I'll just use boto3 to download to a temp file, then return the path of that temp file in the dataframe. Then I can use a dynamic input tool to read the data into Alteryx. Would that work?
EDIT: Would it also be possible to use a python library like tqdm to track the progress of the python tool as it's running?
hmm-> not sure. I'd recommend setting up a transfer config as shown here:
https://boto3.amazonaws.com/v1/documentation/api/1.9.42/guide/s3.html
1) you can set up multi-parts for larger files.
2) you can increase concurrency.
This-> I'll just use boto3 to download to a temp file, then return the path of that temp file in the dataframe. Then I can use a dynamic input tool to read the data into Alteryx. Would that work?
Would be my recommendation with two caveats->
1) I might use run command/AWS CLI vs python/boto3. I just like the CLI.
2) I'd use a batch macro vs dynamic input. I am the most anti dynamic input person on community so this might just be me.
User | Count |
---|---|
106 | |
85 | |
76 | |
54 | |
40 |