Hello Community,
I had built a couple of workflows to scrape a few websites.
All of these worked and was able to extract the data I intended to.
A few months later when I am trying to run these workflows (without any modification), I get the below error:
Error: Download (3): Error transferring data: Failure when receiving data from the peer
All the workflows are now throwing the same error at the download tool.
Could you please point me, what should I look for to fix this issue?
I have attached one of the workflows and below is the error screenshot.
Thanks and Regards,
Chaithanya
Hi @raochaithanya,
Would you have an example of URL and page please? Unfortunately, your workflow doesn't include the CSV file so it is not possible to test it.
Does it work better if you tick "Encode URL Text" in the Download tool?
Thanks,
Paul Noirel
Customer Support Engineer
Hello Community,
Any suggestions/ pointers to fix it?
Thanks in advance.
Best Regards,
Chaithanya
Hi @raochaithanya,
I have played a bit with your workflow. Something seems to prevent download tool from loading web page easily.
I have then tried the following, in case Run Command tool could be a better option in this particular case:
C:\>curl https://www.zomato.com/tampa-bay/restaurants?page=1
curl: (56) Send failure: Connection was reset
=> It didn't work. This is consistent as Download tool uses libcurl in the background
PS C:\temp> invoke-webrequest -Uri "https://www.zomato.com/tampa-bay/restaurants?page=1"
invoke-webrequest : The underlying connection was closed: An unexpected error
occurred on a receive.
At line:1 char:1
+ invoke-webrequest -Uri "https://www.zomato.com/tampa-bay/restaurants? ...
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+ CategoryInfo : InvalidOperation: (System.Net.HttpWebRequest:Htt
pWebRequest) [Invoke-WebRequest], WebException
+ FullyQualifiedErrorId : WebCmdletWebResponseException,Microsoft.PowerShe
ll.Commands.InvokeWebRequestCommand
C:\>python
Python 3.6.2 |Anaconda, Inc.| (default, Sep 19 2017, 08:03:39) [MSC v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.request import urlopen
>>> html = urlopen("https://www.zomato.com/tampa-bay/restaurants?page=1")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
[...]
v = self._sslobj.read(len, buffer)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not prope
rly respond after a period of time, or established connection failed because connected host has fail
ed to respond
I am sorry but something with the website seems to disturb traditional method or web scraping or even access.
I have not tried the Windows version of wget when I am afraid I would obtain similar results.
You mentioned that you had managed to download the page in the past. Maybe they have updated the website or the servers in a way that is causing issues.
If I come with a solution, I will post it here.
Kind regards,
Paul Noirel
Customer Support Engineer
Hi Paul,
Thanks for the detailed tests.
Ok, I had a hunch that was the case as well.
As I have other similar workflows built in the same fashion that still extracts the data.
Thanks again!
Best Regards,
Chaithanya
Is this solved?
If not, are you behind a firewall or work in an organization that has a proxy set up in your LAN connection?
Patrick - That is the case for me! Do you have a solution?
It could be this:
https://community.alteryx.com/t5/Data-Sources/Download-Fail-Proxy-Authentication-issue/m-p/764#M2
Or you may have to just disable the proxy.
If the proxy address is in your machine's connection settings you should be able to just disable it.
To do that:
Internet Explorer --> Tools (or the gear icon) --> Internet Options --> Connections --> LAN Settings --> Clear all settings and uncheck the boxes. --> Click Ok --> Click Apply, Ok. Now, try again.
Hi Patrick,
No, I am not using any proxy or any firewall service.
As I mentioned, I had developed 4-5 workflows, all of which used to work.
Now, only one of it fails with the above error and as @PaulN mentioned in his post, I think it might be because the website is not somehow blocking the read requests.
Thanks,
Chaithanya