Join the Inspire AMA with Joshua Burkhow, March 31-April 4. Ask, share, and connect with the Alteryx community!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Download tool being stumped by website challenge

Ian_Nicholls
5 - Atom

I am trying to download the html from a page, find the links to zips in it, and download those zips. This is a job that currently a person has to do every couple of weeks by just browsing and saving them to our network.

 

I already successfully do this for a half dozen other websites, but now I am stuck with a page where instead of downloading the html that is rendered via a browser I am ending up with the code for a challenge page. It contains things like 'challenge-error-text' and 'Enable JavaScript and cookies to continue' and does not contain the info I need to get.

 

The only header I am using in the download tool is User-Agent

User-AgentMozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36

 

Is there anything more/different I could/should be using here to get around the challenge page and make this site believe I am using a browser? This is the returned download headers. 

Untitled.png

 

I am new to this but that reads as if Cloudflare can tell I am scraping and doesn't want to allow it

 

(EDIT: I should add that the data is public and the body in question know i want to be able to scrape their website - they have made an allowance for my IP in the firewalls.)

 

Thanks,

Ian

3 REPLIES 3
DavidSkaife
14 - Magnetar

Hi @Ian_Nicholls 

 

I'm no expert on this but from what I've read Cloudflare uses bots to protect the website from scraping, and the connection is failing as there is one of those 'prove you're a human' challenges on the page if I'm not mistaken? Given this i don't think you're going to solve this using the Download tool. Others with far more knowledge may correct me though.

 

An alternative option is trying Python, there seems to be a few ideas available if you search on the web for 'Cloudflare web scraping' but i suspect this would be a LOT of trial and error with no guarantee it would work.

Ian_Nicholls
5 - Atom

Thanks @DavidSkaife - I suspected that might be the case. fortunately this isn't anything sketchy and the people whose website it is are trying to amend their bot rules to allow me to make this work the way that normally works for me.

 

I really just wanted to know if there was anything more I could do with the header section in the event of seeing these sorts of messages. But i guess stopping what i am doing is exactly what cloudflare is meant to do...

apathetichell
19 - Altair

If this is a serious business add for you - check out> https://www.zenrows.com/blog/selenium-cloudflare-bypass#undetected-chromedriver

Labels
Top Solution Authors