Requirement
Download a csv file from a website that requires a login (and is blocking cross site scripting - crsf).
Solution Attempt 1 - use curl
A simple curl command from DOS seemed to fail due to csrf token requirement. After many hours of searching and trial and error, I wasn't able to build the command myself. I was able to use Chrome+F12+copy as cURL (cmd) ... to get the actual code that works. The code seems to hardcode the csrf token + cookie? + username/password into the cURL command, but I don't really understand it, and the next day it stops working.
From another post (https://community.alteryx.com/t5/Alteryx-Designer-Discussions/Using-Download-Tool-to-get-file-from-s...) @Troy
curl -c cookies.txt -X POST -F username=<Your_UserName> -F password=<Your_Password> https://<URL of login page> - H "Accept: application/json" --next -X GET https://<URL>/internal/export_Report_EstimatedBilling?format=excel --output text.xls
.. doesn't work in my case.
Solution Attempt 2 - use the download tool
Use the Download tool, add username/password, but it gives errors that I don't understand. I suspect it is a similar issue as above. Maybe the csrf token is not being sent.
Any thoughts/approaches on how to do this better?
HTTP/1.1 302 Found
Connection: keep-alive
Server: skilljar
Date: Wed, 15 Jan 2020 15:25:59 GMT
Location: /auth/login?next=/analytics/enrollments.csv%3Fdraw%3D1%26start%3D0%26length%3D25%26skip_total_count%3Dtrue%26order%255B0%255D%255Bcolumn%255D%3D5%26order%255B0%255D%255Bdir%255D%3Ddesc%26registered_at%3Dall%26completed_at%3Dall%26latest_activity_at%3Dall
Vary: Cookie
X-Ua-Compatible: IE=edge
P3p: CP="This is not a P3P policy!"
Pragma: no-cache
Cache-Control: no-cache, no-store, must-revalidate, private
Content-Type: text/html; charset=utf-8
Expires: Wed, 15 Jan 2020 15:25:59 GMT
Set-Cookie: sj_ubid=bb8c7d95-0590-4690-9b92-9aaee94aad49; expires=Tue, 10-Jan-2040 15:25:59 GMT; httponly; Max-Age=630720000; Path=/; secure
Content-Length: 0
Via: 1.1 vegur
HTTP/1.1 302 Found
Connection: keep-alive
Server: skilljar
Date: Wed, 15 Jan 2020 15:25:59 GMT
Content-Type: text/html; charset=utf-8
X-Ua-Compatible: IE=edge
Location: https://dashboard.skilljar.com/login?next=%2Fanalytics%2Fenrollments.csv%3Fdraw%3D1%26start%3D0%26le...
P3p: CP="This is not a P3P policy!"
Expires: Wed, 15 Jan 2020 15:25:59 GMT
Pragma: no-cache
Vary: Cookie
Cache-Control: no-cache, no-store, must-revalidate, private
Set-Cookie: sj_ubid=52247fc9-ff79-46d6-bfbb-830cb27d69fd; expires=Tue, 10-Jan-2040 15:25:59 GMT; httponly; Max-Age=630720000; Path=/; secure
Content-Length: 0
Via: 1.1 vegur
HTTP/1.1 200 OK
Connection: keep-alive
Server: skilljar
Date: Wed, 15 Jan 2020 15:25:59 GMT
Vary: Cookie
X-Frame-Options: SAMEORIGIN
X-Content-Type-Options: nosniff
X-Xss-Protection: 1
X-Ua-Compatible: IE=edge
P3p: CP="This is not a P3P policy!"
Pragma: no-cache
Cache-Control: no-cache, no-store, must-revalidate, private
Content-Type: text/html; charset=utf-8
Content-Security-Policy: frame-ancestors 'self' academy.skilljar.com
Expires: Wed, 15 Jan 2020 15:25:59 GMT
Set-Cookie: csrftoken=X4J5NsP4AehPs6u59tLwTOVXXbvYPlmPeezhN7GF69Y5MSeMjyEZ2z3CEKr2BaTw; expires=Wed, 13-Jan-2021 15:25:59 GMT; Max-Age=31449600; Path=/; secure
Set-Cookie: sj_ubid=6db248e3-241a-49f4-b538-0cb35dc24adf; expires=Tue, 10-Jan-2040 15:25:59 GMT; httponly; Max-Age=630720000; Path=/; secure
Content-Length: 4004
Via: 1.1 vegur
Solved! Go to Solution.
A colleague of mine @klonergan solved it using python.
I will sanitize and post the solution.
Run this via a python tool:
# List all non-standard packages to be imported by your
# script here (only missing packages will be installed)
from ayx import Package
#Package.installPackages(['pandas','numpy'])
from ayx import Alteryx
import requests
import pandas as pd
from io import StringIO
user = 'your.email@address.ca'
password = 'your_password_for_Skill_jar_login'
site_url = 'https://dashboard.skilljar.com/login'
# Read in fileURL
file_url = Alteryx.read("#1")
print(file_url)
#parameters for csv download
params = {'draw':'1',
'start':'0',
'length':'25',
'skip_total_count':'true',
'order[0][column]':'5',
'order[0][dir]':'desc',
'registered_at':'all',
'completed_at':'all',
'latest_activity_at':'all'}
#didn't work without required referer header ...
#https://stackoverflow.com/questions/13567507/passing-csrftoken-with-python-requests
#looked at Chrome inspection to get required field names (email/password1/csrfmiddlewaretoken)
with requests.Session() as session:
session.get(site_url)
csrftoken = session.cookies['csrftoken']
login_data = dict(email=user, password1=password, csrfmiddlewaretoken=csrftoken)
post = session.post(site_url, data=login_data, headers=dict(Referer='https://dashboard.skilljar.com/'))
data = session.get(file_url.FileURL[0],data=login_data, params=params)
#make a pandas dataframe for exporting to Alteryx
df = pd.read_csv(StringIO(data.text))
#columns after 13th are all null/garbage
df = df.iloc[:, : 13]
#Output to Alteryx workflow/anchor 1
Alteryx.write(df,1)