Engine Works

AkimasaKajitani · ‎11-25-2024

Generally, when you do web scraping with the Alteryx Designer, you access a site that does not require login credentials by using the Download tool and process the obtained HTML file to achieve your goal. However, when that site requires login credentials, you may do scraping by RPA, with Python tool or give up on web scraping. But this time I will show you how to do web scraping with a site that does require login credentials and obtain information with so-called “Base A” style, which uses only the Download tool, not the Python tool.

Background Story

I use WordPress to write blogs about Alteryx, and as my posts increased, I wanted to create a dashboard about my blogs. Half the time I might just look at it and grin for self-satisfaction, but in some cases, it can be useful (or at least it makes it easier to discover what Alteryx tips are available). There are several ways to get information about the articles from WordPress, but I decided to start by looking at the data to see what kind of dashboard I could create. The easiest way to get the data is to use the export feature that WordPress provides by default. WordPress Export file is provided as XML type file. Parsing XML is a bit tedious with Alteryx Designer, but not too difficult once you get the hang of it. First, I created the dashboard using this way.

However, using the WordPress export function requires logging into the site once, so it is not suitable for automation. It is common to use the Rest API for automation. Fortunately, WordPress provides Rest API, so I got the information using that directly... By the way, I write a blog on WordPress at my company as well. I thought it would be nice if I could use the same dashboard for my company's blog, so I immediately tried to use Rest API to get information from my company's WordPress. However, my company's WordPress has disabled the API function (actually, I disabled it myself for security reasons... I didn't expect to be caught in a trap like this). Anyway, I don’t want to manually download the XML file every time. So, I thought again about this method and have tried various things to see if I can get by Alteryx Designer to handle the login process.

Basic Concept

Basically, you use a Download tool to log in and retrieve information. This is the same method as regular web scraping by Alteryx Designer.

In fact, Alteryx's Download tool is not that different from libraries/tools that access via HTTP protocol, such as Curl and Python requests. Therefore, I refer to the method of automating with Curl, but unfortunately, the Download tool is not as user-friendly as Curl, so there is some logic that we must create with Alteryx Designer.

First things to know

What is the Curl?

What is curl? Wikipedia describes it as follows:

Curl is a command-line tool for getting or sending data including files using URL syntax. Since curl uses libcurl, it supports every protocol libcurl supports.

The Download tool in Alteryx also uses URL syntax to send and receive files.

What is happening in the browser and the Download tool/Curl?

Usually, when browsing a website, the viewer enters the URL through the browser and clicks buttons to move between web pages or download files. To explain this technically, the browser GETs a specific URL along with header information, or POSTs data, and then gets data that you want. Alteryx's Download tool and curl can also add header information to a specific URL and POST the main data. In other words, Alteryx/Curl handles the exchange between the web browser and the website server on behalf of the web browser.

About the Browser's Developer Tools

In this article, I will use the Developer tools built into the web browser to check the interaction between the browser and the server (the name varies slightly depending on the browser). First, I will give a brief explanation of this.

Most used browsers (Chrome, Firefox, Edge, etc.) have a tool called the Developer tools, which has almost the same function. I usually use Chrome, but in this article, I will introduce it using Firefox. To launch the Developer tools in Firefox, select "Web Developer Tools" under "More tools" from the Settings menu.

For this you must use the "Network" tab.

In the Network tab, you can see the interaction between the web browser and the server. This function is also used for debugging SaaS services. It is also possible to save the log by using the export function and save it as a HAR file. Note that by default, all logs will be deleted when you change web pages, so be sure to check "Persist Logs" ("Preserve log" in the case of Chrome).

The screen layout is as follows:

With the Developer tools displayed, start accessing the login page of the site you want to get information from. Also, make sure to use a secret/private tab instead of a regular browser tab (if there is past access information, the browser will use old cookie information and other information from the past, so you cannot analyze the website properly).

Things to check using Developer tools

Before launching Alteryx Designer, you need to use the Web Developer tool to look at the sequence of interactions between the browser and the server from when you log in to when you retrieve the information you want. The typical sequence of interactions between the browser and the website up to when you log in is as follows:

Using the information obtained from the Developer tool, you will decide what kind of processing to do in Alteryx Designer. Although the log can be saved as a HAR file in Chrome and other browsers, the contents received in the request cannot be checked later in the Developer tool, so you may need to do the same thing again when implementing in Alteryx Designer. Note that when accessing subsequent times, you will need to close the secret/private window of your browser before accessing and start over from the beginning (please start over with no information from previous interactions, such as cookies remaining).

The most important point in this series of processes is the login page information obtained in (2). The key is whether Alteryx Designer can successfully create the data to be sent by the login web form based on the received HTML information. Other than that, I think the process is similar for any site.

Try it out

This time, I would like to cover logging in to GitHub and retrieving data. However, GitHub has various security options. To simplify the explanation, I will assume that the PC has been logged in once before (GitHub requires additional authentication the first time you log in to a device) and that security is set to the default (not using two-factor authentication, etc.).

Login via Browser

First of all, one thing to note. If you do this process again for any reason, you MUST restart your browser - even if you log in using a secret/private tab, cookies from your previous visit will remain and will mix up unnecessary information, causing the investigation to fail.

Let’s get back to the main topic.

In this section, the key is to know what process the browser goes through to log in.

The first process is to specify the login URL and access it in a secret/private tab in your browser. The GitHub login URL is "https://github.com/login". When doing this, you must enter the login URL in a new private/secret tab in your browser with the Developer tools running. This will cause the browser to automatically download all the data needed to display the web page from the login URL. So, we want to see what information is retrieved for the first URL accessed, so you click on the first record in the Developer tool download list (in this case, "login") to check.

What you need to check here is the Headers tab. The "Request Headers" is the header when accessing from the browser. The "Response Headers" is the data received from the server. Also, the initial request method is GET. This is because basically only header information is set for GET, and nothing is set in the body (Payload/Response).

Here, look at the "Response Headers." There are items called Set-Cookie. The browser will use the value of this Set-Cookie to create a cookie, set it in the Header, and make the next access. Make a note of this cookie. There may be multiple Set-Cookies, as you can see here.

Next, let's look at the "Response Headers," which is the actual login page contents you get.

Please note that if you are using Chrome, this content will no longer be shown if you change a web page on the browser screen on the left. If possible, save the response as an HTML file (you will need to retrieve hidden items from here later). You can save the response by right-clicking on the screen when you're using Chrome.

Next, let's log in. Set your username and password on the web page on the left and press the Sign in button. If your ID/password is correct, you should be able to log in successfully. The following screenshot shows a successful login.

If you look at the Developer tools, the URL of the next document after login is "https://github.com/session". This is easy to find if you filter by HTML/documents only. After that, you are finally moved to a page called github.com, even though you did nothing. This is because the session status code is 302, and a redirect has occurred, so you are automatically redirected to github.com.

Here (URL is session), the request method is POST. In other words, when you operate the login screen in your browser, it is POSTing to "https://github.com/session". Let's look at the Cookie in the request header currently. It is in the red box in the screenshot below.

Compare this with the Set-Cookie in the response header of the login page. It is necessary to obtain only the part of Set-Cookie that is used as a Cookie at the time of POST from Set-Cookie and set it as a Cookie in the request header. The contents of other headers may also be required in some cases, so make a note of them just in case.

Next, check the Request (Payload in Chrome). That is the information that the browser sends to the server as form information, and it must be set in the Request along with the Cookie above and then POST.

This "Request" information must be set as Payload in the body of the request. Here, login and password are of yours, but items such as commit are fixed items in this login form. Other than that, authenticity_token, timestamp, and timestamp_secret are dynamically generated items for each access. So, where are these dynamically generated items? The answer is in the response of the URL when you log in (response in login page). If you search in this, you should be able to find it.

When you have saved the contents of the response in Notepad or something, you can find it immediately by searching. In Firefox, you can search on the Developer tools, but in Chrome, it seems that you cannot see past pages, so if you have not saved the HTML of the page, you will need to follow the same procedure. These items (authenticity_token, timestamp, timestamp_secret) are embedded in the form as Hidden attributes. Make a note of these as well.

Please note that these items will be completely different depending on the website you want to log in to.

You are now ready to log in on the Developer tool. Now, let's go back to Alteryx Designer.

Log in using Alteryx Designer

If you are using Alteryx Designer, follow the steps below to log in.

Access https://github.com/login
Get the login page contents and cookies
- Create login payload information from the login page and login information
Log in to https://github.com/session using cookies and payload
Get the contents and cookies when logged in
- Create the final Cookie

If we were to write this flow in the same way as in a browser, it would look something like this, with the terminology replaced with that used in the Download tool:

The overall workflow to achieve this is as follows:

The workflows for part 1 and 2 are as follows:

The workflows for part 3 and 4 are as follows:

1. Access the login page

First, access https://github.com/login. The settings for the Download tool are to set the "user-agent" on the Headers tab and set the "HTTP action" to "GET" on the Payload tab as follows.

Enter the user-agent in the Text Input tool (Alteryx Designer does not send the user-agent by default, so you must set it yourself. Keep in mind that this is also the same when using the API, as you may get an error if you do not set the user-agent).

2.Get the login page and cookie information

If the access is successful in step 1, the data can be received by the Download tool. At this time, if you receive "HTTP/1.1 200 OK" in the DownloadHeaders field as shown below, the procedure is successful (some sites may receive 302). In any case, if there is an item called Set-Cookie in the header, the procedure is a success.

Create Cookie

Regarding Cookies, not all the Cookies(set-cookie) received are necessary. Here, let's compare the Set-Cookie on the Response Header confirmed in the Developer tools with the Cookie set by the browser.

In the Set-Cookies on the left side, the part in the red frame is the part that the browser sends. To obtain this, you only need to obtain up to the first semicolon since each Set-Cookie is separated by a semicolon. Also, as for the Cookie on the right, which is sent by the browser, some of the data (in the blue frame) was not necessary (this is the conclusion I came to after trying it).

I won’t go into the details of the workflow for creating a Cookie from Set-Cookies in the response header since this can be done with relatively basic tools, but the steps are as follows:

Use the Text To Columns tool to split the DownloadHeaders field into rows with a line break (\n).
Use the filter tool to extract only rows that contain Set-Cookie.
Use the Text To Columns tool to split with a colon, then split again with a semicolon.
Use a Data Cleansing tool to trim leading and trailing spaces.
Use the Summarize tool to combine multiple cookies. Use a semicolon (;) as the delimiter when concatenating.

I've done the same process on this part many times on other websites, so I think it would be a good idea to make it into a macro so that it can be reused.

Create Payload

The content of the payload you should create will vary depending on the site you are logging in to, so be careful when trying it out on other websites.

The payload that you checked in the Developer tools has been processed by the Developer tools to make it easier for humans to see, so you need to put it back in its original form and check. If you want to see the payload that the browser is sending, click the "Raw" button in the Developer tools. In Chrome, it's "View Source."

Clicking "Raw" will display the following screen. The screenshot below shows the payload that is being sent by the browser.

This is very difficult to read, so try pasting it into a text editor.

If you look closely, you'll see that item names and values are joined with "=" instead of colons, and items are joined with "&". Furthermore, values are URL encoded.

The three items "authenticity_token," "timestamp," and "timestamp_secret" must be obtained from DownloadData field. It is easy to extract each of them from DownloadData using a RegEx tool. For example, in this case, when you want to get the "authenticity_token", the setting is as follows.

You can set other items in the same way. All other fixed items are set in the Text Input tool as follows.

Fixed items and dynamic items should be combined with the Union tool, the values are URL encoded in the Formula tool, and Name and URL encoded value should be concatenated by “=” (equal).

This will result in values like the following (only some records are shown due to space constraints):

After that, combine these data using the Summarize tool by setting the Separator to "&".

The workflow will be as follows:

This creates the payload and Cookie to send.

3. Login using cookies and payload

It's finally time to log in! Access https://github.com/session and log in using Cookie and Payload.

The login URL can be created using the Formula tool. Then, you can set them in the Download tool and log in.

If the login is successful, you will see "HTTP/1.1 302 Found" or similar in the DownloadHeaders (if you look at the headers, you should also see HTTP/1.1 200 OK at the bottom). 302 means that it is redirected, but Alteryx Designer has kindly downloaded the data after the redirect.

4. Get the page contents and cookies at login

At the end of the login process, you need to get the Cookie to be used after logging in. If you look at the response header, you will notice that a new Cookie is sent. The method is as already explained. The workflow is the part in the red box below, but if you make it into a macro, you can reuse it.

You should be fine if you use this cookie on continuous visits (but be aware that some sites may issue a new Cookie on each page).

Congratulations, you have successfully logged in by Alteryx Designer.

Get the uploaded files

Now that we've successfully logged in, let's get back to the main topic. What we originally wanted to do was to log in and retrieve information. It's no exaggeration to say that we could do whatever we wanted using the Cookie we obtained when logging in. From this point on, it will vary greatly depending on the design of the website, but in the case of GitHub, we were able to retrieve the file immediately by directly accessing the URL of the file we wanted (simple websites can use a similar method to download files). However, I had to search in the Developer tools to find the URL to download it. Let's try it.

Find the URL of the file you want to download in the Developer tools of your browser.

This time, I will use a private repository I created for testing in GitHub. It is very simple and contains only the text "test". The repository name is "testrep". After logging in, click on “testrep” on the web page.

Furthermore, click on a file (testfile) in the repository as shown below.

Now, let's say you want to download a file called “testfile” from the “testprep” repository. Let's click on “testfile”, the contents will be shown as follows.

You can confirm the string "test" written in the content. Of course, it is possible to obtain data by scraping the contents in the response, but GitHub allows you to download files, so let's try downloading them. You can download the file by selecting "Download" from the three-dot menu on the website.

After clicking the Download button, some Chrome users may find that nothing appears on the screen. At the beginning of this article’s procedure, the filter in the network panel showed the exchanges you wanted to see in "Documents," but here, you have to set it to "All." Be careful, as there may not be what you want to see in "Documents" (here data that can be viewed in "Fetch" was used). Firefox users should also add "XHR" to the filter.

When you press the Download button, the browser's file save menu will appear, but just save it as is. In the Developer tool, the last "testfile" is the URL for the actual download, as shown in the screenshot below. Just access this by Alteryx Designer, and you're good to go (this is the conclusion I came to after trying it out: this URL was OK in GitHub. Please note that depending on the site, you may not be able to access such a URL directly).

Try to download a file with Alteryx Designer

The workflow is very simple for the download—it’s only 2 tools.

If you've logged in, you can then use the cookie created when you successfully logged in to access the file (this has already been done). In this case, the URL to access will be the file download URL you saw in the Developer tool, "https://github.com/ayx-test/testrep/raw/refs/heads/main/testfile", which you can create using the formula tool.

Then, you can set it in the Download tool. To set it up, just set the Cookie and user-agent in the Headers tab, and "GET" as the HTTP action in the Payload tab (GET only sets the header, so you just need to be careful about the URL and headers).

Looking at the results, you can see that the data was actually retrieved, but again, the header is 302, so a redirect has occurred.

In the case of a browser, the save location was manually determined, but in Alteryx Designer, it is automatically stored in DownloadData field. Once you have obtained the data, you can save it as a file or use it freely.

Points to note

One thing to keep in mind is that this is how data was obtained on GitHub, so on other sites you may have to go through various steps. Please implement this according to each site. In some cases, you may need to set the value of the Hidden attribute and POST with the Payload, just like when logging in. For example, if you select data for XX month from a pull-down menu and then click the download button, it will probably be POST, so you will need to set the Payload.

Important Notice

There are some points to note about this method.

Can not handle CAPTCHA

If CAPTCHA, which is a robot-avoiding authentication, is used, it is very difficult to deal with it.

Techniques that work on one site do not necessarily work on another.

Authentication methods and web page design vary greatly depending on the website. The research method I used here is a general-purpose one, but the actual implementation needs to be custom-made for each individual site. In particular, analysis is very difficult for sites that dynamically rewrite their pages, such as Gmail.

Don't overload the sites you visit

Please refrain from downloading and accessing large amounts of data to avoid putting a strain on the servers you access.

Make sure you don't violate the site's terms and conditions

If a site explicitly prohibits scraping or automation, you should give up on automating it.

Conclusion

That's all for this blog. It was a very long article, but what did you think? Logging in to a website using Designer might greatly improve work efficiency.

This series of steps may be familiar to those who have accessed an API, but in the case of an API, you add header information to the login endpoint (URL) to access it and then use the obtained credentials to access each endpoint (URL). How is this different from the work we did this time? The series of steps of accessing a URL, obtaining credentials, and connecting to another URL is exactly the same. The big difference is that APIs have manuals, but in this case, there are no manuals. Therefore, if you understand the basics of API access, I think you will be able to manage a case like this.