community
cancel
Showing results for 
Search instead for 
Did you mean: 

Alteryx designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer.
#SANTALYTICS

The highly anticipated Alteryx Community tradition is back! We hope you'll join us!

Learn More
SOLVED

Web scraping including embedded pdf documents from a website

Meteoroid

Hi all,

 

I am unsure whether this can be done using Alteryx, but perhaps one of you has a ingenious solution.

 

What I want to achieve is to scrape information from the below website into a readable excel table (i.e. list up all announcements in a tabular format).

It does not stop here though, at the same time I want Alteryx to download the corresponding PDF files and store these for me in a certain folder on my laptop. If the file names of these PDF files could be the concatenation of columns "Date" and "Headline", that would be perfection. However, I would be very happy already if the workflow could automatically extract all PDF files.

https://www.asx.com.au/asx/statistics/announcements.do?by=asxCode&asxCode=TLS&timeframe=Y&year=2019

 

I have a good idea on how to create the table, however, downloading the corresponding PDF files might be too challenging for Alteryx.

Apologies if this request is too far fetched... .but I've seen some geniuses around here, so perhaps you can astonish me again :).

 

Alteryx
Alteryx
Using the download tool and the webpage, you can pull all of the href links from the HTML in the download data. Then you can feed these links into another download tool where it downloads the PDF files to a location.
Alteryx
Alteryx
Meteoroid

Hi Brandon,

 

Thanks so much for your replies!

I have given it a try, but the result of my download tool does not give me a link to the PDF file. I guess that this webpage is not suited for what I want to achieve.

 

Unless I'm getting it totally wrong?

Nebula
Nebula

Hi @Liline008 

 

You were close.  Your filter needed to be changed from [DownloadData]="" to [DownloadData]!="".  From there, I added a couple of tools that have worked for me in the past. 

 

w.png

First split to rows and the find all the lines that contain HREF.  The rest is left as an exercise for the student. 

 

One thing I did notice is that the HREF for the pdfs don't actually point to a pdf file, i.e. "https://www.asx.com.au/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02169254".  It's actually a link to a License page that you'll have to navigate before you get to the actual page "https://www.asx.com.au/asxpdf/20191105/pdf/44b8x1lk4hbb07.pdf".   I haven't come across this kind of situation before and I'm not sure how you'll be able to work around it.  Reaching out to @Claje@jdunkerley79@MarqueeCrew  to see if they've come across similar situations.

 

Dan

 

 

 

Highlighted

Had a quick nosey.

 

Need to do a multistage download:

jdunkerley79_0-1572963246827.png

 

First, download and pick the links out (I chose to just use a RegEx tokenise to Rows)

The download each of those pages (which are all accept pages in my case)

Extract the pdfURL from the hidden input

Then download that to a blob

 

You then have all the PDF - how you process those is a different issue!

 

 

Meteoroid

Hi all,

 

You guys are amazing, the solution works as a charm!

Just for educational purposes, can you please explain why you used href="([^"]+)" in the RegEx tool? (especially the ([^"]+) part)

 

As the cherry on the cake I was hoping to incorporate the part in red in the file name of the PDFs that get extracted, but first I'd need to fully understand your great solution. If I add the description in the split, then I can parse it into a new column, and then use it in the file names. 

href="/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02165227"> 2020 First Quarter Sales Results

The regex is looking for hefs in the text and then identifying the link contained between the following ".

 

The "([^"]+)" picks out characters until it encounters another ". The brackets make it a marked group. The RegEx tool in tokenise mode will match pick this marked group out and make it the value.

 

To do the later request we need to do a little extra work. 

1. Change the Regex to match the whole <a> tag: <a[^>]*href="[^"]+"[^>]*>.*?</a

2. Add a second Regex in Parse mode to pick URL and text description out

 

Have attached new version for you to look at

 

Labels