Alteryx Designer Desktop Discussions

Liline008 · ‎11-04-2019

Hi all,

I am unsure whether this can be done using Alteryx, but perhaps one of you has a ingenious solution.

What I want to achieve is to scrape information from the below website into a readable excel table (i.e. list up all announcements in a tabular format).

It does not stop here though, at the same time I want Alteryx to download the corresponding PDF files and store these for me in a certain folder on my laptop. If the file names of these PDF files could be the concatenation of columns "Date" and "Headline", that would be perfection. However, I would be very happy already if the workflow could automatically extract all PDF files.

https://www.asx.com.au/asx/statistics/announcements.do?by=asxCode&asxCode=TLS&timeframe=Y&year=2019

I have a good idea on how to create the table, however, downloading the corresponding PDF files might be too challenging for Alteryx.

Apologies if this request is too far fetched... .but I've seen some geniuses around here, so perhaps you can astonish me again :).

BrandonB · ‎11-04-2019

Using the download tool and the webpage, you can pull all of the href links from the HTML in the download data. Then you can feed these links into another download tool where it downloads the PDF files to a location.

BrandonB · ‎11-04-2019

This link goes over the basics:

https://www.thedataschool.co.uk/nick-jastrzebski/the-dark-arts-of-alteryx-reporting-tools/

Liline008 · ‎11-04-2019

Hi Brandon,

Thanks so much for your replies!

I have given it a try, but the result of my download tool does not give me a link to the PDF file. I guess that this webpage is not suited for what I want to achieve.

Unless I'm getting it totally wrong?

danilang · ‎11-05-2019

Hi @Liline008

You were close. Your filter needed to be changed from [DownloadData]="" to [DownloadData]!="". From there, I added a couple of tools that have worked for me in the past.

First split to rows and the find all the lines that contain HREF. The rest is left as an exercise for the student.

One thing I did notice is that the HREF for the pdfs don't actually point to a pdf file, i.e. "https://www.asx.com.au/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02169254". It's actually a link to a License page that you'll have to navigate before you get to the actual page "https://www.asx.com.au/asxpdf/20191105/pdf/44b8x1lk4hbb07.pdf". I haven't come across this kind of situation before and I'm not sure how you'll be able to work around it. Reaching out to @Claje, @jdunkerley79, @MarqueeCrew to see if they've come across similar situations.

Dan

jdunkerley79 · ‎11-05-2019

Had a quick nosey.

Need to do a multistage download:

First, download and pick the links out (I chose to just use a RegEx tokenise to Rows)

The download each of those pages (which are all accept pages in my case)

Extract the pdfURL from the hidden input

Then download that to a blob

You then have all the PDF - how you process those is a different issue!

Liline008 · ‎11-05-2019

Hi all,

You guys are amazing, the solution works as a charm!

Just for educational purposes, can you please explain why you used href="([^"]+)" in the RegEx tool? (especially the ([^"]+) part)

As the cherry on the cake I was hoping to incorporate the part in red in the file name of the PDFs that get extracted, but first I'd need to fully understand your great solution. If I add the description in the split, then I can parse it into a new column, and then use it in the file names.

href="/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02165227"> 2020 First Quarter Sales Results

jdunkerley79 · ‎11-06-2019

The regex is looking for hefs in the text and then identifying the link contained between the following ".

The "([^"]+)" picks out characters until it encounters another ". The brackets make it a marked group. The RegEx tool in tokenise mode will match pick this marked group out and make it the value.

To do the later request we need to do a little extra work.

1. Change the Regex to match the whole <a> tag: <a[^>]*href="[^"]+"[^>]*>.*?</a

2. Add a second Regex in Parse mode to pick URL and text description out

Have attached new version for you to look at

macca75 · ‎05-25-2020

Is it possible to search the ASX website for "Annual Reports" released by a company and then within the annual reports the term "amended assessment" or "amended assessments"?

@Liline008 @jdunkerley79

(Apologies for reaching out unintroduced, you just seem like the experts! ;))

Alteryx Designer Desktop Discussions

Web scraping including embedded pdf documents from a website

Re: Row creation

Re: How to select columns dynamically using number...

Re: Batch macro to read 1000+ .xlsx files with var...

Re: Issue when using Block Until Done and Power BI...

Example workflow for setting up a custom list to u...