Hi all,
I am unsure whether this can be done using Alteryx, but perhaps one of you has a ingenious solution.
What I want to achieve is to scrape information from the below website into a readable excel table (i.e. list up all announcements in a tabular format).
It does not stop here though, at the same time I want Alteryx to download the corresponding PDF files and store these for me in a certain folder on my laptop. If the file names of these PDF files could be the concatenation of columns "Date" and "Headline", that would be perfection. However, I would be very happy already if the workflow could automatically extract all PDF files.
https://www.asx.com.au/asx/statistics/announcements.do?by=asxCode&asxCode=TLS&timeframe=Y&year=2019
I have a good idea on how to create the table, however, downloading the corresponding PDF files might be too challenging for Alteryx.
Apologies if this request is too far fetched... .but I've seen some geniuses around here, so perhaps you can astonish me again :).
Solved! Go to Solution.
Hi @Liline008
You were close. Your filter needed to be changed from [DownloadData]="" to [DownloadData]!="". From there, I added a couple of tools that have worked for me in the past.
First split to rows and the find all the lines that contain HREF. The rest is left as an exercise for the student.
One thing I did notice is that the HREF for the pdfs don't actually point to a pdf file, i.e. "https://www.asx.com.au/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02169254". It's actually a link to a License page that you'll have to navigate before you get to the actual page "https://www.asx.com.au/asxpdf/20191105/pdf/44b8x1lk4hbb07.pdf". I haven't come across this kind of situation before and I'm not sure how you'll be able to work around it. Reaching out to @Claje, @jdunkerley79, @MarqueeCrew to see if they've come across similar situations.
Dan
Had a quick nosey.
Need to do a multistage download:
First, download and pick the links out (I chose to just use a RegEx tokenise to Rows)
The download each of those pages (which are all accept pages in my case)
Extract the pdfURL from the hidden input
Then download that to a blob
You then have all the PDF - how you process those is a different issue!
Hi all,
You guys are amazing, the solution works as a charm!
Just for educational purposes, can you please explain why you used href="([^"]+)" in the RegEx tool? (especially the ([^"]+) part)
As the cherry on the cake I was hoping to incorporate the part in red in the file name of the PDFs that get extracted, but first I'd need to fully understand your great solution. If I add the description in the split, then I can parse it into a new column, and then use it in the file names.
href="/asx/statistics/displayAnnouncement.do?display=pdf&idsId=02165227"> 2020 First Quarter Sales Results |
The regex is looking for hefs in the text and then identifying the link contained between the following ".
The "([^"]+)" picks out characters until it encounters another ". The brackets make it a marked group. The RegEx tool in tokenise mode will match pick this marked group out and make it the value.
To do the later request we need to do a little extra work.
1. Change the Regex to match the whole <a> tag: <a[^>]*href="[^"]+"[^>]*>.*?</a
2. Add a second Regex in Parse mode to pick URL and text description out
Have attached new version for you to look at
Is it possible to search the ASX website for "Annual Reports" released by a company and then within the annual reports the term "amended assessment" or "amended assessments"?
(Apologies for reaching out unintroduced, you just seem like the experts! ;))