Alteryx Designer Desktop Discussions

hellyars · ‎05-13-2021

I am running into some challenges trying to scrape HTML data.

Basically, I want to extract all the field+ response pairs depicted in the attached page/table image below. (Each hull has its own page/table.)

I am in the process of building an iterative macro to process each URL (page), download the page HTML, and extract the table fields and responses.

There will not be a response to each field. Some fields will be blank.

The attached workflow depicts two ways I was trying to get to the data. The problem is I need to account for the blank responses. (The workflow includes 10 different page downloads.)

(Note: This is all open-source data.)

dougperez · ‎05-13-2021

This helps you? I used multirow formula (with one Hull to test, just group by Hulls)

dougperez · ‎05-13-2021

I was looking into my example and I found a problem: the headers hahahah

Now I think its more accurated

Try to filter that headers into another way (i used a filter and wrote down those, assuming that is standardized)

hellyars · ‎05-13-2021

@dougperez Yep. That nails it. Nice approach. I will have to remember this one. I made a slight edit. I added a formula tool with 3 regex_replace expressions to add a ":" after Years since Launch, Years since Delivery, and Years from Commission and (with your solution) everything snapped into place. It just worked with the first 10 entries. I am going to try it against the first few hundred. THANKS!

Alteryx Designer Desktop Discussions

Download, Parse, & Account for Empty Fields