Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

URL Regex Parse

Watermark
12 - Quasar
12 - Quasar

OK, I give, I've spent hours on this.  I'm clearly missing something. 

 

I"ll up load the file. I'm using regex (and WANT to achieve a 1 tool regex solution if for no other reason than solving this with it), I have three different streams because I was trying/saving/copying different things at same time not that I was trying to have multiple breaks or solutions. 

 

V1 is what I want it to actually look like, and gets them all right but I get nulls in the first 2 rows. 

V2 is an attempt to get more granular part by part, again all correct, but I still get nulls in the first 2 rows

V3 solves the nulls, but down further in the data, it now no longer behaves correctly and any URL with a " . " is now jammed into the base web address.  

 

Notes: V1 parses the columns I actually want to achieve. Data set for working this out is only 90 records some of which I've intentionally manipulated to provide different 'exception' possibilities.  (not just .com, but  .whatever).  Parsing out the 'base' url is my ultimate goal.

 

WaterMark_0-1613597632664.png

 

 

 

 

 

 

 

 

4 REPLIES 4
randreag
11 - Bolide

hi @Watermark 

 

I just put an asterisk after the last (/)

 

I hope it helps

Watermark
12 - Quasar
12 - Quasar

Thanks but that doesn't solve. 

 

If you look at rows 20/23/24/79/81 ...... it takes and stuffs the entire URL into the RegExOut3 Column. This is exactly the problem. it should only have the base URL, not all the  extraneous stuff after .com.  As I mention, this happens when there is a url with period "." somewhere further along in the URL address beyond the . in .com.    It takes all the /data.etc and puts it all in the base url column   (RegExOut3).  If you look at V1 it handles the URLS correctly, but leaves the null up top. 

randreag
11 - Bolide

aah ok 

 

I get it now

try this one (https*:\/+)*(www\.*)*([-.a-zA-Z]+\.[a-zA-Z]+)(.*)

 

 

 

Watermark
12 - Quasar
12 - Quasar

Thx!!  That was it. 

Labels