OK, I give, I've spent hours on this. I'm clearly missing something.
I"ll up load the file. I'm using regex (and WANT to achieve a 1 tool regex solution if for no other reason than solving this with it), I have three different streams because I was trying/saving/copying different things at same time not that I was trying to have multiple breaks or solutions.
V1 is what I want it to actually look like, and gets them all right but I get nulls in the first 2 rows.
V2 is an attempt to get more granular part by part, again all correct, but I still get nulls in the first 2 rows
V3 solves the nulls, but down further in the data, it now no longer behaves correctly and any URL with a " . " is now jammed into the base web address.
Notes: V1 parses the columns I actually want to achieve. Data set for working this out is only 90 records some of which I've intentionally manipulated to provide different 'exception' possibilities. (not just .com, but .whatever). Parsing out the 'base' url is my ultimate goal.
Solved! Go to Solution.
Thanks but that doesn't solve.
If you look at rows 20/23/24/79/81 ...... it takes and stuffs the entire URL into the RegExOut3 Column. This is exactly the problem. it should only have the base URL, not all the extraneous stuff after .com. As I mention, this happens when there is a url with period "." somewhere further along in the URL address beyond the . in .com. It takes all the /data.etc and puts it all in the base url column (RegExOut3). If you look at V1 it handles the URLS correctly, but leaves the null up top.
aah ok
I get it now
try this one (https*:\/+)*(www\.*)*([-.a-zA-Z]+\.[a-zA-Z]+)(.*)
Thx!! That was it.