Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

RegEx Parse

bryan_ram2613
8 - Asteroid

Hello all,

 

I am fairly new to Regex. I have data that I downloaded via web scraping however the parsing has thrown me some issues.

 

I have the following lines 

 

<th scope="col width="10">&nbsp;</th><th scope="col">DX</th><th scope="col">MDC</th><th scope="col">MS-DRG</th></tr>

 

<tr><td>A000</td><td align=center>06</td><td>371-373</td>

 

I would love to see them parsed like so. Any help would be much appreciated as well as a quick explanation for learning purposes!

 

DXMDCMS-DRG
A00006 371-373

 

Thanks community!

4 REPLIES 4
afv2688
16 - Nebula
16 - Nebula

Hello @bryan_ram2613 ,

 

This was a tricky one. This is the regex rule I used:

 

(.*\w\w\W?>)(.*)(</\w\w.*\w\w\W?>)(.*)(</\w\w.*\w\w\W?>)(.*-.*\w)(</\w\w>)

 

And the explanation is:

 

(.*\w\w\W?>) look for 2 consecutive letters and a symbol (that may or may not be there, thats why there is the "?" and ">".

take then what is between that and this

(</\w\w.*\w\w\W?>) which is basically the same as in the first part, only including this time that it has to start with "</" and two letters

 

(.*-.*) Added here that it has to have a dash in between somewhere

 

Cheers

 

 

danilang
19 - Altair
19 - Altair

hi @bryan_ram2613 

 

I don't think you can do this all within Regex, but,(I stand corrected, good job @afv2688.  You got everything, just missing a dynamic rename)  With combination of Regex and other tools you can get what you're looking for

 

WF.png

 

Start out by splitting out the header and data rows.  Both of the following Regex tools have the same format

 

col">(.*?)<

 

The bold text is the text that delimits what you're looking for.  In the header line this regex finds any text (.*?) that comes between col"> and <.   .* matches any character and the question mark means don't be greedy.  With out the question mark, the closing < would be included in the matched text since it is any character, but we don't want it.  The data row uses a similar format <td.*?>(.*?)<. The regex tools are configured to split to rows

 

Once we have the header and data split, we remove the null values that may show up and generate a ColID so we can match the columns up with respective rows.  We then cross tab to give you

 

Results.png

 

Note that I added in an extra data row.

 

 

Dan

bryan_ram2613
8 - Asteroid

@afv2688 Thank you so much for sharing! This is working for me, however I found that some of the lines are different like shown below.

 

<td></td><td>A480</td><td align="center">15</td><td>793</td>

<td></td><td>B000</td><td align="center">15</td><td>793</td></tr>

 

If you check the highlighted section I believe that is throwing me an error. I took out (.*-.*) for those specific rows but it is not parsing now the entire line now. 

afv2688
16 - Nebula
16 - Nebula

Hello @bryan_ram2613 ,

 

Try it out now like this:

 

(.*\w\w\W?>)(.*)(</\w\w.*\w\w\W?>)(.*)(</\w\w.*\w\w\W?>)(.*-?.*\w)(</\w\w>)

 

Should solve the error you are talking about.

 

Cheers

 

PD: Please consider marking the discussion as solved if you think I solved your problem. It helps the community

Labels