Alteryx Designer Desktop Discussions

bryan_ram2613 · ‎06-06-2019

Hello all,

I am fairly new to Regex. I have data that I downloaded via web scraping however the parsing has thrown me some issues.

I have the following lines

I would love to see them parsed like so. Any help would be much appreciated as well as a quick explanation for learning purposes!

DX	MDC	MS-DRG
A000	06	371-373

Thanks community!

afv2688 · ‎06-06-2019

Hello @bryan_ram2613 ,

This was a tricky one. This is the regex rule I used:

(.*\w\w\W?>)(.*)(</\w\w.*\w\w\W?>)(.*)(</\w\w.*\w\w\W?>)(.*-.*\w)(</\w\w>)

And the explanation is:

(.*\w\w\W?>) look for 2 consecutive letters and a symbol (that may or may not be there, thats why there is the "?" and ">".

take then what is between that and this

(</\w\w.*\w\w\W?>) which is basically the same as in the first part, only including this time that it has to start with "</" and two letters

(.*-.*) Added here that it has to have a dash in between somewhere

Cheers

danilang · ‎06-06-2019

hi @bryan_ram2613

~~I don't think you can do this all within Regex, but~~,(I stand corrected, good job @afv2688. You got everything, just missing a dynamic rename) With combination of Regex and other tools you can get what you're looking for

Start out by splitting out the header and data rows. Both of the following Regex tools have the same format

col">(.*?)<

The bold text is the text that delimits what you're looking for. In the header line this regex finds any text (.*?) that comes between col"> and <. .* matches any character and the question mark means don't be greedy. With out the question mark, the closing < would be included in the matched text since it is any character, but we don't want it. The data row uses a similar format <td.*?>(.*?)<. The regex tools are configured to split to rows

Once we have the header and data split, we remove the null values that may show up and generate a ColID so we can match the columns up with respective rows. We then cross tab to give you

Note that I added in an extra data row.

Dan

bryan_ram2613 · ‎06-06-2019

@afv2688 Thank you so much for sharing! This is working for me, however I found that some of the lines are different like shown below.

If you check the highlighted section I believe that is throwing me an error. I took out (.*-.*) for those specific rows but it is not parsing now the entire line now.

afv2688 · ‎06-07-2019

Hello @bryan_ram2613 ,

Try it out now like this:

(.*\w\w\W?>)(.*)(</\w\w.*\w\w\W?>)(.*)(</\w\w.*\w\w\W?>)(.*-?.*\w)(</\w\w>)

Should solve the error you are talking about.

Cheers

PD: Please consider marking the discussion as solved if you think I solved your problem. It helps the community

Alteryx Designer Desktop Discussions

RegEx Parse