Hello all,
I am fairly new to Regex. I have data that I downloaded via web scraping however the parsing has thrown me some issues.
I have the following lines
<th scope="col width="10"> </th><th scope="col">DX</th><th scope="col">MDC</th><th scope="col">MS-DRG</th></tr>
<tr><td>A000</td><td align=center>06</td><td>371-373</td>
I would love to see them parsed like so. Any help would be much appreciated as well as a quick explanation for learning purposes!
DX | MDC | MS-DRG |
A000 | 06 | 371-373 |
Thanks community!
Solved! Go to Solution.
Hello @bryan_ram2613 ,
This was a tricky one. This is the regex rule I used:
(.*\w\w\W?>)(.*)(</\w\w.*\w\w\W?>)(.*)(</\w\w.*\w\w\W?>)(.*-.*\w)(</\w\w>)
And the explanation is:
(.*\w\w\W?>) look for 2 consecutive letters and a symbol (that may or may not be there, thats why there is the "?" and ">".
take then what is between that and this
(</\w\w.*\w\w\W?>) which is basically the same as in the first part, only including this time that it has to start with "</" and two letters
(.*-.*) Added here that it has to have a dash in between somewhere
Cheers
I don't think you can do this all within Regex, but,(I stand corrected, good job @afv2688. You got everything, just missing a dynamic rename) With combination of Regex and other tools you can get what you're looking for
Start out by splitting out the header and data rows. Both of the following Regex tools have the same format
col">(.*?)<
The bold text is the text that delimits what you're looking for. In the header line this regex finds any text (.*?) that comes between col"> and <. .* matches any character and the question mark means don't be greedy. With out the question mark, the closing < would be included in the matched text since it is any character, but we don't want it. The data row uses a similar format <td.*?>(.*?)<. The regex tools are configured to split to rows
Once we have the header and data split, we remove the null values that may show up and generate a ColID so we can match the columns up with respective rows. We then cross tab to give you
Note that I added in an extra data row.
Dan
@afv2688 Thank you so much for sharing! This is working for me, however I found that some of the lines are different like shown below.
<td></td><td>A480</td><td align="center">15</td><td>793</td>
<td></td><td>B000</td><td align="center">15</td><td>793</td></tr>
If you check the highlighted section I believe that is throwing me an error. I took out (.*-.*) for those specific rows but it is not parsing now the entire line now.
Hello @bryan_ram2613 ,
Try it out now like this:
(.*\w\w\W?>)(.*)(</\w\w.*\w\w\W?>)(.*)(</\w\w.*\w\w\W?>)(.*-?.*\w)(</\w\w>)
Should solve the error you are talking about.
Cheers
PD: Please consider marking the discussion as solved if you think I solved your problem. It helps the community