My target data is text that falls between opening and closing paragraph tags. For 95% of the records, the data is contained in one row. The data for the remaining records is split across 3 rows, with the first row being the opening <p, the second containing the target text, and the third row being the closing </p>.
I tried the following expression in a Multi-Row tool, but it fails. A second Mutt-Role tool would have carried the RecordID for the starting <p to the start of the next <p. The third step would be to use a Summarize tool to concat everything back in to a single line where it can then be processed using an existing macro.
if StartsWith([DownloadData],"^<p.*?>") &&
StartsWith([Row+2:DownloadData], "^<\/p>") then [RecordID] else "" endif
A few important notes. The target data is found in a larger HTML file. There are other rows that start with <p. But only the target rows follow the patters of row 1 = <p, row 2 = target text, row3=</p>.
HTML | RecordID | Desired Group | |
<p style="text-align: center;"> | 821 | 1 | |
<strong>AIR FORCE</strong><br /> | 822 | 1 | |
</p> | 823 | 1 | |
<p> | 824 | 2 | |
Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. | 825 | 2 | |
</p> | 826 | 2 |
Solved! Go to Solution.
Hi @hellyars
Workflow is attached.
Your logic was on the money, but I notice that you're using Regular Expressions in the StartsWith function which doesn't seem to be supported. The StartsWith function seems to use a non-case specific character match. Changing these StartsWith functions to a regex_match() function and changing the 'else ""' section to 'else null()' to preserve the column's data type should return the desired output.
if regex_match([HTML], '<p.*>')
AND regex_match([Row+2:HTML], '<\/p>') then [RecordID]
elseif not regex_match([HTML], '<p.*>')
AND not isnull([Row-1:Grouping field #2]) then [Row-1:Grouping field #2]
else null()
endif
If you were super keen to achieve the same result with a StartsWith function then the below formula provides the same output
if startswith([HTML], '<p')
AND startswith([Row+2:HTML], '</p') then [RecordID]
elseif not startswith([HTML], '<p')
AND not isnull([Row-1:Grouping field]) then [Row-1:Grouping field]
else null()
endif
Hope this helps!
@lmorrell Thank you for the assistance and explanation.