Hi
I have a table which contains WebID and HTMLDocument as column
I wanted to extract a full img tag in each HTML document
Sample Data
Web ID | HTMLDoc |
1 | <!DOCTYPE html> <html> <body> <h2>HTML Image</h2> <img src="img_girl.jpg" alt="Girl in a jacket" width="500" height="600"> <img src="image" alt="image" width="500" height="600"> </body> </html> |
2 | <!DOCTYPE html> <html> <body> <h2>HTML Image</h2> <img src="img_girl.jpg" alt="Girl in a jacket" width="500" height="600"> <img src="image" alt="image" width="500" height="600"> </body> </html> |
Expected Output
WebID | Img |
1 | <img src="img_girl.jpg" alt="Girl in a jacket" width="500" height="600"> |
1 | <img src="image" alt="image" width="500" height="600"> |
2 | <img src="img_girl.jpg" alt="Girl in a jacket" width="500" height="600"> |
2 | <img src="image" alt="image" width="500" height="600"> |
Note: Each HTML document in a row can have 1 or more image
Solved! Go to Solution.
@tjamal1
Hope this is what you need.
Hello @tjamal1
You can also do this using a single regex tool configured to Split To Rows
The regular expressin matches all the tags that start with "<img" and end with ">".
Dan
@Qiu Thanks for the solution.
This works too if your HTML document has a full image tag and others tag on separate lines.
Since some of my document has para tag and other tags with imag tag its extracting other tags too.
@danilang Thank you for the workflow,
This works perfectly in my case. Can I create another Column for images instead of writing tokenize image tags to the same column?