Alteryx Designer Desktop Discussions

hellyars · ‎12-02-2020

I am trying to parse Indeed company reviews. The attached TXT file captures all the data found in one <script> tag. It appears to be JSON and it appears to include all the data for each review. Two questions:

1. How can you parse into individual reviews?

2. How can you parse each individual review?

Background: I have successfully downloaded page HTML. I can isolate the <script> tag and "JSON" where the information is trapped. The "JSON" duplicates HTML for each review, but it also contains additional information that is highly useful (such as the SubRatings). I can parse the HTML chunk, but it would be more valuable to extract all the info from the JSON.

PhilippK · ‎12-03-2020

Hi @hellyars ,

please find a workflow attached that reads in the whole file properly.

The parsing of the content looks like a pain, but tools like TEXT TO COLUMNS or REGEX are going to help you.

In addition, you can leverage the DOWNLOAD tool to also automate the html download part at the beginning. The downloaded data might be more convenient for parsing.

Best regards

Phil

hellyars · ‎12-03-2020

@PhilippK I know. That is how I got to this point. I use the Download Tool to get the HTML. I can use regex and a few other techniques to pull the basic review out of the HTML. The same process isolated this section of script that contains the full review. Splitting with \0 does not break the lines up naturally. I am trying to find a way to sort along the natural break-ins of the JSON, which I assume align to individual reviews).

Alteryx Designer Desktop Discussions

Extract Indeed Job Reviews from Script/JSON