Join the Alteryx Community’s Maveryx Summer Cup event! Compete, network with others, and earn your gold through a series of challenges from July 24th to August 11th. Learn more about the event here.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Extract Indeed Job Reviews from Script/JSON

hellyars
13 - Pulsar

I am trying to parse Indeed company reviews.  The attached TXT file captures all the data found in one <script> tag.  It appears to be JSON and it appears to include all the data for each review.  Two questions: 

 

1.  How can you parse into individual reviews?

2.  How can you parse each individual review?

 

Background:  I have successfully downloaded page HTML.  I can isolate the <script> tag and "JSON" where the information is trapped.  The "JSON" duplicates HTML for each review, but it also contains additional information that is highly useful (such as the SubRatings).  I can parse the HTML chunk, but it would be more valuable to extract all the info from the JSON.

2 REPLIES 2
PhilippK
Alteryx Alumni (Retired)

Hi @hellyars ,

 

please find a workflow attached that reads in the whole file properly. 

The parsing of the content looks like a pain, but tools like TEXT TO COLUMNS or REGEX are going to help you.

 

In addition, you can leverage the DOWNLOAD tool to also automate the html download part at the beginning. The downloaded data might be more convenient for parsing.

 

Best regards

Phil

hellyars
13 - Pulsar

@PhilippK   I know.  That is how I got to this point.  I use the Download Tool to get the HTML.  I can use regex and a few other techniques to pull the basic review out of the HTML.  The same process isolated this section of script that contains the full review.  Splitting with \0 does not break the lines up naturally.  I am trying to find a way to sort along the natural break-ins of the JSON, which I assume align to individual reviews).

Labels