Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

Parsing a .txt Transcript to Separate Speakers Within the Text

johnnyt
6 - Meteoroid

Hi All,

 

Fairly new to Alteryx and need some help parsing a .txt file to separate what each speaker says into a new .txt file.

 

For example, I am looking to separate everything the CEO says into its own .txt file. It can be one continuous string if necessary. All of the files share the following format:

  • The list of speakers is preceded by the text of the actual call conversation.
  • The position of speakers are formatted as "Name - Position"
  • The first speaker is always the operator
  • When a new speaker speaks, the format is their full name (without position) and new line with they say.
    • Ex. Steven Humphreys 
      • "Ipsum lipsum"

The workflow needs to be flexible enough to work off of the position of the speaker and not their name.

 

I am not sure if there is a way to have Alteryx store the name of the speakers in between the "Executives" cell and the "Operator" cell. Then have Alteryx check against the list for a line that contains only the speaker name. For example, the logic would look something like: Check for Steven Humphrey - > If found, then store all lines of text following Steven Humphrey's name until another speaker is found. - > If another speaker is found, then stop at [row-1] -> continue until Steven Humphreys is found again.

 

Executives
Steven Humphreys - CEO
Sandra Wallach - CFO
Analysts
Mike Latimore - Northland Capital Markets
Operator

 

I've attached a workflow that was created using the sample.txt file but it isn't flexible enough to work with the input_sample files. There are a couple of thousand text files I need to parse and using Alteryx would make my life so much easier. I appreciate all the help! 🙂

4 REPLIES 4
DavidP
17 - Castor
17 - Castor

This workflow sort of does what you want but there are some issues. I used the data from input sample 2.

 

It extracts the list of speakers/panelists from the first number of rows and strips the titles. It then left joins those names back in to the original data set and uses a multi-row formula to fill in the gaps.

 

Here are the issues - you can investigate them by looking at the Browse tool.

 

1. It does not have the names of people asking questions, so can't match them

2. If there is even a slight spelling difference from the names at the top, they are not picked up, as you can see.

 

The workflow does get most of it right and writes the output to individual text files.

 

I'm not a fuzzy matcher (more of a black and white kind of guy), but perhaps you can play around with fuzzy matching to see if you can overcome issue 2.

 

DavidP_0-1584400388012.png

 

johnnyt
6 - Meteoroid

Hi David,

 

Thanks for the solution! I am actually having issues running the workflow with another similar file. 

 

I edited the workflow a little to try to get it to match since I received some errors. The output file is just Operator and it aggregates all the text but does divide the text by speaker in the output file.

 

johnnyt_0-1584838900843.png

Can I get your thoughts on what is going wrong?

DavidP
17 - Castor
17 - Castor

Here's an updated version with your new file. I modified it to load the file with an Input Data tool and also changed the output format to csv with delimiter set to \0, which works better than the flat ascii choice. I also added a formula tool that adds the file path and txt extension and modified the output data tool to change entire path.

 

The path I chose is just the current path that the workflow file is saved in.

 

Let me know if you have any further issues.

johnnyt
6 - Meteoroid

Sorry for the late acceptance but this was a great framework for me to work off of! Much appreciated.

Labels