Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Workflow Optimisation - Searching Transcripts for Keywords

CHarrison
8 - Asteroid

Hi there!

 

I've built a workflow which searches transcripts for certain keywords (attached a watered down example to show my process)

 

The workflow works perfectly, however we've reached almost a million transcripts now and there's probably more keywords we'd like to search for going forwards

 

Does anyone have any suggestions on how to optimise/improve this workflow so it won't take a massive amount of processing power/time to run?

 

Thanks in advance :)

6 REPLIES 6
IraWatt
17 - Castor
17 - Castor

Hey @CHarrison,

Very interesting problem would make a great weekly challenge! 

IraWatt_0-1653566501755.png

I think find and replace is more efficient as you don't appended every row to one another. Also I think your initial way was miss counting as transcription ID 1. By my count it has 5 ut's not 4. Likewise with Diams it should be 3 not 2.

IraWatt_2-1653566860136.png

 

IraWatt_1-1653566569579.png

I think the punctuation it throws off. Could use a data cleaning tool to remove punctuation before processing to solve this though. 

 

Any questions, issues or adjustments please ask :)
HTH!
Ira

MarqueeCrew
20 - Arcturus
20 - Arcturus

@CHarrison ,

 

Here's yet another approach.  While I agree about the Find & Replace tool as an option, you might consider the JOIN tool.    I've got AMP turned on for this workflow ( @TonyaS  and @jarrod  as well as @NicoleJ  will be proud of me) and millions of records won't be an issue for you.  Essentially, I use a select to remove all unnecessary data and then opted for the RegEx tool to TOKENIZE (turn each word into their own record).  We are now able to count all occurrences of each word that matches to the Keywords (exactly, hence the LOWERCASE function is used).  If you don't set the CASE, your results will vary.

 

Now you can create metrics for matched keywords.  I counted matches (ignored unmatched) and gave unique counts of transcripts plus the total occurrences.  For each transcript I have counts for each unique keyword.

 

 

capture.png

This is a case where I think that optimisation is in the eyes of the beholder.  Clarity of the workflow and ease of updates is important.  I think that you might also want to SUMMARIZE each of the tokenized words and count their usage.  Word counts from the transcripts might also have value for you.

 

Cheers,

 

Mark

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
IraWatt
17 - Castor
17 - Castor

Ah tokenize each word in the transcript and join ! Awesome solution @MarqueeCrew, don't want to admit it but its possibly a tad better then mine XD Though I do wonder how tokenizing every word in the transcript would add to the computational cost? Don't know it there is a good way to do a performance analysis in Alteryx? Could be a good idea suggestion if there's not one. 

MarqueeCrew
20 - Arcturus
20 - Arcturus

@IraWatt ,

 

In the runtime settings you can look at performance tuning.  But if you AMP the workflow, it runs so fast that I'd be surprised if the tuning will amount to anything.  I think in this case you can afford the "cost" of tokenizing the transcripts with the simplification of the workflow.  I also think that this approach gives better metrics.

 

Cheers,

 

Mark

Alteryx ACE & Top Community Contributor

Chaos reigns within. Repent, reflect and restart. Order shall return.
Please Subscribe to my youTube channel.
IraWatt
17 - Castor
17 - Castor

Hey @CHarrison,

I think this is the most efficient and possibly the simplest way. The regex generated from the keywords lets you tokenise just the keywords. Could just put this in a macro and your sorted! 

IraWatt_0-1653605624228.png

The regex generated makes sure that each word is it is just the full keyword (punctuation is allowed)! 

Any questions or issues please ask :)
HTH!
Ira

 

IraWatt
17 - Castor
17 - Castor

@CHarrison,

I've attached the example macro workflow:

IraWatt_0-1653607376410.png

 

IraWatt_1-1653606695672.png

Should be super efficient hopefully , like @MarqueeCrew said you should be able to compare the workflows efficiency here:

IraWatt_0-1653608556737.png

Have to tell us which is the fastest on your real data @CHarrison ! 

Labels