Hi there!
I've built a workflow which searches transcripts for certain keywords (attached a watered down example to show my process)
The workflow works perfectly, however we've reached almost a million transcripts now and there's probably more keywords we'd like to search for going forwards
Does anyone have any suggestions on how to optimise/improve this workflow so it won't take a massive amount of processing power/time to run?
Thanks in advance :)
Hey @CHarrison,
Very interesting problem would make a great weekly challenge!
I think find and replace is more efficient as you don't appended every row to one another. Also I think your initial way was miss counting as transcription ID 1. By my count it has 5 ut's not 4. Likewise with Diams it should be 3 not 2.
I think the punctuation it throws off. Could use a data cleaning tool to remove punctuation before processing to solve this though.
Any questions, issues or adjustments please ask :)
HTH!
Ira
Here's yet another approach. While I agree about the Find & Replace tool as an option, you might consider the JOIN tool. I've got AMP turned on for this workflow ( @TonyaS and @jarrod as well as @NicoleJ will be proud of me) and millions of records won't be an issue for you. Essentially, I use a select to remove all unnecessary data and then opted for the RegEx tool to TOKENIZE (turn each word into their own record). We are now able to count all occurrences of each word that matches to the Keywords (exactly, hence the LOWERCASE function is used). If you don't set the CASE, your results will vary.
Now you can create metrics for matched keywords. I counted matches (ignored unmatched) and gave unique counts of transcripts plus the total occurrences. For each transcript I have counts for each unique keyword.
This is a case where I think that optimisation is in the eyes of the beholder. Clarity of the workflow and ease of updates is important. I think that you might also want to SUMMARIZE each of the tokenized words and count their usage. Word counts from the transcripts might also have value for you.
Cheers,
Mark
Ah tokenize each word in the transcript and join ! Awesome solution @MarqueeCrew, don't want to admit it but its possibly a tad better then mine XD Though I do wonder how tokenizing every word in the transcript would add to the computational cost? Don't know it there is a good way to do a performance analysis in Alteryx? Could be a good idea suggestion if there's not one.
@IraWatt ,
In the runtime settings you can look at performance tuning. But if you AMP the workflow, it runs so fast that I'd be surprised if the tuning will amount to anything. I think in this case you can afford the "cost" of tokenizing the transcripts with the simplification of the workflow. I also think that this approach gives better metrics.
Cheers,
Mark
Hey @CHarrison,
I think this is the most efficient and possibly the simplest way. The regex generated from the keywords lets you tokenise just the keywords. Could just put this in a macro and your sorted!
The regex generated makes sure that each word is it is just the full keyword (punctuation is allowed)!
Any questions or issues please ask :)
HTH!
Ira
I've attached the example macro workflow:
Should be super efficient hopefully , like @MarqueeCrew said you should be able to compare the workflows efficiency here:
Have to tell us which is the fastest on your real data @CHarrison !