Thought I'd challenge myself this Thanksgiving, but may have bit off too much to swallow. Any help appreciated.
In the attached workflow, I generate a number of possible 5grams (combinations of five words) and count them to find out which are most common. The macro then goes through and picks the high-frequency nGrams and replaces the words with them. Unfortunately, this process will require multiple runs through the data and a creative approach to downgrading ngram priorities. Issues I encounter:
I Need a good way to loop through the process until all possible nGrams have been selected and placed into the list of words. For example, if the first text in the dataset was "I think therefore I am amazing at thinking deep thoughts and love." This sentence would generate several potential nGrams:
1. I_think_therefore_I_am
2. think_therefore_I_am_amazing
3. therefore_I_am_amazing_at
4. I_am_amazing_at_thinking
5. am_amazing_at_thinking_deep
6. amazing_at_thinking_deep_thoughts
7. at_thinking_deep_thoughts_and
8. thinking_deep_thoughts_and_love
Any thoughts on how to set the macro up with a loop that knows when it has assigned all nGrams that can be assigned? For example, in the above set, nGram 4 might have a higher priority than the first three nGram candidates, which means that the first round through proceed down to 4, but here discover that nGram 6 has an even higher priority, and accept this as the only nGram coded in this round. Before finishing, the macro will remove nGram candidates 7 and 8 since they can't co-exist with 6. It should also get rid of candidates 2-5 since they all use a word that has now been taken over by nGram 6 (amazing). In the second round, nGram 1 should be selected.
Any thoughts welcome. I've reviewed the macro videos, but the examples are a bit contrived and had a fixed number of loops rather than the more dynamic need I have here.
Kai :-)
Solved! Go to Solution.
@KaiLarsen - this seems like some challenging thinking for a weekend filled with turkey and shopping :)
Played around with it a bit, might have some ideas... take a look at the attached. I tweaked both the workflow and the macro, and if I'm understanding your objective correctly, this should give you the results you're looking for.
First, please confirm my understanding of the objective: Assign priority of nGrams based on the # of times the word in that nGram appears in other records. Choose that highest priority nGram, then eliminate any other nGram records that have the same words as are contained in that nGram << Is this correct?? Should this be done with grouping, so you only eliminate within each OriginalItemText, for example, or should all records containing a matching word be eliminated? Once highest priority nGram is chosen and duplicate words eliminated, run whatever is left back through the workflow for the next round, repeating until there is nothing left.
Assuming my assumptions are true... :) Then I believe you can accomplish what you are looking to do with the modified macro included in the attached workflow. Rather than using Multi-Row formulas, it selects the highest Priority item, filters out any other records in the dataset that contain the same words, and then feeds any remaining records back through the iterative macro again. Once all iterations are complete, it will output the selected nGrams in your original workflow.
I'm sure there is some tweaking needed for your particular use case, but I'm hoping this at least gives you a few ideas! Happy to help tweak more if this doesn't quite get you there... let us know!!
Cheers,
NJ
Thanks so much, Nicole!
I had to move forward, so think i figured it out. Your help is much appreciated, though.
Kai :-)