I need to search text for matches against a list of key words.
Each row of text contains 3 text fields. Text_1 is searched for hits. If a hit is registered, it receives a score of 1 and the entire row passed to an output union tool (with no need to search Text_2 or Text_3). Entries that fail to generate a hit in the Text_1 filter are passed to the Text_2 filter where Text_2 is searched for keywords hits... At the end, I union the results of the 3 filter passes and Score tells me which filter generated the hit.
As currently configured. Score tells me which Text field generated the hit, but I do not know which keyword or keywords generated the hit.
How could you generate an output that tells you which keywords generated the hit -- and as an extra bonus a field that included the 5 words before and 5 words after the keyword?
The sample text below contains the keywords PLANES and unmanned air systems. Other keywords might include tacos, Cervelo, or EW. A text field may generate zero hits; it may generate multiple separate hits; or it may generate multiple hits of the same keyword.
A generic field...
TEXT FIELD |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod PLANES incididunt ut labore et dolore magna aliqua. Id porta nibh venenatis cras sed felis eget. Amet est placerat in egestas. Sed augue lacus viverra vitae congue. At lectus urna duis convallis convallis. Neque unmanned air systems uisque egestas diam in. Ornare lectus sit amet est placerat in. Elementum nibh tellus molestie nunc non blandit massa. Dictum non consectetur a erat nam at. Cras tincidunt lobortis feugiat vivamus at augue eget arcu. Egestas integer eget aliquet nibh praesent tristique magna sit amet. Elit ullamcorper dignissim cras tincidunt lobortis feugiat vivamus at augue. Molestie a planes at erat pellentesque adipiscing. Sagittis nisl starship kloi rhoncus urna neque viverra justo. Mi proin sed libero enim sed faucibus turpis. |
A real world example where the keyword might be hypersonic
Similarly, advancements in material design, processing and manufacturing are enabling novel material architectures that can further enhance performance and resilience in structures such as leading edges, windows and apertures, propulsion systems, and space structures. Exemplar areas of research within the Materials for Extreme Environments thrust include the following: 1) high temperature materials for hypersonic platforms; 2) high temperature window and aperture materials; 3) radiation and/or electromagnetic pulse (EMP) hardened electronics for space platforms; and 4) coatings for platform survivability in corrosive environments. |
I can use a Text Input Tool for the list of keywords and use Find / Replace to append the matching keyword. I can then use the Sum Tool and concat to group the keywords.
Hi @hellyars
Here's one way you can do it.
Split your text to words and join to your list of keywords. Use a generate rows to get the wordID +/-5 from the keyword. Join the ids back to the word list and summarize on Keyword
Dan
@danilang Cool. There is only one challenge, I forgot to mention that a keyword could be a key phrase (e.g. "electronic warfare" instead of EW).
Ya know @hellyars, in the Biz, this is known as scope creep😋
Key words are now key phrases. "Extreme thrust" is there to show the negative case of the words in the text being present, but not contiguous.
There's still work to be done on this. For instance, it doesn't handle partial and full matches of the same key phrase. But, as my advanced differential equations teacher was fond of saying after putting the trivial substitutions into the equations, "The rest is left as an exercise for the student"
Dan
@danilang LOL! Scope creep is every consultants nemesis. Alteryx is a side hobby I am trying to figure out how to apply to my core.
@danilang I can't seem to make this work with real world data.
hi @hellyars
As I mentioned previously, there's still work to be done. I did notice the case where a full and a partial match are found not working properly. The phrase is "material science" and the word "material" appears on it's own and also in the phrase.
To fix this, you'll have to start tracking each match of the each of the key phrase words separately, i.e. give them their own subID and use the subid further in the workflow when determining if it's a complete match or not.
Dan
@hellyars, the 5-words-before-and-after requirement immediately made me think of regex (although for a non-regex solution, I'd use the same method @danilang shared).
Full regex string = '.*?((?:\w+\W+){0,5})(keyword1|keyword2|keyword3|etc)((?:\W+\w+){0,5}).*
This might need some tweaking as well, depending on your specific requirements. For example, this method counts the numbered bullets in text #1 as "words" (since they are numbers), and it would count a hyphenated word as multiple words ("not-so-distant" in the third text counts as 3 words).
Results (I had a little fun highlighting the found keyword):
I thought of a situation where my original solution would fail, so I updated the attachment in my post above. The Filter tool would have allowed a partial match (e.g. "non" for "None"), but the 5-words-before-and-after would not. I modified the Filter tool to only find full-word matches for keywords.
User | Count |
---|---|
19 | |
14 | |
13 | |
9 | |
8 |