Keyword list search, output keyword that generates hit and +5 and -5 word s (for context)
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I need to search text for matches against a list of key words.
Each row of text contains 3 text fields. Text_1 is searched for hits. If a hit is registered, it receives a score of 1 and the entire row passed to an output union tool (with no need to search Text_2 or Text_3). Entries that fail to generate a hit in the Text_1 filter are passed to the Text_2 filter where Text_2 is searched for keywords hits... At the end, I union the results of the 3 filter passes and Score tells me which filter generated the hit.
As currently configured. Score tells me which Text field generated the hit, but I do not know which keyword or keywords generated the hit.
How could you generate an output that tells you which keywords generated the hit -- and as an extra bonus a field that included the 5 words before and 5 words after the keyword?
The sample text below contains the keywords PLANES and unmanned air systems. Other keywords might include tacos, Cervelo, or EW. A text field may generate zero hits; it may generate multiple separate hits; or it may generate multiple hits of the same keyword.
A generic field...
TEXT FIELD |
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod PLANES incididunt ut labore et dolore magna aliqua. Id porta nibh venenatis cras sed felis eget. Amet est placerat in egestas. Sed augue lacus viverra vitae congue. At lectus urna duis convallis convallis. Neque unmanned air systems uisque egestas diam in. Ornare lectus sit amet est placerat in. Elementum nibh tellus molestie nunc non blandit massa. Dictum non consectetur a erat nam at. Cras tincidunt lobortis feugiat vivamus at augue eget arcu. Egestas integer eget aliquet nibh praesent tristique magna sit amet. Elit ullamcorper dignissim cras tincidunt lobortis feugiat vivamus at augue. Molestie a planes at erat pellentesque adipiscing. Sagittis nisl starship kloi rhoncus urna neque viverra justo. Mi proin sed libero enim sed faucibus turpis. |
A real world example where the keyword might be hypersonic
Similarly, advancements in material design, processing and manufacturing are enabling novel material architectures that can further enhance performance and resilience in structures such as leading edges, windows and apertures, propulsion systems, and space structures. Exemplar areas of research within the Materials for Extreme Environments thrust include the following: 1) high temperature materials for hypersonic platforms; 2) high temperature window and aperture materials; 3) radiation and/or electromagnetic pulse (EMP) hardened electronics for space platforms; and 4) coatings for platform survivability in corrosive environments. |
- Labels:
- Help
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I can use a Text Input Tool for the list of keywords and use Find / Replace to append the matching keyword. I can then use the Sum Tool and concat to group the keywords.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @hellyars
Here's one way you can do it.
Split your text to words and join to your list of keywords. Use a generate rows to get the wordID +/-5 from the keyword. Join the ids back to the word list and summarize on Keyword
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@danilang Cool. There is only one challenge, I forgot to mention that a keyword could be a key phrase (e.g. "electronic warfare" instead of EW).
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Ya know @hellyars, in the Biz, this is known as scope creep😋
Key words are now key phrases. "Extreme thrust" is there to show the negative case of the words in the text being present, but not contiguous.
There's still work to be done on this. For instance, it doesn't handle partial and full matches of the same key phrase. But, as my advanced differential equations teacher was fond of saying after putting the trivial substitutions into the equations, "The rest is left as an exercise for the student"
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@danilang LOL! Scope creep is every consultants nemesis. Alteryx is a side hobby I am trying to figure out how to apply to my core.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@danilang I can't seem to make this work with real world data.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
hi @hellyars
As I mentioned previously, there's still work to be done. I did notice the case where a full and a partial match are found not working properly. The phrase is "material science" and the word "material" appears on it's own and also in the phrase.
To fix this, you'll have to start tracking each match of the each of the key phrase words separately, i.e. give them their own subID and use the subid further in the workflow when determining if it's a complete match or not.
Dan
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
@hellyars, the 5-words-before-and-after requirement immediately made me think of regex (although for a non-regex solution, I'd use the same method @danilang shared).
Full regex string = '.*?((?:\w+\W+){0,5})(keyword1|keyword2|keyword3|etc)((?:\W+\w+){0,5}).*
- Here, I'm using \w and \W which mean "word" and "non-word" characters. In regex, a "word" character is a letter or number (and a "non-word" character is everything else, such as spaces and punctuation).
- .*? = match any character, any number of times
- capturing group #1: ((?:\w+\W+){0,5}) = match one or more word characters (\w) followed by one or more non-word characters (\W); find that pattern 0-5 times. The ?: means I'm just using parentheses to group the \w+\W+ pattern together, but don't actually want to capture each individual match; I only want to capture the collection of 5.
- capturing group #2: (keyword1|keyword2|keyword3|etc) = match one of the phrases in the keyword group (the vertical bar = "or")
- capturing group #3: ((?:\W+\w+){0,5}) = match one or more non-word characters, then one or more word characters; do that 0-5 times
- .* = match any character, any number of times
This might need some tweaking as well, depending on your specific requirements. For example, this method counts the numbered bullets in text #1 as "words" (since they are numbers), and it would count a hyphenated word as multiple words ("not-so-distant" in the third text counts as 3 words).
Results (I had a little fun highlighting the found keyword):
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I thought of a situation where my original solution would fail, so I updated the attachment in my post above. The Filter tool would have allowed a partial match (e.g. "non" for "None"), but the 5-words-before-and-after would not. I modified the Filter tool to only find full-word matches for keywords.
