Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Keyword list search, output keyword that generates hit and +5 and -5 word s (for context)

hellyars
13 - Pulsar

I need to search text for matches against a list of key words. 

 

Each row of text contains 3 text fields.   Text_1 is searched for hits.  If a hit is registered, it receives a score of 1 and the entire row passed to an output union tool (with no need to search Text_2 or Text_3).  Entries that fail to generate a hit in the Text_1 filter are passed to the Text_2 filter where Text_2 is searched for keywords hits...  At the end, I union the results of the 3 filter passes and Score tells me which filter generated the hit.

 

hellyars_0-1622913233591.png

 

As currently configured.  Score tells me which Text field generated the hit, but I do not know which keyword or keywords generated the hit. 


How could you generate an output that tells you which keywords generated the hit -- and as an extra bonus a field that included the 5 words before and 5 words after the keyword?

 

The sample text below contains the keywords PLANES and unmanned air systems.  Other keywords might include tacos, Cervelo, or EW.   A text field may generate zero hits; it may generate multiple separate hits; or it may generate multiple hits of the same keyword.

 

A generic field...

TEXT FIELD
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod PLANES incididunt ut labore et dolore magna aliqua. Id porta nibh venenatis cras sed felis eget. Amet est placerat in egestas. Sed augue lacus viverra vitae congue. At lectus urna duis convallis convallis. Neque unmanned air systems uisque egestas diam in. Ornare lectus sit amet est placerat in. Elementum nibh tellus molestie nunc non blandit massa. Dictum non consectetur a erat nam at. Cras tincidunt lobortis feugiat vivamus at augue eget arcu. Egestas integer eget aliquet nibh praesent tristique magna sit amet. Elit ullamcorper dignissim cras tincidunt lobortis feugiat vivamus at augue. Molestie a planes at erat pellentesque adipiscing. Sagittis nisl starship kloi rhoncus urna neque viverra justo. Mi proin sed libero enim sed faucibus turpis.

 

A real world example where the keyword might be hypersonic

 

 
Similarly, advancements in material design, processing
and manufacturing are enabling novel material architectures that can further enhance performance and resilience in structures
such as leading edges, windows and apertures, propulsion systems, and space structures. Exemplar areas of research within
the Materials for Extreme Environments thrust include the following: 1) high temperature materials for hypersonic platforms; 2)
high temperature window and aperture materials; 3) radiation and/or electromagnetic pulse (EMP) hardened electronics for space
platforms; and 4) coatings for platform survivability in corrosive environments.
11 REPLIES 11
hellyars
13 - Pulsar

I can use a Text Input Tool for the list of keywords and use Find / Replace to append the matching keyword.  I can then use the Sum Tool and concat to group the keywords. 

danilang
19 - Altair
19 - Altair

Hi @hellyars 

 

Here's one way you can do it.

 

danilang_0-1622979872946.png

Split your text to words and join to your list of keywords.  Use a generate rows to get the wordID +/-5 from the keyword.  Join the ids back to the word list and summarize on Keyword

 

danilang_2-1622980246580.png

 

Dan

 

 

 

 

 

 

 

hellyars
13 - Pulsar

@danilang  Cool.  There is only one challenge, I forgot to mention that a keyword could be a key phrase (e.g. "electronic warfare" instead of EW). 

danilang
19 - Altair
19 - Altair

Ya know @hellyars, in the Biz, this is known as scope creep😋 

 

danilang_0-1622998881756.png

Key words are now key phrases.  "Extreme thrust" is there to show the negative case of the words in the text being present, but not contiguous.  

danilang_1-1622998952129.png

 

danilang_2-1622999111323.png

 

There's still work to be done on this. For instance, it doesn't handle partial and full matches of the same key phrase. But, as my advanced differential equations teacher was fond of saying after putting the trivial substitutions into the equations, "The rest is left as an exercise for the student" 

 

Dan

 

hellyars
13 - Pulsar

@danilang  LOL!  Scope creep is every consultants nemesis.  Alteryx is a side hobby I am trying to figure out how to apply to my core. 

hellyars
13 - Pulsar

@danilang  I can't seem to make this work with real world data.

danilang
19 - Altair
19 - Altair

hi @hellyars 

 

As I mentioned previously, there's still work to be done.  I did notice the case where a full and a partial match are found not working properly.  The phrase is  "material science" and the word "material"  appears on it's own and also in the phrase.

 

To fix this, you'll have to start tracking each match of the each of the key phrase words separately, i.e. give them their own subID  and use the subid further in the workflow when determining if it's a complete match or not.

 

Dan  

kelly_gilbert
13 - Pulsar

@hellyars, the 5-words-before-and-after requirement immediately made me think of regex (although for a non-regex solution, I'd use the same method @danilang shared).

 

Full regex string  =  '.*?((?:\w+\W+){0,5})(keyword1|keyword2|keyword3|etc)((?:\W+\w+){0,5}).*

 

  • Here, I'm using \w and \W which mean "word" and "non-word" characters. In regex, a "word" character is a letter or number (and a "non-word" character is everything else, such as spaces and punctuation).
  • .*? = match any character, any number of times
  • capturing group #1 ((?:\w+\W+){0,5})  =  match one or more word characters (\w) followed by one or more non-word characters (\W); find that pattern 0-5 times. The ?: means I'm just using parentheses to group the \w+\W+ pattern together, but don't actually want to capture each individual match; I only want to capture the collection of 5.
  • capturing group #2: (keyword1|keyword2|keyword3|etc)  =  match one of the phrases in the keyword group (the vertical bar = "or")
  • capturing group #3: ((?:\W+\w+){0,5}) = match one or more non-word characters, then one or more word characters; do that 0-5 times
  • .* = match any character, any number of times


This might need some tweaking as well, depending on your specific requirements. For example, this method counts the numbered bullets in text #1 as "words" (since they are numbers), and it would count a hyphenated word as multiple words ("not-so-distant" in the third text counts as 3 words).

kelly_gilbert_0-1623153898993.png


Results (I had a little fun highlighting the found keyword):

kelly_gilbert_1-1623154022198.png

kelly_gilbert
13 - Pulsar

I thought of a situation where my original solution would fail, so I updated the attachment in my post above. The Filter tool would have allowed a partial match (e.g. "non" for "None"), but the 5-words-before-and-after would not. I modified the Filter tool to only find full-word matches for keywords.

Labels
Top Solution Authors