Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Identify & Score Quality of Key Word and Phrase Matches Across Multiple Text Fields

hellyars
13 - Pulsar

I am searching for key word and phrases across MULTIPLE text fields. 

The text fields are H1, H2, and H3.

The quality of the match depends on where occurs.

H1 is the highest and H3 is the lowest quality is match.

 

I have a prototype up and working. It has some problems. 

See attached workflow and image below.

I am open to any and all solutions including completely different approaches.

 

So far...I first look for a match in H1 and score it 1. 

Ideally, all records that do not generate an H1 hit should be passed to the filter that looks at H2 and so on.

But, currently all records are passed from H1 to H2 and all records are similarly passed to H3.  References the red boxes in the H1 and H2 containers.

Only the records that do not generate a hit at a given level should.

The output should include all the original text fields with the KW, MRKT, and Score fields appended.  It does not (hence red output containers).

Minor detail...I would like to make the field renaming dynamic.

 

 

 

 

 

 

hellyars_0-1656189212627.png

 

Below is a brute force approach that does not utilize a macro.  But, it fails to account for multiple key word and phrase hits in a given record (hence the macro approach I am trying now).

 

hellyars_1-1656189865168.png

 

 

9 REPLIES 9
DataNath
17 - Castor

@hellyars do you need the sum of matches if there are multiple in the field? Or are you just looking for whether or not there is at least one instance of the KW in the field, and if so, give it a rating corresponding to whether or not it's H1/H2/H3?

hellyars
13 - Pulsar

@DataNath     Thinking thru this....   The quality of a hit is most important -- is it H1, H2, or H3.  Multiple hits by different keywords from the same "market" is of interest.   The number of hits of a single word within an H3 is potentially of importance as it might help validate the quality of a hit -- a single hit might be indicative of a passing reference whereas multiple hits of the same word might be indicative of relevancy  

DataNath
17 - Castor

How does this look @hellyars? I'm not sure if I'm understanding the request right but there's various ways you can aggregate from this output, as I've included column name/market/kw etc in the end output. I've basically set up a nested batch macro where it pivots and processes each column name (H1/H2/H3), before running through the original macro I built for your first request. As the quality goes down across the fields, I have set the match score up so that H1 matches are multiplied by 3, H2 by 2 and H1 as just the number of matches - because I've left the column name in the output you can see how this breaks down across each record.

 

DataNath_0-1656231533667.png

 

hellyars
13 - Pulsar

@DataNathThank you . Working with it.

 

hellyars
13 - Pulsar

@DataNath      I was thinking something slightly different, but that definitely leverages some of these components. 

 

I would like to take all records and check for an H1 hit.   If "true"  output.  If false pass the false records only to the H2 filter.  If true at H2 out put.  If false move to H3. The number of hits at H3 is of interest.

My rationales is that if the hit occurs at H1 then everything (i.e., H2 and H3) will pertain to the key word / phrase (subject).  The same for H2 relative to H3.   H3 is different.  A hit at H3, with no hits at H1 or H2, may be indicative of a passing use of the key word (as it might be a buzz phrase) or an actually relevant hit meaning the item is on topic relative to the key world. 

DataNath
17 - Castor

How does the following look @hellyars? I tweaked one of the macros slightly and added some joins that basically checks if the column (H1/H2/H3) is the earliest and excludes everything that isn't, as well as non-matches. The output has slightly changed due to the nature of the macro so the raw text won't show for non-matches - it'll only show for the columns in which the KW was found:

 

DataNath_0-1656264820504.png

 

hellyars
13 - Pulsar

@DataNath  First, I totally appreciate the help.  Working thru a real world application challenge.   Real world.   H1 think Book Title.  H2 think Chapter Titles.  H3 think Section Titles.   (Real World is not a Book, but it works).   So if a book (H1)  has 10 chapters (H2) and 3 sections each (H3) then there will be 30 records for the given 1 (H1) book.  The data set is hundreds of books (H1), with 1-N chapters (H2) per book  and 1-N Sections (H3) per chapter per book chapter.

 

So

..if the Key Word match occurs in H1 it takes ALL 1-N records for that given Book (H1) because it means that all chapters (H2) and sections (H3) relate to this topic.  (i.e., there is no need to evaluate the H2 and H3 fields)

..if the Key Word match occurs in H2 it takes ONLY the 1-N records for that given Chapter (H2) because it means all the sections (H3) relate to the topic of this chapter. (i.e, there is no reason to evaluate the H3 fields)

..if the Key Word macth occurs in H3 it takes ONLY the single H3 record as only the H3 section and not the book or chapter relate to the key word topic

 

So

...pass all records through the workflow to look for an H1 hit take all corresponding records associated with that H1.

...pass all remaining records through the workflow to look for an H2 hit and take all corresponding records associated with that H2.

...pass all remaining records through the workflow to look for an H3 hit and take the record  (H3 is a 1 to 1)

 

I have a brute force solution that does this, second image in my original question but it fails to account for multiple key word hits.   So it only returns the first Key Word hit.  A book called Dogs and Cats will only evaluate (output) as Cats if Cats appears first in my key word list.  I want it to evaluate (output) Cats, Dogs.

 

 

 

LiuZhang
9 - Comet

My attempt of the problem.

Pivot long the input data, then sort by H1, H2, H3 group, run iterative macro to filter row by row. If there is a hit, remove the set of H1, H2, H3, else if there is no hit, remove that single H_i row.

The Score is not perfect yet, I think you can compare RecordID with EngineIterative Number to decide which H_i is the correct deletion.

Hopefully I didn't misunderstand the question, there is quite a bit discussion going on here. 😀

WorkflowWorkflow

InputInputOutputOutput

 

hellyars
13 - Pulsar

@DataNath  Sorry.  Got busy this week.  I'm back to trying to sort this out.  Thanks again  for your help last week.

Labels