Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

NLP Difficult Parsing Problem

theinsideguy
7 - Meteor

I’m trying to figure out how to perform what I believe to be a difficult parsing problem. I’m new to regex, but I’m not even sure if regex is the way to go here. I have a full paragraph of legal text that looks something like this:

 

BLAH BLAH 26 U.S.C.S. § 115, blah blah blah blah. BLAH BLAH 26 U.S.C.S. § 116, blah blah blah blah. Blah blah blah blah. Blah blah blah. BLAH BLAH 26 U.S.C.S. § 115 blah blah blah blah. Blah. Blah. Blah.

 

My goal is to determine if a sentence contains the 116 statute in it, get that sentence and all proceeding sentences, until you hit a sentence with the 115 statute. The bold would be what I’m looking to extract from the paragraph.

 

Similarly, sometimes the cell contains a paragraph that looks like this.

 

BLAH BLAH 26 U.S.C.S. § 116, blah blah blah blah. Blah blah blah blah. Blah blah blah. BLAH BLAH 26 U.S.C.S. § 115, blah blah blah blah. According to 26 U.S.C.S. § 116 blah blah blah blah. Blah. Blah. Blah.

 

In this instance, I still need all sentences from all 116 statutes onward until the 115 stopword (if you will) triggers stoppage. See bold above.

 

Any ideas how I would approach this?  I’m a little overwhelmed. Thanks for any and all help!

7 REPLIES 7
JoaoLeiteV
10 - Fireball

Good morning @theinsideguy,

 

I've made an example using a formula, a text-to-columns and a filter. I'm changing "According to" to a delimiter ("|"), then breaking down everything to rows and filtering only rows that contain 116.

 

JoaoLeiteV_0-1626357888073.png

 

Please let me know if this worked or if you have any questions!

 

theinsideguy
7 - Meteor

Thank you so much for that quick response. I shouldn't have put "according to" in the example. The truth is, I have no idea what words will precede the statute.  I edited my question.

JoaoLeiteV
10 - Fireball

Okay, so let me get something clear, can we break the data with the "26 U.S.C.S" instead of According to?

 

If so, the workflow would be the same, just changing the first formula and the filter. Check the second example down here to see if it works.

 

OllieClarke
15 - Aurora
15 - Aurora

Hi @theinsideguy   

This isn't perfect, as it's hard to code where sentences end, but is this close enough for you?

 

OllieClarke_0-1626359533082.png

 

theinsideguy
7 - Meteor

Joe and Ollie, your solutions are really close, but neither include the text BEFORE the trigger word. With OIlie's solution, I would have to include the previous "FALSE" record to get all of the text BEFORE the 116 statue. I'm going to try and fool around with Ollie's example to see if I can get it to work and report back a solution. Of course, any additional input is very much appreciated Joao.

OllieClarke
15 - Aurora
15 - Aurora

Hey @theinsideguy do you want the text before, or the text after, or both?

If you want before then this should do the trick (hopefully)

OllieClarke_0-1626362718114.png

 

theinsideguy
7 - Meteor

Was hoping for the entire sentence.

 

So, "BLAH BLAH 26 U.S.C.S. § 115, blah blah blah blah. BLAH BLAH 26 U.S.C.S. § 116, blah blah blah blah. Blah blah blah blah. Blah blah blah. BLAH BLAH 26 U.S.C.S. § 115 blah blah blah blah. Blah. Blah. Blah." returns "BLAH BLAH 26 U.S.C.S. § 116, blah blah blah blah. Blah blah blah blah. Blah blah blah." 

 

One of the issues I'm running into is what you've already mentioned—it's hard to determine what the beginning and end of a sentence is in legal writing with all of the U.S.C.S stuff (sometimes U.S.C.S is in the unstructured text, and sometimes not).

Labels