Get Inspire insights from former attendees in our AMA discussion thread on Inspire Buzz. ACEs and other community members are on call all week to answer!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Text mining tool and Regex

sriniprad08
11 - Bolide

Hi Team,

 

I am trying to use the newly developed text mining tool for extracting the information from the pdf. But the information is in different rows and I would like to have the headers in columns and the values as rows. Can you please help. 

Please find below the snapshoot,

sriniprad08_0-1603551592460.png

For e,g Date of expense should be as a header and the date below that.

 

Thanks

11 REPLIES 11
AkimasaKajitani
17 - Castor
17 - Castor

I may not have understood your intentions, but I make the workflow.

 

There are some problem. eg the same field names(Date of Expense).

 

AkimasaKajitani_0-1603592070780.png

 

ArtApa
Alteryx
Alteryx

Hi @sriniprad08 - PDF or a data sample from the Image to Text tool would help.

sriniprad08
11 - Bolide

Thank you for the workflow. The intention is to extract the field from the pdf and convert into a tabular format. Please find attached the sample pdf.

sriniprad08
11 - Bolide

Hi @ArtApa 

 

Please find the pdf in my comments. thanks

AkimasaKajitani
17 - Castor
17 - Castor

You can add annotations to template pdf file at the Image Template tool.

 

AkimasaKajitani_0-1603617620396.png

 

So Alteryx can read data as another field at each annotations.

 

AkimasaKajitani_1-1603617733714.png

 

sriniprad08
11 - Bolide

Hi @AkimasaKajitani 

 

Thank you so much. Can you please share the workflow?

 

Cheers,

Srinivas

sriniprad08
11 - Bolide

Hi @AkimasaKajitani ,

 

Thank you for the reply. Is it possible to remove the text from the row ? for e.g

from the row below keeping only the Inovice no and not the text (Invoice No). like BLR_WFL0..?

sriniprad08_0-1603701048090.png

 

AkimasaKajitani
17 - Castor
17 - Castor

Please use attached workflow.

You'll save the two files at the same folder and run the workflow.

 

AkimasaKajitani
17 - Castor
17 - Castor

You can use RegEx tool( or RegEx_Replace function of Formula tool).

 

AkimasaKajitani_0-1603703485294.png

If you want to do it all at once, you can use multi-field formula.

 

I check the result of PDF files, the fields contain useless line break, so you can erase by  Data Cleansing tool.

 

 

Labels