Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Text mining tool and Regex

sriniprad08
11 - Bolide

Hi Team,

 

I am trying to use the newly developed text mining tool for extracting the information from the pdf. But the information is in different rows and I would like to have the headers in columns and the values as rows. Can you please help. 

Please find below the snapshoot,

sriniprad08_0-1603551592460.png

For e,g Date of expense should be as a header and the date below that.

 

Thanks

11 REPLIES 11
AkimasaKajitani
17 - Castor
17 - Castor

I may not have understood your intentions, but I make the workflow.

 

There are some problem. eg the same field names(Date of Expense).

 

AkimasaKajitani_0-1603592070780.png

 

ArtApa
Alteryx
Alteryx

Hi @sriniprad08 - PDF or a data sample from the Image to Text tool would help.

sriniprad08
11 - Bolide

Thank you for the workflow. The intention is to extract the field from the pdf and convert into a tabular format. Please find attached the sample pdf.

sriniprad08
11 - Bolide

Hi @ArtApa 

 

Please find the pdf in my comments. thanks

AkimasaKajitani
17 - Castor
17 - Castor

You can add annotations to template pdf file at the Image Template tool.

 

AkimasaKajitani_0-1603617620396.png

 

So Alteryx can read data as another field at each annotations.

 

AkimasaKajitani_1-1603617733714.png

 

sriniprad08
11 - Bolide

Hi @AkimasaKajitani 

 

Thank you so much. Can you please share the workflow?

 

Cheers,

Srinivas

sriniprad08
11 - Bolide

Hi @AkimasaKajitani ,

 

Thank you for the reply. Is it possible to remove the text from the row ? for e.g

from the row below keeping only the Inovice no and not the text (Invoice No). like BLR_WFL0..?

sriniprad08_0-1603701048090.png

 

AkimasaKajitani
17 - Castor
17 - Castor

Please use attached workflow.

You'll save the two files at the same folder and run the workflow.

 

AkimasaKajitani
17 - Castor
17 - Castor

You can use RegEx tool( or RegEx_Replace function of Formula tool).

 

AkimasaKajitani_0-1603703485294.png

If you want to do it all at once, you can use multi-field formula.

 

I check the result of PDF files, the fields contain useless line break, so you can erase by  Data Cleansing tool.

 

 

Labels