Alteryx Designer Desktop Discussions

missgina · ‎02-03-2021

I am trying the new text mining tools to read in PDFs. I have used the image template to configure the fields I want to extract. Even though the fields are always labeled the same, and in the same order, the size of the cells may vary as they are free text fields that are converted to PDF.

I noticed as I process multiple PDFs, some of them had the contents truncated, which I'm assuming is because I drew the box using a template that perhaps only had 1 line of information but another file had 3 lines of info.

Any suggestions as to how to handle this?

Thanks

Gina

sprakasam · ‎02-03-2021

@missgina Currently it needs to be done manually. But we have entity pair extraction coming up in the future which will solve this problem.

ArtApa · ‎02-03-2021

Hi @missgina - You can either create a bigger annotation (for 3 lines instead of 1 line as per your example) or you can avoid annotations completely, read the entire pdf in bulk and parse your data to give it the required shape.

Alteryx Designer Desktop Discussions

Read in PDF using Text Mining tools