I am trying the new text mining tools to read in PDFs. I have used the image template to configure the fields I want to extract. Even though the fields are always labeled the same, and in the same order, the size of the cells may vary as they are free text fields that are converted to PDF.
I noticed as I process multiple PDFs, some of them had the contents truncated, which I'm assuming is because I drew the box using a template that perhaps only had 1 line of information but another file had 3 lines of info.
Any suggestions as to how to handle this?
Thanks
Gina
@missgina Currently it needs to be done manually. But we have entity pair extraction coming up in the future which will solve this problem.
Hi @missgina - You can either create a bigger annotation (for 3 lines instead of 1 line as per your example) or you can avoid annotations completely, read the entire pdf in bulk and parse your data to give it the required shape.