I have just installed a trial of the intelligence suite to trial a case example of importing PDF files (e.g. Incident Tickets) using the Image Template, Image to Text and PDF Input Tools but am hitting a few problems.
The first few fields of each PDF pick up the data correctly, however, later fields are not picking up correctly.
I think this might be because the field may not be in exactly the same place on each page e.g. if there are multiple lines in a field above. Is this because the tool only picks up text from the exact position the text is in a document, rather than the text relative to other text (e.g. a header near it)?
Is there another way to do this?
That is correct, if you look at the Markup string that is output by the template tool you will see that it is specifying coordinates to pull from the image. In some scenarios you can work with a slightly larger region to accommodate shifting, but if it is too different you may need to read in a larger portion of the document and leverage some parsing techniques.
Has anyone else found any using ways of parsing - PDF's with varying numbers of pages / size of fields?
I've found using the find a replace tool to find field headers and replacing with the field header plus a £ sign quite useful and then using text to columns to parse out the text I want.
One problem I have run across is that sometimes some text is not coming through at all through the Intelligence suite tools.
Has anyone found a solution for this? I am having the same issue.
Hi @madisonhoff, @PeterAP,
Another way could be to import the full PDF (not using template) and then parse it and look for keywords. Similar to XML or HTML parsing.