Hello everyone,
I'm struggling a little in searching for a keyword in a PDF. "The issue is not in opening the PDF"
The issue is in the process of searching for the keyword.
I'm using a macro to go search for the keyword in the PDF, but I'm not getting any good result.
I would like the result to be something like this:
Contains "1234" |
Yes |
The column name contains my keyword (1234) and the column name is replaced with any keyword I put.
Thank you.
@MonaAlmutairi I think I can see what the issue is. You are seeing a table on the PDF, whereas, the computer isn't. When I use the PDF to Text tool, I've realised that it reads information line by line, left to right, top to bottom. Initially, this means that there isn't any way to identify the table (assumption based on you saying column). You will need to parse the information after the tool has read the PDF into Alteryx, and then you can apply your logic. This is how I would approach your problem using the PDF to Text tool.
However, if you found a way to identify the table using a different tool, I'd be happy to help develop the logic. Some sample data would be useful 😀.
All the best,
BS
@MonaAlmutairi right, I'll try to simulate what's happening.
First, test doc:
Using this tool: PDF to Text:
Look at the field 'Text': it contains all the information. Now we just need to use the Filter or Formula tool:
Now, on the True anchor, we can see that it found a line of information that matched "id: 1234".
Hopefully this helps with developing your logic. You can read in the PDF, then use the filter to test whether or not it contains the line of information you're after. If it doesn't, your PDF doesn't contain that line of information, and nothing will appear on the True anchor of the filter tool.
Note: Be careful with spaces. For example "id: 1234" and "id: 1234". One has an extra space, but the one with an extra space wouldn't be found using the filter. To account for this, rather than using contains() formula, you can use REGEX_MATCH...
I.e. REGEX_MATCH([insert column], "id:\s+1234"). This basically says, there can be 1 or more spaces between id: and 1234.
All the best,
BS
@BS_THE_ANALYST That is wonderful!!! Thank you so much. How do you think we can apply this to a high number of files? I had like 500+ PDFs to search in. What is the best way to apply the same logic in this case?
+ I could not find the tool PDF to text; I had the intelligence suite but it isn't there.
@MonaAlmutairi To answer your question, Yes, you could apply this to loads of files. It's just a matter of processing time. Should be okay though! Just test that it works for 20, then once you're confident, open the flood gates. We will need to edit the logic slightly to account for lots of files coming in. Currently, we only have accounted for one in the workflow.
Firstly, this is where the PDF to text tool is:
Now for the workflow:
1) Drag the pdfs into a folder, I've got two in here for test purposes.
2) Drag a directory tool onto the workflow. This will allow us to gain access to the filepaths (we'll use this in the PDF tool)
3) Configure the directory tool to the correct folder location.
4) You can now see the file paths for your pdfs in Alteryx. (make sure you click run on the canvas to load them in)
Now the rest is kinda like we did before:
First step, remove the additional columns from the directory tool output (we only need FullPath):
Next step drag the PDF to text tool onto the canvas. Make sure you configure the option Column with File Path to the Full Path field. This will allow us to bring multiple PDFs through!
As you can see above in the field called 'File', I have both of those documents in here now.
Now let's employ the same logic as before. I'm adding a few tweaks in there. I realise you want a table output that tells us whether or not there is. Here is the logic I'd use for this: (it's not the only way, just what came to mind).
I've attached the workflow. Let me know if you have any questions.
All the best,
BS
@BS_THE_ANALYST - lovely comprehensive answers. Even I could follow along! 😜