Advent of Code is now back for a limited time only! Complete as many challenges as you can to earn those badges you may have missed in December. Learn more about how to participate here!
Start Free Trial

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Find a keyword in a PDF

MonaAlmutairi
8 - Asteroid

Hello everyone,

 

 

I'm struggling a little in searching for a keyword in a PDF. "The issue is not in opening the PDF"

 

The issue is in the process of searching for the keyword.

 

I'm using a macro to go search for the keyword in the PDF, but I'm not getting any good result.

 

I would like the result to be something like this:

 

Contains "1234"
Yes

 

The column name contains my keyword (1234) and the column name is replaced with any keyword I put.

 

Thank you.

6 REPLIES 6
BS_THE_ANALYST
15 - Aurora
15 - Aurora

@MonaAlmutairi I think I can see what the issue is. You are seeing a table on the PDF, whereas, the computer isn't. When I use the PDF to Text tool, I've realised that it reads information line by line, left to right, top to bottom. Initially, this means that there isn't any way to identify the table (assumption based on you saying column). You will need to parse the information after the tool has read the PDF into Alteryx, and then you can apply your logic. This is how I would approach your problem using the PDF to Text tool. 

However, if you found a way to identify the table using a different tool, I'd be happy to help develop the logic. Some sample data would be useful 😀.

All the best,
BS

 

All the best,
BS

LinkedIN

Bulien
MonaAlmutairi
8 - Asteroid

 

@BS_THE_ANALYST Basically, I'm reading many Contracts and I'm trying to find some IDs on the contracts.
The table I attached above is the result I wanted to end with. Whenever I search for an ID in the contracts it should output a table with a flag of "Yes" or "No" (depend on if the ID is there or no)
and the ultimate result I want is to specify which PDF contains which keyword :)
BS_THE_ANALYST
15 - Aurora
15 - Aurora

@MonaAlmutairi right, I'll try to simulate what's happening.

First, test doc:

BS_THE_ANALYST_1-1681295091886.png

 

Using this tool: PDF to Text: 

BS_THE_ANALYST_3-1681295165142.png

 


Look at the field 'Text': it contains all the information. Now we just need to use the Filter or Formula tool:

 

BS_THE_ANALYST_4-1681295278461.png

 

Now, on the True anchor, we can see that it found a line of information that matched "id: 1234".

 

Hopefully this helps with developing your logic. You can read in the PDF, then use the filter to test whether or not it contains the line of information you're after. If it doesn't, your PDF doesn't contain that line of information, and nothing will appear on the True anchor of the filter tool.

Note: Be careful with spaces. For example "id: 1234" and "id:  1234". One has an extra space, but the one with an extra space wouldn't be found using the filter. To account for this, rather than using contains() formula, you can use REGEX_MATCH... 

I.e. REGEX_MATCH([insert column], "id:\s+1234"). This basically says, there can be 1 or more spaces between id: and 1234.

All the best,
BS

 

All the best,
BS

LinkedIN

Bulien
MonaAlmutairi
8 - Asteroid

@BS_THE_ANALYST That is wonderful!!! Thank you so much. How do you think we can apply this to a high number of files? I had like 500+ PDFs to search in. What is the best way to apply the same logic in this case?

 

+ I could not find the tool PDF to text; I had the intelligence suite but it isn't there.

BS_THE_ANALYST
15 - Aurora
15 - Aurora

@MonaAlmutairi To answer your question, Yes, you could apply this to loads of files. It's just a matter of processing time. Should be okay though! Just test that it works for 20, then once you're confident, open the flood gates. We will need to edit the logic slightly to account for lots of files coming in. Currently, we only have accounted for one in the workflow. 

Firstly, this is where the PDF to text tool is:

BS_THE_ANALYST_0-1681300105572.png


Now for the workflow:

BS_THE_ANALYST_1-1681300326133.png


1) Drag the pdfs into a folder, I've got two in here for test purposes.
2) Drag a directory tool onto the workflow. This will allow us to gain access to the filepaths (we'll use this in the PDF tool)
3) Configure the directory tool to the correct folder location.
4) You can now see the file paths for your pdfs in Alteryx. (make sure you click run on the canvas to load them in)

Now the rest is kinda like we did before:
First step, remove the additional columns from the directory tool output (we only need FullPath):

BS_THE_ANALYST_2-1681300497955.png


Next step drag the PDF to text tool onto the canvas. Make sure you configure the option Column with File Path to the Full Path field. This will allow us to bring multiple PDFs through! 

BS_THE_ANALYST_4-1681300557245.png

As you can see above in the field called 'File', I have both of those documents in here now. 

Now let's employ the same logic as before. I'm adding a few tweaks in there. I realise you want a table output that tells us whether or not there is. Here is the logic I'd use for this: (it's not the only way, just what came to mind).

BS_THE_ANALYST_5-1681300956683.png


I've attached the workflow. Let me know if you have any questions.

All the best,
BS





 

All the best,
BS

LinkedIN

Bulien
Shifty
12 - Quasar

@BS_THE_ANALYST - lovely comprehensive answers. Even I could follow along! 😜

Labels
Top Solution Authors