Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.

Index / Concordance from PDF File

jkliewer075
5 - Atom

I need to create an index/concordance from a PDF file. Can Alteryx do this?

 

I have a PDF file that is OCR'd. I want to create an index/concordance (I can't figure out the best word to use here.) with the output showing a list of keywords and their location within the PDF file. Think about an index at the end of a textbook that shows every location for the word "cardinal" or "bluejay" or "oriole." A page number for the location would be acceptable; however, ideally instead of page number, it would show the chapter, section, and paragraph that the word is located in.

 

Example:

 

cardinal--Bird Chapter, Red Bird Section, Paragraph 1

bluejay--Bird Chapter, Blue Bird Section, Paragraph 4

oriole--Bird Chapter, Orange Bird Section, Paragraph 2

         --Bird Chapter, Black Bird Section, Paragraph 3

 

Any suggestions?

 

 

2 REPLIES 2
Yoshiro_Fujimori
15 - Aurora

Hi @jkliewer075 ,

 

If you have Intelligence Suite license, there is a group of tools "Computer Vision".

The below article may be helpful.

https://community.alteryx.com/t5/Data-Science/Unlocking-Insights-from-Images-using-Computer-Vision/b...

 

Good luck.

gawa
16 - Nebula
16 - Nebula

If you don't have Intelligence Suite, one way is to use pdfminer library of Python tool. With this library, you can get text and its coordinate (x, y) so if you can identify where each paragraph starts and ends in (x, y) then you can know in which paragraph texts fall.

https://pypi.org/project/pdfminer/

Polls
We’re dying to get your help in determining what the new profile picture frame should be this Halloween. Cast your vote and help us haunt the Community with the best spooky character.
Don’t ghost us—pick your favorite now!
Labels