Index / Concordance from PDF File
- Subscribe to RSS Feed
- Mark Topic as New
- Mark Topic as Read
- Float this Topic for Current User
- Bookmark
- Subscribe
- Mute
- Printer Friendly Page
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
I need to create an index/concordance from a PDF file. Can Alteryx do this?
I have a PDF file that is OCR'd. I want to create an index/concordance (I can't figure out the best word to use here.) with the output showing a list of keywords and their location within the PDF file. Think about an index at the end of a textbook that shows every location for the word "cardinal" or "bluejay" or "oriole." A page number for the location would be acceptable; however, ideally instead of page number, it would show the chapter, section, and paragraph that the word is located in.
Example:
cardinal--Bird Chapter, Red Bird Section, Paragraph 1
bluejay--Bird Chapter, Blue Bird Section, Paragraph 4
oriole--Bird Chapter, Orange Bird Section, Paragraph 2
--Bird Chapter, Black Bird Section, Paragraph 3
Any suggestions?
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
Hi @jkliewer075 ,
If you have Intelligence Suite license, there is a group of tools "Computer Vision".
The below article may be helpful.
Good luck.
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Notify Moderator
If you don't have Intelligence Suite, one way is to use pdfminer library of Python tool. With this library, you can get text and its coordinate (x, y) so if you can identify where each paragraph starts and ends in (x, y) then you can know in which paragraph texts fall.
