Alteryx Designer Discussions

Find answers, ask questions, and share expertise about Alteryx Designer and Intelligence Suite.
SOLVED

How to use R and Python to Parse Word Documents

ShaanM
Alteryx Alumni (Retired)

A lot of people I have been speaking to recently have asked this and seems to crop up more and more.

 

i thought it would be useful to build two macros to help solve this challenge.

 

Using either R or Python the two macros take a feed of files and process them.

 

If using R, the package used is called 'Officer' which you will need to install separately 

 

For Python, the package used is called 'Docx2txt' , also to be installed seperately

 

It is a very basic example and there are a whole host of other packages that do something similar.

 

Here is the R code used:

 

R.jpg

 

 

library(officer)

doc <- read_docx("XXXX")


content <- docx_summary(doc)
head(content)


write.Alteryx(content, 3)

 

 

 

 

Here is the Python code used:

Python.jpg

 

 

from ayx import Alteryx
Alteryx.installPackages("docx2txt")

 

 

 

 

from ayx import Alteryx
import pandas

import docx2txt

text = docx2txt.process('XXXX')

print(text)

#Turn the variabe with html page into Pandas' DF
df = pandas.DataFrame({"text":[text]})

#Write the data frame to Alteryx workflow for downstream processing
Alteryx.write(df,1)

 

 

 

For each method I packaged as a macro, in the code using 'xxxx' as a placeholder for the file name.

 

Attached is the Workflow+Macros and test file

 

 

Enjoy!!

 

Shaan Mistry

Shaan Mistry
Co - Founder : datacurious.ai
36 REPLIES 36
ShaanM
Alteryx Alumni (Retired)

@mceleavey unfortunately that example probably won't lend itself well for that scenario.

 

Python or other R packages may get you closer  - 

 

might be worth starting a new thread and seeing if anyone has anything that could solve it.

Shaan Mistry
Co - Founder : datacurious.ai
mceleavey
17 - Castor
17 - Castor

Cheers, Shaan. Thanks for all your help.

 

M.



Bulien

NickJ
Alteryx Alumni (Retired)

Hi everybody!

 

Looks like https://python-docx.readthedocs.io/en/latest/ (Python-DocX) might be what you need Chris, if you're going down the Python route? 

 

Cheers,

Nick

Nick Jewell | datacurious.ai
mceleavey
17 - Castor
17 - Castor

mceleavey_0-1579612266705.png

 

Dr. Nick, you magnificent human being!

I will give that a try now.



Bulien

mceleavey
17 - Castor
17 - Castor

Unfortunately, that doesn't appear to do what I need. I need to read in .doc files including the values selected in a drop-down. I've found a python package called Textract which claims to do this, but I can't get it working. I'm not at all competent with Python, so it's definitely user error. I've installed the package, and defined the filepaths so I can use it, but it returns a File Not Found error.

 

I'll keep plugging away...



Bulien

G1
8 - Asteroid

Hi ShaanM,

 

OK thank you. I have sent Support an e-mail. I will respond to this thread if a solution is found.

G1
8 - Asteroid

Hi Everyone,

 

I spoke with Support and I do have a solution that has worked for me for reading Word docs in the Python tool.

 

1) to install external packages you have to do it in the Admin version of Alteryx. Right click on the Alteryx icon and select 'Run as administrator'

G1_0-1579824908229.png

 

2) drag the python tool onto the canvas and click on it. Type the following into a cell: 

 

from ayx import Alteryx
Alteryx.installPackages("docx2txt")

 

Keep the cell selected and hit the 'Run' button within the Python tool itself (not the Alteryx Run Workflow button)

G1_1-1579823034934.png

 

This should install the package onto your computer. You should now be able to use it in the non-admin version of Alteryx. You should not have to do this admin step again for this package (but you will if you want to install others)

 

3) Open up the non-admin version of Alteryx and drag a Python tool onto the workflow

 

4) Now the package is installed you need to import its features into your Python kernel. Type the following into a cell:

 

from ayx import Alteryx
import pandas
import docx2txt

 

You need to import pandas too since the Word document conversion is to a pandas df.

 

5) Make sure that the Python tool is sourcing from the correct file directory where your Word document is saved; use the 'cd' (change directory) command to do this:

 

cd C:\Your\File\Path\Here\

 

6) Use the following statement to read in your Word Document:

 

docx2txt.process("File_Name.docx")

 

G1_2-1579824156863.png

 

Hope this works for everyone! I've attached my test Word docx file if anyone wants to use it.

 

I have not got the PDF read-in to work yet but I will update if I do.

Labels