Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

How to use R and Python to Parse Word Documents

ShaanM
Alteryx Alumni (Retired)

A lot of people I have been speaking to recently have asked this and seems to crop up more and more.

 

i thought it would be useful to build two macros to help solve this challenge.

 

Using either R or Python the two macros take a feed of files and process them.

 

If using R, the package used is called 'Officer' which you will need to install separately 

 

For Python, the package used is called 'Docx2txt' , also to be installed seperately

 

It is a very basic example and there are a whole host of other packages that do something similar.

 

Here is the R code used:

 

R.jpg

 

 

library(officer)

doc <- read_docx("XXXX")


content <- docx_summary(doc)
head(content)


write.Alteryx(content, 3)

 

 

 

 

Here is the Python code used:

Python.jpg

 

 

from ayx import Alteryx
Alteryx.installPackages("docx2txt")

 

 

 

 

from ayx import Alteryx
import pandas

import docx2txt

text = docx2txt.process('XXXX')

print(text)

#Turn the variabe with html page into Pandas' DF
df = pandas.DataFrame({"text":[text]})

#Write the data frame to Alteryx workflow for downstream processing
Alteryx.write(df,1)

 

 

 

For each method I packaged as a macro, in the code using 'xxxx' as a placeholder for the file name.

 

Attached is the Workflow+Macros and test file

 

 

Enjoy!!

 

Shaan Mistry

Shaan Mistry
Co - Founder : datacurious.ai
36 REPLIES 36
mceleavey
17 - Castor
17 - Castor

Legend.

 

I've done all that, but I'm now getting an rlang version error. It seems to be unpacking and using an older version, so I need to override that somehow. Any ideas?

Sorry for bothering you Shaan, but if I can get this working I can knock of early and go jet-skiing with movies stars.

Maybe not, but still...



Bulien

ShaanM
Alteryx Alumni (Retired)

@mceleavey 

 

Double check the Alteryx version is using the correct version of R.

 

It sounds like a mismatch somewhere.

 

Alteryx 2019.4 - the R version should be R-3.5.3

 

check the version of Designer, and make sure it is correct. Also check if Designer is non-admin, the R installed is also non admin.

 

downloads.alteryx.com is where you can download older verisons and non admin/admin.

 

 

Failing that try this:

 

browse to this location: C:\Program Files\Alteryx\R-3.5.3\library

 

see if you have an officer folder in the location.

 

I upload a zip of mine. Once unzipped replace with yours and try it again.

Shaan Mistry
Co - Founder : datacurious.ai
mceleavey
17 - Castor
17 - Castor

Thanks @ShaanM ,

 

I've actually done all of that and it moves on to another error each time. I'm now having a problem with Zip. It gave me the same error, so I downloaded the latest version, it gave an error saying it can't uninstall the previous version. I unpacked the zip and copied the zip folder into the library but now it's just returning a zip error:

"Cannot open zip file for reading"

 

I'm stuck.

 

I've checked the versions and all is well. I'm on 2019.4 and the correct version of R is being used...

 

M.



Bulien

ShaanM
Alteryx Alumni (Retired)

@mceleavey 

 

try taking the following components out of the zip folder I sent, and copy into the officer folder in your location:

 

R and Libs

 

 

Shaan Mistry
Co - Founder : datacurious.ai
ShaanM
Alteryx Alumni (Retired)

@mceleavey 

 

also try running RGUI.exe as admin (right click run as admin)

Shaan Mistry
Co - Founder : datacurious.ai
mceleavey
17 - Castor
17 - Castor

I'm getting a zip error. Copying the folders into my library folder for officer did not change anything.

I'm trying to convert .doc, not .docx so is there anything I need to change in the R macro? I tried changing the references to docx to doc and that caused an error.

zip error: 'Cannot open zip file 'C:\Users\\xxxxxx\AppData\Local\Temp\xxxxxx.doc' for reading in file zip.c:238'



Bulien

ShaanM
Alteryx Alumni (Retired)

@mceleavey 

 

sounds like it is the input causing an issue with that error.

 

can you create a new folder on the machine, and place in one word doc.

 

then using that location as the input. 

 

Test that, then you know it runs ok, so it could then be more relating to the input

Shaan Mistry
Co - Founder : datacurious.ai
mceleavey
17 - Castor
17 - Castor

I tried that, however, it seems to work if I use a .docx input. The error only seems to occur when I try a .doc input.

Is there anything that needs changing on the macro itself?



Bulien

ShaanM
Alteryx Alumni (Retired)

@mceleavey just had test on my end with .doc an that works.

 

it might be how that file is formatted or created.

 

adding a word doc for you test.

 

Shaan Mistry
Co - Founder : datacurious.ai
mceleavey
17 - Castor
17 - Castor

Noooooooooo!

 

I've just realised the problem. I need to load the data into text that is held within the Word docs as selected from a  dropdown. The .docx version runs without errors but does not return the data if it has been selected in a drop-down. The .doc version simply returns an error if there is a dropdown within the document.

Is there an option to access the actual XML to return the data?

 

I've attached an example of what I'm trying to do. The problem section is specifically this part:

 

mceleavey_0-1579610796519.png



Bulien

Labels