Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

How to use R and Python to Parse Word Documents

ShaanM
Alteryx Alumni (Retired)

A lot of people I have been speaking to recently have asked this and seems to crop up more and more.

 

i thought it would be useful to build two macros to help solve this challenge.

 

Using either R or Python the two macros take a feed of files and process them.

 

If using R, the package used is called 'Officer' which you will need to install separately 

 

For Python, the package used is called 'Docx2txt' , also to be installed seperately

 

It is a very basic example and there are a whole host of other packages that do something similar.

 

Here is the R code used:

 

R.jpg

 

 

library(officer)

doc <- read_docx("XXXX")


content <- docx_summary(doc)
head(content)


write.Alteryx(content, 3)

 

 

 

 

Here is the Python code used:

Python.jpg

 

 

from ayx import Alteryx
Alteryx.installPackages("docx2txt")

 

 

 

 

from ayx import Alteryx
import pandas

import docx2txt

text = docx2txt.process('XXXX')

print(text)

#Turn the variabe with html page into Pandas' DF
df = pandas.DataFrame({"text":[text]})

#Write the data frame to Alteryx workflow for downstream processing
Alteryx.write(df,1)

 

 

 

For each method I packaged as a macro, in the code using 'xxxx' as a placeholder for the file name.

 

Attached is the Workflow+Macros and test file

 

 

Enjoy!!

 

Shaan Mistry

Shaan Mistry
Co - Founder : datacurious.ai
36 REPLIES 36
JTCairns
8 - Asteroid

Hi Shaan,

 

When i try to run docx2txt i get this:

 

Could not find a version that satisfies the requirement docx2txt (from versions: )
No matching distribution found for docx2txt

can you give any guidance?


ShaanM
Alteryx Alumni (Retired)

hi @JTCairns 

 

what if you try run just this in python:

 

from ayx import Alteryx
Alteryx.installPackages("docx2txt")

 

 

does it attempt to install?

 

 

Shaan Mistry
Co - Founder : datacurious.ai
JTCairns
8 - Asteroid

Hi Shaan,

 

I get the below, i think it may be a net securty issue but i have no way of changing that, if it is that can a package be installed from a local file? Or is it something else?

 

Capture.PNG

ShaanM
Alteryx Alumni (Retired)

hi @JTCairns 

 

The packages can be downloaded out of Alteryx and placed into this folder (default location):C:\Program Files\Alteryx\bin\Miniconda3\PythonTool_venv\Lib\site-packages

 

I will attempt to attach the file here. It needs unzipping then place the 2 folders into the folder above and it should then work.

 

hope this helps

Shaan Mistry
Co - Founder : datacurious.ai
JTCairns
8 - Asteroid

Thanks fo rthis Shann, i hope this works but i dont have admin permision so i will have to wait and see.

gururajb
6 - Meteoroid

Hi I am getting this error while I am trying to parse it using R based macro.

Has anyone come across this issue?

Please help.

Capture.PNG

ShaanM
Alteryx Alumni (Retired)

@gururajb 

 

what does the data look like going into the macro? 

 

Check it is represented as a full path e.g. c:\datafolder\worddoc.docx

Shaan Mistry
Co - Founder : datacurious.ai
gururajb
6 - Meteoroid
Hi Shaan I realized that the file extension is .doc which is not supported by officer library.
I guess we will have to use different package.
ShaanM
Alteryx Alumni (Retired)

hi @gururajb 

 

i tested on my end with a .doc and the R macro still pulls the data in ok.

 

could you maybe send me a direct message with the file or upload here?

Shaan Mistry
Co - Founder : datacurious.ai
Labels