Advent of Code is back! Unwrap daily challenges to sharpen your Alteryx skills and earn badges along the way! Learn more now.

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

How to use R and Python to Parse Word Documents

ShaanM
Alteryx Alumni (Retired)

A lot of people I have been speaking to recently have asked this and seems to crop up more and more.

 

i thought it would be useful to build two macros to help solve this challenge.

 

Using either R or Python the two macros take a feed of files and process them.

 

If using R, the package used is called 'Officer' which you will need to install separately 

 

For Python, the package used is called 'Docx2txt' , also to be installed seperately

 

It is a very basic example and there are a whole host of other packages that do something similar.

 

Here is the R code used:

 

R.jpg

 

 

library(officer)

doc <- read_docx("XXXX")


content <- docx_summary(doc)
head(content)


write.Alteryx(content, 3)

 

 

 

 

Here is the Python code used:

Python.jpg

 

 

from ayx import Alteryx
Alteryx.installPackages("docx2txt")

 

 

 

 

from ayx import Alteryx
import pandas

import docx2txt

text = docx2txt.process('XXXX')

print(text)

#Turn the variabe with html page into Pandas' DF
df = pandas.DataFrame({"text":[text]})

#Write the data frame to Alteryx workflow for downstream processing
Alteryx.write(df,1)

 

 

 

For each method I packaged as a macro, in the code using 'xxxx' as a placeholder for the file name.

 

Attached is the Workflow+Macros and test file

 

 

Enjoy!!

 

Shaan Mistry

Shaan Mistry
Co - Founder : datacurious.ai
36 REPLIES 36
JTCairns
8 - Asteroid

Hi Shaan,

 

When i try to run docx2txt i get this:

 

Could not find a version that satisfies the requirement docx2txt (from versions: )
No matching distribution found for docx2txt

can you give any guidance?


ShaanM
Alteryx Alumni (Retired)

hi @JTCairns 

 

what if you try run just this in python:

 

from ayx import Alteryx
Alteryx.installPackages("docx2txt")

 

 

does it attempt to install?

 

 

Shaan Mistry
Co - Founder : datacurious.ai
JTCairns
8 - Asteroid

Hi Shaan,

 

I get the below, i think it may be a net securty issue but i have no way of changing that, if it is that can a package be installed from a local file? Or is it something else?

 

Capture.PNG

ShaanM
Alteryx Alumni (Retired)

hi @JTCairns 

 

The packages can be downloaded out of Alteryx and placed into this folder (default location):C:\Program Files\Alteryx\bin\Miniconda3\PythonTool_venv\Lib\site-packages

 

I will attempt to attach the file here. It needs unzipping then place the 2 folders into the folder above and it should then work.

 

hope this helps

Shaan Mistry
Co - Founder : datacurious.ai
JTCairns
8 - Asteroid

Thanks fo rthis Shann, i hope this works but i dont have admin permision so i will have to wait and see.

gururajb
6 - Meteoroid

Hi I am getting this error while I am trying to parse it using R based macro.

Has anyone come across this issue?

Please help.

Capture.PNG

ShaanM
Alteryx Alumni (Retired)

@gururajb 

 

what does the data look like going into the macro? 

 

Check it is represented as a full path e.g. c:\datafolder\worddoc.docx

Shaan Mistry
Co - Founder : datacurious.ai
gururajb
6 - Meteoroid
Hi Shaan I realized that the file extension is .doc which is not supported by officer library.
I guess we will have to use different package.
ShaanM
Alteryx Alumni (Retired)

hi @gururajb 

 

i tested on my end with a .doc and the R macro still pulls the data in ok.

 

could you maybe send me a direct message with the file or upload here?

Shaan Mistry
Co - Founder : datacurious.ai
Labels