Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Alteryx Designer Desktop Discussions

Find answers, ask questions, and share expertise about Alteryx Designer Desktop and Intelligence Suite.
SOLVED

How to use R and Python to Parse Word Documents

ShaanM
Alteryx Alumni (Retired)

A lot of people I have been speaking to recently have asked this and seems to crop up more and more.

 

i thought it would be useful to build two macros to help solve this challenge.

 

Using either R or Python the two macros take a feed of files and process them.

 

If using R, the package used is called 'Officer' which you will need to install separately 

 

For Python, the package used is called 'Docx2txt' , also to be installed seperately

 

It is a very basic example and there are a whole host of other packages that do something similar.

 

Here is the R code used:

 

R.jpg

 

 

library(officer)

doc <- read_docx("XXXX")


content <- docx_summary(doc)
head(content)


write.Alteryx(content, 3)

 

 

 

 

Here is the Python code used:

Python.jpg

 

 

from ayx import Alteryx
Alteryx.installPackages("docx2txt")

 

 

 

 

from ayx import Alteryx
import pandas

import docx2txt

text = docx2txt.process('XXXX')

print(text)

#Turn the variabe with html page into Pandas' DF
df = pandas.DataFrame({"text":[text]})

#Write the data frame to Alteryx workflow for downstream processing
Alteryx.write(df,1)

 

 

 

For each method I packaged as a macro, in the code using 'xxxx' as a placeholder for the file name.

 

Attached is the Workflow+Macros and test file

 

 

Enjoy!!

 

Shaan Mistry

Shaan Mistry
Co - Founder : datacurious.ai
36 REPLIES 36
gururajb
6 - Meteoroid

Hi @ShaanM 

 

Please find the sample file.

Thanks in advance.

gururajb
6 - Meteoroid

Hi Shaan

 

Please find the file.

Thanks in advance.

ShaanM
Alteryx Alumni (Retired)

@gururajb i tested with your file. Looks like some file properties have not been filled in.

 

i opened the doc and copied contents and pasted into a new word doc and then the file reads in ok.

 

it might be down to how the original file was created

Shaan Mistry
Co - Founder : datacurious.ai
gururajb
6 - Meteoroid

Thanks for the insights @ShaanM.

I will understand from the client how the files were created.

coderockride
8 - Asteroid

If I wanted to add the input filepath to the python macro so I can link phrases back to source documents, what might that look like? Something like this?

 

from ayx import Alteryx
import pandas

import docx2txt

text = docx2txt.process('XXXX')
filepath = 'XXXX'

print(text)

#Turn the variabe with html page into Pandas' DF
df = pandas.DataFrame({"text","filepath":[text],[filepath]})

#Write the data frame to Alteryx workflow for downstream processing
Alteryx.write(df,1)

ShaanM
Alteryx Alumni (Retired)

@coderockride 

 

Yes think you are on the right path.

 

The main thing is to define the file path in the data frame that way it can be part of the data as it passes through the stream

Shaan Mistry
Co - Founder : datacurious.ai
G1
8 - Asteroid

Hi ShaanM thanks for your info.

 

I got an error on installing the docx2txt so I tried saving the files where you suggest - in C:\Program Files\Alteryx\bin\Miniconda3\PythonTool_venv\Lib\site-packages.

 

However i have no PythonTool_venv folder (I asked IT to look too and they could not find it). I DO have a jupytertool_venv folder and it seems to be looking in there so i tried saving the files in the following location:

c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\

But still no luck. Says environment error. Do you have any more suggestions? I am not familiar with all this back-end stuff. Thanks in advance

 

Collecting docx2txt
Installing collected packages: docx2txt
ERROR: Could not install packages due to an EnvironmentError: [WinError 5] Access is denied: 'c:\\program files\\alteryx\\bin\\miniconda3\\envs\\jupytertool_venv\\Lib\\site-packages\\docx2txt'
Consider using the `--user` option or check the permissions.
 
---------------------------------------------------------------------------
CalledProcessError                        Traceback (most recent call last)
<ipython-input-2-72d8c39b3961> in <module>
      1 from ayx import Alteryx
----> 2 Alteryx.installPackages("docx2txt")
      3 

c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\export.py in installPackage(package, install_type, debug, **kwargs)
    138     This function will install a package or list of packages into the virtual environment used by the Python tool. If using an admin installation of Alteryx, you must run Alteryx as administrator in order to use this function and install packages.
    139     """
--> 140     __installPackages__(package, install_type=install_type, debug=debug, **kwargs)
    141 
    142 

c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\Package.py in installPackages(package, install_type, debug)
    112     print(pip_install_result['msg'])
    113     if not pip_install_result['success']:
--> 114         raise pip_install_result['err']

c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\site-packages\ayx\Utils.py in runSubprocess(args_list, debug)
     56 
     57     try:
---> 58         result = subprocess.check_output(args_list, stderr=subprocess.STDOUT)
     59         if debug:
     60             print("[Subprocess success!]")

c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\subprocess.py in check_output(timeout, *popenargs, **kwargs)
    354 
    355     return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
--> 356                **kwargs).stdout    357 
    358 

c:\program files\alteryx\bin\miniconda3\envs\jupytertool_venv\lib\subprocess.py in run(input, timeout, check, *popenargs, **kwargs)
    436         if check and retcode:
    437             raise CalledProcessError(retcode, process.args,
--> 438                                      output=stdout, stderr=stderr)    439     return CompletedProcess(process.args, retcode, stdout, stderr)
    440 

CalledProcessError: Command '['c:\\program files\\alteryx\\bin\\miniconda3\\envs\\jupytertool_venv\\python.exe', '-m', 'pip', 'install', 'docx2txt']' returned non-zero exit status 1.
mceleavey
17 - Castor
17 - Castor

Hi @ShaanM ,

 

I desperately need this to work as the solution I was using has developed problems. 

I've followed the steps (I'm not overly familiar with R or Python, so I'm leaning toward the problem being between keyboard and chair) but I get the following error when using R:

mceleavey_0-1579597866571.png

Any ideas?

I get different errors when using Python, but we'll address those later if need be. I downloaded the officer package, then used the Alteryx R Package Installer to install. It confirmed it was installed correctly. I then needed to update the RLang package, which I did.

Now I get this error. Any ideas?

I'm literally on-site with a client now so any help will be greatly appreciated!!

 

M.



Bulien

ShaanM
Alteryx Alumni (Retired)

@mceleavey 

 

Try this:

 

on the local machine browse to this location (using Alteryx defaults):

 

C:\Program Files\Alteryx\R-3.5.3\bin\x64

 

This is the R location.

 

Once in that location, find and run: RGui.exe

 

RGUI allows you to install R packages.

 

From the top menu go to : Packages>Install Packages

 

Then select the cran mirror. I just select London. Then it will give you a full list of all packages available.

 

Then select Officer.

 

Once downloaded and unpackaged (it should do it all by itself) then re open Alteryx and try again.

 

Hope this helps. Failing that I would reach out to our support team : support@alteryx.com

Shaan Mistry
Co - Founder : datacurious.ai
ShaanM
Alteryx Alumni (Retired)

@G1 

 

Looks like you may have some environment discrepancies

 

To fully diagnose please log a ticket with our client service team: support@alteryx.com 

Shaan Mistry
Co - Founder : datacurious.ai
Labels