This site uses different types of cookies, including analytics and functional cookies (its own and from other sites). To change your cookie settings or find out more, click here. If you continue browsing our website, you accept these cookies.
The Python SDK is a framework to develop new Alteryx tools with Python. 3rd party library installation instructions are available here. And examples of how to create tools with the Python SDK are here and here.
The given solution works only in the case when library needs to be imported in "Python SDK" tool from the "SDK Example" pallete. How to install packages when running python codes from "Apache Spark code" tool from developer pallete?
To use packages in the Apache Spark Code tool, they will need to be installed on the Apache Spark cluster you are connecting to with your workflow. The process of installing those packages depends on two things - do you want them installed permanently or just for the job / workflow you are running, and the type of Apache Spark cluster you are using (on-premises, Databricks, or Microsoft Azure HDInsight).
If you are wanting to install the packages on the cluster permanently, then the instructions heavily depend on the type of Apache Spark cluster you are using.
On-premises (i.e., Livy cluster): The packages will need to be installed using pip (or whatever Python package manager is used on your servers), preferable on each server in the cluster. It can be done on just one or a few, rather than all, but each time a job runs that uses the package, it will be copied to each worker that doesn't already have it. There are scripts / tools available to make this easier if you have a large cluster.
Microsoft Azure HDInsight: Essentially the same as above. However, you can do this through the Azure web interface and the process of getting the package on each worker in the cluster is easier.
Databricks: Simplest of all. Databricks refers to this in their documentation as "installing a library", and it is the same process for Python, Java, Scala, and R libraries. Their documentation for the process can be found at https://docs.databricks.com/user-guide/libraries.html
If you only want to use the packages for this job / workflow, then the instructions are simpler and nearly identical for each type of connection. This can be done in the connection configuration dialog where you set up the Apache Spark connection. In each connection type, whether it is an on-premises, Databricks, or Microsoft Azure HDInsight, you have the option to add libraries to your connection string. You simply add the library in that part of the connection configuration dialog. The exact instructions can be found in the Alteryx help, and since they may change in the future after my reply is written, I'll simply provide a link to that documentation here: