Community Spring Cleaning week is here! Join your fellow Maveryx in digging through your old posts and marking comments on them as solved. Learn more here!

Data Science

Machine learning & data science for beginners and experts alike.
bryanchen
Alteryx
Alteryx

Hello {minimum dependency} world


Imagine you’re working with project A, which relies on package B versions >=1.0.0 and package C versions <=0.3.0. Additionally, each of these packages have its own dependencies on other packages, each with its own versions they require and support, and those packages depend on others, and so on and so forth. And all these dependencies must align in your project for it to work successfully. Welcome to dependency versioning hell, the land where most open-source packages exist.

 

Example package dependency graphExample package dependency graph

 

Of course, there are resources that users can use to mitigate these issues. For example, conda, pip, and poetry are all tools that handle versioning conflicts during the installation process of a project. Oftentimes, open source (now referred to as OS) projects will have a requirements.txt, project.toml, setup.cfg, or another file that specifies version requirements of their dependencies. This makes it so that project installation is as smooth as possible for users by guaranteeing which package dependency versions will work with the project. However, the same story isn’t true for developers.

 

Pip’s package installation process

 

To dig further into this, we first need to look at how package managers handle package versioning and conflict resolution. We will specifically dive into pip, but note that conda and poetry work in similar fashions. In pip package resolution, pip will look at the package dependencies and make assumptions on what versions are compatible. As it moves through the dependencies, if it finds an assumption it made was incorrect, it backtracks and attempts to find a version which is in agreement before moving onwards. More info on this resolution can be found in their documentation here.

 

Let’s look into an example OS package, Woodwork, a Python package that automates semantic and logical typing of structured data, specifically for machine learning. In an older version, Woodwork depended on four main packages, but we’ll only look at two of them:

 

  1. pandas>=1.3.0
  2. python-dateutil>=2.8.1

 

During pip installation, pip will install the newest compatible version of pandas, which is 1.4.3 as of writing. It will then install the packages that pandas relies on, including numpy>=1.18 and python-dateutil>=2.8.1, install the newest compatible version of these, and so on. Once it finishes with the pandas package, it moves on to the next requirement, python-dateutil, and repeats this process. If our second requirement specified python-dateutil < 2.8.1 instead, we would have a package dependency clash, and pip would revert to an older version of pandas, for example, pandas==1.3.5, in order to resolve this conflict. It continues to do this for all packages, installing and backtracking until it resolves all package dependencies.

 

This installation process works great for users and devs since it usually installs the most recent versions of each package that satisfy the requirements. The backtracking process that pip uses also ensures that it starts with the newest packages, and if those fail or are incompatible, incrementally installs older packages. However, as OS developers, we must also support the lowest version requirements for our package so that the minimum versions specified will run successfully.

 

OS minimum dependency requirements


As OS developers, we need to ensure that we know which minimum version of each package dependency will still allow our product to run successfully. We must be able to make guarantees that all users with these package versions will be able to use our product, and we update these versions when we choose to no longer support them. However, how can we guarantee that the minimal dependencies of our minimal dependencies are also supported? The easiest way to showcase this is through an example, one that we experienced firsthand recently.

 

In our project, we had a dependency on pandas>=1.3.0, which had a dependency on python-dateutil>=2.7.3. However, we introduced new functionality that used python-dateutil==2.8.1, and this turned out to be a huge issue that we weren’t able to catch until it was too late. Let’s look at three different scenarios of package versioning for this project:

 

  • Using our default dependencies, pip would install pandas==1.4.2, which depended on python-dateutil>=2.8.1. pip would then install the newest version of python-dateutil, which was 2.8.1. This causes no issues with our package.
  • Using our minimum dependencies, pip would install pandas==1.3.0, which depended on python-dateutil>=2.7.3. Based on pip installation behavior, it would choose to install the newest version of python-dateutil, which would again be 2.8.1. Once again, this would cause no issues with our package.
  • Using the minimum dependency of our minimum dependencies, we would install pandas==1.3.0 and python-dateutil==2.7.3. This combination of packages would fail with our package. Due to this failure, we needed to add python-dateutil to our list of requirements.


This showcases the limitations that installation packages, like pip has for OS projects. For OS developers, keeping track of which requirements will allow our package to be successfully completed is essential, and we would need a more advanced method to determine whether we truly uphold our minimum dependency claims.

 

Minimal dependencies of minimum dependencies

 

In order to handle this problem, we looked at a variety of packages to check if any support for this already existed. Version managers like pip, conda, and poetry, don’t have any methods of specifying which versions of the package sub-dependencies to install, and no existing packages tackled this problem. In the end, we created our own script that could handle this task for us.

 

First, we looked at how to get the dependencies (and their respective supported versions) from projects. We looked at a few packages that would handle this:

 

  1. pipdeptree [here]: Shows the dependencies of the current installed packages in the env, used through bash.
  2. pipgrip [here]: Shows the dependencies of a package that doesn’t necessarily have to be installed. However, pulls only the most recent release of the package, rather than packages in development.
  3. pip._vendor.pkg_resources: Pythonic method to get the dependencies of a package based on the current packages installed through pip.


We decided to go with pip._vendor.pkg_resources to get the package requirements easily through our python script. It gives us the ability to find the package dependencies of local installations and requirements pre-release, unlike pipgrip, and it runs faster than pipdeptree, especially for packages with more dependencies.

 

Our approach to solving this problem then goes as follows:

 

  • We create a fresh environment and install requirements-parser, which will give us the version breakdown of a package requirement. This package separates the inequality from the version number, turning <=5.0.0 to tuple (<=, 5.0.0).
  • We install all of the minimum dependency versions that our package requires, including the minimum core and minimum test requirements. We have files that list these expected minimum versions so that we can properly test and track our expected minimum package requirements.
  • We can do this through Python by using subprocess to run commands in the bash shell

 

# min_reqs is a list of minimum package requirements
process = ["pip", "install"]
process.extend(min_reqs.split(delim)[:-1])

# we include this to not downgrade pip during the installation process
process = [x for x in process if ("pip==" not in x)]

# `subprocess.run` runs the command in bash in the environment
subprocess.run(process, capture_output=False)

 

We can use pkg_resources.working_set to get the dependencies of a certain installed package, and use this to create a list of dependencies for all installed packages.

 

package_name = 'some_package_name'
_package = pkg_resources.working_set.by_key[package_name]
requirements = [str(r) for r in _package.requires()]
# requirements will be a list like
# ["scipy>=0.17.0", "pandas>=X.x", "moto", "another_package<1.8.0"]

 

  • We can find all versions of a package through pip index versions {package_name}, which we can run through subprocess as well. We also filter out the base versions from the version strings (for example, converting 22.2.post1 to 22.2.1)
  • We can then use these versions and compare with our minimum version requirements to choose the lowest version that satisfies the requirement. After finding the minimum versions of all packages, we can install these to create our true minimum dependency environment.


Note: we only look up to the second level of packages.

 

Example package dependency treeExample package dependency tree
For example, if we used package Woodwork, we look two steps further to the packages that it relies on, including packages pandas, numpy, and scipy. We don't look at the packages beyond, Another package and Another package 2. This is a design choice on our end in order to allow our code to run faster. We also decided that packages further out aren’t as important to ensure the minimum dependency.

 

Our version of a minimum dependency finder, resolver, and installer is here.

 

Final thoughts

 

Minimum dependency resolution is a very difficult and challenging problem for all software packages, but especially for those in the OS space. Getting adequate support and testing in place for this issue is crucial in ensuring that users won’t run into unexpected problems, and the approach that we walked through here provides that capability for our own package.