Pip and Pubgrub

Modify Pip's resolvelib to Pubgrub algorithm

Home

Article

What is package resolving

Package resolving is used for package installers to resolve which version of package can install. It's a sat-problem which is NP-Complete. For example, if I give the package installer what kind of package I want to install (we called it constraints), [numpy >= 1.0.0; panda >= 2.16.0], package installer need to compute which package will fit the constraints, maybe the result will be [numpy 2.6.0; panda 2.16.5].

But usually it won't be that simple, assume [numpy 2.6.0] need another package [numpy-tools >= 2.0.0] (we called dependencies for numpy), package installer need to count it in and install it, so the install result might be [numpy 2.6.0; numpy-tools 2.6.0; panda 2.16.5].

So there will be more problems here. If the dependency need something we can't fit, for example [panda 2.16.5]'s dependency is [panda-tools <= 2.16], and [panda-tools 2.16.0]'s dependency is [tool-pack >= 2.10.0], [numpy-tools]'s dependency is [tool-pack == 2.6.0]. Obviously, there will be no answer to fit constraints, so the conflict happened. Package installer needs to find another version of the package to try whether it can be solved or not. If there is no combination of packages that can satisfy all dependencies without any conflict then we will call this situation “resolve impossible” which means the version solver can’t find the correct pattern of packages to satisfy the user's requirement. Otherwise, if the version solver can find a combination of packages that can satisfy all dependencies without any conflict then we will say it resolves successfully and the user's requirement is achievable.

Pubgrub and Mixology

Mixology

mixology is an open source code in github which implements the version solver algorithm - pubgrub.

Pubgrub

pubgrub is a version solving algorithm and it has been known as the most efficient one. How do pubgrub compute the answer?

At the beginning, we need to introduce some basic units that pubgrub work with. “Term” is a range of versions of a single package that is either required or forbidden. For example, [ menu ≥ 1.1.0 ] is a term, and so is [ not dropdown ≥ 2.0.0 ]. When we call this term has been satisfied which means we pick a specific version of the package and it matches the range of this term. For example, [ aa >= 1.0.0 ] is satisfied when [ aa: 2.0.0 ] has been selected.

After knowing what the term is, we can define incompatibility now. Incompatibility is a set of terms which must not all be satisfied at once. For example, the incompatibility { aa == 1.0.0, not bb < 2.0.0 } means that [ aa: 1.0.0 ] and [ bb: 3.0.0 ] can't exist simultaneously, for all terms in incompatibility have been satisfied in this situation. How can we understand the meaning of incompatibility? We can interpret incompatibility in this way. Because [ aa: 1.0.0 ] depends on [ bb < 2.0.0 ], [ bb < 2.0.0 ] must be existed when [ aa: 1.0.0 ] exist, and we can use a incompatibility - { aa == 1.0.0, not bb < 2.0.0 } to represent this situation.

And that's the main reason why it can resolve faster than before, it can reduce the solver step for doing the same thing like this: if we already know [ aa >= 2.0.0 ] depends on [ cc >= 2.0.0 ], but [ bb >= 1.0.0 ] depends on [ cc ^1.0.0 ], and root need both aa and bb. After calculating we will know the incompatibility { bb >= 1.0.0, not aa < 2.0.0 }, it can skip the step from [ aa: 2.5.0 ] to [ aa: 2.0.0 ] which is really redundant.

So how do Pubgrub done it? It contains mainly two phases in the core loop - "unit propagation" and "decision making".

Unit propagation:

the phase which combines the list of selected package versions with the set of incompatibilities that represent facts we know are always true to find terms that must be true given the package versions we have selected.

For example, if we choose [ aa: 1.6.0 ], we can look at { aa >= 1.0.0, not bb >= 1.0.0 }, which means [ bb >= 1.0.0 ] needs to exist (called derived term). This will send to Unit propagation again, to see whether it can derive more terms. If there is no more derivation to make, Unit propagation done. If there are still dependencies left unsatisfied, go to the Decision making phase.

Decision making:

Choose a version which matches incompatibilities, and add it to the package list (choose by heuristic, pub choose the fewest total version, and pip choose the newest one). Also, add the new incompatibilities from the package we chose.

If everything goes well, we can find a package list matches our incompatibilities, or we can face another problem - Conflict.

Conflict:

During Unit propagation, we might derive something that satisfies some incompatibility, which is a conflict. We need to jump back to the most recently selected package version that might cause conflict, and try another version.

And Pubgrub does more than that, imagine we have incompatibilities { root, aa >= 1.0.0 } and { bb, not aa >= 1.0.0 }, which means [ root ] can't work with [ aa >= 1.0.0 ], but [ bb ] needs [ aa >= 1.0.0 ].So [ root ] can't work with [ bb ], we can combine incompatibilities to { root, bb }. It can let Pubgrub avoid running the same conflict again and again.

Pubgrub's flow chart we make

Pip

pip is a package installer for Python and we can use pip to install packages from the Python Package Index and other indexes. The most common function we use is to use pip to install specific packages and specific versions just like this command - "pip install numpy > 1.0.0". The outcome of this command is to install numpy which version will meet the constraint of “numpy > 1.0.0”, numpy’s dependencies, its dependencies’ dependencies , and so on and so on until everyone’s dependencies are satisfied in our computer.

How do pip know what packages are numpy needed and which version of dependency is numpy needed ?

First, pip can check numpy internally to know what packages are numpy needed and which range of dependencies numpy accept.

Second, pip will use the information of what packages the computer has already installed and the information of numpy to compute what package version we should install actually to avoid the happening of conflict. The technique that pip uses is backtracking, backtracking is an algorithmic-technique for solving problems recursively by trying to build a solution incrementally, one piece at a time, removing those solutions that fail to satisfy the constraints of the problem at any point of time. How do pip build the solution incrementally ? The answer is that pip will try to choose a specific version of the package, usually the latest one from an available range which means it will satisfy the constraint. After trying to choose a version of the package, we will get the new constraint because we choose a new package and it may have its own dependencies. Then we will check whether the intersection of new constraint and old constraint is an empty set or not, if it is an empty set which means conflict has occurred and we must abort the latest chosen to go back to the state of no conflict and choose another version of the package. Otherwise, if the intersection is not an empty set which indicates this choice is temporarily acceptable and we can do the next chosen. There are two probable outcomes in this algorithm, the first one is resolved successfully which means pip find a combination of packages that satisfy all constraint without any conflict, in this case, user’s requirement will be accomplished, the second one is resolved impossible which means pip can find any correct pattern of package to satisfy all constraints, in this case, user’s requirement will not be achieved.

In this section, I just mention a part of functionalities of pip which is also the part highly related to our graduate project. If anyone is curious about the strength of pip, there is the link of pip's document.

Resolvelib in pip

Resolvelib want to choose latest version and just linearly find package's version

Resolvelib's flow chart we make