Reproducible environments

Reproducible environments#

Best Practices in Modern Software Development: 23.11.23

Min Ragan-Kelley

A module is a file consisting of Python code
A package is a hierarchical file directory structure that consists of modules and sub-packages

bg fit right

Using modules#

import itertools
# Access function from the module
itertools.product

# Alias
import itertools as itools
itools.product

# The following is considered a bad practice
from itertools import *
# Easy to shadow existing variables (also hard for IDEs)

Using packages#

from scipy.optimize import minimize
#      ^      ^               ^
#      |      |               |
#   Package   |               |
#           Module            |
#                          Function

Where does Python find modules?#

In [1]: import asyncio, numpy

In [2]: asyncio.__file__
Out[2]: '/usr/local/lib/python3.12/asyncio/__init__.py'

In [3]: numpy.__file__
Out[3]: '/home/myname/.local/lib/python3.12/site-packages/numpy/__init__.py'

How does python know which modules and packages that are available?#

import sys

sys.path

['',
 '/usr/local/lib/python312.zip',
 '/usr/local/lib/python3.12',
 '/usr/local/lib/python3.12/lib-dynload',
 '/home/myname/.local/lib/python3.12/site-packages',
 '/usr/local/lib/python3.12/site-packages']

The order is important!

What is an environment?#

An environment is where you install your software, isolated from your system and other projects. Why?
- Conflicting dependency versions
- Easier to upgrade
- Easier to dispose of and start for scratch
- Portable
- It’s always a good idea to use environments!

Three main options
- Python virtual environments
- Conda environments
- Containers (Docker)

How and why to specify environments#

An environment specification is a portable description of what packages should go in an environment.

When you specify your environment, it’s easier to reproduce your environment, or at least compare it with theirs.
Tools turn specifications into environments (and vice versa!)
- pip - requirements.txt
- conda - environment.yml
- Docker - Dockerfile

Python virtual environments#

Python comes with a built-in tool for creating virtual environments
```
python3 -m venv ./my-env
```
This will create a folder called my-env containing your virtual environment

Activate virtual environment
```
source ./my-env/bin/activate
```

Now you can install the dependencies you need
```
python3 -m pip install pandas
```

To deactivate your virtual environment, type
```
deactivate
```

Demo#

creating a virtual environment

Exercise#

Create two different virtual environments called latest and old-pandas, one where you install the latest version of pandas and one where you install pandas version lower than 2.0

Verify the version using

python3 -c "import pandas; print(pandas.__version__)"

Example#

python3 -m venv latest
. latest/bin/activate
python3 -m pip install pandas

python3 -m venv old-pandas
. old-pandas/bin/activate
python3 -m pip install "pandas<2.0"
deactivate

Creating a `pyproject.toml`#

pyproject.toml is the recommended way to specify project metadata for Python projects
Minimum metadata
- name
- version
- authors
- license
- dependencies

Example `pyproject.toml`#

[build-system]
requires = ["setuptools>=64.4.0"]
build-backend = "setuptools.build_meta"


[project]
name = "my-paper"
version = "0.1.0"
dependencies = [
  "numpy",
]

[tool.setuptools]
# empty packages when your project not a 'real' package
# (i.e. only dependencies, nothing to actually install)
packages = []

Exercise#

Add numpy, scipy, and numba as dependencies in your pyproject.toml
Try to install these dependencies in your virtual environment by typing
```
python3 -m pip install -e .
```

Extra dependencies for development#

You might want to use some other libraries when developing the software, or other specific tasks (such as pip-tools or pytest)
These libraries should not be required when installing the software, but it is nice for other developer to have an easy way to discover and install them
You can list these in pyproject.toml under project.optional-dependencies

Specifying optional dependencies in `pyproject.toml`#

[project.optional-dependencies]
test = [
    "pytest",
    "pytest-cov",
]
dev = [
    "pdbpp",
    "ipython",
    "tbump",
    "pre-commit",
    "pip-tools",
]
all = [
   "my-project[test,dev]"
]

Installing optional dependencies#

Use
```
python3 -m pip install ".[dev]"
```
to install the package in the current directory and its optional ‘dev’ dependencies.
To install several optional dependencies you can separate the names with comma
```
python3 -m pip install ".[dev,test]"
```

Pinning exact versions of the libraries you use#

To ensure reproducible results, it is important that you specify the exact versions of the libraries you used and all their dependencies

You can export your current environment at any time in requirements.txt format with
```
pip freeze
```
But you shouldn’t specify these as your direct dependencies! (never put pandas==2.1.2 in your dependencies by hand)
We can use a tool called pip-compile (install with pip install pip-tools) to pin all the versions based on your pyproject.toml

Pinning with `pip-tools`#

pip-compile is like pip install followed by pip freeze, but without actually installing anything

Use
```
pip-compile pyproject.toml
```
to create a file requirements.txt containing all packages you use, directly or indirectly
You can now install the exact dependencies using the command
```
python3 -m pip install -r requirements.txt
```
pip-tools and dependabot can be used to update requirements.txt when you want to.

Pinning optional dependencies#

It might be beneficial to pin some of your optional dependencies:

pip-compile --extra=dev --output-file=requirements-dev.txt pyproject.toml

Here we save these dependencies to a different file called requirements-dev.txt, which can be installed using
```
python3 -m pip install -r requirements-dev.txt
```

When to pin#

It can be hard to know when to pin dependencies and when not to. Pinned packages help ensure reproducible results. But they also prevent compatibility with other projects.

It’s a good idea to use pinned dependencies when you are:

building reproducible results
building a container image
rendering a website
operating a service

When not to pin#

In package dependencies
Running tests (maybe!)
When you want to share an environment with another tool

Short answer: always good to have both!
- always track loose, direct dependencies
- track pinned dependencies separately, using tools, not by hand
- which to install depends on what you are doing

Virtual environment tools#

While we have made some recommendations, there are a variety of tools for managing Python dependencies and environments:

You don’t have to use the tools we recommend. There are other solutions to the same problems that are fine to use if they fit better into your workflow.

Conda#

Conda is a generic package manager. You can think of it like pip, but where anything can be a package (e.g. Python itself, scientific packages like mpich, petsc, fenics-dolfinx).

Key points, coming from pip/venv:

creates environments, like venv
Python itself is just another package
Can express proper dependencies across languages
All packages are binary, there’s no “install from source, if needed”
conda-forge is a community-maintained collection of over 20,000 conda packages
miniforge is the best way to get started with conda

Basic conda commands#

conda	pip/venv
`conda install fenics-dolfinx mpich`	`pip install`
`conda create --name myproject python=3.10 fenics-dolfinx mpich`	`python3 -m venv`
`conda activate –name myproject	`source myproject/bin/activate`
`conda deactivate`	`deactivate`
`conda list`	`pip list`
`conda env export --name myproject [-f exported.yml]`	`pip freeze`

Sample environment.yml#

channels:
  - conda-forge
dependencies:
  - python=3.10
  - fenics-dolfinx
  - mpich

Create an environment from an environment file:

conda env create -n my-paper -f environment.yml

conda-lock#

conda-lock is a tool for creating “lock files” for conda environments, like pip-compile, but for conda:

conda install conda-lock

conda-loc lock --platform linux-64 --platform osx-arm64 -f environment.yml

DEMO#

conda demo

Containers (Docker)#

Docker is a tool for packaging an application and all its dependencies, including the operating system, together in the form of images and containers. Typical use looks like:

Pull an image from a remote registry (or build the image from source)
Create a container (a running instance of an image)
Runs some code inside the container
Stop and remove the container

Basic docker commands#

Pull image

docker pull <image name>

e.g.

docker pull ghcr.io/scientificcomputing/fenics:2023-08-14

Start new container (set working directory to /home/shared and share this directory with your current working directory)

docker run --name=my-research-code -w /home/shared -v $PWD:/home/shared -it ghcr.io/scientificcomputing/fenics:2023-08-14

Exit container with Ctrl+D or exit
Start existing container
```
docker start my-research-code
```
Stop running container
```
docker stop my-research-code
```
Remove existing container
```
docker rm my-research-code
```

List downloaded images
```
docker images
```
List containers (omit -a to only list running containers)
```
docker ps -a
```

Demo - Running jupyter inside docker#

If you are used to GUI applications (e.g. with windows), being restricted to a terminal inside a container may be limiting. Fortunately, you can connect to containers over the network, meaning that web-based UIs like Jupyter work in containers.

To run a web UI like Jupyter:

docker run \
  --rm \
  -w $PWD \
  -v $PWD:$PWD \
  -u $(id -u) \
  -p 127.0.0.1:8888:8888 \
  my-image jupyter lab --ip=0.0.0.0

The key points here:

--port forwards the local port 127.0.0.1:8888 to the port in the container (also 8888, but could be different)
Because of network namespaces, jupyter must listen on the non-default ip 0.0.0.0 to be connectable from outside the container

Docker development workflow#

To make a docker image:

Write a Dockerfile with instructions on how to build and install the dependencies
Build an image from the Dockerfile
Push this to a registry (optional)

bg right:60% fit

Taken from https://linuxiac.com/what-is-docker-container/

Dockerfile#

Dockerfiles are a series of directives, each of which modify the filesystem, creating a layer.
The result of a series of layers is an image

FROM ghcr.io/scientificcomputing/fenics:2023-08-14

WORKDIR /repo

# Copy requirements.txt first so that we done need to reinstall in case another file
COPY requirements.txt /tmp/requirements.txt
RUN python3 -m pip install --no-cache-dir --upgrade pip \
 && python3 -m pip install --no-cache-dir -r requirements.txt
# collect
COPY . /example-paper-fenics
RUN cd /example-paper-fenics \
 && python3 -m pip install .
USER 1000
CMD ["jupyter", "lab", "--ip=0.0.0.0"]

We maintain some docker images for scientific computing#

orgs/scientificcomputing

What to choose?#

Use python virtual environments if you
- have only python dependencies
Use conda if
- you rely on non-Python packages (e.g C libraries, Tensorflow, FEniCS)
- all packages exist on conda (conda-forge / bioconda)
Use docker if you
- need full control over the environment
- require additional packages that are hard to install
- need the development version of a non-Python dependency (e.g. FEniCS)
- Someone else already maintains an image with what you need!

Reproducible environments

Contents

Reproducible environments#

Using modules#

Using packages#

Where does Python find modules?#

How does python know which modules and packages that are available?#

What is an environment?#

How and why to specify environments#

Python virtual environments#

Demo#

Exercise#

Example#

Creating a pyproject.toml#

Example pyproject.toml#

Exercise#

Extra dependencies for development#

Specifying optional dependencies in pyproject.toml#

Installing optional dependencies#

Pinning exact versions of the libraries you use#

Pinning with pip-tools#

Pinning optional dependencies#

When to pin#

When not to pin#

Virtual environment tools#

Conda#

Basic conda commands#

Sample environment.yml#

conda-lock#

DEMO#

Containers (Docker)#

Basic docker commands#

Demo - Running jupyter inside docker#

Docker development workflow#

Dockerfile#

We maintain some docker images for scientific computing#

What to choose?#

Creating a `pyproject.toml`#

Example `pyproject.toml`#

Specifying optional dependencies in `pyproject.toml`#

Pinning with `pip-tools`#