Best Practices in Modern Software Development: Reproducible environments

Reproducible environments

Best Practices in Modern Software Development: 23.11.23

Min Ragan-Kelley

23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments
  • A module is a file consisting of Python code
  • A package is a hierarchical file directory structure that consists of modules and sub-packages
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Using modules

import itertools
# Access function from the module
itertools.product

# Alias
import itertools as itools
itools.product

# The following is considered a bad practice
from itertools import *
# Easy to shadow existing variables (also hard for IDEs)
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Using packages

from scipy.optimize import minimize
#      ^      ^               ^
#      |      |               |
#   Package   |               |
#           Module            |
#                          Function
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Where does Python find modules?

In [1]: import asyncio, numpy

In [2]: asyncio.__file__
Out[2]: '/usr/local/lib/python3.12/asyncio/__init__.py'

In [3]: numpy.__file__
Out[3]: '/home/myname/.local/lib/python3.12/site-packages/numpy/__init__.py'
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

How does python know which modules and packages that are available?

import sys

sys.path
['',
 '/usr/local/lib/python312.zip',
 '/usr/local/lib/python3.12',
 '/usr/local/lib/python3.12/lib-dynload',
 '/home/myname/.local/lib/python3.12/site-packages',
 '/usr/local/lib/python3.12/site-packages']

The order is important!

23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

What is an environment?

  • An environment is where you install your software, isolated from your system and other projects. Why?
    • Conflicting dependency versions
    • Easier to upgrade
    • Easier to dispose of and start for scratch
    • Portable
    • It's always a good idea to use environments!
  • Three main options
    • Python virtual environments
    • Conda environments
    • Containers (Docker)
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

How and why to specify environments

  • An environment specification is a portable description of what packages should go in an environment.
  • When you specify your environment, it's easier to reproduce your environment, or at least compare it with theirs.

  • Tools turn specifications into environments (and vice versa!)

    • pip - requirements.txt
    • conda - environment.yml
    • Docker - Dockerfile
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Python virtual environments

  • Python comes with a built-in tool for creating virtual environments

    python3 -m venv ./my-env
    

    This will create a folder called my-env containing your virtual environment

  • Activate virtual environment

    source ./my-env/bin/activate
    
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments
  • Now you can install the dependencies you need

    python3 -m pip install pandas
    
  • To deactivate your virtual environment, type

    deactivate
    
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Demo

creating a virtual environment

23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Exercise

Create two different virtual environments called latest and old-pandas, one where you install the latest version of pandas and one where you install pandas version lower than 2.0

Verify the version using

python3 -c "import pandas; print(pandas.__version__)"
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Example

  • python3 -m venv latest
    . latest/bin/activate
    python3 -m pip install pandas
    
  • python3 -m venv old-pandas
    . old-pandas/bin/activate
    python3 -m pip install "pandas<2.0"
    deactivate
    
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Creating a pyproject.toml

  • pyproject.toml is the recommended way to specify project metadata for Python projects
  • Minimum metadata
    • name
    • version
    • authors
    • license
    • dependencies
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Example pyproject.toml

[build-system]
requires = ["setuptools>=64.4.0"]
build-backend = "setuptools.build_meta"


[project]
name = "my-paper"
version = "0.1.0"
dependencies = [
  "numpy",
]

[tool.setuptools]
# empty packages when your project not a 'real' package
# (i.e. only dependencies, nothing to actually install)
packages = []
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Exercise

  • Add numpy, scipy, and numba as dependencies in your pyproject.toml

  • Try to install these dependencies in your virtual environment by typing

    python3 -m pip install -e .
    
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Extra dependencies for development

  • You might want to use some other libraries when developing the software, or other specific tasks (such as pip-tools or pytest)
  • These libraries should not be required when installing the software,
    but it is nice for other developer to have an easy way to discover and install them
  • You can list these in pyproject.toml under project.optional-dependencies
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Specifying optional dependencies in pyproject.toml

[project.optional-dependencies]
test = [
    "pytest",
    "pytest-cov",
]
dev = [
    "pdbpp",
    "ipython",
    "tbump",
    "pre-commit",
    "pip-tools",
]
all = [
   "my-project[test,dev]"
]
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Installing optional dependencies

  • Use
    python3 -m pip install ".[dev]"
    
    to install the package in the current directory and its optional 'dev' dependencies.
  • To install several optional dependencies you can separate the names with comma
    python3 -m pip install ".[dev,test]"
    
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Pinning exact versions of the libraries you use

  • To ensure reproducible results, it is important that you specify the exact versions of the libraries you used and all their dependencies
  • You can export your current environment at any time in requirements.txt format with
    pip freeze
    
  • But you shouldn't specify these as your direct dependencies! (never put pandas==2.1.2 in your dependencies by hand)
  • We can use a tool called pip-compile (install with pip install pip-tools) to pin all the versions based on your pyproject.toml
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Pinning with pip-tools

pip-compile is like pip install followed by pip freeze, but without actually installing anything

  • Use
    pip-compile pyproject.toml
    
    to create a file requirements.txt containing all packages you use, directly or indirectly
  • You can now install the exact dependencies using the command
    python3 -m pip install -r requirements.txt
    
  • pip-tools and dependabot can be used to update requirements.txt when you want to.
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Pinning optional dependencies

It might be beneficial to pin some of your optional dependencies:

pip-compile --extra=dev --output-file=requirements-dev.txt pyproject.toml
  • Here we save these dependencies to a different file called requirements-dev.txt, which can be installed using
    python3 -m pip install -r requirements-dev.txt
    
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

When to pin

It can be hard to know when to pin dependencies and when not to. Pinned packages help ensure reproducible results. But they also prevent compatibility with other projects.

It's a good idea to use pinned dependencies when you are:

  • building reproducible results
  • building a container image
  • rendering a website
  • operating a service
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

When not to pin

  • In package dependencies
  • Running tests (maybe!)
  • When you want to share an environment with another tool
  • Short answer: always good to have both!
    • always track loose, direct dependencies
    • track pinned dependencies separately, using tools, not by hand
    • which to install depends on what you are doing
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Virtual environment tools

While we have made some recommendations, there are a variety of tools for managing Python dependencies and environments:

You don't have to use the tools we recommend.
There are other solutions to the same problems that are fine to use if they fit better into your workflow.

23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Conda

Conda is a generic package manager. You can think of it like pip, but where anything can be a package (e.g. Python itself, scientific packages like mpich, petsc, fenics-dolfinx).

Key points, coming from pip/venv:

  • creates environments, like venv
  • Python itself is just another package
  • Can express proper dependencies across languages
  • All packages are binary, there's no "install from source, if needed"
  • conda-forge is a community-maintained collection of over 20,000 conda packages
  • miniforge is the best way to get started with conda
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Basic conda commands

conda pip/venv
conda install fenics-dolfinx mpich pip install
conda create --name myproject python=3.10 fenics-dolfinx mpich python3 -m venv
`conda activate --name myproject source myproject/bin/activate
conda deactivate deactivate
conda list pip list
conda env export --name myproject [-f exported.yml] pip freeze
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Sample environment.yml

channels:
  - conda-forge
dependencies:
  - python=3.10
  - fenics-dolfinx
  - mpich

Create an environment from an environment file:

conda env create -n my-paper -f environment.yml
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

conda-lock

conda-lock is a tool for creating "lock files" for conda environments, like pip-compile, but for conda:

conda install conda-lock
conda-loc lock --platform linux-64 --platform osx-arm64 -f environment.yml
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

DEMO

conda demo

23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Containers (Docker)

Docker is a tool for packaging an application and all its dependencies, including the operating system, together in the form of images and containers. Typical use looks like:

  • Pull an image from a remote registry (or build the image from source)
  • Create a container (a running instance of an image)
  • Runs some code inside the container
  • Stop and remove the container
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Basic docker commands

  • Pull image

    docker pull <image name>
    

    e.g.

    docker pull ghcr.io/scientificcomputing/fenics:2023-08-14
    
  • Start new container (set working directory to /home/shared and share this directory with your current working directory)

    docker run --name=my-research-code -w /home/shared -v $PWD:/home/shared -it ghcr.io/scientificcomputing/fenics:2023-08-14
    
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments
  • Exit container with Ctrl+D or exit

  • Start existing container

    docker start my-research-code
    
  • Stop running container

    docker stop my-research-code
    
  • Remove existing container

    docker rm my-research-code
    
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments
  • List downloaded images
    docker images
    
  • List containers (omit -a to only list running containers)
    docker ps -a
    
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Demo - Running jupyter inside docker

If you are used to GUI applications (e.g. with windows), being restricted to a terminal inside a container may be limiting.
Fortunately, you can connect to containers over the network,
meaning that web-based UIs like Jupyter work in containers.

23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

To run a web UI like Jupyter:

docker run \
  --rm \
  -w $PWD \
  -v $PWD:$PWD \
  -u $(id -u) \
  -p 127.0.0.1:8888:8888 \
  my-image jupyter lab --ip=0.0.0.0

The key points here:

  • --port forwards the local port 127.0.0.1:8888 to the port in the container (also 8888, but could be different)
  • Because of network namespaces, jupyter must listen on the non-default ip 0.0.0.0 to be connectable from outside the container
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Docker development workflow

To make a docker image:

  • Write a Dockerfile with instructions on how to build and install the dependencies
  • Build an image from the Dockerfile
  • Push this to a registry (optional)
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Taken from https://linuxiac.com/what-is-docker-container/

23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

Dockerfile

  • Dockerfiles are a series of directives,
    each of which modify the filesystem, creating a layer.
  • The result of a series of layers is an image
FROM ghcr.io/scientificcomputing/fenics:2023-08-14

WORKDIR /repo

# Copy requirements.txt first so that we done need to reinstall in case another file
COPY requirements.txt /tmp/requirements.txt
RUN python3 -m pip install --no-cache-dir --upgrade pip \
 && python3 -m pip install --no-cache-dir -r requirements.txt
# collect
COPY . /example-paper-fenics
RUN cd /example-paper-fenics \
 && python3 -m pip install .
USER 1000
CMD ["jupyter", "lab", "--ip=0.0.0.0"]
23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

We maintain some docker images for scientific computing

https://github.com/orgs/scientificcomputing/packages

23.11.23 - Min Ragan-Kelley
Best Practices in Modern Software Development: Reproducible environments

What to choose?

  • Use python virtual environments if you

    • have only python dependencies
  • Use conda if

    • you rely on non-Python packages (e.g C libraries, Tensorflow, FEniCS)
    • all packages exist on conda (conda-forge / bioconda)
  • Use docker if you

    • need full control over the environment
    • require additional packages that are hard to install
    • need the development version of a non-Python dependency (e.g. FEniCS)
    • Someone else already maintains an image with what you need!
23.11.23 - Min Ragan-Kelley