Open Science and reproducibility

Recording dependencies and environment

Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

What is an environment?

  • An environment consists of the operating system, installed packages (with specific versions) and configurations

  • Different environments running the same code can produce different results.

    • Some packages might have updates that changes / breaks the code
  • Example (python2 vs python3):

    a = 1
    b = 5
    print(a/b)
    
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

How and why to specify environments

  • An environment specification is a description of what packages should go in an environment.
  • When you specify your environment, it's easier for you (or someone else!) to reproduce your environment.

  • Tools turn specifications into environments (and vice versa!)

    • pip - requirements.txt (python)
    • conda - environment.yml
    • Docker - Dockerfile
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Python virtual environments

  • Python comes with a built-in tool for creating virtual environments

    python3 -m venv ./my-env
    

    This will create a folder called my-env containing your virtual environment

  • Activate virtual environment

    source ./my-env/bin/activate  # or '. ./my-env/bin/activate'
    

    Windows users should use

    .\my-env\Scripts\activate
    
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility
  • Now you can install the dependencies you need

    python3 -m pip install pandas
    
  • To deactivate your virtual environment, type

    deactivate
    
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Exercise

Create two different virtual environments called latest and old-pandas, one where you install the latest version of pandas and one where you install pandas version lower than 2.0

Verify the version using

python3 -c "import pandas; print(pandas.__version__)"
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Example

python3 -m venv latest
. latest/bin/activate
python3 -m pip install pandas
deactivate
python3 -m venv old-pandas
. old-pandas/bin/activate
python3 -m pip install "pandas<2.0"
deactivate
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Specifying all your dependencies in a file

Specify your python dependencies in a requirements.in or pyproject.toml
See https://scientificcomputing.github.io/seminar-23-11-2023/environments-slides.html#14 for info about pyproject.toml

numpy
scipy==1.3.1
sympy>=1.1
git+https://github.com/someuser/someproject.git
git+https://github.com/anotheruser/anotherproject.git@sometag

I usually don't specify any versions here unless I know I need an exact version

Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Pinning exact versions of the libraries you use

  • To ensure reproducible results, it is important that you specify the exact versions of the libraries you used and all their dependencies
  • You can export your current environment at any time in requirements.txt format with
    pip freeze
    
  • But you shouldn't specify these as your direct dependencies!
  • We can use a tool called pip-compile (install with pip install pip-tools) to pin all the versions based on your pyproject.toml or requirements.in
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Pinning with pip-tools

pip-compile is like pip install followed by pip freeze, but without actually installing anything

  • Use
    pip-compile requirements.in
    
    or
    pip-compile pyproject.toml
    
    to create a file requirements.txt containing all packages you use, directly or indirectly
  • You can now install the exact dependencies using the command
    python3 -m pip install -r requirements.txt
    
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Conda

Conda is a generic package manager. You can think of it like pip, but where anything can be a package (e.g. Python itself, scientific packages like mpich, petsc, fenics-dolfinx).

Key points:

  • creates environments, like venv
  • Python itself is a package
  • All packages are binary, there's no "install from source, if needed"
  • conda-forge is a community-maintained collection of over 20,000 conda packages
  • miniforge is the best way to get started with conda
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Basic conda commands

  • conda install fenics-dolfinx mpich (pip install)
  • conda create --name myproject python=3.10 fenics-dolfinx mpich (python3 -m venv)
  • conda list (pip list)
  • conda env export --name myproject [-f exported.yml] (pip freeze)
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Sample environment.yml

channels:
  - conda-forge
dependencies:
  - python=3.10
  - fenics-dolfinx
  - mpich

Create an environment from an environment file:

conda env create -n my-paper -f environment.yml
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

conda-lock

conda-lock is a tool for creating "lock files" for conda environments, like pip-compile, but for conda:

conda install conda-lock
conda-lock lock --platform linux-64 --platform osx-arm64 -f environment.yml
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Spack

If you run code on HPC cluster it might be important that all packages are built from source (not pre-built binaries) in order to get it to work.

In these situations, Spack is a great tool, and it also uses the notion of environments

Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Docker

Docker is a tool for packaging an application and all its dependencies, including the operating system, together in the form of images and containers.

  • The user needs to pull an image from a remote registry (or build the image from source)
  • create a container (a running instance of an image)
  • The user runs the code inside the container
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Basic docker commands

  • Pull image

    docker pull <image name>
    

    e.g.

    docker pull ghcr.io/scientificcomputing/fenics:2023-08-14
    
  • Start new container (set working directory to home/shared and share this directory with your current working directory)

    docker run --name=my-research-code -w /home/shared -v $PWD:/home/shared -it ghcr.io/scientificcomputing/fenics:2023-08-14
    
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility
  • Exit container with Ctrl+D or exit
  • Start existing container

    docker start my-research-code
    
  • Execute a running container (jump into it in bash)

    docker exec -it my-research-code bash
    
    
  • Stop running container

    docker stop my-research-code
    
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility
  • Remove existing container
    docker rm my-research-code
    
  • List downloaded images
    docker images
    
  • List containers (omit -a to only list running containers)
    docker ps -a
    
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Docker development workflow

  • The developer needs to write a Dockerfile with instructions on how to build and install the dependencies
  • The developer needs to build an image and push this to a registry
  • Build image from Dockerfile with
    docker build -t my-image .
    
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Taken from https://linuxiac.com/what-is-docker-container/

Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Dockerfile

  • Dockerfiles are a series of directives,
    each of which modify the filesystem, creating a layer.
  • The result of a series of layers is an image
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility
# Choose latest ubuntu as base image
from ubuntu:latest

# Install python3 with pip as well as git (this might be need to install
# some of the requirements) and clean up afterward to reduce image size
RUN apt-get update && apt-get install -y python3-pip git && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Set the working directory to /repo
# This means that this will be the directory where the commands will be executed
# and the default directory that will be used when the container starts
WORKDIR /repo

# Copy the requirements file into the container at /repo
COPY requirements.txt /repo

# Install any needed packages specified in requirements.txt
# First we also upgrade pip to the latest version
RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir -r requirements.txt

# Now we copy the rest of the files into the container at /repo
COPY . /repo
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

We maintain some docker images for scientific computing

Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

What to choose?

  • Use python virtual environments if you

    • have only python dependencies
  • Use conda if

    • you rely on packages with strong dependency on C++/Rust/C/Fortran (e.g Tensorflow, FEniCS)
    • all packages exist on conda (conda-forge / bioconda)
  • Use docker if you

    • need full control over the environment
    • require additional packages that are hard to install
    • need the development version of a non-Python dependency (e.g. FEniCS)
Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

When you submit / publish a paper always create a docker image that can reproduce the environment

You can automate this process with GitHub actions. We will have an exercise later :)

Image taken from reddit

Recording dependencies and environment - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Next up: Recording computations

https://scientificcomputing.github.io/phd-retreat-190324/recording-computations-slides

Recording dependencies and environment - 19.03.24 - Henrik Finsberg