Open Science and reproducibility

Organizing your projects

  • git version control system
  • Creating a folder structure
  • GitHub for backup and collaboration
  • Licenses
  • Data
Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility
Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Why version control systems?

  • To keep a history of what has been changed and why
  • To make it easy to go back to a previous version
  • To make changes while maintaining a working version

Image is used under a CC-BY 4.0 license. DOI: 10.5281/zenodo.3332807.

Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

center

Image is used under a CC-BY 4.0 license. https://coderefinery.github.io/git-intro/basics/.

Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Resources for learning about git

Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Setting up your first project

Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Demo cookiecutter

python3 -m pip install cookiecutter
python3 -m cookiecutter gh:scientificcomputing/generate-paper

(Here you can also use pipx)

Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility
research_paper_1
├── .gitignore              # List files to be excluded from git
├── .github                 # Automated workflows with GitHub actions
├── .pre-commit-config.yaml # Pre-commit hooks
├── CITATION.cff            # Info about how to cite your project
├── LICENSE                 # The license
├── README.md               # What the user should read first
├── _config.yml             # Configurations for docs
├── _toc.yml                # Table of contents for docs
├── code                    # Where to put your code
│   └── README.md           # Description of the code
├── cspell.config.yaml      # Dictionary for spell checker
├── data                    # Where to put your data
│   └── README.md           # Description of the data
├── docker
│   └── Dockerfile          # The docker file
├── docs                    # Where to put your docs
│   ├── logo.png            # Simula Logo to put in documentation
│   └── references.bib      # Where to put your references
├── environment.yml         # Conda dependencies
└── pyproject.toml          # Python metadata and dependencies
Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Personal preference

  • I usually just copy files from an existing project that I have locally and edit the files
Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Writing a README file

  • The first documentation a user reads is the README file
  • README.md (markdown) - https://www.markdownguide.org/basic-syntax/
  • Should include
    • Title of the project
    • Description of the project
    • Installation instructions
    • How to get started
Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

The README file (continued)

  • Optional

    • Badges
    • Information about how to contribute
    • License information (should also be in a separate file)
    • Credits
    • Example
    • How to cite
    • Screenshots / figures
Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

What is GitHub?

Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Versioning

  • When you think that your code is ready for external users, it is time to create your first release
  • Your code should get a version number.
  • Create a release when you submit your paper.
  • MAJOR.MINOR.MICRO
  • Specify the version number in pyproject.toml
  • Semantic or Calendar based versioning
Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Calendar based versioning

https://calver.org

  • YEAR.MONTH.DAY
  • YEAR.MONTH.NUMBER
  • YEAR.NUMBER
  • ...
  • e.g 2023.11.4
Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Semantic versioning

https://semver.org

  • major.minor.micro e.g 0.1.2
  • Bump micro / patch: Bug fixes not affecting the API
  • Bump minor: Backward compatible API additions/changes
  • Bump major: Backward incompatible API changes
  • Typically start with 0 major version and bump to 1 when ready for users.
Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Publish a new release

  • Bump version in pyproject.toml
  • Create a git tag once you have bumped the version
    git tag v0.1.2
    git push --tags
    
  • Create a release on GitHub and write a changelog
    • It is also possible to create a tag during this step
Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Write a changelog

  • List the notable changes since the previous release

    • For the first release you don't need a changelog
  • Information about changes are important for the users

  • https://keepachangelog.com/en/1.0.0/

Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Tools for managing versions and tags

Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Licenses

  • What can other users do with the material in your repository?
  • No license means the nobody can use, copy, distribute, or modify the work without consent from the author
  • Add a file called LICENSE to your repository. Go to GitHub, click "Add file" and type the name LICENSE and GitHub will provide you with some options
Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

What license to choose?

  • MIT: Permissive - Others can use your code in any way, and you will not be sued if the software doesn't work (recommended in most cases)
  • GPL: Copyleft - derivative work must use the same license - good way to embrace open source but often problematic for commercial companies
  • LGPL: Similar to GPL but software can be used under different license
  • CC-BY-4.0 - Typically used for creative work (most journals use this)

https://choosealicense.com

Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Data repositories and data sharing

  • Large datasets (more than 50MB) should not be stored in your git repository
    • Git does not work well with binary files
  • Instead you should store large files in a data repository
    • Use Google Drive / Dropbox / Other while developing
    • Publish Data on Zenodo when ready (Zenodo and GitHub integrates well)
Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Other tools for data repositories

Organizing your projects - 19.03.24 - Henrik Finsberg
Open Science and reproducibility

Next up: Recording environments

https://scientificcomputing.github.io/phd-retreat-190324/environments-slides

Organizing your projects - 19.03.24 - Henrik Finsberg

Ask the student if they are familiar with git See if some of them already have an idea on what are the benefits of using version control

Here are some resources, also with links to additional resources. For now I will assume a very basic understanding of git

When setting up your first project it is nice to have a template for which files that should be part of the project We have created two template repos for this which will basically copy all the files from that repo into your own repo Cookiecutter is another alternative where you run a command an it will prompt you with some questions an fill it in Note! These examples are very python centric. How many are not primarily using python