Special thanks to Meeke for the extensive constructive feedback and improvement ideas¶

What is git at all?¶

  • git vs github & gitlab & Azure repos
  • Municipality Guideliens: "open tenzij.."
    • Github for public code
    • Azure repos for sensitive code

Why bother keeping things there?¶

  • Back-up (if you drop your laptop in a canal without railings)
  • History (if you introduce a bug and want to go back to last working version)
  • Collaboration (with your team, supervisor, etc)

Setup¶

  • terminal
  • VS Code
  • github desktop

What do we add there¶

DO: code, configs (but no secrets!), server scripts, documentation

DO: gitignore, workflows, other github files

FINE: log books, relevant resources

FINE: (not sensitive!!) sample data for demo purposes

CAREFUL: notebook output

DON’T: data, large files

DON'T: secrets, names (employees, customers, ...), anything sensitive

TODO: Alternatives for data hosting???

  • hugging face
  • zenodo
  • Git Large File Storage
  • For city datasets data.amsterdam.nl might be an option (contact datapunt@amsterdam.nl)

Sidenote thanks to Meeke: How to deal with secrets?¶

Option 1:¶

Set them manually as an environment variable, then load those in your script. Example:

import os
secret = os.environ.get("MY_ENV_VARIABLE")

Sidenote thanks to Meeke: How to deal with secrets?¶

Option 2:¶

Use python-dotenv package combined with a .env file.

  1. Put your secrets in a file named .env:
API_KEY=test-key
API_SECRET=test-secret

Note: Do not add the .env file to your repository!

  1. Load the secrets using the dotenv package:
from dotenv import load_dotenv
import os

load_dotenv()

api_key = os.getenv("API_KEY")
api_secret = os.getenv("API_SECRET")

Sidenote thanks to Meeke: How to deal with secrets?¶

Option 3:¶

Put them in a manual config file (for example a json or yaml file) and load this file in your code. Of course you do not include this config file in the repository.

In the project documentation, make sure to include information on which secrets need to be set, because nothing is more annoying than having to run a script 10 times for it to fail consecutively at each next attempt to load a secret. For example, create a sample .env, .json or .yaml file with the right structure but bogus values for the secrets.

Sidenote thanks to Meeke: How to deal with secrets?¶

In the project documentation, make sure to include information on which secrets need to be set, because nothing is more annoying than having to run a script 10 times for it to fail consecutively at each next attempt to load a secret.
For example, create a sample .env, .json or .yaml file with the right structure but bogus values for the secrets.

Getting started...¶

  • go to github
  • create a new repo
    • keep public or private
    • select template if you want to
    • good to add at least a readme
    • .gitignore

Then...¶

either clone locally:

git clone git@github.com:Amsterdam-Internships/InternshipAmsterdamGeneral.git

or if you have existing project:

git init
git remote add origin git@github.com:Amsterdam-Internships/GithubDemo.git
git push --set-upstream origin master

Repo structure and standard files¶

Project structure:¶

  • anything fine-ish as long as it is:
    • used consistent
    • explained in the readme
  • look up common structure and naming conventions
  • cookiecutter

TODO: Add convention advanced analytics
AI Team Guidelines???

Always have a...¶

README.md¶

  • Should change together with the code (from the start!)
  • Markdown files can be added to different subfolders to explain separate parts of the code
  • Example for repo-level readme here
  • At least:
    • short summary/description
    • installation
    • usage
    • acknowledgement
  • markdown syntax

AI Team Guidelines???

Always have a...¶

requirements.txt / environment.yml¶

  • allows users to easily pip install -r requirements.txt
  • mind only packages vs pinned version (e.g. requests==2.28.2)
  • you can always pip freeze > requirements.txt
  • or use pipdeptree (top-level packages, no dependencies): pipdeptree --warn silence | grep -E '^\w+' > requirements.txt
  • or pipreqs for project requirements only
  • or poetry in combination with a pyproject.toml file [link]. Takes care of dependencies, virtual environment management, and building your code into a package.

TODO: Elaborate on conda usage
AI Team Guidelines???

Nice to have a...¶

.gitignore¶

  • intentionally untracked files that git should ignore
  • avoid pushing data, output files, checkpoints
  • plenty of examples here + python-specific
  • feel free to add your own patterns

Nice to have a...¶

workflows (.github/workflows)¶

  • Nice way of ensuring some basic code quality and functionality, especially when there are no peers to review your code regularly
  • More info about github workflows here
  • More about the actions syntax here
  • Examples here
    • running a linter for code quality
    • running example tests

Disclaimer: free repos have a limit of 2000mins + 500MB storage

Nice to have a...¶

TODO: Dealing with notebooks

  • clearing output via pre-commit hooks: https://medium.com/somosfit/version-control-on-jupyter-notebooks-6b67a0cf12a3
  • .gitattributes
  • example of source and notebooks separated: https://github.com/Amsterdam-AI-Team/Urban_PointCloud_Processing

Nice to have a...¶

TODO: General pre-commit hooks

Recommendations Meeke:

  • https://github.com/timothycrosley/isort
  • https://github.com/psf/black
  • https://github.com/PYCQA/flake8
  • https://github.com/pre-commit/mirrors-mypy

When using flake8 and black in parallel, you may need to add a .flake8 file to exclude some checks from flake8, as they clash with black.

Nice to have / Think about a...¶

packaging files / setup.py¶

TODO: Expand

Think about a...¶

(custom) .pylintrc and using a linter¶

  • why bother?
    • unused imports
    • rules for variable and module naming
    • rules for docstrings
    • max for length/branching/args of functions
  • example automated workflow
  • example (custom) file
  • but can be also run locally pylint ...

Sidenote on code style..¶

We need more on code style and quality but some pointers¶

  • PEP8
  • flake8 - alternative to pylint
  • black for code formatting

AI Team Guidelines???

Think about a...¶

license¶

  • more about licenses here

City Guidelines???

General workflow¶

  • Work
  • Last check
    • code quality, run tests, run favorite linter (pylink, flake8, whatever), etc
    • take a last look at changes (git diff)
  • Revert undesired changes if needed, final adjustments
  • Add to repo
    • Stage things (git add only-corresponding-files)
    • Commit (meaningful message!!!)
    • Check last time (git status; git reset – file if needed)
    • Puuuush
  • Confirm all well
    • No errors locally or in the repo

Frequency & Messages¶

  • aim for single files or very small functionality?
    • added data loading
    • added preprocessing
    • fixed some parameter error in whatever config
  • messages as a guide:
    • commit message of one short phrase
    • My rule of thumb: avoid OR & AND in this phrase (else something went wrong)
    • Helps with: avoid spacing out and e.g. starting to refactor while implementing

Awesome talk on telling stories through your commits
Whenever in doubt: google "what's a good commit message"

Example commands from here on...¶

In [1]:
!git config --local --list
core.repositoryformatversion=0
core.filemode=true
core.bare=false
core.logallrefupdates=true
remote.origin.url=git@github.com:Amsterdam-Internships/GithubDemo.git
remote.origin.fetch=+refs/heads/*:refs/remotes/origin/*
branch.master.remote=origin
branch.master.merge=refs/heads/master
In [2]:
!git config --global --list
user.email=iva.gornishka@gmail.com
user.name=Iva Gornishka
credential.helper=store
In [3]:
!echo "jupyter" > requirements.txt
In [4]:
!git status
On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   GithubBasics.ipynb

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	GithubBasics-Meeke.ipynb
	convert-slides.sh
	requirements.txt

no changes added to commit (use "git add" and/or "git commit -a")

Add (stage) only specific files¶

In [5]:
!git add requirements.txt
In [6]:
!git status
On branch master
Your branch is up to date with 'origin/master'.

Changes to be committed:
  (use "git restore --staged <file>..." to unstage)
	new file:   requirements.txt

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   GithubBasics.ipynb

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	GithubBasics-Meeke.ipynb
	convert-slides.sh

In [7]:
!git commit -m 'updated requirements (jupyter)'
[master ec97591] updated requirements (jupyter)
 1 file changed, 1 insertion(+)
 create mode 100644 requirements.txt
In [8]:
!git push
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 12 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 308 bytes | 308.00 KiB/s, done.
Total 3 (delta 1), reused 1 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
To github.com:Amsterdam-Internships/GithubDemo.git
   af23740..ec97591  master -> master

Fix small forgotten things without a whole new "fix typo" commit¶

In [9]:
!echo "cookiecutter" >> requirements.txt
In [10]:
!git add requirements.txt
In [11]:
!git commit --amend --no-edit
[master de8c65a] updated requirements (jupyter)
 Date: Wed Mar 15 12:58:12 2023 +0100
 1 file changed, 2 insertions(+)
 create mode 100644 requirements.txt
In [12]:
!git push -f origin
Enumerating objects: 4, done.
Counting objects: 100% (4/4), done.
Delta compression using up to 12 threads
Compressing objects: 100% (2/2), done.
Writing objects: 100% (3/3), 323 bytes | 323.00 KiB/s, done.
Total 3 (delta 1), reused 0 (delta 0)
remote: Resolving deltas: 100% (1/1), completed with 1 local object.
To github.com:Amsterdam-Internships/GithubDemo.git
 + ec97591...de8c65a master -> master (forced update)

Revert only last pushed commit¶

Danger zone

In [13]:
!git reset --mixed HEAD~1
!git push -f
Unstaged changes after reset:
M	GithubBasics.ipynb
Total 0 (delta 0), reused 0 (delta 0)
To github.com:Amsterdam-Internships/GithubDemo.git
 + de8c65a...af23740 master -> master (forced update)
In [14]:
!git status
On branch master
Your branch is up to date with 'origin/master'.

Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
	modified:   GithubBasics.ipynb

Untracked files:
  (use "git add <file>..." to include in what will be committed)
	GithubBasics-Meeke.ipynb
	convert-slides.sh
	requirements.txt

no changes added to commit (use "git add" and/or "git commit -a")

Branches¶

TODO:

(Code) Reviews¶

Why bother doing them?¶

  • 4 eyes principle.

  • Make sure the code...

    • does what it is expected to
    • does everything it is expected to
    • can be used again in the future (reproducibility)
    • can be picked up by anyone else
  • It can bring to light workflow/process issues

  • Helps us learn from each other

Sidenote for Data Scientists @ Gemeente: It will become "a thing" soon, so we should help with setting up the standards and process

(Code) Reviews & AI Team¶

TODO:

How often?¶

  • Depends on team, project, speed…
  • Per ticket?
  • Keep it short, but not annoying (avoid “I need an approval”)

What do we even review?¶

  • (I think) it can be anything
    • not only code
    • also analysis, investigation, paper…
  • Dev vs data science - data scientists produce so much more than code
  • Go back to what is the purpose of this code and review?
    • Are you going to use this again?
    • Is it going in production?
    • What's the impact of errors?
    • ...

Full review¶

  • Structure, Documentation, Overall setup

Partial updates¶

  • What changed, how was it implemented, is it going to break anything…

Sometimes¶

  • pull, look at it locally, test it, run the thing

Always¶

  • look at security, whatever guidelines we have?!
  • Don’t forget positive comments!

Code Reviews and Data Science¶

  • hardcoded paths

  • output in notebooks

  • data in the repo

  • magic numbers & unnamed/positional arguments (SomeRandomModel('l2', False, 0.0001, 1.0, True, 100))

  • random seeds

  • no grid search or explanation of how params came to be

  • overall workflow/pipeline issue

Before the review¶

  • Can’t review without github
  • Set up rules like no merging without pull request
  • Set up custom workflows to avoid basic things (and merging rules for passing tests)
  • Set up tests (automatically executed)
  • Be nice to each other and make running things from scratch part of your process

During the review (tips)¶

  • Look at Best Practices - you might also learn something
  • Focus on one thing at a time (can I run this -> does it do what it should -> now does it do it efficiently -> does it do it while still being pretty)

A nice way of looking at reviews:¶

  • Use the moment to review the process, not the code
  • Do reviews so that you can stop doing reviews

Useful resources¶

  • Joel Chippindale's talk on blameless reviews

TODO:

  • Project boards (issue tracking?)
  • resolving merge conflicts via IDE
  • packaging code (into modules for future work)
  • branching
    • merging back
    • squashing

Random to incorporate¶

  • "the scout rule" of this dude or this other dude
    • Refactoring before implementation to reduce the risk of not doing it at all
In [ ]: