Best Practices for Reproducible Research

Session 2

Ingmar Steiner

2017-05-03

Source Code Management

Motivation

  • Track development over time
  • Find and fix errors
  • Share changes with other developers
  • Synchronize across different machines
  • Experiment with new features without breaking everything

Comparing files

uk.txt

red
green
white
grey
black

us.txt

green
white
gray
black
blue

diff --side-by-side uk.txt us.txt

red                               <
green                               green
white                               white
grey                              | gray
black                               black
                                  > blue

We refer to the difference between to files as a diff.

Diff tool

diff uk.txt us.txt

1d0
< red
4c3
< grey
---
> gray
5a5
> blue

Diff “hunks” without context

Unified format

diff -u uk.txt us.txt

--- uk.txt  2017-04-30 10:27:16.000000000 +0200
+++ us.txt  2017-04-30 10:27:45.000000000 +0200
@@ -1,5 +1,5 @@
-red
 green
 white
-grey
+gray
 black
+blue

Diff hunks with context, filenames

Diffs are patches

Patching files

Tracking changes

foo.txt

foo
baz

foo_v2.txt

foo
bar
baz

foo_v3.txt

foo
bar
baz
qux

foo_v4.txt

foo
bar
baz
quux

Tracking changes (cont’d)

diff foo.txt foo_v2.txt > foo_v2.diff

1a2
> bar

diff foo_v2.txt foo_v3.txt > foo_v3.diff

3a4
> qux

diff foo_v3.txt foo_v4.txt > foo_v4.diff

4c4
< qux
---
> quux

Tracking changes with patches

SCM as diff management

  • Different versions of files are just different files
  • Files can be compared (“diffed”)
  • Diffs are maximally informative in tracking changes
  • Diffs can be stored
  • Storing (or transferring) diffs is more efficient than storing (or transferring) full copies of changed files
  • Files can be reconstructed (or reset) to different versions by applying a sequence of diffs

Real-world example

Minions 

Fixing data

  • Bob, Kevin, and Stuart are working on part-of-speech (POS) analysis with data from Project Gutenberg.
  • Kevin discovers a typo:

    Some of us will be scouting with the flyers. Well be in radio contact with you.

  • and corrects it to:

    Some of us will be scouting with the flyers. We’ll be in radio contact with you.

    He emails the corrected file back to Bob.

Fixing data (cont’d)

  • Meanwhile, Stuart discovers a different typo:

    “Well crush them now.”

  • He corrects it to:

    “We’ll crush them now.”

    He emails his corrected file back to Bob as well.

  • Bob now has three different versions of the text file.
    • The base version (two typos)
    • Kevin’s version (one typo fixed)
    • Stuart’s version (a different typo fixed)
    How can he make a version with both typos fixed?

Hint: try this out yourselves!

Conflicts

  • Meanwhile, Kevin is working to strip out the formatting (_emphasized_)
  • He has no idea he could just run perl -i -pe 's/_(.+?)_/$1/g' pg17958.txt and spends hours editing the text manually.
  • He emails his new version back to Bob – but it still contains the typo fixed by Stuart, so Bob sends him his file without both typos, and asks Kevin to strip the formatting out of that one.

How can the three solve these issues?

SCM FTW!

  • States and changes can be visualized as a graph
  • Nodes are states, edges are changes (diffs)
  • Diffs are “committed” to a state to create a new state
  • Different changes applied to the same state create alternate states (“branches”)
  • Two branches can be “merged” together by commiting another diff

How Did We Get Here?

Source Code Control System (SCCS)

  • First generation of SCM
  • First released in 1972
  • Stores changes (deltas)

Revision Control System (RCS)

  • First released in 1982
  • Tracks single files, one user at a time
  • Files are “checked out”, modified, and “committed”

Concurrent Versions System (CVS)

  • First released in 1989
  • Originally a front-end for RCS
  • Second SCM generation: client/server architecture
  • Central “repository” as main storage
  • Repositories can be local or remote
  • A user checks out a copy into a “working directory”
  • Subtree checkouts (multiple codebases/projects per repository)

Subversion

  • First released in 2000
  • Reimplementation of concepts from CVS
  • Trunk, branches, tags (actually just “shallow” copies!)
  • Unique, sequential revision numbers
  • Repositories with database backends
  • Stores/tracks files, directories, symbolic links
  • Native support for transfer protocols
    • local file
    • WebDAV over http and https
    • svn and svn+ssh
  • Properties to store file and commit metadata
    • svn:date, svn:author, svn:log (for commit messages), svn:executable, svn:eol-style, svn:mime-type
    • svn:ignore to avoid accidental versioning and clutter
  • Nesting via svn:external

BitKeeper

  • Third SCM generation: distributed architecture
  • First released in 2000

Darcs

  • First released in 2003
  • Collaboration via emailed patches

Bazaar

  • First released in 2005
  • Many “bridges” to interact with other SCM platforms
  • Supports unicode filenames
  • Manipulate with bzr uncommit, bzr revert, etc.

Mercurial

  • First released in 2005
  • Local revision numbers, globally unique commit hashes
  • Extensible design
  • Manipulate via hg rollback, hg backout, etc.

Git

  • Developed by Linus Torvalds as an alternative to BitKeeper
  • First released in 2005
  • Globally unique commit hashes (no more sequentially numbered revisions)
  • Manipulate via git reset, git commit --amend, git revert, etc.
  • By now, widely adopted and supported nearly everywhere

Fossil

  • First released in 2007
  • Combined SCM, wiki, issue tracker, and announcements (“technotes”)
  • Built-in webserver, browser GUI

SCM Models

Common terms

Repository
Storage for SCM data and history (central or distributed by cloning/forking – the original is sometimes called “upstream”)
Working copy
Local checkout or clone of data from a repository
Commit (v)
Apply (a set of) changes and store them in SCM
Revision/Changeset/Commit (n)
State of a working copy
Parent
Previous commit
Update/Fetch/Pull/Push
Transfer commits between the local working copy and a repository
Tag
Symbolic reference to a specific commit (to track released versions, etc.)
Trunk/Master
Default branch – assumed to be free of errors
Branch
Alternate universe for independent development – could be experimental or non-working!
Merge
Integrate one branch into another (often ending activity on the former); also the resulting commit

Client-server workflow

  • Central server with repository holding all data and history
  • Clients get snapshots of data at one state as “working copy”
  • Clients update local data by retrieving from server
  • Clients submit local updates to server
  • Interaction requires server to be available (no offline use)
  • Changes must be synchronized with all clients

Demo time!

Distributed workflow

  • Network of repositories (“remotes” – even when local)
  • Data and history replicated (“cloned”) across each network node
  • All commits are made locally, then pushed to other repositories
  • Remote commits are fetched and merged
  • One or more repositories can function as central “hubs”

Another demo!

Pitfalls and Pro Tips

Line wrapping

  • Hard-wrapping editors can incite conflict and clutter the diff:
--- lipsum.txt  2017-05-01 21:36:19.000000000 +0200
+++ lipsum2.txt 2017-05-01 21:36:14.000000000 +0200
@@ -1,6 +1,6 @@
-Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed
-do eiusmod tempor incididunt ut labore et dolore magna aliqua.
-Ut enim ad minim veniam, quis nostrud exercitation ullamco
-laboris nisi ut aliquip ex ea commodo consequat. Duis aute
-irure dolor in reprehenderit in voluptate velit esse cillum
-dolore eu fugiat nulla pariatur.
+Lorem ipsum dolor sit amet, divide et impera, consectetur
+adipiscing elit, sed do eiusmod tempor incididunt ut labore et
+dolore magna aliqua. Ut enim ad minim veniam, quis nostrud
+exercitation ullamco laboris nisi ut aliquip ex ea commodo
+consequat. Duis aute irure dolor in reprehenderit in voluptate
+velit esse cillum dolore eu fugiat nulla pariatur.

→ No hard-wrapping; one sentence, one line

End-of-line encoding

  • Inconsistent EOL encoding, or collaborating between Windows (CRLF) and Un*x (LF) platforms can cause massive conflicts even when the text has not changed.

→ Manage EOL encoding systematically

Git
.gitattributes files
Mercurial
.hgeol files
Subversion
svn:eol-style properties

Binary files

  • Many SCM tools cannot handle binary files efficiently
    • diffing binary files is (mostly) futile
    • distributed repository bloat
  • These issues become even worse with large binary files

→ Use alternate strategies to manage large files

Git
Git-LFS, git-annex, etc.
Mercurial
Largefiles extension

Even better, manage large binary files at a higher level

GUIs

  • Visual commit history (with authors, dates, logs, etc.)
  • Diff viewer
  • Graph visualization (invaluable with much branching/merging action)
  • Pre-commit feedback
  • Nearly all IDEs support various SCM tools natively or via plugins

→ Use a GUI (in addition to the command line)

Meaningful commits

  • Provide useful commit messages
  • Use meaningful commits, one per concept/change if possible
  • Commit only hunks that belong to that commit
  • “Stash” other changes as needed

→ Improve clarity for past commits and collaboration; easier cherry-picking

CI testing

  • Automatic continuous integration testing can be used to verify that
    • no files are missing from SCM
    • no local modifications were forgotten (depends on SCM tool)
    • project is valid in a clean environment (no missing tools/dependencies)
    • project runs on a different platform (within reason)

→ Tests on hosted or local CI platforms can be triggered automatically via “hooks”

Fixing mistakes

  • After sharing with others, undo a commit with another commit
  • Locally, discard changes, reset to a known good state, merge (without committing), etc.
  • Amend, i.e., change the last commit

→ Inspect local commits, and practice how to modify them to fix errors

Merge or rebase?

  • Local commits may conflict, or negatively interact, with newer upstream changes
  • Either merge:
    1. Fetch and compare with upstream
    2. Merge upstream into local working copy
    3. Verify that everything is OK
    4. Merge local working copy into upstream
  • Or rebase (or “cherry-pick” individual commits as appropriate):
    1. “Transplant” local commits onto upstream
    2. Verify that everything is OK
    3. Merge local working copy into upstream
  • Consider merging vs. linearizing (“fast-forward”) commits

→ Play around locally until everything works, but if you’ve shared commits with others, make sure you don’t cause confusion by changing that shared history

Merge tools

  • Visualize conflicts
  • Three-way merge tools help resolve conflicts
  • Examples: KDiff3, Meld, etc.

→ Configure SCM to use a merge tool for conflict resolution

Sharing development

  • All SCM supports various network transport protocols for data exchange (assuming everyone has access to the same network)
  • Old school: email patches or bundles (serialize commits to packed files, apply commits from them)
  • Code hosting websites
    • Repository access
    • User accounts, groups/teams/organizations
    • Forking, pull/merge requests
    • SCM browsing, plus integration of workflow tools
      • Wikis
      • Issue trackers
      • Hooks (CI testing, reporting, etc.)
      • Extra file storage (for releases, large files, etc.)
      • Webpage hosting
    • Examples: GitHub, Bitbucket, SourceForge, GitLab, Trac, etc.

Modular SCM

→ Elegant solution for certain scenarios, but adds complexity – use with caution!

SCM bridges

  • Access SCM repositories using a different SCM tool
  • Workaround for incompatible workflows (e.g., use Git to work with Subversion )
  • Convert/migrate repositories from deprecated SCM tools

Further Reading

Subversion

Mercurial

Git

All of the above

Assignment

Recreate the “real-world example”

…using any distributed SCM (except Git)

  • Then submit a short written report (PDF format) and present your experience in the next session.
  • Bonus points if you manage the process of writing the report using SCM!

Questions?