Best Practices for Reproducible Research

Session 1

Ingmar Steiner

2017-04-26

Introduction

Background

“Best Practices for Reproducible Research”

(a.k.a “Agile Research”, “DevOps for Research”, etc.)

Evolved out of personal experience, “what I wish my grad students had known already”.

This course could give you superpowers, and turn you into a rockstar researcher!

Meta

This project seminar requires active participation.

There will be regular assignments, as well as a mandatory final project and written report.

The course content is technical, but also flexible!

Source code management

  • patching
  • CVS
  • Subversion
  • Mercurial
  • Git
    • git-subtree
    • git-submodule
    • git-annex
    • Git LFS
  • SCM repository hosting
    • GitHub
    • Bitbucket
    • GitLab

Build tools

  • GNU Make (Makefiles)
  • JVM Build tools
    • Ant
    • Maven
    • Gradle
  • SCons
  • Rake
  • Buildr

Testing and continuous integration

  • Jenkins
  • Travis CI
  • Containers

Dependency management

  • Resolution
  • Deployment
  • Serialization
  • Modularization

Documentation

  • README recipes
  • LaTeX
    • hyperlinks
    • bibliographies
    • source code
  • Wiki
  • Markdown

Literate programming

  • Sweave, Knitr
  • Rmarkdown

Software engineering

Software engineering

Definition

The systematic application of scientific and technological knowledge, methods, and experience to the design, implementation, testing, and documentation of software.

ISO/IEC/IEEE 24765:2010(E)

“Research engineering”

The systematic application of scientific and technological knowledge, methods, and experience to the design, implementation, testing, and documentation of research.

Computational research = software

Any code run on a computer can be seen as software.

Research is no different.

Conducting and documenting research is to develop its source code.

Software Project Example

Example program

A small command-line interface (CLI) program that outputs a fortune

Java code

JFortune.java

public class JFortune {

    public String getFortune() {
        return "42";
    }

    public static void main(String[] args) {
        JFortune jfortune = new JFortune();
        String fortune = jfortune.getFortune();
        System.out.println(fortune);
    }
}

Compile to bytecode and run

Separate source and generated code

Package application

Add Main-Class header to manifest and package to JAR file

Unit testing

JFortuneTest.java

import org.junit.*;

public class JFortuneTest {

    @Test
    public void testGetFortune() {
        JFortune jfortune = new JFortune();
        String expected = "41"; // let's fail this test!
        String actual = jfortune.getFortune();
        Assert.assertEquals(expected, actual);
    }
}

We also need the current JAR from JUnit.

This is a test-scoped dependency!

Running the unit test

Patching the test code

patch.diff

--- src/JFortuneTest.java   2017-04-25 17:25:26.000000000 +0200
+++ src/JFortuneTest.java   2017-04-25 17:26:38.000000000 +0200
@@ -5,7 +5,7 @@
     @Test
     public void testGetFortune() {
         JFortune jfortune = new JFortune();
-        String expected = "41"; // let's fail this test!
+        String expected = "42";
         String actual = jfortune.getFortune();
         Assert.assertEquals(expected, actual);
     }

Re-Running the unit test

Build lifecycle

  1. Compile java code
  2. Compile test code
  3. Run tests
  4. Package

Complexity creep

  • More source files
  • More unit tests
  • Separate source sets (main, test)
  • Add resources

We just need to run more and more complex javac commands, solving these issues once and for all!

Pretty cool, eh?

Heck, no!

Build Automation

Build tools

High-level tools to manage the build lifecycle as efficiently as possible

For now, we skip over Ant, Maven, etc. and go straight to …

Gradle feature highlights

  • Groovy domain specific language
  • Extensible build logic
  • Dependency management
  • Task dependency management
  • Caching

Gradle build script

build.gradle

plugins {
  id 'java'
}

repositories {
  jcenter()
}

dependencies {
  testCompile 'junit:junit:4.10'
}

jar {
  manifest {
    attributes 'Main-Class': 'JFortune'
  }
}

Gradle task tree

Gradle build

Gradle application

build.gradle

plugins {
  id 'application'
}

mainClassName = 'JFortune'

repositories {
  jcenter()
}

dependencies {
  testCompile 'junit:junit:4.10'
}

Gradle run

Notice the UP-TO-DATE tasks (cached).

Custom build logic

Download and process additional resource

build.gradle

plugins {
  id 'application'
  id 'de.undercouch.download' version '3.2.0'
}

mainClassName = 'JFortune'

repositories {
  jcenter()
}

dependencies {
  testCompile 'junit:junit:4.10'
}

import groovy.json.JsonBuilder

task processFortunes {
  def destFile = file("$buildDir/fortunes.json")
  outputs.file destFile
  doLast {
    download {
      src 'https://raw.githubusercontent.com/ruanyf/fortunes/master/data/fortunes'
      dest temporaryDir
      overwrite false
    }
    def fortunes = file("$temporaryDir/fortunes").text.split(/\n?%\n?/)
    destFile.text = new JsonBuilder(fortunes).toPrettyString()
  }
}

processResources {
  from processFortunes
}

Run customized build

Gradle recap

  • Build automation (compile, test)
  • Dependency resolution
  • Caching
  • Extensible build logic
  • Groovy DSL
  • Parallel processing
  • Cross-platform
  • and more…

Build Automation for Research

Common tasks

Data management
  • Download
  • Processing (unpacking, conversion, etc.)
Processing workflow
  • Task dependencies
  • Toolchain automation
Testing and validation
  • Verify custom scripts
  • Validate data
Analysis and Reporting
  • Aggregate tool/test results
  • Generate documentation
Collaborate
  • Share processed results, reports

Task automation

All of these steps can be streamlined with build automation:

Data is just another dependency

Processing is just a sequence of tasks

Scripts can be tested for bugs, validation can be run as tests

Reports can include generated content, compiled automatically

Build outputs can be uploaded to shared or public repositories

Next

Upcoming topics

  • Source code management
  • Modular builds

Questions?