Best Practices for Reproducible Research

Session 4

Ingmar Steiner

2017-05-24

Build Lifecycle

Build, rinse, repeat

Developing and building software is an iterative process.

  1. Write code
  2. Compile code
  3. Test code
  4. Go to 1.

Eventually, the software is ready to use (according to some specification), and can be “shipped”, i.e., released.

Round two

Inevitably, more development work needs to be done, due to:

  • Unexpected failures
    • Bugs in the code
    • Bugs in the data
    • Bugs in the system
  • Changed specifications
  • etc.

A new version is developed and released.

Consumer perspective

  1. Identify need for software
  2. Obtain software, version 1
  3. Use software
  4. Discover bugs
  5. Report bugs
  6. Wait for version 2

Developer perspective

  1. Release version 1
  2. Receive bug report, etc.
  3. Build lifecycle
    1. Write code
    2. Compile code
    3. Test code
    4. Go to 1.
  4. Release version 2

SCM perspective

  1. Commit new code
  2. Tag release version 1
  3. Create bugfix branch
    1. Write test code to reproduce bug
    2. Write code until test passes
  4. Merge bugfix branch
  5. Tag release version 2

Open-source software (OSS) models can blur the boundaries between these perspectives.

Build management

  • Specialized build tools can help developers automate this build/test/release lifecycle

  • They should produce reproducible builds given the same source code

  • SCM hooks can further automate the lifecycle (for continuous integration testing)

  • But it’s a good idea to manage build tools and source code separately

Software build tools

  • Build tools specialized for various programming languages automate or simplify the build/test/release lifecycle for those languages… or in general!
Java
Ant, Maven, Gradle
Python
SCons, Waf, PyBuilder
Ruby
Rake
JavaScript
Grunt, Gulp
C/C++
GNU Make (The Original Build Tool.™)

Build tool wishlist

  • Fast
  • Efficient
  • Easy to use
  • Cross-platform
  • Minimal requirements
  • Flexible
  • Automate build/test/release lifecycle (duh!)

Build Tool Examples

Build lifecycle

Remember:

  1. Build
  2. Test
  3. Release

These are just tasks.

We might as well just

  1. Do this
  2. Do that
  3. Do something else

Build script…?

A build script describes the tasks which must be performed to build a project.

A “build script” could just be a README file:

The user would need to manually follow the instructions.

Build shell scripts

A “build script” could also be an actual shell script:

build.sh

#!/bin/sh

do_this() { echo "doing this"; }

do_that() { echo "doing that"; }

do_something_else() { echo "doing something else"; }

# main
do_this
do_that
do_something_else

Build script automation

  • “Real” build scripts can be parsed by build tools for automatic build execution.

  • This requires a specific format/syntax (depending on the tool)

GNU Make

  • The GNU implementation of Make
  • First released in 1976
  • Still widely used
  • Custom build script language
  • Builds anything (traditionally C/C++) via shell commands

Makefile

do_this:
    @echo 'doing this'

do_that:
    @echo 'doing that'

do_something_else:
    @echo 'doing something else'

Makefile basics

GNU Make build scripts (Makefiles) define a number of rules (i.e., build tasks).

target: prerequisite
    recipe
  • Running make target will run recipe
  • If prerequisite is defined, make prerequisite will be run first
Pitfall
The leading whitespace before recipe must be an actual tab!

Task dependencies

  • Prerequisites establish relations between rules; task dependencies can be represented as a directed acyclic graph (DAG)
  • If make is run with no explicit rule argument, the first one is invoked.

Makefile

do_something_else: do_that
    @echo 'doing something else'

do_that: do_this
    @echo 'doing that'

do_this:
    @echo 'doing this'

Task outputs

Rules are normally used to create file targets.

somethingelse: that
    touch somethingelse

that: this
    touch that

this:
    touch this

clean:
    @rm -f this that somethingelse

File modification timestamps are used to determine which rules are up-to-date.

Apache Ant

  • First released in 2000
  • Runs on Java
  • XML-based build scripts
  • Builds Java, extensible to build anything

build.xml

<project default="something else">
    <target name="this">
        <echo message="doing this"/>
    </target>
    <target name="that" depends="this">
        <echo message="doing that"/>
    </target>
    <target name="something else" depends="that">
        <echo message="doing something else"/>
    </target>
</project>

Rake

  • First released in 2003
  • “Ruby port” of Make
  • Runs on Ruby
  • Builds anything via Ruby-based build scripts

Rakefile

task :this do
    puts "doing this"
end

task :that => :this do
    puts "doing that"
end

task :something_else => :that do
    puts "doing something else"
end

task :default => :something_else

Gradle

  • First released in 2007
  • Runs on Java
  • Groovy-based build scripts with custom DSL
  • Builds Java, C/C++, Android, anything else
  • Extensible via plugins

build.gradle

defaultTasks "something_else"

task 'this' << {
    println "doing this"
}

task that(dependsOn: 'this') << {
    println "doing that"
}

task something_else(dependsOn: 'that') << {
    println "doing something else"
}

PyBuilder

  • First released in 2011
  • Inspired by Ant, Maven, Gradle
  • Python-based build scripts
  • Builds Python, anything else
  • Extensible via plugins

build.py

from pybuilder.core import task, depends

@task
def this():
    print "doing this"

@task
@depends("this")
def that():
    print "doing that"

@task
@depends("that")
def something_else():
    print "doing something else"

default_task = "something_else"

Conclusion

  • Build tools can emulate Make via shell execution features
  • But leveraging their “native”, object-oriented language makes them
    • more efficient
    • more powerful
    • cross-platform
  • Modular build logic (e.g., plugins) can be externalized and re-used

Build Tools for Research

Research differs from software engineering…

Typical workflow:

  1. Get data
  2. Convert data
  3. Run experiments
  4. Collect results

But these are also just tasks!

Mixing external tools and custom scripts is common

Adding tests is a good idea!

Real-world example

  • Bob, Kevin, and Stuart want to analyze word distribution for books from Project Gutenberg
  • This time, they version the build script in SCM – not the data

Tasks include:

  1. Download text resources
  2. Strip formatting
  3. Convert to lower case
  4. Count word frequencies
  5. Generate a barplot for the 20 most frequent words

Makefile

plot.svg: data_words.txt
    @gnuplot -e '\
    set terminal svg;\
    set output "plot.svg";\
    set size ratio 0.5;\
    set boxwidth 0.5;\
    set style fill solid;\
    plot "data_words.txt" using 1:xtic(2) with boxes'

data_words.txt: data_lower.txt
    @perl -ne '\
    @w = split /[^a-z]+/;\
    foreach (@w) {\
      $$w{$$_}++;\
    }\
    END {\
      foreach (keys(%w)) {\
        printf "%d\t%s\n", $$w{$$_}, $$_\
      }\
    }' < data_lower.txt | sort -nr | head -n 20 > data_words.txt

data_lower.txt: data_stripped.txt
    @tr '[:upper:]' '[:lower:]' < data_stripped.txt > data_lower.txt

data_stripped.txt: data.txt
    @perl -pe 's/_(.+?)_/$1/g' < data.txt > data_stripped.txt

data.txt:
    @wget 'http://aleph.gutenberg.org/1/7/9/5/17958/17958-8.zip' -O - | funzip | recode latin1..utf8 > data.txt

Assignment

  • Do this with another build tool (Ant, Rake, Gradle, PyBuilder, etc.)
  • Leverage build script language if possible (Ruby, Groovy, Python, respectively)
  • Plot using any OSS framework (e.g., R, GNU Octave, matplotlib, etc.)
  • Write a brief README
  • Version all code with SCM

Up next

  • Emacs (guest star: Sébastien Le Maguer)

Note: Alternate location!

Questions?