Engineering for reproducibility

Published: 2019-06-03

Updated: 2019-06-03

This post originally appeared in the Cloudera Fast Forward Labs Newsletter.

Applied research at Cloudera Forward Labs involves building software. Each of our research reports is accompanied by a prototype, which serves several functions. Building something helps us to understand the technology we’re researching, and making a product helps to connect the machine learning capability to something tangible. Our prototyping process always involves some iteration, and sometimes we pivot to a different idea as we better understand the use cases for a new capability. Between prototypes, ML app dev engagements, and following our own interests, we do our share of software engineering, so it’s natural that we be interested in good practice.

Joel Grus - of Data Science From Scratch and hating notebooks fame - recently gave an ICLR talk about reproducibility in machine learning research. The framing of the talk is to use reproducibility - which is broadly acknowledged to be an important and good thing for all kinds of research - as a vessel to encourage software engineering best practice. Of course (spoiler alert), it turns out that good software engineering does not only help research be reproducible, but has a slew of other benefits. I recommend reading over the slide deck and Joel’s notes on Reproducibility at ICLR .

The “developer data scientist” persona seems to be an emerging trend. There is a trope about data scientists typically falling into one of two “types”, reflecting their background in either science (type A) or engineering (type B). I believe this distinction to be outdated. As data science has gained popularity, so the spectrum of skills found in the role has filled out from an already broad base.

It is indicative of the rising trend that public sector bodies working in the space are also taking good engineering practices seriously. For instance, the UK civil service’s work on Reproducible Analytical Pipelines for the production of official statistics. The linked post is a couple of years old, but work on RAP is very much alive.

As a personal opinion, data scientists seeking to maximise their impact are often well advised to invest in their engineering skills.

Illustrative addendum

As I write this on a Friday, I just today saved myself a potential world of pain with a unit test. I was writing a function to split a dataset into three chunks, building on scikit-learn’s train_test_split. For the enthusiasts’ perusal, here it is with the bug.

def train_dev_test_split(df,
                         dev_fraction=0.2,
                         test_fraction=0.2,
                         stratify=None,
                         random_state=None):
    """
    A long and boring docstring
    """

    rest_fraction = dev_fraction + test_fraction

    if rest_fraction > 1:
        raise(ValueError(
            """
            The sum of `dev_fraction` and `test_fraction`
            must be less than one.
            """
        ))

    train, rest = train_test_split(
        df,
        stratify=stratify,
        test_size=rest_fraction,
        random_state=random_state
    )

    dev, test = train_test_split(
        rest,
        stratify=stratify,
        test_size=test_fraction,
        random_state=random_state
    )

    return train, dev, test

Is it obvious what the bug is? It wasn’t to me, I even thought about the issue while writing the code. Here’s the simple unit test that caught it:

def test_splits_have_expected_lengths():
    df = pd.DataFrame({'X': range(100), 'y': range(100)})
    train, dev, test = train_dev_test_split(df)
    assert len(train) == 60
    assert len(dev) == 20
    assert len(test) == 20

In the function definition, when I split the dataset for a second time, I had set the test_size=test_fraction. Since the data set has already been split once, I was accidentally returning only a fraction of the already reduced set. The test_size parameter should in fact be test_fraction/rest_fraction. Had I not written a simple check that the function was behaving as expected, I might have worked this out later, but it would certainly have taken more time, and possibly caused invalid results in the meantime. Even when composing well-tested APIs like scikit-learn, I find it worthwhile to invest in automated testing to protect myself from myself.