AutoML, open and closed

Published: 2019-04-29
|
Updated: 2019-04-29
This post originally appeared in the Cloudera Fast Forward Labs Newsletter.

The trouble in writing about AutoML is one first has to decide what AutoML is. Is it a catchall for automating any part of the machine learning process? Is it those things discussed in the ICML workshop of the same name? Is it a specific product? The answer to all of those questions is yes.

Beginning with the last definition, Google recently enhanced their AutoML offering with Tables, available in beta form. Whereas Google’s previous AutoML functionality was limited to very specific applications such as images or speech, AutoML Tables allows one to construct automated machine learning on generic tabular data. The promise of the tool is to enable domain experts and citizen data scientists with labelled data to create predictive pipelines without having a deep understanding of machine learning.

Here at Cloudera Fast Forward Labs, we’re duly cautious of such black box approaches. Quoting the conclusion of our Interpretability report:

Interpretability is a powerful and increasingly essential capability. A model you can interpret and understand is one you can more easily improve. It is also one you, regulators, and society can more easily trust to be safe and nondiscriminatory. And an accurate model that is also interpretable can offer insights that can be used to change real-world outcomes for the better.

By implication, a non-interpretable model suffers from the reverse effects, being harder to improve and offering little insight other than its exact output. Most dangerously, lack of interpretability can allow dangerous biases to exist without detection. Google is clearly aware of the problem, and it’s encouraging to see the company releasing advice to help users of their AutoML product create systems that are fair and ethical.

There are model-agnostic methods of gaining local, decision level explanations for black box model outputs; however, even without interpretability (and with due warnings issued), black box AutoML tools are not without their use. If nothing else, automated tools allow a practitioner to get a bearing on how amenable to machine learning a problem is. Rapid evaluation of predictability could be a powerful enabler for teams of data scientists faced with many business problems.

At the other end of the openness spectrum, AutoML - used in its broadest sense of automating parts of the machine learning workflow - is empowering “developer data scientists” to be more productive. There is a maturing ecosystem of open source tools in the space. For instance, there are several python libraries that already support highly automated end-to-end workflows:

  • TPOT: TPOT calls itself your “Data Science Assistant.” It uses genetic programming to automatically explore and optimize feature selection and processing, model selection (from sklearn models) and parameter tuning. As well as providing the sklearn model as a python object, it can export code for the winning combination.
  • AutoSklearn: Similar in nature to TPOT, but leveraging ensemble and meta-learning approaches in place of genetic programming, as outlined in the paper Efficient and Robust Automated Machine Learning.
  • automl-gs: Automl-gs goes a step further than the above libraries in automating the workflow, and can actually be run in a one-liner at the command line. It takes a csv file and the name of the target column and generates python code for the whole pipeline. It currently supports TensorFlow and XGBoost models.

The above libraries are great for establishing a baseline model and generating the accompanying code to build upon. However, one need not go the whole way to automation in one step. For instance, Hyperopt implements a sophisticated hyperparameter search method, as described in Algorithms for Hyper-Parameter Optimization. The subsequent paper Making a Science of Model Search mentions a particularly novel (and increasingly important) application of this kind of optimisation: situations where there are constraints on the model. For example, we may need a classifier with a small enough memory footprint for a mobile device, or have a strict requirement on the time it takes to predict a single instance, and want to maximise accuracy within those constraints.

There is clearly value in automating parts of the process of machine learning. While the world at large may not have settled on a single definition of the term AutoML, we expect the use of all flavours of it to grow.