NBSVM: a strong document (topic) classification baseline

Published: 2019-08-14
|
Updated: 2019-08-14
This post originally appeared in the Cloudera Fast Forward Labs Newsletter.

Transfer Learning for Natural Language Processing is on your screens, and we hope you’re enjoying it! To further your enjoyment and assist your document classification projects, we’re releasing the code for a benchmark algorithm we used in the report.

In the report itself, we detail the benefits and tradeoffs of transfer learning with large, pretrained deep neural networks. To establish where transfer learning shines, we needed a reasonable baseline to compare against for our chosen task, sentiment classification. Establishing a baseline is vital when beginning experimentation for any machine learning application. For instance, in a simple binary classification problem, it is always wise to know the accuracy score if one simply predicts the majority class: 95% accuracy doesn’t seem quite so impressive if 95% of the examples have the same label.

For our sentiment classification task, a great baseline model is provided in the paper Baselines and Bigrams: Simple, Good Sentiment and Topic Classification, which has aged well since its 2012 release. The paper examines two primary learning algorithms as applied to sentiment and topic classification: multinomial naive Bayes (MNB) and support vector machines (SVMs). It’s a short and pragmatic paper, providing practical wisdom for applications of the algorithms: multinomial naive Bayes outperforms more complex methods on short snippets of text, whereas support vector machines win for longer documents. The authors combine these two methods into the so-called NBSVM: a support vector machine using naive Bayes log-counts as features. Further, a weighted interpolation between this NBSVM and plain MNB is used, and shown to be a robust classification algorithm across a variety of text lengths.

It surprised us how effective this algorithm is, given enough data, especially since it uses only bag-of-words type features (with some preprocessing: uni- and bigrams, stopword removal and stemming). When there are very few training examples (a few hundred), NBSVM performed poorly, but with thousands of labelled examples, it reached a respectable 85% accuracy on our problem. This does not compete on accuracy with deep models, especially in the region with few labelled samples, where transfer learning allows neural models to maintain good accuracy. However, it does compete on ease of use, and avoids much of the complexity of serving large neural networks.

To make using the technique easy and repeatable, we implemented it as a scikit-learn classifier. This saved us writing a lot of custom code - we didn’t even have to provide a predict function - and enabled us to easily plug into any tool that conforms to the scikit-learn classifier API. Open source is awesome.

We’re happy to share our implementation, and we hope it’s useful to you.