Spend more time thinking about names

Ever revisited some code you wrote 6+ months ago and struggled to understand it? Me too.

There are a bunch of reasons why this might be. Architecture, lost familiarity with the domain, whatever. One of the reasons is that we often write code that is easy to write, but less so to read. One of the things that makes code hard to read is non-obvious function and variable naming. Names are important. Way more important than we often give them credit for.

There’s a sort of pidgin language people use to name things when coding. Instead of calling an image directory images we call it img. That’s not a particularly egregious example, and I can’t really criticize it because, as I write this, the sidebar in my editor is showing me the public/img folder of this site and it is giving me hard stares ಠ_ಠ

I will pick on myself with a real example. I looked through some old code of mine and found a dashboard presenting an analysis of prescription drug costs in the UK National Health Service. It included some aggregation at the health practice level. Naturally, I called the corresponding data frame prac_data. Why didn’t I make it just a little easier to understand by calling it practice_data, or better yet practice_level_data? I don’t know (well, I do: I was less thoughtful about maintainability at the time), but it’s emblematic of this weird abbreviated language we all use in our code.

The best advice I could give the average data scientist about naming things is to think about it more. Naming things for easier understanding is a worthy use of your time. I love to see commits that are exclusively changes to better names. (Internally. I am not advocating for wantonly shipping breaking API changes). Speaking very generally, some concrete advice is that you should probably be using slightly longer names.

A short variable name like x is fine in a small scope, or when it corresponds to a firm convention, like the input features being X and the target being y in a supervised learning pipeline. If it’s a global variable, better call it something descriptive in SCREAMING_SNAKE_CASE, like N_EPOCHS. I’ll accept N or NUM for “number of”.

For functions, follow basically the opposite rule. If you only use it in a small scope, call it something descriptive, like drop_protected_attribute_columns, because after all, you don’t have to repeat it much, and it’s better to be explicit. If it’s going to be used everywhere, it should probably have a shorter name, because it is probably doing something integral to the problem you’re working on, and that should have a name such that it’s easy to talk about. I mean that literally: unless you are working in isolation, you will talk about the code, so your names should be pronounceable. Giving everything a very long name can make code harder to read and understand, which is the opposite of the goal. None of these heuristics are hard and fast.

In the words of Ward Cunningham,

You know you are working on clean code when each routine you read turns out to be pretty much what you expected.

I think I prefer “tidy code” over “clean code” as a term, but the idea is true. We should be able to read code and know pretty much what it does without reading the implementation details. The idea there is the same as progressive disclosure: a programmer working with our codebase should not need to first absorb the entire codebase to do useful things.

Possibly the best thing I’ve read about naming is the first chapter of Elements of Clojure by Zachary Tellman. The first chapter of the book is free, and you do not need to know Clojure to understand it (I think). It is very good, you should read it.

Finally, a somewhat-relevant anecdote: my class once asked our high school physics teacher why “mole” was abbreviated to “mol” in scientific literature. It only saves one character, after all. His answer, “it keeps the riff-raff out”.