On commits and history and also on people

On Github I recently became a part of a sort of a cultural clash over the topic of git commit granularity and structure of a project history. This opened a question on what values do I prefer and trying to formulate them; I did everything by pure intuition so far. I try to answer it now and I think the results are general enough to apply on other VCSs as well (if they have the ability to rewrite history in feature branches), as well as on non-open source projects.

What I discovered is that when it comes to commits and history, I value two things: minimal cognitive load and maximal blame utility. The former I favour in general in other things as well; the latter is VCS specific.

I start with minimal cognitive load. By this I mean to structure commits in such a way, that any time a person (not necessarily other than the author itself - after two months, even my own code is thing I need to reread and understand) encounters the commit (or a series of them), they should be organized so the amount of extra energy it has to invest to get them is minimized. I see this as plain respect to your fellow readers and so I put it above everything else - IOW as was already said code is written for people, not the machine - machine is OK with zeros and ones. And so is history and it's structure, I dare to add.

This boils down to a few principles. One of them is minimal, tidied diffs. Me personally always try not forget to look over the diff before committing. That said, I use an IDE (WebStorm), where I see the diffs nicely visualised. So I see all little irregularities like bad formatting, misspelled names etc. In plain terminal of hardcore coders, this seems for me a bit harder, but maybe they get used to it and so it is readable for them as well. I fix all such irregularities before I do the commit. And what may be more controversial, if I find out such thing in a few commits before (ideally before pushing), I fix it retroactively, without any hesitation or the second thought. Minimal cognitive load of future readers is worth more than not changing the history - when done before pushing, it's even not disruptive. When the code is already pushed into a feature branch, I do the retroactive fix anyway and do the force push. In any way, having minimal diff means reading it shows exactly what commit is about and increases the blame utility which I will cover later as well.

Minimal diffs must be balanced against other principle - I call it single responsibility as it is already established term. Though, here, it is not about single responsibility of a class or method, but it is a single responsibility of a commit. This lowers cognitive load greatly as the reader of the commit can read all pieces of the diff and understand their role as a whole without additional cognitive load of separating different concerns. Minimal diff per se is of course achieved simply by squashing all changes into single commit - but minimal diff only lowers cognitive load up to the point. This point is the single responsibility of a commit. If a commit contains more different concerns mixed, the cognitive load rises; the reader needs to spend an extra effort to separate them. This should not happen - it is the author's responsibility to structure commits so that they impose minimal cognitive load on the reader, be it from minimal diffs from one side or the single responsibility of the commit from the opposite side.

In other words, I do not value velocity (let's move on; this works fine) over clarity. Maybe it means I am not the fittest to this fast brave new world of lean startup, but that's how I am. OTOH, clarity is velocity, from the long-time perspective.

A single responsibility of a commit has two sides - not only "commit should not have more than one concern" - it also means "commit should not have less than one concern". What I want to say by this is resonating with the minimal diff principle and rises blame utility greatly - if a single concern was implemented in more steps, some of them maybe sidesteps, some of them fixes of the sidesteps (and those steps and fixes each have commit of their own), it increases the cognitive load of the likely reader, though in the opposite way than in previous paragraph. The reader needs no extra effort to isolate concerns in a commit, but he must put an extra effort to integrate all the tiny steps, sidesteps and tiny fixes to see what was actually implemented, and how. Again, I strongly believe it is author's responsibility to integrate this for the reader. In other words, such half-responsibility commits should be squashed before publishing them. I do more tiny commits locally to be able to revert etc. but then I do a local rebase to put it together for publishing.

So there is no silver bullet in "do a squash" or "separate commits". It is about finding that sweet spot that imposes minimal cognitive load on the reader. Or, IOW, make it so it is as nice as possible for a person that will read it later.

I am really pretty religious about that. Maybe I am violating the first point of agile manifesto with that. The first one is hardest one for me - others may find the other ones as their hardest ones.

OK, I now conclude with the "maximal blame utility". This refers to a "blame view" of git and other VCSs. When I am puzzled by some piece of code (and test scenarios do not shed light on why and what), I can ask a VCS to show me the commit where that code was last changed. It is very frustrating to find a commit where some insignificant fix was made (minimal diffs, squash half-commits) and I need to dig deeper possibly to find more insignificant changes to finally find the root cause of the change. Things are even worse when the change is not insignificant (like late formatting change), but it is a fix of some sidestep. At that point I must understand what was tried to be fixed and start my search for the root cause from there.

So by "maximal blame utility" I mean "maximal utility of the blame view". When you structure your commits, think of a potential user of a blame view and do your commits so he is not put unnecessary barriers in the way. Now that I think of it, it is probably just different name for "commit should not have less than one concern".

The ultimate goal of all this is to make codebase a joy to use to all future contributors. This is the guiding principle for me - the one I use by intuition I already mentioned. I do not actually measure any cognitive load or stop to think of a blame view user. I just wanted to shed a light on what are the values behind me structuring commits as I do (and behind my belief that it is good way to do for others as well) and I found it's minimizing the struggle of rereading the history should there be a need to understand something in it; or that joy thing mentioned above. A bit of metagoal, probably, compared with time to market or others. I wonder how rare or common this approach is. I would guess it is rare and being fast is valued more these days.