mjg59 | Your project's RCS history affects ease of contribution (or: don't squash PRs)

Github recently introduced the option to squash commits on merge, and even before then several projects requested that contributors squash their commits after review but before merge. This is a terrible idea that makes it more difficult for people to contribute to projects.

I'm spending today working on reworking some code to integrate with a new feature that was just integrated into Kubernetes. The PR in question was absolutely fine, but just before it was merged the entire commit history was squashed down to a single commit at the request of the reviewer. This single commit contains type declarations, the functionality itself, the integration of that functionality into the scheduler, the client code and a large pile of autogenerated code.

I've got some familiarity with Kubernetes, but even then this commit is difficult for me to read. It doesn't tell a story. I can't see its growth. Looking at a single hunk of this diff doesn't tell me whether it's infrastructural or part of the integration. Given time I can (and have) figured it out, but it's an unnecessary waste of effort that could have gone towards something else. For someone who's less used to working on large projects, it'd be even worse. I'm paid to deal with this. For someone who isn't, the probability that they'll give up and do something else entirely is even greater.

I don't want to pick on Kubernetes here - the fact that this Github feature exists makes it clear that a lot of people feel that this kind of merge is a good idea. And there are certainly cases where squashing commits makes sense. Commits that add broken code and which are immediately followed by a series of "Make this work" commits also impair readability and distract from the narrative that your RCS history should present, and Github present this feature as a way to get rid of them. But that ends up being a false dichotomy. A history that looks like "Commit", "Revert Commit", "Revert Revert Commit", "Fix broken revert", "Revert fix broken revert" is a bad history, as is a history that looks like "Add 20,000 line feature A", "Add 20,000 line feature B".

When you're crafting commits for merge, think about your commit history as a textbook. Start with the building blocks of your feature and make them one commit. Build your functionality on top of them in another. Tie that functionality into the core project and make another commit. Add client support. Add docs. Include your tests. Allow someone to follow the growth of your feature over time, with each commit being a chapter of that story. And never, ever, put autogenerated code in the same commit as an actual functional change.

People can't contribute to your project unless they can understand your code. Writing clear, well commented code is a big part of that. But so is showing the evolution of your features in an understandable way. Make sure your RCS history shows that, otherwise people will go and find another project that doesn't make them feel frustrated.

(Edit to add: Sarah Sharp wrote on the same topic a couple of years ago)

Flat | Top-Level Comments Only

From: (Anonymous)

I find myself doing this a lot, where I hack in WIP and SQUASH and all sorts of junk commits into the tree, but when I reach the end of a development state, I back the branch up, git reset back to the start, and work everything back in step by step, using git stash a lot.

To some people this is probably a lot more work than they are willing to do, but I think it's the only way I've found that gets between type A (revert, squash, hack, WIP, revert WIP) and type B, squash it all.

airlied

You can achieve a lot with git rebase -i...

I also do the same, I consider it my duty to whomever will inherit that code for future maintenance. The (final) commit messages are also as complete as possible.

The process of cleaning up a feature/topic branch into something worth of being enshrined in history is _also_ a self-review of the code. I've often found issues at that stage, and fixed them before wasting time of my peers.

I also compare the end result of the cleanup with the end result of the old mess, to ensure I did not introduce any errors in the process. Then, it is "git commit --amend -S" time (to seal the branch with a crypto signature), and ship it to review/integration (merge request).

https://joeyh.name/blog/entry/our_beautiful_fake_histories/

I almost never squash, but i do use git rebase -i and fixup action a lot. Also, git commit --amend.

I think it was Donald Knuth who once said we write software not to convince the computer to do the right thing but to convince our peers that the program is doing the right thing.

The beauty of your post is that it convincingly brings across that the version control "structure" is part of this software, and must be "written" towards our peers too.

Under pressure, the temptation is high to keep whacking at the code until we think it "works". Same with the commit history. And if we are offered an hydraulic press... squash away! (not)

I'm a bit disappointed that xevilteddy, from only a couple of posts back,
doesn't even have a git history narrative that tells me which parts are evil and which parts are xteddy!

I agree that merging all commits into a single ones does more harm than good. Personally, what I like to do before publishing any commit history is to interactively rebase it to squash together things such as trivial corrections to commits, essentially to avoid noise. A good example of that is a commit that tries to implement something, then a commit that does something else, and finally a commit the adds something that I forgot in the first one: that third one I would squash into the first one.

From:

jack

That's about what I thought, although I've not worked on any projects large enough I've first-hand experience of which is better in practice.

In fact, I've commented elsewhere the history model may not be quite complete, a more helpful model might be that when viewing the history, by default, you view features like that as if they were squashed. But that history is preserved and you can choose to see it.

That also gives options like, if the history is a mess, you can still *allow* people to see it, if that's more useful than just a single "new feature" commit, but not pollute the upstream branch with "oops, doesn't compile, undo" commits. Or any other reason people may want to see the "actual history" rather than the "cleaned up logical-commit history".

The real problem seems to be that this PR has >4000 LoC. If it were split up in logical units (separate PRs) to provide the infrastructure for this feature, it would not have been a problem.

This is what we do in Drupal, which is a much larger project, with more contributors, and more experience with this problem.

lersek

https://www.reddit.com/r/programming/comments/4cx9a8/squash_your_commits_on_github/

brainwane

Thanks for this post -- agreed.

"And never, ever, put autogenerated code in the same commit as an actual functional change."

Why would you ever commit any autogenerated code (or autogenerated anything else) to your repository? Repos are for *sources*. Aren't they?

Matthew Garrett

Your project's RCS history affects ease of contribution (or: don't squash PRs)

Your project's RCS history affects ease of contribution (or: don't squash PRs)

finish the code, and then start again with the history

Re: finish the code, and then start again with the history

Re: finish the code, and then start again with the history

no subject

no subject

The keyword here is "textbook"

xevilteddy

Rebase

no subject

>4000 LoC

also discussed on reddit, about a month ago

no subject

Autogenerated assets in the repo

Profile

About Matthew

Page Summary

Expand Cut Tags