Jonathan Pryor's web log

In Defense Of `git`

On Friday at the OpenOffice.org Conference, we had two sessions discussing the future of Source Code Managers in OpenOffice.org: Child workspaces and the OOo SCM system by Jens-Heiner Rechtien and git: the Source Code Manager for OOo? by Jan Holesovsky (kendy).

In the Q&A section after the git presentation, there was a lot of heated debate in which it seemed that Jan and Jens were talking "past" each other. As a git backer, I thought I'd try to bring some clarity to things.

It seemed that Jens has one fundamental problem with git, which itself is fundamental to its operation: commits are not transferred to the remote module; instead, you need an explicit git-push command to send all local changes to the remote repository. Jens claimed three implications of this (that I remember):

git did not permit line-by-line authorship information, as with cvs annotate or svn blame.
Developers would not see changes made by other developers as soon as they happen.
QA and Release Engineering wouldn't be alerted as soon as developers made any change on any child workspace.

The line-by-line authorship information is possible in git with the git blame or git annotate commands (they are synonyms for each other). I suspect I misinterpreted this part of the debate, as all parties should have known that git supported this.

Which leaves the other two issues, which (again) are fundamental to git: a commit does not send any data to the repository. Thus we get to the title of this blog entry: this is a Good Thing™.

Local commits are world changing in a very small way: they're insanely fast, much faster than Subversion. (For example, committing a one-line change to a text file under a Subversion remote directory took me 4.775s; a similar change under git is 0.246s -- 19x faster -- and this is a small Subversion module, ~1.5MB, hosted on the ximian.com Subversion repo, which never seems as loaded as the openoffice.org servers.)

What can you do when your commits are at least 19x faster? You commit more often. You commit when you save your file (or soon thereafter). You commit when you code is 99.995% guaranteed to be WRONG.

Why do this? Because human memory is limited. Most studies show that the average person can remember 7±2 items at a time before they start forgetting things. This matters because a single bug may require changes to multiple different files, and even within a single file your memory will be filled with such issues as what's the scope of this variable?, what's the type of this variable?, what's this method do?, what bug am I trying to fix again?, etc. Human short-term memory is very limited.

So what's the poor developer to do? Most bugs can be partitioned in some way, e.g. into multiple methods or blocks of code, and each such block/sub-problem is solved sequentially -- you pick one sub-problem, solve it, test it (individually if possible), and continue to the next sub-problem. During this process and when you're finished you'll review the patch (is it formatted nicely?, could this code be cleaned up to be more maintainable?), then finally commit your single patch to the repository. It has to be done this way because if you commit at any earlier point in time, someone else will get your intermediate (untested) changes, and you'll break THEIR code flow. This is obviously bad.

During this solve+test cycle, I frequently find that I'll make a set of changes to a file, save it, make other changes, undo them, etc. I never close my file, because (and here's the key point) cvs diff shows me too many changes. It'll show me the changes I made yesterday as well as the changes I made 5 minutes ago, and I need to keep those changes separate -- the ones from yesterday (probably) work, the ones from 5 minutes ago (probably) don't, and the only way I can possibly remember which is the set from 5 minutes ago is to hit Undo in my editor and find out. :-)

So git's local commits are truly world-changing for me: I can commit something as soon as I have it working for a (small) test case, at which point I can move on to related code and fix that sub-problem, even (especially) if it's a change in the same file. I need an easy way to keep track of which are the solved problems (the stuff I fixed yesterday) and the current problem. I need this primarily because the current problem filled my 7±2 memory slots, and I'm unable to easily remember what I did yesterday. (I'm only human! And "easily remember" means "takes less than 0.1s to recall." If you need to think you've already lost.)

This is why I think the other two issues -- developers don't see other changes instantly, and neither does QA -- are a non-issue. It's a feature.

So let's bring in a well-used analogy to programming: writing a book. You write a paragraph, spell check it, save your document, go onto another paragraph/chapter, repeat for a bit, then review what was written. At any part of this process, you'll be ready to Undo your changes because you changed your mind. Changes may need to occur across the entire manuscript.

Remote commits are equivalent to sending each saved manuscript to the author's editor. If someone is going to review/use/depend upon your change, you're going to Damn Well make sure that it Works/is correct before you send that change.

Which brings us to the workflow dichotomy between centralized source code managers (cvs, svn) and distributed managers (git et. al). Centralized source managers by design require more developer effort, because the developer needs to manually track all of the individual changes of a larger work/patch before sending it upstream (as described above).

Decentralized source managers instead help the developer with the tedious effort of tracking individual changes, because the developer can commit without those changes being seen/used by anyone else. The commit instead gets sent when the developer is done with the feature.

This is why I prefer git to Subversion. git allows me to easily work with my 7±2 short-term memory limitations, by allowing me to commit "probably working but not fully tested" code so that I don't need to review those changes at the next cvs diff for the current problem I'm working on.

Posted on 23 Sep 2007 | Path: /development/openoffice.org/ | Permalink

Problems with Traditional Object Oriented Ideas

I've been training with Michael Meeks, and he gave Hubert and I an overview of the history of OpenOffice.org.

One of the more notable comments was the binfilter module, which is a stripped-down copy of StarOffice 5.2 (so if you build it you wind up with an ancient version of StarOffice embedded within your current OpenOffice.org build).

Why is a embedded StarOffice required? Because of mis-informed "traditional" Object Oriented practice. :-)

Frequently in program design, you'll need to save state to disk and read it back again. Sometimes this needs to be done manually, and sometimes you have a framework to help you (such as .NET Serialization). Normally, you design the individual classes to read/write themselves to external storage. This has lots of nice benefits, such as better encapsulation (the class doesn't need to expose it's internals), the serialization logic is in the class itself "where it belongs," etc. It's all good.

Except it isn't. By tying the serialization logic to your internal data structures, you severely reduce your ability to change your internal data structures for optimization, maintenance, etc.

Which is why OpenOffice.org needs to embed StarOffice 5.2: the StarOffice 5.2 format serialized internal data structures, but as time went on they wanted to change the internal structure for a variety of reasons, The result: they couldn't easily read or write their older storage format without having a copy of the version of StarOffice that generated that format.

The take away from this is that if you expect your software to change in any significant way (and why shouldn't you?), then you should aim to keep your internal data structures as far away from your serialization format as possible. This may complicate things, or it may require "duplicating" code (e.g. your real data structure, and then a [Serializable] version of the "same" class -- with the data members but not the non-serialization logic -- to be used when actually saving your state), but failure to do so may complicate future maintenance.

(Which is why Advanced .NET Remoting suggests thinking about serialization formats before you publish your first version...)

Posted on 28 Aug 2007 | Path: /development/openoffice.org/ | Permalink

Jonathan Pryor's web log

In Defense Of git

Problems with Traditional Object Oriented Ideas

In Defense Of `git`