Once Iíve finished the inner development cycle for a new feature or bug fix Ė write tests, code, build and run Ė itís time to think about committing those changes and therefore publishing them to a wider audience, i.e. my team. With the fun part out of the way itís quite easy to rush this last stage and therefore publish a change that falls short in some way, either from a short or longer term perspective. At its most disruptive it could be the code which is broken through lack of proper integration or a partial (or overzealous) change-set that excludes (or unexpectedly includes) files or code that was never intended to be published. On the flip-side the actual code might be good but the cohesiveness or surrounding documentation may be lacking therefore making future software archaeology  harder than it need be.
Although the mental checklist I follow when committing any change is pretty much the same there are some significant differences that have emerged due to the version control product I might be using. For the purposes of this article Iíve tried to talk about ďpublishingĒ a change when I mean the act of making it visible to the outside world. In a classic centralised VCS like Subversion the act of committing and publishing are one and the same (private branches notwithstanding), whereas for a distributed VCS like Git they are distinct steps. The latter (DVCS) brings a number of affordances over the former (CVCS) around committing that are particularly useful at commit time.
Hence the general order that these steps are done are somewhat dependent on the VCS tool in use. For example with a CVCS like Subversion where you have (had) no ability to commit to an integration branch locally first you need to consider how youíll address the implicit merge that will occur as part of the integration step. In contrast with Git you can focus on the steps one at a time Ė get your commit sorted, then integrate otherís work, and finally tidy up before publishing.
In this article Iíve ordered the steps more like the CVCS world (review, integrate, then commit & publish) rather than the DVCS style (review, commit, integrate, then publish). Either way the checklist Iíll describe is essentially portable, itís just a little easier when you have more time between getting your house in order and showing it to the world at large.
The first thing to do after hacking around the codebase trying to get the bug fixed or the feature added is to ensure the set of changes that will comprise the eventual commit is ďcleanĒ. This means that what we eventually commit is what we intended to commit Ė no unexpected side-effects caused by transient workarounds, accidents or interference from the IDE and tools.
As a starter for ten the code needs to compile and the tests need to pass. Itís all too easy to make localised changes and run just the handful of tests that you believe is required to verify the change is working. Before committing we need to take a step back as we switch mind-set from creator to publisher so, after a deep breath, I need to run the build script which should clean the workspace, compile the code and run all the necessary tests to give me the confidence I need to promote the change. Itís important that the build script I use is the same one that my team-mates and the build server will use so that we all agree on what process is used to create the final deliverable.
I make an extra special effort to deep cleanse my workspace before doing the final build because too often in the past Iíve been caught out by stray artefacts leftover from development that have somehow managed to taint the change and make it appear to work. The CI server should be working from a fresh workspace and therefore I try and do the same, albeit by cleaning up rather than creating a fresh one. With Git itís as easy as ďgit clean -fdxĒ, but with other VCS tools I manually write a script  to do it.
With a (hopefully) minimal set of edited files and folders in my working copy the next step is to verify that only the code I intended to change exists there. During development or manual testing I may have commented out some code, tweaked a default value or adjusted the configuration to get things into a known state for such testing. Or perhaps a tool did something I didnít realise and I need to back that out because it happened by accident.
Alternatively I may have added a new setting that I defaulted for the tests but which needs applying to other environments too. With the rise of automated refactoring tools and smarter IDEs many of the changes we make might now be done for us. However a tool like this will only go so far and try to keep the code intact after each small edit, but itís common for a single small refactoring, such as the deletion of code, to unearth further similar refactorings. Static code analysis is usually limited to advising on the current snapshot instead of chasing the turtles all the way down, so we still have to do that bit manually.
In essence the review step may need to be done a number of times as we identify a problem with the change, make a correction, potentially re-build and re-run the tests, and then review the change once more in the context of the revised change-set.
My reviewing process probably follows that of most other people Ė initially I look at the set of files that have been added, modified or deleted. Depending on the type of change it might be mostly source code or project files / package configuration. Whilst scanning the list of files Iím looking for anything thatís obviously out of place, such as a project file change when I havenít added a new class, or a source file in a namespace that I wouldnít expect to have changed that might indicate some unintended coupling. Design smells that come out of this structural review usually get logged in my notebook  for later consideration (the discipline of staying out of the bigger rabbit holes is a tricky one to master).
The converse is also true Ė there may be files Iíd expect to change if I changed at least one of them. For example adding a new setting to the DEV configuration probably means I need to do the same for other environments. Another is the updating of project dependencies as the version number usually causes a ripple across a couple of files such as the package configuration and project file. The same occurs at a higher level too Ė if a production code project file changes the associated test ones should too.
One mistake thatís less of an issue with modern VCS tools is accidentally checking in temporary files because now they tend to support an exclusion list (e.g. .gitignore). Before this feature you needed to be careful you didnít accidentally check-in some binary artefacts or per-user tool configuration files as it could cause very odd behaviour for your team mates. My clean script would ensure they were deleted before the review thereby side-stepping the issue, but tool upgrades and refactoring meant that the script still needed revisiting now-and-then and so Iíd still be on the lookout at this stage for files which have escaped the ignore list.
Naturally the part we are probably most interested in is reviewing the actual code. Firstly we need to ensure that what we think we changed was what we actually changed and that are no hangovers from spikes, testing, etc. I may have also commented out some code with the intention of deleting it before publication and so I need to make sure I actually did this as checking in commented out code is almost always a mistake.
In the pre-commit review Iím not looking at the code changes from the perspective of ďdoes it work, does the design make sense and does it meet the guidelinesĒ, Iím looking at it to see if the diff makes sense to the future reader. When the software archaeologist is looking back over the revision history they are often looking at the evolution of change rather than each snapshot in isolation. This is why you often hear the advice about not checking in functional and formatting style changes in the same commit Ė it makes it hard to work out what the former is if itís buried in the latter. That doesnít mean I donít care about the more traditional style of code review, itís just that I would have done all that before I even get to the commit stage.
I tend to look at diffs in two different ways depending on the change and how much I want to see of the surrounding context. The classic patch format where you only get to see the actual change and the odd line of context either side is great for simple changes where I just expect some in-place editing to have occurred. In large files this avoids all the scrolling around or navigating you might need to do as it usually fits on a single screen. Sadly many diff tools show the patch format highlighted on a per-line basis and so simple character changes can be harder to spot, but they are great for quickly eyeballing change sets you do frequently, such as updating 3rd party package versions.
The other diff view I use is the more fully featured 2-way (and 3-way for a merge) diff tool that shows the entire file with highlighting around the changes, e.g. KDiff3. I find these tend to make small intra-line changes and block movements a little more visible. When the change is a refactoring Iím often more interested in seeing all the surrounding code because if a tool is involved it will limit its scope and seeing the change in-situ often points out new inconsistencies that have resulted from the surgical precision of the tool.
Whilst Iím happy with the command line for many VCS tasks diffíing is one place where the GUI still does it for me. This includes both the structural and code reviews. Iíve found tools like TortoiseSvn (Git) or Git Extensions allow me to quickly execute different actions to clean up a change set. For example I might need to revert a few mistakes, scan the commit log of others, the odd blame here-and-there, and finally diff the lot.
Yes, you read that right, I always diff every change I commit. If that commit contains a lot of edits then itís a lot of files to diff, but I decided Iím not going to compromise on this principle because it takes time. In fact the more files in the change-set the more inclined I feel to review it thoroughly exactly because there is a greater chance of something untoward creeping in. Instead what itís taught me is that itís preferable to work on smaller units of change (which fits the modern development process nicely), but also to find an easier way to quickly review changes. The patch-style diff format presented as a single unified diff pays off handsomely here for boiler-plate changes as you can very quickly eyeball the entire set and commit it if turns out to be clean.
Even when you are doing continuous integration where you are committing incremental changes every few minutes or hours, itís still possible that a part of your overall change ends up fixing something tangential to the task at hand. I often try and use different tools or techniques when I can to exercise the codebase and tools in different ways to find those less obvious non-functional bugs. For example running a build or test script via a relative path often shows up poor assumptions about the relative path to any dependent scripts or tools. Whilst not essential to the product they are jarring and interrupt your flow.
When these kinds of minor bugs show up I like to fix them right away as long as they really are trivial, this means that the entire change set will probably include two different logical changes Ė one for the task Iím doing and the other for the bug I fixed whilst developing it. Logically they are distinct changes and therefore Iíd look to commit them separately. However this extends further than per-file changes, it may be two independent changes within the same source file. Historically youíd either have to manually split these commits out or commit them together and make it clear via the commit message that there were two different motives for the change. Luckily modern tools like Git make it possible to easily to split a single bunch of edits to one workspace into separate coherent commits.
Although cherry picking changes across branches is generally frowned upon (mind you, so is branching itself these days), this has often been down to commits being overly large and failing to adhere to the Separation of Concerns. By ensuring unrelated changes are kept committed separately you make it much easier to surgically insert a fix from one branch into another should the need ever arise.
Once Iím happy that the changes Iíve made locally complete the task at hand I then need to see whatís changed in the world around me. Working directly on the main integration branch  in small increments ensures that whatever has changed is likely to be quite small and therefore the chance for conflict is minimal Ė any unexpected merge should virtually always be trivial and handled automatically by the VCS tool.
Itís imperative that I pull the latest upstream changes locally first so that I can resolve any syntactic and semantic conflicts before pushing back. The former are fairly easy to spot when working with compiled languages whereas the latter are somewhat trickier and rely on failing tests for the early warning siren. Whilst I could just push my changes and let a gated build inform me of any problems I prefer to avoid context switching and so proactively integrate so that when I eventually manage to publish Iím pretty certain it wonít come back to me for remediation.
Integration is one area where the centralised and distributed VCS products differ greatly. When using a centralised VCS there was always the risk that pulling the latest changes to your workspace would invoke a complex merge which, if messed up, could cause you to corrupt your own changes. This reason alone convinced me to go back to single branch development and keep my commits small and focused. On the few occasions where the prospect of losing my work could have been expensive I would use the ďprivate branch on demandĒ feature (e.g. branch from working copy in Subversion) to create a backup in the VCS. Then Iíd update to the head and merge my private branch back in, safe in the knowledge that I could repeat it as many times as needed to resolve any conflicts. In the distributed world this kind of behaviour is inherent by the way the tool works (you commit to your private fork first) and so itís safer by default.
This in turn means you have more flexibility about how you handle any merge conflicts directly on an integration branch. In the DVCS world you can pull the latest changes and do the bare minimum to commit your changes on top of it. Then you can resolve any syntactic and semantic problems in your own time and commit them separately as fixes to the merge. At this point you then have the choice of whether to leave history intact and publish separate commits (merge + fixes) or squash them into a single change and publish it as if nothing went wrong. For me the decision of which approach to take depends heavily on whether the break was significant or not and therefore may have significance in the future too.
For example if someone just renamed a method in a refactoring then I squash as itís simply a matter of poor timing, whereas if a design change has caused the rework it means my change was based on a different view of the world and so may hide further semantic problems. Ideally every commit to an integration branch should stand on its own two feet (i.e. the build and tests pass) but when you can publish the fix at the same time as the break the observable outcome is essentially the same as so it feels acceptable to break the rules in the rare case that it happens.
In the CVCS world you do not have this luxury as what you publish to an integration branch is immediately visible, hence I might do a ďwhat if?Ē merge first to test the water and if it looks dicey spin up a private branch for safety. This way I can guarantee that Iíll only be publishing a consistent change and the private branch contains the gory details for posterity.
As I mentioned earlier I like to keep my workspace as clean as possible and therefore before integrating Iíll definitely do a ďdeep cleanĒ to ensure that the subsequent build that follows the integration is as close to the build machineís as possible. This means that Iíll ensure all transient files in the workspace are removed, e.g. cached NuGet and Node packages, as I want the best chance possible that any problem is strictly down to the integration itself and not some stray environmental issue caused by detritus.
In essence in a Git based project this boils down to the following one-liner:
> call Clean --all && git pull --rebase && call Build
With the set of changes reviewed and integrated the last big hurdle is to document them. When committing I need to provide a message that tells my future readers something about the change that will enlighten them to its purpose. The same is true of any comment, the writer needs to avoid saying what the code (and diff) already says and to instead focus on the rationale Ė the why.
Itís not quite as black & white as that though because trawling through the revisions trying to find related changes is much better with some tools than others. For example the ďannotateĒ feature (or in its more witch-hunt friendly guise Ė blame) makes it easier to walk the history of a single file using the code itself for navigation. However if youíre using a lesser tool or doing some archaeology around a wider change then the commit message can become more significant in tracking down ďan interesting set of changesĒ.
Hence it helps if the message is broken into (at least) two sections Ė a short summary and then a more detailed explanation. The summary should be short, ideally fitting a single line of text, so that when you are scanning the list of recent commits that one line tells you in an instant what itís about. Curiously there appears to be some disagreement about which tense you should write your message in but Iíve always found the past tense perfectly adequate. Fortunately this two-part format has been brought to prominence by the likes of Git and so itís becoming the norm.
The start of any commit message I write nearly always contains some kind of reference to the feature Iím working on. For example if the user stories are being tracked in an enterprise tool like JIRA then Iíll put the ticket number in, e.g. ď#PRJ-1234Ē. Once again somewhat fortuitously this practice has been adopted by other tools like GitHub as a way to link an issue with a commit. Linking the commit in this way means that I probably have less to document because much of commentary and rationale will already be in another tool.
Even if the project is not using a formal feature tracking tool, e.g. just a Trello board with simple task descriptions, Iíll still use the card number as a message prefix . This is because committing a feature in small chunks makes the set of related commits harder to find. Sometimes the team will review a change after itís already been committed and so the reference number makes it much easier to find all the related changes for the feature. If a change doesnít have a reference then it should be something so trivial as to be easily discernible from the one-line summary alone.
Naturally the body of the commit message will contain further details about why the change was made. If itís an informal bug fix it might contain the basic problem, if a new feature then a bit about what support was added, and if itís tidying up then why the code is now redundant. Sometimes Iíll include other little notes, such as if the commit isnít standalone for some reason. Whilst itís great if you can pick any arbitrary revision and expect it to be self-consistent there are occasionally times when we have to check-in partial changes or changes that may not strictly work due to some environmental concerns. In these cases the extra note helps the reader know that there are dragons lurking and therefore they should look further afield to find where they were eventually slayed.
Once again I personally find a UI advantageous for this part of the task, not least because it normally contains a spell checker to ensure that the basic text has no obvious typos. Whilst itís not an everyday occurrence searching the commit messages for a particular word, or phrase, it is useful and so it helps enormously if you do not have to compensate for spelling mistakes. A modern commit dialog also tends to be aware of the folders, files and even the source code that has changed and so can provide auto-completion of terms that might be relevant, such as the name of a method that was edited.
One thing I do miss about working with a CVCS is that you can often still edit a commit message after the fact. With a DVCS the message is part of the immutable chain of revisions and therefore attempting to fix it leads you into the kind of murky waters that you really want to avoid. If the changes only exist as local commits you still have time to make the correction, but once itís published you pretty much just have to live with it.
Phew! After all those considerations itís finally time to unleash my handiwork on the world at large. If Iím on Git then Iíll do the final push, whereas with Subversion or TFS it will have happened when I hit ďOKĒ on the commit dialog. Either way the change is done, so barring any unforeseen weirdness shown up by the build pipeline, itís time pick another task and go round the development loop once more.
 In The Toolbox - Software Archaeology, C Vu 26-1,
 Cleaning the Workspace, Chris Oldwood,
 In The Toolbox - Pen & Paper, C Vu 25-4,
 Branching Strategies, Chris Oldwood, Overload 121,
 In The Toolbox - Feature Tracking, C Vu 26-3,
27 October 2016
Chris is a freelance programmer who started out as a bedroom coder in the 80ís writing assembler on 8-bit micros. These days it's enterprise grade technology in plush corporate offices. He also commentates on the Godmanchester duck race and can be easily distracted via email@example.com or @chrisoldwood.