In The Toolbox – Software Archaeology

 

“Don’t ever take a fence down until you know why it was put up” – Robert Frost

 

Changing software is hard. It’s especially hard when you’re new to the team or codebase and don’t know the history of the people or the application. Making any change is not just as simple as adding, replacing or removing lines of code. There are also the higher-order aspects to consider, such as whether it fits in stylistically, idiomatically and within the existing design principles. In a really old codebase these can be particularly hard to pin down as they will have undoubtedly changed over time. For example, new code written today under C++ 11 is going to be quite different to code written years ago under C++ 98 and that will need to be factored in when trying to understand what’s going on.

Small-Scale Archaeology

Putting to one side the stylistic and idiomatic aspects which may well be at the whim of the author, the design can be a lot harder to understand. Many years ago I worked on an old C++ codebase for a big finance institution. My role was purely technical and I was initially tasked with fixing various memory leaks and deadlocks within its highly threaded services. Whilst tracking down one leak I came across some code that looked something like this:-

 

ContextPtr getContext(long contextId)
{
    static ContextPtr cachedContext;

 

    if (cachedContext.get() && cachedContext->Id() == contextId)
        return cachedContext;

 

    cachedContext = Context::Load(contextId);

    new ContextPtr(cachedContext);

 

    return cachedContext;

}

 

The spurious extra copy of the smart-pointer was downright weird. I could assume that it was just a cut-and-paste error or dead code, but it seemed too random for that, especially in such a short function. So I decided to consult the version control system to see what it could tell me about its past life.

 

This function appeared to have a somewhat chequered history. Although it had now been stable for a couple of years its birth was a little more tortuous. The author had originally started without the spurious smart-pointer copy. They then started adding random try/catch blocks in different places, presumably because the code was crashing under mysterious circumstances. Finally the try/catch blocks were gone and there appeared to be a couple of attempts to mess with the smart-pointer value and reference count before settling on the final implementation. But what was going on?

 

The missing piece of the puzzle is what the object being pointed to contained – a COM object, and a remote one at that. Putting this knowledge together with the static local variable, which we know will be destroyed after main() exits, and it all starts to make sense. The author was having lifetime issues because COM was uninitialised before the C++ based context object was destroyed. This caused the DCOM remote proxy to throw an exception during invocation of the destructor during process shutdown. This explains why the try/catch blocks were presumably having no effect – they were notionally in the wrong place.

 

Ultimately what the author really needed was a weak-pointer. The full-blown smart-pointer based single item cache was keeping the object alive far longer than necessary. The cached reference was a useful optimisation but the lifetime was deterministically controlled elsewhere. Once I’d worked all this out I switched from the home-brew smart-pointer type to one in Boost and the original problem (the memory leak – a huge one) was fixed.

 

All this was “derived” from the version control history and the author hadn’t written any check-in comments either which might have saved me time. But there was also some other interesting metadata about the check-ins that I found interesting too – each commit appeared to be followed by a formal build (the file with the build number changed) and was hours apart. This suggested to me that the changes were never tested locally first. After all, why bother to check-in the intermediate steps if they didn’t work? This was years before Git was fashionable and there were no automated tests in sight. It appears as if the changes were tested by deploying into the shared development environment and waiting to see what happened.

 

When I came to test my own changes I discovered just how hard it was to test the component in isolation. In particular the large, monolithic service that formed the heart of the system was not amenable to localised testing at all without some serious refactoring. I concluded that whilst the developer was clearly keen to try and resolve the issue the environment and development process did not seem to be supporting him in making that happen easily.

Tribal Knowledge

Sometimes we’re lucky and the person who wrote the code is still around so we can try and shortcut some of the digging around. If they’re still working on the project you have the opportunity to transfer some of the more implicit design knowledge that seeps away over time. More recently I came across a change to a very simple C# extension method I had written for the String class. This was my original version:-

 

public static bool IsEmpty(this string value)

{

    return (value.Length() == 0);

}

 

Which had been changed to this:-

 

public static bool IsEmpty(this string value)

{

    return String.IsNullOrEmpty(value);

}

 

For those non-C# readers there is a subtle difference in behaviour that happens when the value is a null reference. In my original implementation a NullReferenceException will be thrown, whereas in the revised version it will be silently ignored. Looking back in the source code repo I studied the associated changes and realised that this change was made to fix a problem with null string references coming in through the web API framework in certain scenarios. I proposed that the fix was in the wrong place as it changed the semantics of the extension method [1]. After a brief discussion we refactored the code, added a couple of unit tests to document the extension method behaviour and got on with our lives. Sometimes it’s easy to forget how even a simple, 1-line method can hide such detail.

Unearthing Patterns

Having the author around doesn’t always help though. There are programmers you’ll meet which are not quite so amenable to such discussions and will instead get easily offended if they suspect you’re questioning their judgment.

 

As you might deduce from the Robert Frost quote at the very beginning I always approach any codebase with the assumption that the author is a smarter programmer than me and knows more about the problem domain too. At least, until I know for definite that isn’t the case. This means when I see code that isn’t obviously simple [2] I err on the side of caution and assume that I might be about to learn something new. This is almost always true because even if I don’t learn something new about the language, libraries or problem domain I will probably learn something about the author or environment instead.

 

One time I came across the following declaration in a C++ codebase:-

 

std::vector<std::string*> hostnames;

 

My instinct clearly told me this was wrong as RAII almost always rules, but being new to the project and having the author just feet away I politely quizzed them on it. The response was a very confrontational “Are you reviewing my code?” After trying my best to explain my good intentions I escaped with the knowledge I sought – the code was wrong. More importantly I came away knowing a lot more about the author’s character.

 

Once you see some code where a programmer has made what could be considered “a fundamental mistake”, you immediately begin to question where else they may have made it. The version control system can tell you what other commits someone’s made and with a sprinkling of command-line magic it’s often possible to see where they’ve been and what they’ve touched. In this instance I restricted myself to just the most critical areas of the code but it paid dividends as I chalked up a couple more memory leaks.

                                                                                                       

Although somewhat jaded by this experience I decided it was still my duty to point out to an ex-project member, who only worked down the hall, that 19 of the memory leaks I’d just unearthed were due to him forgetting to mark the base class destructor virtual even though they were managed by a smart-pointer. Fortunately he was far more grateful for the feedback.

Blame

One of the most useful tools in modern source control systems is Blame. Whilst its name might suggest it be used for nefarious purposes, such as starting a witch-hunt, it would be far better thought of as “Excavate”. Starting with a revision it will show you who last touched each line of code and what commit it came from. From there you can trace back through the past changes looking at the evolution of the file. Sadly the one thing it doesn’t show is where code has been deleted which can be a useful lead some times.

Revision Graph

The other key tool in the VCS arsenal is the revision graph. This tool shows you the birds-eye view of a file’s history – what commits have taken place, their check-in comments, what branches changes have been made on, how and when they were merged and what labels or tags have been applied.

 

If you have a bug and you want to know what versions of your product the bug appears in this is almost certainly your starting point.

TICOSA

I’ve mostly only been concerned with small-scale software archaeology as it’s usually directly related to a change I’ve needed to make. Occasionally I’ve done some larger scale spelunking, such as rating contributors by their percentage of empty check-in comments, but nothing much more than that.

 

In January of this year I attended The (First) International Conference on Software Archaeology [3] at the Museum of London. There were a number of other ACCU members present and we spent the day finding out a bit more about how software archaeology might be used in other ways. Given its infancy the message was fairly clear that whilst analysing the history of a codebase can be interesting, we’re still not sure exactly what the value is. Nonetheless it was a useful day and hopefully in subsequent years we’ll get to find out how it can be used more effectively.

References

[1] http://chrisoldwood.blogspot.co.uk/2013/09/extension-methods-should-behave-like.html

[2] http://en.wikipedia.org/wiki/Tony_Hoare#Quotations

[3] http://ticosa.org/

 

Chris Oldwood

13 February 2014

 

Bio

Chris is a freelance developer who started out as a bedroom coder in the 80’s writing assembler on 8-bit micros; these days it’s C++ and C#. He also commentates on the Godmanchester duck race and can be contacted via gort@cix.co.uk or @chrisoldwood.