A Case of Mistaken Identity

I recently unearthed a bug in some C# code where, superficially, the cause appeared to be a single character, but on closer inspection it was not entirely clear what the author’s intentions really were. This was down to a number of factors, not least the lack of tests, but it got me thinking about what those intentions might have been and what other practices could have prevented it in the first place.

The Bug

The line of failing code was a simple conditional statement that would induce some error handling logic when true. My suspicion there was a bug were immediately aroused when the tool reported an error code of “none”. If you’ve ever seen a Windows application report an error with the text “The operation completed successfully” then you’ll know what I’m talking about. This was the statement:

if (result != Error.None)
{
  // report error
}

Naturally I switched to the Error class definition to see what it looked like. It was a pretty sparse class with None defined as a simple static property alongside a couple of other static methods:

public enum ErrorCode
{
  None,
  NotFound,
  AccessDenied,
}

public class Error
{
  public ErrorCode Code { get; }
  public string Message { get; }

  public Error(ErrorCode code, string message = "")
  {
    Code = code;
    Message = message;
  }

  public static Error None => new Error(ErrorCode.None);
  public static Error NotFound(string message) => new Error(ErrorCode.NotFound, message);
  public static Error AccessDenied(string message) => new Error(ErrorCode.AccessDenied, message);
}

Initially I didn’t spot the mistake but I realised from the way the conditional statement was behaving that it had to be something to do with the object’s identity. The lack of an Equals() implementation meant it would be using reference based equality, not value semantics, which suggested that the two objects were not exactly the same object in memory.

My gut instinct was that the return site was creating the object directly instead of using the static property and so I was surprised when it was correct, like so:

return Error.None;

Scratching my head I went back to the Error class, studied it more closely, and then I noticed the subtle mistake – the extra “>” in the property definition. This static property, instead of creating a single instance of the value when the class type is initialised, creates one every time it’s invoked! This explains why the reference based comparison fails – it will never be the same object in memory.

New Syntax, New Bugs

Prior to C# 6 the syntax for defining static properties was somewhat more verbose. For example if you wanted to declare a static property backed by a value, you would write:

public static Error None = new Error(ErrorCode.None);

This creates a writable property. If you wanted a read-only static property you could use a custom getter like this:

public static Error None { get { return new Error(ErrorCode.None); } }

With C# 6 however came a simplified syntax especially for read-only properties which either have automatic backing storage or are implemented by an expression body. Hence the former now becomes:

public static Error None { get; } = new Error(ErrorCode.None);

…and the latter can be turned into just this:

public static Error None => new Error(ErrorCode.None);

These two forms together don’t look so similar but if you compare the very first and last side-by-side you’ll see they are incredibly similar:

public static Error None = new Error(ErrorCode.None);

public static Error None => new Error(ErrorCode.None);

The only difference between these two statements in fact is a single character, the “>”, which turns the assignment operator into the lambda arrow operator. Semantically, this changes a writable static property which has a value created at class initialisation, into a read-only static property backed by an expression body that returns a new value every time it’s invoked. As you can see these two are functionally very similar but as we’ve just seen from the bug subtly different depending on the value semantics of the enclosing type.

The Fix, or Is It?

The “obvious” minimal fix from the calling code’s perspective is to remove the extraneous “>” character from the property definition and ensure that there is only one instance of None ever retrieved from it. With only a single instance in play a reference based equality comparison will always succeed. In fact we can go one better and avoid the backdoor that allows None to be replaced by using the readonly keyword to ensure that it cannot ever be changed by a client or the class itself (outside the type constructor):

public readonly static Error None = new Error(ErrorCode.None);

That fixes the actual bug, but it still leaves a number of questions. For example, was the reference based comparison really the intention, and if so, how hard did the author try to avoid letting a caller create a duplicate object like None? As you can see the Error type’s constructor was marked public and so the consuming code could have just as easily written this instead:

return new Error(ErrorCode.None);

Once again the reference based comparison would have failed because the returned object was never the same as the single instance held by the static property.

Another observation is that the underlying error value (from the ErrorCode enumeration) is accessible via a public read-only property:

public ErrorCode Code { get; }

Therefore it’s entirely possible the author may have intended for the conditional statement to be written like this instead:

if (result.Code != ErrorCode.None)

Now we’re doing a comparison of two values from an enumerated type and therefore we get a value based comparison instead which will always succeed irrespective of whether result holds the None singleton or not. The big similarity in the name of the Error class and the name of the underlying ErrorCode enumeration could easily explain how one could have been written instead of the other.

Factory Methods

Perhaps we are on the right lines with assuming that reference based equality was desirable if we bring in an observation around the two factory methods, which also use the new expression body syntax to make them light on ceremony:

  public static Error NotFound(. . .) => new Error(. . .);
  public static Error AccessDenied(. . .) => new Error(. . .);

Maybe there is one other mistake and the Error constructor was never intended to be made public in the first place. Perhaps it should have been marked private so that clients are forced to create an instance of the object thorough one of the factory methods, or use the None value. Hence in the scenario of an error the returning code would therefore write something this:

return Error.NotFound($"{path} not found");

Given the potential for a free format message it’s highly unlikely the calling code would ever attempt to perform an equality comparison on an instance of the Error type, except for a comparison against None. Consequently these two little fixes (one to the None property and the other to the constructor) might be enough for us to declare the problem “properly fixed”.

Value Semantics

Up until this point I have worked on the assumption that the author really did want the comparison using the “==” operator to work correctly and so we’ve found ways to make that happen for the scenario that initially brought it to our attention. However these little tweaks feel somewhat akin to moving the deckchairs around on the Titanic as we’re not really fixing the type itself to perform all comparisons correctly.

The Error type is composed of two values which should both take part in a proper value based comparison. As anyone who attended Steve Love and Roger Orr’s ACCU talk a few years back about implementing equality comparisons will remember, these are non-trivial things to implement in languages like C# and Java. There is a lot of boilerplate code to add (and a GetHashCode() implementation for completeness) which many IDEs will happy produce for you to save typing and ensure all the options are correctly covered, but the meat is essentially just a comparison of the two embedded properties:

public static bool operator==(Error lhs, Error rhs)
{
  return (lhs.Code == rhs.Code) && (lhs.Message == rhs.Message);
}

Now when we invoke our original failing comparison against None in the client’s error handling code we will get a fuller comparison of the object’s embedded members instead of just their references. Additionally because we’re performing a deeper comparison we don’t have to change the None definition because any instance of None will now be equivalent to any other (although creating one every time could be considered wasteful).

This feels like a more complete fix for our equality bug but I’m still left wondering if it’s the right thing to do. Making a type into a value type just because I want an equality comparison to do the right thing in one scenario does not feel overly compelling. When pondering whether to make something a value type or not I have an acid test that asks if I’d ever likely use it as a key in a dictionary or put it into some kind of set.

Once again the free-text error message property suggests to me that this isn’t really a classic value type. Yes, the comparison with None feels more succinct but in any other situation you are likely to ignore the message part and only consider the ErrorCode. This implies to me that the message isn’t really part of the object’s identity per-se but is being bundled along with the code for convenience (as C# does not support multiple return values). This leads me back to believing the real mistake is in the conditional statement in the caller and should have been this one which we covered earlier that compares the Code property directly:

if (result.Code != ErrorCode.None)

Software Archaeology

At this point I’ve pretty much exhausted my analysis based on the current state of the callers and implementation and so I decided to do a spot of software archaeology [2] and trawl the version control logs to see what else it could tell me about the history of this code.

It turns out it has very little to say. The Error class only has one revision which means that whatever the initial implementation was, that sufficed. We can’t tell if the design changed at all in response to its use in the client code but given the nature of the bug I’d posit that any tweaking that did occur during testing would have happened in the caller instead.

Switching to the call site changes we also see only a single check-in. Looking at the diff for the file where the bug occurred we see that there are five modified call sites – three of them directly compare the Code property and the other two use the buggy comparison. What’s also noticeable is that the format of the error handling code is slightly different in the former and latter cases – the code is logically the same but spaced out differently.

If the original author copy-and-pasted the error block I’d expect them all to look the same, which makes me wonder if they were, to start with, but during testing when the silly mistake was fixed the formatting was cleaned up too. This hypothesis gains a little more ground when we consider that the three call sites that work are in code paths which are exercised by a smoke test which is usually run by developers before check-in. In contrast the other two sites are in a code path which is very rarely exercised and has to be done manually, if anyone remembers.

Epilogue

Luckily this piece of code was authored by someone that was still present in the team (although away at the time this showed up hence the somewhat over-engineered analysis), consequently I have been able to find out which of my various hypotheses, if any, were correct.

It turns out that the type was never intended to be used as a value type but just a simple bucket for holding the two bits of error information, i.e. a way to return multiple values in C#. The three call sites that directly compared the Code property with the enumeration value None was the intended pattern of use and was fixed when the smoke test failed. The other two were missed as they didn’t break any existing tests.

Although the smoke test (which is also run automatically on deployment) picked up the bug it also goes to show that it does not help in conveying the author’s intentions. One of the roles of lower level tests (e.g. unit tests) along with verifying our expectations is to document how our code is intended to be used and illustrates the scenarios we’ve considered. In this instance we had to piece that together purely from the way the code is used in production, which was inconsistent. The cost of rewriting would have been minimal this time, but in other cases we’re not so lucky. It also shows that code which is not tested at all automatically will be forgotten about. Once again the broken feature didn’t matter this time but next time it might.

The only other niggle I had was why the two missed code paths didn’t get picked up at commit time. The pre-commit review [3] is an excellent point to jog our memory about such matters, particularly if the change didn’t go smoothly, i.e. we wrote some code and we didn’t get it right first time. To me the difference in code formatting between the working and failing cases was another little sign that something was out of place. However we need to be careful what our tools are showing us because the patch-style diff made this apparent, whereas the full context diff showed that actually the code was in keeping with its surroundings. So the author didn’t really miss anything after all.

Somewhat ironically all the above analysis actually turned out to be a waste of time. A static analysis tool highlighted some dead code which when investigated further revealed that the only property used was the None one, for the happy path. All other code paths resulted in an exception and therefore the Error class was in fact redundant and the intervening methods could all be changed to return nothing (i.e. void). Once done all the error handling blocks where the bug had shown up could be deleted too.

Did I say waste of time? As Ken Thompson once said “One of my most productive days was throwing away 1,000 lines of code”. This was only 50 lines of code but it still feels like time well spent.

References

[1] Some objects are more equal than others, Roger Orr & Steve Love, ACCU 2011 Conference,
https://accu.org/content/conf2011/Steve-Love-Roger-Orr-equals.pdf
[2] In The Toolbox - Software Archaeology, C Vu 26-1,
http://www.chrisoldwood.com/articles/in-the-toolbox-software-archaeology.html
[3] In The Toolbox – Commit Checklist, C Vu 28-5,
https://accu.org/index.php/journals/2306

Chris Oldwood
15 December 2016

Biography

Chris is a freelance programmer who started out as a bedroom coder in the 80’s writing assembler on 8-bit micros. These days it's enterprise grade technology in plush corporate offices. He also commentates on the Godmanchester duck race and can be easily distracted via gort@cix.co.uk or @chrisoldwood.