Monitoring: Turning Noise into Signal

Introduction

The humble text based log file has been with us for a very long time and it often still suffices for simple tasks which will require little genuine support. But over the years systems have gotten so much bigger as the need to scale out has compensated for the inability to continually scale up. Consequently the volume of diagnostic data has grown too as we struggle to make sense of what’s going on inside our “black boxes”.

Moving a system from the relative comfort of an internal data centre out onto the cloud also brings with it a host of new experiences which will add to the general ambient noise as networks routes go up and down and servers come and go on a regular basis, and I’m not talking about continuous deployment here either. Okay, it’s not quite that unstable at the smaller end of the spectrum but there are many more reasons why your service does not appear to be behaving itself over-and-above the mistakes that we might make ourselves.

With that growing volume of diagnostic data we need to remain smart about how we handle it if we are to understand it. What might initially seem like just a bunch of random failures may in fact be something much easier to explain if only we chose to drill into it and categorise it further. We refactor our functional code to help us make more sense of the design and the same is true for our diagnostic code too – we need to continually refine it as we learn more about the problems we encounter so that they are easier to monitor and triage in future. Ultimately you do not want to be woken in the middle of the night because you cannot distinguish between a failure on your side of the fence and a known limitation of an upstream system which you cannot control.

Hence this article is about the journey you might take after starting with a big ball of monitoring data so that you can start to make sense of it, both inside and outside your team, by improving the code so that what was once potentially just noise can be turned into more useful signals.

Structured Logging

The most basic form of log file is just a simple collection of lines of freeform text. These generally form a narrative which tries to tell a story about what’s going on inside, e.g.

2018-02-16 08:14:01.100 9876 INF Fetching orders
2018-02-16 08:14:01.123 9876 ERR Failed to fetch orders for ‘1234’
2018-02-16 08:14:01.145 4225 PRF Orders request took 10 ms

With each message being a blank canvas it often falls to the individual author to tell their part of the story in their own way and that naturally leads to differences in style, just as we might find stylistic differences in the code as each author stamps their own identity on it through brace placement and vocabulary. Consequently we find that different delimiters are used to surround text parameters (or maybe none at all), dates might be formatted locally or in UTC, primitive values output with different units, and spelling & grammar of variable quality too.

This melting pot of text might have some “character” but it’s a royal pain to parse automatically and make sense of at any non-trivial scale. Once you start needing to aggregate and slice-and-dice this data so it can be used to provide overviews of the system health and generate alerts when the proverbial muck hits the fan, then freeform text no longer cuts it – a clear, consistent format needs to be adopted which is machine readable and aggregatable. The alternative is a life spent trying to concoct ever more complex (and unmaintainable) regular expressions to match the dizzying array of similar messages; or not, which is the usual outcome. Machine learning may provide some hope in the future for an otherwise lost cause but until that day comes we need a simpler solution.

Enter stage left – structured logging. Instead of trying to use our natural language, such as English prose, to describe our situation we start to think about it in less emotional terms. Log messages stop being a flat representation of what has passed and instead become richer “diagnostic” events which can be reacted to programmatically, both within the process as well as outside it [1]. In its simplest guise instead of writing the traditional message “File ‘Xxx’ loaded in 10 ms” you instead log, say, a FileLoaded diagnostic event with a property DurationInMs set to 10 and a Path property set to the file name, e.g.

log.Info($“File ‘{filename}’ loaded in {duration} ms”);

now becomes,

log.Write(new FileLoaded{ DurationInMs=duration, Path=filename });

As you can imagine all the usual programming tricks are still available to hide some of the complexity and keep the logging noise down in the code to ensure the functional aspects still shine through, e.g. free functions or extension methods for the most common scenarios:

log.FileLoaded(filename, duration);

In C# you can leverage the “using” mechanism to wrap up the timing and event writing with even less noise thereby giving you some really simple instrumentation [2]:

using(journal.FileLoad(filename))
{
. . .
}

(To try and make it clearer that it’s not a traditional log file I like to rename the eventing façade something like “journal” and use terms like “record” to get away from past traditions.)

Richer Diagnostic Events

This change in mind-set opens up a new world of possibilities about how you write your monitoring and instrumentation code, and how and where you aggregate it. For example although modern logging search services are fairly powerful, even if you have used correlation IDs [3] to relate your events within a process, finding and summing them can still be arduous, especially if the entire IT department is sending all their logs to, say, the same Splunk instance. Tools like Splunk allow you to “join” log event streams (much like a SQL join) but they have limits to keep processing costs down which often means your joins are truncated unless your queries are highly selective. Consequently you may choose to perform some local aggregation in-process and send out “summary” events too to ease the burden on the monitoring side. For example a REST API might generate a summary event that includes the overall timing for handling a request along with a breakdown of the time spent in database calls, outbound to other external services, heavy computations, etc.

With all diagnostic events being richly typed (if you choose to go that far) and flowing through a single stream it’s easy to build in in-process feedback loops as well to monitor and react to local conditions. For example the Circuit Breaker pattern which Michael Nygard introduced in his seminal book [4] is a technique for dealing with temporary failures by shutting off the value when things go awry. One way of implementing this is to continuously analyse the diagnostic stream and, if you see a continuous batch of failure events, flick the switch and turn yourself off for a bit. This can be combined with another classic design pattern – Leaky Bucket Counter [5] – to slowly turn the tap back on. This elevates the diagnostic event stream from a purely non-functional concern to a more functional one which brings with it greater responsibility. (Logging is often treated as a second class citizen due to its non-functional nature but reliable systems elevate non-functional concerns to ensure they get the attention they truly deserve.)

Event Structure

Although you can use much richer types to represent each event you need to bear in mind that the goal is to try and aggregate similar kinds of events across multiple processes to get a more holistic feel for what is happening. Consequently there is still a trade-off between being too loose or too rich and therefore being no better off than we were before. You need just enough structure to ensure a good degree of consistency and to make it easy to add similar classes of events while also ensuring your events are easy to consume in the monitoring system. At its most basic an event might simply be a name and an optional value, e.g.

class Event
{
public string Name { get; }
public double? Value { get; }
}

Naming in software development is always a difficult problem and once again it rears its ugly head here too. You can slice and dice the problem of naming events and their properties however you like but if you want them to be easy to consume in the monitoring tool then you’ll want to adopt a scheme that works well with the query facilities on offer. For example, using a namespace style approach which reads narrower from left-to-right, just like code, is a common option, such as “component.operation.duration” for the time taken to perform a specific operation. This could then be translated into a monitoring query like “database.*.duration” to get an overview for the performance of the database. Once you can aggregate key components over a time window you have the opportunity to automatically measure trends in performance which is a commonly missed critical source of system behaviour. (For a short tale about how performance can slowly decrease over time and then suddenly drop off a cliff see [6].)

What makes aggregation simpler is to adopt a convention for things which should be aggregated, so for example append “.duration” for timings and, say, “.count” for occurrences of interesting things. This helps reduce (though not eliminate) the chances of the classic mistake of treating “1” as “1 ms” instead of “1 failure” which might make your performance look better or worse than expected. If you’re going to be using different units for the same dimension, such as timings, you’ll either need to include that as a property or make it clear in the name so that you can correctly sum them, e.g. “.duration-ms” vs “.duration-ns”. With the events now being rendered automatically by code instead of manually formatted by hand you stand a better chance of rescuing the situation if you decide that your precision is too low for one particular kind of event. The Prometheus documentation is a good source of inspiration for categorising and naming events [7].

If you’re wondering about the basic logging attributes you’ve grown to know and love, such as the timestamp and process ID, well you can use a base class to capture those, e.g.

class EventBase
{
public DateTime Timestamp { get; }
public string Name { get; }
public int ProcessId { get; }
}

class DatabaseQueried : EventBase
{
public int Duration { get; }

public DatabaseQueried(string operation int duration)
: this($“database.{operation}.duration”)
{
Duration = duration;
}
}

// Client code.
journal.DatabaseQueried(“find-customer”, duration);

However before you go sticking all your favourite extra fields in every event, just ask yourself whether you really need them every time or is that only a choice you’ve taken because of the inherent limitations of the traditional method of logging? A fixed width layout was always preferable for a classic log file; maybe it’s not so important now?

Human Readability

One immediate concern I had when first adopting this approach was that I was so used to reading classic log file narratives that I felt they would be much harder to consume on a development & debugging level rather than a production monitoring basis. For example I’ve become used to doing some basic system level testing after making certain kinds of changes to ensure that the non-functional aspects are well represented, i.e. is any new instrumentation as useful as it could be? Given the usual modes of failure do we have recorded what we would generally expect if we have a failure in the new area of functionality? If it involves any I/O have we timed it so that we can detect problems and visualise trends? This has usually involved reading the event stream manually and playing with classic command line and log viewing tools.

Hence, one question you may have had right from the get-go is about what the event stream looks like outside the process. It depends on the logging infrastructure and destination sinks but the simplest choice might be to just render them as JSON as it has a fairly simple format (less of an “angle-bracket” tax than XML) and JSON writers are often performant enough for most logging uses. Hence instead of lines of freeform text you might have lines such as this instead:

{ Time:"2018-02-16T08:14:01.123", Name:"orders.http-error.count", ...}
{ Time:"2018-02-16T08:14:01.123", Name:"orders.request.duration-ms", ...}

An event stream dumped as raw JSON is clearly not quite as easy to read as plain text. At least, it’s perhaps not as easy to read at first, but you soon realise that what really matters is whether you can easily relate the event that you see to where it occurs in the code and what it’s telling you about the operation that happened. Surprisingly I got used to this style very quickly; I almost began to feel like Cypher in The Matrix – I stopped seeing the curly braces and quote marks and just saw “HTTP client error”, “token refreshed”, “database queried”, etc. Your mileage will of course vary considerably, but just give it a little time. Apparently there’s more wisdom in the old adage “omit needless words” than I had quite appreciated.

The other good thing about a decent JSON serializer is that you can usually control the order of serialization so you can ensure that the most important attributes get written first (e.g. Timestamp and Name) and lesser ones later which will reduce the burden somewhat when eyeballing the log file.

The Audience

Another benefit which a more event focused approach gives you is a little more discipline around sticking to true operational events rather than mixing in “debug” or “trace” level messages, which are often of little concern to first level support but which might be useful to a developer doing a full-on post mortem later. This duality of narratives (monitoring vs debugging) is oft debated and what I’ve personally come to favour is slightly richer events where the “headline” is aimed at operational use and designed to be aggregated as before but attached to it might be a “Debug” property with additional context. For example an “http.request-failure.count” event might have the response status code as a core property but the response headers and first N bytes of the payload attached to separate “Debug.Headers” and “Debug.Content” properties should the need arise.

This brings me back to my earlier comment about attaching too much data to every event. Whilst process IDs might be useful to the ops team, thread IDs are probably less so, and therefore consider where and when you really need them. As for deciding in the code what severity an event should be, that probably disappears too as the interpretation shifts from being considered in isolation to being considered at scale, probably in the context of high availability where more transient failures are tolerated. That isn’t to say that failing to read a fundamental configuration setting like a database connection string is not pretty serious, but just that failure without recovery is what we’re looking for in the short term, not necessarily failure in its own right which is more about trends - let the event state the facts and allow the monitoring to decide if it’s noteworthy or not.

Code Complexity

There is one other reason I have seen for overly verbose trace-level logging and that is due to overly complex code. When the code is difficult to reason about, and a problem shows up which is difficult to reproduce, it is natural to resort to “printf” style debugging by adding extra logging to show which code paths are taken and why. Simple code that is well written and easy to reason about is also easier to diagnose without resorting to such heavy handed tactics.

Consequently when writing code we should ask ourselves what we would need to know to be able to determine what significant paths might be taken. If the design does not lend itself to being comprehensible from certain key inputs then we should ask ourselves whether it is too complex and needs to be refactored. Other techniques like Design by Contract may add a little more code here-and-there but this allows us to fail earlier when it’s easier to diagnose and therefore can avoid us logging excessively to track the same problem down later when the problem finally manifests itself, e.g. a latent NullReferenceException.

Improving Events

One half of the equation is about having the means in play to represent diagnostic events in a way that can easily be queried and aggregated, the other is about improving the quality of events as you learn more about the system’s failure modes. This was rammed home early one Saturday morning when I got a support call about a weekend job failure and the diagnostic message was simply:

ERROR: Failed to calibrate currency pair!

Tracking down which of the many currency combinations was at fault would have been made so much easier if the author had included in the message the pair of currencies in question that had failed. Whilst debugging the issue I discovered the information was already available at the code site where the message was logged and so I naturally added it right away to help those on support in the future. (Reflecting on and improving exception messages is a related topic I’ve tackled before [8].)

In an ideal world we would know all the ways that our system can fail and we could code accordingly, but life is just not that fair. Networks like the Internet are awash with services, proxies, caches, load balancers, etc. from different vendors that will behave in different ways when they fail or return an error. Even fairly simple protocols like HTTP are open to different interpretations by different teams, and that’s before you bring in the need to work around broken behaviours on supposedly transparent services because they are trying to be “clever”. The on-premise world does not necessarily fare any better, you still get plenty of bizarre server, network and storage problems to deal with [9].

Consequently although we can predict certain types of failure, such as a network route being unavailable, we may not know all the ways that it will manifest or what diagnostic information we might need to diagnose it. Waking someone up in the middle of the night every time a network request fails is not going to win you any favours and so we need to analyse our errors and drill in to see if we can categorise them further so that the support team only springs into action when there is actually something to do. During development it’s also possible that we can unearth “impedance mismatches” between various components in our system that on the face of it presents as occasional failures but could lead to significant problems once we reach production scale.

What follows are a couple of examples of how non-catastrophic failures can be diagnosed and either resolved or vastly improved upon to ensure the signal-to-noise ratio remains as high as possible. Each step was a small code change, pushed quickly into production, to gather more information about the situation each time it cropped up. Naturally an approach like continuous delivery really helps to keep the feedback loop short between learning and reacting to the new diagnostic information.

Service Misconfiguration

Soon after bringing the initial version of an adapter to bridge the on-premise and cloud-hosted worlds of a service together, the team started seeing occasional errors in the HTTP 5xx range. The general rule is that these are server-side issues and therefore our logic used the classic exponential back-off to ride it out, eventually. What was odd was that the subsequent request always went through fine so whatever the transient condition was it always cleared immediately.

Our first-order approach to diagnostic logging in that component only logged the HTTP status code on failure at this point and so we decided to immediately start capturing more information, such as the HTTP response headers, because this error didn’t appear to resonate from the service at the other end. One always needs to be wary of dumping untrusted input even into a logging system and so we were careful to limit what new information we captured.

What this showed was that the HTTP response was coming from Akamai, the company providing some basic security for the underlying service, such as API throttling. Despite only asking for a JSON response we actually got back an HTML content type which raised some interesting questions about our current response handling code. We also decided to include the first half KB of the response content into a “Debug” section of the logging event so that we could inspect it further when it showed up.

The payload included what appeared to be a standard HTML error response from Akamai that also included an “error reference” which the support team could look up and see what the underlying issue was. We now felt that we had enough heuristics available to crudely parse this Akamai error response correctly into an event of its own type and expose the error reference directly as a property so that we could quickly look them up with Akamai’s own diagnostic tools.

As it turned out, in this instance, the error was due to different idle socket timeouts between Akamai and the AWS load balancer which in itself took some effort to discover. At least we now knew for sure that certain 5xx errors were definitely not from the service and therefore there was no need to go looking elsewhere. Also, unless someone had changed the Akamai configuration, which was very rare, we could say with a high degree of certainty that the error really was transient and out of our control. Consequently the monitoring dashboard would now show a reduction in generic HTTP error responses in favour of the more specific Akamai ones.

The second example looks at a wider reaching code change in reaction to learning more about the way a system dependency reported certain types of errors.

Error Translation

Initially our message handler only had very basic error handling – enough to know to retry on a 5xx HTTP response but to triage a request if it got a 4xx one. In HTTP the 4xx response codes generally mean the client needs to take some corrective action, such as fixing its request format (400), authenticating itself (401), or checking if the right resource was being requested (404). We never saw a 404 but we did see occasional 400s which was worrying as we didn’t know what we could have done wrong to create a malformed request for such a simple query.

Luckily, by adding a little extra diagnostic data, we discovered the upstream service provided a richer error payload, in JSON, that we could attach to a “Debug” property of the, then simple, “http.client-error.count” event, along with any response headers, just like earlier. From doing this we discovered that the “service specific” error code was tantamount to a “not found” error, which you’d usually report with a 404 HTTP status code. The service was only providing a simple lookup and therefore we could attach the identifier we were looking up in our request onto a new, richer diagnostic event designed to show which IDs were unfortunately not correctly mapped.

This not only enhanced the monitoring side of the equation but it also meant that we could use a much richer exception type to convey this unusual condition to the outer layers of our code. It turned out that these missing mappings were never going to be addressed any time soon and therefore pushing the request onto a “slow retry” queue was a complete waste of time and we might as well drop them on the floor until further notice.

Aside from a reduced number of “poisoned” messages to deal with on our side, by including the information directly in the monitoring data this also meant the other team which owned the upstream service could query our logs and find out how bad the problem was and whether it was getting better or worse over time. We could also easily see if any particular IDs cropped up regularly and ask if they could be fixed manually. A better choice of HTTP status code in the first place would have made life slightly simpler but we would have still needed to understand the problem to ensure that we removed the unnecessary noise and reacted accordingly; either way we couldn’t have left it as it was.

Summary

The freeform text log file has been with us for a long time and still continues to serve us well in the right situations. However as the size and complexity of our systems grows we need something more “grown-up” that can sensibly cater for the needs of the operations team as well as those developing it (even if they are one and the same). These two forces are often at odds with each other, not out of malice but out of ignorance and so we should redress the balance by putting their diagnostic needs on an equal par with our own. Their hands are effectively tied and so those who work with the code need to be the ones to cater for them as well.

Moving towards a more structured approach to logging brings with it an improvement in monitoring as we start to consider what each diagnostic event means, not in isolation but in the context of the system as a whole. By making it easy to aggregate related events across time, by component, or by subsystem we can get a more holistic feel for what is going on under the hood. Then, taking these same data points over a much longer period of time we can get a feel for what the trends are in the system as it grows so that we are not caught off guard when those cyclic peaks come around second or third time.

Finally we also need to accept that “there are more things in heaven and earth, Horatio, than are dreamt of in your philosophy”. We don’t know all the ways our system will fail but by continuing to invest in refining the way we capture and report errors we stand a better chance of ensuring the signal stands out from the noise.

References

[1] Test Smell: Logging is also a feature, Steve Freeman,
http://www.mockobjects.com/2007/04/test-smell-logging-is-also-feature.html

[2] Simple Instrumentation, Chris Oldwood, Overload #116,
https://accu.org/index.php/journals/1843

[3] Causality – Relating Distributed Diagnostic Contexts, Chris Oldwood, Overload #114,
https://accu.org/index.php/journals/1870

[4] Release It! Design and Deploy Production-Ready Software, Michael T. Nygard,
https://pragprog.com/book/mnee/release-it

[5] Leaky bucket, Wikipedia,
https://en.wikipedia.org/wiki/Leaky_bucket

[6] Single Points of Failure - The SAN, Chris Oldwood,
http://chrisoldwood.blogspot.co.uk/2014/09/single-points-of-failure-san.html

[7] Metric and Label Naming, Prometheus,
https://prometheus.io/docs/practices/naming

[8] Terse Exception Messages, Chris Oldwood, Overload #127,
https://accu.org/index.php/journals/2110

[9] The Cost of Not Designing the Database Schema, Chris Oldwood,
http://chrisoldwood.blogspot.co.uk/2015/12/the-cost-of-not-designing-database.html

Chris Oldwood
25 March 2018

Biography

Chris is a freelance programmer who started out as a bedroom coder in the 80’s writing assembler on 8-bit micros. These days it's enterprise grade technology in plush corporate offices. He also commentates on the Godmanchester duck race and can be easily distracted via gort@cix.co.uk or @chrisoldwood.