In The Toolbox – Finding Text

It started with a fairly simple question: "Do you know of anything that can open a 400 MB text file?" Whilst being new to the team I've been programming professionally long enough to know that this isn't the real question. I have my suspicions about what my fellow programmer really wants to do but I need to ask them what this huge text file is and, more importantly, why are they trying to open this file in the first place?

My hunch is correct – it was a log file. And the reason they are trying to open it is because they are on support and what to understand why something is failing or misbehaving. Hence the real problem is about how to efficiently view and manipulate large text-based log files. On Windows the lowest common denominator for this is more, if using the command line, and Notepad for the GUI-oriented. The latter is essentially just a wrapper around the Windows edit-box control and was never designed for handling large chunks of text.

After quickly reeling off close to a dozen tools that I could use to view and process log files it got me thinking about the wider question: what tools might I use to solve the more general problem of "finding text", and what conditions or constraints would cause me to choose one over another? After all, this task is one that we programmers probably perform many, many times a day for different reasons.

The criteria for this list are pedagogically loose and cover the need to match prose and structured text, both programmatically and also manually. When diagnosing a problem we often don't know what the pattern is at that point and so we have to rely on our built-in pattern recognition system to seek out some semblance of order from the chaos. At which point we may switch or combine approaches to delve in further. In some cases we may not even be consciously looking for it, but the tool makes it apparent and leads us to go looking for more anomalies.

This list of tools is not, and cannot be comprehensive because the very problem itself is being solved again and again. It is also not presented it any particular order because the tool might be used in a variety of contexts depending on the conditions and constraints in play. Many of these tools do have very similar variations though and are pretty much interchangeable so are discussed together.

FIND / FINDSTR

Windows comes with not one, but two command line tools for finding text within files (or the standard input). If you want to know why there are two similar tools you can read Raymond Chen's blog post [RC], but in short they come from the two different Windows lineages – 9x and NT.

I rarely use FIND, except by accident, as it has the same name as one of the classic UNIX command line tools which does something different. Luckily the more recent Gnu on Windows (GoW) distributions [Gow] have taken the pragmatic approach of renaming its FIND tool to GFIND to avoid surprises when running scripts where the PATH order differs.

Even without this wrinkle there is little reason to use FIND over FINDSTR as the latter has some support for regular expressions. Sadly it also has this weird behaviour of treating a string with spaces as a list of words to match instead of treating the entire string literally.

Given its general non-standard behaviour, with respect to GREP (and its cousins), it might seem somewhat useless. But it has two things going for it – it's installed by default and is fast in comparison to some other similar command line tools I've used.

There was a time when production servers where tightly locked down and so this was your only option. Operations teams seem to be a little more open these days to a wider variety of tools, no doubt in part due to the use of VMs to isolate applications, and therefore teams, from each other.

GREP / EGREP / FGREP

The natural alterative to the Windows FIND/FINDSTR combination is the UNIX equivalent GREP. It has two counterparts EGREP and FGREP which are really just short-hands for GREP –E and GREP –F respectively. In fact the --help switch for the short-hands warns you they are deprecated forms. Sadly my muscle memory keeps kicking in as I've been using ports of them since my days working on DOS.

For the record the difference is that EGREP (GREP –E) enables an extended regular expression syntax whilst FGREP (GREP –F) treats the strings literally and so (I guess) is faster. I haven't timed them recently and suspect that there is little in it these days.

For a long time I used the Win32 ports distributed as UnxUtils, but that went stale years ago and GoW (Gnu on Windows) has replaced it as my port of choice as it has no extra dependencies, such as Cygwin. As such it makes it easy to include these in a diagnostics package or just XCOPY them about. What makes Operations feel uneasy is software that needs "installing" whereas being able to run something directly from a remote file share usually won't raise their ire.

With the introduction of the Chocolately package manager [CPM] on Window this toolset is pretty much one of the first things I install, and the renaming of FIND to GFIND now makes it safer to install on a server too without fear of silently breaking something else.

Like many others this is the tool I probably reach for first, command-line wise.

AWK & SED

If the task is to simply find some text then GREP pretty much does the trick, but often I'm looking to do a little bit more. I might need to do a little parsing, such as summing numbers contained within it or transforming it slightly to reduce the noise. In these cases I'm once again looking to the UNIX toolset classics, this time AWK and SED.

I had forgotten about them for many years whilst I was heavily into doing front-end work, but as I switched to the back-end again I found myself doing a lot more text processing. In fact I wrote about my rediscovery in these very pages a few years ago [AS].

Whilst I could use SED more often as a replacement for GREP I keep forgetting the differences in the regular expression syntax (there are many things it doesn't support) and so I find myself wasting time trying to debug the regex only to discover I'm using an EGREP supported construct. Hence I usually carry around a copy of the O'Reilly pocket reference books on regular expressions and SED & AWK, mostly for the tables comparing support across SED, GREP and AWK.

I could use AWK far more than I do for basic text matching because it's pretty quick, but I forget and only remember when I suddenly realize I need the power of its formatting options, not its matching ones.

One particularly useful feature of SED and AWK is their support for address ranges. If you have some pretty-printed XML or JSON (i.e. one tag or key/value pair per line) then you can match parts of the document using address ranges. For example, before I discovered JQ for querying JSON [JQ], I generated some simple release notes by extracting the card number and title from a Trello board exported as JSON via a script I wrote that invoked SED and AWK.

MORE / HEAD

Earlier I mentioned that I often rely heavily on my own human pattern matching skills to home in something hopefully buried within the chaos. If I'm already in the process of building a pipeline to do some matching I might just want to pass my eye over the content by paging through it and seeing if any patterns emerge in the flicker. Hence MORE (or HEAD if I remember it's there) are useful ways to page text for manual scanning.

I generally shy away from using HEAD though as the versions in the UnxUtils and GoW distributions often go bonkers complaining about a broken pipe and it quickly floods the screen with an error. I'm sure I'm just doing something wrong but it hasn't bothered me enough to discover what it is. That's the great thing about have so many choices, you just work around the limitations in one tool by picking another similar one.

PowerShell

In more recent years as I've started to use PowerShell more and more for scripting I naturally find myself becoming more comfortable using the built-in features of the language to create even simple text processing pipelines as well as its more powerful object base ones.

In particular the language comes with some useful cmdlets out-of-the-box for handling XML and CSV files in a more structured way. For example you can import a CSV file and name the columns (if it's not already done via a header row) which then allows you to query the data using the more natural column names instead of, say, using CUT and having to refer to them numerically. This really aids readability in wrapper scripts too [WS].

Whilst PowerShell comes with great flexibility by using .Net for its underpinnings, this also comes at a cost. The performance of parsing textual log files is considerably worse than with native code, such as AWK. I once needed to do some analysis that involved parsing many multi-megabyte log files on a remote share. I started out using PowerShell but eventually discovered it was taking a couple of minutes just to read a file. So I switched to AWK instead which managed to read and parse the same file in only 8 seconds. This was on PowerShell v2 and so more recent versions of .Net and PowerShell may well have closed the gap.

LogParser

Another very old, free tool that I've found useful for parsing log files because of its performance is Microsoft's LogParser [LP]. Originally written to parse IIS log files it grew the ability to read various other text and binary format files which, like the Import-Csv cmdlet in PowerShell, can give the data's columns names. These can then be used within LogParser's SQL-like language to create some pretty powerful queries.

BareTail

One of the ways I've found to help the mind unearth patterns in text, again especially when dealing with analysing log files is to apply a dash of colour to certain lines and words. Just as I use syntax highlighting to make source code a little easier to read, I apply the same principle to other kinds of files. For example I'll highlight error messages in red and warnings in yellow. If there are regular lines that are usually of little interest, such as a server heartbeat message, I'll colour it in light grey so that it blends into the white background to make the more significant behaviour (in black) stand-out.

The first decent Windows tool I found that did this was a commercial tool (with a free cut-down version) called BareTail. Naturally others have sprung up in the meantime, like LogExpert, which are free.

What really attracted me to shell out for the full version was its ability to tail a file and at the bottom have a real-time GREP running over the same file to filter out interesting events. This was a feature my previous team had built into its own custom log viewer years before and so was a most welcome discovery.

BareGrep

The sister tool to BareTail is called BareGrep, which is a GUI based version of the old classic described earlier. It too has the same highlighting support and also has a TAIL like view at the bottom which provides additional context around the lines you match with the pattern in the main view.

The two other features that made it a worthy addition to my toolbox were its ability to use regular expressions on the filename matching (rather than the usual simpler file globbing provided by the Windows FindFirstFile API), and its support for naming and saving patterns. On support I often find myself using the same regex patterns again and again to pick out certain interesting events at the start of an investigation.

Notepad

This article began by explaining what the alternatives are on Windows to the simplistic Notepad, but that doesn’t mean it's not still useful. Aside from WordPad it's the only GUI based viewer installed by default which might be significant in a locked down environment. Even so it can still be handy for smaller stuff, like looking at .config files.

In a way its naivety is also one of its few useful traits. By only handling Windows line endings and having an insane tab width setting of 8 means that any screwy formatting shows up pretty quickly and acts as a gentle reminder to check everyone's on the same page editor-configuration wise.

Notepad++

The original Notepad is very much a tool of last resort and so I'll try and install something a little more powerful as a replacement for day-to-day, non-IDE based text editing jobs, such as writing mark-up. It's quick to open and provides all the usual features you'd expect from a plug-in enhanced text editor. It could easily be Atom, Sublime Text or any one of a number of decent editors out there.

For me the decision to use the command line or a GUI based tool when searching for text depends on how much context I need when I find what I'm looking for and also whether I'm going to edit it afterwards. When searching log files it's ultimately a read-only affair with perhaps some statistical output. In contrast a document probably means I'm going to select some text and paste it elsewhere or even edit the prose in-situ which demands at least a spell checker.

The other factor is often whether I've navigated to the data via a command prompt or the Windows Explorer in which case a right-click is easier than opening a prompt at the folder and typing a command. That said if the file type association is already registered it's just as easy to go the other route and open it in a GUI tool from the command line. And sometimes the choice of tool is totally arbitrary and depends on whatever I've not used in a while and feel I need to remind myself about.

Visual Studio

Anyone who has ever double-clicked the Visual Studio icon by accident or forgotten to register a more lightweight choice of editor for the file association will curse their mistake as it takes an eternity to start. But if it's already running, and being an IDE means it's quite likely for that to be the case, it's just as easy to reuse it as spawn another text editor.

As of Visual Studio 2013 they have also replaced the byzantine regular expression syntax with the more standard .Net one which makes finding text with regexes way more palatable. And the new, cross platform Visual Studio Code editor is looking like this mistake will be far less costly in future.

Vendor Specific Tools

Although most content is becoming more available in simpler text formats so that the choice of tooling is much wider and freer there is still plenty of it stored in custom binary formats like old MS Word documents. The rise of wikis and the various flavours of mark-up have certainly gained in popularity but the enterprise is often locked into vendors through these bespoke formats and so for completeness these need to be accounted for, but are on the decline.

Google

The discussion thus far has largely been about finding text in files on my machine or the intranet. But every day like so many other people I spend plenty of time looking things up on the Internet and for that, naturally, I'll use one of the major search engines.

Despite them selling appliances though for well over a decade that promises to bring the power of an Internet search to the enterprise this still does not appear to have happened and finding anything on an intranet still appears to be a fruitless exercise.

Summary

Searching text, whether it be source code, prose, log or data files is a bread-and-butter activity for programmers. What we do with it when we've found it adds another dimension to the type of tools we use because it may not just be plain text we're lifting but we might want the formatting too. Throw performance and differing query languages into the mix and it's no wonder that we have to keep our hands on such a varied array of tools.

References

[RC] https://blogs.msdn.microsoft.com/oldnewthing/20121128-00/?p=5963
[GoW] https://github.com/bmatzelle/gow/wiki
[CPM] https://chocolatey.org/
[AS] http://www.chrisoldwood.com/articles/reacquainting-myself-with-sed-and-awk.html
[JQ] https://stedolan.github.io/jq/
[WS] http://www.chrisoldwood.com/articles/in-the-toolbox-wrapper-scripts.html
[LP] https://technet.microsoft.com/en-gb/scriptcenter/dd919274.aspx

Chris Oldwood
21 December 2015

Biography

Chris is a freelance developer who started out as a bedroom coder in the 80’s writing assembler on 8-bit micros; these days it’s C++ and C#. He also commentates on the Godmanchester duck race and can be contacted via gort@cix.co.uk or @chrisoldwood.