Friday, April 22, 2016

Tracking Positions in Go: Why Composition Rocks

In this post I discuss the genesis of the streampos package for Go that I recently posted on github. I don't normally write about code I post, but in this case I learned something cute that I wanted to share. Maybe you'll find it enjoyable too?

The story starts with me writing some Go code to process XML files. Of course I used the encoding/xml package from the Go standard library to do this. A few hours into the job I needed to generate error messages when something is not quite right in the XML file. (If you've dealt with large-ish XML files that get edited manually, you probably know that it's quite easy to make the occassional mistake that syntax highlighting alone will not protect you from.) In order for those error messages to be useful, they should tell the user where in the XML file the problem was detected. And that's when I ran into trouble: There's no straightforward way to get, say, line numbers out of encoding/xml!

Sure, their code tracks line numbers in support of their error messages (check
SyntaxError for example) but if you have your own errors that go beyond what the standard library checks, you're out of luck. Well, not completely out of luck. If you're using Decoder to do stream-based XML processing, you can get something moderately useful: the InputOffset method will give you the current offset in bytes since the beginning of the stream.

What do you have to do to turn that into error messages of the kind users expect, so error messages in terms of line (and maybe column) numbers? First you somehow have to get your hands on the raw input stream. Then you look for newlines and build a data structure that allows you to map a range of offsets in the stream into a line number. With a little more code on top, you even get out column numbers if you want them. Sounds like fun, but just how should we do it?

To use Decoder you have to give it a Reader to grab input from. This is promising because Go very much prides itself in the power that simple interfaces like Reader and Writer provide. Indeed, the pattern of wrapping one Reader inside another (or one Writer inside another) is fundamental in much of the standard I/O library. But we can also find those interfaces in "unrelated" parts of the library: The Hash interface, for example, is a Writer that computes hashes over the data written to it.

So the Decoder wants a Reader, but thanks to interfaces any Reader will do. It's not a big leap to think "Hey, I can hack my own Reader that tracks line numbers, and I'll pass that one to Decoder instead of the original Reader!" That's indeed what I considered doing for a few minutes. Luckily I then realized that by hacking a Reader I am actually making myself more problems than I had to begin with.

I want to solve the problem of building a mapping from offsets to line numbers. However, as a Reader, I also have to solve the problem of providing data to whoever is calling me. So whatever the original problem was, as soon as I decide to solve it inside a Reader, I immediately get a second problem. And it's not an entirely trivial problem either! For example, I have to consider what the code should do when asked to provide 37 bytes of data but the underlying Reader I am wrapping only gave me 25. (The answer, at least in Go land, is to return the short read. But notice that it's something I had to think about for a little while, so it cost me time.) On a more philosophical level, inserting a Reader makes my code an integral part of the entire XML thing. I never set out to do that! I just wanted a way to collect position information "on the side" without getting in anybody's way.

In the specific case of encoding/xml things actually get even funnier. It turns out that Decoder checks whether the Reader we hand it is actually a ByteReader. If it's not, Decoder chooses to wrap the Reader we hand it again, this time in a bufio.Reader. So either I have to implement a ByteReader myself, or I have to live with the fact that plugging in my own Reader causes another level on indirection to be added, unnecessarily to some extent. (That's yet another problem I don't want to have to deal with!) There really should be a better way.

And there is: Instead of hacking a Reader, just hack a Writer! I can almost see you shaking your head at this point. "If the Decoder wants a Reader, what good is it going to do you to hack a Writer?" I'll get to that in a second, first let's focus on what being a Writer instead of a Reader buys us.

The most important thing is that as a Writer, we can decide to be "the sink" where data disappears. That is, after all, exactly what the Hash interface does: It's a Writer that turns a stream into a hash value and nothing else, the stream itself disappears in the process. What's good enough for Hash is good enough for us: We can be a Writer that turns a stream into a data structure that maps offsets to line numbers. Note that a Reader doesn't have this luxury. Not ever. True, there could be Readers that are "the source" where data appears out of thin air, but there are no (sensible) Readers that can be "the sink" as described above.

A secondary effect of "being the sink" is that we don't have to worry about dealing with an "underlying Writer" that we wrap. (As a Reader, we'd have to deal with "both ends" as it were, at least in our scenario.) Also, just like in the case of Hash, any write we deal with cannot actually fail. (Except of course for things like running out of memory, but to a large degree those memory issues are something Go doesn't let us worry about in detail anyway.) This "no failures" property will actually come in handy.

Okay, so those are all nice things that will make our code simpler as long as we hack a Writer and not a Reader. But how the heck are we going to make our Writer "play nice" with the Decoder that after all requires a Reader? Enter a glorious little thing called TeeReader. (No, not TeaReader!) A TeeReader takes two arguments, a Reader r and a Writer w, and returns another Reader t. When we read from t, that request is forwarded to r. But before the data from r gets returned through t, it's also written to w. Problem solved:

lines := &streampos.Writer{}
tee := io.TeeReader(os.Stdin, lines)
dec := xml.NewDecoder(tee)

There's just one small problem with TeeReader: If the write we're doing "on the side" fails, that write error turns into a read error for the client of TeeReader. Of course that client doesn't really know that there's a writer involved anywhere, so things could get confusing. Luckily, as I pointed out above, our Writer for position information never fails, so we cannot possibly generate additional errors for the client.

I could end the post here. But I don't want to sweep under the rug that there's a little cheating going on. Where? Well, you see, TeeReader is not a ByteReader either. So regardless of how nice the code in our Writer is, and regardless of how cute the setup above is, we incur the cost of an extra indirection when NewDecoder decides to wrap the TeeReader. What we're doing is shoving the problem back to the standard library. It's possible that TeeReader will eventually grow a ReadByte method at which point the needless wrapping would cease. However, that's not very likely given what TeeReader is designed to do. But note that this concern arises specifically in connection with encoding/xml. There are probably many applications that do not require methods beyond the Reader interface.

Speaking of other applications. In the Go ecosystem, interfaces such as Reader and Writer are extremely prominent. Lots of people write their code to take advantage of them. The nice thing is that streampos.Writer coupled with TeeReader provides a generic way to handle position information for all applications that use a Reader to grab textual data. Of course not all applications do, and not all applications will be able to take full advantage of it. But if you're writing one that does, and if you want to have position information for error messages, well, it's three lines of code as long as you already track offsets. And you have to track something yourself because after all only your application knows what parts of a stream are interesting.

I very much like that Go encourages this kind of reusability by composing small-ish, independently developed pieces. (Actually, that even confirms a few of the claims I made in my 2003 dissertation. Yay me!) The only "trouble" is that there are already a few places in the standard library where similar position tracking code exists: At the very least in text/scanner and in the Go compiler itself. Whether that code could use my little library I don't know for sure, maybe not. But I guess it should be a goal of the standard library to refactor itself on occasion. We'll see if it does...

One last note: I've been teaching a course on compilers since 2001, and since about 2003 I've told students to use byte offsets as their model of positions. I've always sold this by explaining that offsets can be turned into lines and columns later but we don't have to worry about those details in the basic compiler. Strangely enough I never actually wrote the code to perform that transformation, until now that is. So once I teach the course mostly in Go, I can use my own little library. Neat. :-)

No comments:

Post a Comment