The Reasoning Behind the Choices

Sharing code in public is interesting in many ways. Sometimes the choices we make about design are somewhat arbitrary as there are many options before us. Sometimes those choices are deliberate and methodical with a well reasoned approach on how you got there. Then there are those times where you just do something dumb…

If you’re going to be willing to share your code for the world to see, you really need to be OK with being wrong about something and learning from it. But you also need to know how to stick to your guns when you think you are doing things right. This is post is going to be a bit about both using my latest JSON parsing articles as illustrations: Generators Need a current Value and Improving Code with Generics.

The primary goal of the code that I wrote was to enable the ability to parse through a JSON string and create a JSON object representation from that string. However, in that article, I presented a much lower level view of the problem and framed it in such a way as to remove all of the context on why and how I reached that decision.

Wes Campaigne posted some great feedback over on GitHub about the approach I took to the problem.

I thought the whole

buffer.next()
while buffer.current != nil {
    if let unicode = buffer.current { // ... somewhere, buffer.next() is called

dance was kind of ugly: you’re dealing with the overhead of using a generator, but receiving none of the benefits it provides (e.g. for in loops). Also, using a struct for your BufferedGenerator seems odd — you end up using a class as a backing store anyway, and having it as a struct means using inout parameters all over the place. There’s a discussion on the dev forums that argues the case why GeneratorTypes should, in general, just be reference types.

Wes makes some great points, and his RewindableGenerator<S> is a very good class that solves the specific problem I was looking at better (both in terms of the applicability of the use cases and in how the code that consumes it should work).

The only real problem, which I forgot when I first looked at his solution, was that the performance difference between using the GeneratorType and the Index types for Strings is fairly significant, nearly a 2.5x slowdown.

When I was first solving this problem, I looked at the following approaches:

  1. String.Index based approach grabbing individual characters. This lead me to find out how String works with unicode combining characters.
  2. Then I tried using String.UTF8View.Index, after all, they are both indexes it should be a fairly easy change. Well… turns out that String.Index is a BidirectionalIndexType but String.UTF8View.Index is only a ForwardIndexType. At this point, I realized that I basically needed to re-write a significant portion of my algorithm. I did so making sure that all of my previous() calls were updated; this also required some fairly ugly hacks to get everything to work. Then I found out two new things after more investigation in the topic:
    1. Performance of the GeneratorType construct was significantly faster than the Index based construct.
    2. There is a better view into the string String.UnicodeScalarView. With the String.UTF8View, I had to create strings by passing a pointer to an UInt8 array that I had to keep track of while parsing the string. It was fairly ugly, but it worked. =)

Both of these lead me to the realization that another parser re-write was coming… however, this time, I knew I needed to use GeneratorType and I knew that I wanted to get rid of a lot of the hacks I did. This was the start of the Generators Need a current Value and Improving Code with Generics posts.

Well, I was able to get rid of some of my hacks, but then Wes’ comments came. I already wasn’t very pleased with the implementation of the JSON parser as it still had some hacks in it and some somewhat cryptic logic, but hey, it worked! But as I thought about Wes’ comments some more, I knew there was a better way.

So I started integrating Wes’ solution into my parsing code. But, I had already forgotten a lesson I had learned earlier: Index based approaches suck at perf, big time!

At this point, I had already re-written the parsing to provide some significantly better error messages (thanks in-part to using for (idx, scalar) in enumerate(generator) {} that was now possible due to Wes’ updates) and a much cleaner logic flow. However, I wanted to get my performance back down.

That’s when I came up with this class: ReplayableGenerator

final public class ReplayableGenerator<S: SequenceType> : GeneratorType, SequenceType {
    typealias Sequence = S

    private var firstRun = true
    private var usePrevious = false
    private var previousElement: Sequence.Generator.Element? = nil
    private var generator: Sequence.Generator

    public init(_ sequence: Sequence) {
        self.generator = sequence.generate()
    }

    public func next() -> Sequence.Generator.Element? {
        switch usePrevious {
        case true:
            usePrevious = false
            return previousElement

        default:
            previousElement = generator.next()
            return previousElement
        }
    }

    public func replay() {
        usePrevious = true
        return
    }

    public func generate() -> ReplayableGenerator {
        switch firstRun {
        case true:
            firstRun = false
            return self

        default:
            self.replay()
            return self
        }
    }

    public func atEnd() -> Bool {
        let element = next()
        replay()

        return element == nil
    }
}

I’ve been experimenting with using switch-statements over if-statements; I’m greatly likely their readability in many cases. However, there does seem to be a bug where case true and case false do not create an exhaustive list, so I use default.

: .info

These were the constraints:

  1. Index based iterators and lookups are significantly slower than GeneratorType and for-loop; they cannot be used.
  2. The GeneratorType is only a forward-moving iterator.
  3. There is no ability to inspect the previous character in the construct. This is vital because when we parse values, often times we need to inspect the next value to determine if we stop parsing the current value. However, once we do this, we are in a bit of a situation as the parser really needs to start parsing from that previous character because it’s going to call next() and skip over the just visited character. Bad mojo.

This class provided everything I needed, while the semantics of it also allowed me to create a much better parse(). The integration was also easy as I simply needed to replace the previous() calls with a replay() call.

With this implementation, I was able to get my performance back down to 0.25s vs. 0.17s (JSON.parse vs. NSJSONSerialization).

Remember, often times people are able to look at a problem have been working on and shed new light on the situation. While Wes’ solution was not applicable to my situation, his thought process on why his implementation better was superbly helpful in rethinking the semantics of what I was doing. Ultimately, I’m fairly happy with the results of the parser now… except for that perf! =)

So thanks Wes for helping me think about the problem better. Oh, and you can judge my parsing code here: JSValue.Parsing.

The Reasoning Behind the Choices