Generators Need a current value

When you build a parser, you need the ability to scan through your tokens to build up your output. Swift offers us a few different constructs for iteration; I needed the one that would be the fastest, after-all, I'm building a parser!

I wrote a small test suite to test the various iteration types, which essentially boils down to two options for Strings:

  1. The tradition for-loop that uses an index value
  2. The GeneratorType based approach

Index-based for-loop Here's the code for this one:

var string = ""

let scalars = self.largeJSON.unicodeScalars
self.measureBlock() {
    for var idx = scalars.startIndex; idx < scalars.endIndex; idx = idx.successor() {
        let scalar = scalars[idx]
        scalar.writeTo(&string)
    }
}

It's pretty straight-forward; simply start at startIndex and traverse your way through the string until you hit endIndex. There are couple of gotchas though, the most significant being that Swift doesn't allow Int-based indexing – all of the types have their own special indexing type. The thing to watch out for, not all of them are bi-directional.

GeneratorType Approach

var string = ""

self.measureBlock() {
    for scalar in self.largeJSON.unicodeScalars {
        scalar.writeTo(&string)
    }
}

This one is fairly simple as well: simply loop through all of the unicode values. We can also write this loop in a slightly different way:

var string = ""

self.measureBlock() {
    var generator = self.largeJSON.unicodeScalars.generate()
    for var scalar = generator.next(); scalar != nil; scalar = generator.next() {
        scalar?.writeTo(&string)
    }
}

In my testing, I found that the GeneratorType-based approach was about 18% faster. This is significant enough for me to use it. =)

Implementing the Parsing

Next up is actually parsing the JSON string. The basic idea is to look for specific tokens and call into one these methods:

  1. parseObject – used to parse out a JSON object (e.g. dictionary)
  2. parseArray – used to parse an array
  3. parseNumber – used to parse a number value
  4. parseString – used to parse out a string, also used when parsing keys from a dictionary
  5. parseTrue – used to parse the boolean value true
  6. parseFalse – used to parse the boolean value false
  7. parseNull – used to parse the literal value null

I think that about covers the basics of what I need. And here comes the problem… each of these a one more or pieces of the information:

  1. The current generator value so that increments can be done
  2. The current character the generator is pointing to
  3. The character used at the start of the parse call

When we look at the API for GeneratorType, we find that it only supports next(). Hmm… that's not going to be sufficient. So now we are left with two choices:

  1. Pass the current unicode token around with the our generator instance, or
  2. Package up the generator and the current unicode token into a single class

To me, this is a no-brainer. As soon as we introduce this coupling, it is best to package up the dependencies and maintain that state with a single value.

Ideally, we would simply be able to extend the GeneratorType instance for String.UnicodeScalarView, however, we cannot extend types with stored properties, so we are left with creating an entirely new type to box this functionality.

: .info

To work around this limitation, I created the following type:

struct UnicodeScalarParsingBuffer {
    var generator: String.UnicodeScalarView.Generator
    var current: UnicodeScalar? = nil

    init(_ generator: String.UnicodeScalarView.Generator) {
        self.generator = generator
    }

    mutating func next() -> UnicodeScalar? {
        self.currentUnicodeScalar = generator.next()
        return self.currentUnicodeScalar
    }
}

I find this to be a deficiency in the current implementation of GeneratorType. While it may be the case that you are always working in the same scope, it is also necessary at times to pass this context around. Once you start doing that, you're going to need that current value, otherwise you need to pass both generator and current – no one really wants to do that.

The full source for the json-swift library can be found over on GitHub; the parsing code is here.

Generators Need a current value