The Problem with Enums

In a bit of back and forth on Twitter about tuples, structs, and enums, Chris wrote up a good overview of why you might use one or the other. Check it out here: Tuples, Structs and Enums. It's a good read.

Here's my dirty secret: I really dislike enums for exactly the reason they exist. The entire purpose of an enum is to solve problems like Chris mentions, "In the previous example, we've used String to use the currency, but what if we want to restrict our program to only currencies we know about?".

I emphasized the problem, well, my problem with enums: a complete lack of extensibility in the case options.

If you build a library that handles currencies and model them with enums, you have forever locked your users into using the currencies you have explicitly allowed in your enum definition.

Take Chris' example:

enum Currency {
   case EUR
   case USD
   case YEN
}

If you, as the consumer of the library, want to add CAD to the set of supported currencies, you are out of luck.

You can get the source code and make changes, sure. However, then you need to manage two different versions unless you can get that change pushed back. This is not always possible though.

Whenever you make an enum, ask yourself the question: do I need to constrain this a specific set of possibilities (Either is a good example), or do I need to constrain it to a certain class of possibilities. It might actually be more appropriate to use a interface instead.

An example of that would be the following:

private let usd = USD()
private let eur = EUR()
private let yen = YEN()

protocol CurrencyType {
    class var currency: Self { get }
    class var symbol: String { get }
}

final class USD : CurrencyType {
    class var currency: USD { return usd }
    class var symbol: String { return "$" }
}

final class EUR : CurrencyType {
    class var currency: EUR { return eur }
    class var symbol: String { return "€" }
}

final class YEN : CurrencyType {
    class var currency: YEN { return yen }
    class var symbol: String { return "¥" }
}

typealias Currency = (value: Int, currency: CurrencyType)     // or as a struct

let usd = Currency(5, USD.currency)

There are many other ways to model what I did above; this is just an illustrative example.

I really wish Swift would have solved the problem of case extensions for enums in Swift. It would help in these cases where we really do want a nice set of possible values, but we would also like to have the ability for others to extend those cases in their own code.

The Problem with Enums

Tuples Are The New Struct – Revisited

Yesterday I wrote about thinking about using tuples in place of your dumb data types. However, my example used a class of problem that was typically modeled using an Either<T,U> type. That, understandably, added some confusion that was not intended.

Today, let's instead take a look at an example that is hopefully a little less contentious: the Point type (just two-dimensional).

At a quick glance, I see the following as ways in which we might model the Point type as a tuple, struct, class, or even as an enum.

Here's some sample uses for each of them:

Tuple

typealias Point = (x: Int, y: Int)

let p = Point(2, 6)                   // Point(x: 2, y: 6) is also valid
println("(x, y) = (\(p.x), \(p.y))")

Struct

struct Point {
    var x: Int
    var y: Int
}

let p = Point(x: 2, y: 3)
println("(x, y) = (\(p.x), \(p.y))")

Class

class Point {
    var x: Int
    var y: Int

    init(x: Int, y: Int) {
        self.x = x
        self.y = y
    }
}

let p = Point(x: 2, y: 3)
println("(x, y) = (\(p.x), \(p.y))")

Enum

enum Point {
    case Components(Int, Int)
}

let p = Point.Components(2, 3)
switch p {
case let .Components(comp):
    println("(x, y) = (\(comp.0), \(comp.1))")
}

Now, each of the above approaches has positives and negatives to their approaches. However, to me, the tuple has all of the right behavior out of the box. The enum based approach is the most verbose and I'm unclear of any distinct advantage it has over both the tuple and struct/class options.

If you're anything like me, you might tend to write your code in stages:

  1. There's the initial prototyping and scaffolding to make sure your thoughts apply to code.
  2. Then the roughing in with types and better names.
  3. Finally we get to the flushed out public API surface.

Along the way there is a lot of back and forth between the stages. I tend to start with the least amount of code so that it's easier to through away. So, in flushing out my API, the Point might remain a tuple throughout all of the stages.

Also, by starting with a tuple, I have to specifically ask myself the question: do I really need to add this function or private data here? Is there a better way to model this?

Don't forget about the tuple. Of course, your mileage may vary.

Tuples Are The New Struct – Revisited

Tuples Are The New Struct

I published a "revised" look at this that uses a less controversial example as people were getting stuck on this example being an Either<T,U> (enum based) instead of looking at structs vs. tuples.

: .info

I've been playing around with using named tuples instead of structs for pure data types. One such use case was in returning errors from a function.

Let's say we a function foo and we want to return an Int or an Error?. There are lots of ways to model this: structs, enums, tuples, inout parameters, global error state, etc…

Of course, each has their positives and negatives. However, I want to look at what the difference of the struct and the tuple implementation looks like.

The function definition will look like this:

func foo() -> IntOrError {}

The struct would look like this (Error is my own custom error type, you could use NSError too):

struct IntOrError {
    let value: Int
    let error: Error?
}

The tuple would look like this:

typealias IntOrError = (value: Int, error: Error?)

Usage of the two looks exactly the same:

let result = foo()
if let error = result.error {
    println("Uh oh! An error occurred")
}
else {
    println("The value is: \(result.value)")
}

So should you use tuples? Well, I don't know. =) There are two big disadvantages of the tuple approach:

  1. No generics support; I cannot create: typealias ErrorOf<T> = (value: T, error: Error?). I consider this a deficiency in the generics system of Swift though.
  2. No ability to add functionality to the type itself; essentially no OOP-style programming. This also extends to specific initializers.

However, the I see some benefits for the tuple approach too:

  1. Much faster prototyping while maintaining good readability of code
  2. Currently (as of Beta 6), tuples as return types perform much better than structs. This should get better though. Update: I think the root cause of this was due to rdar://18111139: Swift: Optimizer Perf Bug with inline/external class definitions.
  3. "Upgrading" to a full-blown struct requires only updating the definition of your code (beware, this would be a breaking change for code linking your code though).

If all you need is a dumb data type, try out the named tuple!

Tuples Are The New Struct

Error Handling – Take Two

Make sure to see the update below for a bit for more information on the causes of memory usage.

In my seemingly never ending and not quite achievable goal of beating NSJSONSerialization in both performance and memory utilization for parsing a JSON string, I've come across another pearl of wisdom with regards to Swift: ignore my Error Handling in Swift piece and others that recommend using the Either<T,U> as in other languages (at least for the current version of Swift, as of Beta 6).

I have been able to get my parsing speed to within 0.01s of NSJSONSerialization; while my goal is domination, I also am pragmatic (at times). Next up was memory utilization. Unfortunately, I was (and still am), far behind the total memory usage of the ObjC version. So like a good little software engineer, I fired up Instruments and started investigating what I saw.

When you investigate memory usage, there are three primary concerns that we need to watch out for:

  1. Total amount of memory used over the life of the scenario
  2. Total amount of memory every actually in use at any given time
  3. Highest spike in memory used over the life of the scenario

Instruments visualizes this data pretty nicely for us:

screenshot of instruments with multiple memory profiles visualized in the editor

The picture above is showing the results of the NSJONSerialization code path. My implementation actually has a better "total persistent bytes" overall of 1.92MB vs. the 2.51MB shown above. However, the total memory used in mine was about 6.5MB while we see that NSJONSerialization only used about 4.7MB.

Taking a Dive

There are a couple of approaches we can take to tracking down and solving memory issues:

  1. Examine the code 2. Examine the profiles

Unfortunately, the profiles were not really helping me track down root cause of the issues, but were illustrative in helping me understand that I was creating many, many copies of objects all around the place.

Examining the Error type I first took a quick look over my code to see if I could see anything obvious. There was one thing I noticed right off the bat: FailableOf<T> stores an Error object in its Failure case. Well, the Error type is a struct with three values in it, and since I return a FailableOf<T> in all of my parsing calls, I'm going to need to return a copy of that Error, even if it's empty, all of the time.

Knowing that the Error object is going to be copied so many times throughout the call chain, we can instead mark the Error type as public final class.

When we do this, the total memory usage drops to 6.06MB.

The other option is to create a backing class to store all of the data: that class looks like this:

public struct Error {
    public typealias ErrorInfoDictionary = [String:String]

    class ErrorInfo {
        let code: Int
        let domain: String
        let userInfo: ErrorInfoDictionary?

        init(code: Int, domain: String, userInfo: ErrorInfoDictionary?) {
            self.code = code
            self.domain = domain
            self.userInfo = userInfo
        }
    }

    var errorInfo: ErrorInfo

    public var code: Int { return errorInfo.code }
    public var domain: String { return errorInfo.domain }
    public var userInfo: ErrorInfoDictionary? { return errorInfo.userInfo }

    public init(code: Int, domain: String, userInfo: ErrorInfoDictionary?) {
        self.errorInfo = ErrorInfo(code: code, domain: domain, userInfo: userInfo)
    }
}

However, that seems to be a lot more complicated over simply do this:

public final class Error {
    public typealias ErrorInfoDictionary = [String:String]

    public let code: Int
    public let domain: String
    public let userInfo: ErrorInfoDictionary?

    public init(code: Int, domain: String, userInfo: ErrorInfoDictionary?) {
        self.code = code
        self.domain = domain
        self.userInfo = userInfo
    }
}

And since all my values are immutable to begin with, I'm not sure why I would chose the struct approach for this problem.

Investigating the FailableOf<T> Since I'm having copying issues with the Error (gist) type, it is only logical to look at the FailableOf<T> type next. Instead of using my JSON parser as the test ground, I decided to create a little sample app that would loop many times calling a function that returned the following types:

  • FailableOf<T> – my implementation of the Either<T, U> concept (gist)
  • Either<T, U> – a more generic solution to my FailableOf<T> problem (gist)
  • (T, Error) – a tuple that contains the two pieces of information

The sample program is straight forward:

func either<T>(value: T) -> Either<T, Error> {
    return Either(left: value)
}

// test: either
for var i = 0; i < 100_001; i++ {
    let r = either(i)
    if (r.right != nil) {
        println("error at \(i)")
    }
}

Each of the different constructs have the same form (gist).

This is where I found something interesting: both the FailableOf<T> and Either<T, U> take up about 3MB of memory, while the (T, Error) tests only take 17KB. Clearly, there has to be some missed compiler optimizations in Swift. Regardless, the tuple approach is clearly the one we should be taking, at least for now, if we really care about every ounce of memory.

In order to work with it better in my code, I create a typealias and use named tuples:

/// The type that represents the result of the parse.
public typealias JSParsingResult = (value: JSValue?, error: Error?)

After updating all of the JSON.parse code to return this new type, memory usage is down to 5.33MB!! Simply switching from a struct-based approach to this named tuple approach (which I think is just a good, frankly), I was able to shave off another 700KB of unnecessary memory creation.

I'm not done investigating other opportunities right now, but things are starting to look really promising here.

UPDATE After some more investigating, I realized why the enum case was causing such memory bloat: we need to box all of the types that get stored in them until Swift implements the proper generic support for an enum.

Error Handling – Take Two

Swift Proposal: protected

There has been much said about protected and how Swift needs, I mean, NEEDS, the "protected" keyword. In fact, there has been so much ruckus about it that the Swift team wrote a blog entry on it: Access Control and protected.

While I whole heartedly agree that the protected keyword is a terrible idea from an inheritance perspective, the intent of the notion has great value. I'm going to define the intent as this:

The ability to separate concerns of implementors and consumers.

: .callout

If we focus the definition, it's really not that hard to image how we can extend the existing public, internal, and private access modifiers that Swift already offers with a fourth option: protected.

I propose that we could enable the following:

  1. Introduce the protected keyword
  2. Modify the import rules to include a protected modifier

The rule for the protected keyword would be quite simple:

Protected access enables entities to be used within any source file from their defining module, and also in a source file from another module that imports the defining module with the protected modifier. You typically use protected access to specify the public interface for those wishing to extend the functionality of your types, but hiding that functionality from the consumers of your API.

: .callout

An example would be this:

Defined in module FooMod

public struct Foo {
    public func foo() {}
    protected func bar() {}

    public var fizzy: Int
    protected var fuzzy: Int
}

protected func MakeSuperFoo() -\> Foo {}

Then, in another module, you would have to use the following in order to gain access to the protected members.

import FooMod                  // Brings in all of the public members
import protected FooMod        // Brings in all of the protected members

let f: Foo = MakeSuperFoo()
f.foo()
f.bar()

I think this fits into the existing access control mechanism perfectly and provides a way to provide the high-level intent of what people are asking for with protected.

Swift Proposal: protected

The Reasoning Behind the Choices

Sharing code in public is interesting in many ways. Sometimes the choices we make about design are somewhat arbitrary as there are many options before us. Sometimes those choices are deliberate and methodical with a well reasoned approach on how you got there. Then there are those times where you just do something dumb…

If you’re going to be willing to share your code for the world to see, you really need to be OK with being wrong about something and learning from it. But you also need to know how to stick to your guns when you think you are doing things right. This is post is going to be a bit about both using my latest JSON parsing articles as illustrations: Generators Need a current Value and Improving Code with Generics.

The primary goal of the code that I wrote was to enable the ability to parse through a JSON string and create a JSON object representation from that string. However, in that article, I presented a much lower level view of the problem and framed it in such a way as to remove all of the context on why and how I reached that decision.

Wes Campaigne posted some great feedback over on GitHub about the approach I took to the problem.

I thought the whole

buffer.next()
while buffer.current != nil {
    if let unicode = buffer.current { // ... somewhere, buffer.next() is called

dance was kind of ugly: you’re dealing with the overhead of using a generator, but receiving none of the benefits it provides (e.g. for in loops). Also, using a struct for your BufferedGenerator seems odd — you end up using a class as a backing store anyway, and having it as a struct means using inout parameters all over the place. There’s a discussion on the dev forums that argues the case why GeneratorTypes should, in general, just be reference types.

Wes makes some great points, and his RewindableGenerator<S> is a very good class that solves the specific problem I was looking at better (both in terms of the applicability of the use cases and in how the code that consumes it should work).

The only real problem, which I forgot when I first looked at his solution, was that the performance difference between using the GeneratorType and the Index types for Strings is fairly significant, nearly a 2.5x slowdown.

When I was first solving this problem, I looked at the following approaches:

  1. String.Index based approach grabbing individual characters. This lead me to find out how String works with unicode combining characters.
  2. Then I tried using String.UTF8View.Index, after all, they are both indexes it should be a fairly easy change. Well… turns out that String.Index is a BidirectionalIndexType but String.UTF8View.Index is only a ForwardIndexType. At this point, I realized that I basically needed to re-write a significant portion of my algorithm. I did so making sure that all of my previous() calls were updated; this also required some fairly ugly hacks to get everything to work. Then I found out two new things after more investigation in the topic:
    1. Performance of the GeneratorType construct was significantly faster than the Index based construct.
    2. There is a better view into the string String.UnicodeScalarView. With the String.UTF8View, I had to create strings by passing a pointer to an UInt8 array that I had to keep track of while parsing the string. It was fairly ugly, but it worked. =)

Both of these lead me to the realization that another parser re-write was coming… however, this time, I knew I needed to use GeneratorType and I knew that I wanted to get rid of a lot of the hacks I did. This was the start of the Generators Need a current Value and Improving Code with Generics posts.

Well, I was able to get rid of some of my hacks, but then Wes’ comments came. I already wasn’t very pleased with the implementation of the JSON parser as it still had some hacks in it and some somewhat cryptic logic, but hey, it worked! But as I thought about Wes’ comments some more, I knew there was a better way.

So I started integrating Wes’ solution into my parsing code. But, I had already forgotten a lesson I had learned earlier: Index based approaches suck at perf, big time!

At this point, I had already re-written the parsing to provide some significantly better error messages (thanks in-part to using for (idx, scalar) in enumerate(generator) {} that was now possible due to Wes’ updates) and a much cleaner logic flow. However, I wanted to get my performance back down.

That’s when I came up with this class: ReplayableGenerator

final public class ReplayableGenerator<S: SequenceType> : GeneratorType, SequenceType {
    typealias Sequence = S

    private var firstRun = true
    private var usePrevious = false
    private var previousElement: Sequence.Generator.Element? = nil
    private var generator: Sequence.Generator

    public init(_ sequence: Sequence) {
        self.generator = sequence.generate()
    }

    public func next() -> Sequence.Generator.Element? {
        switch usePrevious {
        case true:
            usePrevious = false
            return previousElement

        default:
            previousElement = generator.next()
            return previousElement
        }
    }

    public func replay() {
        usePrevious = true
        return
    }

    public func generate() -> ReplayableGenerator {
        switch firstRun {
        case true:
            firstRun = false
            return self

        default:
            self.replay()
            return self
        }
    }

    public func atEnd() -> Bool {
        let element = next()
        replay()

        return element == nil
    }
}

I’ve been experimenting with using switch-statements over if-statements; I’m greatly likely their readability in many cases. However, there does seem to be a bug where case true and case false do not create an exhaustive list, so I use default.

: .info

These were the constraints:

  1. Index based iterators and lookups are significantly slower than GeneratorType and for-loop; they cannot be used.
  2. The GeneratorType is only a forward-moving iterator.
  3. There is no ability to inspect the previous character in the construct. This is vital because when we parse values, often times we need to inspect the next value to determine if we stop parsing the current value. However, once we do this, we are in a bit of a situation as the parser really needs to start parsing from that previous character because it’s going to call next() and skip over the just visited character. Bad mojo.

This class provided everything I needed, while the semantics of it also allowed me to create a much better parse(). The integration was also easy as I simply needed to replace the previous() calls with a replay() call.

With this implementation, I was able to get my performance back down to 0.25s vs. 0.17s (JSON.parse vs. NSJSONSerialization).

Remember, often times people are able to look at a problem have been working on and shed new light on the situation. While Wes’ solution was not applicable to my situation, his thought process on why his implementation better was superbly helpful in rethinking the semantics of what I was doing. Ultimately, I’m fairly happy with the results of the parser now… except for that perf! =)

So thanks Wes for helping me think about the problem better. Oh, and you can judge my parsing code here: JSValue.Parsing.

The Reasoning Behind the Choices

Improving Code with Generics

Update: I updated the post to make use of S: SequenceType instead of T: GeneratorType; it's a cleaner API.

: .info

Yesterday, I wrote about how we needed to build the following class:

struct UnicodeScalarParsingBuffer {
    var generator: String.UnicodeScalarView.Generator
    var current: UnicodeScalar? = nil

    init(_ generator: String.UnicodeScalarView.Generator) {
        self.generator = generator
    }

    mutating func next() -> UnicodeScalar? {
        self.currentUnicodeScalar = generator.next()
        return self.currentUnicodeScalar
    }
}

When we look at the code above, we can observe a few things:

  1. The code is tightly coupled to String.UnicodeScalarView.Generator
  2. The code is tightly coupled to UnicodeScalar
  3. The code loosely conforms to GeneratorType

We can make this code better and more suitable for other instances of GeneratorType; or to put it another way, generic.

Let's start from bullet #3; we should be conforming to the GeneratorType protocol because this really is simply another type of generator.

The definition starts to take shape like this:

struct BufferedGenerator : GeneratorType {
    var generator: GeneratorType
    mutating func next() -> UnicodeScalar?
}

Bullets #1 and #2 are aspects of the same coin as Generator and Generator.Element are really defined from the same construct.

The interface now looks more like this:

struct BufferedGenerator<S: SequenceType> : GeneratorType {
    typealias Sequence = S

    var generator: Sequence.Generator
    var current: Sequence.Generator.Element? = nil

    init(_ sequence: Sequence) {
        self.generator = sequence.generate()
    }

    mutating func next() -> Sequence.Generator.Element? {
        self.current = generator.next()
        return self.current
    }
}

This implementation now let's us use any type of SequenceType as a BufferedGenerator.

We use SequenceType as the generic constraint instead of GeneratorType because it creates a better ownership model for the underlying generator. The call to next() should only be done from a single generator; this code puts that burden on BufferedGenerator<S> instead of the caller.

: .info

Generics can be a great way to reduce type information that simply doesn't need to be there. In this case, there was no reason that the original UnicodeScalarParsingBuffer needed to be tied to a specific type. Generics can also help greatly in code reuse, which is almost always a good thing.

The full source for the json-swift library can be found over on GitHub.

Improving Code with Generics

Generators Need a current value

When you build a parser, you need the ability to scan through your tokens to build up your output. Swift offers us a few different constructs for iteration; I needed the one that would be the fastest, after-all, I'm building a parser!

I wrote a small test suite to test the various iteration types, which essentially boils down to two options for Strings:

  1. The tradition for-loop that uses an index value
  2. The GeneratorType based approach

Index-based for-loop Here's the code for this one:

var string = ""

let scalars = self.largeJSON.unicodeScalars
self.measureBlock() {
    for var idx = scalars.startIndex; idx < scalars.endIndex; idx = idx.successor() {
        let scalar = scalars[idx]
        scalar.writeTo(&string)
    }
}

It's pretty straight-forward; simply start at startIndex and traverse your way through the string until you hit endIndex. There are couple of gotchas though, the most significant being that Swift doesn't allow Int-based indexing – all of the types have their own special indexing type. The thing to watch out for, not all of them are bi-directional.

GeneratorType Approach

var string = ""

self.measureBlock() {
    for scalar in self.largeJSON.unicodeScalars {
        scalar.writeTo(&string)
    }
}

This one is fairly simple as well: simply loop through all of the unicode values. We can also write this loop in a slightly different way:

var string = ""

self.measureBlock() {
    var generator = self.largeJSON.unicodeScalars.generate()
    for var scalar = generator.next(); scalar != nil; scalar = generator.next() {
        scalar?.writeTo(&string)
    }
}

In my testing, I found that the GeneratorType-based approach was about 18% faster. This is significant enough for me to use it. =)

Implementing the Parsing

Next up is actually parsing the JSON string. The basic idea is to look for specific tokens and call into one these methods:

  1. parseObject – used to parse out a JSON object (e.g. dictionary)
  2. parseArray – used to parse an array
  3. parseNumber – used to parse a number value
  4. parseString – used to parse out a string, also used when parsing keys from a dictionary
  5. parseTrue – used to parse the boolean value true
  6. parseFalse – used to parse the boolean value false
  7. parseNull – used to parse the literal value null

I think that about covers the basics of what I need. And here comes the problem… each of these a one more or pieces of the information:

  1. The current generator value so that increments can be done
  2. The current character the generator is pointing to
  3. The character used at the start of the parse call

When we look at the API for GeneratorType, we find that it only supports next(). Hmm… that's not going to be sufficient. So now we are left with two choices:

  1. Pass the current unicode token around with the our generator instance, or
  2. Package up the generator and the current unicode token into a single class

To me, this is a no-brainer. As soon as we introduce this coupling, it is best to package up the dependencies and maintain that state with a single value.

Ideally, we would simply be able to extend the GeneratorType instance for String.UnicodeScalarView, however, we cannot extend types with stored properties, so we are left with creating an entirely new type to box this functionality.

: .info

To work around this limitation, I created the following type:

struct UnicodeScalarParsingBuffer {
    var generator: String.UnicodeScalarView.Generator
    var current: UnicodeScalar? = nil

    init(_ generator: String.UnicodeScalarView.Generator) {
        self.generator = generator
    }

    mutating func next() -> UnicodeScalar? {
        self.currentUnicodeScalar = generator.next()
        return self.currentUnicodeScalar
    }
}

I find this to be a deficiency in the current implementation of GeneratorType. While it may be the case that you are always working in the same scope, it is also necessary at times to pass this context around. Once you start doing that, you're going to need that current value, otherwise you need to pass both generator and current – no one really wants to do that.

The full source for the json-swift library can be found over on GitHub; the parsing code is here.

Generators Need a current value

Combining Characters

In my JSON Parsing post, I talked about an issue I was having with a particular character set:

let string = "\"\u{aaef}abcd"
countElements(string)           // 5
countElements(string.utf8)      // 8

Well, it turns out that \u{aaef} is a unicode combining character that modifies the character before it. There are some combining characters that create a single character, but there are also combining characters that still result in multiple visible characters, as seen above.

However, it seems there is a view into the string that gave me what I wanted:

let string = "\"\u{aaef}abcd"
countElements(string.unicodeScalars)     // 6

If we take a look at a few other examples, we can see that the unicodeScalars seems to give us the full make-up of the unicode values that are making up the string.

let single = "è"    // \u{e8}
for scalar in single.unicodeScalars {
    println("\(scalar) (\(scalar.value))")      // prints: è
}

let combined = "e\u{300}"
for scalar in combined.unicodeScalars {
    println("\(scalar) (\(scalar.value))")      // prints: e, `
}

Notice the difference in the two: the first is an single unicode value, the second is the letter "e" combined with the accent grave (`).

The only downside that I've run into with this approach is that it seems to be significantly slower the UT8-based approach I was using earlier.

Combining Characters

JSON Parsing

As part of my diving into Swift I've been using JSON as one of my learning projects: json-swift. Continuing on that tract, I took a look at what it would take to create a JSON parser using only Swift and no ObjC bridging. The results: ok, but lots of room for improvement.

One of the nice things about Swift is that it tries to abstract away all of the unicode information from you and create a nice, simple API for you to work with. Well, that's nice when it works, but there are cases that I ran into where it seemed to simply not be doing what I expected.

One such example:

let string = "\"Í´Øabcd"
countElements(string)           // 5
countElements(string.utf8)      // 8

// The raw bytes:
34      // "
234     // makes up "Í´Ø
171
175
97      // b
98      // c
99      // d
100     // e

In case that character isn't showing up, this is: unicde character, not sure the name of it.

I do not know enough about unicode so I don't know all of the ins and outs of why the " is attached to the unicode character (probably something to do with it being only three bytes), so some of you might be: duh!. That's ok. =)

: .info

That's not the worse part though, evident the value "Í´Ø is a single character, that is treated somewhat like a quote so you have to escape it, hence: \"Í´Øabcd.

I really wanted to try the String class out, so I had to use String.UTF8View as the mapping and compare each of the bytes and build up my strings manually using UnsafePointer and String.fromCString:

static func parseString(string: String.UTF8View, inout startAt index: String.UTF8View.Index, quote: UInt8) -> FailableOf<JSValue> {
    var bytes = [UInt8]()

    index = index.successor()
    for ; index != string.endIndex; index = index.successor() {
        let cu = string[index]
        if cu == quote {
            // Determine if the quote is being escaped or not...
            var count = 0
            for byte in reverse(bytes) {
                if byte == Token.Backslash.toRaw() { count++ }
                else { break }
            }

            if count % 2 == 0 {     // an even number means matched slashes, not an escape
                index = index.successor()

                bytes.append(0)
                let ptr = UnsafePointer<CChar>(bytes)
                return FailableOf(JSValue(JSBackingValue.JSString(String.fromCString(ptr)!)))
            }
            else {
                bytes.append(cu)
            }
        }
        else {
            bytes.append(cu)
        }
    }

    let info = [
        ErrorKeys.LocalizedDescription: ErrorCode.ParsingError.message,
        ErrorKeys.LocalizedFailureReason: "Error parsing JSON string."]
    return FailableOf(Error(code: ErrorCode.ParsingError, domain: JSValueErrorDomain, userInfo: info))
}

Since I do not know enough about unicode, I'm not sure if these are Swift bugs, limitations, or simply a lack in my own understanding.

Some of the problems I ran into:

  1. Unable to take a character from String and figure out the bytes that made it, so I have to use String.UTF8View.
  2. The String.UTF8View.Index is not Comparable, though it Equatable; I was doing index < endIndex initially.
  3. The String.UTF8View.Index is forward indexing only; originally, part of my algorithm would also step backwards through part of the string.
  4. Performance of my algorithm is about 3x slower than the NSJSONSerialization algorithm: 0.12s vs. 0.4s to parse a roughly 688KB file. I still need to investigate the memory usage.

I'm going to keep playing with it, especially as the later betas come out. I may try a purely functional based algorithm as well and see how that plays out in Swift.

JSON Parsing