Here's a design question for you Swifters out there.
I'm playing around with building a tokenizer that works based on a set of rules that you provide to it. From what I see, I have three basic design choices.
//
// Option 1: A tokenizer that manages its own cursor into the `ContentType`.
//
public protocol Tokenizer {
typealias ContentType : CollectionType
var rules: [(content: ContentType, offset: ContentType.Index) -> ContentType.Index?] { get }
var content: ContentType { get }
init(content: ContentType)
mutating func next(index: ContentType.Index?) throws -> Token<ContentType>?
}
//
// Option 2: A tokenizer that passes the next index back to the user for the next call.
//
// NOTE: A tuple breaks the compiler so this type is needed: rdar://21559587.
public struct TokenizerResult<ContentType where ContentType : CollectionType> {
public let token: Token<ContentType>
public let nextIndex: ContentType.Index
public init(token: Token<ContentType>, nextIndex: ContentType.Index) {
self.token = token
self.nextIndex = nextIndex
}
}
public protocol Tokenizer {
typealias ContentType : CollectionType
var rules: [(content: ContentType, offset: ContentType.Index) -> ContentType.Index?] { get }
var content: ContentType { get }
init(content: ContentType)
// HACK(owensd): This version is necessary because default parameters crash the compiler in Swift 2, beta 2.
func next() throws -> TokenizerResult<ContentType>?
func next(index: ContentType.Index?) throws -> TokenizerResult<ContentType>?
}
//
// Option 3: A mixture of option #1 and #2 where the tokenizer manages its own cursor location but
// does so by returning a new instance of the tokenizer value.
//
public protocol Tokenizer {
typealias ContentType : CollectionType
var rules: [(content: ContentType, offset: ContentType.Index) -> ContentType.Index?] { get }
var content: ContentType { get }
init(content: ContentType, currentIndex: ContentType.Index?)
func next() throws -> Self?
}
Option 1
The main problem I have with option #1 is that I'm in the business of managing some bookkeeping details. This has the terrible side-effect of requiring me to expose all of the details of that bookkeeping work in the protocol so that I can provide a defautl implementation of how this works when ContentType
is a String
. That's bad.
The other option is to create a struct
that conforms to the protocol and provides an implementation for various ContentType
s. However, I want to reserve that for a conforming type for particular structures of data, like CSVTokenizer
or JSONTokenizer
.
However, this option provides the benefit of being extremely easy to use as the caller of the code doesn't need to maintain the nextIndex
as in option 2 or new instances of the tokenizer in option 3. Simply call next()
and you get the expected behavior.
Option 2
This gets rid of all of the negatives of option #1, but it does add the burden to call the next()
function with the correct index. Of course, this does allow some additional flexibility. My big concern here is the additional code that each caller will need to make each time next()
is to be called. They have to unpack the optional result, then the nextIndex
value and call next()
with it.
Maybe this is OK. The trade-offs seem better at least. And, I can provide a default implementation for any Tokenizer
that makes use of a String
and String.Index
.
The thing I like most about this approach is that each type of Tokenizer
simple provides the rules
as an overloaded read-only property.
Option 3
This kinda merges option #1 and option #2 together; it's also my least favorite. I don't like all of the potential copying that needs to be done. It's not clear to me that this will be optimized away, especially under all use cases. However, I thought I should at least mention it…
Option 4
Ok, there really is another option. It's to create a struct Tokenizer
that provides an init()
allowing you to pass in the set of rules to use when matching tokens. I really don't like this approach that much either. This turns handling rules for common tokenizer constructs like CSV and JSON into these free-floating arrays of rules.
That feels wrong to me. A concrete implementation of a CSV Tokenizer
seems like the better approach.
Wrapping Up
I'm leaning towards option #2 (in fact, that is what I have implemented currently). It seems to be working alright, though the caller callsite is a little verbose.
guard let result = try tokenizer.next() else { /* bail */ }
// do stuff...
// Do it again!
guard let result = try tokenizer.next(index: result.nextIndex) else { /* bail */ }
This is probably ok and it's likely to be done in a loop. It just feels like a lot of syntax and bookkeeping the caller needs to deal with.
Anyhow, thoughts? Better patterns to consider?