Combining Characters

In my JSON Parsing post, I talked about an issue I was having with a particular character set:

let string = "\"\u{aaef}abcd"
countElements(string)           // 5
countElements(string.utf8)      // 8

Well, it turns out that \u{aaef} is a unicode combining character that modifies the character before it. There are some combining characters that create a single character, but there are also combining characters that still result in multiple visible characters, as seen above.

However, it seems there is a view into the string that gave me what I wanted:

let string = "\"\u{aaef}abcd"
countElements(string.unicodeScalars)     // 6

If we take a look at a few other examples, we can see that the unicodeScalars seems to give us the full make-up of the unicode values that are making up the string.

let single = "è"    // \u{e8}
for scalar in single.unicodeScalars {
    println("\(scalar) (\(scalar.value))")      // prints: è
}

let combined = "e\u{300}"
for scalar in combined.unicodeScalars {
    println("\(scalar) (\(scalar.value))")      // prints: e, `
}

Notice the difference in the two: the first is an single unicode value, the second is the letter "e" combined with the accent grave (`).

The only downside that I've run into with this approach is that it seems to be significantly slower the UT8-based approach I was using earlier.

Combining Characters