Swift Resistance Explored

Evidently I struck a nerve with some people on this one, while others simply missed the entire point of the blog post. Let's revisit Swift Resistance and dig deeper into the problem.

I'm going to put my claim right up here so that it will not be missed:

Claim: The performance of DEBUG (that is, -Onone optimization) builds of Swift can have vastly different performance characteristics depending on the type of Swift intrinsics and foundational types that are being used. Some of these choices can lead down a path where your DEBUG builds are all but useless to use during development.

I'll talk about the implications of this towards the end.

Goals

Problem Statement: Design an algorithm that fills in a buffer of pixel data with a gradient starting with green in the bottom right corner and turning into blue in the top left corner. RGB colors should be used with values in the range [0, 255].

This algorithm must be written in Swift. In addition to that, it is meant to be used in a game loop with a desired framerate of 30 FPS for the software rendered algorithm at a resolution of 960×540 (this is 1/8th the desired rate of the hardware accelerated algorithm at 1920×1080@60Hz).

Additional Information: An existing algorithm already exists in an ObjC program; we will take the algorithm from there and use that as a baseline for both performance and functionality. When rendered to screen, your image should look something like the image below:

A picture of a window with green-to-blue gradient squares.

In addition, the data must be processed sequentially; parallelization of this algorithm is not allowed.

Strategy

Given that an existing algorithm exists, the first attempt at such an algorithm should be a straight port ignoring the language features. After that algorithm is working, it would be good to explore algorithms that may be more natural to the language.

So the two approaches that will be used are:

  1. Use an UnsafeMutablePointer to create a buffer of memory that will be

manipulated. 2. Use an Array to act as the buffer so that we are not using "unsafe" Swift

code.

ObjC Baseline

Here is the algorithm for the ObjC version of the code:

#import <Foundation/Foundation.h>

typedef struct {
    uint8_t red;
    uint8_t blue;
    uint8_t green;
    uint8_t alpha;
} Pixel;

typedef struct {
    Pixel *pixels;
    int width;
    int height;
} RenderBuffer, *RenderBufferRef;

RenderBufferRef RenderBufferCreate(int width, int height)
{
    assert(width > 0);
    assert(height > 0);

    RenderBufferRef buffer = malloc(sizeof(RenderBuffer));
    assert(buffer);

    buffer->pixels = malloc(width * height * sizeof(Pixel));
    assert(buffer->pixels);

    buffer->width = width;
    buffer->height = height;

    return buffer;
}

void RenderBufferRelease(RenderBufferRef buffer)
{
    if (buffer->pixels) {
        free(buffer->pixels);
    }

    buffer->width = 0;
    buffer->height = 0;
}

void RenderGradient(RenderBufferRef buffer, int offsetX, int offsetY)
{
    int offset = 0;
    for (int y = 0, height = buffer->height; y < height; ++y) {
        for (int x = 0, width = buffer->width; x < width; ++x) {
            Pixel pixel = { 0, y + offsetY, x + offsetX, 0xFF };
            buffer->pixels[offset] = pixel;
            ++offset;
        }
    }
}

int main(int argc, const char * argv[]) {
    uint64_t start = mach_absolute_time();

    RenderBufferRef buffer = RenderBufferCreate(960, 540);

    const int NUMER_OF_ITERATIONS = 30;
    for (int i = 0; i < NUMER_OF_ITERATIONS; ++i) {
        RenderGradient(buffer, i, i * 2);
    }

    RenderBufferRelease(buffer);

    uint64_t elapsed = mach_absolute_time() - start;
    printf("elapsed time: %fs\n", (float)elapsed / NSEC_PER_SEC);

    return 0;
}

This code was compiled as a command-line tool under two different optimization flags: -O0 and -Os. These are the default "debug" and "release" configs.

  • The timing output for -O0 (debug) was: 0.099769s
  • The timing output for -Os (release) was: 0.020427s

Both of these timings fall well within the target goal of 30Hz1.

Swift Implementations

We already know that we are going to need to have multiple algorithms for the Swift version, so it's important to setup out test harness in a reusable way. One of the things to note about Swift is that while types and functions can be private to a file, they still collide in name usage. This means that each test implementation we write will need to be wrapped in a function.

The first thing we'll do is create a command-line tool in Swift, create the two schemes (for debug and release) to make timing easier.

Ok, let's start with what our test rig will look like:

import Foundation

let NUMBER_OF_ITERATIONS = 30

#if DEBUG
let BASELINE: Float = 0.099769
#else
let BASELINE: Float = 0.020427
#endif

func timing(samples: Int, iterations: Int, fn: (Int) -> Float) -> (avg: Float, stddev: Float, diff: Int) {
    var timings = [Float](count: samples, repeatedValue: 0.0)
    for s in 0..<samples {
        timings[s] = fn(iterations)
    }

    let avg = reduce(timings, 0.0, +) / Float(samples)

    let sums = reduce(timings, 0.0) { sum, x in ((x - avg) * (x - avg)) + sum }
    let stddev = sqrt(sums / Float(timings.count - 1))
    let diff = Int(((BASELINE - avg) / BASELINE * 100.0) + 0.5)
    return (avg, stddev, diff)
}

println("Swift Rendering Tests: \(NUMBER_OF_ITERATIONS) iterations per test")
println("---------------------")

There is a simple timing function that allows us to capture the average time across some number of samples and some iterations of the rendering function.

NOTE: You'll need to add a custom build flag (-D DEBUG) to your Swift compiler options so the #if DEBUG will match.

The Unsafe Swift Approach

The naïve approach is to simply copy and paste the ObjC code into main.swift. Then simply make the necessary updates get it to compile.

The result is this (remember, we need to wrap everything in a closure so that the names do not collide as we add more tests that may want to use the same name for a struct but lay it out a little differently):

import Foundation

func unsafeMutablePointerTest(iterations: Int) -> Float {
    struct Pixel {
        var red: Byte
        var green: Byte
        var blue: Byte
        var alpha: Byte
    }

    struct RenderBuffer {
        var pixels: UnsafeMutablePointer<Pixel>
        var width: Int
        var height: Int

        init(width: Int, height: Int) {
            assert(width > 0)
            assert(height > 0)

            pixels = UnsafeMutablePointer.alloc(width * height * sizeof(Pixel))

            self.width = width
            self.height = height
        }

        mutating func release() {
            pixels.dealloc(width * height * sizeof(Pixel))
            width = 0
            height = 0
        }
    }

    func RenderGradient(var buffer: RenderBuffer, offsetX: Int, offsetY: Int)
    {
        var offset = 0
        for (var y = 0, height = buffer.height; y < height; ++y) {
            for (var x = 0, width = buffer.width; x < width; ++x) {
                let pixel = Pixel(
                    red: 0,
                    green: Byte((y + offsetY) & 0xFF),
                    blue: Byte((x + offsetX) & 0xFF),
                    alpha: 0xFF)
                buffer.pixels[offset] = pixel;
                ++offset;
            }
        }
    }

    let start = mach_absolute_time()

    var buffer = RenderBuffer(width: 960, height: 540)

    for (var i = 0; i < iterations; ++i) {
        RenderGradient(buffer, i, i * 2);
    }

    buffer.release()

    return Float(mach_absolute_time() - start) / Float(NSEC_PER_SEC)
}

Here are the timings:

  • DEBUG: avg time: avg time: 0.186799s, stddev: 0.0146862s, diff: -86%
  • RELEASE: avg time: 0.0223397s, stddev: 0.00101094s, diff: -8%

The timing code is here (add this to main.swift):

let timing1 = timing(10, NUMBER_OF_ITERATIONS) { n in unsafeMutablePointerTest(n) }
println("UnsafeMutablePointer<Pixel> avg time: \(timing1.avg)s, stddev: \(timing1.stddev)s, diff: \(timing1.diff)%")

This is not looking too bad; both are well within our target rate in both configurations. However, we can see that both of these implementations are slower than the ObjC version.

Takeaway: While this implementation is slower than the ObjC version, there is nothing blocking us at this time from being able to maintain a solid 30Hz in both debug and release builds. This is great news.

The "Safe" Swift Approach

UPDATE: I made a pretty obvious (well, easy to overlook, but still should

have been obvious) and significant error in this section… of course that would

happen in a post I try to better show the issues. I used var instead of

inout on the buffer… which, of course, creates a copy of the buffer array

each time… yeah, it's was that bad. Ironically, it didn't affect the debug

performance, but it did help the release build.

The func RenderGradient(var buffer: RenderBuffer, offsetX: Int, offsetY: Int)

should have been defined as: func RenderGradient(inout buffer: RenderBuffer, offsetX: Int, offsetY: Int).

The original implementation created a copy of the buffer each time which left

us with an empty buffer outside of the function call. Not what we wanted.

Again, this mistake only had two repercussions: incorrect functionality and

a 2x regression on the release build. The debug build is still as

painfully slow.

Let me know if you spot any other mistakes.

Now, there is another way that Swift allows us to access a region of contiguous memory: arrays. After all, that is really what the semantics are. So let's give it a try:

import Foundation

func pixelArrayTest(iterations: Int) -> Float {
    struct Pixel {
        var red: Byte
        var green: Byte
        var blue: Byte
        var alpha: Byte
    }

    struct RenderBuffer {
        var pixels: [Pixel]
        var width: Int
        var height: Int

        init(width: Int, height: Int) {
            assert(width > 0)
            assert(height > 0)

            let pixel = Pixel(red: 0, green: 0, blue: 0, alpha: 0xFF)
            pixels = [Pixel](count: width * height, repeatedValue: pixel)

            self.width = width
            self.height = height
        }
    }

    func RenderGradient(inout buffer: RenderBuffer, offsetX: Int, offsetY: Int)
    {
        var offset = 0
        for (var y = 0, height = buffer.height; y < height; ++y) {
            for (var x = 0, width = buffer.width; x < width; ++x) {
                let pixel = Pixel(
                    red: 0,
                    green: Byte((y + offsetY) & 0xFF),
                    blue: Byte((x + offsetX) & 0xFF),
                    alpha: 0xFF)
                buffer.pixels[offset] = pixel;
                ++offset;
            }
        }
    }

    let start = mach_absolute_time()

    var buffer = RenderBuffer(width: 960, height: 540)

    for (var i = 0; i < iterations; ++i) {
        RenderGradient(&buffer, i, i * 2);
    }

    return Float(mach_absolute_time() - start) / Float(NSEC_PER_SEC)
}

The nice thing about this change is that it was super easy to do; it was really only a handful of changes. This method has the benefit that I should not ever leak the buffer because I forgot to call dealloc on the UnsafeMutablePointer value.

Let's check out the timings:

  • DEBUG: avg time: 27.9754s, stddev: 0.0333994s, diff: -27939%
  • RELEASE: avg time: 0.0287606s, stddev: 0.00180078s, diff: -40%

What on earth just happened… 27.9s seconds to compute that loop above 30 times… this loop, right here:

int offset = 0;
for (int y = 0, height = buffer->height; y < height; ++y) {
    for (int x = 0, width = buffer->width; x < width; ++x) {
        Pixel pixel = { 0, y + offsetY, x + offsetX, 0xFF };
        buffer->pixels[offset] = pixel;
        ++offset;
    }
}

That's 960 * 540 (518,400) iterations. This is unacceptable. This is what my previous blog post was entirely about. There is not a SINGLE argument that you can make where you can justify this performance characteristic. Not one.

Now, had this performance been like 300% slower, I might have been able to take that and the safety justifications… maybe. At least if it was only 300% slower I'd still be in a spot where I could run my game at 30Hz with reasonable head room left over for other logic to run.

But no… we are talking about this loop taking 1 entire SECOND to compute. It was nearly 28,000% slower…

Here's a screenshot of the profile with the NUMBER_OF_ITERATIONS dropped down to 2 (there's no way I was going to sit through another full 10 samples of 30 iterations).

A screenshot of the Instruments 'profile of death' for this monstronsity.

Ok… now, I'm left with a few of choices:

  1. Say screw it and leave Swift on the table, convulsing from the seizure this

basic loop setting a value in an array just caused it. 2. Go back to the UnsafeMutablePointer method that was back in the land of all

things sane was in, but then I get to risk all of my consumers forgetting to

call release(). But you know… we're all adults here (in spirit at least),

we should be able to handle our own memory. And really, if I'm going to need

to resort to naked pointers, I might as well stick with C, yeah? 3. I can create another wrapper around the array so that I can use a backing

array to keep track of the memory for me, but expose an unsafe pointer to

that array. 4. Say screw it with the non-optimized builds and live in the land of crap(ier)

debugging which completely breaks the logic flow of your algorithms. Yes,

there is a time for this land, but that time is not the in the beginning of

your project when you are still prototyping, scaffolding, and shaping your

program into what it will one day be.

Of these options, #3 is the worst choice, in my opinion. This is a choice where you are explicitly said you want a "safe" array, but, in this part of the code, I'm just going to go hog wild. It exists for a reason; I'm guessing that one of those reasons is because performance sucks.

Here's the code update for that option:

buffer.pixels.withUnsafeMutableBufferPointer { (inout p: UnsafeMutableBufferPointer<Pixel>) -> () in
    var offset = 0
    for (var y = 0, height = buffer.height; y < height; ++y) {
        for (var x = 0, width = buffer.width; x < width; ++x) {
            let pixel = Pixel(
                red: 0,
                green: Byte((y + offsetY) & 0xFF),
                blue: Byte((x + offsetX) & 0xFF),
                alpha: 0xFF)
            p[offset] = pixel
            ++offset;
        }
    }
}

Let's check out the timings:

  • DEBUG: avg time: 1.18535s, stddev: 0.00684964s, diff: -1087%
  • RELEASE: avg time: 0.0402743s, stddev: 0.0018447s, diff: -96%

Ok… at least I cannot literally go the bathroom, get a drink from the kitchen, come back to my computer and wait for another few minutes while all the iterations are going. However, it's still too slow in DEBUG mode and it's still twice as slow as the ObjC version in RELEASE mode.

Conclusion

OK, so let's be explicitly clear here: this post is not about how Swift is horrendously slow in builds you'll be giving your customers. No, it's about how terribly slow and painful your life as a developer will be trying to write any amount of Swift code that works on any reasonable amount of data inside of arrays in Swift. However, you can see that none of the Swift options are faster or even as fast as the C version. And frankly, none of them are really that much clearer… but that's a different topic.

And don't tell me to open another Swift bug; I have. I have opened many Swift bugs since last WWDC. This post is a way for me to better reflect the current state of Swift to others. It's a way to let the people at Apple see the real impact developers like myself are feeling when trying to even the most basic of things in Swift, and so that when people do run into these issues, they won't have to bang their heads against the wall trying to figure out to solve the problem.

Here is the source for both the Swift and the ObjC versions: SwiftResistance.zip. I release it all in the public domain; do whatever you want with it.

  1. Hz, or hertz, is simply a measurement of cycles per second.
Swift Resistance Explored