Performance Xcode7 Beta 2

Update June 27th @ 3:21 PM

I was able to squeeze out some more performance by promoting some constants that I noticed I had in my code.

Here's the updated code:

func RenderGradient(inout buffer: RenderBuffer, offsetX: Int, offsetY: Int) {
    buffer.pixels.withUnsafeMutableBufferPointer { (inout p: UnsafeMutableBufferPointer<Pixel>) -> () in
        var offset = 0

        let yoffset = int4(Int32(offsetY))
        let xoffset = int4(Int32(offsetX))

        let inc = int4(0, 1, 2, 3)
        let blueaddr = inc + xoffset

        for var y: Int32 = 0, height = buffer.height; y < Int32(height); ++y {
            let green = int4(y) + yoffset

            for var x: Int32 = 0, width = buffer.width; x < Int32(width); x += 4 {
                let blue = int4(x) + blueaddr

                // If we had 8-bit operations above, we should be able to write this as a single blob.
                p[offset++] = 0xFF << 24 | UInt32(blue.x & 0xFF) << 16 | UInt32(green.x & 0xFF) << 8
                p[offset++] = 0xFF << 24 | UInt32(blue.y & 0xFF) << 16 | UInt32(green.y & 0xFF) << 8
                p[offset++] = 0xFF << 24 | UInt32(blue.z & 0xFF) << 16 | UInt32(green.z & 0xFF) << 8
                p[offset++] = 0xFF << 24 | UInt32(blue.w & 0xFF) << 16 | UInt32(green.w & 0xFF) << 8
            }
        }
    }
}

And the new timings with this update:

Language: Swift, Optimization: -O, Samples = 10, Iterations = 30          ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient ([UInt32].withUnsafeMutablePointer (SIMD))                 │ 15.75163 │ 15.00523 │ 17.31266 │ 0.8139 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

Language: Swift, Optimization: -Ounchecked, Samples = 10, Iterations = 30 ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient ([UInt32].withUnsafeMutablePointer (SIMD))                 │ 3.789642 │ 3.272549 │ 5.110642 │ 0.6232 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

The -O case was unaffected, however, the -Ounchecked is now about twice as fast as before and practically the same as the C

Update June 27th @ 1:56 AM

I noticed a bug that I had when adding the x-values, they should have been incremented by (0, 1, 2, 3). I updated the code samples and timings, though the analysis comes out to be roughly the same. I did see some the SIMD code not have much benefit under the most aggressive compiler settings. That's not too unexpected as this code is fairly trivial.

Original Entry

Well, it's that time again, to look at the performance of Swift. I've been using my swift-perf repo which contains various implementations of a RenderGradient function.

So, how does Swift 2.0 stack up in Xcode 7 Beta 2? Good! We've seen some improvements in debug builds, which is great. There is still a long ways to go, but it's getting there. As for release builds, not too much difference there.

However, there is a new thing that got added in Swift 2.0 – basic SIMD support.

I decided to update my RenderGradient with two different implementations, one that uses an array of pixel data through the array interface and another that interacts with the array throught a mutable pointer. The latter is what is required for the best speed.

Here's the implementation:

NOTE: I'm pretty new to writing SIMD code, so if there are any things I should fix, please let me know!

func RenderGradient(inout buffer: RenderBuffer, offsetX: Int, offsetY: Int) {
    buffer.pixels.withUnsafeMutableBufferPointer { (inout p: UnsafeMutableBufferPointer<Pixel>) -> () in
        var offset = 0

        let yoffset = int4(Int32(offsetY))
        let xoffset = int4(Int32(offsetX))

        // TODO(owensd): Move to the 8-bit SIMD instructions when they are available.

        // NOTE(owensd): There is a performance loss using the friendly versions.

        //for y in 0..<buffer.height {
        for var y = 0, height = buffer.height; y < height; ++y {
            let green = int4(Int32(y)) + yoffset

            //for x in stride(from: 0, through: buffer.width, by: 4) {
            for var x: Int32 = 0, width = buffer.width; x < Int32(width); x += 4 {
                let inc = int4(0, 1, 2, 3)
                let blue = int4(x) + inc + xoffset

                p[offset++] = 0xFF << 24 | UInt32(blue.x & 0xFF) << 16 | UInt32(green.x & 0xFF) << 8
                p[offset++] = 0xFF << 24 | UInt32(blue.y & 0xFF) << 16 | UInt32(green.y & 0xFF) << 8
                p[offset++] = 0xFF << 24 | UInt32(blue.z & 0xFF) << 16 | UInt32(green.z & 0xFF) << 8
                p[offset++] = 0xFF << 24 | UInt32(blue.w & 0xFF) << 16 | UInt32(green.w & 0xFF) << 8
            }
        }
    }
}

The basic idea is to fill the registers on the CPU with data and perform the operation on that set instead of doing it one value at a time. For comparison, the non-SIMD version is below.

func RenderGradient(inout buffer: RenderBuffer, offsetX: Int, offsetY: Int)
{
    buffer.pixels.withUnsafeMutableBufferPointer { (inout p: UnsafeMutableBufferPointer<Pixel>) -> () in
        var offset = 0
        for (var y = 0, height = buffer.height; y < height; ++y) {
            for (var x = 0, width = buffer.width; x < width; ++x) {
                let pixel = RenderBuffer.rgba(
                    0,
                    UInt8((y + offsetY) & 0xFF),
                    UInt8((x + offsetX) & 0xFF),
                    0xFF)
                p[offset] = pixel
                ++offset;
            }
        }
    }
}

The awesome thing is that the SIMD version is a bit faster (update June 27th, @ 9:20 am : previously it was 2x before I fixed a bug, dang!)! When 8-bit operations are allowed, it should get even faster as we can reduce the amount of work that needs to be done even further and directly assign the result into memory.

Here is the performance break-down for these two methods in -O and -Ounchecked builds:

Swift Performance

Language: Swift, Optimization: -O, Samples = 10, Iterations = 30          ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient ([UInt32].withUnsafeMutablePointer)                        │ 18.07803 │ 17.19691 │ 21.00281 │ 1.4847 │
RenderGradient ([UInt32].withUnsafeMutablePointer (SIMD))                 │ 15.88613 │ 15.11753 │ 20.16230 │ 1.5437 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

Language: Swift, Optimization: -Ounchecked, Samples = 10, Iterations = 30 ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient ([UInt32].withUnsafeMutablePointer)                        │ 6.623639 │  6.22851 │ 8.339521 │ 0.6325 │
RenderGradient ([UInt32].withUnsafeMutablePointer (SIMD))                 │ 6.629701 │ 5.930751 │ 8.751819 │ 1.0005 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

Now, here's where things start to get really interesting. I have a C

C

Language: C, Optimization: -Os, Samples = 10, Iterations = 30             ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient (Pointer Math)                                             │    9.364 │    8.723 │   11.338 │  0.994 │
RenderGradient (SIMD)                                                     │    7.751 │    7.101 │    9.642 │  0.960 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

Language: C, Optimization: -Ofast, Samples = 10, Iterations = 30          ┃ Avg (ms) ┃ Min (ms) ┃ Max (ms) ┃ StdDev ┃
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━┩
RenderGradient (Pointer Math)                                             │    3.302 │    2.865 │    5.061 │  0.693 │
RenderGradient (SIMD)                                                     │    7.607 │    6.991 │    9.923 │  0.887 │
──────────────────────────────────────────────────────────────────────────┴──────────┴──────────┴──────────┴────────┘

When Swift is compiled without the safetey checks, it's sitting right between the "Pointer Math" and the "SIMD" versions. The safety checks are causing about a 2-3 times slow-down over the -Ounchecked version though. There might be some room for improvement still in how I'm structuring things. Also, the C

I find this really exciting! We're really close to being able to write high-level, low syntactical noise code (compared to C

Again, the code for this can be found here: swift-perf. If you know any optimizatinos I should make in the C

Performance Xcode7 Beta 2