Optimizing Wulverblade: Threading Animation Updates

Optimization is one of the dark arts. It often is essential in the process of getting your game from production to launch. There are many articles out there with general guidelines of how you should go about optimizing your games, but in the end, optimization is a specialized skill that I think benefits from case studies. This is a look at our specific efforts to optimize the 2D animation system in Wulverblade.

Most of the optimization instruction manuals will repeat a critical component in optimizing your application: profiling. Before you can start to fix the areas that are slowing down your game you must know exactly where it is slowing down. I can’t tell you how many times I thought I knew what the slowest parts of my program were only to be completely surprised once I profiled and discovered the true horrors lurking in my code. Since we used Unity to develop Wulverblade, we’ll be looking at their built-in tools for profiling the game and I’ll show you what I saw as we started trying to squeeze as much performance out of our game as possible.

Skeleton updates without threading

You see the game running in the Unity editor above and you see the graph in the pane at the bottom. See what is highlighted? “SkeletonRenderer.LateUpdate” is taking 0.74 ms to run. That is not a lot of time but remember this is running on a powerful development machine. Systems with weaker CPUs (like, say, Nintendo’s Switch) will struggle much more with these slow sections of code. Also, take note how much of the total frame time is spent on the sections of code you are looking at. Is 0.74 ms a large amount of time compared to other sections of code? In this case, the answer was yes. Below, I’ll give you a sneak peak as to where we want to end up.

Skeleton updates with threading

So that is going from 0.74 ms to 0.35 ms which I think is an excellent result.

Once you identify these areas you need to dig into them. What is “SkeletonRenderer.LateUpdate”? Well, it is part of our animation subsystem, which I didn’t write. We use a tool called Spine to do our 2D animations and Spine provides a Unity Runtime, which is a set of code files that read the exported skeletons and draws and animates them within Unity for us. So, having identified this as an area to speed up we have to open up this 3rd party library and take a peek inside. As I suspected when I did this the code provided is actually already very well optimized. There is not much that can be squeezed out of this system with conventional optimizations. I didn’t find any obvious mistakes that were eating my CPU cycles (credit to Esoteric Software, the makers of Spine).

Multithreading to the Rescue

I’ll give away a little secret: this idea was not concocted from a rigorous process. I did not follow some flow chart that led me to decide on the solution that ended up working. The general idea for how I would speed up the animation system in Wulverblade was determined mostly by intuition. I looked at the system, saw that each character was using a certain amount of CPU cycles to animate every frame, dug down further into the profiler to see exactly which pieces of the underlying Spine system were the slowest, and proposed the notion that the system could benefit from parallelization. What does that mean? It means within a single frame multiple characters need to be updated at the same time, but in the stock system provided by Spine they happen in sequence, one character at a time. Since almost all modern computers have multiple CPU cores able to calculate operations in parallel I decided to explore the idea of running some or all of the animation updates on separate cores. For those who are not programmers the terminology we use here is that we will run each animation update on separate “threads” thus taking advantage of “multithreading.”

If we take a quick step back and look at our game running on a computer we can quickly see that Unity heavily uses 1 core, slightly uses a second, and doesn’t really stress other cores. Most modern systems (consoles included) have at least 4 cores available for use. This means that we likely have processing power going underutilized just waiting to be accessed. But, don’t get too excited. Adding multiple threads to an application is one of the harder things a programmer can do, especially in a complex application like a game. Unity itself knows this and the smart engineers over there are working on new systems within the Unity engine to help make threading easier, but we won’t have access to that stuff for our game. Maybe next time…

We will have to remember to play by some rules if we want to avoid introducing a lot of instability to our game. I won’t go into explaining in detail why the first two of these rules exist other than to say they are rules to help ensure that the addition of threading to the application doesn’t destroy the game’s stability. The last rule should be obvious. We are trying to optimize our game right now, so adding additional processing and memory wastage is counterproductive.

  1. Do not call ANY Unity API functions from outside the game’s main thread
  2. Access to shared data must be strictly controlled to avoid threads conflicting with each other while doing their work
  3. This is supposed to be an optimization so we can’t introduce per-frame memory allocations or any per-frame thread setup CPU costs

We need to identify what specifically about Spine’s animation system can be threaded. To do this we can look at the basic update process for calculating an animation frame.

Spine Animation Timelines

Timelines: step one is to loop over the timelines and advance them one time step. Spine takes the data exported from their authoring tool (such as a piece of the character needs to be 0, 0 on frame 1 and then at 20, 56 at frame 10) and calculates the current state of the “skeleton.” The skeleton is the representation of the character that is animated by the animation system. A “bone” is the logical unit within the skeleton that gets translated, scaled, rotated, etc. by each timeline. These timeline updates are very order-dependent and so they absolutely must happen in sequence. So… we can’t update the timelines in separate threads because we would lose the guarantee of when each timeline gets updated. We could possibly update all the timelines for each character in separate threads but the design of the Spine runtime related to timeline updates makes this a challenge. The API would likely require some extensive revisions to make it ready for threading. Furthermore, there is the issue of when in the frame these timeline updates happen. They happen early on. Other systems also rely on the results of the timeline update. This creates a serious synchronization issue. If we wanted to thread these systems we would begin the threaded update at the beginning of the frame, but then other systems would be asking the animation system where certain bones are positioned and there would be no guarantee that those bone positions have been calculated yet. This issue isn’t actually impossible to solve. There are ways of making this work. It is, however, a very difficult one to solve and so we left it to focus on other areas.

Spine Animation Bones

Transform Update: this next step takes the final positions calculated for the current frame the skeleton’s bones and updates the individual Unity elements that make up the visual representation of the character. Immediately the problem becomes that Unity API is used here which makes this process impossible within the current system to thread. It also is the fastest step in this process and so would give the least benefit to the application.

Spine Animation Meshes

Meshing: one of final steps in the process is updating any meshes the character might use. It is possible to make Spine characters with no meshing at all. What this means is each piece of the character would be 1 rigid image. By using meshing we can deform the individual pieces of the character. Caradoc’s torso can bend and skew as he moves, allowing our animators to add extra life into the character’s movements. The downside is that this process can be expensive. After the timelines have been updated the effect of that timeline changes must be calculated on each point in the mesh for every character. It can really add up.

Most of the animation update process happens within Spines “Update” function. The mesh updates actually occur later, during a function called “LateUpdate.” Remember seeing that earlier in the profiler? This mesh updating accounts for the majority of the time spent in Spine’s LateUpdate call. It is ripe for threading. The mesh data itself is nothing more than a list of points, making it easy to copy and synchronize across threads. Plus the fact that this process already occurs later in the frame means we can trigger the threaded update earlier, allow the rest of the game to continue processing everything else it needs to and then come back and check on our threaded mesh updates in the LateUpdate function. This is going to be our target for optimization.

What we are proposing is to change from this:

Character Updates in Sequence

to this:

Character Updates in Parallel

Seems more complicated doesn’t it? Often times optimized code is more complicated or “messier” than the generic, clean version. Also, these diagrams hide the fact that Unity will be doing other processing in between calling Update and LateUpdate. It is in this gap that the new system will actually process the skeleton meshes. The key is that once it comes time to calculate the new meshes in the LateUpdate call, we will already have those calculations done and waiting to be applied, which will be much much faster than the older system.

Everyone into the Thread Pool

We need to decide on a specific implementation of this plan. I did not even attempt a system which starts a thread when it was time to calculate a mesh update and stop it when that update was finished. Starting a new thread is a relatively costly and slow process. Whatever threads we were going to use needed to be started early in the game’s life and persist throughout. This sounds like a job for thread pools, which are a well-known pattern for implementing multi-threading into a program. The issue was I could not use the standard thread pools provided in C#, since they have expensive locking that would absolutely ruin the benefits gained from using them and because they allocate memory each time you try to spawn a new threaded job, and I needed to avoid these costly per-frame allocations. Locks are constructs which protect a shared resource from being accessed by multiple threads at once which are often really critical tools to robust multithreaded applications. Locks, however, take a lot of CPU cycles to interact with. So, a custom lock-less thread pooling system was needed.

To understand why a custom solution was best here it’s important to understand what standard thread pools are good at doing and what I was attempting to do. Thread pools are great for taking a task that you expect to run for a long time and putting it into a thread to be calculated while the rest of your application can continue running. An operation taking a long time is relative, of course, but we’re often talking about operations that in game development terms would take many frames to complete. Something like pathfinding algorithms are a great example of this. Our animations must start their update and finish that update within the same frame. We aren’t trying to calculate something in the background and come back to several frames later. We simply want to take advantage of the parallel processing we have available. Since we need this update and all other processing the game needs to do finished within a single frame (16 ms if we want to hit 60 FPS) even small costs associated with starting a task in a thread pool can ruin the benefit we were hoping to get. Most uses of thread pools do not have this finish-before-end-of-frame requirement and so the ease of use and safety is well worth the extra cost.

Memory Allocation Issues

Per frame memory allocations

I have mentioned the per-frame allocation issue several times but it’s important to understand why this constraint is so important. We use C# for Wulverblade which is a managed language. As you use and discard memory you eventually have to have the garbage collector sweep through the clean up. The garbage collector is slow. Even worse is that the version of the garbage collector built into the version of Unity we used for Wulverblade is an old implementation particularly poorly suited for games. When the garbage collector kicks in, your players are going to feel it. Avoiding allocating and dumping memory per-frame avoids having to invoke the garbage collector.

Above, you can see the profiler telling me we are allocating 1.2KB per-frame which is a lot. Oh no! Don’t worry, looking into this function reveals that this allocation only occurs when running inside the Unity editor. A runtime exported application won’t have this. But it was important that it was investigated. As many of these per-frame allocations should be hunted down and squashed as possible. You can see why this constraint drove the implementation of the animation system changes.

The Custom Thread Pool

I’ll give a high-level description of the animation threading system. There are three places where the new system interacts with the animation updates. The old non-threaded system is left in place. This is important because if for any reason we can’t do threaded updates, or there are too many characters on screen to run on separate cores, then the regular system will still be used.

Creation: The first place is when a character is created. Here we check to see if the shared thread pooling system exists and if it doesn’t we create it. This spins up the threads which are will idle waiting to be called into action once character mesh updates are required. These threads are kept alive all throughout the game’s lifetime.

Update: This is where the old system updates the timelines and the transform hierarchy. These are still done exactly as before. Once they have been completed it is time to invoke the threaded updates. If a thread is available for use, we invoke it to calculate our mesh updates. There is a chance that so many characters are on screen that all the threads are already being used, so we would fall back to the old system. We make sure any data that needs to be synchronized to these separate threads is done before even invoking the thread pool. This avoids any need to get locks involved at this step.

LateUpdate: Here we check if we were able to invoke a threaded update this frame. If we did not for whatever reason we can still calculate the mesh updates using the old non-threaded code. If we did invoke a threaded update then we can check on its status. It is given a slight amount of time to finish if it isn’t done with its calculations after which point the update will be skipped for this frame (there is a bit more sophisticated checking that goes on here relating to frame skipping and how many frames are allowed to be skipped). When the mesh update is ready its data is copied over to the visual character on screen. If everything went as planned this function only has to do some simple boolean and integer checks followed by a simple memory copy which is much faster than the full mesh deformation code which ran in the background on separate cores while Unity was doing other work.

Conclusion

So, after putting in these thread pools and tweaking how many characters are allowed to run in parallel per frame (system dependent) we get the results that I showed. The new threaded mesh updates have yielded speed ups on all systems, but the speed ups are more pronounced on systems with weaker CPUs. A user will feel the difference between 250 and 300 FPS less than the difference between 15 and 30.

Skeleton updates with threading

Using the mesh deformation features of Spine allows us to add a lot of life and character to the animations but they really are expensive. On any modern PC this cost is negligible but heavily using the meshes did cause issues hitting our frame rate targets on other systems. This threading optimization was absolutely necessary to hit those targets and I’m extremely happy with how effective it was. I actually believe more parts of Spine’s animation updates can be threaded but they would have required a ton more effort. The mesh deformations were the lowest hanging fruit in this case and provided the needed speed ups. Thanks for reading this post! Releases of Wulverblade for other platforms are on their way. If you haven’t already be sure to add it to your Steam wishlist!

  • Felix

    Super interesting read! Any chance for gitHub link? :D
    Or have you been in touch with the Spine guys for putting that stuff in their code?

    • Brian

      Not yet. There may be more areas for optimizations we’ll look into but we will likely talk to Esoteric about it in the future.