Wicked Engine DevBlog

Skinning in a Compute Shader


Recently I have moved my mesh skinning implementation from a streamout geometry shader to compute shader. One reason for this was the ugly API for the streamout which I wanted to leave behind, but the more important reason was that this could come with several benefits.

First, compared to traditional skinning in a vertex shader, the render pipeline can be simplified, because we only perform skinning once for each mesh instead of in each render pass. So when we render our animated models multiple times, for shadow maps, Z-prepass, lighting pass, etc.. we are using regular vertex shaders for those passes with the vertex buffer swapped out for the pre-skinned vertex buffer. Also, we avoid many render state setup, like binding bone matrix buffers for each render pass. But this can be done in a geometry shader with stream out capabilities as well.

The compute shader approach has some other nice features compared to the first point. The render pipeline of Wicked Engine requires the creation of a screen space velocity buffer. For that, we need out previous frame animated vertex positions. If we don’t do it in a compute shader, we probably need to skin each vertex with the previous frame bone transforms in the current frame to get the velocity of the vertex which is currentPos – prevPos (If we have deinterleaved vertex buffers, we could avoid it by swapping vertex position buffers). In a compute shader, this becomes quite straight forward, however. Perform skinning only for the current frame bone matrices, but before writing out the skinned vertex to the buffer, load the previous value of the position and that is your previous frame vertex position. Write it out then to the buffer at the end.

In a compute shader, it is the developer who can assign the workload across several threads, not rely on the default vertex shader thread invocations. Also, the vertex shader stage has strict ordering specifications, because vertices must be written out in the exact same order they arrived. A compute shader can just randomly write into the skinned vertex buffer when it is finished. That said, it is also the developer’s responsibility to avoid writing conflits. Thankfully, it is quite trivial, because we are writing a linear array of data.

An other nice feature is the possibility lo leverage async compute in a newer graphics APIs like DirectX 12, Vulkan or the Playstation 4 graphics API. I don’t have experience with it, but I imagine it would be more taxing on the memory, because we would probably need to double buffer the skinned vertex buffers.

An other possible optimization is possible with this. If the performance is bottlenecked by the skinning in our scene, we can avoid skinning meshes in the distance for every other frame or so for example, so a kind of a level of detail technique for skinning.

The downside is that this technique comes with increased memory requirements, because we must write into global memory to provide the data up front for following render passes. We also avoid the fast on-chip memory of the GPU (memory for vertex shader to pixel shader parameters) for storing the skinned values.

Here is my shader implementation for skinning a mesh in a compute shader:

struct Bone
float4x4 pose;
StructuredBuffer<Bone> boneBuffer;

ByteAddressBuffer vertexBuffer_POS; // T-Pose pos
ByteAddressBuffer vertexBuffer_NOR; // T-Pose normal
ByteAddressBuffer vertexBuffer_WEI; // bone weights
ByteAddressBuffer vertexBuffer_BON; // bone indices

RWByteAddressBuffer streamoutBuffer_POS; // skinned pos
RWByteAddressBuffer streamoutBuffer_NOR; // skinned normal
RWByteAddressBuffer streamoutBuffer_PRE; // previous frame skinned pos

inline void Skinning(inout float4 pos, inout float4 nor, in float4 inBon, in float4 inWei)
 float4 p = 0, pp = 0;
 float3 n = 0;
 float4x4 m;
 float3x3 m3;
 float weisum = 0;

// force loop to reduce register pressure
 // though this way we can not interleave TEX - ALU operations
 for (uint i = 0; ((i &lt; 4) &amp;&amp; (weisum&lt;1.0f)); ++i)
 m = boneBuffer[(uint)inBon[i]].pose;
 m3 = (float3x3)m;

p += mul(float4(pos.xyz, 1), m)*inWei[i];
 n += mul(nor.xyz, m3)*inWei[i];

weisum += inWei[i];

bool w = any(inWei);
 pos.xyz = w ? p.xyz : pos.xyz;
 nor.xyz = w ? n : nor.xyz;

[numthreads(1024, 1, 1)]
void main( uint3 DTid : SV_DispatchThreadID )
 const uint fetchAddress = DTid.x * 16; // stride is 16 bytes for each vertex buffer now...

uint4 pos_u = vertexBuffer_POS.Load4(fetchAddress);
 uint4 nor_u = vertexBuffer_NOR.Load4(fetchAddress);
 uint4 wei_u = vertexBuffer_WEI.Load4(fetchAddress);
 uint4 bon_u = vertexBuffer_BON.Load4(fetchAddress);

float4 pos = asfloat(pos_u);
 float4 nor = asfloat(nor_u);
 float4 wei = asfloat(wei_u);
 float4 bon = asfloat(bon_u);

Skinning(pos, nor, bon, wei);

pos_u = asuint(pos);
 nor_u = asuint(nor);

// copy prev frame current pos to current frame prev pos
streamoutBuffer_PRE.Store4(fetchAddress, streamoutBuffer_POS.Load4(fetchAddress));
// write out skinned props:
 streamoutBuffer_POS.Store4(fetchAddress, pos_u);
 streamoutBuffer_NOR.Store4(fetchAddress, nor_u);

Oh god I hate this wordpress code editor… (maybe I just can’t use it properly)

As you can see, quite simple code, I just call this compute shader with something like this:
Dispatch( ceil(mesh.vertices.getCount() / 1024.0f), 1, 1);

These vertex buffers are not packed yet as of now, which is quite inefficient. Of course, positions could probably be stored in 16-bit float3s (but you must animate in local space then), normals can be packed nicely into 32-bit uints, bone weights and indices should be packed into a single buffer and packed into uints as well. If you are using raw buffers (byteaddressbuffer in hlsl), then you have to do the type conversion yourself. You can also use typed buffers, but performance may be diminished. You can see an example of the optimizations with manual type conversion of compressed vertex streams in my Wicked Engine repo.

I am using precomputed skinning in Wicked Engine for a long time now, so can’t compare with the vertex shader approach, but it is definetly not worse than the streamout technique. I can imagine that for some titles, it might not be worth it to store additional vertex buffers to VRAM and avoid on-ship memory for skinning results. However, this technique could be a candidate in optimization scenarios because it is easy to implement and I think also easier to maintain because we can avoid the shader permutations for skinned and not skinned models.

Thanks for reading!