NAO32 is a format that gained some fame in the dev community when ex-Ninja Theory programmer Marco Salvi shared some details on the technique over on the beyond3D forums. Used in the game Heavenly Sword, it allowed for multi-sampling to be used in conjuction with HDR on a platform (PS3) whose GPU didn't support multi-sampling of floating-point surfaces (The RSX is heavily based on Nvidia G70). In this technique, color is stored in the LogLuv format usinga standard R8G8B8A8 surface. Two components are used to store X and Y at 8-bit precision, and the other two are used to store the log of luminance at 16-bit precision. Having 16 bits for luminance allows for a wide dynamic range to be stored in this format, and storing the log of the luminance allows for linear filtering in multi-sampling or texture sampling. Since he first explained it other games have also used it, such as Naughty Dog's Uncharted. It's likely that it's been used in many other PS3 games, as well.
My actual shader implementation was helped along quite a bit by Christer Ericson's blog post, which described how to derive optimized shader code for encoding RGB into the LogLuv format. Using his code as a starting point, I came up with the following HLSL code for encoding and decoding:
// M matrix, for encoding
const static float3x3 M = float3x3(
0.2209, 0.3390, 0.4184,
0.1138, 0.6780, 0.7319,
0.0102, 0.1130, 0.2969);
// Inverse M matrix, for decoding
const static float3x3 InverseM = float3x3(
6.0013, -2.700, -1.7995,
-1.332, 3.1029, -5.7720,
.3007, -1.088, 5.6268);
float4 LogLuvEncode(in float3 vRGB)
{
float4 vResult;
float3 Xp_Y_XYZp = mul(vRGB, M);
Xp_Y_XYZp = max(Xp_Y_XYZp, float3(1e-6, 1e-6, 1e-6));
vResult.xy = Xp_Y_XYZp.xy / Xp_Y_XYZp.z;
float Le = 2 * log2(Xp_Y_XYZp.y) + 127;
vResult.w = frac(Le);
vResult.z = (Le - (floor(vResult.w*255.0f))/255.0f)/255.0f;
return vResult;
}
float3 LogLuvDecode(in float4 vLogLuv)
{
float Le = vLogLuv.z * 255 + vLogLuv.w;
float3 Xp_Y_XYZp;
Xp_Y_XYZp.y = exp2((Le - 127) / 2);
Xp_Y_XYZp.z = Xp_Y_XYZp.y / vLogLuv.y;
Xp_Y_XYZp.x = vLogLuv.x * Xp_Y_XYZp.z;
float3 vRGB = mul(Xp_Y_XYZp, InverseM);
return max(vRGB, 0);
}
Once I had this implemented and worked through a few small glitches, results were much improved in the 360 version. Performance was much much better, I could multi-sample again, and the results looked great. So once again things didn't exactly work out in an ideal way, but I'm pleased with the results.