Monday, October 3, 2011

Performance differences in data types based on packing

In C++, structure size varies a lot based on the size of the variables present in a structure/class and this also has an impact on performance.

There are some key performance issues and they are real for differently sized integers. Granted, modern CPUs change that a bit, but there are VERY REAL performance differences, particularly in structures/classes. The thing I am talking about is 'packing' and many people realize that structs with a mixture of differently sized variables hava a lot of 'wasted space' (i.e. structs that have a much larger memory footprint than they should).
Look at your target hardware. X360, PS3, and WII all use the same memory bus and CPU instruction set meaning that they have nearly identical characteristics. Using a int32 (32bit int) is the same and is generally 1 clock cycle... but watch this:
struct a
{        int x;        char y;        int z;
}; 
and compare it to this:
struct b
{        int x;        int y;    
   char z;
}; 
So these are the same, right? Not even close, at least not with packing switched on. The second structure is far faster. It turns out that this is true on most architectures, but especially most consoles. So, why you ask..? the answer is packing.
This is the worst text editor known to man so please bear with me...(alright, maybe VIM)
The memory bus is 32 bits on these systems (and 64 on x86). This means that accessing both struct a & b takes three bus cycles to access: one bus cycle hands us back 32 bits. But structs are packed, by default, along a single byte boundary which you can change this with compiler settings (but most people forget). So when the structure is accessed, the struct is pulled in memory across the bus... 7 bytes at a time. If it's in an array, a little bit of the next struct in the array is also pulled in 3 bytes because the bus can only fetch on 32 bit boundaries. I hope that you are beginning to see the problem.
So let's deal with a single instance of a for the time being. When you access a.x, fine... no performance hit. When you access a.y, fine. This can be slower on intel platforms because the instruction set for small values has to do a little extra work.. it's built to run fast with 64...not 8, but it's minor. When you access a.z, all hell breaks loose. The structure is packed tightly meaning that part of a.z is stuck in the same 32 bits of a.y... a.z and a.y share the same 32 bits and a.y only takes up 8 of those and a.z takes up 24. But where does that extra 8 bits of z go, it is packed into another 'trailing 32 bits'. 32 bit CPUs work in 32 bits at a time only. So when you access a.z, the CPU must do the following..
var addy = address of y
var high = addy & 0xFFFFFF ; // grab the lower 24 bits
addy ++;// move to next memory address
var low = addy & 0xFF000000;// grab high order... this become the low-order word of z
var finalZ = high<<8 + low>>24 
So, try playing with it and you can notice a severe performance degradation with this type of structure. People creating structures/classes often pay no attention whatsoever to this and performance becomes a problem. The best way to fix this?
RULE: at primitive structure levels (not high level), put 64 bit variables first, then 32 bit. after that it becomes less important, but grouping by size is always a benefit. also, pointers are the same size as your memory bus so a 32bit OS has values the same size as integers... 32 bit. On the first structure, it's actually 3 times faster to make y an int32 because of it's impact on z. Better yet, just reorder.
This is even worse for an array of structures that don't align. If a or b is in an array, then every single member of a[1] or b[1] will be slow. Luckily, most compilers don't pack structures together like that, but you can with compiler settings or you can make a single structure pack tightly with the pragma 'pack' command. Be careful. Add padding to make the total size of the struct some multiple of the bus width.

No comments: