const int lut[256] =
{
0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
1,2,2,3,2,3,3,4,2,3,3,4,3,4,4,5,2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
2,3,3,4,3,4,4,5,3,4,4,5,4,5,5,6,3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,
3,4,4,5,4,5,5,6,4,5,5,6,5,6,6,7,4,5,5,6,5,6,6,7,5,6,6,7,6,7,7,8
};
static inline int popcount_lut(uint64 x) {
int i, ret = 0;
for (i = 0; i < 64; i += 8)
ret +=lut[(x >> i) & 0xff];
return ret;
}
之前几种方法中的所有操作都可以在寄存器中完成,而查找表法由于无法将查找表放在寄存器中,不可避免地要出现访问cache或内存的操作。但由于我们采用一个循环频繁访问查找表,在所有的实现中查找表都应该被锁定在cache中,访问开销很小[4]。因此从后面的评测我们可以看到,这种方法的效率最高。
对于很酷的HAKMEM169,由于用到了耗时的mod操作,在大多数平台上都不如shift_and_add的乘法版本速度快,当然就更比不上我们最笨最好的查找表大法了。KISS准则再次被证明是有效的,哦也。具体的评测数据,可以参考Bart Massey的结果[5]以及感言,或者看这个更加全面详尽啰嗦的评测[6]。