A few weeks ago one of the devs in the trading system team hit a performance issue, running four threads on a multicore machine ran slower than running when processor affinity was turned on.
Turning on processor affinity basically was limiting his application to one core so he expected a performance hit.
After three weeks of searching he found the answer: smartheap
He was making extensive use of the STL and with the default VS memory allocation and a multi processor machine this was causing problems.
The answer was to replace the standard library with smartheap (there are other alternatives) and this fixed the problem.
I am blogging about this in case there are other people out banging heads on desks trying to work out why their app is slower on a multicore server rather than a single core one.
If you nose around this blog you will see I am not a C++ developer so don't ask me for any more detail.