Brad Robinson

Hi,
I've been working on performance tuning some audio sample rate conversion routines and firstly I must say how impressed I've been with VS05's x86 optimizing compiler - it does some really amazing things, often matching and occasionally outperforming some hand written assembler I have.

Today however I recompiled these routines with the x64 compiler and was shocked at the drop in performance. As an example on a core 2 duo, a single channel sinc resample routine compiled as x86 converts about 1.5 million samples per second. Exactly the same code recompiled in x64, with exactly the same input data, on exactly the same machine - 500 thousand samples per second. That's three times slower!

Although I haven't investigated deeply I did take a quick look at the generated assembly and noticed the x64 version uses SSE floating point instructions where as the x86 version uses the floating point stack. For comparison I wrote the same routine in x64 assembler using floating point stack and got similar results to the x86 version. (Note these routines are not using SSE intrinsics, nor is the compiler generating anything that uses SSE's parallel nature - its just using the low double of each SSE register to hold/manipulate a single double value).

I've also compared the routines enough to tell they are basically similar ie: both the x86 and x64 versions do loop unrolling and are structurally similar - its just how the actual floating point math is done that is different. In the tight inner loop, the x86 version has 20 floating point related instructions while the x64 version has 24.

Here's the worst part though - there are no compiler options to tell the x64 compiler to not use SSE - aparently it just can't generate code that uses the floating point stack. The other downside of this is you loose the 80 bit precision of the floating point stack registers as the SSE double registers are only 64 bit.

So while the x86 compiler is awesome, seems the x64 version is a more than little lacking.

Anyone else noticed similar results, or know why SSE instructions would be so much slower, or ideas on how to fix it I'm guessing something like its slow to load 32bit float values from non-16 byte aligned memory. I'm 99% confident this is not a denormal problem.
Please don't tell me I need to go back to assembler again!

Brad


Re: Visual C++ Language Microsoft C++ x64 compiler very slow at floating point math

einaros

I've got a few control questions:
  1. Is this the x64 compiler from VS2005
  2. If yes, have you applied SP1
  3. Have you tried compiling the same project in VS2008 beta2 What's the result there
So there's no quick fix or groundbreaking news in my post, but the answers would be relevant in either case.





Re: Visual C++ Language Microsoft C++ x64 compiler very slow at floating point math

Vadim Paretsky

Can you provide more information about this problem Compilation flags, architecture you are running on, a code sample would be helpful.





Re: Visual C++ Language Microsoft C++ x64 compiler very slow at floating point math

Brad Robinson

Yes, this is the x64 compiler from VS2005

Yes, I've applied SP1

No, I haven't tried in the VS2008 beta. (downloading now)





Re: Visual C++ Language Microsoft C++ x64 compiler very slow at floating point math

Brad Robinson

Basically I'm just compiling with all optimizations set to maximum speed. I'm running on a core 2 duo notebook with 64 bit Vista.

The particular routine that is seeing the biggest slow down is a quite complex sinc-based audio resampling routine. I'm currently trying to put together a simpler example that demonstrates the problem. So far I've found I can quite easily make an example that shows a 15%-30% drop in performance, but nothing of the order in my real code. I might just extract that one routine and pass some dummy data to it as the example.

(thanks all for help on this)

Brad





Re: Visual C++ Language Microsoft C++ x64 compiler very slow at floating point math

Vadim Paretsky

please send me the example that you have, and the exact compilation flags to v-vadimp@microsoft.com. I'll look at it.





Re: Visual C++ Language Microsoft C++ x64 compiler very slow at floating point math

Brad Robinson

Hi Vadim,

As request, example program sent... thanks for looking into this.

Brad





Re: Visual C++ Language Microsoft C++ x64 compiler very slow at floating point math

Chuck the Code Monkey

The reason only SSE code gets generated is that the x87 stack in Windows x64 has been deprecated and are no longer used. There is extensive documentation stating that this is due to the register states not being saved by windows during context switches. Some have observed this to not be the case, none the less though, it's best to stick with only SSE in Windows X64 because of this. I believe that's also why there's not option to specifically generate SSE code in the x64 compiler as it simply happens automatically.

"Overview of x64 Calling Conventions"
http://msdn2.microsoft.com/en-us/library/ms235286(vs.80).aspx

FTA:
"...The x87 register stack is unused. It may be used, but must be considered volatile across function calls. All floating point operations are done using the 16 XMM registers..."




Re: Visual C++ Language Microsoft C++ x64 compiler very slow at floating point math

Brad Robinson

That's useful information! I'm missed that point about fp stack in the calling conventions. Thanks.

Not sure what to do about this. The x64 version of this routine runs almost three times slower that x86 version on a core 2 duo. On AMD it runs faster than the x86 version. That's significant enough for me to abandon the x64 version of my app for the time being.

Thanks for insight.

Brad





Re: Visual C++ Language Microsoft C++ x64 compiler very slow at floating point math

Jerry Goodwin

Update for all interested readers:

Brad has been communicating with Vadim by email, we have reproduced the issue in-house and have a proposed fix. We are in the process of testing that and getting approvals to include it in VS 2008. We have not yet put the fix through the performance runs that prove it doesn't cause any serious deterioration in some other scenario, and so it's not a lock yet that the fix will get in VS 2008, but I'm optimistic since the fix is relatively simple.

I'll do my best to remember to post again when all the due-dilligence is done and a definite promise can be made.






Re: Visual C++ Language Microsoft C++ x64 compiler very slow at floating point math

Brad Robinson

Yes, I can confirm the proposed change does in fact fix the problem with the x64 version now outperforming the x86 version.

Unfortunately I could only test this by rewriting 1 (of 36) routines in assembler and making the change manually. I'd love to see this released as a hotfix as VS08 is probably outside my project time frame.





Re: Visual C++ Language Microsoft C++ x64 compiler very slow at floating point math

Jerry Goodwin

The fix is now in the code base for VS 2008. One additional thing we learned in our testing is that the problem does not affect P4 and K8 processors, which are the main processor families for which VS 2005 was tuned. In fact, if we were to add the fix to VS 2005 it would impose a performance penalty on those processors. And there are no command line switches in VS 2005 to cause it to compile for a specific generation of processor, so right now we are only planning to include the fix in VS 2008.






Re: Visual C++ Language Microsoft C++ x64 compiler very slow at floating point math

Brad Robinson

Thanks Jerry, that's excellent news.

Oh, and btw, congrats to Microsoft on this unbelievably excellent support and compiler. This was a serious issue for my app - enough to try the Intel C++ compiler only to find it even slower and then suffering the same slow down on Intel chips. Feedback from Microsoft has been quick and provided enough info to resolve the problem for critical cases now by using masm, and I'm happy to wait for the fix in next compiler for other cases. Well done.

Brad