AndyL

A question for the framework/VM guys...

I have ported one of my simpler benchmarks to the 360, and find that the numeric performance is running 5x slower than my 2.8GHz P4 desktop. Now I was expecting a drop in performance on the 360, allowing for the difference in CPU implementation (in order execution / branch prediction etc), but nothing like this magnitude!

The benchmark simply transforms a source array of Vector4s to a destination array of Vector4s, using a 4x4 matrix.

Have you guys done any benchmarking of pure numeric perf, and if so is this in line with what you get

Andy.



Re: XNA Game Studio Express Numeric performance on 360

Walther Gropius

Hey dude

I've noticed similar things today - my number crunching stuff (procedurally creating a world from some small data) takes 8s on my year-old PC, and about 4 minutes on the 360.

Will be digging into it tomorrow to see if there are any easy wins, and will report back. My first port of call will be to try passing as much stuff as possible by ref/out to avoid copies.

Perhaps some console coders out there may have some useful hints about what not to do

Cheers






Re: XNA Game Studio Express Numeric performance on 360

Jack H. Palevich

The Xbox 360 CLR runtime is based upon the .NET Compact Framework 2.0 runtime, which is not currently optimized for either floating point performance or for processing structs. In particular, structs are heap allocated on the .NET Compact Framework, rather than being held on the stack. This makes structs much slower to create and pass around to subroutines than on the Windows PC.

If your heavily-used code creates and passes around lots of structs, you may find a performance improvement on the 360 by rewriting it to use individual floats & ints. (Yes, this is a super pain. I'm just mentioning it as something that might help.)

A second thing to do is try and take advantage of the 360's 3 CPUs and 6 hardware threads. Try rewriting your code to take advantage of multiple threads. I bet you could get a 2x speedup by doing this.

Another approach would be to try to shift computation to the GPU by using pixel shaders and render-to-texture. For some kinds of algorithms the pixel shader can be much faster than the .NET Compact Framework 2.0 runtime.





Re: XNA Game Studio Express Numeric performance on 360

Shawn Hargreaves - MSFT

I'm pretty sure the Compact Framework does proper stack allocation for structs: where did you get this info about them being heap allocated It isn't perfectly optimised for structure passing (no register calling conventions, etc), but not so utterly terrible as heap allocating everything would make it.

Judging from my own performance work, raw math performance on 360 tends to be slower than on a Windows machine with a comparable clock speed, but not by a huge margin. The 360 framework is much more sensitive to coding style, though, so things that would maybe only cause a 10% performance drop on Windows can easily cause a huge slowdown on Xbox.

The two biggest things to be aware of when optimising Xbox math code are:
  • Inlining. The Compact Framework only has limited support for automatically inlining methods, so if you are calling many tiny routines (a common thing to do in math code) all those function calls will be very expensive. Manually inlining the contents of your leaf functions can give a huge speedup.
  • Passing structures by value is slow. For performance critical code that operates on vectors or matrices, you should pass arguments by ref, and use out parameters instead of return values.
Both of those techniques make your code more complex and harder to maintain, so I wouldn't recommend using them everywhere. Judiciously applied in the right parts of your inner loops, manual inlining and reference calling convention can give huge speedups.

I'll also second what Jack says about taking advantage of the 3 processor cores and especially the GPU. The trick to getting really amazing performance on Xbox is finding ways to offload as much computation as possible onto the GPU.





Re: XNA Game Studio Express Numeric performance on 360

Jack H. Palevich

My apparently-incorrect "structs on the heap" info was from work I did with Compact Framework 1.0, back in the day. Glad to hear it's not a problem for CF 2.0.





Re: XNA Game Studio Express Numeric performance on 360

AndyL

How come the CF VM was chosen for the 360 - surely the combination of the desktop VM and the CF library would have been far better Inlining is crucial to getting good vector, matrix and quaternion performance.



Re: XNA Game Studio Express Numeric performance on 360

Shawn Hargreaves - MSFT

The desktop CLR is not easily portable to non-x86 architectures, whereas the Compact Framework was implemented with portability as a major goal.





Re: XNA Game Studio Express Numeric performance on 360

andreww

You might want to read this in order to get some ideas about why numeric performance is lower than desktop, it's all way above me but sounds very interesting :)

http://dpad.gotfrag.com/portal/story/35372/ spage=1


Quote:
Both the 360 and PS3¡¯s CPUs are heavily stripped down compared to what most of us are probably using on our desktop computers to view this article. Both consoles are labeled as 3.2GHZ, but they don¡¯t offer performance comparable to that of a typical Athlon 64 3200+ or better than even an Athlon XP 2800+ CPU. The CPUs inside the Xbox 360 and PS3 are ¡°In-Order Execution¡± CPUs with narrow execution cores, whereas what we use on our computers are classified as ¡°Out-of-Order Execution¡± CPUs with wider execution cores.




Re: XNA Game Studio Express Numeric performance on 360

AndyL

Shawn Hargreaves - MSFT wrote:
The desktop CLR is not easily portable to non-x86 architectures, whereas the Compact Framework was implemented with portability as a major goal.

Fair enough.

Shame MSFT stopped development of the PowerPC verision of Windows NT, perhaps then the desktop VM would have been different!





Re: XNA Game Studio Express Numeric performance on 360

AndyL

andreww wrote:
You might want to read this in order to get some ideas about why numeric performance is lower than desktop, it's all way above me but sounds very interesting :)

http://dpad.gotfrag.com/portal/story/35372/ spage=1


Quote:
Both the 360 and PS3¡¯s CPUs are heavily stripped down compared to what most of us are probably using on our desktop computers to view this article. Both consoles are labeled as 3.2GHZ, but they don¡¯t offer performance comparable to that of a typical Athlon 64 3200+ or better than even an Athlon XP 2800+ CPU. The CPUs inside the Xbox 360 and PS3 are ¡°In-Order Execution¡± CPUs with narrow execution cores, whereas what we use on our computers are classified as ¡°Out-of-Order Execution¡± CPUs with wider execution cores.

In-order execution can be mitigated by a good instruction scheduler within the compiler, even more so by a developer that is willing to write assembly and work around instruction latency. Sadly I don't think either of the .NET VMs do any instruction scheduling...





Re: XNA Game Studio Express Numeric performance on 360

Walther Gropius

Okay, some hard numbers from my game. Basically, I'm decoding some data, creating meshes from some previously-loaded building blocks (then welding verts etc), and creating shadow volume meshes from these. Also, this is all using arrays of my own vert structs and ints for indices - vertex and index buffers get built right at the end.

PC, creating 386 objects -
decoding : 1437ms (15.65%)
creating mesh : 6312ms (68.71%)
creating shadow mesh : 1437ms (15.65%)

Xbox, creating 386 objects -
decoding : 5275ms (2.5%)
creating mesh : 189978ms (89.71%)
creating shadow mesh : 16508ms (7.8%)

Yeesh...

I'm going to see what works and what doesn't in terms of speeding this up, and report back.

Cheers






Re: XNA Game Studio Express Numeric performance on 360

Adam Miles

Is that multi-threaded in any way



Re: XNA Game Studio Express Numeric performance on 360

Jon Watte

What's sad is that, traditionally, the PPC architecture has been better than Intel at FLOPS and lower latencies. However, the Xbox and PS/3 CPUs are on par with the original Pentium when it comes to CPU architecture -- dual-issue, in-order execution. At least they have some nice AltiVec SIMD instructions in the architecture, although I don't know whether the CF VM can use those types of instructions at all.

For those of you already running 1.0 on Creator's Club: what kind of information do you get from the Xbox performance analysis tool






Re: XNA Game Studio Express Numeric performance on 360

Walther Gropius

Hi again

@Adam M - this code is running on a separate thread behind a splash screen, but isn't itself split into multiple threads. I definitely want to do that, but not until I get to the bottom of why it's so slow at the moment. If I can get a single thread down to ~20secs then it would make sense to try and parallelise that.

I spent a pretty long time today messing with inner loops, and didn't really get anywhere. My one bit of advice to others with similar problems is PROFILE NOW! The big slowdowns weren't at all where I expected them.

@John W - I haven't got onto the perf tool you mention yet, I've been using DateTime.Now and TimeSpan to increment totalTime variables. I'll have a look at the remote perf monitor tomorrow.

The biggest chunk of time (1/3 of the total time) is spent in a routine which adds vertex/index chunks from a library of pieces (parsed from a 3dsmax file during Init(), using my own exporter/importer) to a single mesh vb/ib - adding correct offsets to the indices as neccessary. I'm doing various horrible things like using my own array.join() method to combine ie the current array of indices with a new chunk of indices - so there may be big wins there.

Other methods are still between factors of 10 and 40 times slower though, so I'm a bit worried. A final idea is to decode a small bit of the world initially, then stick all this stuff in a separate thread which runs while you're playing, and decode the further reaches of the world before you actually get there. I initially did this for the PC, but after a healthy dose of refactoring and getting it down from 1 minute to 9 seconds, I hoped this wouldn't be neccessary. Live and learn : )

Anyone, please ask if you'd like further info on what I'm trying to achieve. I'm not a position to open source this stuff, but would love to share as much info as possible.

Cheers!






Re: XNA Game Studio Express Numeric performance on 360

Shawn Hargreaves - MSFT

The Xbox perf tool is primarily a garbage collector profiler: it gives you data on how often collections are happening, how many objects are being allocated, the total size of the managed heap, and so on.