Andreas Asterlund

I have followed the general outline by MVP Brian Kramer when developing a wrapper for Rick Wagner's C++ version of the Mersenne Twister. But to my utter disappointment the performance of the code was quite week. The native code running in a win32 console application was able to produce 300 000 000 million random integers in just over one second, but my wrapped version running in an CLI console application took about 6-7 seconds to produce the same amount of integers. This was absolutely not what I have hoped for.

Performance is of the utmost importance in the part of my application where I will run the Mersenne Twister. I know that is it not advisable to run my wrapper code in a loop, but I really have to do that in my application due to the design (I will not explaine the design here now, but I can if you ask me to). If you want to se my wrapper code I can post that to, just ask.

Now is there anything I can do to prevent this big loss of performance Any help is much appreciated.

Is there possible to gain any performance by modify the unmanaged class in any way Say removing functions that I don't use in my application Or any other modification

Andreas




Re: Visual C++ Language Big loss in performance when using wrapper for unmanaged code

Brian Kramer

When you jump from managed to unmanaged code you take a hit. This boundary needs to be minimized, and in your case, a purely managed MT implementation might be the best bet. Having said that, my confidence of a port is somewhat lower. I haven't seen Rick Wagner's implementation, but the implementation I adapted does some pointer arithmetic that reads outside an array bounds (I contacted MT's investor, and he--actually his graduate student--said it was "by design.")

If you want, I can try porting my version of MT and we can compare notes.

Brian





Re: Visual C++ Language Big loss in performance when using wrapper for unmanaged code

Brian Kramer

Instead of going with my version of MT (which is also a C++ wrapping of the original mt19937ar.c), I compiled Wagner's version and created a simple wrapper (with randInt()), and am able to see the same slowdown as you do. 

Here's some prelim results.  I'm measuring now many integers I am able to generate in one second (times 10,000).

// unmanaged version without /clr: 16409, 16523, 16096

// unmanaged version with /clr: 11822, 11587, 11599

// managed wrapper with /clr: 1627, 1642, 1633

// managed port with /clr: 5196, 5209, 5234

The managed port is simply changing class MTRand to ref class MTRand and making the usual changes: removed unnamed enums, remove const qualifier on functions, convert P.O.D.--in particular, the state array--to indirect forms).

I'm pretty certain that this is due to the compiler not inlining any managed code.  The unmanaged version has very few instructions in its loop.   I'm continuing to look into it; up to now, I've never had to solve the problem of inlining in C++/CLI.

 





Re: Visual C++ Language Big loss in performance when using wrapper for unmanaged code

Brian Kramer

After going through some simplifcations of randInt, it looks like the compiler makes it difficult for me to induce inlining.  It wasn't until I made randInt independent of the this pointer that I got it to inline.  I'm going to chalk this up to very inefficient code generation, and hopefully someone at Microsoft can give us an inside perspective.

By the way, when you do benchmarks, make sure you're using the results of the random number, or you assign it to a volatile int.  This makes a difference in the "baseline" unmanaged perf test, where inlining occurs, and the subsequent shift and XOR arithmetic are seen as dead code when their results are not used.





Re: Visual C++ Language Big loss in performance when using wrapper for unmanaged code

Andreas Asterlund

Brian Kramer wrote:

The managed port is simply changing class MTRand to ref class MTRand and making the usual changes: removed unnamed enums, remove const qualifier on functions, convert P.O.D.--in particular, the state array--to indirect forms).

I'm pretty certain that this is due to the compiler not inlining any managed code. The unmanaged version has very few instructions in its loop. I'm continuing to look into it; up to now, I've never had to solve the problem of inlining in C++/CLI.

Hello Brian and thanks for all you replies!

Actually, because of my stubbornness to keep the code managed, started out by porting Rick's version. But that was to a disappointment because the 300 millions of integers took about 20 seconds to produce. I thought it was the interior pointers that I have used that was the buttleneck. I quickly went on to wrapping the native class instead.

What do you mean by "convert P.O.D"

Brian Kramer wrote:

After going through some simplifcations of randInt, it looks like the compiler makes it difficult for me to induce inlining. It wasn't until I made randInt independent of the this pointer that I got it to inline. I'm going to chalk this up to very inefficient code generation, and hopefully someone at Microsoft can give us an inside perspective.

By the way, when you do benchmarks, make sure you're using the results of the random number, or you assign it to a volatile int. This makes a difference in the "baseline" unmanaged perf test, where inlining occurs, and the subsequent shift and XOR arithmetic are seen as dead code when their results are not used.

It sounds to me that even if you declare a function as inline, it is up to the compiler to decide whether it is inline or not. Is that right

Is it in you opinion to avoid the use of the this pointer as far as possible, to gain better performance

This is how my managed test code look like:

//====== Test start ====================================

unsigned long oneSeed = 4357UL;

Console::WriteLine( "\nTest of time to generate 300 million random integers:\n" );

MMTRand mtrand3( oneSeed );

unsigned long junk;

int start = System::Environment::TickCount;

for(int i = 0; i < 300000000; ++i )

{

junk = mtrand3.randInt();

}

int stop = System::Environment::TickCount;

Console::Write( "Time elapsed = " );

Console::WriteLine( "{0:f3}", double( stop - start ) / 1000.0 );

//====== Test end =======================================

How should I do the changes in this code

And this is the code for my wrapper:

public ref class MMTRand

{

private:

MTRand *pMTRand;

public:

MMTRand(const MTRand::uint32 seed)

{

pMTRand = new MTRand( seed );

}

~MMTRand(void)

{

this->!MMTRand();

}

!MMTRand(void)

{

delete pMTRand;

}

MTRand::uint32 randInt(void)

{return pMTRand->randInt();}

};

Can you give any tips on how to improv the code or is it fine as it is

Again you help is much appreciated!

Best regards Andreas






Re: Visual C++ Language Big loss in performance when using wrapper for unmanaged code

Brian Kramer

 Andreas Asterlund wrote:

What do you mean by "convert P.O.D"

"Plain Old Data."  Data members in a class can be native types (e.g. int, float, char, etc) , or pointers to native types.  Nested structs/classes and arrays fall into the classification POD and are restricted under C++/CLI:  they cannot be members of a managed class.  The workaround is to turn these into pointers of such and allocate in the constructor.  e.g. uint state[ N ] becomes uint* state;

 Andreas Asterlund wrote:

It sounds to me that even if you declare a function as inline, it is up to the compiler to decide whether it is inline or not. Is that right

The inline keyword is merely a hint. (The Microsoft extension __forceinline is a bit more "forceful," but even that doesn't work in this case).  Someone who's around the compiler sources are going to give a more exact answer than I can here, but my take is that the compiler makes its own decisions on whether something can be inlined, regardless if you define it inside the class definition, or outside of it using the inline keyword.  (Out-of-class declarations in headers do require inline, however, as a way to tell the compiler to generate code that avoids multiple definitions when it comes to link-time).

The inlinability of randInt is a no-brainer for the Microsoft compiler for unmanaged code.  But for managed code, I think there's either a hidden requirement that I don't know about or the compiler isn't designed to be as aggressive as unmanaged, or its a bug.

 Andreas Asterlund wrote:

Is it in you opinion to avoid the use of the this pointer as far as possible, to gain better performance

Nope.   The this pointer itself should not inhibit inlining, but the fact that appears to in this case is a mystery.   But if this is important enough to you, you can go with the original C implementation of Mersenne Twister, but this is a big step backward, and I wouldn't take it until you (or we) get to the bottom of why inlining is not happening.  (And if you don't get any responses in this thread, I would encourage you to create a bug on the Connect website with a simple test case.)

You can know whether something is inlined by looking at the disassembly during debugging.  Or you can set a breakpoint on the call to randInt.  If it is inlined, F11 should do nothing (as long as left > 0).  If it is inlined, F11 takes you to the header file.

 Andreas Asterlund wrote:

How should I do the changes in this code

Add volatile:  volatile unsigned long junk;

You'll see a dramatic difference when you look at the disassembly.  Without volatile, the compiler wipes away most of the inlined randInt code because it's all dead code. 

Also, when you compare disassembly between managed and unmanaged, you'll notice little difference in the randInt code.  (Which is why I think your original perf difference is due to 1. inlining, 2. dead code discovered after inlining).

 Andreas Asterlund wrote:

Can you give any tips on how to improv the code or is it fine as it is

Your wrapper looks just as it should.  But Rich Wagner's code is also easily convertable to a managed class as I outlined earlier.  Doing this reduces the complexity of your code, and has potential performance benefits.

Brian

PS Here's my test harness.  I measure how many tens of thousands of times I can generate random numbers in one second (an easier metric to deal with, and makes running time constant).  The reason for the 10,000x magnification is to make calls to clock() negligible.  I also average three runs, throwing away the first (which tends to be more varied due to the CPU bouncing around when the executable first starts).

int main()
{
  printf( "Running managed wrapper version of MTRand.\n" );
  printf( "/clr is in effect.\n");
  int total = 0;
  for( int pass = 0; pass < 4; pass++ )
  {
    MTRand_wrapped r;
    int cnt = 0;
    for( int time = clock(); clock() - time < 1000; cnt++ )
    {
      for( int i = 0; i < 10000; i++ )
      {
          volatile int n = r.randInt();
      }
    }
    printf( "cnt=%d\n", cnt ); 
    if( pass > 0 )
    {
       total += cnt;
    }
  }
  printf( "avg=%d\n",total/3);
  system( "pause" );
  return 0;
}

 





Re: Visual C++ Language Big loss in performance when using wrapper for unmanaged code

Andreas Asterlund

Hello Brian and once again thanks for all your good and informative replies!

 Brian Kramer wrote:

The workaround is to turn these into pointers of such and allocate in the constructor.  e.g. uint state[ N ] becomes uint* state;

I did not realize it before, but I should have done it when I wrote the wrapper, that I can use a native pointer to a native type on the native heap. And as you say, allocate the memory with new in the constructor. Is that okey The reason for why I am asking this is because I now use a state vector of managed type and I am allso using interior pointers to point att that vector. I have got the impression that the interior pointer are much slower than the native ones

 Brian Kramer wrote:

int main()
{
  printf( "Running managed wrapper version of MTRand.\n" );
  printf( "/clr is in effect.\n");
  int total = 0;
  for( int pass = 0; pass < 4; pass++ )
  {
    MTRand_wrapped r;
    int cnt = 0;
    for( int time = clock(); clock() - time < 1000; cnt++ )
    {
      for( int i = 0; i < 10000; i++ )
      {
          volatile int n = r.randInt();
      }
    }
    printf( "cnt=%d\n", cnt ); 
    if( pass > 0 )
    {
       total += cnt;
    }
  }
  printf( "avg=%d\n",total/3);
  system( "pause" );
  return 0;
}

The test code seems to be good and I intend to use it for my tests. But I do have some questions about it.

Is it not a loss in performance to declare a variable each time a loop iterates I mean is it not in this code that the volatile int n is declared more than 40000 times This must eat a lot of memory Can one modify the code to this:

MTRand_wrapped r; 
int
total = 0;
int cnt = 0;
int time = 0;
int i = 0;
volatile int n = 0;
for( int pass = 0; pass < 4; pass++ )
{
   cnt = 0;
   for(time = clock(); clock() - time < 1000; cnt++ )
   {
     for( i = 0; i < 10000; i++ )
     {
         n = r.randInt();
     }
   }
   printf( "cnt=%d\n", cnt ); 
   if( pass > 0 )
   {
      total += cnt;
   }
}

Thank again for all help!

Best regards Andreas.






Re: Visual C++ Language Big loss in performance when using wrapper for unmanaged code

Brian Kramer

 Andreas Asterlund wrote:

Hello Brian and once again thanks for all your good and informative replies!

I did not realize it before, but I should have done it when I wrote the wrapper, that I can use a native pointer to a native type on the native heap. And as you say, allocate the memory with new in the constructor. Is that okey The reason for why I am asking this is because I now use a state vector of managed type and I am allso using interior pointers to point att that vector. I have got the impression that the interior pointer are much slower than the native ones

Is it not a loss in performance to declare a variable each time a loop iterates I mean is it not in this code that the volatile int n is declared more than 40000 times This must eat a lot of memory Can one modify the code to this:

The state array goes on the regular, unmanaged heap, so yes, use the new operator to allocate this.  The containing class is managed, and as such will get moved around in memory (which is part of why it's called "managed"), but the native member pointer points to the unmanaged heap (which doesn't move.)

During runtime, there is no "declaring" of variables.  All local variables have a slot on the stack, and that is figured out during compilation.  The only difference between declaring it function scope and block scope is, well, their scopes, not in their allocation which happens only once when you enter the function (the stack grows by a certain amount) and deallocation (when the stack pointer register gets restored).

Brian

 





Re: Visual C++ Language Big loss in performance when using wrapper for unmanaged code

Andreas Asterlund

Hello Brian!

Thanks again for you tips. I will try to modify my ported class and hopefully speed it up a little bit. I have one last question before I go on with my work on the class:

If i, instead of using a header file, use a class library (dll). And before i compile it i turn on the options "Favor Fast Code (/Ot)" and "Maximize Speed (/O2)" in the property pages for the class library project. Do this have any impact on the inlineing we talked about earlier

Best regards, Andreas






Re: Visual C++ Language Big loss in performance when using wrapper for unmanaged code

Brian Kramer

/Os and /Ot are optimize for size or time but I think these might be out of fashion. /O2 is what you normally throw for compiling for speed. (There are some cases where compiling for size actually is faster due to instruction cache issues, but I digress.)

/O2 breaks down into /Og and /Ob2. /Og means global optimization, but really it means do all kinds of optimizations (global optimization as defined in compiler theory being one of them.) /Ob2 means to"inline-at-will" according to the compiler's best efforts in doing cost analysis of doing so.

If the compiler sees the body of a function prior to a function's call site during compilation, then that function is always considered for inlining. (And /LTCG allows inlining to occur between modules). In C++, and apparently not C++/CLI, __forceinline usually works to override the compiler's costing decision (it won't inline things that might change exception handling semantics, has recursion, etc).

You really should open a bug on that inlining issue... it's pretty severe, IMO.

Brian