Rivorus

Before I go into my huge rant, is there a way to prevent the compiler from reordering statements that will work for VC8 as well as older versions of VC I want to be very clear that I am looking strictly for compiler barriers and not memory barriers for the time being, since generating the appropriate memory barriers is easy enough to do in __asm blocks. Anyway, as the title suggests, in VC8 standard I am pretty sure that I am seeing very incorrect output when compared to functionality that is described on MSDN. First, the documentation on MSDN about _ReadBarrier and similar intrinsics actually contradicts itself! Here http://msdn2.microsoft.com/en-us/library/z055s48f(VS.80).aspx it is claimed that _ReadBarrier is a memory barrier, whereas here http://msdn2.microsoft.com/en-us/library/ms684208.aspx (in the remarks section of that page) it is described as being a compiler barrier and nothing more. As well, if you look at the code snippet from the second page linked, which supposedly shows a full memory barrier for x86, apparently does nothing of the sort. An xchg does not guarantee any fence what-so-ever, at least as far as I can see from x86 instruction listings. It is strictly an unordered atomic operation. In order to portably and reliably create any type of fence, I'm fairly certain that you would need to use sfence, lfence, or mfence, which are SSE and SSE2 instructions. After looking at the output generated by the supposed interlocked operation intrinsics such as InterlockedCompareExchange, which claim full fence semantics, it seems as though the appropriate fences are not in place, even when SSE and SSE2 generation is enabled! After looking through the interlocked intrinsics I have only found contradictions in documentation and, unless I am going insane, improper implementation. Similarly, the load acquire and store release semantics specified for volatile are not reflected in instructions generated. For reference, I am using VC8 standard in debug on Windows XP, testing with SSE and SSE2 generation on and off.

Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

einaros

Have you actually witnessed any dangerous reorderings in your compiled code And what platform are you targetting




Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

Rivorus

Unfortunately, since they would just be CPU reorderings and not reorderings of the instructions themselves, it's physically impossible to force such reorderings to occur or reproduce the similar behavior on a second run of the same test, and so I would likely only be able to witness improper results if I were to personally run rigorous tests on various multicore processors on Windows, and even then I'd have to get lucky enough (unlucky enough ) to have the results of one operation issued earlier appear after the result of another operation issued later. I do not currently have that luxury and I'd rather not have to wait until users get odd results on Windows considering that from the instructions generated, it looks very clear to me that there are absolutely no guarantees for ordering in place despite what the documentation claims. While they may be rarely apparent when running code, these reorderings are perfectly fine for the CPU to do, and unless I am horribly mistaken, are exactly why instructions such as sfence, lfence, and mfence exist in the first place! Microsoft seems to somewhat understand the concepts of ordering since they acknowledge that barriers can be expensive, and so they supposedly provide versions with only read barriers (acquire semantics) and write barriers (release semantics) in addition to their supposedly fully ordered versions. I don't have access to a Vista machine so I can't check the instructions generated from the Acquire and Release versions (Vista only according to documentation, which doesn't make any sense to me since these low-level instrinsics are just a few lines of inlined x86 instructions), but I am really curious as to how they would be implemented if Microsoft believes that a lack of any fences already implies fully ordered semantics with atomic instructions. If that actually were the case, then it wouldn't be possible to implement strictly acquire and release versions. Does the implementation for the Acquire and Release forms of the instrinsics correctly insert sfence and lfence instructions respectively The acquire and release semantics which are supposedly present in VC8 for volatile variable access do not output any of these fences, so I wouldn't be surprised if the same went for the interlocked intrinsics. If on Vista the Acquire and Release forms of the intrinsics actually do put in fences, doesn't that seem rather odd that the acquire and release forms have more ordering instructions than the supposedly fully ordered versions In actuality, it looks to me that, at least on VC8 on XP in debug, with no SSE generation, SSE generation, and SSE2 generation applied, the fully ordered forms of these intrinsics are actually totally unordered. For the RMW instructions, proper acquire semantics would imply an lfence after the atomic operation, proper release semantics would imply an sfence before the operation, and strict ordering would imply both an sfence prior to the instruction and an lfence after the instruction. To make sure I am not going crazy, I've started looking online and I did come across this page http://ein.designreactor.com:8080/amd_devCentral/articlex.jsp id=89#top which is from before VS 2005 was released. The important line to take note of is "... in Visual Studio 2005, volatile will also introduce code to prevent out-of-order execution by the processor (a memory fence or memory barrier). In other environments, you will need to use memory fence instructions (sfence, lfence, and mfence) to prevent hardware re-ordering." As was stated, CPU barriers are required for proper ordering guarantees, and I find it especially frustrating since this page is so old, makes reference to VS 2005 providing such functionality when VS 2005 was to be released, and even explains in text exactly how to properly implement such functionality including the exact instructions required, and yet I see no fences in the code generated by VC8. What's worse is, MSDN still claims proper output, but I see nothing of the sort. What does Microsoft think acquire and release semantics mean Are they just interpretting these as being compiler-only reordering requirements If so, they have a horrible misunderstanding of atomic operations and ordering semantics. It seems to me that the proper output would be a compiler error on attempt to use any ordering semantics without at least SSE generation enabled, a compiler error for store and fully ordered semantics with only SSE generation enabled (using sfence prior to "write" operations when release semantics are specified), and finally, release and fully ordered semantics allowable with SSE2 generation enabled (using lfence after "read" operations for acquire semantics and using sfence and lfence around RMW operations for fully ordered semantics). Similarly, this would have to be reflected in the functionality of volatile qualification as well.



Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

einaros

Rivorus wrote:
An xchg does not guarantee any fence what-so-ever, at least as far as I can see from x86 instruction listings

According to the Intel 64 / IA-32 software development manual, XCHG instructions imply automatic locking.

As for the fence operations, I can confirm that no explicit ones are provided for volatile objects. That being said, the documentation doesn't appear to make that guarantee (and neither does volatile on any other C++ platforms I know of). The only guarantee made is for the volatile objects not to take part in any compiler re-orderings.

One part I find ambiguous in the MSDN documentation is the following statement: "Although the processor will not reorder un-cacheable memory accesses, un-cacheable variables must be volatile to guarantee that the compiler will not change memory order." This seems to suggest that volatile objects are allocated in ranges marked as un-cachable in the processors MTRRs (memory type range registers, applicable for P4, Xeon and P6). According to the Intel docs, marking a range UC will cause the processor to enforce strong ordering on accesses to that segment. What's unclear to me about this strong ordering, is whether it only affects the un-cachable memory segments alone (that is, reads and writes to UC segments cannot be re-ordered). I doubt, however, that it also enforces serialization of reads / writes to segments *not* marked as UC, which occur (instruction-wise) in-between those to the UC segments.

The bottom line, as I see it, is that you must manually use either the fence instructions; implicitly or explicitly lock prefixed instructions; CPUID or API level lock primitives. Failure to do so may, as you note, cause cpu re-orderings to break your code on SMP architecture.

A side note, I noticed that the Orcas compiler generates volatile release build operations in a manner similar to the debug build operations in VC8: a single pre-increment of a volatile int will take the value into a register, increment the register, then write it back to the value itself. Release build apps in VC8 will skip the register step, and increment the value at the memory location in a single instruction.






Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

Kang Su Gatlin

We should probably spend some more time to document this better.

Lets talk specifically about x86/x64 (as Itanium has different behavior). volatile and _*Barrier prevent compiler reordering, but are not hardware fences.. volatile does so with acquire/release semantics and the Barriers with different semantics depending on which one is used. So why do the *fence instructions exist They exist for weakly-ordered memory accesses (remember these fences weren't introduced until SSE and SSE2). So in cases where you are using weakly-ordered instructions, then you should make sure that you use these fences. In the absence of these weakly-ordered instructions, the strong-ordering guarantees by the processor make the constraints on compiler reordering sufficient (through the use of volatile and _*Barrier).

Thanks,






Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

Holger Grund

Now that's really confusing. AFAIK there are no order guarantees for the ordering of normal loads (i.e. not only inherently weakly ordered instructions) from WB memory on x86. So, exactly how can a volatile memory accesses have acquire and release semantics without a memory barrier

-hg





Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

einaros

Holger Grund wrote:

Now that's really confusing. AFAIK there are no order guarantees for the ordering of normal loads (i.e. not only inherently weakly ordered instructions) from WB memory on x86. So, exactly how can a volatile memory accesses have acquire and release semantics without a memory barrier

My understanding (or assumption) is that it's controlled through the memory type range registers and / or page attribute tables, by marking the volatile ranges as uncachable, and thus forcing the CPU to use strict ordering. I haven't seen this confirmed by anyone, though.






Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

Holger Grund

I don't think Windows ever allocates UC pages for standard heap memory. There are several reasons why this is impratical.

My - admittely somewhat outdated information about IA-32 - suggests all CPUs will always agree about a given CPU's order of stores (unless instructions are inherently weakly ordered - e.g. nontemporal stores). The same is not true for loads, however. Therefore I claim that a simple load from memory (which is what a volatile read boils down to) does not have acquire semantics.

-hg





Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

einaros

That's pretty much what the Intel and AMD manuals say -- reads can go ahead of writes; out-of-order reads are allowed, as are speculative reads.






Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

einaros

Holger Grund wrote:

I don't think Windows ever allocates UC pages for standard heap memory. There are several reasons why this is impratical.

I suppose we need Kang Su to make another statement Smile






Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

Mag pie

goes for x86 only, As I have only tested on x86.

As for volatile variables they are not placed in UC area. they are treated just like normal variables as the placement goes.

For compiler reordering, volatiles behave similiar to read or write barrier.

for example

Code Snippet
int testloc1;
volatile int testloc2;
//
testloc1 = x;
++ testloc2;
testloc1 = y;

will generate something like this

The first write inst coule be omitted, but compiler kept it as it was placed before volatile write.

From http://msdn2.microsoft.com/en-us/library/bb310595.aspx

Visual C++ 2005 goes beyond standard C++ to define multi-threading-friendly semantics for volatile variable access. Starting with Visual C++ 2005, reads from volatile variables are defined to have read-acquire semantics, and writes to volatile variables are defined to have write-release semantics. This means that the compiler will not rearrange any reads and writes past them, and on Windows it will ensure that the CPU does not do so either.

As the my understanding of read acquire and wite release goes, any write after write release could be moved before the write-release. The code above could be also complied as.

Code Snippet
mov DWORD PTR testloc1__3HA, eax ; testloc1 = y
mov edx, 1
add DWORD PTR testloc2__3HC, edx ; ++ testloc2

And another strange part is mentioning of "on Windows". There is no way that Windows can affect anything about reordering.

These reordering and memory barrier and volatile are making me very uncomfortable.

Getting complicated and strange.

This is my first use of MSDN forum and i have make some mistake. Some mailto links are present in this post. just ignore them. Sorry.





Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

einaros

Mag pie wrote:

And another strange part is mentioning of "on Windows". There is no way that Windows can affect anything about reordering.

These reordering and memory barrier and volatile are making me very uncomfortable.

Getting complicated and strange.

I was informed by Kang Su Gatlin that he had asked a chip vendor to address this thread. There will probably be some new input here this coming week.






Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

einaros

Mag pie wrote:

Code Snippet
int testloc1;
volatile int testloc2;
//
testloc1 = x;
++ testloc2;
testloc1 = y;

If I'm interpreting the Intel / AMD docs correctly, and the instruction ordering isn't otherwise strengthened, any x86 or x64 CPU would be allowed to reorder the execution of the above code to (roughly) read

Code Snippet

testloc1 = x;

testloc1 = y;

++testloc2;






Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

Mag pie

Compiler is allowed to optimize away "testloc1 = x;" statement, as the statement is useless.

My thought is that volatile write is acting as "_WriteBarrier" intrinsic.

Code Snippet
testloc1 = x;
testloc2 = x;
testloc1 = x;
testloc2 = y;
++ ++ testloc1;
++ ++ testloc2;

complied in VC8 ( /Ox /Oi /GL /D "WIN32" /D "NDEBUG" /FD /EHsc /MD /Fo"Asm\\" /c /Wp64 /Zi /TP /errorReportStick out tonguerompt ) the output asm file reads like this

Code Snippet

mov DWORD PTR testloc1__3HA, eax ; testloc1 = x : this shoud be optimized away.
mov DWORD PTR testloc1__3HA, eax ; testloc1 = x : compiler reordered as write-release allows
mov DWORD PTR testloc2__3HC, eax ; testloc2 = x
add eax, 2
mov DWORD PTR testloc1__3HA, eax ; ++ ++ testloc1 : compiler reordered, moved up
mov eax, 1
mov DWORD PTR testloc2__3HC, ecx ; testloc2 = y
add DWORD PTR testloc2__3HC, eax ; ++ testloc2
add DWORD PTR testloc2__3HC, eax ; ++ testloc2

I have very little idea how the CPU will reorder this code in run time, but it seems to me that compiler is doing some reordering based on volatile write as _WriteBarrier. MSDN and intel manuals state that write-reorder is very unlikely in x86, unless there some alignment issue, please correct me I'm wrong. CPU will write as the assembly code order.

Read can be the real problem as though the compiler will ensure that volatile read or _ReadBarrier acts as read-acquire sematics, but the CPU will not. <- I'm not so sure about this.





Re: Visual C++ Language Are Interlocked and Barrier Intrinsics Implemented Incorrectly on VC8?

einaros

Mag pie wrote:

Compiler is allowed to optimize away "testloc1 = x;" statement, as the statement is useless.

Well my point still stands if you still make both testloc's volatile.

Mag pie wrote:

My thought is that volatile write is acting as "_WriteBarrier" intrinsic.

As did the original poster.

Mag pie wrote:

I have very little idea how the CPU will reorder this code in run time, but it seems to me that compiler is doing some reordering based on volatile write as _WriteBarrier. MSDN and intel manuals state that write-reorder is very unlikely in x86, unless there some alignment issue, please correct me I'm wrong. CPU will write as the assembly code order.

Write-reordering is illegal. Reads are allowed to move ahead of writes (granted that they aren't targetting the same address), and speculative reads are also allowed.

I think we're repeating ourselves, though, so I suggest we let this thread stay dead until the chipset people have spoken.