Rivorus wrote:
An xchg does not guarantee any fence what-so-ever, at least as far as I can see from x86 instruction listings
According to the Intel 64 / IA-32 software development manual, XCHG instructions imply automatic locking.
As for the fence operations, I can confirm that no explicit ones are provided for volatile objects. That being said, the documentation doesn't appear to make that guarantee (and neither does volatile on any other C++ platforms I know of). The only guarantee made is for the volatile objects not to take part in any compiler re-orderings.
One part I find ambiguous in the MSDN documentation is the following statement: "Although the processor will not reorder un-cacheable memory accesses, un-cacheable variables must be volatile to guarantee that the compiler will not change memory order." This seems to suggest that volatile objects are allocated in ranges marked as un-cachable in the processors MTRRs (memory type range registers, applicable for P4, Xeon and P6). According to the Intel docs, marking a range UC will cause the processor to enforce strong ordering on accesses to that segment. What's unclear to me about this strong ordering, is whether it only affects the un-cachable memory segments alone (that is, reads and writes to UC segments cannot be re-ordered). I doubt, however, that it also enforces serialization of reads / writes to segments *not* marked as UC, which occur (instruction-wise) in-between those to the UC segments.
The bottom line, as I see it, is that you must manually use either the fence instructions; implicitly or explicitly lock prefixed instructions; CPUID or API level lock primitives. Failure to do so may, as you note, cause cpu re-orderings to break your code on SMP architecture.
A side note, I noticed that the Orcas compiler generates volatile release build operations in a manner similar to the debug build operations in VC8: a single pre-increment of a volatile int will take the value into a register, increment the register, then write it back to the value itself. Release build apps in VC8 will skip the register step, and increment the value at the memory location in a single instruction.
We should probably spend some more time to document this better.
Lets talk specifically about x86/x64 (as Itanium has different behavior). volatile and _*Barrier prevent compiler reordering, but are not hardware fences.. volatile does so with acquire/release semantics and the Barriers with different semantics depending on which one is used. So why do the *fence instructions exist They exist for weakly-ordered memory accesses (remember these fences weren't introduced until SSE and SSE2). So in cases where you are using weakly-ordered instructions, then you should make sure that you use these fences. In the absence of these weakly-ordered instructions, the strong-ordering guarantees by the processor make the constraints on compiler reordering sufficient (through the use of volatile and _*Barrier).
Thanks,
Now that's really confusing. AFAIK there are no order guarantees for the ordering of normal loads (i.e. not only inherently weakly ordered instructions) from WB memory on x86. So, exactly how can a volatile memory accesses have acquire and release semantics without a memory barrier
-hg
Holger Grund wrote:
Now that's really confusing. AFAIK there are no order guarantees for the ordering of normal loads (i.e. not only inherently weakly ordered instructions) from WB memory on x86. So, exactly how can a volatile memory accesses have acquire and release semantics without a memory barrier
My understanding (or assumption) is that it's controlled through the memory type range registers and / or page attribute tables, by marking the volatile ranges as uncachable, and thus forcing the CPU to use strict ordering. I haven't seen this confirmed by anyone, though.
I don't think Windows ever allocates UC pages for standard heap memory. There are several reasons why this is impratical.
My - admittely somewhat outdated information about IA-32 - suggests all CPUs will always agree about a given CPU's order of stores (unless instructions are inherently weakly ordered - e.g. nontemporal stores). The same is not true for loads, however. Therefore I claim that a simple load from memory (which is what a volatile read boils down to) does not have acquire semantics.
-hg
That's pretty much what the Intel and AMD manuals say -- reads can go ahead of writes; out-of-order reads are allowed, as are speculative reads.
Holger Grund wrote:
I don't think Windows ever allocates UC pages for standard heap memory. There are several reasons why this is impratical.
I suppose we need Kang Su to make another statement
goes for x86 only, As I have only tested on x86.
As for volatile variables they are not placed in UC area. they are treated just like normal variables as the placement goes.
For compiler reordering, volatiles behave similiar to read or write barrier.
for example
will generate something like this
The first write inst coule be omitted, but compiler kept it as it was placed before volatile write.
From http://msdn2.microsoft.com/en-us/library/bb310595.aspx
Visual C++ 2005 goes beyond standard C++ to define multi-threading-friendly semantics for volatile variable access. Starting with Visual C++ 2005, reads from volatile variables are defined to have read-acquire semantics, and writes to volatile variables are defined to have write-release semantics. This means that the compiler will not rearrange any reads and writes past them, and on Windows it will ensure that the CPU does not do so either.
As the my understanding of read acquire and wite release goes, any write after write release could be moved before the write-release. The code above could be also complied as.
And another strange part is mentioning of "on Windows". There is no way that Windows can affect anything about reordering.
These reordering and memory barrier and volatile are making me very uncomfortable.
Getting complicated and strange.
This is my first use of MSDN forum and i have make some mistake. Some mailto links are present in this post. just ignore them. Sorry.
Mag pie wrote:
And another strange part is mentioning of "on Windows". There is no way that Windows can affect anything about reordering.
These reordering and memory barrier and volatile are making me very uncomfortable.
Getting complicated and strange.
I was informed by Kang Su Gatlin that he had asked a chip vendor to address this thread. There will probably be some new input here this coming week.
Mag pie wrote:
Code Snippetint testloc1;
volatile int testloc2;
//
testloc1 = x;
++ testloc2;
testloc1 = y;
If I'm interpreting the Intel / AMD docs correctly, and the instruction ordering isn't otherwise strengthened, any x86 or x64 CPU would be allowed to reorder the execution of the above code to (roughly) read
testloc1 = x;
testloc1 = y;
++testloc2;
Compiler is allowed to optimize away "testloc1 = x;" statement, as the statement is useless.
My thought is that volatile write is acting as "_WriteBarrier" intrinsic.
complied in VC8 ( /Ox /Oi /GL /D "WIN32" /D "NDEBUG" /FD /EHsc /MD /Fo"Asm\\" /c /Wp64 /Zi /TP /errorReportrompt ) the output asm file reads like this
mov DWORD PTR testloc1__3HA, eax ; testloc1 = x : this shoud be optimized away.
mov DWORD PTR testloc1__3HA, eax ; testloc1 = x : compiler reordered as write-release allows
mov DWORD PTR testloc2__3HC, eax ; testloc2 = x
add eax, 2
mov DWORD PTR testloc1__3HA, eax ; ++ ++ testloc1 : compiler reordered, moved up
mov eax, 1
mov DWORD PTR testloc2__3HC, ecx ; testloc2 = y
add DWORD PTR testloc2__3HC, eax ; ++ testloc2
add DWORD PTR testloc2__3HC, eax ; ++ testloc2
I have very little idea how the CPU will reorder this code in run time, but it seems to me that compiler is doing some reordering based on volatile write as _WriteBarrier. MSDN and intel manuals state that write-reorder is very unlikely in x86, unless there some alignment issue, please correct me I'm wrong. CPU will write as the assembly code order.
Read can be the real problem as though the compiler will ensure that volatile read or _ReadBarrier acts as read-acquire sematics, but the CPU will not. <- I'm not so sure about this.
Mag pie wrote:
Compiler is allowed to optimize away "testloc1 = x;" statement, as the statement is useless.
Well my point still stands if you still make both testloc's volatile.
Mag pie wrote:
My thought is that volatile write is acting as "_WriteBarrier" intrinsic.
As did the original poster.
Mag pie wrote:
I have very little idea how the CPU will reorder this code in run time, but it seems to me that compiler is doing some reordering based on volatile write as _WriteBarrier. MSDN and intel manuals state that write-reorder is very unlikely in x86, unless there some alignment issue, please correct me I'm wrong. CPU will write as the assembly code order.
Write-reordering is illegal. Reads are allowed to move ahead of writes (granted that they aren't targetting the same address), and speculative reads are also allowed.
I think we're repeating ourselves, though, so I suggest we let this thread stay dead until the chipset people have spoken.