大家好,又见面了,我是你们的朋友全栈君。
but because of how difficult it can be to penetrate the subject in the first place》;
Bruce Dawson的优秀的白皮书。无锁编程注意事项(
http://msdn.microsoft.com/en-us/library/windows/desktop/ee418650(v=vs.85).aspx
);和许多人一样,我有机会按照
Bruce Dawson的建议,在Xbox 360上进行无锁编程的开发与调试。
At times, the information in one source may appear orthogonal to other sources:
C++11 atomic library standard(
http://en.cppreference.com/w/cpp/atomic)是我们编写无锁算法的好工具。
semaphores
http://preshing.com/20111118/locks-arent-slow-lock-contention-is)。上面这句大白话没有错,但是并不是无锁编程的全部。普遍接受的基于学术文献的定义,是一个更广泛的定义。从本质上讲,无锁是用于描述一些代码的属性,关于这些代码是如何写的没有说太多。
your worst enemy(没有理解什么意思,暂时解释为线程调度器)。这最后一点听起来很好笑,但是这是最关键的。共享互斥锁是互相排斥的,因为一旦一个线程获得了共享锁,那么线程调度器就没有办法再次安排该线程。当然了,真正的操作系统不这样工作,我们仅仅定义了一些术语。
Herlihy 和Shavit编写的“
The Art of Multiprocessor Programming
”中说了一些操作和类方法,并提到了一个简洁的关于无锁的定义(看幻灯片150,该幻灯片需要在google doc中看,附件中有下载。)“在有无限的请求时,会有无限的方法完成。”。换句话说,如果程序能够一直不停的调用那些无锁方法,那么完成的数量也在不停的增加,这就是全部。算法不可能锁定到那些无锁操作中。
使用无锁方法
线程的运行。这暗示了无锁编程的意义,当你编写中断处理程序或者实时系统时,某些工作必须在限定时间内完成,不管进行的其他部分在什么状态。
atomic operations)、内存障碍(
memory barriers),避免ABA问题(
ABA problem)等等。事情变得很可怕。
http://preshing.com/20120226/roll-your-own-lightweight-mutex),一个递归的互斥(
http://preshing.com/20120305/implementing-a-recursive-mutex)和一个轻量级的日志系统(
http://preshing.com/20120522/lightweight-in-memory-logging)。
_InterlockedIncrement
OSAtomicAdd32
std::atomic::fetch_add
std::atomic<>::is_lock_free来确认下。
不同的CPU系列支持RMW的方式不同。
PowerPC
ARM
load-link/store-conditional,这使得你可以在底层实现你自己的RMW原语,当然了这并不常用。普通的RMW操作通常是足够使用的。
_InterlockedCompareExchange 。通常,程序员在一个循环中执行CAS,反复尝试执行某一个事务。这种模式通常会把一个共享的变量拷贝为临时变量,执行一些操作,并尝试使用CAS把修改后的共享变量设置回去。
LockFreeQueue
::
push(
Node
*
newHead)
{
for (;;)
{
// Copy a shared variable (m_Head) to a local.
Node
*
oldHead
=
m_Head;
// Do some speculative work, not yet visible to other threads.
newHead
->
next
=
oldHead;
// Next, attempt to publish our changes to the shared variable.
// If the shared variable hasn’t changed, the CAS succeeds and we return.
// Otherwise, repeat.
if (
_InterlockedCompareExchan
&
m_Head
,
newHead
,
oldHead)
==
oldHead)
return;
}
}
weaker variant of CAS),这不一定完成正确。)。只要执行一个CAS循环,必须避开ABA问题(
the
ABA problem)。
http://preshing.com/20120515/memory-reordering-caught-in-the-act)
atomic
volatile。下面的例子是从我以前的blog中拿的(
http://preshing.com/20120515/memory-reordering-caught-in-the-act
),使用C++11的风格编写的。
symmetric multiprocessor)上进行无锁编程,你的环境并不能保证顺序一致性。你必须想办法防止内存重排序。(
memory reordering)
Memory operations which provide acquire or release semantics.
)
future post
.
)
Different CPU families have different habits
)。这些特性需要查询特定CPU厂商及硬件的文档。例如,PowerPC和ARM的CPU可以改变内存存储相对于指令自身的顺序,一般来说Intel的X86/64家族的CPU及AMD的CPU却不会改变。我们看到以前的处理器拥有更加轻松的内存模型(这句话没有明白。
We say the former processors have a more
relaxed memory model
.
)。
comes with acquire semantics
),向内存存入数据有写原语(
provides release semantics
)–至少对于非SSE指令和非WC内存操作。这会导致过去写的在X86/64上的无锁代码不能运行与其他类型CPU上。(
fails on other processors
.
)。【这一段没有理解,读原语和写原语和SSE及WC有什么关系呢。。。】
Is Parallel Programming Hard
. 在任何情况下,你需要记住内存重排序都会由于编译器的原因而发生。
- Anthony Williams’ blog
and his book, C++ Concurrency in Action - Dmitriy V’jukov’s website
and various forum discussions - Bartosz Milewski’s blog
- Charles Bloom’s
Low-Level Threading series on his blog - Doug Lea’s
JSR-133 Cookbook - Howells and McKenney’s
memory-barriers.txt document - Hans Boehm’s
collection of links about the C++11 memory model - Herb Sutter’s
Effective Concurrency series
Lightweight In-Memory Logging
When debugging multithreaded code, it’s not always easy to determine which codepath was taken. You can’t always reproduce the bug while stepping through the debugger, nor can you always sprinkle printf
s throughout the code, as you might in a single-threaded program. There might be millions of events before the bug occurs, and printf
can easily slow the application to a crawl, mask the bug, or create a spam fest in the output log.
One way of attacking such problems is to instrument the code so that events are logged to a circular buffer in memory. This is similar to adding printf
s, except that only the most recent events are kept in the log, and the performance overhead can be made very low using lock-free techniques.
Here’s one possible implementation. I’ve written it specifically for Windows in 32-bit C++, but you could easily adapt the idea to other platforms. The header file contains the following:
#include <windows.h> #include <intrin.h> namespace Logger { struct Event { DWORD tid; // Thread ID const char* msg; // Message string DWORD param; // A parameter which can mean anything you want }; static const int BUFFER_SIZE = 65536; // Must be a power of 2 extern Event g_events[BUFFER_SIZE]; extern LONG g_pos; inline void Log(const char* msg, DWORD param) { // Get next event index LONG index = _InterlockedIncrement(&g_pos); // Write an event at this index Event* e = g_events + (index & (BUFFER_SIZE - 1)); // Wrap to buffer size e->tid = ((DWORD*) __readfsdword(24))[9]; // Get thread ID e->msg = msg; e->param = param; } } #define LOG(m, p) Logger::Log(m, p)
And you must place the following in a .cpp file.
namespace Logger { Event g_events[BUFFER_SIZE]; LONG g_pos = -1; }
This is perhaps one of the simplest examples of lock-free programming which actually does something useful. There’s a single macro LOG
, which writes to the log. It uses _InterlockedIncrement
, an atomic operation which I’ve talked about in previous posts, for thread safety. There are no readers. You are meant to be the reader when you inspect the process in the debugger, such as when the program crashes, or when the bug is otherwise caught.
Using It to Debug My Previous Post
My previous post, Memory Reordering Caught In the Act, contains a sample program which demonstrates a specific type of memory reordering. There are two semaphores, beginSema1
and beginSema2
, which are used to repeatedly kick off two worker threads.
While I was preparing the post, there was only a single beginSema
shared by both threads. To verify that the experiment was valid, I added a makeshift assert to the worker threads. Here’s the Win32 version:
DWORD WINAPI thread1Func(LPVOID param) { MersenneTwister random(1); for (;;) { WaitForSingleObject(beginSema, INFINITE); // Wait for signal while (random.integer() % 8 != 0) {} // Random delay // ----- THE TRANSACTION! ----- if (X != 0) DebugBreak(); // Makeshift assert X = 1; _ReadWriteBarrier(); // Prevent compiler reordering only r1 = Y; ReleaseSemaphore(endSema, 1, NULL); // Notify transaction complete } return 0; // Never returns };
Surprisingly, this “assert” got hit, which means that X
was not 0 at the start of the experiment, as expected. This puzzled me, since as I explained in that post, the semaphores are supposed to guarantee the initial values X = 0
and Y = 0
are completely propogated at this point.
I needed more visibility on what was going on, so I added the LOG
macro in a few strategic places. Note that the integer parameter can be used to log any value you want. In the second LOG
statement below, I use it to log the initial value of X
. Similar changes were made in the other worker thread.
for (;;) { LOG("wait", 0); WaitForSingleObject(beginSema, INFINITE); // Wait for signal while (random.integer() % 8 != 0) {} // Random delay // ----- THE TRANSACTION! ----- LOG("X ==", X); if (X != 0) DebugBreak(); // Makeshift assert X = 1; _ReadWriteBarrier(); // Prevent compiler reordering only r1 = Y; ReleaseSemaphore(endSema, 1, NULL); // Notify transaction complete }
And in the main thread:
for (int iterations = 1; ; iterations++) { // Reset X and Y LOG("reset vars", 0); X = 0; Y = 0; // Signal both threads ReleaseSemaphore(beginSema, 1, NULL); ReleaseSemaphore(beginSema, 1, NULL); // Wait for both threads WaitForSingleObject(endSema, INFINITE); WaitForSingleObject(endSema, INFINITE); // Check if there was a simultaneous reorder LOG("check vars", 0); if (r1 == 0 && r2 == 0) { detected++; printf("%d reorders detected after %d iterations\n", detected, iterations); } }
The next time the “assert” was hit, I checked the contents of the log simply by watching the expressions Logger::g_pos
and Logger::g_events
in the Watch window.
In this case, the assert was hit fairly quickly. Only 17 events were logged in total (0 – 16). The final three events made the problem obvious: a single worker thread had managed to iterate twice before the other thread got a chance to run. In other words, thread1 had stolen the extra semaphore count which was intended to kick off thread2! Splitting this semaphore into two separate semaphores fixed the bug.
This example was relatively simple, involving a small number of events. In some games I’ve worked on, we’ve used this kind of technique to track down more complex problems. It’s still possible for this technique to mask a bug; for example, when memory reordering is the issue. But even if so, that may tell you something about the problem.
Tips on Viewing the Log
The g_events
array is only big enough to hold the latest 65536 events. You can adjust this number to your liking, but at some point, the index counter g_pos
will have to wrap around. For example, if g_pos
has reached a value of 3630838, you can find the last log entry by taking this value modulo 65536. Using interactive Python:
>>> 3630838 % 65536
26358
When breaking, you may also find that “CXX0017: Error: symbol not found” is sometimes shown in the Watch window, as seen here:
This usually means that the debugger’s current thread and stack frame context is inside an external DLL instead of your executable. You can often fix it by double-clicking a different stack frame in the Call Stack window and/or a different thread in the Threads window. If all else fails, you can always add the context operator to your Watch expression, explicitly telling the debugger which module to use to resolve these symbols:
One convenient detail about this implementation is that the event log is stored in a global array. This allows the log show up in crash dumps, via an automated crash reporting system for example, even when limited minidump flags are used.
What Makes This Lightweight?
In this implementation, I strived to make the LOG
macro as non-intrusive as reasonably possible. Besides being lock-free, this is mainly achieved through copious use of compiler intrinsics, which avoid the overhead of DLL function calls for certain functions. For example, instead of calling InterlockedIncrement
, which involves a call into kernel32.dll, I used the intrinsic function _InterlockedIncrement
(with an underscore).
Similarly, instead of getting the current thread ID from GetCurrentThreadId
, I used the compiler intrinsic __readfsdword
to read the thread ID directly from the Thread Information Block (TIB), an undocumented but well-known data structure in Win32.
You may question whether such micro-optimizations are justified. However, after building several makeshift logging systems, usually to handle millions of events in high-performance, multi-threaded code, I’ve come to believe that the less intrusive you can make it, the better. As a result of these micro-optimizations, the LOG
macro compiles down to a few machine instructions, all inlined, with no function calls, no branching and no blocking:
This technique is attractive because it is very easy to integrate. There are many ways you could adapt it, depending on your needs, especially if performance is less of a concern. You could add timestamps and stack traces. You could introduce a dedicated thread to spool the event log to disk, though this would require much more sophisticated synchronization than the single atomic operation used here.
After adding such features, the technique would begin to resemble Microsoft’s Event Tracing for Windows (ETW) framework, so if you’re willing to go that far, it might be interesting to look at ETW’s support for user-mode provider events instead.
发布者:全栈程序员-用户IM,转载请注明出处:https://javaforall.cn/134571.html原文链接:https://javaforall.cn
【正版授权,激活自己账号】: Jetbrains全家桶Ide使用,1年售后保障,每天仅需1毛
【官方授权 正版激活】: 官方授权 正版激活 支持Jetbrains家族下所有IDE 使用个人JB账号...