提交 5d7962c6 编写于 作者: R Robert Haas

Change locking regimen around buffer replacement.

Previously, we used an lwlock that was held from the time we began
seeking a candidate buffer until the time when we found and pinned
one, which is disastrous for concurrency.  Instead, use a spinlock
which is held just long enough to pop the freelist or advance the
clock sweep hand, and then released.  If we need to advance the clock
sweep further, we reacquire the spinlock once per buffer.

This represents a significant increase in atomic operations around
buffer eviction, but it still wins on many workloads.  On others, it
may result in no gain, or even cause a regression, unless the number
of buffer mapping locks is also increased.  However, that seems like
material for a separate commit.  We may also need to consider other
methods of mitigating contention on this spinlock, such as splitting
it into multiple locks or jumping the clock sweep hand more than one
buffer at a time, but those, too, seem like separate improvements.

Patch by me, inspired by a much larger patch from Amit Kapila.
Reviewed by Andres Freund.
上级 1dcfb8da
...@@ -125,14 +125,12 @@ bits of the tag's hash value. The rules stated above apply to each partition ...@@ -125,14 +125,12 @@ bits of the tag's hash value. The rules stated above apply to each partition
independently. If it is necessary to lock more than one partition at a time, independently. If it is necessary to lock more than one partition at a time,
they must be locked in partition-number order to avoid risk of deadlock. they must be locked in partition-number order to avoid risk of deadlock.
* A separate system-wide LWLock, the BufFreelistLock, provides mutual * A separate system-wide spinlock, buffer_strategy_lock, provides mutual
exclusion for operations that access the buffer free list or select exclusion for operations that access the buffer free list or select
buffers for replacement. This is always taken in exclusive mode since buffers for replacement. A spinlock is used here rather than a lightweight
there are no read-only operations on those data structures. The buffer lock for efficiency; no other locks of any sort should be acquired while
management policy is designed so that BufFreelistLock need not be taken buffer_strategy_lock is held. This is essential to allow buffer replacement
except in paths that will require I/O, and thus will be slow anyway. to happen in multiple backends with reasonable concurrency.
(Details appear below.) It is never necessary to hold the BufMappingLock
and the BufFreelistLock at the same time.
* Each buffer header contains a spinlock that must be taken when examining * Each buffer header contains a spinlock that must be taken when examining
or changing fields of that buffer header. This allows operations such as or changing fields of that buffer header. This allows operations such as
...@@ -165,7 +163,7 @@ consider their pages unlikely to be needed soon; however, the current ...@@ -165,7 +163,7 @@ consider their pages unlikely to be needed soon; however, the current
algorithm never does that. The list is singly-linked using fields in the algorithm never does that. The list is singly-linked using fields in the
buffer headers; we maintain head and tail pointers in global variables. buffer headers; we maintain head and tail pointers in global variables.
(Note: although the list links are in the buffer headers, they are (Note: although the list links are in the buffer headers, they are
considered to be protected by the BufFreelistLock, not the buffer-header considered to be protected by the buffer_strategy_lock, not the buffer-header
spinlocks.) To choose a victim buffer to recycle when there are no free spinlocks.) To choose a victim buffer to recycle when there are no free
buffers available, we use a simple clock-sweep algorithm, which avoids the buffers available, we use a simple clock-sweep algorithm, which avoids the
need to take system-wide locks during common operations. It works like need to take system-wide locks during common operations. It works like
...@@ -178,25 +176,26 @@ buffer reference count, so it's nearly free.) ...@@ -178,25 +176,26 @@ buffer reference count, so it's nearly free.)
The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly The "clock hand" is a buffer index, nextVictimBuffer, that moves circularly
through all the available buffers. nextVictimBuffer is protected by the through all the available buffers. nextVictimBuffer is protected by the
BufFreelistLock. buffer_strategy_lock.
The algorithm for a process that needs to obtain a victim buffer is: The algorithm for a process that needs to obtain a victim buffer is:
1. Obtain BufFreelistLock. 1. Obtain buffer_strategy_lock.
2. If buffer free list is nonempty, remove its head buffer. If the buffer 2. If buffer free list is nonempty, remove its head buffer. Release
is pinned or has a nonzero usage count, it cannot be used; ignore it and buffer_strategy_lock. If the buffer is pinned or has a nonzero usage count,
return to the start of step 2. Otherwise, pin the buffer, release it cannot be used; ignore it go back to step 1. Otherwise, pin the buffer,
BufFreelistLock, and return the buffer. and return it.
3. Otherwise, select the buffer pointed to by nextVictimBuffer, and 3. Otherwise, the buffer free list is empty. Select the buffer pointed to by
circularly advance nextVictimBuffer for next time. nextVictimBuffer, and circularly advance nextVictimBuffer for next time.
Release buffer_strategy_lock.
4. If the selected buffer is pinned or has a nonzero usage count, it cannot 4. If the selected buffer is pinned or has a nonzero usage count, it cannot
be used. Decrement its usage count (if nonzero) and return to step 3 to be used. Decrement its usage count (if nonzero), reacquire
examine the next buffer. buffer_strategy_lock, and return to step 3 to examine the next buffer.
5. Pin the selected buffer, release BufFreelistLock, and return the buffer. 5. Pin the selected buffer, and return.
(Note that if the selected buffer is dirty, we will have to write it out (Note that if the selected buffer is dirty, we will have to write it out
before we can recycle it; if someone else pins the buffer meanwhile we will before we can recycle it; if someone else pins the buffer meanwhile we will
...@@ -259,7 +258,7 @@ dirty and not pinned nor marked with a positive usage count. It pins, ...@@ -259,7 +258,7 @@ dirty and not pinned nor marked with a positive usage count. It pins,
writes, and releases any such buffer. writes, and releases any such buffer.
If we can assume that reading nextVictimBuffer is an atomic action, then If we can assume that reading nextVictimBuffer is an atomic action, then
the writer doesn't even need to take the BufFreelistLock in order to look the writer doesn't even need to take buffer_strategy_lock in order to look
for buffers to write; it needs only to spinlock each buffer header for long for buffers to write; it needs only to spinlock each buffer header for long
enough to check the dirtybit. Even without that assumption, the writer enough to check the dirtybit. Even without that assumption, the writer
only needs to take the lock long enough to read the variable value, not only needs to take the lock long enough to read the variable value, not
......
...@@ -889,15 +889,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, ...@@ -889,15 +889,11 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
/* Loop here in case we have to try another victim buffer */ /* Loop here in case we have to try another victim buffer */
for (;;) for (;;)
{ {
bool lock_held;
/* /*
* Select a victim buffer. The buffer is returned with its header * Select a victim buffer. The buffer is returned with its header
* spinlock still held! Also (in most cases) the BufFreelistLock is * spinlock still held!
* still held, since it would be bad to hold the spinlock while
* possibly waking up other processes.
*/ */
buf = StrategyGetBuffer(strategy, &lock_held); buf = StrategyGetBuffer(strategy);
Assert(buf->refcount == 0); Assert(buf->refcount == 0);
...@@ -907,10 +903,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum, ...@@ -907,10 +903,6 @@ BufferAlloc(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
/* Pin the buffer and then release the buffer spinlock */ /* Pin the buffer and then release the buffer spinlock */
PinBuffer_Locked(buf); PinBuffer_Locked(buf);
/* Now it's safe to release the freelist lock */
if (lock_held)
LWLockRelease(BufFreelistLock);
/* /*
* If the buffer was dirty, try to write it out. There is a race * If the buffer was dirty, try to write it out. There is a race
* condition here, in that someone might dirty it after we released it * condition here, in that someone might dirty it after we released it
......
...@@ -24,6 +24,9 @@ ...@@ -24,6 +24,9 @@
*/ */
typedef struct typedef struct
{ {
/* Spinlock: protects the values below */
slock_t buffer_strategy_lock;
/* Clock sweep hand: index of next buffer to consider grabbing */ /* Clock sweep hand: index of next buffer to consider grabbing */
int nextVictimBuffer; int nextVictimBuffer;
...@@ -101,15 +104,10 @@ static void AddBufferToRing(BufferAccessStrategy strategy, ...@@ -101,15 +104,10 @@ static void AddBufferToRing(BufferAccessStrategy strategy,
* strategy is a BufferAccessStrategy object, or NULL for default strategy. * strategy is a BufferAccessStrategy object, or NULL for default strategy.
* *
* To ensure that no one else can pin the buffer before we do, we must * To ensure that no one else can pin the buffer before we do, we must
* return the buffer with the buffer header spinlock still held. If * return the buffer with the buffer header spinlock still held.
* *lock_held is set on exit, we have returned with the BufFreelistLock
* still held, as well; the caller must release that lock once the spinlock
* is dropped. We do it that way because releasing the BufFreelistLock
* might awaken other processes, and it would be bad to do the associated
* kernel calls while holding the buffer header spinlock.
*/ */
volatile BufferDesc * volatile BufferDesc *
StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) StrategyGetBuffer(BufferAccessStrategy strategy)
{ {
volatile BufferDesc *buf; volatile BufferDesc *buf;
Latch *bgwriterLatch; Latch *bgwriterLatch;
...@@ -117,21 +115,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) ...@@ -117,21 +115,17 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
/* /*
* If given a strategy object, see whether it can select a buffer. We * If given a strategy object, see whether it can select a buffer. We
* assume strategy objects don't need the BufFreelistLock. * assume strategy objects don't need buffer_strategy_lock.
*/ */
if (strategy != NULL) if (strategy != NULL)
{ {
buf = GetBufferFromRing(strategy); buf = GetBufferFromRing(strategy);
if (buf != NULL) if (buf != NULL)
{
*lock_held = false;
return buf; return buf;
}
} }
/* Nope, so lock the freelist */ /* Nope, so lock the freelist */
*lock_held = true; SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE);
/* /*
* We count buffer allocation requests so that the bgwriter can estimate * We count buffer allocation requests so that the bgwriter can estimate
...@@ -142,22 +136,22 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) ...@@ -142,22 +136,22 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
/* /*
* If bgwriterLatch is set, we need to waken the bgwriter, but we should * If bgwriterLatch is set, we need to waken the bgwriter, but we should
* not do so while holding BufFreelistLock; so release and re-grab. This * not do so while holding buffer_strategy_lock; so release and re-grab.
* is annoyingly tedious, but it happens at most once per bgwriter cycle, * This is annoyingly tedious, but it happens at most once per bgwriter
* so the performance hit is minimal. * cycle, so the performance hit is minimal.
*/ */
bgwriterLatch = StrategyControl->bgwriterLatch; bgwriterLatch = StrategyControl->bgwriterLatch;
if (bgwriterLatch) if (bgwriterLatch)
{ {
StrategyControl->bgwriterLatch = NULL; StrategyControl->bgwriterLatch = NULL;
LWLockRelease(BufFreelistLock); SpinLockRelease(&StrategyControl->buffer_strategy_lock);
SetLatch(bgwriterLatch); SetLatch(bgwriterLatch);
LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
} }
/* /*
* Try to get a buffer from the freelist. Note that the freeNext fields * Try to get a buffer from the freelist. Note that the freeNext fields
* are considered to be protected by the BufFreelistLock not the * are considered to be protected by the buffer_strategy_lock not the
* individual buffer spinlocks, so it's OK to manipulate them without * individual buffer spinlocks, so it's OK to manipulate them without
* holding the spinlock. * holding the spinlock.
*/ */
...@@ -170,6 +164,12 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) ...@@ -170,6 +164,12 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
StrategyControl->firstFreeBuffer = buf->freeNext; StrategyControl->firstFreeBuffer = buf->freeNext;
buf->freeNext = FREENEXT_NOT_IN_LIST; buf->freeNext = FREENEXT_NOT_IN_LIST;
/*
* Release the lock so someone else can access the freelist (or run
* the clocksweep) while we check out this buffer.
*/
SpinLockRelease(&StrategyControl->buffer_strategy_lock);
/* /*
* If the buffer is pinned or has a nonzero usage_count, we cannot use * If the buffer is pinned or has a nonzero usage_count, we cannot use
* it; discard it and retry. (This can only happen if VACUUM put a * it; discard it and retry. (This can only happen if VACUUM put a
...@@ -185,6 +185,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) ...@@ -185,6 +185,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
return buf; return buf;
} }
UnlockBufHdr(buf); UnlockBufHdr(buf);
/* Reacquire the lock and go around for another pass. */
SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
} }
/* Nothing on the freelist, so run the "clock sweep" algorithm */ /* Nothing on the freelist, so run the "clock sweep" algorithm */
...@@ -199,6 +202,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) ...@@ -199,6 +202,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
StrategyControl->completePasses++; StrategyControl->completePasses++;
} }
/* Release the lock before manipulating the candidate buffer. */
SpinLockRelease(&StrategyControl->buffer_strategy_lock);
/* /*
* If the buffer is pinned or has a nonzero usage_count, we cannot use * If the buffer is pinned or has a nonzero usage_count, we cannot use
* it; decrement the usage_count (unless pinned) and keep scanning. * it; decrement the usage_count (unless pinned) and keep scanning.
...@@ -232,6 +238,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) ...@@ -232,6 +238,9 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
elog(ERROR, "no unpinned buffers available"); elog(ERROR, "no unpinned buffers available");
} }
UnlockBufHdr(buf); UnlockBufHdr(buf);
/* Reacquire the lock and get a new candidate buffer. */
SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
} }
} }
...@@ -241,7 +250,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held) ...@@ -241,7 +250,7 @@ StrategyGetBuffer(BufferAccessStrategy strategy, bool *lock_held)
void void
StrategyFreeBuffer(volatile BufferDesc *buf) StrategyFreeBuffer(volatile BufferDesc *buf)
{ {
LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
/* /*
* It is possible that we are told to put something in the freelist that * It is possible that we are told to put something in the freelist that
...@@ -255,7 +264,7 @@ StrategyFreeBuffer(volatile BufferDesc *buf) ...@@ -255,7 +264,7 @@ StrategyFreeBuffer(volatile BufferDesc *buf)
StrategyControl->firstFreeBuffer = buf->buf_id; StrategyControl->firstFreeBuffer = buf->buf_id;
} }
LWLockRelease(BufFreelistLock); SpinLockRelease(&StrategyControl->buffer_strategy_lock);
} }
/* /*
...@@ -274,7 +283,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) ...@@ -274,7 +283,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
{ {
int result; int result;
LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
result = StrategyControl->nextVictimBuffer; result = StrategyControl->nextVictimBuffer;
if (complete_passes) if (complete_passes)
*complete_passes = StrategyControl->completePasses; *complete_passes = StrategyControl->completePasses;
...@@ -283,7 +292,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc) ...@@ -283,7 +292,7 @@ StrategySyncStart(uint32 *complete_passes, uint32 *num_buf_alloc)
*num_buf_alloc = StrategyControl->numBufferAllocs; *num_buf_alloc = StrategyControl->numBufferAllocs;
StrategyControl->numBufferAllocs = 0; StrategyControl->numBufferAllocs = 0;
} }
LWLockRelease(BufFreelistLock); SpinLockRelease(&StrategyControl->buffer_strategy_lock);
return result; return result;
} }
...@@ -299,13 +308,13 @@ void ...@@ -299,13 +308,13 @@ void
StrategyNotifyBgWriter(Latch *bgwriterLatch) StrategyNotifyBgWriter(Latch *bgwriterLatch)
{ {
/* /*
* We acquire the BufFreelistLock just to ensure that the store appears * We acquire buffer_strategy_lock just to ensure that the store appears
* atomic to StrategyGetBuffer. The bgwriter should call this rather * atomic to StrategyGetBuffer. The bgwriter should call this rather
* infrequently, so there's no performance penalty from being safe. * infrequently, so there's no performance penalty from being safe.
*/ */
LWLockAcquire(BufFreelistLock, LW_EXCLUSIVE); SpinLockAcquire(&StrategyControl->buffer_strategy_lock);
StrategyControl->bgwriterLatch = bgwriterLatch; StrategyControl->bgwriterLatch = bgwriterLatch;
LWLockRelease(BufFreelistLock); SpinLockRelease(&StrategyControl->buffer_strategy_lock);
} }
...@@ -370,6 +379,8 @@ StrategyInitialize(bool init) ...@@ -370,6 +379,8 @@ StrategyInitialize(bool init)
*/ */
Assert(init); Assert(init);
SpinLockInit(&StrategyControl->buffer_strategy_lock);
/* /*
* Grab the whole linked list of free buffers for our strategy. We * Grab the whole linked list of free buffers for our strategy. We
* assume it was previously set up by InitBufferPool(). * assume it was previously set up by InitBufferPool().
......
...@@ -115,7 +115,7 @@ typedef struct buftag ...@@ -115,7 +115,7 @@ typedef struct buftag
* Note: buf_hdr_lock must be held to examine or change the tag, flags, * Note: buf_hdr_lock must be held to examine or change the tag, flags,
* usage_count, refcount, or wait_backend_pid fields. buf_id field never * usage_count, refcount, or wait_backend_pid fields. buf_id field never
* changes after initialization, so does not need locking. freeNext is * changes after initialization, so does not need locking. freeNext is
* protected by the BufFreelistLock not buf_hdr_lock. The LWLocks can take * protected by the buffer_strategy_lock not buf_hdr_lock. The LWLocks can take
* care of themselves. The buf_hdr_lock is *not* used to control access to * care of themselves. The buf_hdr_lock is *not* used to control access to
* the data in the buffer! * the data in the buffer!
* *
...@@ -185,8 +185,7 @@ extern BufferDesc *LocalBufferDescriptors; ...@@ -185,8 +185,7 @@ extern BufferDesc *LocalBufferDescriptors;
*/ */
/* freelist.c */ /* freelist.c */
extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy, extern volatile BufferDesc *StrategyGetBuffer(BufferAccessStrategy strategy);
bool *lock_held);
extern void StrategyFreeBuffer(volatile BufferDesc *buf); extern void StrategyFreeBuffer(volatile BufferDesc *buf);
extern bool StrategyRejectBuffer(BufferAccessStrategy strategy, extern bool StrategyRejectBuffer(BufferAccessStrategy strategy,
volatile BufferDesc *buf); volatile BufferDesc *buf);
......
...@@ -89,7 +89,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray; ...@@ -89,7 +89,7 @@ extern PGDLLIMPORT LWLockPadded *MainLWLockArray;
* if you remove a lock, consider leaving a gap in the numbering sequence for * if you remove a lock, consider leaving a gap in the numbering sequence for
* the benefit of DTrace and other external debugging scripts. * the benefit of DTrace and other external debugging scripts.
*/ */
#define BufFreelistLock (&MainLWLockArray[0].lock) /* 0 is available; was formerly BufFreelistLock */
#define ShmemIndexLock (&MainLWLockArray[1].lock) #define ShmemIndexLock (&MainLWLockArray[1].lock)
#define OidGenLock (&MainLWLockArray[2].lock) #define OidGenLock (&MainLWLockArray[2].lock)
#define XidGenLock (&MainLWLockArray[3].lock) #define XidGenLock (&MainLWLockArray[3].lock)
......
Markdown is supported
0% .
You are about to add 0 people to the discussion. Proceed with caution.
先完成此消息的编辑!
想要评论请 注册