Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

Started by Andres Freundalmost 12 years ago55 messages

andres@2ndquadrant.com

almost 12 years ago

Hi,

During the lwlock scalability work I noticed a longstanding issue with
the lwlock code. LWLockRelease() and the other mentioned locations do
the following to wake up any waiters, without holding the lock's
spinlock:
/*
* Awaken any waiters I removed from the queue.
*/
while (head != NULL)
{
LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
proc = head;
head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
proc->lwWaiting = false;
PGSemaphoreUnlock(&proc->sem);
}

which means they manipulate the lwWaitLink queue without
protection. That's done intentionally. The code tries to protect against
corruption of the list to do a woken up backend acquiring a lock (this
or an independent one) by only continuing when the lwWaiting flag is set
to false. Unfortunately there's absolutely no guarantee that a) the
assignment to lwWaitLink and lwWaiting are done in that order b) that
the stores are done in-order from the POV of other backends.
So what we need to do is to acquire a write barrier between the
assignments to lwWaitLink and lwWaiting, i.e.
proc->lwWaitLink = NULL;
pg_write_barrier();
proc->lwWaiting = false;
the reader side already uses an implicit barrier by using spinlocks.

I've fixed this as part 1 of the lwlock scalability work in [1]http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commitdiff;h=2de11eb11bb3e3777f6d384de0af9c2f77960637, but
Heikki rightfully suggested that a) this should be backpatched b) done
in a separate commit.

There is the question what to do about the branches without barriers? I
guess a SpinLockAcquire()/Release() would do? Or do we decide it's not
important enough to matter, since it's not an issue on x86?

[1]: http://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=commitdiff;h=2de11eb11bb3e3777f6d384de0af9c2f77960637

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Andres Freund (#1)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

Andres Freund <andres@2ndquadrant.com> writes:

So what we need to do is to acquire a write barrier between the
assignments to lwWaitLink and lwWaiting, i.e.
proc->lwWaitLink = NULL;
pg_write_barrier();
proc->lwWaiting = false;

You didn't really explain why you think that ordering is necessary?
Each proc being awoken will surely see both fields updated, and other
procs won't be examining these fields at all, since we already delinked
all these procs from the LWLock's queue.

There is the question what to do about the branches without barriers? I
guess a SpinLockAcquire()/Release() would do? Or do we decide it's not
important enough to matter, since it's not an issue on x86?

Given the lack of trouble reports that could be traced to this,
I don't feel a need to worry about it in branches that don't
have any barrier support. But in any case, I'm not convinced
there's a bug here at all.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Tom Lane (#2)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

I wrote:

You didn't really explain why you think that ordering is necessary?

Actually, after grepping to check my memory of what those fields are
being used for, I have a bigger question: WTF is xlog.c doing being
so friendly with the innards of LWLocks? Surely this needs to get
refactored so that most of WakeupWaiters() and friends is in lwlock.c.
Or has all notion of modularity gone out the window while I wasn't
looking?

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Tom Lane (#2)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-10 11:11:28 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

So what we need to do is to acquire a write barrier between the
assignments to lwWaitLink and lwWaiting, i.e.
proc->lwWaitLink = NULL;
pg_write_barrier();
proc->lwWaiting = false;

You didn't really explain why you think that ordering is necessary?
Each proc being awoken will surely see both fields updated, and other
procs won't be examining these fields at all, since we already delinked
all these procs from the LWLock's queue.

The problem is that one the released backends could wake up concurrently
because of a unrelated, or previous PGSemaphoreUnlock(). It could see
lwWaiting = false, and thus wakeup and acquire the lock, even if the
store for lwWaitLink hasn't arrived (or performed, there's no guaranteed
ordering here) yet.
Now, it may well be that there's no practical consequence of that, but I
am not prepared to bet on it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Tom Lane (#3)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-10 11:20:30 -0500, Tom Lane wrote:

I wrote:

You didn't really explain why you think that ordering is necessary?

Actually, after grepping to check my memory of what those fields are
being used for, I have a bigger question: WTF is xlog.c doing being
so friendly with the innards of LWLocks? Surely this needs to get
refactored so that most of WakeupWaiters() and friends is in lwlock.c.
Or has all notion of modularity gone out the window while I wasn't
looking?

Well, it's not actually using any lwlock.c code, it's a special case
locking logic, just reusing the datastructures. That said, I am not
particularly happy about the amount of code it's duplicating from
lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
WALInsertSlotAcquireOne() is a copied.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Andres Freund (#5)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 02/10/2014 06:41 PM, Andres Freund wrote:

On 2014-02-10 11:20:30 -0500, Tom Lane wrote:

I wrote:

You didn't really explain why you think that ordering is necessary?

Actually, after grepping to check my memory of what those fields are
being used for, I have a bigger question: WTF is xlog.c doing being
so friendly with the innards of LWLocks? Surely this needs to get
refactored so that most of WakeupWaiters() and friends is in lwlock.c.
Or has all notion of modularity gone out the window while I wasn't
looking?

Well, it's not actually using any lwlock.c code, it's a special case
locking logic, just reusing the datastructures. That said, I am not
particularly happy about the amount of code it's duplicating from
lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
WALInsertSlotAcquireOne() is a copied.

I'm not too happy with the amount of copy-paste myself, but there was
enough difference to regular lwlocks that I didn't want to bother all
lwlocks with the xlog-specific stuff either. The WAL insert slots do
share the LWLock-related PGPROC fields though, and semaphore. I'm all
ears if you have ideas on that..

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Heikki Linnakangas (#6)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 02/10/2014 06:41 PM, Andres Freund wrote:

Well, it's not actually using any lwlock.c code, it's a special case
locking logic, just reusing the datastructures. That said, I am not
particularly happy about the amount of code it's duplicating from
lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
WALInsertSlotAcquireOne() is a copied.

I'm not too happy with the amount of copy-paste myself, but there was
enough difference to regular lwlocks that I didn't want to bother all
lwlocks with the xlog-specific stuff either. The WAL insert slots do
share the LWLock-related PGPROC fields though, and semaphore. I'm all
ears if you have ideas on that..

I agree that if the behavior is considerably different, we don't really
want to try to make LWLockAcquire/Release cater to both this and their
standard behavior. But why not add some additional functions in lwlock.c
that do what xlog wants? If we're going to have mostly-copy-pasted logic,
it'd at least be better if it was in the same file, and not somewhere
that's not even in the same major subtree.

Also, the reason that LWLock isn't an abstract struct is because we wanted
to be able to embed it in other structs; *not* as license for other
modules to fool with its contents. If we were working in C++ we'd
certainly have made all its fields private.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Andres Freund (#1)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 02/10/2014 03:46 PM, Andres Freund wrote:

Hi,

During the lwlock scalability work I noticed a longstanding issue with
the lwlock code. LWLockRelease() and the other mentioned locations do
the following to wake up any waiters, without holding the lock's
spinlock:
/*
* Awaken any waiters I removed from the queue.
*/
while (head != NULL)
{
LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
proc = head;
head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
proc->lwWaiting = false;
PGSemaphoreUnlock(&proc->sem);
}

which means they manipulate the lwWaitLink queue without
protection. That's done intentionally. The code tries to protect against
corruption of the list to do a woken up backend acquiring a lock (this
or an independent one) by only continuing when the lwWaiting flag is set
to false. Unfortunately there's absolutely no guarantee that a) the
assignment to lwWaitLink and lwWaiting are done in that order b) that
the stores are done in-order from the POV of other backends.
So what we need to do is to acquire a write barrier between the
assignments to lwWaitLink and lwWaiting, i.e.
proc->lwWaitLink = NULL;
pg_write_barrier();
proc->lwWaiting = false;
the reader side already uses an implicit barrier by using spinlocks.

I've fixed this as part 1 of the lwlock scalability work in [1], but
Heikki rightfully suggested that a) this should be backpatched b) done
in a separate commit.

There is the question what to do about the branches without barriers? I
guess a SpinLockAcquire()/Release() would do?

The other alternative we discussed on IM is to unlink the waiters from
the linked list, while still holding the spinlock. We can't do the
PGSemaphoreUnlock() call to actually wake up the waiters while holding
the spinlock, but all the shared memory manipulations we could. It would
move all the modifications of the shared structures under the spinlock,
which feels comforting.

It would require using some-sort of a backend-private data structure to
hold the list of processes to wake up. We don't want to palloc() in
LWLockRelease(), but we could malloc() a large-enough array once at
process initialization, and use that on all LWLockRelease() calls.
- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Tom Lane (#7)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 02/10/2014 08:03 PM, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 02/10/2014 06:41 PM, Andres Freund wrote:

Well, it's not actually using any lwlock.c code, it's a special case
locking logic, just reusing the datastructures. That said, I am not
particularly happy about the amount of code it's duplicating from
lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
WALInsertSlotAcquireOne() is a copied.

I'm not too happy with the amount of copy-paste myself, but there was
enough difference to regular lwlocks that I didn't want to bother all
lwlocks with the xlog-specific stuff either. The WAL insert slots do
share the LWLock-related PGPROC fields though, and semaphore. I'm all
ears if you have ideas on that..

I agree that if the behavior is considerably different, we don't really
want to try to make LWLockAcquire/Release cater to both this and their
standard behavior. But why not add some additional functions in lwlock.c
that do what xlog wants? If we're going to have mostly-copy-pasted logic,
it'd at least be better if it was in the same file, and not somewhere
that's not even in the same major subtree.

Ok, I'll try to refactor it that way, so that we can see if it looks better.

Also, the reason that LWLock isn't an abstract struct is because we wanted
to be able to embed it in other structs; *not* as license for other
modules to fool with its contents. If we were working in C++ we'd
certainly have made all its fields private.

Um, xlog.c is doing no such thing. The insertion slots use a struct of
their own, which is also copy-pasted from LWLock (with one additional
field).

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Heikki Linnakangas (#6)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-10 19:48:47 +0200, Heikki Linnakangas wrote:

On 02/10/2014 06:41 PM, Andres Freund wrote:

On 2014-02-10 11:20:30 -0500, Tom Lane wrote:

I wrote:

You didn't really explain why you think that ordering is necessary?

Actually, after grepping to check my memory of what those fields are
being used for, I have a bigger question: WTF is xlog.c doing being
so friendly with the innards of LWLocks? Surely this needs to get
refactored so that most of WakeupWaiters() and friends is in lwlock.c.
Or has all notion of modularity gone out the window while I wasn't
looking?

Well, it's not actually using any lwlock.c code, it's a special case
locking logic, just reusing the datastructures. That said, I am not
particularly happy about the amount of code it's duplicating from
lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
WALInsertSlotAcquireOne() is a copied.

I'm not too happy with the amount of copy-paste myself, but there was enough
difference to regular lwlocks that I didn't want to bother all lwlocks with
the xlog-specific stuff either. The WAL insert slots do share the
LWLock-related PGPROC fields though, and semaphore. I'm all ears if you have
ideas on that..

The lwlock scalability stuff has separated out the enqueue/wakeup code,
that probably should work here as well? And that's a fair portion of the
code. I think it should be doable to make that generic enough that the
actual difference of the struct doesn't matter. It'd also reduce
duplication of LWLockAcquire, ConditionalAcquire, OrWait.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

MauMau

maumau307@gmail.com

almost 12 years ago

In reply to: Andres Freund (#1)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

From: "Andres Freund" <andres@2ndquadrant.com>

which means they manipulate the lwWaitLink queue without
protection. That's done intentionally. The code tries to protect against
corruption of the list to do a woken up backend acquiring a lock (this
or an independent one) by only continuing when the lwWaiting flag is set
to false. Unfortunately there's absolutely no guarantee that a) the
assignment to lwWaitLink and lwWaiting are done in that order b) that
the stores are done in-order from the POV of other backends.
So what we need to do is to acquire a write barrier between the
assignments to lwWaitLink and lwWaiting, i.e.
proc->lwWaitLink = NULL;
pg_write_barrier();
proc->lwWaiting = false;
the reader side already uses an implicit barrier by using spinlocks.

I've got a report from one customer that they encountered a hang during
performance benchmarking. They were using PostgreSQL 9.2.4. I remember
that the stack trace showed many backends blocked forever at LWLockAcuuire()
during btree insert operation. I'm not sure this has something to do with
what you are raising, but the release notes for 9.2.5/6 doesn't suggest any
fixes for this. So I felt there is something wrong with lwlocks.

Do you think that your question could cause my customer's problem --
backends block at lwlock forever?

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: MauMau (#11)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-11 21:46:04 +0900, MauMau wrote:

From: "Andres Freund" <andres@2ndquadrant.com>

which means they manipulate the lwWaitLink queue without
protection. That's done intentionally. The code tries to protect against
corruption of the list to do a woken up backend acquiring a lock (this
or an independent one) by only continuing when the lwWaiting flag is set
to false. Unfortunately there's absolutely no guarantee that a) the
assignment to lwWaitLink and lwWaiting are done in that order b) that
the stores are done in-order from the POV of other backends.
So what we need to do is to acquire a write barrier between the
assignments to lwWaitLink and lwWaiting, i.e.
proc->lwWaitLink = NULL;
pg_write_barrier();
proc->lwWaiting = false;
the reader side already uses an implicit barrier by using spinlocks.

I've got a report from one customer that they encountered a hang during
performance benchmarking. They were using PostgreSQL 9.2.4. I remember
that the stack trace showed many backends blocked forever at LWLockAcuuire()
during btree insert operation. I'm not sure this has something to do with
what you are raising, but the release notes for 9.2.5/6 doesn't suggest any
fixes for this. So I felt there is something wrong with lwlocks.

Do you think that your question could cause my customer's problem --
backends block at lwlock forever?

It's x86, right? Then it's unlikely to be actual unordered memory
accesses, but if the compiler reordered:
LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
proc = head;
head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
proc->lwWaiting = false;
PGSemaphoreUnlock(&proc->sem);
to
LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
proc = head;
proc->lwWaiting = false;
head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
PGSemaphoreUnlock(&proc->sem);
which it is permitted to do, yes, that could cause symptoms like you
describe.

Any chance you have the binaries the customer ran back then around?
Disassembling that piece of code might give you a hint whether that's a
possible cause.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

MauMau

maumau307@gmail.com

almost 12 years ago

In reply to: Andres Freund (#12)

1 attachment(s)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

From: "Andres Freund" <andres@2ndquadrant.com>

It's x86, right? Then it's unlikely to be actual unordered memory
accesses, but if the compiler reordered:
LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
proc = head;
head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
proc->lwWaiting = false;
PGSemaphoreUnlock(&proc->sem);
to
LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
proc = head;
proc->lwWaiting = false;
head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
PGSemaphoreUnlock(&proc->sem);
which it is permitted to do, yes, that could cause symptoms like you
describe.

Yes, the hang occurred with 64-bit PostgreSQL 9.2.4 running on RHEL6 for
x86_64. The PostgreSQL was built with GCC.

Any chance you have the binaries the customer ran back then around?
Disassembling that piece of code might give you a hint whether that's a
possible cause.

I'm sorry I can't provide the module, but I attached the disassembled code
code for lwlockRelease and LWLockAcquire in the executable. I'm not sure
this proves something.

FYI, the following stack traces are the ones obtained during two instances
of hang.

#0 0x00000036102eaf77 in semop () from /lib64/libc.so.6
#1 0x0000000000614707 in PGSemaphoreLock ()
#2 0x0000000000659d5b in LWLockAcquire ()
#3 0x000000000047983d in RelationGetBufferForTuple ()
#4 0x0000000000477f86 in heap_insert ()
#5 0x00000000005a4a12 in ExecModifyTable ()
#6 0x000000000058d928 in ExecProcNode ()
#7 0x000000000058c762 in standard_ExecutorRun ()
#8 0x00007f0cb37f99cb in pgss_ExecutorRun () from
/opt/symfoserver64/lib/pg_stat_statements.so
#9 0x00007f0cb357f545 in explain_ExecutorRun () from
/opt/symfoserver64/lib/auto_explain.so
#10 0x000000000066a59e in ProcessQuery ()
#11 0x000000000066a7ef in PortalRunMulti ()
#12 0x000000000066afd2 in PortalRun ()
#13 0x0000000000666fcb in exec_simple_query ()
#14 0x0000000000668058 in PostgresMain ()
#15 0x0000000000622ef1 in PostmasterMain ()
#16 0x00000000005c0723 in main ()

#0 0x00000036102eaf77 in semop () from /lib64/libc.so.6
#1 0x0000000000614707 in PGSemaphoreLock ()
#2 0x0000000000659d5b in LWLockAcquire ()
#3 0x000000000064bb8c in ProcArrayEndTransaction ()
#4 0x0000000000491216 in CommitTransaction ()
#5 0x00000000004925a5 in CommitTransactionCommand ()
#6 0x0000000000664cf7 in finish_xact_command ()
#7 0x0000000000667145 in exec_simple_query ()
#8 0x0000000000668058 in PostgresMain ()
#9 0x0000000000622ef1 in PostmasterMain ()
#10 0x00000000005c0723 in main ()

Regards
MauMau

#14

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: MauMau (#13)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-12 20:55:32 +0900, MauMau wrote:

Dump of assembler code for function LWLockRelease:

could you try if you get more readable dumps by using disassemble/m?
That might at least print line numbers if you have debug info installed.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

MauMau

maumau307@gmail.com

almost 12 years ago

In reply to: Andres Freund (#14)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

From: "Andres Freund" <andres@2ndquadrant.com>

could you try if you get more readable dumps by using disassemble/m?
That might at least print line numbers if you have debug info installed.

OK, I'll try that tomorrow. However, the debug info is not available,
because they use PostgreSQL built by themselves, not the community RPM nor
EnterpriseDB's installer.

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Florian Pflug

fgp@phlo.org

almost 12 years ago

In reply to: MauMau (#13)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On Feb12, 2014, at 12:55 , MauMau <maumau307@gmail.com> wrote:

From: "Andres Freund" <andres@2ndquadrant.com>

It's x86, right? Then it's unlikely to be actual unordered memory
accesses, but if the compiler reordered:
LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
proc = head;
head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
proc->lwWaiting = false;
PGSemaphoreUnlock(&proc->sem);
to
LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
proc = head;
proc->lwWaiting = false;
head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
PGSemaphoreUnlock(&proc->sem);
which it is permitted to do, yes, that could cause symptoms like you
describe.

Yes, the hang occurred with 64-bit PostgreSQL 9.2.4 running on RHEL6 for x86_64.
The PostgreSQL was built with GCC.

The relevant part of the disassembled binary you attached seems to be

Dump of assembler code for function LWLockRelease:
...
0x0000000000647f47 <LWLockRelease+519>: lea 0x10(%rcx),%rdi
0x0000000000647f4b <LWLockRelease+523>: movq $0x0,0x48(%rcx)
0x0000000000647f53 <LWLockRelease+531>: movb $0x0,0x41(%rcx)
0x0000000000647f57 <LWLockRelease+535>: callq 0x606210 <PGSemaphoreUnlock>

I haven't checked the offsets, but since lwWaitLink is an 8-byte quantity
and lwWaiting a single-byte quantity, it's pretty much certain that the
first store updates lwWaitLink and the second lwWaiting. Thus, no reordering
seems to have taken place here...

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

knizhnik

knizhnik@garret.ru

almost 12 years ago

In reply to: Florian Pflug (#16)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 02/12/2014 05:42 PM, Florian Pflug wrote:

On Feb12, 2014, at 12:55 , MauMau <maumau307@gmail.com> wrote:

From: "Andres Freund" <andres@2ndquadrant.com>

It's x86, right? Then it's unlikely to be actual unordered memory
accesses, but if the compiler reordered:
LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
proc = head;
head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
proc->lwWaiting = false;
PGSemaphoreUnlock(&proc->sem);
to
LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
proc = head;
proc->lwWaiting = false;
head = proc->lwWaitLink;
proc->lwWaitLink = NULL;
PGSemaphoreUnlock(&proc->sem);
which it is permitted to do, yes, that could cause symptoms like you
describe.

Yes, the hang occurred with 64-bit PostgreSQL 9.2.4 running on RHEL6 for x86_64.
The PostgreSQL was built with GCC.

The relevant part of the disassembled binary you attached seems to be

Dump of assembler code for function LWLockRelease:
...
0x0000000000647f47 <LWLockRelease+519>: lea 0x10(%rcx),%rdi
0x0000000000647f4b <LWLockRelease+523>: movq $0x0,0x48(%rcx)
0x0000000000647f53 <LWLockRelease+531>: movb $0x0,0x41(%rcx)
0x0000000000647f57 <LWLockRelease+535>: callq 0x606210 <PGSemaphoreUnlock>

I haven't checked the offsets, but since lwWaitLink is an 8-byte quantity
and lwWaiting a single-byte quantity, it's pretty much certain that the
first store updates lwWaitLink and the second lwWaiting. Thus, no reordering
seems to have taken place here...

best regards,
Florian Pflug

Even if reordering was not done by compiler, it still can happen while execution.
There is no warranty that two subsequent assignments will be observed by all CPU cores in the same order.
So if one thread is performing

p->x = 1;
p->y = 2;
p->x = 3;
p->y = 4;

then other thread can see the following combinations of (x,y):

(1,2)
(1,4)
(3,2)
(3,4)

It is necessary to explicitly insert write barrier to prevent such non-deterministic behaviour.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#18

Ants Aasma

ants@cybertec.at

almost 12 years ago

In reply to: knizhnik (#17)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On Wed, Feb 12, 2014 at 4:04 PM, knizhnik <knizhnik@garret.ru> wrote:

Even if reordering was not done by compiler, it still can happen while
execution.
There is no warranty that two subsequent assignments will be observed by all
CPU cores in the same order.

The x86 memory model (total store order) provides that guarantee in
this specific case.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

MauMau

maumau307@gmail.com

almost 12 years ago

In reply to: Andres Freund (#14)

1 attachment(s)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

eFrom: "Andres Freund" <andres@2ndquadrant.com>

could you try if you get more readable dumps by using disassemble/m?
That might at least print line numbers if you have debug info installed.

Please find the attached file. I hope this will reveal something.

Regards
MauMau

#20

knizhnik

knizhnik@garret.ru

almost 12 years ago

In reply to: Ants Aasma (#18)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 02/12/2014 06:10 PM, Ants Aasma wrote:

On Wed, Feb 12, 2014 at 4:04 PM, knizhnik <knizhnik@garret.ru> wrote:

Even if reordering was not done by compiler, it still can happen while
execution.
There is no warranty that two subsequent assignments will be observed by all
CPU cores in the same order.

The x86 memory model (total store order) provides that guarantee in
this specific case.

Regards,
Ants Aasma

Sorry, I thought that we are talking about general case, not just x86 architecture.
May be I do not understand something in LWlock code, but it seems to me that assigning NULL to proc->lwWaitLink is not needed at all:

while (head != NULL)
{
LOG_LWDEBUG("LWLockRelease", lockid, "release waiter");
proc = head;
head = proc->lwWaitLink;

proc->lwWaitLink = NULL;

proc->lwWaiting = false;
PGSemaphoreUnlock(&proc->sem);
}

This part of L1 list is not traversed by any other processor. So nobody will inspect this field. When awakened process needs to wait for another lock,
it will just assign NULL to this field itself:

proc->lwWaiting = 1;
proc->lwWaitMode = mode;

proc->lwWaitLink = NULL;

if (lock->head == NULL)
lock->head = proc;
else
lock->tail->lwWaitLink = proc;
lock->tail = proc;

Without TSO (total store order), such assignment of lwWaitLink in LWLockRlease outside critical section may just corrupt L1 list, in which awakened process is already linked.
But I am not sure that elimination of this assignment will be enough to ensure correctness of this code without TSO.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Florian Pflug

fgp@phlo.org

almost 12 years ago

In reply to: Andres Freund (#4)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On Feb10, 2014, at 17:38 , Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-10 11:11:28 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

So what we need to do is to acquire a write barrier between the
assignments to lwWaitLink and lwWaiting, i.e.
proc->lwWaitLink = NULL;
pg_write_barrier();
proc->lwWaiting = false;

You didn't really explain why you think that ordering is necessary?
Each proc being awoken will surely see both fields updated, and other
procs won't be examining these fields at all, since we already delinked
all these procs from the LWLock's queue.

The problem is that one the released backends could wake up concurrently
because of a unrelated, or previous PGSemaphoreUnlock(). It could see
lwWaiting = false, and thus wakeup and acquire the lock, even if the
store for lwWaitLink hasn't arrived (or performed, there's no guaranteed
ordering here) yet.
Now, it may well be that there's no practical consequence of that, but I
am not prepared to bet on it.

AFAICS there is a potential problem if three backends are involved, since
by the time the waiting backend's lwWaitLink is set to NULL after the
original lock holder released the lock, the waiting backend might already
have acquired the lock (due to a spurious wakeup) *and* a third backend
might have already enqueued behind it.

The exact sequence for backends A,B,C that corrupts the wait queue is:

A: Release lock, set B's lwWaiting to false
B: Wakes up spuriously, takes the lock
C: Enqueues behind B
A: Sets B' lwWaitLink back to NULL, thereby truncating the queue and
causing C and anyone behind it to block indefinitely.

I wonder whether LWLockRelease really needs to update lwWaitLink at all.
We take the backends we awake off the queue by updating the queue's head and
tail, so the contents of lwWaitLink should only matter once the backend is
re-inserted into some wait queue. But when doing that, we reset lwWaitLink
back to NULL anway.

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Florian Pflug (#21)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-13 15:34:09 +0100, Florian Pflug wrote:

On Feb10, 2014, at 17:38 , Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-10 11:11:28 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

So what we need to do is to acquire a write barrier between the
assignments to lwWaitLink and lwWaiting, i.e.
proc->lwWaitLink = NULL;
pg_write_barrier();
proc->lwWaiting = false;

You didn't really explain why you think that ordering is necessary?
Each proc being awoken will surely see both fields updated, and other
procs won't be examining these fields at all, since we already delinked
all these procs from the LWLock's queue.

The problem is that one the released backends could wake up concurrently
because of a unrelated, or previous PGSemaphoreUnlock(). It could see
lwWaiting = false, and thus wakeup and acquire the lock, even if the
store for lwWaitLink hasn't arrived (or performed, there's no guaranteed
ordering here) yet.
Now, it may well be that there's no practical consequence of that, but I
am not prepared to bet on it.

AFAICS there is a potential problem if three backends are involved, since
by the time the waiting backend's lwWaitLink is set to NULL after the
original lock holder released the lock, the waiting backend might already
have acquired the lock (due to a spurious wakeup) *and* a third backend
might have already enqueued behind it.

The exact sequence for backends A,B,C that corrupts the wait queue is:

A: Release lock, set B's lwWaiting to false
B: Wakes up spuriously, takes the lock
C: Enqueues behind B
A: Sets B' lwWaitLink back to NULL, thereby truncating the queue and
causing C and anyone behind it to block indefinitely.

I don't think that can actually happen because the head of the wait list
isn't the lock holder's lwWaitLink, but LWLock->head. I thought the same
for a while...

So, right now I don't see problems without either the compiler reordering
stores or heavily out of order machines with speculative execution.

I wonder whether LWLockRelease really needs to update lwWaitLink at all.
We take the backends we awake off the queue by updating the queue's head and
tail, so the contents of lwWaitLink should only matter once the backend is
re-inserted into some wait queue. But when doing that, we reset lwWaitLink
back to NULL anway.

I don't like that, this stuff is hard to debug already, having stale
pointers around doesn't help. I think we should just add the barrier in
the releases with barrier.h and locally use a volatile var in the
branches before that.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Florian Pflug

fgp@phlo.org

almost 12 years ago

In reply to: Andres Freund (#22)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On Feb14, 2014, at 11:45 , Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-13 15:34:09 +0100, Florian Pflug wrote:

On Feb10, 2014, at 17:38 , Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-10 11:11:28 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

So what we need to do is to acquire a write barrier between the
assignments to lwWaitLink and lwWaiting, i.e.
proc->lwWaitLink = NULL;
pg_write_barrier();
proc->lwWaiting = false;

You didn't really explain why you think that ordering is necessary?
Each proc being awoken will surely see both fields updated, and other
procs won't be examining these fields at all, since we already delinked
all these procs from the LWLock's queue.

The problem is that one the released backends could wake up concurrently
because of a unrelated, or previous PGSemaphoreUnlock(). It could see
lwWaiting = false, and thus wakeup and acquire the lock, even if the
store for lwWaitLink hasn't arrived (or performed, there's no guaranteed
ordering here) yet.
Now, it may well be that there's no practical consequence of that, but I
am not prepared to bet on it.

AFAICS there is a potential problem if three backends are involved, since
by the time the waiting backend's lwWaitLink is set to NULL after the
original lock holder released the lock, the waiting backend might already
have acquired the lock (due to a spurious wakeup) *and* a third backend
might have already enqueued behind it.

The exact sequence for backends A,B,C that corrupts the wait queue is:

A: Release lock, set B's lwWaiting to false
B: Wakes up spuriously, takes the lock
C: Enqueues behind B
A: Sets B' lwWaitLink back to NULL, thereby truncating the queue and
causing C and anyone behind it to block indefinitely.

I don't think that can actually happen because the head of the wait list
isn't the lock holder's lwWaitLink, but LWLock->head. I thought the same
for a while...

Hm, true, but does that protect us under all circumstances? If another
backend grabs the lock before B gets a chance to do so, B will become the
wait queue's head, and anyone who enqueues behind B will do so by updating
B's lwWaitLink. That is then undone by the delayed reset of B's lwWaitLink
by the original lock holder.

The specific sequence I have in mind is:

A: Take lock
B: Enqueue
A: Release lock, set B's lwWaiting to false
D: Acquire lock
B: Enqueue after spurious wakeup
(lock->head := B)
C: Enqueue behind B
(B->lwWaitLink := C, lock->tail := C)
A: Set B's wWaitLink back to NULL, thereby corrupting the queue
(B->lwWaitLink := NULL)
D: Release lock, dequeue and wakeup B
(lock->head := B->lwWaitLink, i.e. lock->head := NULL)
B: Take and release lock, queue appears empty!
C blocks indefinitely.

I wonder whether LWLockRelease really needs to update lwWaitLink at all.
We take the backends we awake off the queue by updating the queue's head and
tail, so the contents of lwWaitLink should only matter once the backend is
re-inserted into some wait queue. But when doing that, we reset lwWaitLink
back to NULL anway.

I don't like that, this stuff is hard to debug already, having stale
pointers around doesn't help. I think we should just add the barrier in
the releases with barrier.h and locally use a volatile var in the
branches before that.

I like the idea of updating lwWaiting and lwWaitLink while still holding the
spinlock better. The costs are probably negligible, and it'd make the code much
easier to reason about.

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Florian Pflug (#23)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-14 13:28:47 +0100, Florian Pflug wrote:

I don't think that can actually happen because the head of the wait list
isn't the lock holder's lwWaitLink, but LWLock->head. I thought the same
for a while...

Hm, true, but does that protect us under all circumstances? If another
backend grabs the lock before B gets a chance to do so, B will become the
wait queue's head, and anyone who enqueues behind B will do so by updating
B's lwWaitLink. That is then undone by the delayed reset of B's lwWaitLink
by the original lock holder.

The specific sequence I have in mind is:

A: Take lock
B: Enqueue
A: Release lock, set B's lwWaiting to false
D: Acquire lock
B: Enqueue after spurious wakeup
(lock->head := B)
C: Enqueue behind B
(B->lwWaitLink := C, lock->tail := C)
A: Set B's wWaitLink back to NULL, thereby corrupting the queue
(B->lwWaitLink := NULL)
D: Release lock, dequeue and wakeup B
(lock->head := B->lwWaitLink, i.e. lock->head := NULL)
B: Take and release lock, queue appears empty!
C blocks indefinitely.

That's assuming either reordering by the compiler or an out-of-order
store architecture, right?

I wonder whether LWLockRelease really needs to update lwWaitLink at all.
We take the backends we awake off the queue by updating the queue's head and
tail, so the contents of lwWaitLink should only matter once the backend is
re-inserted into some wait queue. But when doing that, we reset lwWaitLink
back to NULL anway.

I don't like that, this stuff is hard to debug already, having stale
pointers around doesn't help. I think we should just add the barrier in
the releases with barrier.h and locally use a volatile var in the
branches before that.

I like the idea of updating lwWaiting and lwWaitLink while still holding the
spinlock better. The costs are probably negligible, and it'd make the code much
easier to reason about.

I agree we should do that, but imo not in the backbranches. Anything
more than than the minimal fix in that code should be avoided in the
stable branches, this stuff is friggin performance sensitive, and the
spinlock already is a *major* contention point in many workloads.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#25

Florian Pflug

fgp@phlo.org

almost 12 years ago

In reply to: Andres Freund (#24)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On Feb14, 2014, at 13:36 , Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-14 13:28:47 +0100, Florian Pflug wrote:

I don't think that can actually happen because the head of the wait list
isn't the lock holder's lwWaitLink, but LWLock->head. I thought the same
for a while...

Hm, true, but does that protect us under all circumstances? If another
backend grabs the lock before B gets a chance to do so, B will become the
wait queue's head, and anyone who enqueues behind B will do so by updating
B's lwWaitLink. That is then undone by the delayed reset of B's lwWaitLink
by the original lock holder.

The specific sequence I have in mind is:

A: Take lock
B: Enqueue
A: Release lock, set B's lwWaiting to false
D: Acquire lock
B: Enqueue after spurious wakeup
(lock->head := B)
C: Enqueue behind B
(B->lwWaitLink := C, lock->tail := C)
A: Set B's wWaitLink back to NULL, thereby corrupting the queue
(B->lwWaitLink := NULL)
D: Release lock, dequeue and wakeup B
(lock->head := B->lwWaitLink, i.e. lock->head := NULL)
B: Take and release lock, queue appears empty!
C blocks indefinitely.

That's assuming either reordering by the compiler or an out-of-order
store architecture, right?

Yes, it requires that a backend exits out of the PGSemaphoreLock loop in
LWLockAcquire before lwWaitLink has been reset to NULL by the previous lock
holder's LWLockRelease. Only if that happens there is a risk of the PGPROC
being on a wait queue by the time lwWaitLink is finally reset to NULL.

I wonder whether LWLockRelease really needs to update lwWaitLink at all.
We take the backends we awake off the queue by updating the queue's head and
tail, so the contents of lwWaitLink should only matter once the backend is
re-inserted into some wait queue. But when doing that, we reset lwWaitLink
back to NULL anway.

I don't like that, this stuff is hard to debug already, having stale
pointers around doesn't help. I think we should just add the barrier in
the releases with barrier.h and locally use a volatile var in the
branches before that.

I like the idea of updating lwWaiting and lwWaitLink while still holding the
spinlock better. The costs are probably negligible, and it'd make the code much
easier to reason about.

I agree we should do that, but imo not in the backbranches. Anything
more than than the minimal fix in that code should be avoided in the
stable branches, this stuff is friggin performance sensitive, and the
spinlock already is a *major* contention point in many workloads.

No argument there. But unless I missed something, there actually is a bug
there, and using volatile won't fix it. A barrier would, but what about the
back branches that don't have barrier.h? AFAICS the only other choices we have on
these branches are to either ignore the bug - it's probably *extremely* unlikely
- or to use a spinlock acquire/release cycle to create a barrier. The former
leaves me with a bit of an uneasy feeling, and the latter quite certainly has
a larger performance impact than moving the PGPROC updates within the section
protected by the spinlock and using an array to remember the backends to wakeup.

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#26

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Florian Pflug (#25)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:

I agree we should do that, but imo not in the backbranches. Anything
more than than the minimal fix in that code should be avoided in the
stable branches, this stuff is friggin performance sensitive, and the
spinlock already is a *major* contention point in many workloads.

No argument there. But unless I missed something, there actually is a bug
there, and using volatile won't fix it. A barrier would, but what about the
back branches that don't have barrier.h?

Yea, but I don't see a better alternative. Realistically the likelihood
of a problem without the compiler reordering stuff is miniscule on any
relevant platform that's supported by the !barrier.h branches. Newer
ARMs are the only realistic suspect, and the support for in older
releases wasn't so good...

The former
leaves me with a bit of an uneasy feeling, and the latter quite certainly has
a larger performance impact than moving the PGPROC updates within the section
protected by the spinlock and using an array to remember the backends to wakeup.

I am not so sure, it adds a host of new cacheline references in a piece
of code that's already heavily affected by pipeline stalls, that can
influence performance. I am not saying it's super likely, just more than
I want to do for a theoretical problem in the back branches.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#27

Florian Pflug

fgp@phlo.org

almost 12 years ago

In reply to: Andres Freund (#26)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On Feb14, 2014, at 14:07 , Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:

I agree we should do that, but imo not in the backbranches. Anything
more than than the minimal fix in that code should be avoided in the
stable branches, this stuff is friggin performance sensitive, and the
spinlock already is a *major* contention point in many workloads.

No argument there. But unless I missed something, there actually is a bug
there, and using volatile won't fix it. A barrier would, but what about the
back branches that don't have barrier.h?

Yea, but I don't see a better alternative. Realistically the likelihood
of a problem without the compiler reordering stuff is miniscule on any
relevant platform that's supported by the !barrier.h branches. Newer
ARMs are the only realistic suspect, and the support for in older
releases wasn't so good...

Isn't POWER/PowerPC potentially affected by this?

Also, there is still the alternative of not resetting lwWaitLink in LWLockRelease.
I can understand why you oppose that - it's certainly nicer to not have stray
pointer lying around. But since (as least as far as we know)

a) resetting lwWaitLink is not strictly necessary
b) resetting lwWaitLink introduces a race condition (however unlikely)

we'll trade correctness for cleanliness if we continue to reset lwWaitLink
without protecting against the race. That's a bit of a weird trade-off to make.

Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one
field. We could replace "lwWaiting" by "lwWaitLink != NULL" everywhere it's
tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we
enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true.

We'd then depend on pointer-sized stores being atomic, which I think we depend
on in other places already.

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#28

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Florian Pflug (#27)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

Florian Pflug <fgp@phlo.org> writes:

Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one
field. We could replace "lwWaiting" by "lwWaitLink != NULL" everywhere it's
tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we
enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true.

We'd then depend on pointer-sized stores being atomic, which I think we depend
on in other places already.

I don't believe that's true; neither that we depend on it now, nor that
it would be safe to do so.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#29

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Tom Lane (#28)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-14 10:26:07 -0500, Tom Lane wrote:

Florian Pflug <fgp@phlo.org> writes:

Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one
field. We could replace "lwWaiting" by "lwWaitLink != NULL" everywhere it's
tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we
enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true.

We'd then depend on pointer-sized stores being atomic, which I think we depend
on in other places already.

I don't believe that's true; neither that we depend on it now, nor that
it would be safe to do so.

Yea, we currently rely on 4 byte stores being atomic (most notably for
xids), but we don't rely on anything bigger. I am not sure if there are
architectures with 64bit pointers that aren't accessed atomically when
aligned? Even if, that's certainly nothing that should be introduced
when backpatching.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#30

Florian Pflug

fgp@phlo.org

almost 12 years ago

In reply to: Andres Freund (#29)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On Feb14, 2014, at 16:32 , Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-14 10:26:07 -0500, Tom Lane wrote:

Florian Pflug <fgp@phlo.org> writes:

Another idea for a fix would be to conflate lwWaiting and lwWaitLink into one
field. We could replace "lwWaiting" by "lwWaitLink != NULL" everywhere it's
tested, and set lwWaitLink to some special non-NULL value (say 0x1) when we
enqueue a PGPROC, instead of setting it to NULL and setting lwWaiting to true.

We'd then depend on pointer-sized stores being atomic, which I think we depend
on in other places already.

I don't believe that's true; neither that we depend on it now, nor that
it would be safe to do so.

Yea, we currently rely on 4 byte stores being atomic (most notably for
xids), but we don't rely on anything bigger. I am not sure if there are
architectures with 64bit pointers that aren't accessed atomically when
aligned? Even if, that's certainly nothing that should be introduced
when backpatching.

Hm, we could use 4-byte stores instead of 8-byte stores if we turned lwWaitLink
into an index into the proc array instead of a pointer to the PGPROC struct.

We could then use 0xffffffff instead place of NULL to indicate "not waiting",
and PROCARRAY_MAXPROCS to indicate "waiting, but no successor in the queue".

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#31

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Florian Pflug (#27)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-14 15:03:16 +0100, Florian Pflug wrote:

On Feb14, 2014, at 14:07 , Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:

I agree we should do that, but imo not in the backbranches. Anything
more than than the minimal fix in that code should be avoided in the
stable branches, this stuff is friggin performance sensitive, and the
spinlock already is a *major* contention point in many workloads.

No argument there. But unless I missed something, there actually is a bug
there, and using volatile won't fix it. A barrier would, but what about the
back branches that don't have barrier.h?

Yea, but I don't see a better alternative. Realistically the likelihood
of a problem without the compiler reordering stuff is miniscule on any
relevant platform that's supported by the !barrier.h branches. Newer
ARMs are the only realistic suspect, and the support for in older
releases wasn't so good...

Isn't POWER/PowerPC potentially affected by this?

I wouldn't consider it a major architecture... And I am not sure how
much out of order a CPU has to be to be affected by this, the read side
uses spinlocks, which in most of the spinlock implementations implies a
full memory barrier which in many cache coherency designs will cause
other CPUs to flush writes. And I think the control dependency in the
PGSemaphoreUnlock() loop will actually cause a flush on ppc...

Also, there is still the alternative of not resetting lwWaitLink in LWLockRelease.
I can understand why you oppose that - it's certainly nicer to not have stray
pointer lying around. But since (as least as far as we know)

a) resetting lwWaitLink is not strictly necessary

I don't want to rely on that in the backbranches, that's an entirely new
assumption. I think anything more than minimal changes are hard to
justify for a theoretical issue that hasn't been observed in the field.

I think the biggest danger here is that writes are reordered by the
compiler, that we definitely need to protect against. Making a variable
volatile or introducing a memory barrier is pretty simple to analyze.

b) resetting lwWaitLink introduces a race condition (however unlikely)

we'll trade correctness for cleanliness if we continue to reset lwWaitLink
without protecting against the race. That's a bit of a weird trade-off to make.

It's not just cleanliness, it's being able to actually debug crashes.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#32

knizhnik

knizhnik@garret.ru

almost 12 years ago

In reply to: Andres Freund (#31)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 02/14/2014 07:51 PM, Andres Freund wrote:

On 2014-02-14 15:03:16 +0100, Florian Pflug wrote:

On Feb14, 2014, at 14:07 , Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:

I agree we should do that, but imo not in the backbranches. Anything
more than than the minimal fix in that code should be avoided in the
stable branches, this stuff is friggin performance sensitive, and the
spinlock already is a *major* contention point in many workloads.

No argument there. But unless I missed something, there actually is a bug
there, and using volatile won't fix it. A barrier would, but what about the
back branches that don't have barrier.h?

Yea, but I don't see a better alternative. Realistically the likelihood
of a problem without the compiler reordering stuff is miniscule on any
relevant platform that's supported by the !barrier.h branches. Newer
ARMs are the only realistic suspect, and the support for in older
releases wasn't so good...

Isn't POWER/PowerPC potentially affected by this?

I wouldn't consider it a major architecture... And I am not sure how
much out of order a CPU has to be to be affected by this, the read side
uses spinlocks, which in most of the spinlock implementations implies a
full memory barrier which in many cache coherency designs will cause
other CPUs to flush writes. And I think the control dependency in the
PGSemaphoreUnlock() loop will actually cause a flush on ppc...

PGSemaphoreUnlock() should really introduce memory barrier, but the problem can happen before PGSemaphoreUnlock() is called.
So presence of PGSemaphoreUnlock() just reduces probability of race condition on non-TSO architectures (PPC, ARM, IA64,...) but doesn't completely eliminate it.

Also, there is still the alternative of not resetting lwWaitLink in LWLockRelease.
I can understand why you oppose that - it's certainly nicer to not have stray
pointer lying around. But since (as least as far as we know)

a) resetting lwWaitLink is not strictly necessary

I don't want to rely on that in the backbranches, that's an entirely new
assumption. I think anything more than minimal changes are hard to
justify for a theoretical issue that hasn't been observed in the field.

I think the biggest danger here is that writes are reordered by the
compiler, that we definitely need to protect against. Making a variable
volatile or introducing a memory barrier is pretty simple to analyze.

b) resetting lwWaitLink introduces a race condition (however unlikely)

we'll trade correctness for cleanliness if we continue to reset lwWaitLink
without protecting against the race. That's a bit of a weird trade-off to make.

It's not just cleanliness, it's being able to actually debug crashes.

Frankly speaking I do not understand why elimination of resetting of lwWaitLink was considered to be bad idea.
Resetting this pointer to NULL will not help much to analyze crash dumps, because right now it is possible that lwWaitLink==NULL but process in waiting for a lock and linked in the list
(if it is the last element of the list). So the fact that lwWaitLink==NULL actually gives not so much useful information.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#33

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: knizhnik (#32)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-14 20:23:32 +0400, knizhnik wrote:

we'll trade correctness for cleanliness if we continue to reset lwWaitLink
without protecting against the race. That's a bit of a weird trade-off to make.

It's not just cleanliness, it's being able to actually debug crashes.

Frankly speaking I do not understand why elimination of resetting of lwWaitLink was considered to be bad idea.
Resetting this pointer to NULL will not help much to analyze crash dumps, because right now it is possible that lwWaitLink==NULL but process in waiting for a lock and linked in the list
(if it is the last element of the list). So the fact that lwWaitLink==NULL actually gives not so much useful information.

At the moment if you connect to a live pg or a crash dump, investigating
the wait links allows you to somewhat sensibly determine which backends
are waiting for a lock and whether they are part of a queue. If they
aren't reset anymore that will tell you nothing, so you'll need to
connect to all pg instances to debug issues.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#34

knizhnik

knizhnik@garret.ru

almost 12 years ago

In reply to: Andres Freund (#33)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 02/14/2014 08:28 PM, Andres Freund wrote:

On 2014-02-14 20:23:32 +0400, knizhnik wrote:

we'll trade correctness for cleanliness if we continue to reset lwWaitLink
without protecting against the race. That's a bit of a weird trade-off to make.

It's not just cleanliness, it's being able to actually debug crashes.

Frankly speaking I do not understand why elimination of resetting of lwWaitLink was considered to be bad idea.
Resetting this pointer to NULL will not help much to analyze crash dumps, because right now it is possible that lwWaitLink==NULL but process in waiting for a lock and linked in the list
(if it is the last element of the list). So the fact that lwWaitLink==NULL actually gives not so much useful information.

At the moment if you connect to a live pg or a crash dump, investigating
the wait links allows you to somewhat sensibly determine which backends
are waiting for a lock and whether they are part of a queue. If they
aren't reset anymore that will tell you nothing, so you'll need to
connect to all pg instances to debug issues.

Why it is not enough to inspect lwWaiting flag?
In any case, resetting lwWaitLink can be safely done in awakened process:

if (!proc->lwWaiting) {

proc->lwWaitLink = NULL;

break;
}

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#35

Florian Pflug

fgp@phlo.org

almost 12 years ago

In reply to: Andres Freund (#31)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On Feb14, 2014, at 16:51 , Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-14 15:03:16 +0100, Florian Pflug wrote:

On Feb14, 2014, at 14:07 , Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-14 13:52:45 +0100, Florian Pflug wrote:

I agree we should do that, but imo not in the backbranches. Anything
more than than the minimal fix in that code should be avoided in the
stable branches, this stuff is friggin performance sensitive, and the
spinlock already is a *major* contention point in many workloads.

No argument there. But unless I missed something, there actually is a bug
there, and using volatile won't fix it. A barrier would, but what about the
back branches that don't have barrier.h?

Yea, but I don't see a better alternative. Realistically the likelihood
of a problem without the compiler reordering stuff is miniscule on any
relevant platform that's supported by the !barrier.h branches. Newer
ARMs are the only realistic suspect, and the support for in older
releases wasn't so good...

Isn't POWER/PowerPC potentially affected by this?

I wouldn't consider it a major architecture... And I am not sure how
much out of order a CPU has to be to be affected by this, the read side
uses spinlocks, which in most of the spinlock implementations implies a
full memory barrier which in many cache coherency designs will cause
other CPUs to flush writes. And I think the control dependency in the
PGSemaphoreUnlock() loop will actually cause a flush on ppc...

I guess it's sufficient for the store to lwWaitLink to be delayed.
As long as the CPU on which that store is executing hasn't managed to gain
exclusive access to the relevant cache line, memory barriers on the read
side won't help because the store won't be visible to other CPUs.

Also, there is still the alternative of not resetting lwWaitLink in LWLockRelease.
I can understand why you oppose that - it's certainly nicer to not have stray
pointer lying around. But since (as least as far as we know)

a) resetting lwWaitLink is not strictly necessary

I don't want to rely on that in the backbranches, that's an entirely new
assumption. I think anything more than minimal changes are hard to
justify for a theoretical issue that hasn't been observed in the field.

I think the biggest danger here is that writes are reordered by the
compiler, that we definitely need to protect against. Making a variable
volatile or introducing a memory barrier is pretty simple to analyze.

Well, the assumption isn't all that new. We already have the situation that
a PGPROC may be not on any wait queue, yet its lwWaitLink may be non-NULL.
Currently, the process who took it off the queue is responsible for rectifying
that eventually, with the changed proposed below it'll be the backend who
owns the PGPROC. From the point of view of other backends, nothing really
changes.

b) resetting lwWaitLink introduces a race condition (however unlikely)

we'll trade correctness for cleanliness if we continue to reset lwWaitLink
without protecting against the race. That's a bit of a weird trade-off to make.

It's not just cleanliness, it's being able to actually debug crashes.

We could still arrange for the stray lwWaitLink from being visible only
momentarily. If a backend is woken up and detects that lwWaiting is false,
it knows that it cannot be on any wait queue - it was just removed, and
hasn't added itself again yet. At that point, it's safe to reset lwWaitLink
back to NULL. The stray value would thus only be visible while a backend executes
the PGSemaphoreLock() loop, and whether or not this is the case can be deduced
from a stack trace. So I'm not convinced that this makes debugging any harder.

(knizhnik has just suggested the same)

It's interesting, BTW, that updating lwWaitLink after lwWaiting is OK here -
the crucial property is not that it's updated before lwWaiting, but rather that
it is reset before enqueuing the backend again. Which is something that backend
itself can guarantee far more easily than whoever happens to de-queue it due to
spurious wakeups.

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#36

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Florian Pflug (#35)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-14 18:49:33 +0100, Florian Pflug wrote:

I wouldn't consider it a major architecture... And I am not sure how
much out of order a CPU has to be to be affected by this, the read side
uses spinlocks, which in most of the spinlock implementations implies a
full memory barrier which in many cache coherency designs will cause
other CPUs to flush writes. And I think the control dependency in the
PGSemaphoreUnlock() loop will actually cause a flush on ppc...

I guess it's sufficient for the store to lwWaitLink to be delayed.
As long as the CPU on which that store is executing hasn't managed to gain
exclusive access to the relevant cache line, memory barriers on the read
side won't help because the store won't be visible to other CPUs.

The whole lwlock actually should be on the same cacheline on anything
with cachelines >= 32. As the woken up side is doing an atomic op on it
*before* modifying lwWaitLink I think we are actually guaranteed that
any incomplete store on the writer will have completed unless the compiler
reordered. So we are safe against out of order stores, but not out of
order execution. That might have been what prevented the issue from
being noticable on some platforms.

Well, the assumption isn't all that new. We already have the situation that
a PGPROC may be not on any wait queue, yet its lwWaitLink may be non-NULL.
Currently, the process who took it off the queue is responsible for rectifying
that eventually, with the changed proposed below it'll be the backend who
owns the PGPROC. From the point of view of other backends, nothing really
changes.

That window is absolutely tiny tho.

b) resetting lwWaitLink introduces a race condition (however unlikely)

we'll trade correctness for cleanliness if we continue to reset lwWaitLink
without protecting against the race. That's a bit of a weird trade-off to make.

It's not just cleanliness, it's being able to actually debug crashes.

We could still arrange for the stray lwWaitLink from being visible only
momentarily. If a backend is woken up and detects that lwWaiting is false,
it knows that it cannot be on any wait queue - it was just removed, and
hasn't added itself again yet. At that point, it's safe to reset lwWaitLink
back to NULL. The stray value would thus only be visible while a backend executes
the PGSemaphoreLock() loop, and whether or not this is the case can be deduced
from a stack trace. So I'm not convinced that this makes debugging any harder.

That's not actually safe on an out of order architecture afaics. Without
barriers the store to lwWaitLink in the woken up backend can preempt the
read for the next element in the PGSemaphoreUnlock() loop.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#37

Florian Pflug

fgp@phlo.org

almost 12 years ago

In reply to: Andres Freund (#36)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On Feb14, 2014, at 19:21 , Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-14 18:49:33 +0100, Florian Pflug wrote:

Well, the assumption isn't all that new. We already have the situation that
a PGPROC may be not on any wait queue, yet its lwWaitLink may be non-NULL.
Currently, the process who took it off the queue is responsible for rectifying
that eventually, with the changed proposed below it'll be the backend who
owns the PGPROC. From the point of view of other backends, nothing really
changes.

That window is absolutely tiny tho.

True, but it's there, so if anything counts on that never being the case, it's
still already broken.

b) resetting lwWaitLink introduces a race condition (however unlikely)

we'll trade correctness for cleanliness if we continue to reset lwWaitLink
without protecting against the race. That's a bit of a weird trade-off to make.

It's not just cleanliness, it's being able to actually debug crashes.

We could still arrange for the stray lwWaitLink from being visible only
momentarily. If a backend is woken up and detects that lwWaiting is false,
it knows that it cannot be on any wait queue - it was just removed, and
hasn't added itself again yet. At that point, it's safe to reset lwWaitLink
back to NULL. The stray value would thus only be visible while a backend executes
the PGSemaphoreLock() loop, and whether or not this is the case can be deduced
from a stack trace. So I'm not convinced that this makes debugging any harder.

That's not actually safe on an out of order architecture afaics. Without
barriers the store to lwWaitLink in the woken up backend can preempt the
read for the next element in the PGSemaphoreUnlock() loop.

Hm, true. I guess we could prevent that by introducing an artificial dependency
between the read and the write - something like

proc = head;
head = proc->lwWaitLink
/*
* We don't ever expect to actually PANIC here, but the test forces the
* load to execute before the store to lwWaiting. This prevents a race
* between reading lwWaitLink here and resetting it back to zero in the
* awoken backend (Note that backends can wake up spuriously, so just
* reading it before doing PGSemaphoreUnlock is insufficient)
*/
if (head != MyProc)
proc->lwWaiting = false;
else
elog(PANIC, ...)
PGSemaphoreUnlock(&proc->sem);

(This assumes that proc is a volatile pointer)

Another idea would be to do as you suggest and only mark the PGPROC pointers
volatile, but to additionally add a check for queue corruption somewhere. We should
be able to detect that - if we ever hit this issue, LWLockRelease should find a
PGPROC while traversing the queue whose lwWaitLink is NULL but which isn't equal to
lock->tail. If that ever happens, we'd simply PANIC.

best regards,
Florian Pflug

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#38

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Florian Pflug (#37)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-15 04:20:17 +0100, Florian Pflug wrote:

Another idea would be to do as you suggest and only mark the PGPROC pointers
volatile, but to additionally add a check for queue corruption somewhere. We should
be able to detect that - if we ever hit this issue, LWLockRelease should find a
PGPROC while traversing the queue whose lwWaitLink is NULL but which isn't equal to
lock->tail. If that ever happens, we'd simply PANIC.

My current conclusion is that backporting barriers.h is by far the most
reasonable way to go. The compiler problems have been ironed out by
now...
Arguments against?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#39

Tom Lane

tgl@sss.pgh.pa.us

almost 12 years ago

In reply to: Andres Freund (#38)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

Andres Freund <andres@2ndquadrant.com> writes:

My current conclusion is that backporting barriers.h is by far the most
reasonable way to go. The compiler problems have been ironed out by
now...

-1. IMO that code is still quite unproven, and what's more, the
problem we're discussing here is completely hypothetical. If it
were real, we'd have field evidence of it. We've not had that
much trouble seeing instances of even very narrow race-condition
windows in the past.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#40

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Tom Lane (#39)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-15 10:06:41 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

My current conclusion is that backporting barriers.h is by far the most
reasonable way to go. The compiler problems have been ironed out by
now...

-1. IMO that code is still quite unproven, and what's more, the
problem we're discussing here is completely hypothetical. If it
were real, we'd have field evidence of it. We've not had that
much trouble seeing instances of even very narrow race-condition
windows in the past.

Well, the problem is that few of us have access to interesting !x86
machines to run tests, and that's where we'd see problems (since x86
gives enough guarantees to avoid this unless the compiler reorders
stuff). I am personally fine with just using volatiles to avoid
reordering in the older branches, but Florian argued against it.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#41

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Andres Freund (#40)

3 attachment(s)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-15 16:18:00 +0100, Andres Freund wrote:

On 2014-02-15 10:06:41 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

My current conclusion is that backporting barriers.h is by far the most
reasonable way to go. The compiler problems have been ironed out by
now...

-1. IMO that code is still quite unproven, and what's more, the
problem we're discussing here is completely hypothetical. If it
were real, we'd have field evidence of it. We've not had that
much trouble seeing instances of even very narrow race-condition
windows in the past.

Well, the problem is that few of us have access to interesting !x86
machines to run tests, and that's where we'd see problems (since x86
gives enough guarantees to avoid this unless the compiler reorders
stuff). I am personally fine with just using volatiles to avoid
reordering in the older branches, but Florian argued against it.

Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1
version applies back to 8.4.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

barrier-in-lwlock-release-9.1.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 0fe7ce4..a8d5b7f 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -647,12 +647,27 @@ LWLockRelease(LWLockId lockid)
 	 */
 	while (head != NULL)
 	{
+		/*
+		 * Use volatile to prevent the compiler from reordering the store to
+		 * lwWaitLink with the store to lwWaiting which could cause problems
+		 * when the to-be-woken-up backend wakes up spuriously and writes to
+		 * lwWaitLink when acquiring a new lock. That could corrupt the list
+		 * this backend is traversing leading to backends stuck waiting for a
+		 * lock.
+		 *
+		 * That's not neccessarily sufficient for out-of-order architectures,
+		 * but there've been no field report of problems. The proper solution
+		 * would be to use a write barrier, but those are not available in the
+		 * back branches.
+		 */
+		volatile PGPROC *vp = proc;
+
 		LOG_LWDEBUG("LWLockRelease", lockid, "release waiter");
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
+		vp = head;
+		head = vp->lwWaitLink;
+		vp->lwWaitLink = NULL;
+		vp->lwWaiting = false;
+		PGSemaphoreUnlock(&vp->sem);
 	}
 
 	/*

barrier-in-lwlock-release-9.3.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 4f88d3f..cff631d 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -27,6 +27,7 @@
 #include "commands/async.h"
 #include "miscadmin.h"
 #include "pg_trace.h"
+#include "storage/barrier.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -831,10 +832,21 @@ LWLockRelease(LWLockId lockid)
 	 */
 	while (head != NULL)
 	{
+		/*
+		 * Use a write barrier to prevent the compiler from reordering the
+		 * store to lwWaitLink with the store to lwWaiting which could cause
+		 * problems when the to-be-woken-up backend wakes up spuriously and
+		 * writes to lwWaitLink when acquiring a new lock. That could corrupt
+		 * the list this backend is traversing leading to backends stuck
+		 * waiting for a lock. A write barrier is sufficient as the read side
+		 * only accesses the data while holding a spinlock which acts as a
+		 * full barrier.
+		 */
 		LOG_LWDEBUG("LWLockRelease", lockid, "release waiter");
 		proc = head;
 		head = proc->lwWaitLink;
 		proc->lwWaitLink = NULL;
+		pg_write_barrier();
 		proc->lwWaiting = false;
 		PGSemaphoreUnlock(&proc->sem);
 	}

barrier-in-lwlock-release-master.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 85a0ce9..22f8540 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1872,9 +1872,11 @@ WakeupWaiters(XLogRecPtr EndPos)
 	 */
 	while (head != NULL)
 	{
+		/* check comment in LWLockRelease() about barrier usage */
 		proc = head;
 		head = proc->lwWaitLink;
 		proc->lwWaitLink = NULL;
+		pg_write_barrier();
 		proc->lwWaiting = false;
 		PGSemaphoreUnlock(&proc->sem);
 	}
@@ -1966,9 +1968,11 @@ WALInsertSlotReleaseOne(int slotno)
 	 */
 	while (head != NULL)
 	{
+		/* check comment in LWLockRelease() about barrier usage */
 		proc = head;
 		head = proc->lwWaitLink;
 		proc->lwWaitLink = NULL;
+		pg_write_barrier();
 		proc->lwWaiting = false;
 		PGSemaphoreUnlock(&proc->sem);
 	}
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 82ef440..98c4845 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -28,6 +28,7 @@
 #include "miscadmin.h"
 #include "pg_trace.h"
 #include "replication/slot.h"
+#include "storage/barrier.h"
 #include "storage/ipc.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -944,10 +945,21 @@ LWLockRelease(LWLock *l)
 	 */
 	while (head != NULL)
 	{
+		/*
+		 * Use a write barrier to prevent the compiler from reordering the
+		 * store to lwWaitLink with the store to lwWaiting which could cause
+		 * problems when the to-be-woken-up backend wakes up spuriously and
+		 * writes to lwWaitLink when acquiring a new lock. That could corrupt
+		 * the list this backend is traversing leading to backends stuck
+		 * waiting for a lock. A write barrier is sufficient as the read side
+		 * only accesses the data while holding a spinlock which acts as a
+		 * full barrier.
+		 */
 		LOG_LWDEBUG("LWLockRelease", T_NAME(l), T_ID(l), "release waiter");
 		proc = head;
 		head = proc->lwWaitLink;
 		proc->lwWaitLink = NULL;
+		pg_write_barrier();
 		proc->lwWaiting = false;
 		PGSemaphoreUnlock(&proc->sem);
 	}

#42

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Andres Freund (#41)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On Sat, Feb 15, 2014 at 11:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-15 16:18:00 +0100, Andres Freund wrote:

On 2014-02-15 10:06:41 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

My current conclusion is that backporting barriers.h is by far the most
reasonable way to go. The compiler problems have been ironed out by
now...

-1. IMO that code is still quite unproven, and what's more, the
problem we're discussing here is completely hypothetical. If it
were real, we'd have field evidence of it. We've not had that
much trouble seeing instances of even very narrow race-condition
windows in the past.

Well, the problem is that few of us have access to interesting !x86
machines to run tests, and that's where we'd see problems (since x86
gives enough guarantees to avoid this unless the compiler reorders
stuff). I am personally fine with just using volatiles to avoid
reordering in the older branches, but Florian argued against it.

Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1
version applies back to 8.4.

I have no confidence that this isn't going to be real bad for performance.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#43

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Robert Haas (#42)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-17 13:49:01 -0500, Robert Haas wrote:

On Sat, Feb 15, 2014 at 11:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-15 16:18:00 +0100, Andres Freund wrote:

On 2014-02-15 10:06:41 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

My current conclusion is that backporting barriers.h is by far the most
reasonable way to go. The compiler problems have been ironed out by
now...

-1. IMO that code is still quite unproven, and what's more, the
problem we're discussing here is completely hypothetical. If it
were real, we'd have field evidence of it. We've not had that
much trouble seeing instances of even very narrow race-condition
windows in the past.

Well, the problem is that few of us have access to interesting !x86
machines to run tests, and that's where we'd see problems (since x86
gives enough guarantees to avoid this unless the compiler reorders
stuff). I am personally fine with just using volatiles to avoid
reordering in the older branches, but Florian argued against it.

Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1
version applies back to 8.4.

I have no confidence that this isn't going to be real bad for performance.

It's just a write barrier which evaluates to a pure compiler barrier on
x86 anyway?
And it's in a loop that's only entered when the kernel is entered anyway
to wake up the other backend.

What should that affect significantly?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#44

Robert Haas

robertmhaas@gmail.com

almost 12 years ago

In reply to: Andres Freund (#43)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On Mon, Feb 17, 2014 at 1:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-17 13:49:01 -0500, Robert Haas wrote:

On Sat, Feb 15, 2014 at 11:17 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-15 16:18:00 +0100, Andres Freund wrote:

On 2014-02-15 10:06:41 -0500, Tom Lane wrote:

Andres Freund <andres@2ndquadrant.com> writes:

My current conclusion is that backporting barriers.h is by far the most
reasonable way to go. The compiler problems have been ironed out by
now...

-1. IMO that code is still quite unproven, and what's more, the
problem we're discussing here is completely hypothetical. If it
were real, we'd have field evidence of it. We've not had that
much trouble seeing instances of even very narrow race-condition
windows in the past.

Well, the problem is that few of us have access to interesting !x86
machines to run tests, and that's where we'd see problems (since x86
gives enough guarantees to avoid this unless the compiler reorders
stuff). I am personally fine with just using volatiles to avoid
reordering in the older branches, but Florian argued against it.

Here's patches doing that. The 9.3 version also applies to 9.2; the 9.1
version applies back to 8.4.

I have no confidence that this isn't going to be real bad for performance.

It's just a write barrier which evaluates to a pure compiler barrier on
x86 anyway?
And it's in a loop that's only entered when the kernel is entered anyway
to wake up the other backend.

What should that affect significantly?

On x86, presumably nothing. On other architectures, I don't know what
the impact is, but I don't accept a hand-wavy assertion that there
shouldn't be any as evidence that there won't be.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#45

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Robert Haas (#44)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-17 14:06:43 -0500, Robert Haas wrote:

On Mon, Feb 17, 2014 at 1:55 PM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2014-02-17 13:49:01 -0500, Robert Haas wrote:
It's just a write barrier which evaluates to a pure compiler barrier on
x86 anyway?
And it's in a loop that's only entered when the kernel is entered anyway
to wake up the other backend.

What should that affect significantly?

On x86, presumably nothing. On other architectures, I don't know what
the impact is, but I don't accept a hand-wavy assertion that there
shouldn't be any as evidence that there won't be.

Directly afterwards there's a syscall that needs to do internal locking
(because it's essentially doing IPC). Which combined certainly is much
more expensive then a write barrier.
And any !x86 architecture that has more heavyweight write barriers
really *needs* a barrier there since you only need more heavywheight
write barriers if the architecture doesn't guarantee total store
order. This isn't a performance optimization, it's correctness.

What's the way to resolve this then? I don't have access to any big !x86
machines.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#46

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Heikki Linnakangas (#9)

1 attachment(s)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 02/10/2014 08:33 PM, Heikki Linnakangas wrote:

On 02/10/2014 08:03 PM, Tom Lane wrote:

Heikki Linnakangas <hlinnakangas@vmware.com> writes:

On 02/10/2014 06:41 PM, Andres Freund wrote:

Well, it's not actually using any lwlock.c code, it's a special case
locking logic, just reusing the datastructures. That said, I am not
particularly happy about the amount of code it's duplicating from
lwlock.c. Pretty much all of WALInsertSlotReleaseOne and most of
WALInsertSlotAcquireOne() is a copied.

I'm not too happy with the amount of copy-paste myself, but there was
enough difference to regular lwlocks that I didn't want to bother all
lwlocks with the xlog-specific stuff either. The WAL insert slots do
share the LWLock-related PGPROC fields though, and semaphore. I'm all
ears if you have ideas on that..

I agree that if the behavior is considerably different, we don't really
want to try to make LWLockAcquire/Release cater to both this and their
standard behavior. But why not add some additional functions in lwlock.c
that do what xlog wants? If we're going to have mostly-copy-pasted logic,
it'd at least be better if it was in the same file, and not somewhere
that's not even in the same major subtree.

Ok, I'll try to refactor it that way, so that we can see if it looks better.

This is what I came up with. I like it, I didn't have to contort lwlocks
as much as I feared. I added one field to LWLock structure, which is
used to store the position of how far a WAL inserter has progressed. The
LWLock code calls it just "value", without caring what's stored in it,
and it's used by new functions LWLockWait and LWLockWakeup to implement
the behavior the WAL insertion slots have, to wake up other processes
waiting for the slot without releasing it.

This passes regression tests, but I'll have to re-run the performance
tests with this. One worry is that if the padded size of the LWLock
struct is smaller than cache line, neighboring WAL insertion locks will
compete for the cache line. Another worry is that since I added a field
to LWLock struct, it might now take 64 bytes on platforms where it used
to be 32 bytes before. That wastes some memory.

- Heikki

Attachments:

xlogslot-to-lwlock.patchtext/x-diff; name=xlogslot-to-lwlock.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 508970a..b148f70 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -86,7 +86,7 @@ int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
-int			num_xloginsert_slots = 8;
+int			num_xloginsert_locks = 8;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -289,7 +289,7 @@ XLogRecPtr	XactLastRecEnd = InvalidXLogRecPtr;
  * (which is almost but not quite the same as a pointer to the most recent
  * CHECKPOINT record).	We update this from the shared-memory copy,
  * XLogCtl->Insert.RedoRecPtr, whenever we can safely do so (ie, when we
- * hold an insertion slot).  See XLogInsert for details.  We are also allowed
+ * hold an insertion lock).  See XLogInsert for details.  We are also allowed
  * to update from XLogCtl->RedoRecPtr if we hold the info_lck;
  * see GetRedoRecPtr.  A freshly spawned backend obtains the value during
  * InitXLOGAccess.
@@ -363,63 +363,6 @@ typedef struct XLogwrtResult
 
 
 /*
- * A slot for inserting to the WAL. This is similar to an LWLock, the main
- * difference is that there is an extra xlogInsertingAt field that is protected
- * by the same mutex. Unlike an LWLock, a slot can only be acquired in
- * exclusive mode.
- *
- * The xlogInsertingAt field is used to advertise to other processes how far
- * the slot owner has progressed in inserting the record. When a backend
- * acquires a slot, it initializes xlogInsertingAt to 1, because it doesn't
- * yet know where it's going to insert the record. That's conservative
- * but correct; the new insertion is certainly going to go to a byte position
- * greater than 1. If another backend needs to flush the WAL, it will have to
- * wait for the new insertion. xlogInsertingAt is updated after finishing the
- * insert or when crossing a page boundary, which will wake up anyone waiting
- * for it, whether the wait was necessary in the first place or not.
- *
- * A process can wait on a slot in two modes: LW_EXCLUSIVE or
- * LW_WAIT_UNTIL_FREE. LW_EXCLUSIVE works like in an lwlock; when the slot is
- * released, the first LW_EXCLUSIVE waiter in the queue is woken up. Processes
- * waiting in LW_WAIT_UNTIL_FREE mode are woken up whenever the slot is
- * released, or xlogInsertingAt is updated. In other words, a process in
- * LW_WAIT_UNTIL_FREE mode is woken up whenever the inserter makes any progress
- * copying the record in place. LW_WAIT_UNTIL_FREE waiters are always added to
- * the front of the queue, while LW_EXCLUSIVE waiters are appended to the end.
- *
- * To join the wait queue, a process must set MyProc->lwWaitMode to the mode
- * it wants to wait in, MyProc->lwWaiting to true, and link MyProc to the head
- * or tail of the wait queue. The same mechanism is used to wait on an LWLock,
- * see lwlock.c for details.
- */
-typedef struct
-{
-	slock_t		mutex;			/* protects the below fields */
-	XLogRecPtr	xlogInsertingAt; /* insert has completed up to this point */
-
-	PGPROC	   *owner;			/* for debugging purposes */
-
-	bool		releaseOK;		/* T if ok to release waiters */
-	char		exclusive;		/* # of exclusive holders (0 or 1) */
-	PGPROC	   *head;			/* head of list of waiting PGPROCs */
-	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
-	/* tail is undefined when head is NULL */
-} XLogInsertSlot;
-
-/*
- * All the slots are allocated as an array in shared memory. We force the
- * array stride to be a power of 2, which saves a few cycles in indexing, but
- * more importantly also ensures that individual slots don't cross cache line
- * boundaries.	(Of course, we have to also ensure that the array start
- * address is suitably aligned.)
- */
-typedef union XLogInsertSlotPadded
-{
-	XLogInsertSlot slot;
-	char		pad[CACHE_LINE_SIZE];
-} XLogInsertSlotPadded;
-
-/*
  * Shared state data for XLogInsert.
  */
 typedef struct XLogCtlInsert
@@ -452,8 +395,8 @@ typedef struct XLogCtlInsert
 	 * we must WAL-log it before it actually affects WAL-logging by backends.
 	 * Checkpointer sets at startup or after SIGHUP.
 	 *
-	 * To read these fields, you must hold an insertion slot. To modify them,
-	 * you must hold ALL the slots.
+	 * To read these fields, you must hold an insertion lock. To modify them,
+	 * you must hold ALL the locks.
 	 */
 	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
 	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
@@ -470,8 +413,27 @@ typedef struct XLogCtlInsert
 	int			nonExclusiveBackups;
 	XLogRecPtr	lastBackupStart;
 
-	/* insertion slots, see XLogInsertSlot struct above for details */
-	XLogInsertSlotPadded *insertSlots;
+	/*
+	 * To insert a new WAL record, you must hold a WAL insertion lock. Before
+	 * 9.4, there was a single WALInsertLock, but that became a bottleneck
+	 * on multi-core systems that insert a lot of WAL. Now, there are several
+	 * WAL insertion locks, and to insert WAL, you must hold one of them (in
+	 * exclusive mode). It doesn't matter which one.
+	 *
+	 * In addition to a simple insertion-in-progress or not state, each WAL
+	 * insertion lock can be used to advertise the current progress of the
+	 * insertion, and to wait for partial progress to be made without waiting
+	 * for the lock to be released. Each lock contains an XLogRecPtr that
+	 * indicates how far the inserter has completed the insertion. This value
+	 * is set with LWLockWakeup(), and can be queried with LWLockWait(). It
+	 * is initialized to 0 (InvalidXLogRecPtr) when the lock is acquired,
+	 * which means that the process doesn't yet know where it's going to
+	 * insert. Small WAL records that don't cross a page boundary never update
+	 * the value, the WAL record is just copied to the page and the lock is
+	 * released. But when crossing a page boundary, it's updated to let others
+	 * know that the backend has finished modifying the previous page.
+	 */
+	LWLock	  **xlogInsertLocks;
 } XLogCtlInsert;
 
 /*
@@ -732,9 +694,9 @@ static bool InRedo = false;
 /* Have we launched bgwriter during recovery? */
 static bool bgwriterLaunched = false;
 
-/* For WALInsertSlotAcquire/Release functions */
-static int	MySlotNo = 0;
-static bool holdingAllSlots = false;
+/* For WALInsertLockAcquire/Release functions */
+static int	MyLockNo = 0;
+static bool holdingAllLocks = false;
 
 static void readRecoveryCommandFile(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
@@ -808,16 +770,13 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 				  XLogRecPtr *PrevPtr);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
-static void WakeupWaiters(XLogRecPtr EndPos);
 static char *GetXLogBuffer(XLogRecPtr ptr);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
 
-static void WALInsertSlotAcquire(bool exclusive);
-static void WALInsertSlotAcquireOne(int slotno);
-static void WALInsertSlotRelease(void);
-static void WALInsertSlotReleaseOne(int slotno);
+static void WALInsertLockAcquire(bool exclusive);
+static void WALInsertLockRelease(void);
 
 /*
  * Insert an XLOG record having the specified RMID and info bytes,
@@ -894,7 +853,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	 *
 	 * We may have to loop back to here if a race condition is detected below.
 	 * We could prevent the race by doing all this work while holding an
-	 * insertion slot, but it seems better to avoid doing CRC calculations
+	 * insertion lock, but it seems better to avoid doing CRC calculations
 	 * while holding one.
 	 *
 	 * We add entries for backup blocks to the chain, so that they don't need
@@ -912,8 +871,8 @@ begin:;
 	/*
 	 * Decide if we need to do full-page writes in this XLOG record: true if
 	 * full_page_writes is on or we have a PITR request for it.  Since we
-	 * don't yet have an insertion slot, fullPageWrites and forcePageWrites
-	 * could change under us, but we'll recheck them once we have a slot.
+	 * don't yet have an insertion lock, fullPageWrites and forcePageWrites
+	 * could change under us, but we'll recheck them once we have a lock.
 	 */
 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
 
@@ -1087,16 +1046,15 @@ begin:;
 	 *    record in place. This can be done concurrently in multiple processes.
 	 *
 	 * To keep track of which insertions are still in-progress, each concurrent
-	 * inserter allocates an "insertion slot", which tells others how far the
+	 * inserter acquires an insertion lock. In addition to just indicating that
+	 * an insertion is in progress, the lock tells others how far the
 	 * inserter has progressed. There is a small fixed number of insertion
-	 * slots, determined by the num_xloginsert_slots GUC. When an inserter
-	 * finishes, it updates the xlogInsertingAt of its slot to the end of the
-	 * record it inserted, to let others know that it's done. xlogInsertingAt
-	 * is also updated when crossing over to a new WAL buffer, to allow the
-	 * the previous buffer to be flushed.
+	 * locks, determined by the num_xloginsert_locks GUC. When an inserter
+	 * crosses a page boundary, it updates the value stored in the lock to the
+	 * how far it has inserted, to allow the the previous buffer to be flushed.
 	 *
-	 * Holding onto a slot also protects RedoRecPtr and fullPageWrites from
-	 * changing until the insertion is finished.
+	 * Holding onto an insertion lock also protects RedoRecPtr and
+	 * fullPageWrites from changing until the insertion is finished.
 	 *
 	 * Step 2 can usually be done completely in parallel. If the required WAL
 	 * page is not initialized yet, you have to grab WALBufMappingLock to
@@ -1106,7 +1064,7 @@ begin:;
 	 *----------
 	 */
 	START_CRIT_SECTION();
-	WALInsertSlotAcquire(isLogSwitch);
+	WALInsertLockAcquire(isLogSwitch);
 
 	/*
 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
@@ -1135,7 +1093,7 @@ begin:;
 					 * Oops, this buffer now needs to be backed up, but we
 					 * didn't think so above.  Start over.
 					 */
-					WALInsertSlotRelease();
+					WALInsertLockRelease();
 					END_CRIT_SECTION();
 					rdt_lastnormal->next = NULL;
 					info = info_orig;
@@ -1154,7 +1112,7 @@ begin:;
 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
 	{
 		/* Oops, must redo it with full-page data. */
-		WALInsertSlotRelease();
+		WALInsertLockRelease();
 		END_CRIT_SECTION();
 		rdt_lastnormal->next = NULL;
 		info = info_orig;
@@ -1202,7 +1160,7 @@ begin:;
 	/*
 	 * Done! Let others know that we're finished.
 	 */
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	MarkCurrentTransactionIdLoggedIfAny();
 
@@ -1363,7 +1321,7 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 
 	/*
 	 * These calculations are a bit heavy-weight to be done while holding a
-	 * spinlock, but since we're holding all the WAL insertion slots, there
+	 * spinlock, but since we're holding all the WAL insertion locks, there
 	 * are no other inserters competing for it. GetXLogInsertRecPtr() does
 	 * compete for it, but that's not called very frequently.
 	 */
@@ -1523,7 +1481,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 		while (CurrPos < EndPos)
 		{
 			/* initialize the next page (if not initialized already) */
-			WakeupWaiters(CurrPos);
+			LWLockWakeup(XLogCtl->Insert.xlogInsertLocks[MyLockNo], CurrPos);
 			AdvanceXLInsertBuffer(CurrPos, false);
 			CurrPos += XLOG_BLCKSZ;
 		}
@@ -1534,452 +1492,77 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 }
 
 /*
- * Allocate a slot for insertion.
+ * Acquire a WAL insertion lock.
  *
- * In exclusive mode, all slots are reserved for the current process. That
- * blocks all concurrent insertions.
+ * In exclusive mode, all locks are acquired. That blocks all concurrent
+ * insertions.
  */
 static void
-WALInsertSlotAcquire(bool exclusive)
+WALInsertLockAcquire(bool exclusive)
 {
 	int			i;
 
 	if (exclusive)
 	{
-		for (i = 0; i < num_xloginsert_slots; i++)
-			WALInsertSlotAcquireOne(i);
-		holdingAllSlots = true;
+		for (i = 0; i < num_xloginsert_locks; i++)
+			LWLockAcquire(XLogCtl->Insert.xlogInsertLocks[i], LW_EXCLUSIVE);
+		holdingAllLocks = true;
 	}
 	else
-		WALInsertSlotAcquireOne(-1);
-}
-
-/*
- * Workhorse of WALInsertSlotAcquire. Acquires the given slot, or an arbitrary
- * one if slotno == -1. The index of the slot that was acquired is stored in
- * MySlotNo.
- *
- * This is more or less equivalent to LWLockAcquire().
- */
-static void
-WALInsertSlotAcquireOne(int slotno)
-{
-	volatile XLogInsertSlot *slot;
-	PGPROC	   *proc = MyProc;
-	bool		retry = false;
-	int			extraWaits = 0;
-	static int	slotToTry = -1;
-
-	/*
-	 * Try to use the slot we used last time. If the system isn't particularly
-	 * busy, it's a good bet that it's available, and it's good to have some
-	 * affinity to a particular slot so that you don't unnecessarily bounce
-	 * cache lines between processes when there is no contention.
-	 *
-	 * If this is the first time through in this backend, pick a slot
-	 * (semi-)randomly. This allows the slots to be used evenly if you have a
-	 * lot of very short connections.
-	 */
-	if (slotno != -1)
-		MySlotNo = slotno;
-	else
 	{
-		if (slotToTry == -1)
-			slotToTry = MyProc->pgprocno % num_xloginsert_slots;
-		MySlotNo = slotToTry;
-	}
-
-	/*
-	 * We can't wait if we haven't got a PGPROC.  This should only occur
-	 * during bootstrap or shared memory initialization.  Put an Assert here
-	 * to catch unsafe coding practices.
-	 */
-	Assert(MyProc != NULL);
-
-	/*
-	 * Lock out cancel/die interrupts until we exit the code section protected
-	 * by the slot.  This ensures that interrupts will not interfere with
-	 * manipulations of data structures in shared memory. There is no cleanup
-	 * mechanism to release the slot if the backend dies while holding one,
-	 * so make this a critical section.
-	 */
-	START_CRIT_SECTION();
-
-	/*
-	 * Loop here to try to acquire slot after each time we are signaled by
-	 * WALInsertSlotRelease.
-	 */
-	for (;;)
-	{
-		bool		mustwait;
-
-		slot = &XLogCtl->Insert.insertSlots[MySlotNo].slot;
-
-		/* Acquire mutex.  Time spent holding mutex should be short! */
-		SpinLockAcquire(&slot->mutex);
-
-		/* If retrying, allow WALInsertSlotRelease to release waiters again */
-		if (retry)
-			slot->releaseOK = true;
-
-		/* If I can get the slot, do so quickly. */
-		if (slot->exclusive == 0)
-		{
-			slot->exclusive++;
-			mustwait = false;
-		}
-		else
-			mustwait = true;
-
-		if (!mustwait)
-			break;				/* got the lock */
-
-		Assert(slot->owner != MyProc);
+		bool		immed;
 
 		/*
-		 * Add myself to wait queue.
-		 */
-		proc->lwWaiting = true;
-		proc->lwWaitMode = LW_EXCLUSIVE;
-		proc->lwWaitLink = NULL;
-		if (slot->head == NULL)
-			slot->head = proc;
-		else
-			slot->tail->lwWaitLink = proc;
-		slot->tail = proc;
-
-		/* Can release the mutex now */
-		SpinLockRelease(&slot->mutex);
-
-		/*
-		 * Wait until awakened.
+		 * Acquire one of the locks. It doesn't matter which one, but try to
+		 * use the lock we used last time. If the system isn't particularly
+		 * busy, it's a good bet that it's available, and it's good to have
+		 * some affinity to a particular lock so that you don't unnecessarily
+		 * bounce cache lines between processes when there is no contention.
 		 *
-		 * Since we share the process wait semaphore with the regular lock
-		 * manager and ProcWaitForSignal, and we may need to acquire a slot
-		 * while one of those is pending, it is possible that we get awakened
-		 * for a reason other than being signaled by WALInsertSlotRelease. If
-		 * so, loop back and wait again.  Once we've gotten the slot,
-		 * re-increment the sema by the number of additional signals received,
-		 * so that the lock manager or signal manager will see the received
-		 * signal when it next waits.
+		 * If this is the first time through in this backend, pick a lock
+		 * (semi-)randomly. This allows the locks to be used evenly if you have
+		 * a lot of very short connections.
 		 */
-		for (;;)
-		{
-			/* "false" means cannot accept cancel/die interrupt here. */
-			PGSemaphoreLock(&proc->sem, false);
-			if (!proc->lwWaiting)
-				break;
-			extraWaits++;
-		}
-
-		/* Now loop back and try to acquire lock again. */
-		retry = true;
-	}
-
-	slot->owner = proc;
-
-	/*
-	 * Normally, we initialize the xlogInsertingAt value of the slot to 1,
-	 * because we don't yet know where in the WAL we're going to insert. It's
-	 * not critical what it points to right now - leaving it to a too small
-	 * value just means that WaitXlogInsertionsToFinish() might wait on us
-	 * unnecessarily, until we update the value (when we finish the insert or
-	 * move to next page).
-	 *
-	 * If we're grabbing all the slots, however, stamp all but the last one
-	 * with InvalidXLogRecPtr, meaning there is no insert in progress. The last
-	 * slot is the one that we will update as we proceed with the insert, the
-	 * rest are held just to keep off other inserters.
-	 */
-	if (slotno != -1 && slotno != num_xloginsert_slots - 1)
-		slot->xlogInsertingAt = InvalidXLogRecPtr;
-	else
-		slot->xlogInsertingAt = 1;
-
-	/* We are done updating shared state of the slot itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Fix the process wait semaphore's count for any absorbed wakeups.
-	 */
-	while (extraWaits-- > 0)
-		PGSemaphoreUnlock(&proc->sem);
-
-	/*
-	 * If we couldn't get the slot immediately, try another slot next time.
-	 * On a system with more insertion slots than concurrent inserters, this
-	 * causes all the inserters to eventually migrate to a slot that no-one
-	 * else is using. On a system with more inserters than slots, it still
-	 * causes the inserters to be distributed quite evenly across the slots.
-	 */
-	if (slotno != -1 && retry)
-		slotToTry = (slotToTry + 1) % num_xloginsert_slots;
-}
-
-/*
- * Wait for the given slot to become free, or for its xlogInsertingAt location
- * to change to something else than 'waitptr'. In other words, wait for the
- * inserter using the given slot to finish its insertion, or to at least make
- * some progress.
- */
-static void
-WaitOnSlot(volatile XLogInsertSlot *slot, XLogRecPtr waitptr)
-{
-	PGPROC	   *proc = MyProc;
-	int			extraWaits = 0;
-
-	/*
-	 * Lock out cancel/die interrupts while we sleep on the slot. There is
-	 * no cleanup mechanism to remove us from the wait queue if we got
-	 * interrupted.
-	 */
-	HOLD_INTERRUPTS();
-
-	/*
-	 * Loop here to try to acquire lock after each time we are signaled.
-	 */
-	for (;;)
-	{
-		bool		mustwait;
-
-		/* Acquire mutex.  Time spent holding mutex should be short! */
-		SpinLockAcquire(&slot->mutex);
-
-		/* If I can get the lock, do so quickly. */
-		if (slot->exclusive == 0 || slot->xlogInsertingAt != waitptr)
-			mustwait = false;
-		else
-			mustwait = true;
-
-		if (!mustwait)
-			break;				/* the lock was free */
-
-		Assert(slot->owner != MyProc);
-
-		/*
-		 * Add myself to wait queue.
-		 */
-		proc->lwWaiting = true;
-		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
-		proc->lwWaitLink = NULL;
-
-		/* waiters are added to the front of the queue */
-		proc->lwWaitLink = slot->head;
-		if (slot->head == NULL)
-			slot->tail = proc;
-		slot->head = proc;
+		static int	lockToTry = -1;
+		if (lockToTry == -1)
+			lockToTry = MyProc->pgprocno % num_xloginsert_locks;
+		MyLockNo = lockToTry;
 
-		/* Can release the mutex now */
-		SpinLockRelease(&slot->mutex);
+		immed = LWLockAcquire(XLogCtl->Insert.xlogInsertLocks[MyLockNo], LW_EXCLUSIVE);
 
-		/*
-		 * Wait until awakened.
-		 *
-		 * Since we share the process wait semaphore with other things, like
-		 * the regular lock manager and ProcWaitForSignal, and we may need to
-		 * acquire an LWLock while one of those is pending, it is possible that
-		 * we get awakened for a reason other than being signaled by
-		 * LWLockRelease. If so, loop back and wait again.  Once we've gotten
-		 * the LWLock, re-increment the sema by the number of additional
-		 * signals received, so that the lock manager or signal manager will
-		 * see the received signal when it next waits.
-		 */
-		for (;;)
+		if (!immed)
 		{
-			/* "false" means cannot accept cancel/die interrupt here. */
-			PGSemaphoreLock(&proc->sem, false);
-			if (!proc->lwWaiting)
-				break;
-			extraWaits++;
-		}
-
-		/* Now loop back and try to acquire lock again. */
-	}
-
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Fix the process wait semaphore's count for any absorbed wakeups.
-	 */
-	while (extraWaits-- > 0)
-		PGSemaphoreUnlock(&proc->sem);
-
-	/*
-	 * Now okay to allow cancel/die interrupts.
-	 */
-	RESUME_INTERRUPTS();
-}
-
-/*
- * Wake up all processes waiting for us with WaitOnSlot(). Sets our
- * xlogInsertingAt value to EndPos, without releasing the slot.
- */
-static void
-WakeupWaiters(XLogRecPtr EndPos)
-{
-	volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[MySlotNo].slot;
-	PGPROC	   *head;
-	PGPROC	   *proc;
-	PGPROC	   *next;
-
-	/*
-	 * If we have already reported progress up to the same point, do nothing.
-	 * No other process can modify xlogInsertingAt, so we can check this before
-	 * grabbing the spinlock.
-	 */
-	if (slot->xlogInsertingAt == EndPos)
-		return;
-	/* xlogInsertingAt should not go backwards */
-	Assert(slot->xlogInsertingAt < EndPos);
-
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&slot->mutex);
-
-	/* we should own the slot */
-	Assert(slot->exclusive == 1 && slot->owner == MyProc);
-
-	slot->xlogInsertingAt = EndPos;
-
-	/*
-	 * See if there are any LW_WAIT_UNTIL_FREE waiters that need to be woken
-	 * up. They are always in the front of the queue.
-	 */
-	head = slot->head;
-
-	if (head != NULL && head->lwWaitMode == LW_WAIT_UNTIL_FREE)
-	{
-		proc = head;
-		next = proc->lwWaitLink;
-		while (next && next->lwWaitMode == LW_WAIT_UNTIL_FREE)
-		{
-			proc = next;
-			next = next->lwWaitLink;
+			/*
+			 * If we couldn't get the lock immediately, try another lock next
+			 * time.  On a system with more insertion locks than concurrent
+			 * inserters, this causes all the inserters to eventually migrate
+			 * to a lock that no-one else is using.  On a system with more
+			 * inserters than locks, it still helps to distribute the inserters
+			 * quite evenly across the locks.
+			 */
+			lockToTry = (lockToTry + 1) % num_xloginsert_locks;
 		}
-
-		/* proc is now the last PGPROC to be released */
-		slot->head = next;
-		proc->lwWaitLink = NULL;
-	}
-	else
-		head = NULL;
-
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Awaken any waiters I removed from the queue.
-	 */
-	while (head != NULL)
-	{
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
 	}
 }
 
 /*
- * Release our insertion slot (or slots, if we're holding them all).
+ * Release our insertion lock (or locks, if we're holding them all).
  */
 static void
-WALInsertSlotRelease(void)
+WALInsertLockRelease(void)
 {
 	int			i;
 
-	if (holdingAllSlots)
+	if (holdingAllLocks)
 	{
-		for (i = 0; i < num_xloginsert_slots; i++)
-			WALInsertSlotReleaseOne(i);
-		holdingAllSlots = false;
+		for (i = 0; i < num_xloginsert_locks; i++)
+			LWLockRelease(XLogCtl->Insert.xlogInsertLocks[i]);
+		holdingAllLocks = false;
 	}
 	else
-		WALInsertSlotReleaseOne(MySlotNo);
-}
-
-static void
-WALInsertSlotReleaseOne(int slotno)
-{
-	volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[slotno].slot;
-	PGPROC	   *head;
-	PGPROC	   *proc;
-
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&slot->mutex);
-
-	/* we must be holding it */
-	Assert(slot->exclusive == 1 && slot->owner == MyProc);
-
-	slot->xlogInsertingAt = InvalidXLogRecPtr;
-
-	/* Release my hold on the slot */
-	slot->exclusive = 0;
-	slot->owner = NULL;
-
-	/*
-	 * See if I need to awaken any waiters..
-	 */
-	head = slot->head;
-	if (head != NULL)
-	{
-		if (slot->releaseOK)
-		{
-			/*
-			 * Remove the to-be-awakened PGPROCs from the queue.
-			 */
-			bool		releaseOK = true;
-
-			proc = head;
-
-			/*
-			 * First wake up any backends that want to be woken up without
-			 * acquiring the lock. These are always in the front of the queue.
-			 */
-			while (proc->lwWaitMode == LW_WAIT_UNTIL_FREE && proc->lwWaitLink)
-				proc = proc->lwWaitLink;
-
-			/*
-			 * Awaken the first exclusive-waiter, if any.
-			 */
-			if (proc->lwWaitLink)
-			{
-				Assert(proc->lwWaitLink->lwWaitMode == LW_EXCLUSIVE);
-				proc = proc->lwWaitLink;
-				releaseOK = false;
-			}
-			/* proc is now the last PGPROC to be released */
-			slot->head = proc->lwWaitLink;
-			proc->lwWaitLink = NULL;
-
-			slot->releaseOK = releaseOK;
-		}
-		else
-			head = NULL;
-	}
-
-	/* We are done updating shared state of the slot itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Awaken any waiters I removed from the queue.
-	 */
-	while (head != NULL)
-	{
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
-	}
-
-	/*
-	 * Now okay to allow cancel/die interrupts.
-	 */
-	END_CRIT_SECTION();
+		LWLockRelease(XLogCtl->Insert.xlogInsertLocks[MyLockNo]);
 }
 
-
 /*
  * Wait for any WAL insertions < upto to finish.
  *
@@ -2029,79 +1612,53 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	}
 
 	/*
+	 * Loop through all the locks, sleeping on any in-progress insert older
+	 * than 'upto'.
+	 *
 	 * finishedUpto is our return value, indicating the point upto which
 	 * all the WAL insertions have been finished. Initialize it to the head
-	 * of reserved WAL, and as we iterate through the insertion slots, back it
+	 * of reserved WAL, and as we iterate through the insertion locks, back it
 	 * out for any insertion that's still in progress.
 	 */
 	finishedUpto = reservedUpto;
-
-	/*
-	 * Loop through all the slots, sleeping on any in-progress insert older
-	 * than 'upto'.
-	 */
-	for (i = 0; i < num_xloginsert_slots; i++)
+	for (i = 0; i < num_xloginsert_locks; i++)
 	{
-		volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[i].slot;
-		XLogRecPtr insertingat;
-
-	retry:
-		/*
-		 * We can check if the slot is in use without grabbing the spinlock.
-		 * The spinlock acquisition of insertpos_lck before this loop acts
-		 * as a memory barrier. If someone acquires the slot after that, it
-		 * can't possibly be inserting to anything < reservedUpto. If it was
-		 * acquired before that, an unlocked test will return true.
-		 */
-		if (!slot->exclusive)
-			continue;
-
-		SpinLockAcquire(&slot->mutex);
-		/* re-check now that we have the lock */
-		if (!slot->exclusive)
-		{
-			SpinLockRelease(&slot->mutex);
-			continue;
-		}
-		insertingat = slot->xlogInsertingAt;
-		SpinLockRelease(&slot->mutex);
-
-		if (insertingat == InvalidXLogRecPtr)
+		XLogRecPtr insertingat = InvalidXLogRecPtr;
+		do
 		{
 			/*
-			 * slot is reserved just to hold off other inserters, there is no
-			 * actual insert in progress.
+			 * See if this insertion is in progress. LWLockWait will wait for
+			 * the lock to be released, or for the 'value' to be set by a
+			 * LWLockWakeup call. When a lock is initially acquired, its
+			 * value is reset to 0 (InvalidXLogRecPtr), which means that we
+			 * don't know where it's inserting yet. We will have to wait for
+			 * it. If it's a small insertion, the record will most likely fit
+			 * on the same page and the inserter will release the lock without
+			 * ever calling LWLockWakeup. But if it has to cross a page, it
+			 * advertise the insertion point with LWLockWakeup.
+			 *
+			 * It's possible that the inserter finishes, and another inserter
+			 * grabs the lock before we get a chance to observe that the lock
+			 * was momentarily released. We wouldn't need to wait for the new
+			 * inserter, but we cannot distinguish that case, so we will have
+			 * to wait.
 			 */
-			continue;
-		}
+			if (LWLockWait(XLogCtl->Insert.xlogInsertLocks[i],
+						   insertingat, &insertingat))
+			{
+				/* the lock was free, so no insertion in progress */
+				insertingat = InvalidXLogRecPtr;
+				break;
+			}
 
-		/*
-		 * This insertion is still in progress. Do we need to wait for it?
-		 *
-		 * When an inserter acquires a slot, it doesn't reset 'insertingat', so
-		 * it will initially point to the old value of some already-finished
-		 * insertion. The inserter will update the value as soon as it finishes
-		 * the insertion, moves to the next page, or has to do I/O to flush an
-		 * old dirty buffer. That means that when we see a slot with
-		 * insertingat value < upto, we don't know if that insertion is still
-		 * truly in progress, or if the slot is reused by a new inserter that
-		 * hasn't updated the insertingat value yet. We have to assume it's the
-		 * latter, and wait.
-		 */
-		if (insertingat < upto)
-		{
-			WaitOnSlot(slot, insertingat);
-			goto retry;
-		}
-		else
-		{
 			/*
-			 * We don't need to wait for this insertion, but update the
-			 * return value.
+			 * This insertion is still in progress. Have to wait, unless the
+			 * inserter has proceeded past 'upto'.
 			 */
-			if (insertingat < finishedUpto)
-				finishedUpto = insertingat;
-		}
+		} while (insertingat < upto);
+
+		if (insertingat != InvalidXLogRecPtr && insertingat < finishedUpto)
+			finishedUpto = insertingat;
 	}
 	return finishedUpto;
 }
@@ -2115,7 +1672,7 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
  *
  * The caller must ensure that the page containing the requested location
  * isn't evicted yet, and won't be evicted. The way to ensure that is to
- * hold onto an XLogInsertSlot with the xlogInsertingAt position set to
+ * hold onto a WAL insertion lock with the xlogInsertingAt position set to
  * something <= ptr. GetXLogBuffer() will update xlogInsertingAt if it needs
  * to evict an old page from the buffer. (This means that once you call
  * GetXLogBuffer() with a given 'ptr', you must not access anything before
@@ -2176,7 +1733,8 @@ GetXLogBuffer(XLogRecPtr ptr)
 		 * Let others know that we're finished inserting the record up
 		 * to the page boundary.
 		 */
-		WakeupWaiters(expectedEndPtr - XLOG_BLCKSZ);
+		LWLockWakeup(XLogCtl->Insert.xlogInsertLocks[MyLockNo],
+					 expectedEndPtr - XLOG_BLCKSZ);
 
 		AdvanceXLInsertBuffer(ptr, false);
 		endptr = XLogCtl->xlblocks[idx];
@@ -5071,8 +4629,6 @@ XLOGShmemSize(void)
 	/* XLogCtl */
 	size = sizeof(XLogCtlData);
 
-	/* xlog insertion slots, plus alignment */
-	size = add_size(size, mul_size(sizeof(XLogInsertSlotPadded), num_xloginsert_slots + 1));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
@@ -5120,11 +4676,11 @@ XLOGShmemInit(void)
 	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
 	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
 
-	/* Xlog insertion slots. Ensure they're aligned to the full padded size */
-	allocptr += sizeof(XLogInsertSlotPadded) -
-		((uintptr_t) allocptr) % sizeof(XLogInsertSlotPadded);
-	XLogCtl->Insert.insertSlots = (XLogInsertSlotPadded *) allocptr;
-	allocptr += sizeof(XLogInsertSlotPadded) * num_xloginsert_slots;
+	/* WAL insert locks */
+	XLogCtl->Insert.xlogInsertLocks = (LWLock **) allocptr;
+	allocptr += sizeof(LWLock *) * num_xloginsert_locks;
+	for (i = 0; i < num_xloginsert_locks; i++)
+		XLogCtl->Insert.xlogInsertLocks[i] = LWLockAssign();
 
 	/*
 	 * Align the start of the page buffers to a full xlog block size boundary.
@@ -5144,19 +4700,6 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
-	for (i = 0; i < num_xloginsert_slots; i++)
-	{
-		XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[i].slot;
-		SpinLockInit(&slot->mutex);
-		slot->xlogInsertingAt = InvalidXLogRecPtr;
-		slot->owner = NULL;
-
-		slot->releaseOK = true;
-		slot->exclusive = 0;
-		slot->head = NULL;
-		slot->tail = NULL;
-	}
-
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	SpinLockInit(&XLogCtl->ulsn_lck);
@@ -7901,7 +7444,7 @@ GetRedoRecPtr(void)
 
 	/*
 	 * The possibly not up-to-date copy in XlogCtl is enough. Even if we
-	 * grabbed a WAL insertion slot to read the master copy, someone might
+	 * grabbed a WAL insertion lock to read the master copy, someone might
 	 * update it just after we've released the lock.
 	 */
 	SpinLockAcquire(&xlogctl->info_lck);
@@ -7919,7 +7462,7 @@ GetRedoRecPtr(void)
  *
  * NOTE: The value *actually* returned is the position of the last full
  * xlog page. It lags behind the real insert position by at most 1 page.
- * For that, we don't need to scan through WAL insertion slots, and an
+ * For that, we don't need to scan through WAL insertion locks, and an
  * approximation is enough for the current usage of this function.
  */
 XLogRecPtr
@@ -8280,7 +7823,7 @@ CreateCheckPoint(int flags)
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
 
 	/*
@@ -8305,7 +7848,7 @@ CreateCheckPoint(int flags)
 			MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
 			ControlFile->checkPoint == ControlFile->checkPointCopy.redo)
 		{
-			WALInsertSlotRelease();
+			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
 			return;
@@ -8349,7 +7892,7 @@ CreateCheckPoint(int flags)
 
 	/*
 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
-	 * must be done while holding the insertion slots.
+	 * must be done while holding all the insertion locks.
 	 *
 	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
 	 * pointing past where it really needs to point.  This is okay; the only
@@ -8361,10 +7904,10 @@ CreateCheckPoint(int flags)
 	RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
 
 	/*
-	 * Now we can release the WAL insertion slots, allowing other xacts to
+	 * Now we can release the WAL insertion locks, allowing other xacts to
 	 * proceed while we are flushing disk buffers.
 	 */
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	/* Update the info_lck-protected copy of RedoRecPtr as well */
 	SpinLockAcquire(&xlogctl->info_lck);
@@ -8394,7 +7937,7 @@ CreateCheckPoint(int flags)
 	 * we wait till he's out of his commit critical section before proceeding.
 	 * See notes in RecordTransactionCommit().
 	 *
-	 * Because we've already released the insertion slots, this test is a bit
+	 * Because we've already released the insertion locks, this test is a bit
 	 * fuzzy: it is possible that we will wait for xacts we didn't really need
 	 * to wait for.  But the delay should be short and it seems better to make
 	 * checkpoint take a bit longer than to hold off insertions longer than
@@ -8625,10 +8168,10 @@ CreateEndOfRecoveryRecord(void)
 
 	xlrec.end_time = time(NULL);
 
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	xlrec.ThisTimeLineID = ThisTimeLineID;
 	xlrec.PrevTimeLineID = XLogCtl->PrevTimeLineID;
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	LocalSetXLogInsertAllowed();
 
@@ -8834,9 +8377,9 @@ CreateRestartPoint(int flags)
 	 * during recovery this is just pro forma, because no WAL insertions are
 	 * happening.
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	/* Also update the info_lck-protected copy */
 	SpinLockAcquire(&xlogctl->info_lck);
@@ -9296,9 +8839,9 @@ UpdateFullPageWrites(void)
 	 */
 	if (fullPageWrites)
 	{
-		WALInsertSlotAcquire(true);
+		WALInsertLockAcquire(true);
 		Insert->fullPageWrites = true;
-		WALInsertSlotRelease();
+		WALInsertLockRelease();
 	}
 
 	/*
@@ -9319,9 +8862,9 @@ UpdateFullPageWrites(void)
 
 	if (!fullPageWrites)
 	{
-		WALInsertSlotAcquire(true);
+		WALInsertLockAcquire(true);
 		Insert->fullPageWrites = false;
-		WALInsertSlotRelease();
+		WALInsertLockRelease();
 	}
 	END_CRIT_SECTION();
 }
@@ -9952,15 +9495,15 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	 * Note that forcePageWrites has no effect during an online backup from
 	 * the standby.
 	 *
-	 * We must hold all the insertion slots to change the value of
+	 * We must hold all the insertion locks to change the value of
 	 * forcePageWrites, to ensure adequate interlocking against XLogInsert().
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	if (exclusive)
 	{
 		if (XLogCtl->Insert.exclusiveBackup)
 		{
-			WALInsertSlotRelease();
+			WALInsertLockRelease();
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("a backup is already in progress"),
@@ -9971,7 +9514,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	else
 		XLogCtl->Insert.nonExclusiveBackups++;
 	XLogCtl->Insert.forcePageWrites = true;
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	/* Ensure we release forcePageWrites if fail below */
 	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
@@ -10086,13 +9629,13 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			 * taking a checkpoint right after another is not that expensive
 			 * either because only few buffers have been dirtied yet.
 			 */
-			WALInsertSlotAcquire(true);
+			WALInsertLockAcquire(true);
 			if (XLogCtl->Insert.lastBackupStart < startpoint)
 			{
 				XLogCtl->Insert.lastBackupStart = startpoint;
 				gotUniqueStartpoint = true;
 			}
-			WALInsertSlotRelease();
+			WALInsertLockRelease();
 		} while (!gotUniqueStartpoint);
 
 		XLByteToSeg(startpoint, _logSegNo);
@@ -10182,7 +9725,7 @@ pg_start_backup_callback(int code, Datum arg)
 	bool		exclusive = DatumGetBool(arg);
 
 	/* Update backup counters and forcePageWrites on failure */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	if (exclusive)
 	{
 		Assert(XLogCtl->Insert.exclusiveBackup);
@@ -10199,7 +9742,7 @@ pg_start_backup_callback(int code, Datum arg)
 	{
 		XLogCtl->Insert.forcePageWrites = false;
 	}
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 }
 
 /*
@@ -10268,7 +9811,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	/*
 	 * OK to update backup counters and forcePageWrites
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	if (exclusive)
 		XLogCtl->Insert.exclusiveBackup = false;
 	else
@@ -10288,7 +9831,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	{
 		XLogCtl->Insert.forcePageWrites = false;
 	}
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	if (exclusive)
 	{
@@ -10573,7 +10116,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 void
 do_pg_abort_backup(void)
 {
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
 	XLogCtl->Insert.nonExclusiveBackups--;
 
@@ -10582,7 +10125,7 @@ do_pg_abort_backup(void)
 	{
 		XLogCtl->Insert.forcePageWrites = false;
 	}
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 82ef440..3e46783 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -10,6 +10,13 @@
  * locking should be done with the full lock manager --- which depends on
  * LWLocks to protect its shared state.
  *
+ * In addition to exclusive and shared modes, each lightweight lock can hold
+ * an extra value, which can be set with LWLockWakeup, and waited for with
+ * LWLockWait.  LWLockWait waits until the lock is free, or the lock's extra
+ * changes.  LWLockWakeup sets the extra value, waking up any LWLockWait()
+ * callers, without releasing the lock.  The extra value's meaning is up to
+ * the caller, but it is initialized to 0 when a lock is acquired.
+ *
  *
  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -239,6 +246,9 @@ NumLWLocks(void)
 	/* predicate.c needs one per old serializable xid buffer */
 	numLocks += NUM_OLDSERXID_BUFFERS;
 
+	/* xlog.c needs one per xlog insertion slot */
+	numLocks += num_xloginsert_locks;
+
 	/* slot.c needs one for each slot */
 	numLocks += max_replication_slots;
 
@@ -443,16 +453,18 @@ LWLockInitialize(LWLock *lock, int tranche_id)
 /*
  * LWLockAcquire - acquire a lightweight lock in the specified mode
  *
- * If the lock is not available, sleep until it is.
+ * If the lock is not available, sleep until it is.  Returns true if the lock
+ * was available immediately, false if we had to sleep.
  *
  * Side effect: cancel/die interrupts are held off until lock release.
  */
-void
+bool
 LWLockAcquire(LWLock *l, LWLockMode mode)
 {
 	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
+	bool		result = true;
 	int			extraWaits = 0;
 #ifdef LWLOCK_STATS
 	lwlock_stats *lwstats;
@@ -601,8 +613,15 @@ LWLockAcquire(LWLock *l, LWLockMode mode)
 
 		/* Now loop back and try to acquire lock again. */
 		retry = true;
+		result = false;
 	}
 
+	/*
+	 * Initialize the lock's extra value (it can only be used in an exclusively
+	 * held lock, but there's no harm resetting it in a shared lock either).
+	 */
+	lock->value = 0;
+
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
@@ -616,6 +635,8 @@ LWLockAcquire(LWLock *l, LWLockMode mode)
 	 */
 	while (extraWaits-- > 0)
 		PGSemaphoreUnlock(&proc->sem);
+
+	return result;
 }
 
 /*
@@ -669,6 +690,9 @@ LWLockConditionalAcquire(LWLock *l, LWLockMode mode)
 			mustwait = true;
 	}
 
+	if (!mustwait)
+		lock->value = 0;
+
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
@@ -807,6 +831,7 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 	}
 	else
 	{
+		lock->value = 0;
 		/* We are done updating shared state of the lock itself. */
 		SpinLockRelease(&lock->mutex);
 	}
@@ -835,6 +860,217 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 }
 
 /*
+ * LWLockWait - Wait until lock is free, or its value is updated.
+ *
+ * If the lock is held, and the lock's value equals 'oldval', waits until the
+ * lock is either free'd, or the lock holder changes the value by calling
+ * LWLockWakeup. If the lock is free on exit (immediately or after waiting),
+ * returns true. Otherwise returns false, and sets *newval to the locks
+ * current value.
+ *
+ * Note that this function ignores lock holders; if the lock is held in shared
+ * mode, returns 'true'.
+ */
+bool
+LWLockWait(LWLock *l, uint64 oldval, uint64 *newval)
+{
+	volatile LWLock *lock = l;
+	PGPROC	   *proc = MyProc;
+	int			extraWaits = 0;
+	bool		result = false;
+
+	/*
+	 * Quick test first to see if it the slot is free right now.
+	 *
+	 * XXX: the caller uses a spinlock before this, so we don't need a memory
+	 * barrier here as far as the current usage is concerned. But that might
+	 * not be safe in general.
+	 */
+	if (lock->exclusive == 0)
+		return true;
+
+	/*
+	 * Lock out cancel/die interrupts while we sleep on the lock. There is
+	 * no cleanup mechanism to remove us from the wait queue if we got
+	 * interrupted.
+	 */
+	HOLD_INTERRUPTS();
+
+	/*
+	 * Loop here to check the lock's status after each time we are signaled.
+	 */
+	for (;;)
+	{
+		bool		mustwait;
+		uint64		value;
+
+		/* Acquire mutex.  Time spent holding mutex should be short! */
+#ifdef LWLOCK_STATS
+		lwstats->spin_delay_count += SpinLockAcquire(&lock->mutex);
+#else
+		SpinLockAcquire(&lock->mutex);
+#endif
+
+		/* Is the lock now free, and if not, does the value match? */
+		if (lock->exclusive == 0)
+		{
+			result = true;
+			mustwait = false;
+		}
+		else
+		{
+			value = lock->value;
+			if (value != oldval)
+			{
+				result = false;
+				mustwait = false;
+				*newval = value;
+			}
+			else
+				mustwait = true;
+		}
+
+		if (!mustwait)
+			break;				/* the lock was free or value didn't match */
+
+		/*
+		 * Add myself to wait queue.
+		 */
+		proc->lwWaiting = true;
+		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
+		proc->lwWaitLink = NULL;
+
+		/* waiters are added to the front of the queue */
+		proc->lwWaitLink = lock->head;
+		if (lock->head == NULL)
+			lock->tail = proc;
+		lock->head = proc;
+
+		/* Can release the mutex now */
+		SpinLockRelease(&lock->mutex);
+
+		/*
+		 * Wait until awakened.
+		 *
+		 * Since we share the process wait semaphore with other things, like
+		 * the regular lock manager and ProcWaitForSignal, and we may need to
+		 * acquire an LWLock while one of those is pending, it is possible that
+		 * we get awakened for a reason other than being signaled by
+		 * LWLockRelease. If so, loop back and wait again.  Once we've gotten
+		 * the LWLock, re-increment the sema by the number of additional
+		 * signals received, so that the lock manager or signal manager will
+		 * see the received signal when it next waits.
+		 */
+		LOG_LWDEBUG("LWLockWait", T_NAME(l), T_ID(l), "waiting");
+
+#ifdef LWLOCK_STATS
+		lwstats->block_count++;
+#endif
+
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
+
+		for (;;)
+		{
+			/* "false" means cannot accept cancel/die interrupt here. */
+			PGSemaphoreLock(&proc->sem, false);
+			if (!proc->lwWaiting)
+				break;
+			extraWaits++;
+		}
+
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
+
+		LOG_LWDEBUG("LWLockWait", T_NAME(l), T_ID(l), "awakened");
+
+		/* Now loop back and check the status of the lock again. */
+	}
+
+	/* We are done updating shared state of the lock itself. */
+	SpinLockRelease(&lock->mutex);
+
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode);
+
+	/*
+	 * Fix the process wait semaphore's count for any absorbed wakeups.
+	 */
+	while (extraWaits-- > 0)
+		PGSemaphoreUnlock(&proc->sem);
+
+	/*
+	 * Now okay to allow cancel/die interrupts.
+	 */
+	RESUME_INTERRUPTS();
+
+	return result;
+}
+
+
+/*
+ * LWLockWakeup - Update a lock's value, and wake up waiters
+ *
+ * Wake up all processes waiting for us with LWLockWait(). Sets the lock's
+ * value to 'val', without releasing it.
+ *
+ * The caller must be holding the lock in exclusive mode.
+ */
+void
+LWLockWakeup(LWLock *l, uint64 val)
+{
+	volatile LWLock *lock = l;
+	PGPROC	   *head;
+	PGPROC	   *proc;
+	PGPROC	   *next;
+
+	/* Acquire mutex.  Time spent holding mutex should be short! */
+	SpinLockAcquire(&lock->mutex);
+
+	/* we should hold the lock */
+	Assert(lock->exclusive == 1);
+
+	/* Update the lock's value */
+	lock->value = val;
+
+	/*
+	 * See if there are any LW_WAIT_UNTIL_FREE waiters that need to be woken
+	 * up. They are always in the front of the queue.
+	 */
+	head = lock->head;
+
+	if (head != NULL && head->lwWaitMode == LW_WAIT_UNTIL_FREE)
+	{
+		proc = head;
+		next = proc->lwWaitLink;
+		while (next && next->lwWaitMode == LW_WAIT_UNTIL_FREE)
+		{
+			proc = next;
+			next = next->lwWaitLink;
+		}
+
+		/* proc is now the last PGPROC to be released */
+		lock->head = next;
+		proc->lwWaitLink = NULL;
+	}
+	else
+		head = NULL;
+
+	/* We are done updating shared state of the lock itself. */
+	SpinLockRelease(&lock->mutex);
+
+	/*
+	 * Awaken any waiters I removed from the queue.
+	 */
+	while (head != NULL)
+	{
+		proc = head;
+		head = proc->lwWaitLink;
+		proc->lwWaitLink = NULL;
+		proc->lwWaiting = false;
+		PGSemaphoreUnlock(&proc->sem);
+	}
+}
+
+
+/*
  * LWLockRelease - release a previously acquired lock
  */
 void
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2812a73..3f9b366 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2103,12 +2103,12 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"xloginsert_slots", PGC_POSTMASTER, WAL_SETTINGS,
-			gettext_noop("Sets the number of slots for concurrent xlog insertions."),
+		{"xloginsert_locks", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Sets the number of locks used for concurrent xlog insertions."),
 			NULL,
 			GUC_NOT_IN_SAMPLE
 		},
-		&num_xloginsert_slots,
+		&num_xloginsert_locks,
 		8, 1, 1000,
 		NULL, NULL, NULL
 	},
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 11ab277..f5152f4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -192,7 +192,7 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool log_checkpoints;
-extern int	num_xloginsert_slots;
+extern int	num_xloginsert_locks;
 
 /* WAL levels */
 typedef enum WalLevel
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c8ff4eb..ad2294c 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -53,6 +53,7 @@ typedef struct LWLock
 	struct PGPROC *head;			/* head of list of waiting PGPROCs */
 	struct PGPROC *tail;			/* tail of list of waiting PGPROCs */
 	/* tail is undefined when head is NULL */
+	uint64		value;			/* value set by LWLockWakeup */
 } LWLock;
 
 /*
@@ -169,9 +170,11 @@ typedef enum LWLockMode
 extern bool Trace_lwlocks;
 #endif
 
-extern void LWLockAcquire(LWLock *lock, LWLockMode mode);
+extern bool LWLockAcquire(LWLock *lock, LWLockMode mode);
 extern bool LWLockConditionalAcquire(LWLock *lock, LWLockMode mode);
 extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
+extern bool LWLockWait(LWLock *lock, uint64 oldval, uint64 *newval);
+extern void LWLockWakeup(LWLock *lock, uint64 value);
 extern void LWLockRelease(LWLock *lock);
 extern void LWLockReleaseAll(void);
 extern bool LWLockHeldByMe(LWLock *lock);

#47

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Heikki Linnakangas (#46)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-02-17 22:30:54 +0200, Heikki Linnakangas wrote:

This is what I came up with. I like it, I didn't have to contort lwlocks as
much as I feared. I added one field to LWLock structure, which is used to
store the position of how far a WAL inserter has progressed. The LWLock code
calls it just "value", without caring what's stored in it, and it's used by
new functions LWLockWait and LWLockWakeup to implement the behavior the WAL
insertion slots have, to wake up other processes waiting for the slot
without releasing it.

This passes regression tests, but I'll have to re-run the performance tests
with this. One worry is that if the padded size of the LWLock struct is
smaller than cache line, neighboring WAL insertion locks will compete for
the cache line. Another worry is that since I added a field to LWLock
struct, it might now take 64 bytes on platforms where it used to be 32 bytes
before. That wastes some memory.

Why don't you allocate them in a separate tranche, from xlog.c? Then you
can store them inside whatever bigger object you want, guaranteeing
exactly the alignment you need. possibly you even can have the extra
value in the enclosing object?

I'd very much like to keep the core lwlocks size from increasing much, I
plan to work on inlineing them in the BufferDescriptors and keeping it
smaller does increase cache hit ratio..

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#48

Peter Geoghegan

pg@heroku.com

almost 12 years ago

In reply to: MauMau (#13)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On Wed, Feb 12, 2014 at 3:55 AM, MauMau <maumau307@gmail.com> wrote:

FYI, the following stack traces are the ones obtained during two instances
of hang.

You mentioned a hang during a B-Tree insert operation - do you happen
to have a backtrace that relates to that?

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#49

MauMau

maumau307@gmail.com

almost 12 years ago

In reply to: Peter Geoghegan (#48)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

From: "Peter Geoghegan" <pg@heroku.com>

You mentioned a hang during a B-Tree insert operation - do you happen
to have a backtrace that relates to that?

Sorry, I may have misunderstood. The three stack traces I attached are not
related to btree. I recall that I saw one stack trace containing
bt_insert(), but I'm not sure.

When the hang occurred, INSERT/UPDATE/COMMIT statements stopped for at least
one hour, while SELECT statements ran smoothly.

Regards
MauMau

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#50

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Andres Freund (#47)

1 attachment(s)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 02/17/2014 10:36 PM, Andres Freund wrote:

On 2014-02-17 22:30:54 +0200, Heikki Linnakangas wrote:

This is what I came up with. I like it, I didn't have to contort lwlocks as
much as I feared. I added one field to LWLock structure, which is used to
store the position of how far a WAL inserter has progressed. The LWLock code
calls it just "value", without caring what's stored in it, and it's used by
new functions LWLockWait and LWLockWakeup to implement the behavior the WAL
insertion slots have, to wake up other processes waiting for the slot
without releasing it.

This passes regression tests, but I'll have to re-run the performance tests
with this. One worry is that if the padded size of the LWLock struct is
smaller than cache line, neighboring WAL insertion locks will compete for
the cache line. Another worry is that since I added a field to LWLock
struct, it might now take 64 bytes on platforms where it used to be 32 bytes
before. That wastes some memory.

Why don't you allocate them in a separate tranche, from xlog.c? Then you
can store them inside whatever bigger object you want, guaranteeing
exactly the alignment you need. possibly you even can have the extra
value in the enclosing object?

Good idea. New patch attached, doing that.

I'll try to find time on some multi-CPU hardware to performance test
this against current master, to make sure there's no regression.

- Heikki

Attachments:

xlogslot-to-lwlock-2.patchtext/x-diff; name=xlogslot-to-lwlock-2.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 508970a..3eef968 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -86,7 +86,7 @@ int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
-int			num_xloginsert_slots = 8;
+int			num_xloginsert_locks = 8;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -289,7 +289,7 @@ XLogRecPtr	XactLastRecEnd = InvalidXLogRecPtr;
  * (which is almost but not quite the same as a pointer to the most recent
  * CHECKPOINT record).	We update this from the shared-memory copy,
  * XLogCtl->Insert.RedoRecPtr, whenever we can safely do so (ie, when we
- * hold an insertion slot).  See XLogInsert for details.  We are also allowed
+ * hold an insertion lock).  See XLogInsert for details.  We are also allowed
  * to update from XLogCtl->RedoRecPtr if we hold the info_lck;
  * see GetRedoRecPtr.  A freshly spawned backend obtains the value during
  * InitXLOGAccess.
@@ -361,63 +361,45 @@ typedef struct XLogwrtResult
 	XLogRecPtr	Flush;			/* last byte + 1 flushed */
 } XLogwrtResult;
 
-
 /*
- * A slot for inserting to the WAL. This is similar to an LWLock, the main
- * difference is that there is an extra xlogInsertingAt field that is protected
- * by the same mutex. Unlike an LWLock, a slot can only be acquired in
- * exclusive mode.
- *
- * The xlogInsertingAt field is used to advertise to other processes how far
- * the slot owner has progressed in inserting the record. When a backend
- * acquires a slot, it initializes xlogInsertingAt to 1, because it doesn't
- * yet know where it's going to insert the record. That's conservative
- * but correct; the new insertion is certainly going to go to a byte position
- * greater than 1. If another backend needs to flush the WAL, it will have to
- * wait for the new insertion. xlogInsertingAt is updated after finishing the
- * insert or when crossing a page boundary, which will wake up anyone waiting
- * for it, whether the wait was necessary in the first place or not.
- *
- * A process can wait on a slot in two modes: LW_EXCLUSIVE or
- * LW_WAIT_UNTIL_FREE. LW_EXCLUSIVE works like in an lwlock; when the slot is
- * released, the first LW_EXCLUSIVE waiter in the queue is woken up. Processes
- * waiting in LW_WAIT_UNTIL_FREE mode are woken up whenever the slot is
- * released, or xlogInsertingAt is updated. In other words, a process in
- * LW_WAIT_UNTIL_FREE mode is woken up whenever the inserter makes any progress
- * copying the record in place. LW_WAIT_UNTIL_FREE waiters are always added to
- * the front of the queue, while LW_EXCLUSIVE waiters are appended to the end.
- *
- * To join the wait queue, a process must set MyProc->lwWaitMode to the mode
- * it wants to wait in, MyProc->lwWaiting to true, and link MyProc to the head
- * or tail of the wait queue. The same mechanism is used to wait on an LWLock,
- * see lwlock.c for details.
+ * Inserting to WAL is protected by a bunch of WALInsertLocks. Each WAL
+ * insertion lock consists of a lightweight lock, plus an indicator of how
+ * far the insertion has progressed (insertingAt).
+ *
+ * The insertingAt value is used when writing the WAL to disk, to avoid
+ * waiting unnecessarily for an insertion that's still in-progress, but has
+ * already finished inserting all WAL beyond the point you're going to write
+ * the WAL up to. This isn't just to optimize, it's necessary to avoid
+ * deadlocks when an inserter has to switch to new WAL buffer. An inserter
+ * that's holding a WAL insert lock might need to flush the WAL, to evict an
+ * old WAL buffer, to make room for the new record. If it's possible for an
+ * inserter to wait for another inserter unnecessarily, that can lead to
+ * a deadlock if two inserters holding a WAL insert lock wait for each other
+ * to finish their insertion.
+ *
+ * Small WAL records that don't cross a page boundary never update the value,
+ * the WAL record is just copied to the page and the lock is released. But
+ * when crossing a page boundary, it's updated to let others know that the
+ * backend has finished modifying the previous page.
  */
 typedef struct
 {
-	slock_t		mutex;			/* protects the below fields */
-	XLogRecPtr	xlogInsertingAt; /* insert has completed up to this point */
-
-	PGPROC	   *owner;			/* for debugging purposes */
-
-	bool		releaseOK;		/* T if ok to release waiters */
-	char		exclusive;		/* # of exclusive holders (0 or 1) */
-	PGPROC	   *head;			/* head of list of waiting PGPROCs */
-	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
-	/* tail is undefined when head is NULL */
-} XLogInsertSlot;
+	LWLock		lock;
+	XLogRecPtr	insertingAt;
+} WALInsertLock;
 
 /*
- * All the slots are allocated as an array in shared memory. We force the
- * array stride to be a power of 2, which saves a few cycles in indexing, but
- * more importantly also ensures that individual slots don't cross cache line
- * boundaries.	(Of course, we have to also ensure that the array start
- * address is suitably aligned.)
+ * All the WAL insertion locks are allocated as an array in shared memory. We
+ * force the array stride to be a power of 2, which saves a few cycles in
+ * indexing, but more importantly also ensures that individual slots don't
+ * cross cache line boundaries. (Of course, we have to also ensure that the
+ * array start address is suitably aligned.)
  */
-typedef union XLogInsertSlotPadded
+typedef union WALInsertLockPadded
 {
-	XLogInsertSlot slot;
+	WALInsertLock l;
 	char		pad[CACHE_LINE_SIZE];
-} XLogInsertSlotPadded;
+} WALInsertLockPadded;
 
 /*
  * Shared state data for XLogInsert.
@@ -452,8 +434,8 @@ typedef struct XLogCtlInsert
 	 * we must WAL-log it before it actually affects WAL-logging by backends.
 	 * Checkpointer sets at startup or after SIGHUP.
 	 *
-	 * To read these fields, you must hold an insertion slot. To modify them,
-	 * you must hold ALL the slots.
+	 * To read these fields, you must hold an insertion lock. To modify them,
+	 * you must hold ALL the locks.
 	 */
 	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
 	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
@@ -470,8 +452,16 @@ typedef struct XLogCtlInsert
 	int			nonExclusiveBackups;
 	XLogRecPtr	lastBackupStart;
 
-	/* insertion slots, see XLogInsertSlot struct above for details */
-	XLogInsertSlotPadded *insertSlots;
+	/*
+	 * To insert a new WAL record, you must hold a WAL insertion lock. Before
+	 * 9.4, there was a single WALInsertLock, but that became a bottleneck
+	 * on multi-core systems that insert a lot of WAL. Now, there are several
+	 * WAL insertion locks, and to insert WAL, you must hold one of them (in
+	 * exclusive mode). It doesn't matter which one.
+	 */
+	WALInsertLockPadded	*WALInsertLocks;
+	LWLockTranche WALInsertLockTranche;
+	int			WALInsertLockTrancheId;
 } XLogCtlInsert;
 
 /*
@@ -609,6 +599,9 @@ typedef struct XLogCtlData
 
 static XLogCtlData *XLogCtl = NULL;
 
+/* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
+static WALInsertLockPadded *WALInsertLocks = NULL;
+
 /*
  * We maintain an image of pg_control in shared memory.
  */
@@ -732,9 +725,9 @@ static bool InRedo = false;
 /* Have we launched bgwriter during recovery? */
 static bool bgwriterLaunched = false;
 
-/* For WALInsertSlotAcquire/Release functions */
-static int	MySlotNo = 0;
-static bool holdingAllSlots = false;
+/* For WALInsertLockAcquire/Release functions */
+static int	MyLockNo = 0;
+static bool holdingAllLocks = false;
 
 static void readRecoveryCommandFile(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
@@ -808,16 +801,14 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 				  XLogRecPtr *PrevPtr);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
-static void WakeupWaiters(XLogRecPtr EndPos);
 static char *GetXLogBuffer(XLogRecPtr ptr);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
 
-static void WALInsertSlotAcquire(bool exclusive);
-static void WALInsertSlotAcquireOne(int slotno);
-static void WALInsertSlotRelease(void);
-static void WALInsertSlotReleaseOne(int slotno);
+static void WALInsertLockAcquire(bool exclusive);
+static void WALInsertLockRelease(void);
+static void WALInsertLockWakeup(XLogRecPtr insertingAt);
 
 /*
  * Insert an XLOG record having the specified RMID and info bytes,
@@ -894,7 +885,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	 *
 	 * We may have to loop back to here if a race condition is detected below.
 	 * We could prevent the race by doing all this work while holding an
-	 * insertion slot, but it seems better to avoid doing CRC calculations
+	 * insertion lock, but it seems better to avoid doing CRC calculations
 	 * while holding one.
 	 *
 	 * We add entries for backup blocks to the chain, so that they don't need
@@ -912,8 +903,8 @@ begin:;
 	/*
 	 * Decide if we need to do full-page writes in this XLOG record: true if
 	 * full_page_writes is on or we have a PITR request for it.  Since we
-	 * don't yet have an insertion slot, fullPageWrites and forcePageWrites
-	 * could change under us, but we'll recheck them once we have a slot.
+	 * don't yet have an insertion lock, fullPageWrites and forcePageWrites
+	 * could change under us, but we'll recheck them once we have a lock.
 	 */
 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
 
@@ -1087,16 +1078,15 @@ begin:;
 	 *    record in place. This can be done concurrently in multiple processes.
 	 *
 	 * To keep track of which insertions are still in-progress, each concurrent
-	 * inserter allocates an "insertion slot", which tells others how far the
+	 * inserter acquires an insertion lock. In addition to just indicating that
+	 * an insertion is in progress, the lock tells others how far the
 	 * inserter has progressed. There is a small fixed number of insertion
-	 * slots, determined by the num_xloginsert_slots GUC. When an inserter
-	 * finishes, it updates the xlogInsertingAt of its slot to the end of the
-	 * record it inserted, to let others know that it's done. xlogInsertingAt
-	 * is also updated when crossing over to a new WAL buffer, to allow the
-	 * the previous buffer to be flushed.
+	 * locks, determined by the num_xloginsert_locks GUC. When an inserter
+	 * crosses a page boundary, it updates the value stored in the lock to the
+	 * how far it has inserted, to allow the the previous buffer to be flushed.
 	 *
-	 * Holding onto a slot also protects RedoRecPtr and fullPageWrites from
-	 * changing until the insertion is finished.
+	 * Holding onto an insertion lock also protects RedoRecPtr and
+	 * fullPageWrites from changing until the insertion is finished.
 	 *
 	 * Step 2 can usually be done completely in parallel. If the required WAL
 	 * page is not initialized yet, you have to grab WALBufMappingLock to
@@ -1106,7 +1096,7 @@ begin:;
 	 *----------
 	 */
 	START_CRIT_SECTION();
-	WALInsertSlotAcquire(isLogSwitch);
+	WALInsertLockAcquire(isLogSwitch);
 
 	/*
 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
@@ -1135,7 +1125,7 @@ begin:;
 					 * Oops, this buffer now needs to be backed up, but we
 					 * didn't think so above.  Start over.
 					 */
-					WALInsertSlotRelease();
+					WALInsertLockRelease();
 					END_CRIT_SECTION();
 					rdt_lastnormal->next = NULL;
 					info = info_orig;
@@ -1154,7 +1144,7 @@ begin:;
 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
 	{
 		/* Oops, must redo it with full-page data. */
-		WALInsertSlotRelease();
+		WALInsertLockRelease();
 		END_CRIT_SECTION();
 		rdt_lastnormal->next = NULL;
 		info = info_orig;
@@ -1202,7 +1192,7 @@ begin:;
 	/*
 	 * Done! Let others know that we're finished.
 	 */
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	MarkCurrentTransactionIdLoggedIfAny();
 
@@ -1363,7 +1353,7 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 
 	/*
 	 * These calculations are a bit heavy-weight to be done while holding a
-	 * spinlock, but since we're holding all the WAL insertion slots, there
+	 * spinlock, but since we're holding all the WAL insertion locks, there
 	 * are no other inserters competing for it. GetXLogInsertRecPtr() does
 	 * compete for it, but that's not called very frequently.
 	 */
@@ -1523,7 +1513,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 		while (CurrPos < EndPos)
 		{
 			/* initialize the next page (if not initialized already) */
-			WakeupWaiters(CurrPos);
+			WALInsertLockWakeup(CurrPos);
 			AdvanceXLInsertBuffer(CurrPos, false);
 			CurrPos += XLOG_BLCKSZ;
 		}
@@ -1534,452 +1524,116 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 }
 
 /*
- * Allocate a slot for insertion.
+ * Acquire a WAL insertion lock.
  *
- * In exclusive mode, all slots are reserved for the current process. That
- * blocks all concurrent insertions.
+ * In exclusive mode, all locks are acquired. That blocks all concurrent
+ * insertions.
  */
 static void
-WALInsertSlotAcquire(bool exclusive)
+WALInsertLockAcquire(bool exclusive)
 {
 	int			i;
 
 	if (exclusive)
 	{
-		for (i = 0; i < num_xloginsert_slots; i++)
-			WALInsertSlotAcquireOne(i);
-		holdingAllSlots = true;
+		for (i = 0; i < num_xloginsert_locks; i++)
+			LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+		holdingAllLocks = true;
 	}
 	else
-		WALInsertSlotAcquireOne(-1);
-}
-
-/*
- * Workhorse of WALInsertSlotAcquire. Acquires the given slot, or an arbitrary
- * one if slotno == -1. The index of the slot that was acquired is stored in
- * MySlotNo.
- *
- * This is more or less equivalent to LWLockAcquire().
- */
-static void
-WALInsertSlotAcquireOne(int slotno)
-{
-	volatile XLogInsertSlot *slot;
-	PGPROC	   *proc = MyProc;
-	bool		retry = false;
-	int			extraWaits = 0;
-	static int	slotToTry = -1;
-
-	/*
-	 * Try to use the slot we used last time. If the system isn't particularly
-	 * busy, it's a good bet that it's available, and it's good to have some
-	 * affinity to a particular slot so that you don't unnecessarily bounce
-	 * cache lines between processes when there is no contention.
-	 *
-	 * If this is the first time through in this backend, pick a slot
-	 * (semi-)randomly. This allows the slots to be used evenly if you have a
-	 * lot of very short connections.
-	 */
-	if (slotno != -1)
-		MySlotNo = slotno;
-	else
 	{
-		if (slotToTry == -1)
-			slotToTry = MyProc->pgprocno % num_xloginsert_slots;
-		MySlotNo = slotToTry;
-	}
-
-	/*
-	 * We can't wait if we haven't got a PGPROC.  This should only occur
-	 * during bootstrap or shared memory initialization.  Put an Assert here
-	 * to catch unsafe coding practices.
-	 */
-	Assert(MyProc != NULL);
-
-	/*
-	 * Lock out cancel/die interrupts until we exit the code section protected
-	 * by the slot.  This ensures that interrupts will not interfere with
-	 * manipulations of data structures in shared memory. There is no cleanup
-	 * mechanism to release the slot if the backend dies while holding one,
-	 * so make this a critical section.
-	 */
-	START_CRIT_SECTION();
-
-	/*
-	 * Loop here to try to acquire slot after each time we are signaled by
-	 * WALInsertSlotRelease.
-	 */
-	for (;;)
-	{
-		bool		mustwait;
-
-		slot = &XLogCtl->Insert.insertSlots[MySlotNo].slot;
-
-		/* Acquire mutex.  Time spent holding mutex should be short! */
-		SpinLockAcquire(&slot->mutex);
-
-		/* If retrying, allow WALInsertSlotRelease to release waiters again */
-		if (retry)
-			slot->releaseOK = true;
-
-		/* If I can get the slot, do so quickly. */
-		if (slot->exclusive == 0)
-		{
-			slot->exclusive++;
-			mustwait = false;
-		}
-		else
-			mustwait = true;
-
-		if (!mustwait)
-			break;				/* got the lock */
-
-		Assert(slot->owner != MyProc);
-
-		/*
-		 * Add myself to wait queue.
-		 */
-		proc->lwWaiting = true;
-		proc->lwWaitMode = LW_EXCLUSIVE;
-		proc->lwWaitLink = NULL;
-		if (slot->head == NULL)
-			slot->head = proc;
-		else
-			slot->tail->lwWaitLink = proc;
-		slot->tail = proc;
-
-		/* Can release the mutex now */
-		SpinLockRelease(&slot->mutex);
+		bool		immed;
 
 		/*
-		 * Wait until awakened.
+		 * Acquire one of the locks. It doesn't matter which one, but try to
+		 * use the lock we used last time. If the system isn't particularly
+		 * busy, it's a good bet that it's available, and it's good to have
+		 * some affinity to a particular lock so that you don't unnecessarily
+		 * bounce cache lines between processes when there is no contention.
 		 *
-		 * Since we share the process wait semaphore with the regular lock
-		 * manager and ProcWaitForSignal, and we may need to acquire a slot
-		 * while one of those is pending, it is possible that we get awakened
-		 * for a reason other than being signaled by WALInsertSlotRelease. If
-		 * so, loop back and wait again.  Once we've gotten the slot,
-		 * re-increment the sema by the number of additional signals received,
-		 * so that the lock manager or signal manager will see the received
-		 * signal when it next waits.
+		 * If this is the first time through in this backend, pick a lock
+		 * (semi-)randomly. This allows the locks to be used evenly if you have
+		 * a lot of very short connections.
 		 */
-		for (;;)
-		{
-			/* "false" means cannot accept cancel/die interrupt here. */
-			PGSemaphoreLock(&proc->sem, false);
-			if (!proc->lwWaiting)
-				break;
-			extraWaits++;
-		}
+		static int	lockToTry = -1;
+		if (lockToTry == -1)
+			lockToTry = MyProc->pgprocno % num_xloginsert_locks;
+		MyLockNo = lockToTry;
 
-		/* Now loop back and try to acquire lock again. */
-		retry = true;
-	}
+		immed = LWLockAcquire(&WALInsertLocks[MyLockNo].l.lock, LW_EXCLUSIVE);
 
-	slot->owner = proc;
-
-	/*
-	 * Normally, we initialize the xlogInsertingAt value of the slot to 1,
-	 * because we don't yet know where in the WAL we're going to insert. It's
-	 * not critical what it points to right now - leaving it to a too small
-	 * value just means that WaitXlogInsertionsToFinish() might wait on us
-	 * unnecessarily, until we update the value (when we finish the insert or
-	 * move to next page).
-	 *
-	 * If we're grabbing all the slots, however, stamp all but the last one
-	 * with InvalidXLogRecPtr, meaning there is no insert in progress. The last
-	 * slot is the one that we will update as we proceed with the insert, the
-	 * rest are held just to keep off other inserters.
-	 */
-	if (slotno != -1 && slotno != num_xloginsert_slots - 1)
-		slot->xlogInsertingAt = InvalidXLogRecPtr;
-	else
-		slot->xlogInsertingAt = 1;
-
-	/* We are done updating shared state of the slot itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Fix the process wait semaphore's count for any absorbed wakeups.
-	 */
-	while (extraWaits-- > 0)
-		PGSemaphoreUnlock(&proc->sem);
-
-	/*
-	 * If we couldn't get the slot immediately, try another slot next time.
-	 * On a system with more insertion slots than concurrent inserters, this
-	 * causes all the inserters to eventually migrate to a slot that no-one
-	 * else is using. On a system with more inserters than slots, it still
-	 * causes the inserters to be distributed quite evenly across the slots.
-	 */
-	if (slotno != -1 && retry)
-		slotToTry = (slotToTry + 1) % num_xloginsert_slots;
-}
-
-/*
- * Wait for the given slot to become free, or for its xlogInsertingAt location
- * to change to something else than 'waitptr'. In other words, wait for the
- * inserter using the given slot to finish its insertion, or to at least make
- * some progress.
- */
-static void
-WaitOnSlot(volatile XLogInsertSlot *slot, XLogRecPtr waitptr)
-{
-	PGPROC	   *proc = MyProc;
-	int			extraWaits = 0;
-
-	/*
-	 * Lock out cancel/die interrupts while we sleep on the slot. There is
-	 * no cleanup mechanism to remove us from the wait queue if we got
-	 * interrupted.
-	 */
-	HOLD_INTERRUPTS();
-
-	/*
-	 * Loop here to try to acquire lock after each time we are signaled.
-	 */
-	for (;;)
-	{
-		bool		mustwait;
-
-		/* Acquire mutex.  Time spent holding mutex should be short! */
-		SpinLockAcquire(&slot->mutex);
-
-		/* If I can get the lock, do so quickly. */
-		if (slot->exclusive == 0 || slot->xlogInsertingAt != waitptr)
-			mustwait = false;
-		else
-			mustwait = true;
-
-		if (!mustwait)
-			break;				/* the lock was free */
-
-		Assert(slot->owner != MyProc);
-
-		/*
-		 * Add myself to wait queue.
-		 */
-		proc->lwWaiting = true;
-		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
-		proc->lwWaitLink = NULL;
-
-		/* waiters are added to the front of the queue */
-		proc->lwWaitLink = slot->head;
-		if (slot->head == NULL)
-			slot->tail = proc;
-		slot->head = proc;
-
-		/* Can release the mutex now */
-		SpinLockRelease(&slot->mutex);
-
-		/*
-		 * Wait until awakened.
-		 *
-		 * Since we share the process wait semaphore with other things, like
-		 * the regular lock manager and ProcWaitForSignal, and we may need to
-		 * acquire an LWLock while one of those is pending, it is possible that
-		 * we get awakened for a reason other than being signaled by
-		 * LWLockRelease. If so, loop back and wait again.  Once we've gotten
-		 * the LWLock, re-increment the sema by the number of additional
-		 * signals received, so that the lock manager or signal manager will
-		 * see the received signal when it next waits.
-		 */
-		for (;;)
+		if (!immed)
 		{
-			/* "false" means cannot accept cancel/die interrupt here. */
-			PGSemaphoreLock(&proc->sem, false);
-			if (!proc->lwWaiting)
-				break;
-			extraWaits++;
+			/*
+			 * If we couldn't get the lock immediately, try another lock next
+			 * time.  On a system with more insertion locks than concurrent
+			 * inserters, this causes all the inserters to eventually migrate
+			 * to a lock that no-one else is using.  On a system with more
+			 * inserters than locks, it still helps to distribute the inserters
+			 * quite evenly across the locks.
+			 */
+			lockToTry = (lockToTry + 1) % num_xloginsert_locks;
 		}
-
-		/* Now loop back and try to acquire lock again. */
 	}
-
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Fix the process wait semaphore's count for any absorbed wakeups.
-	 */
-	while (extraWaits-- > 0)
-		PGSemaphoreUnlock(&proc->sem);
-
-	/*
-	 * Now okay to allow cancel/die interrupts.
-	 */
-	RESUME_INTERRUPTS();
 }
 
 /*
- * Wake up all processes waiting for us with WaitOnSlot(). Sets our
- * xlogInsertingAt value to EndPos, without releasing the slot.
+ * Release our insertion lock (or locks, if we're holding them all).
  */
 static void
-WakeupWaiters(XLogRecPtr EndPos)
+WALInsertLockRelease(void)
 {
-	volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[MySlotNo].slot;
-	PGPROC	   *head;
-	PGPROC	   *proc;
-	PGPROC	   *next;
-
-	/*
-	 * If we have already reported progress up to the same point, do nothing.
-	 * No other process can modify xlogInsertingAt, so we can check this before
-	 * grabbing the spinlock.
-	 */
-	if (slot->xlogInsertingAt == EndPos)
-		return;
-	/* xlogInsertingAt should not go backwards */
-	Assert(slot->xlogInsertingAt < EndPos);
-
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&slot->mutex);
-
-	/* we should own the slot */
-	Assert(slot->exclusive == 1 && slot->owner == MyProc);
-
-	slot->xlogInsertingAt = EndPos;
-
-	/*
-	 * See if there are any LW_WAIT_UNTIL_FREE waiters that need to be woken
-	 * up. They are always in the front of the queue.
-	 */
-	head = slot->head;
+	int			i;
 
-	if (head != NULL && head->lwWaitMode == LW_WAIT_UNTIL_FREE)
+	if (holdingAllLocks)
 	{
-		proc = head;
-		next = proc->lwWaitLink;
-		while (next && next->lwWaitMode == LW_WAIT_UNTIL_FREE)
+		for (i = 0; i < num_xloginsert_locks; i++)
 		{
-			proc = next;
-			next = next->lwWaitLink;
+			/* see below */
+			WALInsertLocks[i].l.insertingAt = 0;
+			LWLockRelease(&WALInsertLocks[i].l.lock);
 		}
-
-		/* proc is now the last PGPROC to be released */
-		slot->head = next;
-		proc->lwWaitLink = NULL;
+		holdingAllLocks = false;
 	}
 	else
-		head = NULL;
-
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Awaken any waiters I removed from the queue.
-	 */
-	while (head != NULL)
 	{
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
+		/*
+		 * Reset insertingAt value to ease debugging. It would be OK to let
+		 * it point to any old value as long is it's smaller than the current
+		 * end of reserved WAL, but seems tidier to reset it to zero.
+		 *
+		 * No need for holding a spinlock while we reset it, because we don't
+		 * care if someone transiently sees a bogus value. We're just about to
+		 * release the lock anyway, waking up anyone who might wait for us.
+		 */
+		WALInsertLocks[MyLockNo].l.insertingAt = 0;
+		LWLockRelease(&WALInsertLocks[MyLockNo].l.lock);
 	}
 }
 
 /*
- * Release our insertion slot (or slots, if we're holding them all).
+ * Update our insertingAt value, to let others know that we've finished
+ * inserting up to that point.
  */
 static void
-WALInsertSlotRelease(void)
+WALInsertLockWakeup(XLogRecPtr insertingAt)
 {
 	int			i;
 
-	if (holdingAllSlots)
+	if (holdingAllLocks)
 	{
-		for (i = 0; i < num_xloginsert_slots; i++)
-			WALInsertSlotReleaseOne(i);
-		holdingAllSlots = false;
+		for (i = 0; i < num_xloginsert_locks; i++)
+			LWLockWakeup(&WALInsertLocks[i].l.lock,
+						 &WALInsertLocks[i].l.insertingAt,
+						 insertingAt);
+		holdingAllLocks = false;
 	}
 	else
-		WALInsertSlotReleaseOne(MySlotNo);
-}
-
-static void
-WALInsertSlotReleaseOne(int slotno)
-{
-	volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[slotno].slot;
-	PGPROC	   *head;
-	PGPROC	   *proc;
-
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&slot->mutex);
-
-	/* we must be holding it */
-	Assert(slot->exclusive == 1 && slot->owner == MyProc);
-
-	slot->xlogInsertingAt = InvalidXLogRecPtr;
-
-	/* Release my hold on the slot */
-	slot->exclusive = 0;
-	slot->owner = NULL;
-
-	/*
-	 * See if I need to awaken any waiters..
-	 */
-	head = slot->head;
-	if (head != NULL)
-	{
-		if (slot->releaseOK)
-		{
-			/*
-			 * Remove the to-be-awakened PGPROCs from the queue.
-			 */
-			bool		releaseOK = true;
-
-			proc = head;
-
-			/*
-			 * First wake up any backends that want to be woken up without
-			 * acquiring the lock. These are always in the front of the queue.
-			 */
-			while (proc->lwWaitMode == LW_WAIT_UNTIL_FREE && proc->lwWaitLink)
-				proc = proc->lwWaitLink;
-
-			/*
-			 * Awaken the first exclusive-waiter, if any.
-			 */
-			if (proc->lwWaitLink)
-			{
-				Assert(proc->lwWaitLink->lwWaitMode == LW_EXCLUSIVE);
-				proc = proc->lwWaitLink;
-				releaseOK = false;
-			}
-			/* proc is now the last PGPROC to be released */
-			slot->head = proc->lwWaitLink;
-			proc->lwWaitLink = NULL;
-
-			slot->releaseOK = releaseOK;
-		}
-		else
-			head = NULL;
-	}
-
-	/* We are done updating shared state of the slot itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Awaken any waiters I removed from the queue.
-	 */
-	while (head != NULL)
-	{
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
-	}
-
-	/*
-	 * Now okay to allow cancel/die interrupts.
-	 */
-	END_CRIT_SECTION();
+		LWLockWakeup(&WALInsertLocks[MyLockNo].l.lock,
+					 &WALInsertLocks[MyLockNo].l.insertingAt,
+					 insertingAt);
 }
 
-
 /*
  * Wait for any WAL insertions < upto to finish.
  *
@@ -2029,79 +1683,48 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	}
 
 	/*
+	 * Loop through all the locks, sleeping on any in-progress insert older
+	 * than 'upto'.
+	 *
 	 * finishedUpto is our return value, indicating the point upto which
 	 * all the WAL insertions have been finished. Initialize it to the head
-	 * of reserved WAL, and as we iterate through the insertion slots, back it
+	 * of reserved WAL, and as we iterate through the insertion locks, back it
 	 * out for any insertion that's still in progress.
 	 */
 	finishedUpto = reservedUpto;
-
-	/*
-	 * Loop through all the slots, sleeping on any in-progress insert older
-	 * than 'upto'.
-	 */
-	for (i = 0; i < num_xloginsert_slots; i++)
+	for (i = 0; i < num_xloginsert_locks; i++)
 	{
-		volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[i].slot;
-		XLogRecPtr insertingat;
-
-	retry:
-		/*
-		 * We can check if the slot is in use without grabbing the spinlock.
-		 * The spinlock acquisition of insertpos_lck before this loop acts
-		 * as a memory barrier. If someone acquires the slot after that, it
-		 * can't possibly be inserting to anything < reservedUpto. If it was
-		 * acquired before that, an unlocked test will return true.
-		 */
-		if (!slot->exclusive)
-			continue;
-
-		SpinLockAcquire(&slot->mutex);
-		/* re-check now that we have the lock */
-		if (!slot->exclusive)
-		{
-			SpinLockRelease(&slot->mutex);
-			continue;
-		}
-		insertingat = slot->xlogInsertingAt;
-		SpinLockRelease(&slot->mutex);
-
-		if (insertingat == InvalidXLogRecPtr)
+		XLogRecPtr insertingat = InvalidXLogRecPtr;
+		do
 		{
 			/*
-			 * slot is reserved just to hold off other inserters, there is no
-			 * actual insert in progress.
+			 * See if this insertion is in progress. LWLockWait will wait for
+			 * the lock to be released, or for the 'value' to be set by a
+			 * LWLockWakeup call. When a lock is initially acquired, its
+			 * value is 0 (InvalidXLogRecPtr), which means that we don't know
+			 * where it's inserting yet. We will have to wait for it. If it's
+			 * a small insertion, the record will most likely fit on the same
+			 * page and the inserter will release the lock without ever
+			 * calling LWLockWakeup. But if it has to cross a page, it will
+			 * advertise the insertion point with LWLockWakeup.
 			 */
-			continue;
-		}
+			if (LWLockWait(&WALInsertLocks[i].l.lock,
+						   &WALInsertLocks[i].l.insertingAt,
+						   insertingat, &insertingat))
+			{
+				/* the lock was free, so no insertion in progress */
+				insertingat = InvalidXLogRecPtr;
+				break;
+			}
 
-		/*
-		 * This insertion is still in progress. Do we need to wait for it?
-		 *
-		 * When an inserter acquires a slot, it doesn't reset 'insertingat', so
-		 * it will initially point to the old value of some already-finished
-		 * insertion. The inserter will update the value as soon as it finishes
-		 * the insertion, moves to the next page, or has to do I/O to flush an
-		 * old dirty buffer. That means that when we see a slot with
-		 * insertingat value < upto, we don't know if that insertion is still
-		 * truly in progress, or if the slot is reused by a new inserter that
-		 * hasn't updated the insertingat value yet. We have to assume it's the
-		 * latter, and wait.
-		 */
-		if (insertingat < upto)
-		{
-			WaitOnSlot(slot, insertingat);
-			goto retry;
-		}
-		else
-		{
 			/*
-			 * We don't need to wait for this insertion, but update the
-			 * return value.
+			 * This insertion is still in progress. Have to wait, unless the
+			 * inserter has proceeded past 'upto'.
 			 */
-			if (insertingat < finishedUpto)
-				finishedUpto = insertingat;
-		}
+		} while (insertingat < upto);
+
+		if (insertingat != InvalidXLogRecPtr && insertingat < finishedUpto)
+			finishedUpto = insertingat;
 	}
 	return finishedUpto;
 }
@@ -2115,8 +1738,8 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
  *
  * The caller must ensure that the page containing the requested location
  * isn't evicted yet, and won't be evicted. The way to ensure that is to
- * hold onto an XLogInsertSlot with the xlogInsertingAt position set to
- * something <= ptr. GetXLogBuffer() will update xlogInsertingAt if it needs
+ * hold onto a WAL insertion lock with the insertingAt position set to
+ * something <= ptr. GetXLogBuffer() will update insertingAt if it needs
  * to evict an old page from the buffer. (This means that once you call
  * GetXLogBuffer() with a given 'ptr', you must not access anything before
  * that point anymore, and must not call GetXLogBuffer() with an older 'ptr'
@@ -2176,7 +1799,7 @@ GetXLogBuffer(XLogRecPtr ptr)
 		 * Let others know that we're finished inserting the record up
 		 * to the page boundary.
 		 */
-		WakeupWaiters(expectedEndPtr - XLOG_BLCKSZ);
+		WALInsertLockWakeup(expectedEndPtr - XLOG_BLCKSZ);
 
 		AdvanceXLInsertBuffer(ptr, false);
 		endptr = XLogCtl->xlblocks[idx];
@@ -5071,8 +4694,8 @@ XLOGShmemSize(void)
 	/* XLogCtl */
 	size = sizeof(XLogCtlData);
 
-	/* xlog insertion slots, plus alignment */
-	size = add_size(size, mul_size(sizeof(XLogInsertSlotPadded), num_xloginsert_slots + 1));
+	/* WAL insertion locks, plus alignment */
+	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), num_xloginsert_locks + 1));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
@@ -5120,11 +4743,27 @@ XLOGShmemInit(void)
 	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
 	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
 
-	/* Xlog insertion slots. Ensure they're aligned to the full padded size */
-	allocptr += sizeof(XLogInsertSlotPadded) -
-		((uintptr_t) allocptr) % sizeof(XLogInsertSlotPadded);
-	XLogCtl->Insert.insertSlots = (XLogInsertSlotPadded *) allocptr;
-	allocptr += sizeof(XLogInsertSlotPadded) * num_xloginsert_slots;
+
+	/* WAL insertion locks. Ensure they're aligned to the full padded size */
+	allocptr += sizeof(WALInsertLockPadded) -
+		((uintptr_t) allocptr) % sizeof(WALInsertLockPadded);
+	WALInsertLocks = XLogCtl->Insert.WALInsertLocks =
+		(WALInsertLockPadded *) allocptr;
+	allocptr += sizeof(WALInsertLockPadded) * num_xloginsert_locks;
+
+	XLogCtl->Insert.WALInsertLockTrancheId = LWLockNewTrancheId();
+
+	XLogCtl->Insert.WALInsertLockTranche.name = "WALInsertLocks";
+	XLogCtl->Insert.WALInsertLockTranche.array_base = WALInsertLocks;
+	XLogCtl->Insert.WALInsertLockTranche.array_stride = sizeof(WALInsertLockPadded);
+
+	LWLockRegisterTranche(XLogCtl->Insert.WALInsertLockTrancheId, &XLogCtl->Insert.WALInsertLockTranche);
+	for (i = 0; i < num_xloginsert_locks; i++)
+	{
+		LWLockInitialize(&WALInsertLocks[i].l.lock,
+						 XLogCtl->Insert.WALInsertLockTrancheId);
+		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+	}
 
 	/*
 	 * Align the start of the page buffers to a full xlog block size boundary.
@@ -5144,19 +4783,6 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
-	for (i = 0; i < num_xloginsert_slots; i++)
-	{
-		XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[i].slot;
-		SpinLockInit(&slot->mutex);
-		slot->xlogInsertingAt = InvalidXLogRecPtr;
-		slot->owner = NULL;
-
-		slot->releaseOK = true;
-		slot->exclusive = 0;
-		slot->head = NULL;
-		slot->tail = NULL;
-	}
-
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	SpinLockInit(&XLogCtl->ulsn_lck);
@@ -7883,6 +7509,11 @@ InitXLOGAccess(void)
 	ThisTimeLineID = XLogCtl->ThisTimeLineID;
 	Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode());
 
+	/* Initialize our copy of WALInsertLocks and register the tranche */
+	WALInsertLocks = XLogCtl->Insert.WALInsertLocks;
+	LWLockRegisterTranche(XLogCtl->Insert.WALInsertLockTrancheId,
+						  &XLogCtl->Insert.WALInsertLockTranche);
+
 	/* Use GetRedoRecPtr to copy the RedoRecPtr safely */
 	(void) GetRedoRecPtr();
 }
@@ -7901,7 +7532,7 @@ GetRedoRecPtr(void)
 
 	/*
 	 * The possibly not up-to-date copy in XlogCtl is enough. Even if we
-	 * grabbed a WAL insertion slot to read the master copy, someone might
+	 * grabbed a WAL insertion lock to read the master copy, someone might
 	 * update it just after we've released the lock.
 	 */
 	SpinLockAcquire(&xlogctl->info_lck);
@@ -7919,7 +7550,7 @@ GetRedoRecPtr(void)
  *
  * NOTE: The value *actually* returned is the position of the last full
  * xlog page. It lags behind the real insert position by at most 1 page.
- * For that, we don't need to scan through WAL insertion slots, and an
+ * For that, we don't need to scan through WAL insertion locks, and an
  * approximation is enough for the current usage of this function.
  */
 XLogRecPtr
@@ -8280,7 +7911,7 @@ CreateCheckPoint(int flags)
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
 
 	/*
@@ -8305,7 +7936,7 @@ CreateCheckPoint(int flags)
 			MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
 			ControlFile->checkPoint == ControlFile->checkPointCopy.redo)
 		{
-			WALInsertSlotRelease();
+			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
 			return;
@@ -8349,7 +7980,7 @@ CreateCheckPoint(int flags)
 
 	/*
 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
-	 * must be done while holding the insertion slots.
+	 * must be done while holding all the insertion locks.
 	 *
 	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
 	 * pointing past where it really needs to point.  This is okay; the only
@@ -8361,10 +7992,10 @@ CreateCheckPoint(int flags)
 	RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
 
 	/*
-	 * Now we can release the WAL insertion slots, allowing other xacts to
+	 * Now we can release the WAL insertion locks, allowing other xacts to
 	 * proceed while we are flushing disk buffers.
 	 */
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	/* Update the info_lck-protected copy of RedoRecPtr as well */
 	SpinLockAcquire(&xlogctl->info_lck);
@@ -8394,7 +8025,7 @@ CreateCheckPoint(int flags)
 	 * we wait till he's out of his commit critical section before proceeding.
 	 * See notes in RecordTransactionCommit().
 	 *
-	 * Because we've already released the insertion slots, this test is a bit
+	 * Because we've already released the insertion locks, this test is a bit
 	 * fuzzy: it is possible that we will wait for xacts we didn't really need
 	 * to wait for.  But the delay should be short and it seems better to make
 	 * checkpoint take a bit longer than to hold off insertions longer than
@@ -8625,10 +8256,10 @@ CreateEndOfRecoveryRecord(void)
 
 	xlrec.end_time = time(NULL);
 
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	xlrec.ThisTimeLineID = ThisTimeLineID;
 	xlrec.PrevTimeLineID = XLogCtl->PrevTimeLineID;
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	LocalSetXLogInsertAllowed();
 
@@ -8834,9 +8465,9 @@ CreateRestartPoint(int flags)
 	 * during recovery this is just pro forma, because no WAL insertions are
 	 * happening.
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	/* Also update the info_lck-protected copy */
 	SpinLockAcquire(&xlogctl->info_lck);
@@ -9296,9 +8927,9 @@ UpdateFullPageWrites(void)
 	 */
 	if (fullPageWrites)
 	{
-		WALInsertSlotAcquire(true);
+		WALInsertLockAcquire(true);
 		Insert->fullPageWrites = true;
-		WALInsertSlotRelease();
+		WALInsertLockRelease();
 	}
 
 	/*
@@ -9319,9 +8950,9 @@ UpdateFullPageWrites(void)
 
 	if (!fullPageWrites)
 	{
-		WALInsertSlotAcquire(true);
+		WALInsertLockAcquire(true);
 		Insert->fullPageWrites = false;
-		WALInsertSlotRelease();
+		WALInsertLockRelease();
 	}
 	END_CRIT_SECTION();
 }
@@ -9952,15 +9583,15 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	 * Note that forcePageWrites has no effect during an online backup from
 	 * the standby.
 	 *
-	 * We must hold all the insertion slots to change the value of
+	 * We must hold all the insertion locks to change the value of
 	 * forcePageWrites, to ensure adequate interlocking against XLogInsert().
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	if (exclusive)
 	{
 		if (XLogCtl->Insert.exclusiveBackup)
 		{
-			WALInsertSlotRelease();
+			WALInsertLockRelease();
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("a backup is already in progress"),
@@ -9971,7 +9602,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	else
 		XLogCtl->Insert.nonExclusiveBackups++;
 	XLogCtl->Insert.forcePageWrites = true;
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	/* Ensure we release forcePageWrites if fail below */
 	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
@@ -10086,13 +9717,13 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			 * taking a checkpoint right after another is not that expensive
 			 * either because only few buffers have been dirtied yet.
 			 */
-			WALInsertSlotAcquire(true);
+			WALInsertLockAcquire(true);
 			if (XLogCtl->Insert.lastBackupStart < startpoint)
 			{
 				XLogCtl->Insert.lastBackupStart = startpoint;
 				gotUniqueStartpoint = true;
 			}
-			WALInsertSlotRelease();
+			WALInsertLockRelease();
 		} while (!gotUniqueStartpoint);
 
 		XLByteToSeg(startpoint, _logSegNo);
@@ -10182,7 +9813,7 @@ pg_start_backup_callback(int code, Datum arg)
 	bool		exclusive = DatumGetBool(arg);
 
 	/* Update backup counters and forcePageWrites on failure */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	if (exclusive)
 	{
 		Assert(XLogCtl->Insert.exclusiveBackup);
@@ -10199,7 +9830,7 @@ pg_start_backup_callback(int code, Datum arg)
 	{
 		XLogCtl->Insert.forcePageWrites = false;
 	}
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 }
 
 /*
@@ -10268,7 +9899,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	/*
 	 * OK to update backup counters and forcePageWrites
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	if (exclusive)
 		XLogCtl->Insert.exclusiveBackup = false;
 	else
@@ -10288,7 +9919,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	{
 		XLogCtl->Insert.forcePageWrites = false;
 	}
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	if (exclusive)
 	{
@@ -10573,7 +10204,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 void
 do_pg_abort_backup(void)
 {
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
 	XLogCtl->Insert.nonExclusiveBackups--;
 
@@ -10582,7 +10213,7 @@ do_pg_abort_backup(void)
 	{
 		XLogCtl->Insert.forcePageWrites = false;
 	}
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 82ef440..f88bf76 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -10,6 +10,14 @@
  * locking should be done with the full lock manager --- which depends on
  * LWLocks to protect its shared state.
  *
+ * In addition to exclusive and shared modes, lightweight locks can be used
+ * to wait until a variable changes value. The variable is set with
+ * LWLockWakeup, and waited for with LWLockWait.  LWLockWait waits until the
+ * lock is free, or the variable changes.  LWLockWakeup assigns to the
+ * variable, waking up any LWLockWait() callers, without releasing the lock.
+ * The meaning of the value assigned is up to the caller, the lightweight
+ * lock code just assigns and compares it.
+ *
  *
  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -443,16 +451,18 @@ LWLockInitialize(LWLock *lock, int tranche_id)
 /*
  * LWLockAcquire - acquire a lightweight lock in the specified mode
  *
- * If the lock is not available, sleep until it is.
+ * If the lock is not available, sleep until it is.  Returns true if the lock
+ * was available immediately, false if we had to sleep.
  *
  * Side effect: cancel/die interrupts are held off until lock release.
  */
-void
+bool
 LWLockAcquire(LWLock *l, LWLockMode mode)
 {
 	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
+	bool		result = true;
 	int			extraWaits = 0;
 #ifdef LWLOCK_STATS
 	lwlock_stats *lwstats;
@@ -601,6 +611,7 @@ LWLockAcquire(LWLock *l, LWLockMode mode)
 
 		/* Now loop back and try to acquire lock again. */
 		retry = true;
+		result = false;
 	}
 
 	/* We are done updating shared state of the lock itself. */
@@ -616,6 +627,8 @@ LWLockAcquire(LWLock *l, LWLockMode mode)
 	 */
 	while (extraWaits-- > 0)
 		PGSemaphoreUnlock(&proc->sem);
+
+	return result;
 }
 
 /*
@@ -835,6 +848,226 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 }
 
 /*
+ * LWLockWait - Wait until lock is free, or a variable is updated.
+ *
+ * If the lock is held, and the *valptr equals oldval, waits until the lock
+ * is either freed, or the lock holder updates *valptr by calling
+ * LWLockWakeup. If the lock is free on exit (immediately or after waiting),
+ * returns true. If the lock is still held, but *valptr no longer matches
+ * oldval, returns false and sets *newval to the current value in *valptr.
+ *
+ * It's possible that the lock holder releases the lock, but another backend
+ * acquires it again before we get a chance to observe that the lock was
+ * momentarily released. We wouldn't need to wait for the new lock holder, but
+ * we cannot distinguish that case, so we will have to wait.
+ *
+ * Note: this function ignores shared lock holders; if the lock is held
+ * in shared mode, returns 'true'.
+ */
+bool
+LWLockWait(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
+{
+	volatile LWLock *lock = l;
+	volatile uint64 *valp = valptr;
+	PGPROC	   *proc = MyProc;
+	int			extraWaits = 0;
+	bool		result = false;
+
+	/*
+	 * Quick test first to see if it the slot is free right now.
+	 *
+	 * XXX: the caller uses a spinlock before this, so we don't need a memory
+	 * barrier here as far as the current usage is concerned. But that might
+	 * not be safe in general.
+	 */
+	if (lock->exclusive == 0)
+		return true;
+
+	/*
+	 * Lock out cancel/die interrupts while we sleep on the lock. There is
+	 * no cleanup mechanism to remove us from the wait queue if we got
+	 * interrupted.
+	 */
+	HOLD_INTERRUPTS();
+
+	/*
+	 * Loop here to check the lock's status after each time we are signaled.
+	 */
+	for (;;)
+	{
+		bool		mustwait;
+		uint64		value;
+
+		/* Acquire mutex.  Time spent holding mutex should be short! */
+#ifdef LWLOCK_STATS
+		lwstats->spin_delay_count += SpinLockAcquire(&lock->mutex);
+#else
+		SpinLockAcquire(&lock->mutex);
+#endif
+
+		/* Is the lock now free, and if not, does the value match? */
+		if (lock->exclusive == 0)
+		{
+			result = true;
+			mustwait = false;
+		}
+		else
+		{
+			value = *valp;
+			if (value != oldval)
+			{
+				result = false;
+				mustwait = false;
+				*newval = value;
+			}
+			else
+				mustwait = true;
+		}
+
+		if (!mustwait)
+			break;				/* the lock was free or value didn't match */
+
+		/*
+		 * Add myself to wait queue.
+		 */
+		proc->lwWaiting = true;
+		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
+		proc->lwWaitLink = NULL;
+
+		/* waiters are added to the front of the queue */
+		proc->lwWaitLink = lock->head;
+		if (lock->head == NULL)
+			lock->tail = proc;
+		lock->head = proc;
+
+		/* Can release the mutex now */
+		SpinLockRelease(&lock->mutex);
+
+		/*
+		 * Wait until awakened.
+		 *
+		 * Since we share the process wait semaphore with other things, like
+		 * the regular lock manager and ProcWaitForSignal, and we may need to
+		 * acquire an LWLock while one of those is pending, it is possible that
+		 * we get awakened for a reason other than being signaled by
+		 * LWLockRelease. If so, loop back and wait again.  Once we've gotten
+		 * the LWLock, re-increment the sema by the number of additional
+		 * signals received, so that the lock manager or signal manager will
+		 * see the received signal when it next waits.
+		 */
+		LOG_LWDEBUG("LWLockWait", T_NAME(l), T_ID(l), "waiting");
+
+#ifdef LWLOCK_STATS
+		lwstats->block_count++;
+#endif
+
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
+
+		for (;;)
+		{
+			/* "false" means cannot accept cancel/die interrupt here. */
+			PGSemaphoreLock(&proc->sem, false);
+			if (!proc->lwWaiting)
+				break;
+			extraWaits++;
+		}
+
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
+
+		LOG_LWDEBUG("LWLockWait", T_NAME(l), T_ID(l), "awakened");
+
+		/* Now loop back and check the status of the lock again. */
+	}
+
+	/* We are done updating shared state of the lock itself. */
+	SpinLockRelease(&lock->mutex);
+
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode);
+
+	/*
+	 * Fix the process wait semaphore's count for any absorbed wakeups.
+	 */
+	while (extraWaits-- > 0)
+		PGSemaphoreUnlock(&proc->sem);
+
+	/*
+	 * Now okay to allow cancel/die interrupts.
+	 */
+	RESUME_INTERRUPTS();
+
+	return result;
+}
+
+
+/*
+ * LWLockWakeup - Update a variable and wake up waiters atomically
+ *
+ * Sets *valptr to 'val', and wakes up all processes waiting for us with
+ * LWLockWait(). Setting the value and waking up the process happen
+ * atomically so that any process calling LWLockWait() on the same lock is
+ * guaranteed to see the new value, and act accordingly.
+ *
+ * The caller must be holding the lock in exclusive mode.
+ */
+void
+LWLockWakeup(LWLock *l, uint64 *valptr, uint64 val)
+{
+	volatile LWLock *lock = l;
+	volatile uint64 *valp = valptr;
+	PGPROC	   *head;
+	PGPROC	   *proc;
+	PGPROC	   *next;
+
+	/* Acquire mutex.  Time spent holding mutex should be short! */
+	SpinLockAcquire(&lock->mutex);
+
+	/* we should hold the lock */
+	Assert(lock->exclusive == 1);
+
+	/* Update the lock's value */
+	*valp = val;
+
+	/*
+	 * See if there are any LW_WAIT_UNTIL_FREE waiters that need to be woken
+	 * up. They are always in the front of the queue.
+	 */
+	head = lock->head;
+
+	if (head != NULL && head->lwWaitMode == LW_WAIT_UNTIL_FREE)
+	{
+		proc = head;
+		next = proc->lwWaitLink;
+		while (next && next->lwWaitMode == LW_WAIT_UNTIL_FREE)
+		{
+			proc = next;
+			next = next->lwWaitLink;
+		}
+
+		/* proc is now the last PGPROC to be released */
+		lock->head = next;
+		proc->lwWaitLink = NULL;
+	}
+	else
+		head = NULL;
+
+	/* We are done updating shared state of the lock itself. */
+	SpinLockRelease(&lock->mutex);
+
+	/*
+	 * Awaken any waiters I removed from the queue.
+	 */
+	while (head != NULL)
+	{
+		proc = head;
+		head = proc->lwWaitLink;
+		proc->lwWaitLink = NULL;
+		proc->lwWaiting = false;
+		PGSemaphoreUnlock(&proc->sem);
+	}
+}
+
+
+/*
  * LWLockRelease - release a previously acquired lock
  */
 void
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index 2812a73..3f9b366 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2103,12 +2103,12 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"xloginsert_slots", PGC_POSTMASTER, WAL_SETTINGS,
-			gettext_noop("Sets the number of slots for concurrent xlog insertions."),
+		{"xloginsert_locks", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Sets the number of locks used for concurrent xlog insertions."),
 			NULL,
 			GUC_NOT_IN_SAMPLE
 		},
-		&num_xloginsert_slots,
+		&num_xloginsert_locks,
 		8, 1, 1000,
 		NULL, NULL, NULL
 	},
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 11ab277..f5152f4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -192,7 +192,7 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool log_checkpoints;
-extern int	num_xloginsert_slots;
+extern int	num_xloginsert_locks;
 
 /* WAL levels */
 typedef enum WalLevel
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index c8ff4eb..ecebd4d 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -169,9 +169,11 @@ typedef enum LWLockMode
 extern bool Trace_lwlocks;
 #endif
 
-extern void LWLockAcquire(LWLock *lock, LWLockMode mode);
+extern bool LWLockAcquire(LWLock *lock, LWLockMode mode);
 extern bool LWLockConditionalAcquire(LWLock *lock, LWLockMode mode);
 extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
+extern bool LWLockWait(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval);
+extern void LWLockWakeup(LWLock *lock, uint64 *valptr, uint64 value);
 extern void LWLockRelease(LWLock *lock);
 extern void LWLockReleaseAll(void);
 extern bool LWLockHeldByMe(LWLock *lock);

#51

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Heikki Linnakangas (#50)

1 attachment(s)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 02/18/2014 09:23 PM, Heikki Linnakangas wrote:

On 02/17/2014 10:36 PM, Andres Freund wrote:

On 2014-02-17 22:30:54 +0200, Heikki Linnakangas wrote:

This is what I came up with. I like it, I didn't have to contort lwlocks as
much as I feared. I added one field to LWLock structure, which is used to
store the position of how far a WAL inserter has progressed. The LWLock code
calls it just "value", without caring what's stored in it, and it's used by
new functions LWLockWait and LWLockWakeup to implement the behavior the WAL
insertion slots have, to wake up other processes waiting for the slot
without releasing it.

This passes regression tests, but I'll have to re-run the performance tests
with this. One worry is that if the padded size of the LWLock struct is
smaller than cache line, neighboring WAL insertion locks will compete for
the cache line. Another worry is that since I added a field to LWLock
struct, it might now take 64 bytes on platforms where it used to be 32 bytes
before. That wastes some memory.

Why don't you allocate them in a separate tranche, from xlog.c? Then you
can store them inside whatever bigger object you want, guaranteeing
exactly the alignment you need. possibly you even can have the extra
value in the enclosing object?

Good idea. New patch attached, doing that.

I'll try to find time on some multi-CPU hardware to performance test
this against current master, to make sure there's no regression.

Ok, I ran the same tests I used earlier for the xloginsert scaling
patch, with REL9_3_STABLE, current master, and the patch to refactor the
xloginsert slots into lwlocks.

The main question I was trying to answer was: Is the new patch similar
in performance to current master? The short answer is "yes". In some
tests there was significant differences, but overall I couldn't say
which one was better.

The test case I used is pgbench with a custom script using a backend
extension called "xlogtest", which just does a bunch XLOGInserts of
dummy WAL records of given size. Small and large WAL records and have
quite different characteristics in how they stress the xlog machinery. I
used three different WAL record sizes: 20, 100 and 8000 bytes, excluding
the WAL record header. 20 bytes is pretty much the minimum size of a
realistic WAL record, for something like a heap deletion. 100 bytes
would be typical of an insert or update record, while 8000 bytes would
be a full-page write or b-tree page split. The number of such inserts
done per transaction was scaled so that each transaction inserts about
100000 bytes in total. That's quite a lot, but with shorter transactions
you easily get bottlenecked by other things like the ProcArrayLock, and
the point of this test was to exercise WAL insertions.

I ran the tests on three different hardware: my laptop with 4 cores (8
logical cores with hyperthreading) and an SSD disk, a virtual machine
running on a host with 32 cores (no other VMs running) with some kind of
external storage (I don't know the details), and Nathan Boley's 64-core
AMD box (thanks Nate for lending it again!). On the AMD box, I ran the
tests twice, once with data directory on the disk, and once in /dev/shm.

The results are varying. Overall, both git master and the patched
version perform similarly, and at least as well as REL9_3_STABLE. There
are a few clear exceptions to that: in Nathan's box with data directory
on disk, the patched version performs much better than either git or
REL9_3_STABLE, with 20 byte payload. And on my laptop, with 20 byte
payload, git master performs somewhat better than the patched version,
but still better than REL9_3_STABLE, except when running with single client.

I collected the summary graphs of all the tests here (you can click the
graphs for the details pgbench-tools result pages):

http://hlinnaka.iki.fi/xlogslot-to-lwlock-results/

Some caveats on the test methodology:
1. I didn't run the tests for the same duration on all the different
machines. The test duration on the AMD box was 20 seconds for the
disk-based tests and 30 seconds for the RAM-disk based tests. On the
32-core VM and my laptop, the test duration was 60 seconds. So you
cannot compare the tests on different hardware directly.
2. All those test durations were pretty short. That means that the TPS
number in any individual test result is quite noisy, and you should look
at the general shapes of the graphs instead of individual points. 3. The
number of checkpoints during the tests varied, which again creates a lot
of noise in the individual points.
4. In the last test in the series, on the 64-core AMD box with data dir
in RAM drive, the patched test with 64 clients deadlocked. I tracked it
down to a bug in the patch, in how the insertion's progress is reported
when holding the WALInsertLocks in exclusive-mode, ie. when holding them
all, when starting a checkpoint. The exclusive lock is held so seldom
that I have no reason to believe that it affects the performance, but
nevertheless the patch I tested was not 100% identical to the fixed
version attached. That explains the apparent dip in performance with 64
clients with the patched version.

So there are some unexplained differences there, but based on these
results, I'm still OK with committing the patch.

- Heikki

Attachments:

xlogslot-to-lwlock-3.patchtext/x-diff; name=xlogslot-to-lwlock-3.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 483d5c3..06bbc90 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -86,7 +86,7 @@ int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
-int			num_xloginsert_slots = 8;
+int			num_xloginsert_locks = 8;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -289,7 +289,7 @@ XLogRecPtr	XactLastRecEnd = InvalidXLogRecPtr;
  * (which is almost but not quite the same as a pointer to the most recent
  * CHECKPOINT record).	We update this from the shared-memory copy,
  * XLogCtl->Insert.RedoRecPtr, whenever we can safely do so (ie, when we
- * hold an insertion slot).  See XLogInsert for details.  We are also allowed
+ * hold an insertion lock).  See XLogInsert for details.  We are also allowed
  * to update from XLogCtl->RedoRecPtr if we hold the info_lck;
  * see GetRedoRecPtr.  A freshly spawned backend obtains the value during
  * InitXLOGAccess.
@@ -361,63 +361,45 @@ typedef struct XLogwrtResult
 	XLogRecPtr	Flush;			/* last byte + 1 flushed */
 } XLogwrtResult;
 
-
 /*
- * A slot for inserting to the WAL. This is similar to an LWLock, the main
- * difference is that there is an extra xlogInsertingAt field that is protected
- * by the same mutex. Unlike an LWLock, a slot can only be acquired in
- * exclusive mode.
- *
- * The xlogInsertingAt field is used to advertise to other processes how far
- * the slot owner has progressed in inserting the record. When a backend
- * acquires a slot, it initializes xlogInsertingAt to 1, because it doesn't
- * yet know where it's going to insert the record. That's conservative
- * but correct; the new insertion is certainly going to go to a byte position
- * greater than 1. If another backend needs to flush the WAL, it will have to
- * wait for the new insertion. xlogInsertingAt is updated after finishing the
- * insert or when crossing a page boundary, which will wake up anyone waiting
- * for it, whether the wait was necessary in the first place or not.
- *
- * A process can wait on a slot in two modes: LW_EXCLUSIVE or
- * LW_WAIT_UNTIL_FREE. LW_EXCLUSIVE works like in an lwlock; when the slot is
- * released, the first LW_EXCLUSIVE waiter in the queue is woken up. Processes
- * waiting in LW_WAIT_UNTIL_FREE mode are woken up whenever the slot is
- * released, or xlogInsertingAt is updated. In other words, a process in
- * LW_WAIT_UNTIL_FREE mode is woken up whenever the inserter makes any progress
- * copying the record in place. LW_WAIT_UNTIL_FREE waiters are always added to
- * the front of the queue, while LW_EXCLUSIVE waiters are appended to the end.
- *
- * To join the wait queue, a process must set MyProc->lwWaitMode to the mode
- * it wants to wait in, MyProc->lwWaiting to true, and link MyProc to the head
- * or tail of the wait queue. The same mechanism is used to wait on an LWLock,
- * see lwlock.c for details.
+ * Inserting to WAL is protected by a bunch of WALInsertLocks. Each WAL
+ * insertion lock consists of a lightweight lock, plus an indicator of how
+ * far the insertion has progressed (insertingAt).
+ *
+ * The insertingAt value is used when writing the WAL to disk, to avoid
+ * waiting unnecessarily for an insertion that's still in-progress, but has
+ * already finished inserting all WAL beyond the point you're going to write
+ * the WAL up to. This isn't just to optimize, it's necessary to avoid
+ * deadlocks when an inserter has to switch to new WAL buffer. An inserter
+ * that's holding a WAL insert lock might need to flush the WAL, to evict an
+ * old WAL buffer, to make room for the new record. If it's possible for an
+ * inserter to wait for another inserter unnecessarily, that can lead to
+ * a deadlock if two inserters holding a WAL insert lock wait for each other
+ * to finish their insertion.
+ *
+ * Small WAL records that don't cross a page boundary never update the value,
+ * the WAL record is just copied to the page and the lock is released. But
+ * when crossing a page boundary, it's updated to let others know that the
+ * backend has finished modifying the previous page.
  */
 typedef struct
 {
-	slock_t		mutex;			/* protects the below fields */
-	XLogRecPtr	xlogInsertingAt; /* insert has completed up to this point */
-
-	PGPROC	   *owner;			/* for debugging purposes */
-
-	bool		releaseOK;		/* T if ok to release waiters */
-	char		exclusive;		/* # of exclusive holders (0 or 1) */
-	PGPROC	   *head;			/* head of list of waiting PGPROCs */
-	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
-	/* tail is undefined when head is NULL */
-} XLogInsertSlot;
+	LWLock		lock;
+	XLogRecPtr	insertingAt;
+} WALInsertLock;
 
 /*
- * All the slots are allocated as an array in shared memory. We force the
- * array stride to be a power of 2, which saves a few cycles in indexing, but
- * more importantly also ensures that individual slots don't cross cache line
- * boundaries.	(Of course, we have to also ensure that the array start
- * address is suitably aligned.)
+ * All the WAL insertion locks are allocated as an array in shared memory. We
+ * force the array stride to be a power of 2, which saves a few cycles in
+ * indexing, but more importantly also ensures that individual slots don't
+ * cross cache line boundaries. (Of course, we have to also ensure that the
+ * array start address is suitably aligned.)
  */
-typedef union XLogInsertSlotPadded
+typedef union WALInsertLockPadded
 {
-	XLogInsertSlot slot;
+	WALInsertLock l;
 	char		pad[CACHE_LINE_SIZE];
-} XLogInsertSlotPadded;
+} WALInsertLockPadded;
 
 /*
  * Shared state data for XLogInsert.
@@ -452,8 +434,8 @@ typedef struct XLogCtlInsert
 	 * we must WAL-log it before it actually affects WAL-logging by backends.
 	 * Checkpointer sets at startup or after SIGHUP.
 	 *
-	 * To read these fields, you must hold an insertion slot. To modify them,
-	 * you must hold ALL the slots.
+	 * To read these fields, you must hold an insertion lock. To modify them,
+	 * you must hold ALL the locks.
 	 */
 	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
 	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
@@ -470,8 +452,16 @@ typedef struct XLogCtlInsert
 	int			nonExclusiveBackups;
 	XLogRecPtr	lastBackupStart;
 
-	/* insertion slots, see XLogInsertSlot struct above for details */
-	XLogInsertSlotPadded *insertSlots;
+	/*
+	 * To insert a new WAL record, you must hold a WAL insertion lock. Before
+	 * 9.4, there was a single WALInsertLock, but that became a bottleneck
+	 * on multi-core systems that insert a lot of WAL. Now, there are several
+	 * WAL insertion locks, and to insert WAL, you must hold one of them (in
+	 * exclusive mode). It doesn't matter which one.
+	 */
+	WALInsertLockPadded	*WALInsertLocks;
+	LWLockTranche WALInsertLockTranche;
+	int			WALInsertLockTrancheId;
 } XLogCtlInsert;
 
 /*
@@ -609,6 +599,9 @@ typedef struct XLogCtlData
 
 static XLogCtlData *XLogCtl = NULL;
 
+/* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
+static WALInsertLockPadded *WALInsertLocks = NULL;
+
 /*
  * We maintain an image of pg_control in shared memory.
  */
@@ -732,9 +725,9 @@ static bool InRedo = false;
 /* Have we launched bgwriter during recovery? */
 static bool bgwriterLaunched = false;
 
-/* For WALInsertSlotAcquire/Release functions */
-static int	MySlotNo = 0;
-static bool holdingAllSlots = false;
+/* For WALInsertLockAcquire/Release functions */
+static int	MyLockNo = 0;
+static bool holdingAllLocks = false;
 
 static void readRecoveryCommandFile(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
@@ -808,16 +801,14 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 				  XLogRecPtr *PrevPtr);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
-static void WakeupWaiters(XLogRecPtr EndPos);
 static char *GetXLogBuffer(XLogRecPtr ptr);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
 
-static void WALInsertSlotAcquire(bool exclusive);
-static void WALInsertSlotAcquireOne(int slotno);
-static void WALInsertSlotRelease(void);
-static void WALInsertSlotReleaseOne(int slotno);
+static void WALInsertLockAcquire(bool exclusive);
+static void WALInsertLockRelease(void);
+static void WALInsertLockWakeup(XLogRecPtr insertingAt);
 
 /*
  * Insert an XLOG record having the specified RMID and info bytes,
@@ -894,7 +885,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	 *
 	 * We may have to loop back to here if a race condition is detected below.
 	 * We could prevent the race by doing all this work while holding an
-	 * insertion slot, but it seems better to avoid doing CRC calculations
+	 * insertion lock, but it seems better to avoid doing CRC calculations
 	 * while holding one.
 	 *
 	 * We add entries for backup blocks to the chain, so that they don't need
@@ -912,8 +903,8 @@ begin:;
 	/*
 	 * Decide if we need to do full-page writes in this XLOG record: true if
 	 * full_page_writes is on or we have a PITR request for it.  Since we
-	 * don't yet have an insertion slot, fullPageWrites and forcePageWrites
-	 * could change under us, but we'll recheck them once we have a slot.
+	 * don't yet have an insertion lock, fullPageWrites and forcePageWrites
+	 * could change under us, but we'll recheck them once we have a lock.
 	 */
 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
 
@@ -1087,16 +1078,15 @@ begin:;
 	 *    record in place. This can be done concurrently in multiple processes.
 	 *
 	 * To keep track of which insertions are still in-progress, each concurrent
-	 * inserter allocates an "insertion slot", which tells others how far the
+	 * inserter acquires an insertion lock. In addition to just indicating that
+	 * an insertion is in progress, the lock tells others how far the
 	 * inserter has progressed. There is a small fixed number of insertion
-	 * slots, determined by the num_xloginsert_slots GUC. When an inserter
-	 * finishes, it updates the xlogInsertingAt of its slot to the end of the
-	 * record it inserted, to let others know that it's done. xlogInsertingAt
-	 * is also updated when crossing over to a new WAL buffer, to allow the
-	 * the previous buffer to be flushed.
+	 * locks, determined by the num_xloginsert_locks GUC. When an inserter
+	 * crosses a page boundary, it updates the value stored in the lock to the
+	 * how far it has inserted, to allow the the previous buffer to be flushed.
 	 *
-	 * Holding onto a slot also protects RedoRecPtr and fullPageWrites from
-	 * changing until the insertion is finished.
+	 * Holding onto an insertion lock also protects RedoRecPtr and
+	 * fullPageWrites from changing until the insertion is finished.
 	 *
 	 * Step 2 can usually be done completely in parallel. If the required WAL
 	 * page is not initialized yet, you have to grab WALBufMappingLock to
@@ -1106,7 +1096,7 @@ begin:;
 	 *----------
 	 */
 	START_CRIT_SECTION();
-	WALInsertSlotAcquire(isLogSwitch);
+	WALInsertLockAcquire(isLogSwitch);
 
 	/*
 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
@@ -1135,7 +1125,7 @@ begin:;
 					 * Oops, this buffer now needs to be backed up, but we
 					 * didn't think so above.  Start over.
 					 */
-					WALInsertSlotRelease();
+					WALInsertLockRelease();
 					END_CRIT_SECTION();
 					rdt_lastnormal->next = NULL;
 					info = info_orig;
@@ -1154,7 +1144,7 @@ begin:;
 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
 	{
 		/* Oops, must redo it with full-page data. */
-		WALInsertSlotRelease();
+		WALInsertLockRelease();
 		END_CRIT_SECTION();
 		rdt_lastnormal->next = NULL;
 		info = info_orig;
@@ -1202,7 +1192,7 @@ begin:;
 	/*
 	 * Done! Let others know that we're finished.
 	 */
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	MarkCurrentTransactionIdLoggedIfAny();
 
@@ -1363,7 +1353,7 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 
 	/*
 	 * These calculations are a bit heavy-weight to be done while holding a
-	 * spinlock, but since we're holding all the WAL insertion slots, there
+	 * spinlock, but since we're holding all the WAL insertion locks, there
 	 * are no other inserters competing for it. GetXLogInsertRecPtr() does
 	 * compete for it, but that's not called very frequently.
 	 */
@@ -1523,7 +1513,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 		while (CurrPos < EndPos)
 		{
 			/* initialize the next page (if not initialized already) */
-			WakeupWaiters(CurrPos);
+			WALInsertLockWakeup(CurrPos);
 			AdvanceXLInsertBuffer(CurrPos, false);
 			CurrPos += XLOG_BLCKSZ;
 		}
@@ -1534,452 +1524,125 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 }
 
 /*
- * Allocate a slot for insertion.
+ * Acquire a WAL insertion lock.
  *
- * In exclusive mode, all slots are reserved for the current process. That
- * blocks all concurrent insertions.
+ * In exclusive mode, all locks are acquired. That blocks all concurrent
+ * insertions.
  */
 static void
-WALInsertSlotAcquire(bool exclusive)
+WALInsertLockAcquire(bool exclusive)
 {
 	int			i;
 
 	if (exclusive)
 	{
-		for (i = 0; i < num_xloginsert_slots; i++)
-			WALInsertSlotAcquireOne(i);
-		holdingAllSlots = true;
-	}
-	else
-		WALInsertSlotAcquireOne(-1);
-}
-
-/*
- * Workhorse of WALInsertSlotAcquire. Acquires the given slot, or an arbitrary
- * one if slotno == -1. The index of the slot that was acquired is stored in
- * MySlotNo.
- *
- * This is more or less equivalent to LWLockAcquire().
- */
-static void
-WALInsertSlotAcquireOne(int slotno)
-{
-	volatile XLogInsertSlot *slot;
-	PGPROC	   *proc = MyProc;
-	bool		retry = false;
-	int			extraWaits = 0;
-	static int	slotToTry = -1;
-
-	/*
-	 * Try to use the slot we used last time. If the system isn't particularly
-	 * busy, it's a good bet that it's available, and it's good to have some
-	 * affinity to a particular slot so that you don't unnecessarily bounce
-	 * cache lines between processes when there is no contention.
-	 *
-	 * If this is the first time through in this backend, pick a slot
-	 * (semi-)randomly. This allows the slots to be used evenly if you have a
-	 * lot of very short connections.
-	 */
-	if (slotno != -1)
-		MySlotNo = slotno;
-	else
-	{
-		if (slotToTry == -1)
-			slotToTry = MyProc->pgprocno % num_xloginsert_slots;
-		MySlotNo = slotToTry;
-	}
-
-	/*
-	 * We can't wait if we haven't got a PGPROC.  This should only occur
-	 * during bootstrap or shared memory initialization.  Put an Assert here
-	 * to catch unsafe coding practices.
-	 */
-	Assert(MyProc != NULL);
-
-	/*
-	 * Lock out cancel/die interrupts until we exit the code section protected
-	 * by the slot.  This ensures that interrupts will not interfere with
-	 * manipulations of data structures in shared memory. There is no cleanup
-	 * mechanism to release the slot if the backend dies while holding one,
-	 * so make this a critical section.
-	 */
-	START_CRIT_SECTION();
-
-	/*
-	 * Loop here to try to acquire slot after each time we are signaled by
-	 * WALInsertSlotRelease.
-	 */
-	for (;;)
-	{
-		bool		mustwait;
-
-		slot = &XLogCtl->Insert.insertSlots[MySlotNo].slot;
-
-		/* Acquire mutex.  Time spent holding mutex should be short! */
-		SpinLockAcquire(&slot->mutex);
-
-		/* If retrying, allow WALInsertSlotRelease to release waiters again */
-		if (retry)
-			slot->releaseOK = true;
-
-		/* If I can get the slot, do so quickly. */
-		if (slot->exclusive == 0)
+		for (i = 0; i < num_xloginsert_locks - 1; i++)
 		{
-			slot->exclusive++;
-			mustwait = false;
+			LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
+			LWLockWakeup(&WALInsertLocks[i].l.lock,
+						 &WALInsertLocks[i].l.insertingAt,
+						 INT64CONST(0xFFFFFFFFFFFFFFFF));
 		}
-		else
-			mustwait = true;
-
-		if (!mustwait)
-			break;				/* got the lock */
-
-		Assert(slot->owner != MyProc);
-
-		/*
-		 * Add myself to wait queue.
-		 */
-		proc->lwWaiting = true;
-		proc->lwWaitMode = LW_EXCLUSIVE;
-		proc->lwWaitLink = NULL;
-		if (slot->head == NULL)
-			slot->head = proc;
-		else
-			slot->tail->lwWaitLink = proc;
-		slot->tail = proc;
-
-		/* Can release the mutex now */
-		SpinLockRelease(&slot->mutex);
+		LWLockAcquire(&WALInsertLocks[i].l.lock, LW_EXCLUSIVE);
 
-		/*
-		 * Wait until awakened.
-		 *
-		 * Since we share the process wait semaphore with the regular lock
-		 * manager and ProcWaitForSignal, and we may need to acquire a slot
-		 * while one of those is pending, it is possible that we get awakened
-		 * for a reason other than being signaled by WALInsertSlotRelease. If
-		 * so, loop back and wait again.  Once we've gotten the slot,
-		 * re-increment the sema by the number of additional signals received,
-		 * so that the lock manager or signal manager will see the received
-		 * signal when it next waits.
-		 */
-		for (;;)
-		{
-			/* "false" means cannot accept cancel/die interrupt here. */
-			PGSemaphoreLock(&proc->sem, false);
-			if (!proc->lwWaiting)
-				break;
-			extraWaits++;
-		}
-
-		/* Now loop back and try to acquire lock again. */
-		retry = true;
+		holdingAllLocks = true;
 	}
-
-	slot->owner = proc;
-
-	/*
-	 * Normally, we initialize the xlogInsertingAt value of the slot to 1,
-	 * because we don't yet know where in the WAL we're going to insert. It's
-	 * not critical what it points to right now - leaving it to a too small
-	 * value just means that WaitXlogInsertionsToFinish() might wait on us
-	 * unnecessarily, until we update the value (when we finish the insert or
-	 * move to next page).
-	 *
-	 * If we're grabbing all the slots, however, stamp all but the last one
-	 * with InvalidXLogRecPtr, meaning there is no insert in progress. The last
-	 * slot is the one that we will update as we proceed with the insert, the
-	 * rest are held just to keep off other inserters.
-	 */
-	if (slotno != -1 && slotno != num_xloginsert_slots - 1)
-		slot->xlogInsertingAt = InvalidXLogRecPtr;
 	else
-		slot->xlogInsertingAt = 1;
-
-	/* We are done updating shared state of the slot itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Fix the process wait semaphore's count for any absorbed wakeups.
-	 */
-	while (extraWaits-- > 0)
-		PGSemaphoreUnlock(&proc->sem);
-
-	/*
-	 * If we couldn't get the slot immediately, try another slot next time.
-	 * On a system with more insertion slots than concurrent inserters, this
-	 * causes all the inserters to eventually migrate to a slot that no-one
-	 * else is using. On a system with more inserters than slots, it still
-	 * causes the inserters to be distributed quite evenly across the slots.
-	 */
-	if (slotno != -1 && retry)
-		slotToTry = (slotToTry + 1) % num_xloginsert_slots;
-}
-
-/*
- * Wait for the given slot to become free, or for its xlogInsertingAt location
- * to change to something else than 'waitptr'. In other words, wait for the
- * inserter using the given slot to finish its insertion, or to at least make
- * some progress.
- */
-static void
-WaitOnSlot(volatile XLogInsertSlot *slot, XLogRecPtr waitptr)
-{
-	PGPROC	   *proc = MyProc;
-	int			extraWaits = 0;
-
-	/*
-	 * Lock out cancel/die interrupts while we sleep on the slot. There is
-	 * no cleanup mechanism to remove us from the wait queue if we got
-	 * interrupted.
-	 */
-	HOLD_INTERRUPTS();
-
-	/*
-	 * Loop here to try to acquire lock after each time we are signaled.
-	 */
-	for (;;)
 	{
-		bool		mustwait;
-
-		/* Acquire mutex.  Time spent holding mutex should be short! */
-		SpinLockAcquire(&slot->mutex);
-
-		/* If I can get the lock, do so quickly. */
-		if (slot->exclusive == 0 || slot->xlogInsertingAt != waitptr)
-			mustwait = false;
-		else
-			mustwait = true;
-
-		if (!mustwait)
-			break;				/* the lock was free */
-
-		Assert(slot->owner != MyProc);
+		bool		immed;
 
 		/*
-		 * Add myself to wait queue.
+		 * Acquire one of the locks. It doesn't matter which one, but try to
+		 * use the lock we used last time. If the system isn't particularly
+		 * busy, it's a good bet that it's available, and it's good to have
+		 * some affinity to a particular lock so that you don't unnecessarily
+		 * bounce cache lines between processes when there is no contention.
+		 *
+		 * If this is the first time through in this backend, pick a lock
+		 * (semi-)randomly. This allows the locks to be used evenly if you have
+		 * a lot of very short connections.
 		 */
-		proc->lwWaiting = true;
-		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
-		proc->lwWaitLink = NULL;
+		static int	lockToTry = -1;
+		if (lockToTry == -1)
+			lockToTry = MyProc->pgprocno % num_xloginsert_locks;
+		MyLockNo = lockToTry;
 
-		/* waiters are added to the front of the queue */
-		proc->lwWaitLink = slot->head;
-		if (slot->head == NULL)
-			slot->tail = proc;
-		slot->head = proc;
+		immed = LWLockAcquire(&WALInsertLocks[MyLockNo].l.lock, LW_EXCLUSIVE);
 
-		/* Can release the mutex now */
-		SpinLockRelease(&slot->mutex);
-
-		/*
-		 * Wait until awakened.
-		 *
-		 * Since we share the process wait semaphore with other things, like
-		 * the regular lock manager and ProcWaitForSignal, and we may need to
-		 * acquire an LWLock while one of those is pending, it is possible that
-		 * we get awakened for a reason other than being signaled by
-		 * LWLockRelease. If so, loop back and wait again.  Once we've gotten
-		 * the LWLock, re-increment the sema by the number of additional
-		 * signals received, so that the lock manager or signal manager will
-		 * see the received signal when it next waits.
-		 */
-		for (;;)
+		if (!immed)
 		{
-			/* "false" means cannot accept cancel/die interrupt here. */
-			PGSemaphoreLock(&proc->sem, false);
-			if (!proc->lwWaiting)
-				break;
-			extraWaits++;
+			/*
+			 * If we couldn't get the lock immediately, try another lock next
+			 * time.  On a system with more insertion locks than concurrent
+			 * inserters, this causes all the inserters to eventually migrate
+			 * to a lock that no-one else is using.  On a system with more
+			 * inserters than locks, it still helps to distribute the inserters
+			 * quite evenly across the locks.
+			 */
+			lockToTry = (lockToTry + 1) % num_xloginsert_locks;
 		}
-
-		/* Now loop back and try to acquire lock again. */
 	}
-
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Fix the process wait semaphore's count for any absorbed wakeups.
-	 */
-	while (extraWaits-- > 0)
-		PGSemaphoreUnlock(&proc->sem);
-
-	/*
-	 * Now okay to allow cancel/die interrupts.
-	 */
-	RESUME_INTERRUPTS();
 }
 
 /*
- * Wake up all processes waiting for us with WaitOnSlot(). Sets our
- * xlogInsertingAt value to EndPos, without releasing the slot.
+ * Release our insertion lock (or locks, if we're holding them all).
  */
 static void
-WakeupWaiters(XLogRecPtr EndPos)
+WALInsertLockRelease(void)
 {
-	volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[MySlotNo].slot;
-	PGPROC	   *head;
-	PGPROC	   *proc;
-	PGPROC	   *next;
-
-	/*
-	 * If we have already reported progress up to the same point, do nothing.
-	 * No other process can modify xlogInsertingAt, so we can check this before
-	 * grabbing the spinlock.
-	 */
-	if (slot->xlogInsertingAt == EndPos)
-		return;
-	/* xlogInsertingAt should not go backwards */
-	Assert(slot->xlogInsertingAt < EndPos);
-
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&slot->mutex);
-
-	/* we should own the slot */
-	Assert(slot->exclusive == 1 && slot->owner == MyProc);
-
-	slot->xlogInsertingAt = EndPos;
-
-	/*
-	 * See if there are any LW_WAIT_UNTIL_FREE waiters that need to be woken
-	 * up. They are always in the front of the queue.
-	 */
-	head = slot->head;
+	int			i;
 
-	if (head != NULL && head->lwWaitMode == LW_WAIT_UNTIL_FREE)
+	if (holdingAllLocks)
 	{
-		proc = head;
-		next = proc->lwWaitLink;
-		while (next && next->lwWaitMode == LW_WAIT_UNTIL_FREE)
+		for (i = 0; i < num_xloginsert_locks; i++)
 		{
-			proc = next;
-			next = next->lwWaitLink;
+			/* see below */
+			WALInsertLocks[i].l.insertingAt = 0;
+			LWLockRelease(&WALInsertLocks[i].l.lock);
 		}
-
-		/* proc is now the last PGPROC to be released */
-		slot->head = next;
-		proc->lwWaitLink = NULL;
+		holdingAllLocks = false;
 	}
 	else
-		head = NULL;
-
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Awaken any waiters I removed from the queue.
-	 */
-	while (head != NULL)
 	{
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
+		/*
+		 * Reset insertingAt value to ease debugging. It would be OK to let
+		 * it point to any old value as long is it's smaller than the current
+		 * end of reserved WAL, but seems tidier to reset it to zero.
+		 *
+		 * No need to hold a spinlock while we reset it, because we don't care
+		 * if someone transiently sees a bogus value. We're just about to
+		 * release the lock anyway, waking up anyone who might wait for us.
+		 */
+		WALInsertLocks[MyLockNo].l.insertingAt = 0;
+		LWLockRelease(&WALInsertLocks[MyLockNo].l.lock);
 	}
 }
 
 /*
- * Release our insertion slot (or slots, if we're holding them all).
+ * Update our insertingAt value, to let others know that we've finished
+ * inserting up to that point.
  */
 static void
-WALInsertSlotRelease(void)
+WALInsertLockWakeup(XLogRecPtr insertingAt)
 {
 	int			i;
 
-	if (holdingAllSlots)
+	if (holdingAllLocks)
 	{
-		for (i = 0; i < num_xloginsert_slots; i++)
-			WALInsertSlotReleaseOne(i);
-		holdingAllSlots = false;
+		/*
+		 * We use the last lock to mark our actual position. The others
+		 * are kept at max value, so that no-one sleeps on them.
+		 */
+		LWLockWakeup(&WALInsertLocks[num_xloginsert_locks - 1].l.lock,
+					 &WALInsertLocks[num_xloginsert_locks - 1].l.insertingAt,
+					 insertingAt);
 	}
 	else
-		WALInsertSlotReleaseOne(MySlotNo);
+		LWLockWakeup(&WALInsertLocks[MyLockNo].l.lock,
+					 &WALInsertLocks[MyLockNo].l.insertingAt,
+					 insertingAt);
 }
 
-static void
-WALInsertSlotReleaseOne(int slotno)
-{
-	volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[slotno].slot;
-	PGPROC	   *head;
-	PGPROC	   *proc;
-
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&slot->mutex);
-
-	/* we must be holding it */
-	Assert(slot->exclusive == 1 && slot->owner == MyProc);
-
-	slot->xlogInsertingAt = InvalidXLogRecPtr;
-
-	/* Release my hold on the slot */
-	slot->exclusive = 0;
-	slot->owner = NULL;
-
-	/*
-	 * See if I need to awaken any waiters..
-	 */
-	head = slot->head;
-	if (head != NULL)
-	{
-		if (slot->releaseOK)
-		{
-			/*
-			 * Remove the to-be-awakened PGPROCs from the queue.
-			 */
-			bool		releaseOK = true;
-
-			proc = head;
-
-			/*
-			 * First wake up any backends that want to be woken up without
-			 * acquiring the lock. These are always in the front of the queue.
-			 */
-			while (proc->lwWaitMode == LW_WAIT_UNTIL_FREE && proc->lwWaitLink)
-				proc = proc->lwWaitLink;
-
-			/*
-			 * Awaken the first exclusive-waiter, if any.
-			 */
-			if (proc->lwWaitLink)
-			{
-				Assert(proc->lwWaitLink->lwWaitMode == LW_EXCLUSIVE);
-				proc = proc->lwWaitLink;
-				releaseOK = false;
-			}
-			/* proc is now the last PGPROC to be released */
-			slot->head = proc->lwWaitLink;
-			proc->lwWaitLink = NULL;
-
-			slot->releaseOK = releaseOK;
-		}
-		else
-			head = NULL;
-	}
-
-	/* We are done updating shared state of the slot itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Awaken any waiters I removed from the queue.
-	 */
-	while (head != NULL)
-	{
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
-	}
-
-	/*
-	 * Now okay to allow cancel/die interrupts.
-	 */
-	END_CRIT_SECTION();
-}
-
-
 /*
  * Wait for any WAL insertions < upto to finish.
  *
@@ -2029,79 +1692,48 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	}
 
 	/*
+	 * Loop through all the locks, sleeping on any in-progress insert older
+	 * than 'upto'.
+	 *
 	 * finishedUpto is our return value, indicating the point upto which
 	 * all the WAL insertions have been finished. Initialize it to the head
-	 * of reserved WAL, and as we iterate through the insertion slots, back it
+	 * of reserved WAL, and as we iterate through the insertion locks, back it
 	 * out for any insertion that's still in progress.
 	 */
 	finishedUpto = reservedUpto;
-
-	/*
-	 * Loop through all the slots, sleeping on any in-progress insert older
-	 * than 'upto'.
-	 */
-	for (i = 0; i < num_xloginsert_slots; i++)
+	for (i = 0; i < num_xloginsert_locks; i++)
 	{
-		volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[i].slot;
-		XLogRecPtr insertingat;
-
-	retry:
-		/*
-		 * We can check if the slot is in use without grabbing the spinlock.
-		 * The spinlock acquisition of insertpos_lck before this loop acts
-		 * as a memory barrier. If someone acquires the slot after that, it
-		 * can't possibly be inserting to anything < reservedUpto. If it was
-		 * acquired before that, an unlocked test will return true.
-		 */
-		if (!slot->exclusive)
-			continue;
-
-		SpinLockAcquire(&slot->mutex);
-		/* re-check now that we have the lock */
-		if (!slot->exclusive)
-		{
-			SpinLockRelease(&slot->mutex);
-			continue;
-		}
-		insertingat = slot->xlogInsertingAt;
-		SpinLockRelease(&slot->mutex);
-
-		if (insertingat == InvalidXLogRecPtr)
+		XLogRecPtr insertingat = InvalidXLogRecPtr;
+		do
 		{
 			/*
-			 * slot is reserved just to hold off other inserters, there is no
-			 * actual insert in progress.
+			 * See if this insertion is in progress. LWLockWait will wait for
+			 * the lock to be released, or for the 'value' to be set by a
+			 * LWLockWakeup call. When a lock is initially acquired, its
+			 * value is 0 (InvalidXLogRecPtr), which means that we don't know
+			 * where it's inserting yet. We will have to wait for it. If it's
+			 * a small insertion, the record will most likely fit on the same
+			 * page and the inserter will release the lock without ever
+			 * calling LWLockWakeup. But if it has to cross a page, it will
+			 * advertise the insertion point with LWLockWakeup.
 			 */
-			continue;
-		}
+			if (LWLockWait(&WALInsertLocks[i].l.lock,
+						   &WALInsertLocks[i].l.insertingAt,
+						   insertingat, &insertingat))
+			{
+				/* the lock was free, so no insertion in progress */
+				insertingat = InvalidXLogRecPtr;
+				break;
+			}
 
-		/*
-		 * This insertion is still in progress. Do we need to wait for it?
-		 *
-		 * When an inserter acquires a slot, it doesn't reset 'insertingat', so
-		 * it will initially point to the old value of some already-finished
-		 * insertion. The inserter will update the value as soon as it finishes
-		 * the insertion, moves to the next page, or has to do I/O to flush an
-		 * old dirty buffer. That means that when we see a slot with
-		 * insertingat value < upto, we don't know if that insertion is still
-		 * truly in progress, or if the slot is reused by a new inserter that
-		 * hasn't updated the insertingat value yet. We have to assume it's the
-		 * latter, and wait.
-		 */
-		if (insertingat < upto)
-		{
-			WaitOnSlot(slot, insertingat);
-			goto retry;
-		}
-		else
-		{
 			/*
-			 * We don't need to wait for this insertion, but update the
-			 * return value.
+			 * This insertion is still in progress. Have to wait, unless the
+			 * inserter has proceeded past 'upto'.
 			 */
-			if (insertingat < finishedUpto)
-				finishedUpto = insertingat;
-		}
+		} while (insertingat < upto);
+
+		if (insertingat != InvalidXLogRecPtr && insertingat < finishedUpto)
+			finishedUpto = insertingat;
 	}
 	return finishedUpto;
 }
@@ -2115,8 +1747,8 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
  *
  * The caller must ensure that the page containing the requested location
  * isn't evicted yet, and won't be evicted. The way to ensure that is to
- * hold onto an XLogInsertSlot with the xlogInsertingAt position set to
- * something <= ptr. GetXLogBuffer() will update xlogInsertingAt if it needs
+ * hold onto a WAL insertion lock with the insertingAt position set to
+ * something <= ptr. GetXLogBuffer() will update insertingAt if it needs
  * to evict an old page from the buffer. (This means that once you call
  * GetXLogBuffer() with a given 'ptr', you must not access anything before
  * that point anymore, and must not call GetXLogBuffer() with an older 'ptr'
@@ -2176,7 +1808,7 @@ GetXLogBuffer(XLogRecPtr ptr)
 		 * Let others know that we're finished inserting the record up
 		 * to the page boundary.
 		 */
-		WakeupWaiters(expectedEndPtr - XLOG_BLCKSZ);
+		WALInsertLockWakeup(expectedEndPtr - XLOG_BLCKSZ);
 
 		AdvanceXLInsertBuffer(ptr, false);
 		endptr = XLogCtl->xlblocks[idx];
@@ -5071,8 +4703,8 @@ XLOGShmemSize(void)
 	/* XLogCtl */
 	size = sizeof(XLogCtlData);
 
-	/* xlog insertion slots, plus alignment */
-	size = add_size(size, mul_size(sizeof(XLogInsertSlotPadded), num_xloginsert_slots + 1));
+	/* WAL insertion locks, plus alignment */
+	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), num_xloginsert_locks + 1));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
@@ -5120,11 +4752,27 @@ XLOGShmemInit(void)
 	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
 	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
 
-	/* Xlog insertion slots. Ensure they're aligned to the full padded size */
-	allocptr += sizeof(XLogInsertSlotPadded) -
-		((uintptr_t) allocptr) % sizeof(XLogInsertSlotPadded);
-	XLogCtl->Insert.insertSlots = (XLogInsertSlotPadded *) allocptr;
-	allocptr += sizeof(XLogInsertSlotPadded) * num_xloginsert_slots;
+
+	/* WAL insertion locks. Ensure they're aligned to the full padded size */
+	allocptr += sizeof(WALInsertLockPadded) -
+		((uintptr_t) allocptr) % sizeof(WALInsertLockPadded);
+	WALInsertLocks = XLogCtl->Insert.WALInsertLocks =
+		(WALInsertLockPadded *) allocptr;
+	allocptr += sizeof(WALInsertLockPadded) * num_xloginsert_locks;
+
+	XLogCtl->Insert.WALInsertLockTrancheId = LWLockNewTrancheId();
+
+	XLogCtl->Insert.WALInsertLockTranche.name = "WALInsertLocks";
+	XLogCtl->Insert.WALInsertLockTranche.array_base = WALInsertLocks;
+	XLogCtl->Insert.WALInsertLockTranche.array_stride = sizeof(WALInsertLockPadded);
+
+	LWLockRegisterTranche(XLogCtl->Insert.WALInsertLockTrancheId, &XLogCtl->Insert.WALInsertLockTranche);
+	for (i = 0; i < num_xloginsert_locks; i++)
+	{
+		LWLockInitialize(&WALInsertLocks[i].l.lock,
+						 XLogCtl->Insert.WALInsertLockTrancheId);
+		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+	}
 
 	/*
 	 * Align the start of the page buffers to a full xlog block size boundary.
@@ -5144,19 +4792,6 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
-	for (i = 0; i < num_xloginsert_slots; i++)
-	{
-		XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[i].slot;
-		SpinLockInit(&slot->mutex);
-		slot->xlogInsertingAt = InvalidXLogRecPtr;
-		slot->owner = NULL;
-
-		slot->releaseOK = true;
-		slot->exclusive = 0;
-		slot->head = NULL;
-		slot->tail = NULL;
-	}
-
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	SpinLockInit(&XLogCtl->ulsn_lck);
@@ -7883,6 +7518,11 @@ InitXLOGAccess(void)
 	ThisTimeLineID = XLogCtl->ThisTimeLineID;
 	Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode());
 
+	/* Initialize our copy of WALInsertLocks and register the tranche */
+	WALInsertLocks = XLogCtl->Insert.WALInsertLocks;
+	LWLockRegisterTranche(XLogCtl->Insert.WALInsertLockTrancheId,
+						  &XLogCtl->Insert.WALInsertLockTranche);
+
 	/* Use GetRedoRecPtr to copy the RedoRecPtr safely */
 	(void) GetRedoRecPtr();
 }
@@ -7901,7 +7541,7 @@ GetRedoRecPtr(void)
 
 	/*
 	 * The possibly not up-to-date copy in XlogCtl is enough. Even if we
-	 * grabbed a WAL insertion slot to read the master copy, someone might
+	 * grabbed a WAL insertion lock to read the master copy, someone might
 	 * update it just after we've released the lock.
 	 */
 	SpinLockAcquire(&xlogctl->info_lck);
@@ -7919,7 +7559,7 @@ GetRedoRecPtr(void)
  *
  * NOTE: The value *actually* returned is the position of the last full
  * xlog page. It lags behind the real insert position by at most 1 page.
- * For that, we don't need to scan through WAL insertion slots, and an
+ * For that, we don't need to scan through WAL insertion locks, and an
  * approximation is enough for the current usage of this function.
  */
 XLogRecPtr
@@ -8280,7 +7920,7 @@ CreateCheckPoint(int flags)
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
 
 	/*
@@ -8305,7 +7945,7 @@ CreateCheckPoint(int flags)
 			MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
 			ControlFile->checkPoint == ControlFile->checkPointCopy.redo)
 		{
-			WALInsertSlotRelease();
+			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
 			return;
@@ -8349,7 +7989,7 @@ CreateCheckPoint(int flags)
 
 	/*
 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
-	 * must be done while holding the insertion slots.
+	 * must be done while holding all the insertion locks.
 	 *
 	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
 	 * pointing past where it really needs to point.  This is okay; the only
@@ -8361,10 +8001,10 @@ CreateCheckPoint(int flags)
 	RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
 
 	/*
-	 * Now we can release the WAL insertion slots, allowing other xacts to
+	 * Now we can release the WAL insertion locks, allowing other xacts to
 	 * proceed while we are flushing disk buffers.
 	 */
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	/* Update the info_lck-protected copy of RedoRecPtr as well */
 	SpinLockAcquire(&xlogctl->info_lck);
@@ -8394,7 +8034,7 @@ CreateCheckPoint(int flags)
 	 * we wait till he's out of his commit critical section before proceeding.
 	 * See notes in RecordTransactionCommit().
 	 *
-	 * Because we've already released the insertion slots, this test is a bit
+	 * Because we've already released the insertion locks, this test is a bit
 	 * fuzzy: it is possible that we will wait for xacts we didn't really need
 	 * to wait for.  But the delay should be short and it seems better to make
 	 * checkpoint take a bit longer than to hold off insertions longer than
@@ -8625,10 +8265,10 @@ CreateEndOfRecoveryRecord(void)
 
 	xlrec.end_time = time(NULL);
 
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	xlrec.ThisTimeLineID = ThisTimeLineID;
 	xlrec.PrevTimeLineID = XLogCtl->PrevTimeLineID;
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	LocalSetXLogInsertAllowed();
 
@@ -8834,9 +8474,9 @@ CreateRestartPoint(int flags)
 	 * during recovery this is just pro forma, because no WAL insertions are
 	 * happening.
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	/* Also update the info_lck-protected copy */
 	SpinLockAcquire(&xlogctl->info_lck);
@@ -9296,9 +8936,9 @@ UpdateFullPageWrites(void)
 	 */
 	if (fullPageWrites)
 	{
-		WALInsertSlotAcquire(true);
+		WALInsertLockAcquire(true);
 		Insert->fullPageWrites = true;
-		WALInsertSlotRelease();
+		WALInsertLockRelease();
 	}
 
 	/*
@@ -9319,9 +8959,9 @@ UpdateFullPageWrites(void)
 
 	if (!fullPageWrites)
 	{
-		WALInsertSlotAcquire(true);
+		WALInsertLockAcquire(true);
 		Insert->fullPageWrites = false;
-		WALInsertSlotRelease();
+		WALInsertLockRelease();
 	}
 	END_CRIT_SECTION();
 }
@@ -9952,15 +9592,15 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	 * Note that forcePageWrites has no effect during an online backup from
 	 * the standby.
 	 *
-	 * We must hold all the insertion slots to change the value of
+	 * We must hold all the insertion locks to change the value of
 	 * forcePageWrites, to ensure adequate interlocking against XLogInsert().
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	if (exclusive)
 	{
 		if (XLogCtl->Insert.exclusiveBackup)
 		{
-			WALInsertSlotRelease();
+			WALInsertLockRelease();
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("a backup is already in progress"),
@@ -9971,7 +9611,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	else
 		XLogCtl->Insert.nonExclusiveBackups++;
 	XLogCtl->Insert.forcePageWrites = true;
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	/* Ensure we release forcePageWrites if fail below */
 	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
@@ -10086,13 +9726,13 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			 * taking a checkpoint right after another is not that expensive
 			 * either because only few buffers have been dirtied yet.
 			 */
-			WALInsertSlotAcquire(true);
+			WALInsertLockAcquire(true);
 			if (XLogCtl->Insert.lastBackupStart < startpoint)
 			{
 				XLogCtl->Insert.lastBackupStart = startpoint;
 				gotUniqueStartpoint = true;
 			}
-			WALInsertSlotRelease();
+			WALInsertLockRelease();
 		} while (!gotUniqueStartpoint);
 
 		XLByteToSeg(startpoint, _logSegNo);
@@ -10182,7 +9822,7 @@ pg_start_backup_callback(int code, Datum arg)
 	bool		exclusive = DatumGetBool(arg);
 
 	/* Update backup counters and forcePageWrites on failure */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	if (exclusive)
 	{
 		Assert(XLogCtl->Insert.exclusiveBackup);
@@ -10199,7 +9839,7 @@ pg_start_backup_callback(int code, Datum arg)
 	{
 		XLogCtl->Insert.forcePageWrites = false;
 	}
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 }
 
 /*
@@ -10268,7 +9908,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	/*
 	 * OK to update backup counters and forcePageWrites
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	if (exclusive)
 		XLogCtl->Insert.exclusiveBackup = false;
 	else
@@ -10288,7 +9928,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	{
 		XLogCtl->Insert.forcePageWrites = false;
 	}
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	if (exclusive)
 	{
@@ -10573,7 +10213,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 void
 do_pg_abort_backup(void)
 {
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquire(true);
 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
 	XLogCtl->Insert.nonExclusiveBackups--;
 
@@ -10582,7 +10222,7 @@ do_pg_abort_backup(void)
 	{
 		XLogCtl->Insert.forcePageWrites = false;
 	}
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 82ef440..f88bf76 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -10,6 +10,14 @@
  * locking should be done with the full lock manager --- which depends on
  * LWLocks to protect its shared state.
  *
+ * In addition to exclusive and shared modes, lightweight locks can be used
+ * to wait until a variable changes value. The variable is set with
+ * LWLockWakeup, and waited for with LWLockWait.  LWLockWait waits until the
+ * lock is free, or the variable changes.  LWLockWakeup assigns to the
+ * variable, waking up any LWLockWait() callers, without releasing the lock.
+ * The meaning of the value assigned is up to the caller, the lightweight
+ * lock code just assigns and compares it.
+ *
  *
  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -443,16 +451,18 @@ LWLockInitialize(LWLock *lock, int tranche_id)
 /*
  * LWLockAcquire - acquire a lightweight lock in the specified mode
  *
- * If the lock is not available, sleep until it is.
+ * If the lock is not available, sleep until it is.  Returns true if the lock
+ * was available immediately, false if we had to sleep.
  *
  * Side effect: cancel/die interrupts are held off until lock release.
  */
-void
+bool
 LWLockAcquire(LWLock *l, LWLockMode mode)
 {
 	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
+	bool		result = true;
 	int			extraWaits = 0;
 #ifdef LWLOCK_STATS
 	lwlock_stats *lwstats;
@@ -601,6 +611,7 @@ LWLockAcquire(LWLock *l, LWLockMode mode)
 
 		/* Now loop back and try to acquire lock again. */
 		retry = true;
+		result = false;
 	}
 
 	/* We are done updating shared state of the lock itself. */
@@ -616,6 +627,8 @@ LWLockAcquire(LWLock *l, LWLockMode mode)
 	 */
 	while (extraWaits-- > 0)
 		PGSemaphoreUnlock(&proc->sem);
+
+	return result;
 }
 
 /*
@@ -835,6 +848,226 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 }
 
 /*
+ * LWLockWait - Wait until lock is free, or a variable is updated.
+ *
+ * If the lock is held, and the *valptr equals oldval, waits until the lock
+ * is either freed, or the lock holder updates *valptr by calling
+ * LWLockWakeup. If the lock is free on exit (immediately or after waiting),
+ * returns true. If the lock is still held, but *valptr no longer matches
+ * oldval, returns false and sets *newval to the current value in *valptr.
+ *
+ * It's possible that the lock holder releases the lock, but another backend
+ * acquires it again before we get a chance to observe that the lock was
+ * momentarily released. We wouldn't need to wait for the new lock holder, but
+ * we cannot distinguish that case, so we will have to wait.
+ *
+ * Note: this function ignores shared lock holders; if the lock is held
+ * in shared mode, returns 'true'.
+ */
+bool
+LWLockWait(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
+{
+	volatile LWLock *lock = l;
+	volatile uint64 *valp = valptr;
+	PGPROC	   *proc = MyProc;
+	int			extraWaits = 0;
+	bool		result = false;
+
+	/*
+	 * Quick test first to see if it the slot is free right now.
+	 *
+	 * XXX: the caller uses a spinlock before this, so we don't need a memory
+	 * barrier here as far as the current usage is concerned. But that might
+	 * not be safe in general.
+	 */
+	if (lock->exclusive == 0)
+		return true;
+
+	/*
+	 * Lock out cancel/die interrupts while we sleep on the lock. There is
+	 * no cleanup mechanism to remove us from the wait queue if we got
+	 * interrupted.
+	 */
+	HOLD_INTERRUPTS();
+
+	/*
+	 * Loop here to check the lock's status after each time we are signaled.
+	 */
+	for (;;)
+	{
+		bool		mustwait;
+		uint64		value;
+
+		/* Acquire mutex.  Time spent holding mutex should be short! */
+#ifdef LWLOCK_STATS
+		lwstats->spin_delay_count += SpinLockAcquire(&lock->mutex);
+#else
+		SpinLockAcquire(&lock->mutex);
+#endif
+
+		/* Is the lock now free, and if not, does the value match? */
+		if (lock->exclusive == 0)
+		{
+			result = true;
+			mustwait = false;
+		}
+		else
+		{
+			value = *valp;
+			if (value != oldval)
+			{
+				result = false;
+				mustwait = false;
+				*newval = value;
+			}
+			else
+				mustwait = true;
+		}
+
+		if (!mustwait)
+			break;				/* the lock was free or value didn't match */
+
+		/*
+		 * Add myself to wait queue.
+		 */
+		proc->lwWaiting = true;
+		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
+		proc->lwWaitLink = NULL;
+
+		/* waiters are added to the front of the queue */
+		proc->lwWaitLink = lock->head;
+		if (lock->head == NULL)
+			lock->tail = proc;
+		lock->head = proc;
+
+		/* Can release the mutex now */
+		SpinLockRelease(&lock->mutex);
+
+		/*
+		 * Wait until awakened.
+		 *
+		 * Since we share the process wait semaphore with other things, like
+		 * the regular lock manager and ProcWaitForSignal, and we may need to
+		 * acquire an LWLock while one of those is pending, it is possible that
+		 * we get awakened for a reason other than being signaled by
+		 * LWLockRelease. If so, loop back and wait again.  Once we've gotten
+		 * the LWLock, re-increment the sema by the number of additional
+		 * signals received, so that the lock manager or signal manager will
+		 * see the received signal when it next waits.
+		 */
+		LOG_LWDEBUG("LWLockWait", T_NAME(l), T_ID(l), "waiting");
+
+#ifdef LWLOCK_STATS
+		lwstats->block_count++;
+#endif
+
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
+
+		for (;;)
+		{
+			/* "false" means cannot accept cancel/die interrupt here. */
+			PGSemaphoreLock(&proc->sem, false);
+			if (!proc->lwWaiting)
+				break;
+			extraWaits++;
+		}
+
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
+
+		LOG_LWDEBUG("LWLockWait", T_NAME(l), T_ID(l), "awakened");
+
+		/* Now loop back and check the status of the lock again. */
+	}
+
+	/* We are done updating shared state of the lock itself. */
+	SpinLockRelease(&lock->mutex);
+
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode);
+
+	/*
+	 * Fix the process wait semaphore's count for any absorbed wakeups.
+	 */
+	while (extraWaits-- > 0)
+		PGSemaphoreUnlock(&proc->sem);
+
+	/*
+	 * Now okay to allow cancel/die interrupts.
+	 */
+	RESUME_INTERRUPTS();
+
+	return result;
+}
+
+
+/*
+ * LWLockWakeup - Update a variable and wake up waiters atomically
+ *
+ * Sets *valptr to 'val', and wakes up all processes waiting for us with
+ * LWLockWait(). Setting the value and waking up the process happen
+ * atomically so that any process calling LWLockWait() on the same lock is
+ * guaranteed to see the new value, and act accordingly.
+ *
+ * The caller must be holding the lock in exclusive mode.
+ */
+void
+LWLockWakeup(LWLock *l, uint64 *valptr, uint64 val)
+{
+	volatile LWLock *lock = l;
+	volatile uint64 *valp = valptr;
+	PGPROC	   *head;
+	PGPROC	   *proc;
+	PGPROC	   *next;
+
+	/* Acquire mutex.  Time spent holding mutex should be short! */
+	SpinLockAcquire(&lock->mutex);
+
+	/* we should hold the lock */
+	Assert(lock->exclusive == 1);
+
+	/* Update the lock's value */
+	*valp = val;
+
+	/*
+	 * See if there are any LW_WAIT_UNTIL_FREE waiters that need to be woken
+	 * up. They are always in the front of the queue.
+	 */
+	head = lock->head;
+
+	if (head != NULL && head->lwWaitMode == LW_WAIT_UNTIL_FREE)
+	{
+		proc = head;
+		next = proc->lwWaitLink;
+		while (next && next->lwWaitMode == LW_WAIT_UNTIL_FREE)
+		{
+			proc = next;
+			next = next->lwWaitLink;
+		}
+
+		/* proc is now the last PGPROC to be released */
+		lock->head = next;
+		proc->lwWaitLink = NULL;
+	}
+	else
+		head = NULL;
+
+	/* We are done updating shared state of the lock itself. */
+	SpinLockRelease(&lock->mutex);
+
+	/*
+	 * Awaken any waiters I removed from the queue.
+	 */
+	while (head != NULL)
+	{
+		proc = head;
+		head = proc->lwWaitLink;
+		proc->lwWaitLink = NULL;
+		proc->lwWaiting = false;
+		PGSemaphoreUnlock(&proc->sem);
+	}
+}
+
+
+/*
  * LWLockRelease - release a previously acquired lock
  */
 void
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index b27cb89..9e875e0 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2123,12 +2123,12 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"xloginsert_slots", PGC_POSTMASTER, WAL_SETTINGS,
-			gettext_noop("Sets the number of slots for concurrent xlog insertions."),
+		{"xloginsert_locks", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Sets the number of locks used for concurrent xlog insertions."),
 			NULL,
 			GUC_NOT_IN_SAMPLE
 		},
-		&num_xloginsert_slots,
+		&num_xloginsert_locks,
 		8, 1, 1000,
 		NULL, NULL, NULL
 	},
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 11ab277..f5152f4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -192,7 +192,7 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool log_checkpoints;
-extern int	num_xloginsert_slots;
+extern int	num_xloginsert_locks;
 
 /* WAL levels */
 typedef enum WalLevel
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8840c79..2c056d8 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -169,9 +169,11 @@ typedef enum LWLockMode
 extern bool Trace_lwlocks;
 #endif
 
-extern void LWLockAcquire(LWLock *lock, LWLockMode mode);
+extern bool LWLockAcquire(LWLock *lock, LWLockMode mode);
 extern bool LWLockConditionalAcquire(LWLock *lock, LWLockMode mode);
 extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
+extern bool LWLockWait(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval);
+extern void LWLockWakeup(LWLock *lock, uint64 *valptr, uint64 value);
 extern void LWLockRelease(LWLock *lock);
 extern void LWLockReleaseAll(void);
 extern bool LWLockHeldByMe(LWLock *lock);

#52

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Heikki Linnakangas (#51)

1 attachment(s)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-03-07 17:54:32 +0200, Heikki Linnakangas wrote:

So there are some unexplained differences there, but based on these results,
I'm still OK with committing the patch.

So, I am looking at this right now.

I think there are some minor things I'd like to see addressed:

1) I think there needs to be a good sized comment explaining why
WaitXLogInsertionsToFinish() isn't racy due to the unlocked read at
the beginning of LWLockWait(). I think it's safe because we're
reading Insert->CurrBytePos inside a spinlock, and it will only ever
increment. As SpinLockAcquire() has to be a read barrier we can
assume that every skewed read in LWLockWait() will be for lock
protecting a newer insertingAt?
2) I am not particularly happy about the LWLockWait() LWLockWakeup()
function names. They sound too much like a part of the normal lwlock
implementation to me. But admittedly I don't have a great idea for
a better naming scheme. Maybe LWLockWaitForVar(),
LWLockWakeupVarWaiter()?
3) I am the wrong one to complain, I know, but the comments above struct
WALInsertLock are pretty hard to read from th sentence structure.
4) WALInsertLockAcquire() needs to comment on acquiring/waking all but
the last slot. Generally the trick of exclusive xlog insertion lock
acquiration only really using the last lock could use a bit more
docs.
5) WALInsertLockRelease() comments on the reset of insertingAt being
optional, but I am not convinced that that's true anymore. If an
exclusive acquiration isn't seen as 0 or
INT64CONST(0xFFFFFFFFFFFFFFFF) by another backend we're in trouble,
right? Absolutely not sure without thinking on it for longer than I
can concentrate right now.
6) Pretty minor, but from a style POV it seems nicer to separate
exclusive/nonexclusive out of WALInsertLockAcquire(). The cases don't
share any code now.

A patch contianing some trivial changes is attached...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

minor-things.patchtext/x-patch; charset=us-asciiDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 484b9c5..8a55c6b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1628,8 +1628,6 @@ WALInsertLockRelease(void)
 static void
 WALInsertLockWakeup(XLogRecPtr insertingAt)
 {
-	int			i;
-
 	if (holdingAllLocks)
 	{
 		/*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index f88bf76..2695128 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -873,6 +873,9 @@ LWLockWait(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 	int			extraWaits = 0;
 	bool		result = false;
 
+	/* can't be used with shared locks for now */
+	Assert(lock->shared == 0);
+
 	/*
 	 * Quick test first to see if it the slot is free right now.
 	 *
@@ -905,6 +908,8 @@ LWLockWait(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
 		SpinLockAcquire(&lock->mutex);
 #endif
 
+		Assert(lock->shared == 0);
+
 		/* Is the lock now free, and if not, does the value match? */
 		if (lock->exclusive == 0)
 		{
@@ -1022,6 +1027,7 @@ LWLockWakeup(LWLock *l, uint64 *valptr, uint64 val)
 	SpinLockAcquire(&lock->mutex);
 
 	/* we should hold the lock */
+	LWLockHeldByMe(l);
 	Assert(lock->exclusive == 1);
 
 	/* Update the lock's value */

#53

Heikki Linnakangas

hlinnakangas@vmware.com

almost 12 years ago

In reply to: Andres Freund (#52)

1 attachment(s)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 03/12/2014 09:29 PM, Andres Freund wrote:

On 2014-03-07 17:54:32 +0200, Heikki Linnakangas wrote:

So there are some unexplained differences there, but based on these results,
I'm still OK with committing the patch.

So, I am looking at this right now.

I think there are some minor things I'd like to see addressed:

1) I think there needs to be a good sized comment explaining why
WaitXLogInsertionsToFinish() isn't racy due to the unlocked read at
the beginning of LWLockWait().

There's a comment inside LWLockWait(). I think that's the right place
for it; it's LWLockWait() that's cheating by not acquiring the spinlock
before reading lock->exclusive.

I think it's safe because we're
reading Insert->CurrBytePos inside a spinlock, and it will only ever
increment. As SpinLockAcquire() has to be a read barrier we can
assume that every skewed read in LWLockWait() will be for lock
protecting a newer insertingAt?

Right. If the quick test in the beginning of LWLockWait() returns
'true', even though the lock was *just* acquired by a different backend,
it nevertheless must've been free some time after we read CurrBytePos in
WaitXLogInsertionsToFinish(), and that's good enough.

2) I am not particularly happy about the LWLockWait() LWLockWakeup()
function names. They sound too much like a part of the normal lwlock
implementation to me. But admittedly I don't have a great idea for
a better naming scheme. Maybe LWLockWaitForVar(),
LWLockWakeupVarWaiter()?

Works for me.

3) I am the wrong one to complain, I know, but the comments above struct
WALInsertLock are pretty hard to read from th sentence structure.

Hmm, ok. I reworded that, I hope it's more clear now.

4) WALInsertLockAcquire() needs to comment on acquiring/waking all but
the last slot. Generally the trick of exclusive xlog insertion lock
acquiration only really using the last lock could use a bit more
docs.

Added.

5) WALInsertLockRelease() comments on the reset of insertingAt being
optional, but I am not convinced that that's true anymore. If an
exclusive acquiration isn't seen as 0 or
INT64CONST(0xFFFFFFFFFFFFFFFF) by another backend we're in trouble,
right? Absolutely not sure without thinking on it for longer than I
can concentrate right now.

Hmm, right, it isn't optional when resetting the 0xFFFFFFFFFFFFFFFF
value back to zero.

Now that I look at it, even resetting it to zero in the normal,
non-exclusive case is more fiddly than it seems at first glance
(although still correct). We do this:

0. Acquire the WAL insertion lock, get insert position
1. Copy the WAL data to the shared buffer
2. Set insertingAt = 0
3. Release the lock.

However, nothing stops the compiler (or CPU on weak-memory-order
architectures) from rearranging the operations like this:

0. Acquire the WAL insertion lock, get insert position
1. Set insertingAt = 0
2. Copy the WAL data to the shared buffer
3. Release the lock.

Furthermore, setting the insertingAt value might be "torn" if a 64-bit
store is not atomic. That would be a problem, if a backend saw the torn
value, and incorrectly determined that it doesn't need to wait for it.
(If the compiler didn't reorder steps 1 and 2, that would be OK because
by the time we reset insertingAt, we have already copied the WAL data.)

We're saved by the fact that resetting insertingAt to 0 never moves the
value forwards, only backwards, even if someone sees a torn value. If we
used a some magic value other than 0 to mean "uninitialized", we would
have trouble. But that's way more fiddly than I'd like, so let's make
that more robust.

What we really ought to do is to initialize insertingAt inside
LWLockAcquire (or LWLockRelease), while we're holding the lwlock's
spinlock. The only reason I didn't do that was to avoid having another
copy of LWLockAcquire, with the 'var', but maybe that was penny-wise and
pound-foolish.

New patch version attached. I added a new variant of LWLockAcquire,
called LWLockAcquireWithVar, to reset the variable atomically when the
lock is acquired. LWLockAcquire and the new function now just call an
internal function that implements both, but I'm now slightly worried if
that might hurt performance of LWLockAcquire in general. The extra
indirection through the function call shouldn't add much overhead, but
LWLockAcquire is called so frequently that every cycle counts.

- Heikki

Attachments:

xlogslot-to-lwlock-4.patchtext/x-diff; name=xlogslot-to-lwlock-4.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5f3c3b9..fd16f52 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -89,7 +89,7 @@ int			sync_method = DEFAULT_SYNC_METHOD;
 int			wal_level = WAL_LEVEL_MINIMAL;
 int			CommitDelay = 0;	/* precommit delay in microseconds */
 int			CommitSiblings = 5; /* # concurrent xacts needed to sleep */
-int			num_xloginsert_slots = 8;
+int			num_xloginsert_locks = 8;
 
 #ifdef WAL_DEBUG
 bool		XLOG_DEBUG = false;
@@ -292,7 +292,7 @@ XLogRecPtr	XactLastRecEnd = InvalidXLogRecPtr;
  * (which is almost but not quite the same as a pointer to the most recent
  * CHECKPOINT record).	We update this from the shared-memory copy,
  * XLogCtl->Insert.RedoRecPtr, whenever we can safely do so (ie, when we
- * hold an insertion slot).  See XLogInsert for details.  We are also allowed
+ * hold an insertion lock).  See XLogInsert for details.  We are also allowed
  * to update from XLogCtl->RedoRecPtr if we hold the info_lck;
  * see GetRedoRecPtr.  A freshly spawned backend obtains the value during
  * InitXLOGAccess.
@@ -364,63 +364,49 @@ typedef struct XLogwrtResult
 	XLogRecPtr	Flush;			/* last byte + 1 flushed */
 } XLogwrtResult;
 
-
 /*
- * A slot for inserting to the WAL. This is similar to an LWLock, the main
- * difference is that there is an extra xlogInsertingAt field that is protected
- * by the same mutex. Unlike an LWLock, a slot can only be acquired in
- * exclusive mode.
- *
- * The xlogInsertingAt field is used to advertise to other processes how far
- * the slot owner has progressed in inserting the record. When a backend
- * acquires a slot, it initializes xlogInsertingAt to 1, because it doesn't
- * yet know where it's going to insert the record. That's conservative
- * but correct; the new insertion is certainly going to go to a byte position
- * greater than 1. If another backend needs to flush the WAL, it will have to
- * wait for the new insertion. xlogInsertingAt is updated after finishing the
- * insert or when crossing a page boundary, which will wake up anyone waiting
- * for it, whether the wait was necessary in the first place or not.
- *
- * A process can wait on a slot in two modes: LW_EXCLUSIVE or
- * LW_WAIT_UNTIL_FREE. LW_EXCLUSIVE works like in an lwlock; when the slot is
- * released, the first LW_EXCLUSIVE waiter in the queue is woken up. Processes
- * waiting in LW_WAIT_UNTIL_FREE mode are woken up whenever the slot is
- * released, or xlogInsertingAt is updated. In other words, a process in
- * LW_WAIT_UNTIL_FREE mode is woken up whenever the inserter makes any progress
- * copying the record in place. LW_WAIT_UNTIL_FREE waiters are always added to
- * the front of the queue, while LW_EXCLUSIVE waiters are appended to the end.
- *
- * To join the wait queue, a process must set MyProc->lwWaitMode to the mode
- * it wants to wait in, MyProc->lwWaiting to true, and link MyProc to the head
- * or tail of the wait queue. The same mechanism is used to wait on an LWLock,
- * see lwlock.c for details.
+ * Inserting to WAL is protected by a bunch of WAL insertion locks. Each WAL
+ * insertion lock consists of a lightweight lock, plus an indicator of how
+ * far the insertion has progressed (insertingAt).
+ *
+ * The insertingAt values are read when a process wants to flush WAL from
+ * the in-memory buffers to disk, to check that all the insertions to the
+ * region the process is about to write out have finished. You could simply
+ * wait for all currently in-progress insertions to finish, but the
+ * insertingAt indicator allows you to ignore insertions to later in the WAL,
+ * so that you only wait for the insertions that are modifying the buffers
+ * you're about to write out.
+ *
+ * This isn't just an optimization. If all the WAL buffers are dirty, an
+ * inserter that's holding a WAL insert lock might need to evict an old WAL
+ * buffer, which requires flushing the WAL. If it's possible for an inserter
+ * to block on another inserter unnecessarily, deadlock can arise when two
+ * inserters holding a WAL insert lock wait for each other to finish their
+ * insertion.
+ *
+ * Small WAL records that don't cross a page boundary never update the value,
+ * the WAL record is just copied to the page and the lock is released. But
+ * to avoid the deadlock-scenario explained above, the indicator is always
+ * updated before sleeping while holding an insertion lock.
  */
 typedef struct
 {
-	slock_t		mutex;			/* protects the below fields */
-	XLogRecPtr	xlogInsertingAt; /* insert has completed up to this point */
-
-	PGPROC	   *owner;			/* for debugging purposes */
-
-	bool		releaseOK;		/* T if ok to release waiters */
-	char		exclusive;		/* # of exclusive holders (0 or 1) */
-	PGPROC	   *head;			/* head of list of waiting PGPROCs */
-	PGPROC	   *tail;			/* tail of list of waiting PGPROCs */
-	/* tail is undefined when head is NULL */
-} XLogInsertSlot;
+	LWLock		lock;
+	XLogRecPtr	insertingAt;
+} WALInsertLock;
 
 /*
- * All the slots are allocated as an array in shared memory. We force the
- * array stride to be a power of 2, which saves a few cycles in indexing, but
- * more importantly also ensures that individual slots don't cross cache line
- * boundaries.	(Of course, we have to also ensure that the array start
- * address is suitably aligned.)
+ * All the WAL insertion locks are allocated as an array in shared memory. We
+ * force the array stride to be a power of 2, which saves a few cycles in
+ * indexing, but more importantly also ensures that individual slots don't
+ * cross cache line boundaries. (Of course, we have to also ensure that the
+ * array start address is suitably aligned.)
  */
-typedef union XLogInsertSlotPadded
+typedef union WALInsertLockPadded
 {
-	XLogInsertSlot slot;
+	WALInsertLock l;
 	char		pad[CACHE_LINE_SIZE];
-} XLogInsertSlotPadded;
+} WALInsertLockPadded;
 
 /*
  * Shared state data for XLogInsert.
@@ -455,8 +441,8 @@ typedef struct XLogCtlInsert
 	 * we must WAL-log it before it actually affects WAL-logging by backends.
 	 * Checkpointer sets at startup or after SIGHUP.
 	 *
-	 * To read these fields, you must hold an insertion slot. To modify them,
-	 * you must hold ALL the slots.
+	 * To read these fields, you must hold an insertion lock. To modify them,
+	 * you must hold ALL the locks.
 	 */
 	XLogRecPtr	RedoRecPtr;		/* current redo point for insertions */
 	bool		forcePageWrites;	/* forcing full-page writes for PITR? */
@@ -473,8 +459,16 @@ typedef struct XLogCtlInsert
 	int			nonExclusiveBackups;
 	XLogRecPtr	lastBackupStart;
 
-	/* insertion slots, see XLogInsertSlot struct above for details */
-	XLogInsertSlotPadded *insertSlots;
+	/*
+	 * To insert a new WAL record, you must hold a WAL insertion lock. Before
+	 * 9.4, there was a single WALInsertLock, but that became a bottleneck
+	 * on multi-core systems that insert a lot of WAL. Now, there are several
+	 * WAL insertion locks, and to insert WAL, you must hold one of them (in
+	 * exclusive mode). It doesn't matter which one.
+	 */
+	WALInsertLockPadded	*WALInsertLocks;
+	LWLockTranche WALInsertLockTranche;
+	int			WALInsertLockTrancheId;
 } XLogCtlInsert;
 
 /*
@@ -612,6 +606,9 @@ typedef struct XLogCtlData
 
 static XLogCtlData *XLogCtl = NULL;
 
+/* a private copy of XLogCtl->Insert.WALInsertLocks, for convenience */
+static WALInsertLockPadded *WALInsertLocks = NULL;
+
 /*
  * We maintain an image of pg_control in shared memory.
  */
@@ -735,9 +732,9 @@ static bool InRedo = false;
 /* Have we launched bgwriter during recovery? */
 static bool bgwriterLaunched = false;
 
-/* For WALInsertSlotAcquire/Release functions */
-static int	MySlotNo = 0;
-static bool holdingAllSlots = false;
+/* For WALInsertLockAcquire/Release functions */
+static int	MyLockNo = 0;
+static bool holdingAllLocks = false;
 
 static void readRecoveryCommandFile(void);
 static void exitArchiveRecovery(TimeLineID endTLI, XLogSegNo endLogSegNo);
@@ -811,16 +808,15 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 				  XLogRecPtr *PrevPtr);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
-static void WakeupWaiters(XLogRecPtr EndPos);
 static char *GetXLogBuffer(XLogRecPtr ptr);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
 
-static void WALInsertSlotAcquire(bool exclusive);
-static void WALInsertSlotAcquireOne(int slotno);
-static void WALInsertSlotRelease(void);
-static void WALInsertSlotReleaseOne(int slotno);
+static void WALInsertLockAcquire(void);
+static void WALInsertLockAcquireExclusive(void);
+static void WALInsertLockRelease(void);
+static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
 /*
  * Insert an XLOG record having the specified RMID and info bytes,
@@ -897,7 +893,7 @@ XLogInsert(RmgrId rmid, uint8 info, XLogRecData *rdata)
 	 *
 	 * We may have to loop back to here if a race condition is detected below.
 	 * We could prevent the race by doing all this work while holding an
-	 * insertion slot, but it seems better to avoid doing CRC calculations
+	 * insertion lock, but it seems better to avoid doing CRC calculations
 	 * while holding one.
 	 *
 	 * We add entries for backup blocks to the chain, so that they don't need
@@ -915,8 +911,8 @@ begin:;
 	/*
 	 * Decide if we need to do full-page writes in this XLOG record: true if
 	 * full_page_writes is on or we have a PITR request for it.  Since we
-	 * don't yet have an insertion slot, fullPageWrites and forcePageWrites
-	 * could change under us, but we'll recheck them once we have a slot.
+	 * don't yet have an insertion lock, fullPageWrites and forcePageWrites
+	 * could change under us, but we'll recheck them once we have a lock.
 	 */
 	doPageWrites = Insert->fullPageWrites || Insert->forcePageWrites;
 
@@ -1090,16 +1086,15 @@ begin:;
 	 *    record in place. This can be done concurrently in multiple processes.
 	 *
 	 * To keep track of which insertions are still in-progress, each concurrent
-	 * inserter allocates an "insertion slot", which tells others how far the
+	 * inserter acquires an insertion lock. In addition to just indicating that
+	 * an insertion is in progress, the lock tells others how far the
 	 * inserter has progressed. There is a small fixed number of insertion
-	 * slots, determined by the num_xloginsert_slots GUC. When an inserter
-	 * finishes, it updates the xlogInsertingAt of its slot to the end of the
-	 * record it inserted, to let others know that it's done. xlogInsertingAt
-	 * is also updated when crossing over to a new WAL buffer, to allow the
-	 * the previous buffer to be flushed.
+	 * locks, determined by the num_xloginsert_locks GUC. When an inserter
+	 * crosses a page boundary, it updates the value stored in the lock to the
+	 * how far it has inserted, to allow the the previous buffer to be flushed.
 	 *
-	 * Holding onto a slot also protects RedoRecPtr and fullPageWrites from
-	 * changing until the insertion is finished.
+	 * Holding onto an insertion lock also protects RedoRecPtr and
+	 * fullPageWrites from changing until the insertion is finished.
 	 *
 	 * Step 2 can usually be done completely in parallel. If the required WAL
 	 * page is not initialized yet, you have to grab WALBufMappingLock to
@@ -1109,7 +1104,10 @@ begin:;
 	 *----------
 	 */
 	START_CRIT_SECTION();
-	WALInsertSlotAcquire(isLogSwitch);
+	if (isLogSwitch)
+		WALInsertLockAcquireExclusive();
+	else
+		WALInsertLockAcquire();
 
 	/*
 	 * Check to see if my RedoRecPtr is out of date.  If so, may have to go
@@ -1138,7 +1136,7 @@ begin:;
 					 * Oops, this buffer now needs to be backed up, but we
 					 * didn't think so above.  Start over.
 					 */
-					WALInsertSlotRelease();
+					WALInsertLockRelease();
 					END_CRIT_SECTION();
 					rdt_lastnormal->next = NULL;
 					info = info_orig;
@@ -1157,7 +1155,7 @@ begin:;
 	if ((Insert->fullPageWrites || Insert->forcePageWrites) && !doPageWrites)
 	{
 		/* Oops, must redo it with full-page data. */
-		WALInsertSlotRelease();
+		WALInsertLockRelease();
 		END_CRIT_SECTION();
 		rdt_lastnormal->next = NULL;
 		info = info_orig;
@@ -1205,7 +1203,7 @@ begin:;
 	/*
 	 * Done! Let others know that we're finished.
 	 */
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	MarkCurrentTransactionIdLoggedIfAny();
 
@@ -1366,7 +1364,7 @@ ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos, XLogRecPtr *PrevPtr)
 
 	/*
 	 * These calculations are a bit heavy-weight to be done while holding a
-	 * spinlock, but since we're holding all the WAL insertion slots, there
+	 * spinlock, but since we're holding all the WAL insertion locks, there
 	 * are no other inserters competing for it. GetXLogInsertRecPtr() does
 	 * compete for it, but that's not called very frequently.
 	 */
@@ -1526,7 +1524,7 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 		while (CurrPos < EndPos)
 		{
 			/* initialize the next page (if not initialized already) */
-			WakeupWaiters(CurrPos);
+			WALInsertLockUpdateInsertingAt(CurrPos);
 			AdvanceXLInsertBuffer(CurrPos, false);
 			CurrPos += XLOG_BLCKSZ;
 		}
@@ -1537,452 +1535,123 @@ CopyXLogRecordToWAL(int write_len, bool isLogSwitch, XLogRecData *rdata,
 }
 
 /*
- * Allocate a slot for insertion.
- *
- * In exclusive mode, all slots are reserved for the current process. That
- * blocks all concurrent insertions.
- */
-static void
-WALInsertSlotAcquire(bool exclusive)
-{
-	int			i;
-
-	if (exclusive)
-	{
-		for (i = 0; i < num_xloginsert_slots; i++)
-			WALInsertSlotAcquireOne(i);
-		holdingAllSlots = true;
-	}
-	else
-		WALInsertSlotAcquireOne(-1);
-}
-
-/*
- * Workhorse of WALInsertSlotAcquire. Acquires the given slot, or an arbitrary
- * one if slotno == -1. The index of the slot that was acquired is stored in
- * MySlotNo.
- *
- * This is more or less equivalent to LWLockAcquire().
+ * Acquire a WAL insertion lock, for inserting to WAL.
  */
 static void
-WALInsertSlotAcquireOne(int slotno)
+WALInsertLockAcquire(void)
 {
-	volatile XLogInsertSlot *slot;
-	PGPROC	   *proc = MyProc;
-	bool		retry = false;
-	int			extraWaits = 0;
-	static int	slotToTry = -1;
+	bool		immed;
 
 	/*
-	 * Try to use the slot we used last time. If the system isn't particularly
-	 * busy, it's a good bet that it's available, and it's good to have some
-	 * affinity to a particular slot so that you don't unnecessarily bounce
-	 * cache lines between processes when there is no contention.
+	 * It doesn't matter which of the WAL insertion locks we acquire, so try
+	 * the one we used last time.  If the system isn't particularly busy,
+	 * it's a good bet that it's still available, and it's good to have some
+	 * affinity to a particular lock so that you don't unnecessarily bounce
+	 * cache lines between processes when there's no contention.
 	 *
-	 * If this is the first time through in this backend, pick a slot
-	 * (semi-)randomly. This allows the slots to be used evenly if you have a
-	 * lot of very short connections.
+	 * If this is the first time through in this backend, pick a lock
+	 * (semi-)randomly.  This allows the locks to be used evenly if you have
+	 * a lot of very short connections.
 	 */
-	if (slotno != -1)
-		MySlotNo = slotno;
-	else
-	{
-		if (slotToTry == -1)
-			slotToTry = MyProc->pgprocno % num_xloginsert_slots;
-		MySlotNo = slotToTry;
-	}
+	static int	lockToTry = -1;
 
-	/*
-	 * We can't wait if we haven't got a PGPROC.  This should only occur
-	 * during bootstrap or shared memory initialization.  Put an Assert here
-	 * to catch unsafe coding practices.
-	 */
-	Assert(MyProc != NULL);
+	if (lockToTry == -1)
+		lockToTry = MyProc->pgprocno % num_xloginsert_locks;
+	MyLockNo = lockToTry;
 
 	/*
-	 * Lock out cancel/die interrupts until we exit the code section protected
-	 * by the slot.  This ensures that interrupts will not interfere with
-	 * manipulations of data structures in shared memory. There is no cleanup
-	 * mechanism to release the slot if the backend dies while holding one,
-	 * so make this a critical section.
+	 * The insertingAt value is initially set to 0, as we don't know our
+	 * insert location yet.
 	 */
-	START_CRIT_SECTION();
-
-	/*
-	 * Loop here to try to acquire slot after each time we are signaled by
-	 * WALInsertSlotRelease.
-	 */
-	for (;;)
+	immed = LWLockAcquireWithVar(&WALInsertLocks[MyLockNo].l.lock,
+								 &WALInsertLocks[MyLockNo].l.insertingAt,
+								 0);
+	if (!immed)
 	{
-		bool		mustwait;
-
-		slot = &XLogCtl->Insert.insertSlots[MySlotNo].slot;
-
-		/* Acquire mutex.  Time spent holding mutex should be short! */
-		SpinLockAcquire(&slot->mutex);
-
-		/* If retrying, allow WALInsertSlotRelease to release waiters again */
-		if (retry)
-			slot->releaseOK = true;
-
-		/* If I can get the slot, do so quickly. */
-		if (slot->exclusive == 0)
-		{
-			slot->exclusive++;
-			mustwait = false;
-		}
-		else
-			mustwait = true;
-
-		if (!mustwait)
-			break;				/* got the lock */
-
-		Assert(slot->owner != MyProc);
-
-		/*
-		 * Add myself to wait queue.
-		 */
-		proc->lwWaiting = true;
-		proc->lwWaitMode = LW_EXCLUSIVE;
-		proc->lwWaitLink = NULL;
-		if (slot->head == NULL)
-			slot->head = proc;
-		else
-			slot->tail->lwWaitLink = proc;
-		slot->tail = proc;
-
-		/* Can release the mutex now */
-		SpinLockRelease(&slot->mutex);
-
 		/*
-		 * Wait until awakened.
-		 *
-		 * Since we share the process wait semaphore with the regular lock
-		 * manager and ProcWaitForSignal, and we may need to acquire a slot
-		 * while one of those is pending, it is possible that we get awakened
-		 * for a reason other than being signaled by WALInsertSlotRelease. If
-		 * so, loop back and wait again.  Once we've gotten the slot,
-		 * re-increment the sema by the number of additional signals received,
-		 * so that the lock manager or signal manager will see the received
-		 * signal when it next waits.
+		 * If we couldn't get the lock immediately, try another lock next
+		 * time.  On a system with more insertion locks than concurrent
+		 * inserters, this causes all the inserters to eventually migrate
+		 * to a lock that no-one else is using.  On a system with more
+		 * inserters than locks, it still helps to distribute the inserters
+		 * quite evenly across the locks.
 		 */
-		for (;;)
-		{
-			/* "false" means cannot accept cancel/die interrupt here. */
-			PGSemaphoreLock(&proc->sem, false);
-			if (!proc->lwWaiting)
-				break;
-			extraWaits++;
-		}
-
-		/* Now loop back and try to acquire lock again. */
-		retry = true;
+		lockToTry = (lockToTry + 1) % num_xloginsert_locks;
 	}
-
-	slot->owner = proc;
-
-	/*
-	 * Normally, we initialize the xlogInsertingAt value of the slot to 1,
-	 * because we don't yet know where in the WAL we're going to insert. It's
-	 * not critical what it points to right now - leaving it to a too small
-	 * value just means that WaitXlogInsertionsToFinish() might wait on us
-	 * unnecessarily, until we update the value (when we finish the insert or
-	 * move to next page).
-	 *
-	 * If we're grabbing all the slots, however, stamp all but the last one
-	 * with InvalidXLogRecPtr, meaning there is no insert in progress. The last
-	 * slot is the one that we will update as we proceed with the insert, the
-	 * rest are held just to keep off other inserters.
-	 */
-	if (slotno != -1 && slotno != num_xloginsert_slots - 1)
-		slot->xlogInsertingAt = InvalidXLogRecPtr;
-	else
-		slot->xlogInsertingAt = 1;
-
-	/* We are done updating shared state of the slot itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Fix the process wait semaphore's count for any absorbed wakeups.
-	 */
-	while (extraWaits-- > 0)
-		PGSemaphoreUnlock(&proc->sem);
-
-	/*
-	 * If we couldn't get the slot immediately, try another slot next time.
-	 * On a system with more insertion slots than concurrent inserters, this
-	 * causes all the inserters to eventually migrate to a slot that no-one
-	 * else is using. On a system with more inserters than slots, it still
-	 * causes the inserters to be distributed quite evenly across the slots.
-	 */
-	if (slotno != -1 && retry)
-		slotToTry = (slotToTry + 1) % num_xloginsert_slots;
 }
 
 /*
- * Wait for the given slot to become free, or for its xlogInsertingAt location
- * to change to something else than 'waitptr'. In other words, wait for the
- * inserter using the given slot to finish its insertion, or to at least make
- * some progress.
+ * Acquire all WAL insertion locks, to prevent other backends from inserting
+ * to WAL.
  */
 static void
-WaitOnSlot(volatile XLogInsertSlot *slot, XLogRecPtr waitptr)
+WALInsertLockAcquireExclusive(void)
 {
-	PGPROC	   *proc = MyProc;
-	int			extraWaits = 0;
-
-	/*
-	 * Lock out cancel/die interrupts while we sleep on the slot. There is
-	 * no cleanup mechanism to remove us from the wait queue if we got
-	 * interrupted.
-	 */
-	HOLD_INTERRUPTS();
+	int			i;
 
 	/*
-	 * Loop here to try to acquire lock after each time we are signaled.
+	 * When holding all the locks, we only update the last lock's insertingAt
+	 * indicator.  The others are set to 0xFFFFFFFFFFFFFFFF, which is higher
+	 * than any real XLogRecPtr value, to make sure that no-one blocks
+	 * waiting on those.
 	 */
-	for (;;)
+	for (i = 0; i < num_xloginsert_locks - 1; i++)
 	{
-		bool		mustwait;
-
-		/* Acquire mutex.  Time spent holding mutex should be short! */
-		SpinLockAcquire(&slot->mutex);
-
-		/* If I can get the lock, do so quickly. */
-		if (slot->exclusive == 0 || slot->xlogInsertingAt != waitptr)
-			mustwait = false;
-		else
-			mustwait = true;
-
-		if (!mustwait)
-			break;				/* the lock was free */
-
-		Assert(slot->owner != MyProc);
-
-		/*
-		 * Add myself to wait queue.
-		 */
-		proc->lwWaiting = true;
-		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
-		proc->lwWaitLink = NULL;
-
-		/* waiters are added to the front of the queue */
-		proc->lwWaitLink = slot->head;
-		if (slot->head == NULL)
-			slot->tail = proc;
-		slot->head = proc;
-
-		/* Can release the mutex now */
-		SpinLockRelease(&slot->mutex);
-
-		/*
-		 * Wait until awakened.
-		 *
-		 * Since we share the process wait semaphore with other things, like
-		 * the regular lock manager and ProcWaitForSignal, and we may need to
-		 * acquire an LWLock while one of those is pending, it is possible that
-		 * we get awakened for a reason other than being signaled by
-		 * LWLockRelease. If so, loop back and wait again.  Once we've gotten
-		 * the LWLock, re-increment the sema by the number of additional
-		 * signals received, so that the lock manager or signal manager will
-		 * see the received signal when it next waits.
-		 */
-		for (;;)
-		{
-			/* "false" means cannot accept cancel/die interrupt here. */
-			PGSemaphoreLock(&proc->sem, false);
-			if (!proc->lwWaiting)
-				break;
-			extraWaits++;
-		}
-
-		/* Now loop back and try to acquire lock again. */
+		LWLockAcquireWithVar(&WALInsertLocks[i].l.lock,
+							 &WALInsertLocks[i].l.insertingAt,
+							 UINT64CONST(0xFFFFFFFFFFFFFFFF));
 	}
+	LWLockAcquireWithVar(&WALInsertLocks[i].l.lock,
+						 &WALInsertLocks[i].l.insertingAt,
+						 0);
 
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Fix the process wait semaphore's count for any absorbed wakeups.
-	 */
-	while (extraWaits-- > 0)
-		PGSemaphoreUnlock(&proc->sem);
-
-	/*
-	 * Now okay to allow cancel/die interrupts.
-	 */
-	RESUME_INTERRUPTS();
+	holdingAllLocks = true;
 }
 
 /*
- * Wake up all processes waiting for us with WaitOnSlot(). Sets our
- * xlogInsertingAt value to EndPos, without releasing the slot.
+ * Release our insertion lock (or locks, if we're holding them all).
  */
 static void
-WakeupWaiters(XLogRecPtr EndPos)
+WALInsertLockRelease(void)
 {
-	volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[MySlotNo].slot;
-	PGPROC	   *head;
-	PGPROC	   *proc;
-	PGPROC	   *next;
-
-	/*
-	 * If we have already reported progress up to the same point, do nothing.
-	 * No other process can modify xlogInsertingAt, so we can check this before
-	 * grabbing the spinlock.
-	 */
-	if (slot->xlogInsertingAt == EndPos)
-		return;
-	/* xlogInsertingAt should not go backwards */
-	Assert(slot->xlogInsertingAt < EndPos);
-
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&slot->mutex);
-
-	/* we should own the slot */
-	Assert(slot->exclusive == 1 && slot->owner == MyProc);
-
-	slot->xlogInsertingAt = EndPos;
-
-	/*
-	 * See if there are any LW_WAIT_UNTIL_FREE waiters that need to be woken
-	 * up. They are always in the front of the queue.
-	 */
-	head = slot->head;
-
-	if (head != NULL && head->lwWaitMode == LW_WAIT_UNTIL_FREE)
+	if (holdingAllLocks)
 	{
-		proc = head;
-		next = proc->lwWaitLink;
-		while (next && next->lwWaitMode == LW_WAIT_UNTIL_FREE)
-		{
-			proc = next;
-			next = next->lwWaitLink;
-		}
+		int			i;
 
-		/* proc is now the last PGPROC to be released */
-		slot->head = next;
-		proc->lwWaitLink = NULL;
+		for (i = 0; i < num_xloginsert_locks; i++)
+			LWLockRelease(&WALInsertLocks[i].l.lock);
+
+		holdingAllLocks = false;
 	}
 	else
-		head = NULL;
-
-	/* We are done updating shared state of the lock itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Awaken any waiters I removed from the queue.
-	 */
-	while (head != NULL)
 	{
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
+		LWLockRelease(&WALInsertLocks[MyLockNo].l.lock);
 	}
 }
 
 /*
- * Release our insertion slot (or slots, if we're holding them all).
+ * Update our insertingAt value, to let others know that we've finished
+ * inserting up to that point.
  */
 static void
-WALInsertSlotRelease(void)
+WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
 {
-	int			i;
-
-	if (holdingAllSlots)
+	if (holdingAllLocks)
 	{
-		for (i = 0; i < num_xloginsert_slots; i++)
-			WALInsertSlotReleaseOne(i);
-		holdingAllSlots = false;
+		/*
+		 * We use the last lock to mark our actual position, see comments in
+		 * WALInsertLockAcquireExclusive.
+		 */
+		LWLockUpdateVar(&WALInsertLocks[num_xloginsert_locks - 1].l.lock,
+						&WALInsertLocks[num_xloginsert_locks - 1].l.insertingAt,
+						insertingAt);
 	}
 	else
-		WALInsertSlotReleaseOne(MySlotNo);
+		LWLockUpdateVar(&WALInsertLocks[MyLockNo].l.lock,
+						&WALInsertLocks[MyLockNo].l.insertingAt,
+						insertingAt);
 }
 
-static void
-WALInsertSlotReleaseOne(int slotno)
-{
-	volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[slotno].slot;
-	PGPROC	   *head;
-	PGPROC	   *proc;
-
-	/* Acquire mutex.  Time spent holding mutex should be short! */
-	SpinLockAcquire(&slot->mutex);
-
-	/* we must be holding it */
-	Assert(slot->exclusive == 1 && slot->owner == MyProc);
-
-	slot->xlogInsertingAt = InvalidXLogRecPtr;
-
-	/* Release my hold on the slot */
-	slot->exclusive = 0;
-	slot->owner = NULL;
-
-	/*
-	 * See if I need to awaken any waiters..
-	 */
-	head = slot->head;
-	if (head != NULL)
-	{
-		if (slot->releaseOK)
-		{
-			/*
-			 * Remove the to-be-awakened PGPROCs from the queue.
-			 */
-			bool		releaseOK = true;
-
-			proc = head;
-
-			/*
-			 * First wake up any backends that want to be woken up without
-			 * acquiring the lock. These are always in the front of the queue.
-			 */
-			while (proc->lwWaitMode == LW_WAIT_UNTIL_FREE && proc->lwWaitLink)
-				proc = proc->lwWaitLink;
-
-			/*
-			 * Awaken the first exclusive-waiter, if any.
-			 */
-			if (proc->lwWaitLink)
-			{
-				Assert(proc->lwWaitLink->lwWaitMode == LW_EXCLUSIVE);
-				proc = proc->lwWaitLink;
-				releaseOK = false;
-			}
-			/* proc is now the last PGPROC to be released */
-			slot->head = proc->lwWaitLink;
-			proc->lwWaitLink = NULL;
-
-			slot->releaseOK = releaseOK;
-		}
-		else
-			head = NULL;
-	}
-
-	/* We are done updating shared state of the slot itself. */
-	SpinLockRelease(&slot->mutex);
-
-	/*
-	 * Awaken any waiters I removed from the queue.
-	 */
-	while (head != NULL)
-	{
-		proc = head;
-		head = proc->lwWaitLink;
-		proc->lwWaitLink = NULL;
-		proc->lwWaiting = false;
-		PGSemaphoreUnlock(&proc->sem);
-	}
-
-	/*
-	 * Now okay to allow cancel/die interrupts.
-	 */
-	END_CRIT_SECTION();
-}
-
-
 /*
  * Wait for any WAL insertions < upto to finish.
  *
@@ -2032,79 +1701,49 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	}
 
 	/*
+	 * Loop through all the locks, sleeping on any in-progress insert older
+	 * than 'upto'.
+	 *
 	 * finishedUpto is our return value, indicating the point upto which
 	 * all the WAL insertions have been finished. Initialize it to the head
-	 * of reserved WAL, and as we iterate through the insertion slots, back it
+	 * of reserved WAL, and as we iterate through the insertion locks, back it
 	 * out for any insertion that's still in progress.
 	 */
 	finishedUpto = reservedUpto;
-
-	/*
-	 * Loop through all the slots, sleeping on any in-progress insert older
-	 * than 'upto'.
-	 */
-	for (i = 0; i < num_xloginsert_slots; i++)
+	for (i = 0; i < num_xloginsert_locks; i++)
 	{
-		volatile XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[i].slot;
-		XLogRecPtr insertingat;
-
-	retry:
-		/*
-		 * We can check if the slot is in use without grabbing the spinlock.
-		 * The spinlock acquisition of insertpos_lck before this loop acts
-		 * as a memory barrier. If someone acquires the slot after that, it
-		 * can't possibly be inserting to anything < reservedUpto. If it was
-		 * acquired before that, an unlocked test will return true.
-		 */
-		if (!slot->exclusive)
-			continue;
-
-		SpinLockAcquire(&slot->mutex);
-		/* re-check now that we have the lock */
-		if (!slot->exclusive)
-		{
-			SpinLockRelease(&slot->mutex);
-			continue;
-		}
-		insertingat = slot->xlogInsertingAt;
-		SpinLockRelease(&slot->mutex);
-
-		if (insertingat == InvalidXLogRecPtr)
+		XLogRecPtr insertingat = InvalidXLogRecPtr;
+		do
 		{
 			/*
-			 * slot is reserved just to hold off other inserters, there is no
-			 * actual insert in progress.
+			 * See if this insertion is in progress. LWLockWait will wait for
+			 * the lock to be released, or for the 'value' to be set by a
+			 * LWLockUpdateVar call.  When a lock is initially acquired, its
+			 * value is 0 (InvalidXLogRecPtr), which means that we don't know
+			 * where it's inserting yet.  We will have to wait for it.  If
+			 * it's a small insertion, the record will most likely fit on the
+			 * same page and the inserter will release the lock without ever
+			 * calling LWLockUpdateVar.  But if it has to sleep, it will
+			 * advertise the insertion point with LWLockUpdateVar before
+			 * sleeping.
 			 */
-			continue;
-		}
+			if (LWLockWaitForVar(&WALInsertLocks[i].l.lock,
+								 &WALInsertLocks[i].l.insertingAt,
+								 insertingat, &insertingat))
+			{
+				/* the lock was free, so no insertion in progress */
+				insertingat = InvalidXLogRecPtr;
+				break;
+			}
 
-		/*
-		 * This insertion is still in progress. Do we need to wait for it?
-		 *
-		 * When an inserter acquires a slot, it doesn't reset 'insertingat', so
-		 * it will initially point to the old value of some already-finished
-		 * insertion. The inserter will update the value as soon as it finishes
-		 * the insertion, moves to the next page, or has to do I/O to flush an
-		 * old dirty buffer. That means that when we see a slot with
-		 * insertingat value < upto, we don't know if that insertion is still
-		 * truly in progress, or if the slot is reused by a new inserter that
-		 * hasn't updated the insertingat value yet. We have to assume it's the
-		 * latter, and wait.
-		 */
-		if (insertingat < upto)
-		{
-			WaitOnSlot(slot, insertingat);
-			goto retry;
-		}
-		else
-		{
 			/*
-			 * We don't need to wait for this insertion, but update the
-			 * return value.
+			 * This insertion is still in progress. Have to wait, unless the
+			 * inserter has proceeded past 'upto'.
 			 */
-			if (insertingat < finishedUpto)
-				finishedUpto = insertingat;
-		}
+		} while (insertingat < upto);
+
+		if (insertingat != InvalidXLogRecPtr && insertingat < finishedUpto)
+			finishedUpto = insertingat;
 	}
 	return finishedUpto;
 }
@@ -2118,8 +1757,8 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
  *
  * The caller must ensure that the page containing the requested location
  * isn't evicted yet, and won't be evicted. The way to ensure that is to
- * hold onto an XLogInsertSlot with the xlogInsertingAt position set to
- * something <= ptr. GetXLogBuffer() will update xlogInsertingAt if it needs
+ * hold onto a WAL insertion lock with the insertingAt position set to
+ * something <= ptr. GetXLogBuffer() will update insertingAt if it needs
  * to evict an old page from the buffer. (This means that once you call
  * GetXLogBuffer() with a given 'ptr', you must not access anything before
  * that point anymore, and must not call GetXLogBuffer() with an older 'ptr'
@@ -2179,7 +1818,7 @@ GetXLogBuffer(XLogRecPtr ptr)
 		 * Let others know that we're finished inserting the record up
 		 * to the page boundary.
 		 */
-		WakeupWaiters(expectedEndPtr - XLOG_BLCKSZ);
+		WALInsertLockUpdateInsertingAt(expectedEndPtr - XLOG_BLCKSZ);
 
 		AdvanceXLInsertBuffer(ptr, false);
 		endptr = XLogCtl->xlblocks[idx];
@@ -5117,8 +4756,8 @@ XLOGShmemSize(void)
 	/* XLogCtl */
 	size = sizeof(XLogCtlData);
 
-	/* xlog insertion slots, plus alignment */
-	size = add_size(size, mul_size(sizeof(XLogInsertSlotPadded), num_xloginsert_slots + 1));
+	/* WAL insertion locks, plus alignment */
+	size = add_size(size, mul_size(sizeof(WALInsertLockPadded), num_xloginsert_locks + 1));
 	/* xlblocks array */
 	size = add_size(size, mul_size(sizeof(XLogRecPtr), XLOGbuffers));
 	/* extra alignment padding for XLOG I/O buffers */
@@ -5166,11 +4805,27 @@ XLOGShmemInit(void)
 	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
 	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
 
-	/* Xlog insertion slots. Ensure they're aligned to the full padded size */
-	allocptr += sizeof(XLogInsertSlotPadded) -
-		((uintptr_t) allocptr) % sizeof(XLogInsertSlotPadded);
-	XLogCtl->Insert.insertSlots = (XLogInsertSlotPadded *) allocptr;
-	allocptr += sizeof(XLogInsertSlotPadded) * num_xloginsert_slots;
+
+	/* WAL insertion locks. Ensure they're aligned to the full padded size */
+	allocptr += sizeof(WALInsertLockPadded) -
+		((uintptr_t) allocptr) % sizeof(WALInsertLockPadded);
+	WALInsertLocks = XLogCtl->Insert.WALInsertLocks =
+		(WALInsertLockPadded *) allocptr;
+	allocptr += sizeof(WALInsertLockPadded) * num_xloginsert_locks;
+
+	XLogCtl->Insert.WALInsertLockTrancheId = LWLockNewTrancheId();
+
+	XLogCtl->Insert.WALInsertLockTranche.name = "WALInsertLocks";
+	XLogCtl->Insert.WALInsertLockTranche.array_base = WALInsertLocks;
+	XLogCtl->Insert.WALInsertLockTranche.array_stride = sizeof(WALInsertLockPadded);
+
+	LWLockRegisterTranche(XLogCtl->Insert.WALInsertLockTrancheId, &XLogCtl->Insert.WALInsertLockTranche);
+	for (i = 0; i < num_xloginsert_locks; i++)
+	{
+		LWLockInitialize(&WALInsertLocks[i].l.lock,
+						 XLogCtl->Insert.WALInsertLockTrancheId);
+		WALInsertLocks[i].l.insertingAt = InvalidXLogRecPtr;
+	}
 
 	/*
 	 * Align the start of the page buffers to a full xlog block size boundary.
@@ -5190,19 +4845,6 @@ XLOGShmemInit(void)
 	XLogCtl->SharedHotStandbyActive = false;
 	XLogCtl->WalWriterSleeping = false;
 
-	for (i = 0; i < num_xloginsert_slots; i++)
-	{
-		XLogInsertSlot *slot = &XLogCtl->Insert.insertSlots[i].slot;
-		SpinLockInit(&slot->mutex);
-		slot->xlogInsertingAt = InvalidXLogRecPtr;
-		slot->owner = NULL;
-
-		slot->releaseOK = true;
-		slot->exclusive = 0;
-		slot->head = NULL;
-		slot->tail = NULL;
-	}
-
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
 	SpinLockInit(&XLogCtl->ulsn_lck);
@@ -7939,6 +7581,11 @@ InitXLOGAccess(void)
 	ThisTimeLineID = XLogCtl->ThisTimeLineID;
 	Assert(ThisTimeLineID != 0 || IsBootstrapProcessingMode());
 
+	/* Initialize our copy of WALInsertLocks and register the tranche */
+	WALInsertLocks = XLogCtl->Insert.WALInsertLocks;
+	LWLockRegisterTranche(XLogCtl->Insert.WALInsertLockTrancheId,
+						  &XLogCtl->Insert.WALInsertLockTranche);
+
 	/* Use GetRedoRecPtr to copy the RedoRecPtr safely */
 	(void) GetRedoRecPtr();
 }
@@ -7957,7 +7604,7 @@ GetRedoRecPtr(void)
 
 	/*
 	 * The possibly not up-to-date copy in XlogCtl is enough. Even if we
-	 * grabbed a WAL insertion slot to read the master copy, someone might
+	 * grabbed a WAL insertion lock to read the master copy, someone might
 	 * update it just after we've released the lock.
 	 */
 	SpinLockAcquire(&xlogctl->info_lck);
@@ -7975,7 +7622,7 @@ GetRedoRecPtr(void)
  *
  * NOTE: The value *actually* returned is the position of the last full
  * xlog page. It lags behind the real insert position by at most 1 page.
- * For that, we don't need to scan through WAL insertion slots, and an
+ * For that, we don't need to scan through WAL insertion locks, and an
  * approximation is enough for the current usage of this function.
  */
 XLogRecPtr
@@ -8336,7 +7983,7 @@ CreateCheckPoint(int flags)
 	 * We must block concurrent insertions while examining insert state to
 	 * determine the checkpoint REDO pointer.
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquireExclusive();
 	curInsert = XLogBytePosToRecPtr(Insert->CurrBytePos);
 
 	/*
@@ -8361,7 +8008,7 @@ CreateCheckPoint(int flags)
 			MAXALIGN(SizeOfXLogRecord + sizeof(CheckPoint)) &&
 			ControlFile->checkPoint == ControlFile->checkPointCopy.redo)
 		{
-			WALInsertSlotRelease();
+			WALInsertLockRelease();
 			LWLockRelease(CheckpointLock);
 			END_CRIT_SECTION();
 			return;
@@ -8405,7 +8052,7 @@ CreateCheckPoint(int flags)
 
 	/*
 	 * Here we update the shared RedoRecPtr for future XLogInsert calls; this
-	 * must be done while holding the insertion slots.
+	 * must be done while holding all the insertion locks.
 	 *
 	 * Note: if we fail to complete the checkpoint, RedoRecPtr will be left
 	 * pointing past where it really needs to point.  This is okay; the only
@@ -8417,10 +8064,10 @@ CreateCheckPoint(int flags)
 	RedoRecPtr = xlogctl->Insert.RedoRecPtr = checkPoint.redo;
 
 	/*
-	 * Now we can release the WAL insertion slots, allowing other xacts to
+	 * Now we can release the WAL insertion locks, allowing other xacts to
 	 * proceed while we are flushing disk buffers.
 	 */
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	/* Update the info_lck-protected copy of RedoRecPtr as well */
 	SpinLockAcquire(&xlogctl->info_lck);
@@ -8450,7 +8097,7 @@ CreateCheckPoint(int flags)
 	 * we wait till he's out of his commit critical section before proceeding.
 	 * See notes in RecordTransactionCommit().
 	 *
-	 * Because we've already released the insertion slots, this test is a bit
+	 * Because we've already released the insertion locks, this test is a bit
 	 * fuzzy: it is possible that we will wait for xacts we didn't really need
 	 * to wait for.  But the delay should be short and it seems better to make
 	 * checkpoint take a bit longer than to hold off insertions longer than
@@ -8681,10 +8328,10 @@ CreateEndOfRecoveryRecord(void)
 
 	xlrec.end_time = time(NULL);
 
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquireExclusive();
 	xlrec.ThisTimeLineID = ThisTimeLineID;
 	xlrec.PrevTimeLineID = XLogCtl->PrevTimeLineID;
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	LocalSetXLogInsertAllowed();
 
@@ -8892,9 +8539,9 @@ CreateRestartPoint(int flags)
 	 * during recovery this is just pro forma, because no WAL insertions are
 	 * happening.
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquireExclusive();
 	xlogctl->Insert.RedoRecPtr = lastCheckPoint.redo;
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	/* Also update the info_lck-protected copy */
 	SpinLockAcquire(&xlogctl->info_lck);
@@ -9354,9 +9001,9 @@ UpdateFullPageWrites(void)
 	 */
 	if (fullPageWrites)
 	{
-		WALInsertSlotAcquire(true);
+		WALInsertLockAcquireExclusive();
 		Insert->fullPageWrites = true;
-		WALInsertSlotRelease();
+		WALInsertLockRelease();
 	}
 
 	/*
@@ -9377,9 +9024,9 @@ UpdateFullPageWrites(void)
 
 	if (!fullPageWrites)
 	{
-		WALInsertSlotAcquire(true);
+		WALInsertLockAcquireExclusive();
 		Insert->fullPageWrites = false;
-		WALInsertSlotRelease();
+		WALInsertLockRelease();
 	}
 	END_CRIT_SECTION();
 }
@@ -10010,15 +9657,15 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	 * Note that forcePageWrites has no effect during an online backup from
 	 * the standby.
 	 *
-	 * We must hold all the insertion slots to change the value of
+	 * We must hold all the insertion locks to change the value of
 	 * forcePageWrites, to ensure adequate interlocking against XLogInsert().
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquireExclusive();
 	if (exclusive)
 	{
 		if (XLogCtl->Insert.exclusiveBackup)
 		{
-			WALInsertSlotRelease();
+			WALInsertLockRelease();
 			ereport(ERROR,
 					(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 					 errmsg("a backup is already in progress"),
@@ -10029,7 +9676,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	else
 		XLogCtl->Insert.nonExclusiveBackups++;
 	XLogCtl->Insert.forcePageWrites = true;
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	/* Ensure we release forcePageWrites if fail below */
 	PG_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
@@ -10144,13 +9791,13 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 			 * taking a checkpoint right after another is not that expensive
 			 * either because only few buffers have been dirtied yet.
 			 */
-			WALInsertSlotAcquire(true);
+			WALInsertLockAcquireExclusive();
 			if (XLogCtl->Insert.lastBackupStart < startpoint)
 			{
 				XLogCtl->Insert.lastBackupStart = startpoint;
 				gotUniqueStartpoint = true;
 			}
-			WALInsertSlotRelease();
+			WALInsertLockRelease();
 		} while (!gotUniqueStartpoint);
 
 		XLByteToSeg(startpoint, _logSegNo);
@@ -10240,7 +9887,7 @@ pg_start_backup_callback(int code, Datum arg)
 	bool		exclusive = DatumGetBool(arg);
 
 	/* Update backup counters and forcePageWrites on failure */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquireExclusive();
 	if (exclusive)
 	{
 		Assert(XLogCtl->Insert.exclusiveBackup);
@@ -10257,7 +9904,7 @@ pg_start_backup_callback(int code, Datum arg)
 	{
 		XLogCtl->Insert.forcePageWrites = false;
 	}
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 }
 
 /*
@@ -10326,7 +9973,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	/*
 	 * OK to update backup counters and forcePageWrites
 	 */
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquireExclusive();
 	if (exclusive)
 		XLogCtl->Insert.exclusiveBackup = false;
 	else
@@ -10346,7 +9993,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 	{
 		XLogCtl->Insert.forcePageWrites = false;
 	}
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 
 	if (exclusive)
 	{
@@ -10631,7 +10278,7 @@ do_pg_stop_backup(char *labelfile, bool waitforarchive, TimeLineID *stoptli_p)
 void
 do_pg_abort_backup(void)
 {
-	WALInsertSlotAcquire(true);
+	WALInsertLockAcquireExclusive();
 	Assert(XLogCtl->Insert.nonExclusiveBackups > 0);
 	XLogCtl->Insert.nonExclusiveBackups--;
 
@@ -10640,7 +10287,7 @@ do_pg_abort_backup(void)
 	{
 		XLogCtl->Insert.forcePageWrites = false;
 	}
-	WALInsertSlotRelease();
+	WALInsertLockRelease();
 }
 
 /*
diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c
index 82ef440..160519c 100644
--- a/src/backend/storage/lmgr/lwlock.c
+++ b/src/backend/storage/lmgr/lwlock.c
@@ -10,6 +10,14 @@
  * locking should be done with the full lock manager --- which depends on
  * LWLocks to protect its shared state.
  *
+ * In addition to exclusive and shared modes, lightweight locks can be used
+ * to wait until a variable changes value.  The variable is initially set
+ * when the lock is acquired with LWLockAcquireWithVar, and can be updated
+ * without releasing the lock by calling LWLockUpdateVar.  LWLockWaitForVar
+ * waits for the variable to be updated, or until the lock is free.  The
+ * meaning of the variable is up to the caller, the lightweight lock code
+ * just assigns and compares it.
+ *
  *
  * Portions Copyright (c) 1996-2014, PostgreSQL Global Development Group
  * Portions Copyright (c) 1994, Regents of the University of California
@@ -78,6 +86,9 @@ static LWLock *held_lwlocks[MAX_SIMUL_LWLOCKS];
 static int	lock_addin_request = 0;
 static bool lock_addin_request_allowed = true;
 
+static bool LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr,
+					uint64 val);
+
 #ifdef LWLOCK_STATS
 typedef struct lwlock_stats_key
 {
@@ -443,16 +454,36 @@ LWLockInitialize(LWLock *lock, int tranche_id)
 /*
  * LWLockAcquire - acquire a lightweight lock in the specified mode
  *
- * If the lock is not available, sleep until it is.
+ * If the lock is not available, sleep until it is.  Returns true if the lock
+ * was available immediately, false if we had to sleep.
  *
  * Side effect: cancel/die interrupts are held off until lock release.
  */
-void
+bool
 LWLockAcquire(LWLock *l, LWLockMode mode)
 {
+	return LWLockAcquireCommon(l, mode, NULL, 0);
+}
+
+/*
+ * LWLockAcquireWithVar - like LWLockAcquire, but also sets *valptr = val
+ *
+ * The lock is always acquired in exclusive mode with this function.
+ */
+bool
+LWLockAcquireWithVar(LWLock *l, uint64 *valptr, uint64 val)
+{
+	return LWLockAcquireCommon(l, LW_EXCLUSIVE, valptr, val);
+}
+
+/* internal function to implement LWLockAcquire and LWLockAcquireWithVar */
+static bool
+LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val)
+{
 	volatile LWLock *lock = l;
 	PGPROC	   *proc = MyProc;
 	bool		retry = false;
+	bool		result = true;
 	int			extraWaits = 0;
 #ifdef LWLOCK_STATS
 	lwlock_stats *lwstats;
@@ -601,8 +632,13 @@ LWLockAcquire(LWLock *l, LWLockMode mode)
 
 		/* Now loop back and try to acquire lock again. */
 		retry = true;
+		result = false;
 	}
 
+	/* If there's a variable associated with this lock, initialize it */
+	if (valptr)
+		*valptr = val;
+
 	/* We are done updating shared state of the lock itself. */
 	SpinLockRelease(&lock->mutex);
 
@@ -616,6 +652,8 @@ LWLockAcquire(LWLock *l, LWLockMode mode)
 	 */
 	while (extraWaits-- > 0)
 		PGSemaphoreUnlock(&proc->sem);
+
+	return result;
 }
 
 /*
@@ -835,6 +873,227 @@ LWLockAcquireOrWait(LWLock *l, LWLockMode mode)
 }
 
 /*
+ * LWLockWait - Wait until lock is free, or a variable is updated.
+ *
+ * If the lock is held and *valptr equals oldval, waits until the lock is
+ * either freed, or the lock holder updates *valptr by calling
+ * LWLockUpdateVar.  If the lock is free on exit (immediately or after
+ * waiting), returns true.  If the lock is still held, but *valptr no longer
+ * matches oldval, returns false and sets *newval to the current value in
+ * *valptr.
+ *
+ * It's possible that the lock holder releases the lock, but another backend
+ * acquires it again before we get a chance to observe that the lock was
+ * momentarily released.  We wouldn't need to wait for the new lock holder,
+ * but we cannot distinguish that case, so we will have to wait.
+ *
+ * Note: this function ignores shared lock holders; if the lock is held
+ * in shared mode, returns 'true'.
+ */
+bool
+LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval)
+{
+	volatile LWLock *lock = l;
+	volatile uint64 *valp = valptr;
+	PGPROC	   *proc = MyProc;
+	int			extraWaits = 0;
+	bool		result = false;
+
+	/*
+	 * Quick test first to see if it the slot is free right now.
+	 *
+	 * XXX: the caller uses a spinlock before this, so we don't need a memory
+	 * barrier here as far as the current usage is concerned.  But that might
+	 * not be safe in general.
+	 */
+	if (lock->exclusive == 0)
+		return true;
+
+	/*
+	 * Lock out cancel/die interrupts while we sleep on the lock.  There is
+	 * no cleanup mechanism to remove us from the wait queue if we got
+	 * interrupted.
+	 */
+	HOLD_INTERRUPTS();
+
+	/*
+	 * Loop here to check the lock's status after each time we are signaled.
+	 */
+	for (;;)
+	{
+		bool		mustwait;
+		uint64		value;
+
+		/* Acquire mutex.  Time spent holding mutex should be short! */
+#ifdef LWLOCK_STATS
+		lwstats->spin_delay_count += SpinLockAcquire(&lock->mutex);
+#else
+		SpinLockAcquire(&lock->mutex);
+#endif
+
+		/* Is the lock now free, and if not, does the value match? */
+		if (lock->exclusive == 0)
+		{
+			result = true;
+			mustwait = false;
+		}
+		else
+		{
+			value = *valp;
+			if (value != oldval)
+			{
+				result = false;
+				mustwait = false;
+				*newval = value;
+			}
+			else
+				mustwait = true;
+		}
+
+		if (!mustwait)
+			break;				/* the lock was free or value didn't match */
+
+		/*
+		 * Add myself to wait queue.
+		 */
+		proc->lwWaiting = true;
+		proc->lwWaitMode = LW_WAIT_UNTIL_FREE;
+		proc->lwWaitLink = NULL;
+
+		/* waiters are added to the front of the queue */
+		proc->lwWaitLink = lock->head;
+		if (lock->head == NULL)
+			lock->tail = proc;
+		lock->head = proc;
+
+		/* Can release the mutex now */
+		SpinLockRelease(&lock->mutex);
+
+		/*
+		 * Wait until awakened.
+		 *
+		 * Since we share the process wait semaphore with the regular lock
+		 * manager and ProcWaitForSignal, and we may need to acquire an LWLock
+		 * while one of those is pending, it is possible that we get awakened
+		 * for a reason other than being signaled by LWLockRelease. If so,
+		 * loop back and wait again.  Once we've gotten the LWLock,
+		 * re-increment the sema by the number of additional signals received,
+		 * so that the lock manager or signal manager will see the received
+		 * signal when it next waits.
+		 */
+		LOG_LWDEBUG("LWLockWaitForVar", T_NAME(l), T_ID(l), "waiting");
+
+#ifdef LWLOCK_STATS
+		lwstats->block_count++;
+#endif
+
+		TRACE_POSTGRESQL_LWLOCK_WAIT_START(T_NAME(l), T_ID(l), mode);
+
+		for (;;)
+		{
+			/* "false" means cannot accept cancel/die interrupt here. */
+			PGSemaphoreLock(&proc->sem, false);
+			if (!proc->lwWaiting)
+				break;
+			extraWaits++;
+		}
+
+		TRACE_POSTGRESQL_LWLOCK_WAIT_DONE(T_NAME(l), T_ID(l), mode);
+
+		LOG_LWDEBUG("LWLockWaitForVar", T_NAME(l), T_ID(l), "awakened");
+
+		/* Now loop back and check the status of the lock again. */
+	}
+
+	/* We are done updating shared state of the lock itself. */
+	SpinLockRelease(&lock->mutex);
+
+	TRACE_POSTGRESQL_LWLOCK_ACQUIRE(T_NAME(l), T_ID(l), mode);
+
+	/*
+	 * Fix the process wait semaphore's count for any absorbed wakeups.
+	 */
+	while (extraWaits-- > 0)
+		PGSemaphoreUnlock(&proc->sem);
+
+	/*
+	 * Now okay to allow cancel/die interrupts.
+	 */
+	RESUME_INTERRUPTS();
+
+	return result;
+}
+
+
+/*
+ * LWLockUpdateVar - Update a variable and wake up waiters atomically
+ *
+ * Sets *valptr to 'val', and wakes up all processes waiting for us with
+ * LWLockWaitForVar().  Setting the value and waking up the processes happen
+ * atomically so that any process calling LWLockWaitForVar() on the same lock
+ * is guaranteed to see the new value, and act accordingly.
+ *
+ * The caller must be holding the lock in exclusive mode.
+ */
+void
+LWLockUpdateVar(LWLock *l, uint64 *valptr, uint64 val)
+{
+	volatile LWLock *lock = l;
+	volatile uint64 *valp = valptr;
+	PGPROC	   *head;
+	PGPROC	   *proc;
+	PGPROC	   *next;
+
+	/* Acquire mutex.  Time spent holding mutex should be short! */
+	SpinLockAcquire(&lock->mutex);
+
+	/* we should hold the lock */
+	Assert(lock->exclusive == 1);
+
+	/* Update the lock's value */
+	*valp = val;
+
+	/*
+	 * See if there are any LW_WAIT_UNTIL_FREE waiters that need to be woken
+	 * up. They are always in the front of the queue.
+	 */
+	head = lock->head;
+
+	if (head != NULL && head->lwWaitMode == LW_WAIT_UNTIL_FREE)
+	{
+		proc = head;
+		next = proc->lwWaitLink;
+		while (next && next->lwWaitMode == LW_WAIT_UNTIL_FREE)
+		{
+			proc = next;
+			next = next->lwWaitLink;
+		}
+
+		/* proc is now the last PGPROC to be released */
+		lock->head = next;
+		proc->lwWaitLink = NULL;
+	}
+	else
+		head = NULL;
+
+	/* We are done updating shared state of the lock itself. */
+	SpinLockRelease(&lock->mutex);
+
+	/*
+	 * Awaken any waiters I removed from the queue.
+	 */
+	while (head != NULL)
+	{
+		proc = head;
+		head = proc->lwWaitLink;
+		proc->lwWaitLink = NULL;
+		proc->lwWaiting = false;
+		PGSemaphoreUnlock(&proc->sem);
+	}
+}
+
+
+/*
  * LWLockRelease - release a previously acquired lock
  */
 void
diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
index c76edb4..216fd13 100644
--- a/src/backend/utils/misc/guc.c
+++ b/src/backend/utils/misc/guc.c
@@ -2123,12 +2123,12 @@ static struct config_int ConfigureNamesInt[] =
 	},
 
 	{
-		{"xloginsert_slots", PGC_POSTMASTER, WAL_SETTINGS,
-			gettext_noop("Sets the number of slots for concurrent xlog insertions."),
+		{"xloginsert_locks", PGC_POSTMASTER, WAL_SETTINGS,
+			gettext_noop("Sets the number of locks used for concurrent xlog insertions."),
 			NULL,
 			GUC_NOT_IN_SAMPLE
 		},
-		&num_xloginsert_slots,
+		&num_xloginsert_locks,
 		8, 1, 1000,
 		NULL, NULL, NULL
 	},
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 3509228..56cfe63 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -192,7 +192,7 @@ extern bool EnableHotStandby;
 extern bool fullPageWrites;
 extern bool wal_log_hints;
 extern bool log_checkpoints;
-extern int	num_xloginsert_slots;
+extern int	num_xloginsert_locks;
 
 /* WAL levels */
 typedef enum WalLevel
diff --git a/src/include/storage/lwlock.h b/src/include/storage/lwlock.h
index 8840c79..3a19533 100644
--- a/src/include/storage/lwlock.h
+++ b/src/include/storage/lwlock.h
@@ -169,13 +169,17 @@ typedef enum LWLockMode
 extern bool Trace_lwlocks;
 #endif
 
-extern void LWLockAcquire(LWLock *lock, LWLockMode mode);
+extern bool LWLockAcquire(LWLock *lock, LWLockMode mode);
 extern bool LWLockConditionalAcquire(LWLock *lock, LWLockMode mode);
 extern bool LWLockAcquireOrWait(LWLock *lock, LWLockMode mode);
 extern void LWLockRelease(LWLock *lock);
 extern void LWLockReleaseAll(void);
 extern bool LWLockHeldByMe(LWLock *lock);
 
+extern bool LWLockAcquireWithVar(LWLock *lock, uint64 *valptr, uint64 val);
+extern bool LWLockWaitForVar(LWLock *lock, uint64 *valptr, uint64 oldval, uint64 *newval);
+extern void LWLockUpdateVar(LWLock *lock, uint64 *valptr, uint64 value);
+
 extern Size LWLockShmemSize(void);
 extern void CreateLWLocks(void);

#54

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Heikki Linnakangas (#53)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

Hi,

I see you've committed this, cool. Sorry for not getting back to the
topic earlier..

On 2014-03-13 22:44:03 +0200, Heikki Linnakangas wrote:

On 03/12/2014 09:29 PM, Andres Freund wrote:

On 2014-03-07 17:54:32 +0200, Heikki Linnakangas wrote:

So there are some unexplained differences there, but based on these results,
I'm still OK with committing the patch.

So, I am looking at this right now.

I think there are some minor things I'd like to see addressed:

1) I think there needs to be a good sized comment explaining why
WaitXLogInsertionsToFinish() isn't racy due to the unlocked read at
the beginning of LWLockWait().

There's a comment inside LWLockWait(). I think that's the right place for
it; it's LWLockWait() that's cheating by not acquiring the spinlock before
reading lock->exclusive.

I don't find that argument convincing. After all it's only correct
because the API user does things in a particular way. So there should be
comment at the callsite to make sure that's not changed.

3) I am the wrong one to complain, I know, but the comments above struct
WALInsertLock are pretty hard to read from th sentence structure.

Hmm, ok. I reworded that, I hope it's more clear now.

Yes, it is.

The committed version doesn't compile with LWLOCK_STATS...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#55

Andres Freund

andres@2ndquadrant.com

almost 12 years ago

In reply to: Andres Freund (#54)

Re: Memory ordering issue in LWLockRelease, WakeupWaiters, WALInsertSlotRelease

On 2014-03-21 22:52:33 +0100, Andres Freund wrote:

The committed version doesn't compile with LWLOCK_STATS...

Just noticed that it seems to also break the dtrace stuff:
http://pgbuildfarm.org/cgi-bin/show_log.pl?nm=rover_firefly&dt=2014-03-21%2018%3A04%3A00

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers