Re: [PERFORM] Cpu usage 100% on slave. s_lock problem.

Started by Merlin Moncureover 12 years ago24 messages

mmoncure@gmail.com

over 12 years ago

1 attachment(s)

On Fri, Sep 27, 2013 at 7:56 AM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Thu, Sep 26, 2013 at 10:14 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
On Thu, Sep 26, 2013 at 6:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-08-27 12:17:55 -0500, Merlin Moncure wrote:
On Tue, Aug 27, 2013 at 10:55 AM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-08-27 09:57:38 -0500, Merlin Moncure wrote:
+ bool
+ RecoveryMightBeInProgress(void)
+ {
+     /*
+      * We check shared state each time only until we leave recovery mode. We
+      * can't re-enter recovery, so there's no need to keep checking after the
+      * shared variable has once been seen false.
+      */
+     if (!LocalRecoveryInProgress)
+             return false;
+     else
+     {
+             /* use volatile pointer to prevent code rearrangement */
+             volatile XLogCtlData *xlogctl = XLogCtl;
+
+             /* Intentionally query xlogctl without spinlocking! */
+             LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+
+             return LocalRecoveryInProgress;
+     }
+ }
I don't think it's acceptable to *set* LocalRecoveryInProgress
here. That should only be done in the normal routine.
quite right -- that was a major error -- you could bypass the
initialization call to the xlog with some bad luck.
I've seen this in profiles since, so I'd appreciate pushing this
forward.
roger that -- will push ahead when I get into the office...
attached is new version fixing some comment typos.

Attached is simplified patch that replaces the spinlock with a read
barrier based on a suggestion made by Andres offlist. The approach
has different performance characteristics -- a barrier call is being
issued instead of a non-blocking read. I don't have a performance
test case in hand to prove that's better so I'm going with Andre's
approach because it's simpler. Aside: can this truly the only caller
of pg_read_barrier()?

Also, moving to -hackers from -performance.

merlin

Attachments:

recovery5.patchapplication/octet-stream; name=recovery5.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
new file mode 100755
index 959f423..cd3b673
*** a/src/backend/access/transam/xlog.c
--- b/src/backend/access/transam/xlog.c
***************
*** 41,46 ****
--- 41,47 ----
  #include "postmaster/startup.h"
  #include "replication/walreceiver.h"
  #include "replication/walsender.h"
+ #include "storage/barrier.h"
  #include "storage/bufmgr.h"
  #include "storage/fd.h"
  #include "storage/ipc.h"
*************** RecoveryInProgress(void)
*** 6222,6231 ****
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
! 		/* spinlock is essential on machines with weak memory ordering! */
! 		SpinLockAcquire(&xlogctl->info_lck);
  		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
- 		SpinLockRelease(&xlogctl->info_lck);
  
  		/*
  		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
--- 6223,6232 ----
  		/* use volatile pointer to prevent code rearrangement */
  		volatile XLogCtlData *xlogctl = XLogCtl;
  
!     /* ensure a proper read on machines with weak memory ordering! */
!     pg_read_barrier();
! 
  		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
  
  		/*
  		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery

Heikki Linnakangas

hlinnakangas@vmware.com

over 12 years ago

In reply to: Merlin Moncure (#1)

On 27.09.2013 22:00, Merlin Moncure wrote:

On Fri, Sep 27, 2013 at 7:56 AM, Merlin Moncure<mmoncure@gmail.com> wrote:
On Thu, Sep 26, 2013 at 10:14 PM, Merlin Moncure<mmoncure@gmail.com> wrote:
On Thu, Sep 26, 2013 at 6:08 PM, Andres Freund<andres@2ndquadrant.com> wrote:
On 2013-08-27 12:17:55 -0500, Merlin Moncure wrote:
On Tue, Aug 27, 2013 at 10:55 AM, Andres Freund<andres@2ndquadrant.com> wrote:
On 2013-08-27 09:57:38 -0500, Merlin Moncure wrote:
+ bool
+ RecoveryMightBeInProgress(void)
+ {
+     /*
+      * We check shared state each time only until we leave recovery mode. We
+      * can't re-enter recovery, so there's no need to keep checking after the
+      * shared variable has once been seen false.
+      */
+     if (!LocalRecoveryInProgress)
+             return false;
+     else
+     {
+             /* use volatile pointer to prevent code rearrangement */
+             volatile XLogCtlData *xlogctl = XLogCtl;
+
+             /* Intentionally query xlogctl without spinlocking! */
+             LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
+
+             return LocalRecoveryInProgress;
+     }
+ }
I don't think it's acceptable to *set* LocalRecoveryInProgress
here. That should only be done in the normal routine.
quite right -- that was a major error -- you could bypass the
initialization call to the xlog with some bad luck.
I've seen this in profiles since, so I'd appreciate pushing this
forward.
roger that -- will push ahead when I get into the office...
attached is new version fixing some comment typos.
Attached is simplified patch that replaces the spinlock with a read
barrier based on a suggestion made by Andres offlist. The approach
has different performance characteristics -- a barrier call is being
issued instead of a non-blocking read. I don't have a performance
test case in hand to prove that's better so I'm going with Andre's
approach because it's simpler.

Can you think of a concrete example of what might go wrong if we simply
removed the spinlock, and did an unprotected read through the volatile
pointer? Is there some particular call sequence that might get optimized
in an unfortunate way? I'm not suggesting that we do that - better safe
than sorry even if we can't think of any particular scenario - but it
would help me understand this.

I don't think a read-barrier is enough here. See README.barrier:

When a memory barrier is being used to separate two
loads, use pg_read_barrier(); when it is separating two stores, use
pg_write_barrier(); when it is a separating a load and a store (in either
order), use pg_memory_barrier().

The function RecoveryInProgress() function does just one load, to read
the variable, and wouldn't even need a barrier by itself. The other load
or store that needs to be protected by the barrier happens in the
caller, before or after the function, and we can't say for sure if it's
a load or a store. So, let's use pg_memory_barrier().

Aside: can this truly the only caller
of pg_read_barrier()?

Yep. I only added the first caller of the barriers altogether in the
xlog-insertion scaling patch. Robert wrote the infrastructure in 9.3,
but it wasn't used until now, in 9.4.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Geoghegan

pg@heroku.com

over 12 years ago

In reply to: Heikki Linnakangas (#2)

On Fri, Sep 27, 2013 at 1:28 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

Yep. I only added the first caller of the barriers altogether in the
xlog-insertion scaling patch. Robert wrote the infrastructure in 9.3, but it
wasn't used until now, in 9.4.

FWIW, it was actually during 9.2 development that Robert first added
the barriers.

--
Peter Geoghegan

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Heikki Linnakangas (#2)

On 2013-09-27 23:28:37 +0300, Heikki Linnakangas wrote:

On 27.09.2013 22:00, Merlin Moncure wrote:

Attached is simplified patch that replaces the spinlock with a read
barrier based on a suggestion made by Andres offlist. The approach
has different performance characteristics -- a barrier call is being
issued instead of a non-blocking read. I don't have a performance
test case in hand to prove that's better so I'm going with Andre's
approach because it's simpler.

Can you think of a concrete example of what might go wrong if we simply
removed the spinlock, and did an unprotected read through the volatile
pointer? Is there some particular call sequence that might get optimized in
an unfortunate way? I'm not suggesting that we do that - better safe than
sorry even if we can't think of any particular scenario - but it would help
me understand this.

I think in some architecutres the write from the other side might just
not immediately be visible, so RecoveryInProgress() might return a
spurious true, although we already finished.
I haven't checked whether there's any callers that would have a problem
with that.

I don't think a read-barrier is enough here. See README.barrier:

When a memory barrier is being used to separate two
loads, use pg_read_barrier(); when it is separating two stores, use
pg_write_barrier(); when it is a separating a load and a store (in either
order), use pg_memory_barrier().

I think the documentation is just worded in an easy to misunderstand
way. If you look at the following example, we're fine.
The "two stores", "two loads" it is talking about are what's happening
in one thread of execution. The important part of a read barrier is that
it forces a fresh LOAD, separated from earlier LOADs to the same
address.

The function RecoveryInProgress() function does just one load, to read the
variable, and wouldn't even need a barrier by itself. The other load or
store that needs to be protected by the barrier happens in the caller,
before or after the function, and we can't say for sure if it's a load or a
store. So, let's use pg_memory_barrier().

The caller uses a spinlock, so it's guaranteed to write out before the
spinlock is released. A write barrier (the spinlock in the startup
process) should always be paired by a read barrier.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Andres Freund (#4)

Hi,

What confuses me is that pg_read_barrier() is just a compiler barrier on
x86[-64] in barrier.h. According to my knowledge it needs to be an
lfence or the full barrier?
The linked papers from Paul McKenney - which are a great read - seem to
agree?

On 2013-09-27 23:12:17 +0200, Andres Freund wrote:

On 2013-09-27 23:28:37 +0300, Heikki Linnakangas wrote:

The function RecoveryInProgress() function does just one load, to read the
variable, and wouldn't even need a barrier by itself. The other load or
store that needs to be protected by the barrier happens in the caller,
before or after the function, and we can't say for sure if it's a load or a
store. So, let's use pg_memory_barrier().

The caller uses a spinlock, so it's guaranteed to write out before the
spinlock is released. A write barrier (the spinlock in the startup
process) should always be paired by a read barrier.

s/caller/startup process/

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Peter Eisentraut

peter_e@gmx.net

over 12 years ago

In reply to: Merlin Moncure (#1)

On 9/27/13 3:00 PM, Merlin Moncure wrote:

Attached is simplified patch that replaces the spinlock with a read
barrier based on a suggestion made by Andres offlist.

This patch doesn't apply.

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Ants Aasma

ants@cybertec.at

over 12 years ago

In reply to: Andres Freund (#5)

On Sat, Sep 28, 2013 at 12:53 AM, Andres Freund <andres@2ndquadrant.com> wrote:

What confuses me is that pg_read_barrier() is just a compiler barrier on
x86[-64] in barrier.h. According to my knowledge it needs to be an
lfence or the full barrier?
The linked papers from Paul McKenney - which are a great read - seem to
agree?

x86 provides a pretty strong memory model, the only (relevant) allowed
reordering is that stores may be delayed beyond subsequent loads. A
simple and functionally accurate model of this is to ignore all caches
and think of the system as a bunch of CPU's accessing the main memory
through a shared bus, but each CPU has a small buffer that stores
writes done by that CPU. Intel and AMD have performed feats of black
magic in the memory pipelines that this isn't a good performance
model, but for correctness it is valid. The exceptions are few
unordered memory instructions that you have to specifically ask for
and non-writeback memory that concerns the kernel. (See section 8.2 in
[1]: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
for a read barrier is for those unordered instructions. I don't think
it's necessary in PostgreSQL.

As for the specific patch being discussed here. The read barrier is in
the wrong place and with the wrong comment, and the write side is
assuming that SpinLockAcquire() is a write barrier, which it isn't
documented to do at the moment. The correct way to think of this is
that StartupXLOG() does a bunch of state modifications and then
advertises the fact that it's done by setting
xlogctl->SharedRecoveryInProgress = false; The state modifications
should better be visible to anyone seeing that last write, so you need
one write barrier between the state modifications and setting the
flag. On the other side, after reading the flag as false in
RecoveryInProgress() the state modified by StartupXLOG() may be
investigated, those loads better not happen before we read the flag.
So we need a read barrier somewhere *after* reading the flag in
RecoveryInProgress() and reading the shared memory structures, and in
theory a full barrier if we are going to be writing data. In practice
x86 is covered thanks to it's memory model, Power is covered thanks to
the control dependency and ARM would need a read barrier, but I don't
think any real ARM CPU does speculative stores as that would be
insane.

[1]: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Ants Aasma (#7)

On 2013-10-01 03:51:50 +0300, Ants Aasma wrote:

On Sat, Sep 28, 2013 at 12:53 AM, Andres Freund <andres@2ndquadrant.com> wrote:

What confuses me is that pg_read_barrier() is just a compiler barrier on
x86[-64] in barrier.h. According to my knowledge it needs to be an
lfence or the full barrier?
The linked papers from Paul McKenney - which are a great read - seem to
agree?

x86 provides a pretty strong memory model, the only (relevant) allowed
reordering is that stores may be delayed beyond subsequent loads. A
simple and functionally accurate model of this is to ignore all caches
and think of the system as a bunch of CPU's accessing the main memory
through a shared bus, but each CPU has a small buffer that stores
writes done by that CPU. Intel and AMD have performed feats of black
magic in the memory pipelines that this isn't a good performance
model, but for correctness it is valid. The exceptions are few
unordered memory instructions that you have to specifically ask for
and non-writeback memory that concerns the kernel. (See section 8.2 in
[1] for details) The note by McKenney stating that lfence is required
for a read barrier is for those unordered instructions. I don't think
it's necessary in PostgreSQL.

I am actually referring to his x86 description where he states smp_rmb()
is defined as lock xadd;. But I've since done some research and that was
removed for most x86 platforms after intel and AMD relaxed the
requirements to match the actually implemented reality.

For reference:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b6c7347fffa655a3000d9d41640d222c19fc3065

As for the specific patch being discussed here. The read barrier is in
the wrong place and with the wrong comment, and the write side is
assuming that SpinLockAcquire() is a write barrier, which it isn't
documented to do at the moment.

Lots of things we do pretty much already assume it is one - and I have a
hard time imagining any spinlock implementation that isn't a write
barrier. In fact it's the only reason we use spinlocks in quite some
places.
What we are missing is that it's not guaranteed to be a compiler barrier
on all platforms :(. But that's imo a bug.

The correct way to think of this is
that StartupXLOG() does a bunch of state modifications and then
advertises the fact that it's done by setting
xlogctl->SharedRecoveryInProgress = false; The state modifications
should better be visible to anyone seeing that last write, so you need
one write barrier between the state modifications and setting the
flag.

SpinLockAcquire() should provide that.

On the other side, after reading the flag as false in
RecoveryInProgress() the state modified by StartupXLOG() may be
investigated, those loads better not happen before we read the flag.

Agreed.

So we need a read barrier somewhere *after* reading the flag in
RecoveryInProgress() and reading the shared memory structures, and in
theory a full barrier if we are going to be writing data. In practice
x86 is covered thanks to it's memory model, Power is covered thanks to
the control dependency and ARM would need a read barrier, but I don't
think any real ARM CPU does speculative stores as that would be
insane.

Does there necessarily have to be a "visible" control dependency?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Ants Aasma

ants@cybertec.at

over 12 years ago

In reply to: Andres Freund (#8)

On Tue, Oct 1, 2013 at 2:08 PM, Andres Freund <andres@2ndquadrant.com> wrote:

As for the specific patch being discussed here. The read barrier is in
the wrong place and with the wrong comment, and the write side is
assuming that SpinLockAcquire() is a write barrier, which it isn't
documented to do at the moment.

Lots of things we do pretty much already assume it is one - and I have a
hard time imagining any spinlock implementation that isn't a write
barrier. In fact it's the only reason we use spinlocks in quite some
places.
What we are missing is that it's not guaranteed to be a compiler barrier
on all platforms :(. But that's imo a bug.

Agree on both counts.

The correct way to think of this is
that StartupXLOG() does a bunch of state modifications and then
advertises the fact that it's done by setting
xlogctl->SharedRecoveryInProgress = false; The state modifications
should better be visible to anyone seeing that last write, so you need
one write barrier between the state modifications and setting the
flag.

SpinLockAcquire() should provide that.

Yes. It's notable that in this case it's a matter of correctness that
the global state modifications do *not* share the critical section
with the flag update. Otherwise the flag update may become visible
before the state updates.

So we need a read barrier somewhere *after* reading the flag in
RecoveryInProgress() and reading the shared memory structures, and in
theory a full barrier if we are going to be writing data. In practice
x86 is covered thanks to it's memory model, Power is covered thanks to
the control dependency and ARM would need a read barrier, but I don't
think any real ARM CPU does speculative stores as that would be
insane.

Does there necessarily have to be a "visible" control dependency?

Unfortunately no, the compiler is quite free to hoist loads and even
stores out from the conditional block losing the control dependency.
:( It's quite unlikely to do so as it would be a very large code
motion and it probably has no reason to do it, but I don't see
anything that would disallow it. I wonder if we should just emit a
full fence there for platforms with weak memory models. Trying to
enforce the control dependency seems quite fragile.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Ants Aasma (#9)

On 2013-10-01 14:31:11 +0300, Ants Aasma wrote:

The correct way to think of this is
that StartupXLOG() does a bunch of state modifications and then
advertises the fact that it's done by setting
xlogctl->SharedRecoveryInProgress = false; The state modifications
should better be visible to anyone seeing that last write, so you need
one write barrier between the state modifications and setting the
flag.

SpinLockAcquire() should provide that.

Yes. It's notable that in this case it's a matter of correctness that
the global state modifications do *not* share the critical section
with the flag update. Otherwise the flag update may become visible
before the state updates.

I think we're currently essentially assuming that not only
SpinLockAcquire() is a barrier but also that SpinLockRelease() is
one... - which is actually far less likely to be true.

So we need a read barrier somewhere *after* reading the flag in
RecoveryInProgress() and reading the shared memory structures, and in
theory a full barrier if we are going to be writing data. In practice
x86 is covered thanks to it's memory model, Power is covered thanks to
the control dependency and ARM would need a read barrier, but I don't
think any real ARM CPU does speculative stores as that would be
insane.

Does there necessarily have to be a "visible" control dependency?

Unfortunately no

That's what I thought :(

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11

Merlin Moncure

mmoncure@gmail.com

over 12 years ago

In reply to: Peter Eisentraut (#6)

On Mon, Sep 30, 2013 at 8:15 AM, Peter Eisentraut <peter_e@gmx.net> wrote:

On 9/27/13 3:00 PM, Merlin Moncure wrote:

Attached is simplified patch that replaces the spinlock with a read
barrier based on a suggestion made by Andres offlist.

This patch doesn't apply.

works for me:

merlin@mmoncure-ubuntu:~/pgdev/pgsql$ git reset --hard HEAD
HEAD is now at 200ba16 Add regression test for bug fixed by recent refactoring.
merlin@mmoncure-ubuntu:~/pgdev/pgsql$ patch -p1 < buffer5.patch
patching file src/backend/access/transam/xlog.c

On Mon, Sep 30, 2013 at 7:51 PM, Ants Aasma <ants@cybertec.at> wrote:

So we need a read barrier somewhere *after* reading the flag in
RecoveryInProgress() and reading the shared memory structures, and in
theory a full barrier if we are going to be writing data.

wow -- thanks for your review and provided detail. Considering there
are no examples of barrier instructions to date, I think some of your
commentary should be included in the in-source documentation.

In this particular case, a read barrier should be sufficient? By
'writing data', do you mean to the xlog control structure? This
routine only sets a backend local flag so that should be safe?

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12

Andres Freund

andres@2ndquadrant.com

over 12 years ago

In reply to: Merlin Moncure (#11)

On 2013-10-02 08:39:59 -0500, Merlin Moncure wrote:

wow -- thanks for your review and provided detail. Considering there
are no examples of barrier instructions to date, I think some of your
commentary should be included in the in-source documentation.

I think most of it is in README.barrier

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13

Ants Aasma

ants@cybertec.at

over 12 years ago

In reply to: Merlin Moncure (#11)

On Wed, Oct 2, 2013 at 4:39 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Mon, Sep 30, 2013 at 7:51 PM, Ants Aasma <ants@cybertec.at> wrote:

So we need a read barrier somewhere *after* reading the flag in
RecoveryInProgress() and reading the shared memory structures, and in
theory a full barrier if we are going to be writing data.

wow -- thanks for your review and provided detail. Considering there
are no examples of barrier instructions to date, I think some of your
commentary should be included in the in-source documentation.

In this particular case, a read barrier should be sufficient? By
'writing data', do you mean to the xlog control structure? This
routine only sets a backend local flag so that should be safe?

I haven't reviewed the code in as much detail to say if there is an
actual race here, I tend to think there's probably not, but the
specific pattern that I had in mind is that with the following actual
code:

Process A:
globalVar = 1;
write_barrier();
recoveryInProgress = false;

Process B:
if (!recoveryInProgress) {
globalVar = 2;
doWork();
}

If process B speculatively executes line 2 before reading the flag for
line 1, then it's possible that the store in process B is executed
before the store in process A, making globalVar move backwards. The
barriers as they are defined don't make this scenario impossible. That
said, I don't know of any hardware that would make speculatively
executed stores visible to non-speculative state, as I said, that
would be completely insane. However currently compilers consider it
completely legal to rewrite the code into the following form:

tmp = globalVar;
globalVar = 2;
if (!recoveryInProgress) {
doWork();
} else {
globalVar = tmp;
}

That still exhibits the same problem. An abstract read barrier would
not be enough here, as this requires a LoadStore barrier. However, the
control dependency is enough for the hardware and PostgreSQL
pg_read_barrier() is a full compiler barrier, so in practice a simple
pg_read_barrier() is enough.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14

Merlin Moncure

mmoncure@gmail.com

over 12 years ago

In reply to: Ants Aasma (#13)

On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma <ants@cybertec.at> wrote:

I haven't reviewed the code in as much detail to say if there is an
actual race here, I tend to think there's probably not, but the
specific pattern that I had in mind is that with the following actual
code:

hm. I think there *is* a race. 2+ threads could race to the line:

LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;

and simultaneously set the value of LocalRecoveryInProgress to false,
and both engage InitXLOGAccess, which is destructive. The move
operation is atomic, but I don't think there's any guarantee the reads
to xlogctl->SharedRecoveryInProgress are ordered between threads
without a lock.

I don't think the memory barrier will fix this. Do you agree? If so,
my earlier patch (recovery4) is another take on this problem, and
arguable safer; the unlocked read is in a separate path that does not
engage InitXLOGAccess()

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15

Ants Aasma

ants@cybertec.at

over 12 years ago

In reply to: Merlin Moncure (#14)

On Wed, Oct 2, 2013 at 10:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma <ants@cybertec.at> wrote:

I haven't reviewed the code in as much detail to say if there is an
actual race here, I tend to think there's probably not, but the
specific pattern that I had in mind is that with the following actual
code:

hm. I think there *is* a race. 2+ threads could race to the line:

LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;

and simultaneously set the value of LocalRecoveryInProgress to false,
and both engage InitXLOGAccess, which is destructive. The move
operation is atomic, but I don't think there's any guarantee the reads
to xlogctl->SharedRecoveryInProgress are ordered between threads
without a lock.

I don't think the memory barrier will fix this. Do you agree? If so,
my earlier patch (recovery4) is another take on this problem, and
arguable safer; the unlocked read is in a separate path that does not
engage InitXLOGAccess()

InitXLOGAccess only writes backend local variables, so it can't be
destructive. In fact, it calls GetRedoRecPtr(), which does a spinlock
acquisition cycle, which should be a full memory barrier. A read
barrier after accessing SharedRecoveryInProgress is good enough, and
it seems to be necessary to avoid a race condition for
XLogCtl->ThisTimeLineID.

Admittedly the race is so unprobable that in practice it probably
doesn't matter. I just wanted to be spell out the correct way to do
unlocked accesses as it can get quite confusing. I found Herb Sutter's
"atomic<> weapons" talk very useful, thinking about the problem in
terms of acquire and release makes it much clearer to me.

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#16

Merlin Moncure

mmoncure@gmail.com

over 12 years ago

In reply to: Ants Aasma (#15)

On Wed, Oct 2, 2013 at 3:43 PM, Ants Aasma <ants@cybertec.at> wrote:

On Wed, Oct 2, 2013 at 10:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma <ants@cybertec.at> wrote:

I haven't reviewed the code in as much detail to say if there is an
actual race here, I tend to think there's probably not, but the
specific pattern that I had in mind is that with the following actual
code:

hm. I think there *is* a race. 2+ threads could race to the line:

LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;

and simultaneously set the value of LocalRecoveryInProgress to false,
and both engage InitXLOGAccess, which is destructive. The move
operation is atomic, but I don't think there's any guarantee the reads
to xlogctl->SharedRecoveryInProgress are ordered between threads
without a lock.

I don't think the memory barrier will fix this. Do you agree? If so,
my earlier patch (recovery4) is another take on this problem, and
arguable safer; the unlocked read is in a separate path that does not
engage InitXLOGAccess()

InitXLOGAccess only writes backend local variables, so it can't be
destructive. In fact, it calls GetRedoRecPtr(), which does a spinlock
acquisition cycle, which should be a full memory barrier. A read
barrier after accessing SharedRecoveryInProgress is good enough, and
it seems to be necessary to avoid a race condition for
XLogCtl->ThisTimeLineID.

Admittedly the race is so unprobable that in practice it probably
doesn't matter. I just wanted to be spell out the correct way to do
unlocked accesses as it can get quite confusing. I found Herb Sutter's
"atomic<> weapons" talk very useful, thinking about the problem in
terms of acquire and release makes it much clearer to me.

If we don't care about multiple calls to InitXLOGAccess() (for a
backend) then we can get away with just a barrier. That's a pretty
weird assumption though (even if 'improbable') and I think it should
be well documented in the code, particularly since previous versions
went though such trouble to do a proper check->set->init.

I'm still leaning on my earlier idea (recovery4.patch) since it keeps
the old semantics by exploiting the fact that the call site of
interest (only) does not require a correct answer (that is, for the
backend to think it's in recovery when it's not). That just defers
the heap prune for a time and you don't have to mess around with
barriers at all or be concerned about present or future issues caused
by spurious extra calls to InitXLOGAccess(). The other code paths to
RecoveryInProgress() appear not to be interesting from a spinlock
perspective.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Merlin Moncure (#16)

1 attachment(s)

On 03.10.2013 00:14, Merlin Moncure wrote:

On Wed, Oct 2, 2013 at 3:43 PM, Ants Aasma <ants@cybertec.at> wrote:

On Wed, Oct 2, 2013 at 10:37 PM, Merlin Moncure <mmoncure@gmail.com> wrote:

On Wed, Oct 2, 2013 at 9:45 AM, Ants Aasma <ants@cybertec.at> wrote:

I haven't reviewed the code in as much detail to say if there is an
actual race here, I tend to think there's probably not, but the
specific pattern that I had in mind is that with the following actual
code:

hm. I think there *is* a race. 2+ threads could race to the line:

LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;

and simultaneously set the value of LocalRecoveryInProgress to false,
and both engage InitXLOGAccess, which is destructive. The move
operation is atomic, but I don't think there's any guarantee the reads
to xlogctl->SharedRecoveryInProgress are ordered between threads
without a lock.

The backend is single-threaded, so this is moot.

InitXLOGAccess only writes backend local variables, so it can't be
destructive. In fact, it calls GetRedoRecPtr(), which does a spinlock
acquisition cycle, which should be a full memory barrier. A read
barrier after accessing SharedRecoveryInProgress is good enough, and
it seems to be necessary to avoid a race condition for
XLogCtl->ThisTimeLineID.

Admittedly the race is so unprobable that in practice it probably
doesn't matter. I just wanted to be spell out the correct way to do
unlocked accesses as it can get quite confusing. I found Herb Sutter's
"atomic<> weapons" talk very useful, thinking about the problem in
terms of acquire and release makes it much clearer to me.

If we don't care about multiple calls to InitXLOGAccess() (for a
backend) then we can get away with just a barrier. That's a pretty
weird assumption though (even if 'improbable') and I think it should
be well documented in the code, particularly since previous versions
went though such trouble to do a proper check->set->init.

I'm still leaning on my earlier idea (recovery4.patch) since it keeps
the old semantics by exploiting the fact that the call site of
interest (only) does not require a correct answer (that is, for the
backend to think it's in recovery when it's not). That just defers
the heap prune for a time and you don't have to mess around with
barriers at all or be concerned about present or future issues caused
by spurious extra calls to InitXLOGAccess(). The other code paths to
RecoveryInProgress() appear not to be interesting from a spinlock
perspective.

Hmm. All callers of RecoveryInProgress() must be prepared to handle the
case that RecoveryInProgress() returns true, but the system is no longer
in recovery. No matter what locking we do in RecoveryInProgress(), the
startup process might finish recovery just after RecoveryInProgress()
has returned.

What about the attached? It reads the shared variable without a lock or
barrier. If it returns 'true', but the system in fact just exited
recovery, that's OK. As explained above, all the callers must tolerate
that anyway. But if it returns 'false', then it performs a full memory
barrier, which should ensure that it sees any other shared variables as
it is after the startup process cleared SharedRecoveryInProgress
(notably, XLogCtl->ThisTimeLineID).

- Heikki

Attachments:

recovery6.patchtext/x-diff; name=recovery6.patchDownload

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
old mode 100644
new mode 100755
index a95149b..a9bfdfc
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7370,10 +7370,7 @@ RecoveryInProgress(void)
 		/* use volatile pointer to prevent code rearrangement */
 		volatile XLogCtlData *xlogctl = XLogCtl;
 
-		/* spinlock is essential on machines with weak memory ordering! */
-		SpinLockAcquire(&xlogctl->info_lck);
 		LocalRecoveryInProgress = xlogctl->SharedRecoveryInProgress;
-		SpinLockRelease(&xlogctl->info_lck);
 
 		/*
 		 * Initialize TimeLineID and RedoRecPtr when we discover that recovery
@@ -7382,7 +7379,15 @@ RecoveryInProgress(void)
 		 * this, see also LocalSetXLogInsertAllowed.)
 		 */
 		if (!LocalRecoveryInProgress)
+		{
+			/*
+			 * If we just exited recovery, make sure we read any other global
+			 * state after SharedRecoveryInProgress (for machines with weak
+			 * memory ordering).
+			 */
+			pg_memory_barrier();
 			InitXLOGAccess();
+		}
 
 		return LocalRecoveryInProgress;
 	}

#18

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Heikki Linnakangas (#17)

On 2013-11-21 16:25:02 +0200, Heikki Linnakangas wrote:

Hmm. All callers of RecoveryInProgress() must be prepared to handle the case
that RecoveryInProgress() returns true, but the system is no longer in
recovery. No matter what locking we do in RecoveryInProgress(), the startup
process might finish recovery just after RecoveryInProgress() has returned.

True.

What about the attached? It reads the shared variable without a lock or
barrier. If it returns 'true', but the system in fact just exited recovery,
that's OK. As explained above, all the callers must tolerate that anyway.
But if it returns 'false', then it performs a full memory barrier, which
should ensure that it sees any other shared variables as it is after the
startup process cleared SharedRecoveryInProgress (notably,
XLogCtl->ThisTimeLineID).

I'd argue that we should also remove the spinlock in StartupXLOG and
replace it with a write barrier. Obviously not for performance reasons,
but because somebody might add more code to run under that spinlock.

Looks good otherwise, although a read memory barrier ought to suffice.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19

Merlin Moncure

mmoncure@gmail.com

about 12 years ago

In reply to: Andres Freund (#18)

On Thu, Nov 21, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-11-21 16:25:02 +0200, Heikki Linnakangas wrote:

Hmm. All callers of RecoveryInProgress() must be prepared to handle the case
that RecoveryInProgress() returns true, but the system is no longer in
recovery. No matter what locking we do in RecoveryInProgress(), the startup
process might finish recovery just after RecoveryInProgress() has returned.

True.

What about the attached? It reads the shared variable without a lock or
barrier. If it returns 'true', but the system in fact just exited recovery,
that's OK. As explained above, all the callers must tolerate that anyway.
But if it returns 'false', then it performs a full memory barrier, which
should ensure that it sees any other shared variables as it is after the
startup process cleared SharedRecoveryInProgress (notably,
XLogCtl->ThisTimeLineID).

I'd argue that we should also remove the spinlock in StartupXLOG and
replace it with a write barrier. Obviously not for performance reasons,
but because somebody might add more code to run under that spinlock.

Looks good otherwise, although a read memory barrier ought to suffice.

This code is in a very hot code path. Are we *sure* that the read
barrier is fast enough that we don't want to provide an alternate
function that only returns the local flag? I don't know enough about
them to say either way.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20

Andres Freund

andres@2ndquadrant.com

about 12 years ago

In reply to: Merlin Moncure (#19)

On 2013-11-21 09:08:05 -0600, Merlin Moncure wrote:

This code is in a very hot code path. Are we *sure* that the read
barrier is fast enough that we don't want to provide an alternate
function that only returns the local flag? I don't know enough about
them to say either way.

A read barrier is just a compiler barrier on x86.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#21

Merlin Moncure

mmoncure@gmail.com

about 12 years ago

In reply to: Andres Freund (#20)

On Thu, Nov 21, 2013 at 9:09 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-11-21 09:08:05 -0600, Merlin Moncure wrote:

This code is in a very hot code path. Are we *sure* that the read
barrier is fast enough that we don't want to provide an alternate
function that only returns the local flag? I don't know enough about
them to say either way.

A read barrier is just a compiler barrier on x86.

That's good enough for me then.

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Merlin Moncure (#19)

On 21.11.2013 17:08, Merlin Moncure wrote:

On Thu, Nov 21, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:

On 2013-11-21 16:25:02 +0200, Heikki Linnakangas wrote:

Hmm. All callers of RecoveryInProgress() must be prepared to handle the case
that RecoveryInProgress() returns true, but the system is no longer in
recovery. No matter what locking we do in RecoveryInProgress(), the startup
process might finish recovery just after RecoveryInProgress() has returned.

True.

What about the attached? It reads the shared variable without a lock or
barrier. If it returns 'true', but the system in fact just exited recovery,
that's OK. As explained above, all the callers must tolerate that anyway.
But if it returns 'false', then it performs a full memory barrier, which
should ensure that it sees any other shared variables as it is after the
startup process cleared SharedRecoveryInProgress (notably,
XLogCtl->ThisTimeLineID).

I'd argue that we should also remove the spinlock in StartupXLOG and
replace it with a write barrier. Obviously not for performance reasons,
but because somebody might add more code to run under that spinlock.

Looks good otherwise, although a read memory barrier ought to suffice.

This code is in a very hot code path. Are we *sure* that the read
barrier is fast enough that we don't want to provide an alternate
function that only returns the local flag? I don't know enough about
them to say either way.

In my patch, I put the barrier inside the if (!LocalRecoveryInProgress)
block. That codepath can only execute once in a backend, so performance
is not an issue there. Does that look sane to you?

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23

Merlin Moncure

mmoncure@gmail.com

about 12 years ago

In reply to: Heikki Linnakangas (#22)

On Thu, Nov 21, 2013 at 10:37 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:

On 21.11.2013 17:08, Merlin Moncure wrote:

On Thu, Nov 21, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com>
wrote:

On 2013-11-21 16:25:02 +0200, Heikki Linnakangas wrote:

Hmm. All callers of RecoveryInProgress() must be prepared to handle the
case
that RecoveryInProgress() returns true, but the system is no longer in
recovery. No matter what locking we do in RecoveryInProgress(), the
startup
process might finish recovery just after RecoveryInProgress() has
returned.

True.

What about the attached? It reads the shared variable without a lock or
barrier. If it returns 'true', but the system in fact just exited
recovery,
that's OK. As explained above, all the callers must tolerate that
anyway.
But if it returns 'false', then it performs a full memory barrier, which
should ensure that it sees any other shared variables as it is after the
startup process cleared SharedRecoveryInProgress (notably,
XLogCtl->ThisTimeLineID).

I'd argue that we should also remove the spinlock in StartupXLOG and
replace it with a write barrier. Obviously not for performance reasons,
but because somebody might add more code to run under that spinlock.

Looks good otherwise, although a read memory barrier ought to suffice.

This code is in a very hot code path. Are we *sure* that the read
barrier is fast enough that we don't want to provide an alternate
function that only returns the local flag? I don't know enough about
them to say either way.

In my patch, I put the barrier inside the if (!LocalRecoveryInProgress)
block. That codepath can only execute once in a backend, so performance is
not an issue there. Does that look sane to you?

oh right -- certainly!

merlin

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#24

Heikki Linnakangas

hlinnakangas@vmware.com

about 12 years ago

In reply to: Merlin Moncure (#23)

On 21.11.2013 21:37, Merlin Moncure wrote:

On Thu, Nov 21, 2013 at 10:37 AM, Heikki Linnakangas

In my patch, I put the barrier inside the if (!LocalRecoveryInProgress)
block. That codepath can only execute once in a backend, so performance is
not an issue there. Does that look sane to you?

oh right -- certainly!

Ok, commmited.

- Heikki

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers