basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

Started by Andres Freundalmost 11 years ago23 messages
#1Andres Freund
andres@2ndquadrant.com

Hi,

I'm analyzing a problem in which a customer had a pg_basebackup (from
standby) created 9.2 cluster that failed with "WAL contains references to
invalid pages". The failed record was a "xlog redo visible"
i.e. XLOG_HEAP2_VISIBLE.

First I thought there might be another bug along the line of
17fa4c321cc. Looking at the code and the WAL that didn't seem to be the
case (man, I miss pg_xlogdump). Other, slightly older, standbys, didn't
seem to have any problems.

Logs show that a ALTER DATABASE ... SET TABLESPACE ... was running when
the basebackup was started and finished *before* pg_basebackup finished.

movedb() basically works in these steps:
1) lock out users of the database
2) RequestCheckpoint(IMMEDIATE|WAIT)
3) DropDatabaseBuffers()
4) copydir()
5) XLogInsert(XLOG_DBASE_CREATE)
6) RequestCheckpoint(CHECKPOINT_IMMEDIATE)
7) rmtree(src_dbpath)
8) XLogInsert(XLOG_DBASE_DROP)
9) unlock database

If a basebackup starts while 4) is in progress and continues until 7)
happens I think a pretty wide race opens: The basebackup can end up with
a partial copy of the database in the old tablespace because the
rmtree(old_path) concurrently was in progress. Normally such races are
fixed during replay. But in this case, the replay of the
XLOG_DBASE_CREATE will just try to do a rmtree(new); copydiar(old, new);.
fixing nothing.

Besides making AD .. ST use sane WAL logging, which doesn't seem
backpatchable, I don't see what could be done against this except
somehow making basebackups fail if a AD .. ST is in progress. Which
doesn't look entirely trivial either.

This is a pretty nasty bug imo, because you're in no way guaranteed to
be noticed. If the problem happens only in some large, seldomly read,
table, the database might appear to be in a correct state.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#1)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

Hi,

On 2015-01-20 16:28:19 +0100, Andres Freund wrote:

I'm analyzing a problem in which a customer had a pg_basebackup (from
standby) created 9.2 cluster that failed with "WAL contains references to
invalid pages". The failed record was a "xlog redo visible"
i.e. XLOG_HEAP2_VISIBLE.

First I thought there might be another bug along the line of
17fa4c321cc. Looking at the code and the WAL that didn't seem to be the
case (man, I miss pg_xlogdump). Other, slightly older, standbys, didn't
seem to have any problems.

Logs show that a ALTER DATABASE ... SET TABLESPACE ... was running when
the basebackup was started and finished *before* pg_basebackup finished.

movedb() basically works in these steps:
1) lock out users of the database
2) RequestCheckpoint(IMMEDIATE|WAIT)
3) DropDatabaseBuffers()
4) copydir()
5) XLogInsert(XLOG_DBASE_CREATE)
6) RequestCheckpoint(CHECKPOINT_IMMEDIATE)
7) rmtree(src_dbpath)
8) XLogInsert(XLOG_DBASE_DROP)
9) unlock database

If a basebackup starts while 4) is in progress and continues until 7)
happens I think a pretty wide race opens: The basebackup can end up with
a partial copy of the database in the old tablespace because the
rmtree(old_path) concurrently was in progress. Normally such races are
fixed during replay. But in this case, the replay of the
XLOG_DBASE_CREATE will just try to do a rmtree(new); copydiar(old, new);.
fixing nothing.

Besides making AD .. ST use sane WAL logging, which doesn't seem
backpatchable, I don't see what could be done against this except
somehow making basebackups fail if a AD .. ST is in progress. Which
doesn't look entirely trivial either.

I basically have two ideas to fix this.

1) Make do_pg_start_backup() acquire a SHARE lock on
pg_database. That'll prevent it from starting while a movedb() is
still in progress. Then additionally add pg_backup_in_progress()
function to xlog.c that checks (XLogCtl->Insert.exclusiveBackup ||
XLogCtl->Insert.nonExclusiveBackups != 0). Use that in createdb() and
movedb() to error out if a backup is in progress.

Not pretty, but sounds doable.

We've discussed trying to sleep instead of erroring out in movedb(),
while a base backup is in progress, but that's nontrivial. I also
don't think ALTER DATABASE is ever intentionally run at the time of a
base backup.

2) Make movedb() (and possibly created(), not sure yet) use proper WAL
logging and log the whole copied data. I think this is the right long
term fix and would end up being much more reliable. But it either
requires some uglyness during redo (creating nonexistant database
directories on the fly during redo) or new wal records.

Doable, but probably too large/invasive to backpatch.

Thanks for Alvaro and Petr for discussing the problem.

I lean towards doing 1) in all branches and then doing 2) in master.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Andres Freund (#2)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

Andres Freund wrote:

2) Make movedb() (and possibly created(), not sure yet) use proper WAL
logging and log the whole copied data. I think this is the right long
term fix and would end up being much more reliable. But it either
requires some uglyness during redo (creating nonexistant database
directories on the fly during redo) or new wal records.

Doable, but probably too large/invasive to backpatch.

Not to mention the extra WAL traffic ...

--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Alvaro Herrera (#3)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 1/22/15 1:43 PM, Alvaro Herrera wrote:

Andres Freund wrote:

2) Make movedb() (and possibly created(), not sure yet) use proper WAL
logging and log the whole copied data. I think this is the right long
term fix and would end up being much more reliable. But it either
requires some uglyness during redo (creating nonexistant database
directories on the fly during redo) or new wal records.

Doable, but probably too large/invasive to backpatch.

Not to mention the extra WAL traffic ...

Yeah, I don't know that we actually want #2. It's bad enough to copy an entire database locally, but to then put it's entire contents into WAL? Blech.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Andres Freund
andres@2ndquadrant.com
In reply to: Jim Nasby (#4)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 2015-01-22 14:42:18 -0600, Jim Nasby wrote:

On 1/22/15 1:43 PM, Alvaro Herrera wrote:

Andres Freund wrote:

2) Make movedb() (and possibly created(), not sure yet) use proper WAL
logging and log the whole copied data. I think this is the right long
term fix and would end up being much more reliable. But it either
requires some uglyness during redo (creating nonexistant database
directories on the fly during redo) or new wal records.

Doable, but probably too large/invasive to backpatch.

Not to mention the extra WAL traffic ...

Yeah, I don't know that we actually want #2. It's bad enough to copy
an entire database locally

The local copy is pretty much fundamental. Given that tablespaces
usually will be on different filesystems there's not much else we can
do.

If we want a optimization for moving databases across tablespaces if
both are on the same filesystems and wal_level < archive, we can do
that. But that's pretty independent. And I doubt it's worthwile the
developer time.

, but to then put it's entire contents into WAL? Blech.

Besides actually having a chance of being correct, doing so will save
having to do two checkpoints inside movedb(). I think it's pretty likely
that that actually saves overall IO, even including the WAL
writes. Especially if there's other databases in the cluster at the same
time.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Stephen Frost
sfrost@snowman.net
In reply to: Andres Freund (#2)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

Andres,

* Andres Freund (andres@2ndquadrant.com) wrote:

1) Make do_pg_start_backup() acquire a SHARE lock on
pg_database. That'll prevent it from starting while a movedb() is
still in progress. Then additionally add pg_backup_in_progress()
function to xlog.c that checks (XLogCtl->Insert.exclusiveBackup ||
XLogCtl->Insert.nonExclusiveBackups != 0). Use that in createdb() and
movedb() to error out if a backup is in progress.

Not pretty, but sounds doable.

We've discussed trying to sleep instead of erroring out in movedb(),
while a base backup is in progress, but that's nontrivial. I also
don't think ALTER DATABASE is ever intentionally run at the time of a
base backup.

2) Make movedb() (and possibly created(), not sure yet) use proper WAL
logging and log the whole copied data. I think this is the right long
term fix and would end up being much more reliable. But it either
requires some uglyness during redo (creating nonexistant database
directories on the fly during redo) or new wal records.

Doable, but probably too large/invasive to backpatch.

Thanks for Alvaro and Petr for discussing the problem.

I lean towards doing 1) in all branches and then doing 2) in master.

I'm trying to figure out why you'd do '2' in master when in discussion
of '1' you say "I also don't think ALTER DATABASE is even intentionally
run at the time of a base backup." I agree with that sentiment and am
inclined to say that '1' is good enough throughout.

Another thought would be to offer both- perhaps an AD .. ST ..
CONCURRENTLY option which would WAL. Or a way to say "my backup is more
important than some AD .. ST ..", which I could see some admins wanting.

The other question is- what about AT .. ST? That is, ALTER TABLE .. SET
TABLESPACE. Do we do things differently there or is there a similar
issue?

Thanks,

Stephen

#7Andres Freund
andres@2ndquadrant.com
In reply to: Stephen Frost (#6)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 2015-01-22 16:38:49 -0500, Stephen Frost wrote:

Andres,

* Andres Freund (andres@2ndquadrant.com) wrote:

1) Make do_pg_start_backup() acquire a SHARE lock on
pg_database. That'll prevent it from starting while a movedb() is
still in progress. Then additionally add pg_backup_in_progress()
function to xlog.c that checks (XLogCtl->Insert.exclusiveBackup ||
XLogCtl->Insert.nonExclusiveBackups != 0). Use that in createdb() and
movedb() to error out if a backup is in progress.

Not pretty, but sounds doable.

We've discussed trying to sleep instead of erroring out in movedb(),
while a base backup is in progress, but that's nontrivial. I also
don't think ALTER DATABASE is ever intentionally run at the time of a
base backup.

2) Make movedb() (and possibly created(), not sure yet) use proper WAL
logging and log the whole copied data. I think this is the right long
term fix and would end up being much more reliable. But it either
requires some uglyness during redo (creating nonexistant database
directories on the fly during redo) or new wal records.

Doable, but probably too large/invasive to backpatch.

Thanks for Alvaro and Petr for discussing the problem.

I lean towards doing 1) in all branches and then doing 2) in master.

I'm trying to figure out why you'd do '2' in master when in discussion
of '1' you say "I also don't think ALTER DATABASE is even intentionally
run at the time of a base backup." I agree with that sentiment and am
inclined to say that '1' is good enough throughout.

Because the way it currently works is a major crock. It's more luck than
anything that it actually somewhat works. We normally rely on WAL to
bring us into a consistent state. But around CREATE/MOVE/DROP DATABASE
we've ignored that.

My point about not intentionally running it at the same isn't that it's
not happening, it's that it's not intended to happen at the same
time. But most sane sites these days actually do use automated
basebackups - making it much harder to be safe against this.

And. Hm. The difficulty of the current method is evidenced by the fact
that so far nodoby recognized that 1) as described above isn't actually
safe. It fails to protect against basebackups on a standby as its
XLogCtl state will obviously not be visible on the master.

For exclusive basebackups (i.e. SELECT pg_start/stop_backup()) we can't
rely on locking because these commands can happen in independent
sessions. But I think we can in the builtin nonexclusive basebackups, as
it's run in one session. So I guess we could rely on recovery conflicts
not allowing the XLOG_DBASE_CREATE/DROP to replicate. It's nasty that a
basebackup can suddenly stop replication from progressing though :(.
Additionally it'd not protect stuff like pgespresso (an extension for
nonexclusive standby basebackups).

The other question is- what about AT .. ST? That is, ALTER TABLE .. SET
TABLESPACE. Do we do things differently there or is there a similar
issue?

No issue, because it actually is implemented halfway sanely sanely and
uses WAL logging. I personally don't like at all that it uses
FlushRelationBuffers() and reads the tables on the smgr level instead of
using the buffer manager and a bulk state, but ...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#8Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#7)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 2015-01-22 22:58:17 +0100, Andres Freund wrote:

On 2015-01-22 16:38:49 -0500, Stephen Frost wrote:

I'm trying to figure out why you'd do '2' in master when in discussion
of '1' you say "I also don't think ALTER DATABASE is even intentionally
run at the time of a base backup." I agree with that sentiment and am
inclined to say that '1' is good enough throughout.

Because the way it currently works is a major crock. It's more luck than
anything that it actually somewhat works. We normally rely on WAL to
bring us into a consistent state. But around CREATE/MOVE/DROP DATABASE
we've ignored that.

And. Hm. The difficulty of the current method is evidenced by the fact
that so far nodoby recognized that 1) as described above isn't actually
safe. It fails to protect against basebackups on a standby as its
XLogCtl state will obviously not be visible on the master.

Further evidenced by the fact that the current method isn't
crash/shutdown safe at all. If a standby was shut down/crashed/was
started on a base backup when a CREATE DATABASE from the primary is
replayed the template database used can be in an nearly arbitrarily bad
state. It'll later get fixed up by recovery - but those changes won't
make it to the copied database.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#5)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 1/22/15 3:18 PM, Andres Freund wrote:

, but to then put it's entire contents into WAL? Blech.

Besides actually having a chance of being correct, doing so will save
having to do two checkpoints inside movedb(). I think it's pretty likely
that that actually saves overall IO, even including the WAL
writes. Especially if there's other databases in the cluster at the same
time.

If you're moving a small amount of data, maybe. If you're moving several hundred GB or more? You're going to flood WAL and probably cause all replication/archiving to lag.

Obviously, we need to do be reliable, but I think a lot of users would much rather that they can't muck with tablespaces while a backup is running than an ADAT suddenly consumes way more resources than before.

Is there some way we can add an option to control this? I'm thinking that by default ADAT will error if a backup is running, but allow the user to over-ride that.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Jim Nasby
Jim.Nasby@BlueTreble.com
In reply to: Andres Freund (#8)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 1/23/15 9:10 AM, Andres Freund wrote:

On 2015-01-22 22:58:17 +0100, Andres Freund wrote:

On 2015-01-22 16:38:49 -0500, Stephen Frost wrote:

I'm trying to figure out why you'd do '2' in master when in discussion
of '1' you say "I also don't think ALTER DATABASE is even intentionally
run at the time of a base backup." I agree with that sentiment and am
inclined to say that '1' is good enough throughout.

Because the way it currently works is a major crock. It's more luck than
anything that it actually somewhat works. We normally rely on WAL to
bring us into a consistent state. But around CREATE/MOVE/DROP DATABASE
we've ignored that.

And. Hm. The difficulty of the current method is evidenced by the fact
that so far nodoby recognized that 1) as described above isn't actually
safe. It fails to protect against basebackups on a standby as its
XLogCtl state will obviously not be visible on the master.

Further evidenced by the fact that the current method isn't
crash/shutdown safe at all. If a standby was shut down/crashed/was
started on a base backup when a CREATE DATABASE from the primary is
replayed the template database used can be in an nearly arbitrarily bad
state. It'll later get fixed up by recovery - but those changes won't
make it to the copied database.

I think we all agree that ADAT can't run while a base backup is happening, which I believe is what you're describing above. We'd have to somehow cover that same scenario on replicas too.

Perhaps there isn't really an issue here; I suspect ADAT is very rarely used. What I'm worried about though is that someone is using it regularly for some reason, with non-trivial databases, and this is going to completely hose them.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#11Andres Freund
andres@2ndquadrant.com
In reply to: Jim Nasby (#9)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 2015-01-23 12:52:03 -0600, Jim Nasby wrote:

On 1/22/15 3:18 PM, Andres Freund wrote:

, but to then put it's entire contents into WAL? Blech.

Besides actually having a chance of being correct, doing so will save
having to do two checkpoints inside movedb(). I think it's pretty likely
that that actually saves overall IO, even including the WAL
writes. Especially if there's other databases in the cluster at the same
time.

If you're moving a small amount of data, maybe. If you're moving
several hundred GB or more? You're going to flood WAL and probably
cause all replication/archiving to lag.

I have hard time believing many people do AD ST on a database a couple
hundred GB large. That'd mean the database is out of business for a
significant amount of time (remember, there can't be any connected users
during that time). I think realistically it's only used on smaller
databases.

The reason createdb()/movedb() work the way they do is imo simply
because they're old.

So far I can't see how the current solution can actually be made safe to
be executed on a not yet consistent database. And we obviously can't
wait for it to be consistent (as we're replaying a linear stream of WAL).

Is there some way we can add an option to control this? I'm thinking
that by default ADAT will error if a backup is running, but allow the
user to over-ride that.

Why would we want to allow that? There's simply no way it's safe, so
...?

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Noname
andres@anarazel.de
In reply to: Jim Nasby (#10)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 2015-01-23 13:01:34 -0600, Jim Nasby wrote:

On 1/23/15 9:10 AM, Andres Freund wrote:

On 2015-01-22 22:58:17 +0100, Andres Freund wrote:

On 2015-01-22 16:38:49 -0500, Stephen Frost wrote:

I'm trying to figure out why you'd do '2' in master when in discussion
of '1' you say "I also don't think ALTER DATABASE is even intentionally
run at the time of a base backup." I agree with that sentiment and am
inclined to say that '1' is good enough throughout.

Because the way it currently works is a major crock. It's more luck than
anything that it actually somewhat works. We normally rely on WAL to
bring us into a consistent state. But around CREATE/MOVE/DROP DATABASE
we've ignored that.

And. Hm. The difficulty of the current method is evidenced by the fact
that so far nodoby recognized that 1) as described above isn't actually
safe. It fails to protect against basebackups on a standby as its
XLogCtl state will obviously not be visible on the master.

Further evidenced by the fact that the current method isn't
crash/shutdown safe at all. If a standby was shut down/crashed/was
started on a base backup when a CREATE DATABASE from the primary is
replayed the template database used can be in an nearly arbitrarily bad
state. It'll later get fixed up by recovery - but those changes won't
make it to the copied database.

I think we all agree that ADAT can't run while a base backup is
happening, which I believe is what you're describing above.

No, that's not what I'm descirbing in the last paragraph. The problem is
present without any AD ST. When a cluster crashes it's likely not in a
consistent state - we need to replay from the last checkpoint's REDO
pointer to the minimum recovery location to make it so. The problem
though is that if the copied database (both for CREATE DB/SET
TABLESPACE) is copied that record can be replayed well before the
database is in a consistent state. In that case the new copy will be in
a corrupted state as it'll not be fixed up by recovery, in contrast to
the original, which will.

Perhaps there isn't really an issue here; I suspect ADAT is very
rarely used. What I'm worried about though is that someone is using it
regularly for some reason, with non-trivial databases, and this is
going to completely hose them.

I think if somebody does this regularly on nontrivialy sized databases,
over replication, they'd have hit this bug a long time ago. It's really
not that hard to reproduce.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#7)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 2015-01-22 22:58:17 +0100, Andres Freund wrote:

Because the way it currently works is a major crock. It's more luck than
anything that it actually somewhat works. We normally rely on WAL to
bring us into a consistent state. But around CREATE/MOVE/DROP DATABASE
we've ignored that.

Hah: There's even a comment about some of the existing dangers:
*
* In PITR replay, the first of these isn't an issue, and the second
* is only a risk if the CREATE DATABASE and subsequent template
* database change both occur while a base backup is being taken.
* There doesn't seem to be much we can do about that except document
* it as a limitation.
*
* Perhaps if we ever implement CREATE DATABASE in a less cheesy way,
* we can avoid this.
only that it has never been documented as a limitation to my knowledge...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#2)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

Hi,

On 2015-01-22 19:56:07 +0100, Andres Freund wrote:

Hi,

On 2015-01-20 16:28:19 +0100, Andres Freund wrote:

I'm analyzing a problem in which a customer had a pg_basebackup (from
standby) created 9.2 cluster that failed with "WAL contains references to
invalid pages". The failed record was a "xlog redo visible"
i.e. XLOG_HEAP2_VISIBLE.

First I thought there might be another bug along the line of
17fa4c321cc. Looking at the code and the WAL that didn't seem to be the
case (man, I miss pg_xlogdump). Other, slightly older, standbys, didn't
seem to have any problems.

Logs show that a ALTER DATABASE ... SET TABLESPACE ... was running when
the basebackup was started and finished *before* pg_basebackup finished.

movedb() basically works in these steps:
1) lock out users of the database
2) RequestCheckpoint(IMMEDIATE|WAIT)
3) DropDatabaseBuffers()
4) copydir()
5) XLogInsert(XLOG_DBASE_CREATE)
6) RequestCheckpoint(CHECKPOINT_IMMEDIATE)
7) rmtree(src_dbpath)
8) XLogInsert(XLOG_DBASE_DROP)
9) unlock database

If a basebackup starts while 4) is in progress and continues until 7)
happens I think a pretty wide race opens: The basebackup can end up with
a partial copy of the database in the old tablespace because the
rmtree(old_path) concurrently was in progress. Normally such races are
fixed during replay. But in this case, the replay of the
XLOG_DBASE_CREATE will just try to do a rmtree(new); copydiar(old, new);.
fixing nothing.

Besides making AD .. ST use sane WAL logging, which doesn't seem
backpatchable, I don't see what could be done against this except
somehow making basebackups fail if a AD .. ST is in progress. Which
doesn't look entirely trivial either.

I basically have two ideas to fix this.

1) Make do_pg_start_backup() acquire a SHARE lock on
pg_database. That'll prevent it from starting while a movedb() is
still in progress. Then additionally add pg_backup_in_progress()
function to xlog.c that checks (XLogCtl->Insert.exclusiveBackup ||
XLogCtl->Insert.nonExclusiveBackups != 0). Use that in createdb() and
movedb() to error out if a backup is in progress.

Attached is a patch trying to this. Doesn't look too bad and lead me to
discover missing recovery conflicts during a AD ST.

But: It doesn't actually work on standbys, because lock.c prevents any
stronger lock than RowExclusive from being acquired. And we need need a
lock that can conflict with WAL replay of DBASE_CREATE, to handle base
backups that are executed on the primary. Those obviously can't detect
whether any standby is currently doing a base backup...

I currently don't have a good idea how to mangle lock.c to allow
this. I've played with doing it like in the second patch, but that
doesn't actually work because of some asserts around ProcSleep - leading
to locks on database objects not working in the startup process (despite
already being used).

The easiest thing would be to just use a lwlock instead of a heavyweight
lock - but those aren't canceleable...

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#15Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#14)
2 attachment(s)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 2015-01-26 22:03:03 +0100, Andres Freund wrote:

Attached is a patch trying to this. Doesn't look too bad and lead me to
discover missing recovery conflicts during a AD ST.

But: It doesn't actually work on standbys, because lock.c prevents any
stronger lock than RowExclusive from being acquired. And we need need a
lock that can conflict with WAL replay of DBASE_CREATE, to handle base
backups that are executed on the primary. Those obviously can't detect
whether any standby is currently doing a base backup...

I currently don't have a good idea how to mangle lock.c to allow
this. I've played with doing it like in the second patch, but that
doesn't actually work because of some asserts around ProcSleep - leading
to locks on database objects not working in the startup process (despite
already being used).

The easiest thing would be to just use a lwlock instead of a heavyweight
lock - but those aren't canceleable...

As Stephen noticed on irc I forgot to attach those. Caused by the
generic HS issue mentioned in 20150126212458.GA29457@awork2.anarazel.de.

Attached now.

As mentioned above, this isn't really isn't ready (e.g. it'll Assert() on a
standby due to the HS issue).

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Prevent-ALTER-DATABASE-SET-TABLESPACE-from-running-c.patchtext/x-patch; charset=us-asciiDownload
>From e5d46c73955e7647912c2625e31027a6fef7c880 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 26 Jan 2015 21:03:35 +0100
Subject: [PATCH] Prevent ALTER DATABASE SET TABLESPACE from running
 concurrently with a base backup.

The combination can cause a corrupt base backup.
---
 src/backend/access/transam/xlog.c    | 33 +++++++++++++++++++++++++++++
 src/backend/commands/dbcommands.c    | 41 ++++++++++++++++++++++++++++++++++++
 src/backend/replication/basebackup.c | 11 ++++++++++
 src/include/access/xlog.h            |  1 +
 4 files changed, 86 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..1daa10d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -53,6 +53,7 @@
 #include "storage/ipc.h"
 #include "storage/large_object.h"
 #include "storage/latch.h"
+#include "storage/lmgr.h"
 #include "storage/pmsignal.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -9329,6 +9330,14 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	else
 		XLogCtl->Insert.nonExclusiveBackups++;
 	XLogCtl->Insert.forcePageWrites = true;
+
+	/*
+	 * Acquire lock on pg datababase just before releasing the WAL insert lock
+	 * and entering the ENSURE_ERROR_CLEANUP block. That way it's sufficient
+	 * to release it in the error cleanup callback.
+	 */
+	LockRelationOid(DatabaseRelationId, ShareLock);
+
 	WALInsertLockRelease();
 
 	/* Ensure we release forcePageWrites if fail below */
@@ -9523,6 +9532,8 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
 
+	UnlockRelationOid(DatabaseRelationId, ShareLock);
+
 	/*
 	 * We're done.  As a convenience, return the starting WAL location.
 	 */
@@ -9537,6 +9548,8 @@ pg_start_backup_callback(int code, Datum arg)
 {
 	bool		exclusive = DatumGetBool(arg);
 
+	UnlockRelationOid(DatabaseRelationId, ShareLock);
+
 	/* Update backup counters and forcePageWrites on failure */
 	WALInsertLockAcquireExclusive();
 	if (exclusive)
@@ -9937,6 +9950,26 @@ do_pg_abort_backup(void)
 }
 
 /*
+ * Is a (exclusive or nonexclusive) base backup running?
+ *
+ * Note that this does not check whether any standby of this node has a
+ * basebackup running, or whether any upstream master (if this is a standby)
+ * has one in progress
+ */
+bool
+LocalBaseBackupInProgress(void)
+{
+	bool ret;
+
+	WALInsertLockAcquire();
+	ret = XLogCtl->Insert.exclusiveBackup ||
+		XLogCtl->Insert.nonExclusiveBackups > 0;
+	WALInsertLockRelease();
+
+	return ret;
+}
+
+/*
  * Get latest redo apply position.
  *
  * Exported to allow WALReceiver to read the pointer directly.
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5e66961..6184c83 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1087,6 +1087,30 @@ movedb(const char *dbname, const char *tblspcname)
 				 errmsg("cannot change the tablespace of the currently open database")));
 
 	/*
+	 * Prevent SET TABLESPACE from running concurrently with a base
+	 * backup. Without that check a base backup would potentially copy a
+	 * partially removed source database; which WAL replay then would copy
+	 * over the new database...
+	 *
+	 * Starting a base backup takes a SHARE lock on pg_database. In addition a
+	 * streaming basebackup takes the same lock for the entirety of the copy
+	 * of the data directory.  That, combined with this check, prevents base
+	 * backups from being taken at the same time a SET TABLESPACE is in
+	 * progress.
+	 *
+	 * Note that this check here will not trigger if a standby currently has a
+	 * base backup ongoing; instead WAL replay will have a recovery conflict
+	 * when replaying the DBASE_CREATE record. That is only safe because
+	 * standbys can only do nonexclusive base backups which hold the lock over
+	 * the entire runtime - otherwise no lock would prevent replay after
+	 * pg_start_backup().
+	 */
+	if (LocalBaseBackupInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot change a database's tablespace while a base backup is in progress")));
+
+	/*
 	 * Get tablespace's oid
 	 */
 	dst_tblspcoid = get_tablespace_oid(tblspcname, false);
@@ -2052,6 +2076,17 @@ dbase_redo(XLogReaderState *record)
 		src_path = GetDatabasePath(xlrec->src_db_id, xlrec->src_tablespace_id);
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
+		if (InHotStandby)
+		{
+			/* Lock source database, do avoid concurrent hint bit writes et al. */
+			LockSharedObjectForSession(DatabaseRelationId, xlrec->src_db_id, 0, AccessExclusiveLock);
+			ResolveRecoveryConflictWithDatabase(xlrec->src_db_id);
+
+			/* Lock target database, it'll be overwritten in a second */
+			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
+			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
+		}
+
 		/*
 		 * Our theory for replaying a CREATE is to forcibly drop the target
 		 * subdirectory if present, then re-copy the source data. This may be
@@ -2078,6 +2113,12 @@ dbase_redo(XLogReaderState *record)
 		 * We don't need to copy subdirectories
 		 */
 		copydir(src_path, dst_path, false);
+
+		if (InHotStandby)
+		{
+			UnlockSharedObjectForSession(DatabaseRelationId, xlrec->src_db_id, 0, AccessExclusiveLock);
+			UnlockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
+		}
 	}
 	else if (info == XLOG_DBASE_DROP)
 	{
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 3058ce9..0e76497 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -19,6 +19,7 @@
 
 #include "access/xlog_internal.h"		/* for pg_start/stop_backup */
 #include "catalog/catalog.h"
+#include "catalog/pg_database.h"
 #include "catalog/pg_type.h"
 #include "lib/stringinfo.h"
 #include "libpq/libpq.h"
@@ -32,6 +33,8 @@
 #include "replication/walsender_private.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/lock.h"
 #include "utils/builtins.h"
 #include "utils/elog.h"
 #include "utils/ps_status.h"
@@ -110,6 +113,8 @@ static void
 base_backup_cleanup(int code, Datum arg)
 {
 	do_pg_abort_backup();
+
+	UnlockRelationOid(DatabaseRelationId, ShareLock);
 }
 
 /*
@@ -134,6 +139,9 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 
 	startptr = do_pg_start_backup(opt->label, opt->fastcheckpoint, &starttli,
 								  &labelfile);
+
+	LockRelationOid(DatabaseRelationId, ShareLock);
+
 	/*
 	 * Once do_pg_start_backup has been called, ensure that any failure causes
 	 * us to abort the backup so we don't "leak" a backup counter. For this reason,
@@ -304,6 +312,9 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 
+	/* release lock again, before stop_backup, as that can error out */
+	UnlockRelationOid(DatabaseRelationId, ShareLock);
+
 	endptr = do_pg_stop_backup(labelfile, !opt->nowait, &endtli);
 
 	if (opt->includewal)
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 138deaf..f3893f3 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -253,6 +253,7 @@ extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast,
 extern XLogRecPtr do_pg_stop_backup(char *labelfile, bool waitforarchive,
 				  TimeLineID *stoptli_p);
 extern void do_pg_abort_backup(void);
+extern bool LocalBaseBackupInProgress(void);
 
 /* File path names (all relative to $PGDATA) */
 #define BACKUP_LABEL_FILE		"backup_label"
-- 
2.0.0.rc2.4.g1dc51c6.dirty

2.patchtext/x-patch; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1daa10d..5c70567 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -9336,7 +9336,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	 * and entering the ENSURE_ERROR_CLEANUP block. That way it's sufficient
 	 * to release it in the error cleanup callback.
 	 */
-	LockRelationOid(DatabaseRelationId, ShareLock);
+	LockRelationOidForSession(DatabaseRelationId, ShareLock);
 
 	WALInsertLockRelease();
 
@@ -9532,7 +9532,7 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
 
-	UnlockRelationOid(DatabaseRelationId, ShareLock);
+	UnlockRelationOidForSession(DatabaseRelationId, ShareLock);
 
 	/*
 	 * We're done.  As a convenience, return the starting WAL location.
@@ -9548,7 +9548,7 @@ pg_start_backup_callback(int code, Datum arg)
 {
 	bool		exclusive = DatumGetBool(arg);
 
-	UnlockRelationOid(DatabaseRelationId, ShareLock);
+	UnlockRelationOidForSession(DatabaseRelationId, ShareLock);
 
 	/* Update backup counters and forcePageWrites on failure */
 	WALInsertLockAcquireExclusive();
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 6184c83..d312e23 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -2085,6 +2085,9 @@ dbase_redo(XLogReaderState *record)
 			/* Lock target database, it'll be overwritten in a second */
 			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
+
+			/* Lock pg_database, to conflict with base backups */
+			LockRelationOidForSession(DatabaseRelationId, RowExclusiveLock);
 		}
 
 		/*
@@ -2116,8 +2119,9 @@ dbase_redo(XLogReaderState *record)
 
 		if (InHotStandby)
 		{
-			UnlockSharedObjectForSession(DatabaseRelationId, xlrec->src_db_id, 0, AccessExclusiveLock);
+			UnlockRelationOidForSession(DatabaseRelationId, RowExclusiveLock);
 			UnlockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
+			UnlockSharedObjectForSession(DatabaseRelationId, xlrec->src_db_id, 0, AccessExclusiveLock);
 		}
 	}
 	else if (info == XLOG_DBASE_DROP)
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 0e76497..481ba18 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -114,7 +114,7 @@ base_backup_cleanup(int code, Datum arg)
 {
 	do_pg_abort_backup();
 
-	UnlockRelationOid(DatabaseRelationId, ShareLock);
+	UnlockRelationOidForSession(DatabaseRelationId, ShareLock);
 }
 
 /*
@@ -140,7 +140,7 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 	startptr = do_pg_start_backup(opt->label, opt->fastcheckpoint, &starttli,
 								  &labelfile);
 
-	LockRelationOid(DatabaseRelationId, ShareLock);
+	LockRelationOidForSession(DatabaseRelationId, ShareLock);
 
 	/*
 	 * Once do_pg_start_backup has been called, ensure that any failure causes
@@ -313,7 +313,7 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 	PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 
 	/* release lock again, before stop_backup, as that can error out */
-	UnlockRelationOid(DatabaseRelationId, ShareLock);
+	UnlockRelationOidForSession(DatabaseRelationId, ShareLock);
 
 	endptr = do_pg_stop_backup(labelfile, !opt->nowait, &endtli);
 
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index d13a167..c428b38 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -248,6 +248,37 @@ UnlockRelation(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationOidForSession
+ *
+ * Lock a relation in session mode, without requiring having it opened it in
+ * transaction local mode. That's likely only useful during WAL replay.
+ */
+void
+LockRelationOidForSession(Oid relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SetLocktagRelationOid(&tag, relid);
+
+	(void) LockAcquire(&tag, lockmode, true, false);
+}
+
+/*
+ *		UnlockRelationOidForSession
+ *
+ * Unlock lock acquired by LockRelationOidForSession.
+ */
+void
+UnlockRelationOidForSession(Oid relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SetLocktagRelationOid(&tag, relid);
+
+	LockRelease(&tag, lockmode, true);
+}
+
+/*
  *		LockHasWaitersRelation
  *
  * This is a functiion to check if someone else is waiting on a
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 61c8d21..3ebbe9c 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -710,13 +710,16 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (RecoveryInProgress() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
-		lockmode > RowExclusiveLock)
+		lockmode > RowExclusiveLock &&
+		!sessionLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
 						lockMethodTable->lockModeNames[lockmode]),
 				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
 
+	Assert(!InRecovery || (sessionLock && dontWait));
+
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
 		elog(LOG, "LockAcquire: lock [%u,%u] %s",
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index f5d70e5..2397f11 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,6 +50,9 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
+extern void LockRelationOidForSession(Oid relid, LOCKMODE lockmode);
+extern void UnlockRelationOidForSession(Oid relid, LOCKMODE lockmode);
+
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
#16Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#14)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On Mon, Jan 26, 2015 at 4:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:

I basically have two ideas to fix this.

1) Make do_pg_start_backup() acquire a SHARE lock on
pg_database. That'll prevent it from starting while a movedb() is
still in progress. Then additionally add pg_backup_in_progress()
function to xlog.c that checks (XLogCtl->Insert.exclusiveBackup ||
XLogCtl->Insert.nonExclusiveBackups != 0). Use that in createdb() and
movedb() to error out if a backup is in progress.

Attached is a patch trying to this. Doesn't look too bad and lead me to
discover missing recovery conflicts during a AD ST.

But: It doesn't actually work on standbys, because lock.c prevents any
stronger lock than RowExclusive from being acquired. And we need need a
lock that can conflict with WAL replay of DBASE_CREATE, to handle base
backups that are executed on the primary. Those obviously can't detect
whether any standby is currently doing a base backup...

I currently don't have a good idea how to mangle lock.c to allow
this. I've played with doing it like in the second patch, but that
doesn't actually work because of some asserts around ProcSleep - leading
to locks on database objects not working in the startup process (despite
already being used).

The easiest thing would be to just use a lwlock instead of a heavyweight
lock - but those aren't canceleable...

How about just wrapping an lwlock around a flag variable? movedb()
increments the variable when starting and decrements it when done
(must use PG_ENSURE_ERROR_CLEANUP). Starting a backup errors out (or
waits in 1-second increments) if it's non-zero.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#17Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#16)
1 attachment(s)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 2015-01-27 07:16:27 -0500, Robert Haas wrote:

On Mon, Jan 26, 2015 at 4:03 PM, Andres Freund <andres@2ndquadrant.com> wrote:

I basically have two ideas to fix this.

1) Make do_pg_start_backup() acquire a SHARE lock on
pg_database. That'll prevent it from starting while a movedb() is
still in progress. Then additionally add pg_backup_in_progress()
function to xlog.c that checks (XLogCtl->Insert.exclusiveBackup ||
XLogCtl->Insert.nonExclusiveBackups != 0). Use that in createdb() and
movedb() to error out if a backup is in progress.

Attached is a patch trying to this. Doesn't look too bad and lead me to
discover missing recovery conflicts during a AD ST.

But: It doesn't actually work on standbys, because lock.c prevents any
stronger lock than RowExclusive from being acquired. And we need need a
lock that can conflict with WAL replay of DBASE_CREATE, to handle base
backups that are executed on the primary. Those obviously can't detect
whether any standby is currently doing a base backup...

I currently don't have a good idea how to mangle lock.c to allow
this. I've played with doing it like in the second patch, but that
doesn't actually work because of some asserts around ProcSleep - leading
to locks on database objects not working in the startup process (despite
already being used).

The easiest thing would be to just use a lwlock instead of a heavyweight
lock - but those aren't canceleable...

How about just wrapping an lwlock around a flag variable? movedb()
increments the variable when starting and decrements it when done
(must use PG_ENSURE_ERROR_CLEANUP). Starting a backup errors out (or
waits in 1-second increments) if it's non-zero.

That'd end up essentially being a re-emulation of locks. Don't find that
all that pretty. But maybe we have to go there.

Here's an alternative approach. I think it generally is superior and
going in the right direction, but I'm not sure it's backpatchable.

It basically consists out of:
1) Make GetLockConflicts() actually work.
2) Allow the startup process to actually acquire locks other than
AccessExclusiveLocks. There already is code acquiring other locks,
but it's currently broken because they're acquired in blocking mode
which isn't actually supported in the startup mode. Using this
infrastructure we can actually fix a couple small races around
database creation/drop.
3) Allow session locks during recovery to be heavier than
RowExclusiveLock - since relation/object session locks aren't ever
taken out by the user (without a plain lock first) that's safe.
4) Perform streaming base backups from within a transaction command, to
provide error handling.
5) Make walsender actually react to recovery conflict interrupts
6) Prevent access to the template database of a CREATE DATABASE during
WAL replay.
7) Add an interlock to prevent base backups and movedb() to run
concurrently. What we actually came here for.

Comments?

At the very least it's missing some documentation updates about the
locking changes in ALTER DATABASE, CREATE DATABASE and the base backup
sections.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Fix-various-issues-around-WAL-replay-and-ALTER-DATAB.patchtext/x-patch; charset=us-asciiDownload
>From 6e196d17e3dc3ae923321c1b49eb46ccd5ac75b0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Tue, 27 Jan 2015 19:52:11 +0100
Subject: [PATCH] Fix various issues around WAL replay and ALTER DATABASE SET
 TABLESPACE.

<Danger of basebackups vs. AD ST>
Discussion: 20150120152819.GC24381@alap3.anarazel.de

Fix GetLockConflicts() to properly terminate the list of conflicting
backends. It's unclear why this hasn't caused more problems.

Discussion: 20150127142713.GD29457@awork2.anarazel.de

Don't acquire blocking locks on the database in dbase_redo(), not
enough state setup.
Discussion: 20150126212458.GA29457@awork2.anarazel.de

Don't allow access to the template database during the replay of a
CREATE DATABASE.
---
 src/backend/access/transam/xlog.c    |  29 +++++++++
 src/backend/commands/dbcommands.c    | 104 ++++++++++++++++++++++++++-----
 src/backend/replication/basebackup.c |  15 +++++
 src/backend/replication/walsender.c  |  14 +----
 src/backend/storage/ipc/standby.c    | 117 ++++++++++++++++++-----------------
 src/backend/storage/lmgr/lmgr.c      |  31 ++++++++++
 src/backend/storage/lmgr/lock.c      |  13 ++--
 src/backend/utils/init/postinit.c    |   2 +-
 src/include/access/xlog.h            |   1 +
 src/include/storage/lmgr.h           |   3 +
 src/include/storage/standby.h        |   2 +-
 11 files changed, 240 insertions(+), 91 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..38e7dff 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -53,6 +53,7 @@
 #include "storage/ipc.h"
 #include "storage/large_object.h"
 #include "storage/latch.h"
+#include "storage/lmgr.h"
 #include "storage/pmsignal.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -9291,6 +9292,12 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("backup label too long (max %d bytes)",
 						MAXPGPATH)));
+	/*
+	 * Acquire lock on pg datababase to prevent concurrent CREATE DATABASE and
+	 * ALTER DATABASE ... SET TABLESPACE from running. In case of an error
+	 * it'll be released by the transaction abort code.
+	 */
+	LockRelationOidForSession(DatabaseRelationId, ShareLock);
 
 	/*
 	 * Mark backup active in shared memory.  We must do full-page WAL writes
@@ -9523,6 +9530,8 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
 
+	UnlockRelationOidForSession(DatabaseRelationId, ShareLock);
+
 	/*
 	 * We're done.  As a convenience, return the starting WAL location.
 	 */
@@ -9937,6 +9946,26 @@ do_pg_abort_backup(void)
 }
 
 /*
+ * Is a (exclusive or nonexclusive) base backup running?
+ *
+ * Note that this does not check whether any standby of this node has a
+ * basebackup running, or whether any upstream master (if this is a standby)
+ * has one in progress
+ */
+bool
+LocalBaseBackupInProgress(void)
+{
+	bool ret;
+
+	WALInsertLockAcquire();
+	ret = XLogCtl->Insert.exclusiveBackup ||
+		XLogCtl->Insert.nonExclusiveBackups > 0;
+	WALInsertLockRelease();
+
+	return ret;
+}
+
+/*
  * Get latest redo apply position.
  *
  * Exported to allow WALReceiver to read the pointer directly.
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5e66961..e6a3352 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1071,6 +1071,9 @@ movedb(const char *dbname, const char *tblspcname)
 	LockSharedObjectForSession(DatabaseRelationId, db_id, 0,
 							   AccessExclusiveLock);
 
+	/* Acquire lock on pg_database against concurrent base backups starting */
+	LockRelationOidForSession(DatabaseRelationId, RowExclusiveLock);
+
 	/*
 	 * Permission checks
 	 */
@@ -1087,6 +1090,30 @@ movedb(const char *dbname, const char *tblspcname)
 				 errmsg("cannot change the tablespace of the currently open database")));
 
 	/*
+	 * Prevent SET TABLESPACE from running concurrently with a base
+	 * backup. Without that check a base backup would potentially copy a
+	 * partially removed source database; which WAL replay then would copy
+	 * over the new database...
+	 *
+	 * Starting a base backup takes a SHARE lock on pg_database. In addition a
+	 * streaming basebackup takes the same lock for the entirety of the copy
+	 * of the data directory.  That, combined with this check, prevents base
+	 * backups from being taken at the same time a SET TABLESPACE is in
+	 * progress.
+	 *
+	 * Note that this check here will not trigger if a standby currently has a
+	 * base backup ongoing; instead WAL replay will have a recovery conflict
+	 * when replaying the DBASE_CREATE record. That is only safe because
+	 * standbys can only do nonexclusive base backups which hold the lock over
+	 * the entire runtime - otherwise no lock would prevent replay after
+	 * pg_start_backup().
+	 */
+	if (LocalBaseBackupInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot change a database's tablespace while a base backup is in progress")));
+
+	/*
 	 * Get tablespace's oid
 	 */
 	dst_tblspcoid = get_tablespace_oid(tblspcname, false);
@@ -1116,6 +1143,7 @@ movedb(const char *dbname, const char *tblspcname)
 		heap_close(pgdbrel, NoLock);
 		UnlockSharedObjectForSession(DatabaseRelationId, db_id, 0,
 									 AccessExclusiveLock);
+		UnlockRelationOidForSession(DatabaseRelationId, RowExclusiveLock);
 		return;
 	}
 
@@ -1204,6 +1232,12 @@ movedb(const char *dbname, const char *tblspcname)
 	}
 
 	/*
+	 * Force xid assignment, for the benefit of dbase_redo()'s internal
+	 * locking.
+	 */
+	GetTopTransactionId();
+
+	/*
 	 * Use an ENSURE block to make sure we remove the debris if the copy fails
 	 * (eg, due to out-of-disk-space).  This is not a 100% solution, because
 	 * of the possibility of failure during transaction commit, but it should
@@ -1314,6 +1348,12 @@ movedb(const char *dbname, const char *tblspcname)
 	StartTransactionCommand();
 
 	/*
+	 * Force xid assignment, for the benefit of dbase_redo()'s internal
+	 * locking.
+	 */
+	GetTopTransactionId();
+
+	/*
 	 * Remove files from the old tablespace
 	 */
 	if (!rmtree(src_dbpath, true))
@@ -1340,6 +1380,7 @@ movedb(const char *dbname, const char *tblspcname)
 	/* Now it's safe to release the database lock */
 	UnlockSharedObjectForSession(DatabaseRelationId, db_id, 0,
 								 AccessExclusiveLock);
+	UnlockRelationOidForSession(DatabaseRelationId, RowExclusiveLock);
 }
 
 /* Error cleanup callback for movedb */
@@ -2053,6 +2094,38 @@ dbase_redo(XLogReaderState *record)
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
 		/*
+		 * Can only do the locking if the server generating the record was new
+		 * enough to know it has to assign a xid.
+		 */
+		if (InHotStandby && TransactionIdIsValid(XLogRecGetXid(record)))
+		{
+			LOCKTAG locktag;
+
+			SET_LOCKTAG_OBJECT(locktag,
+							   InvalidOid,
+							   DatabaseRelationId,
+							   xlrec->src_db_id,
+							   0);
+
+			/* Lock source database, do avoid concurrent hint bit writes et al. */
+			StandbyAcquireLock(XLogRecGetXid(record), &locktag, AccessExclusiveLock);
+			ResolveRecoveryConflictWithDatabase(xlrec->src_db_id);
+
+			/* Lock target database, it'll be overwritten in a second */
+			SET_LOCKTAG_OBJECT(locktag,
+							   InvalidOid,
+							   DatabaseRelationId,
+							   xlrec->db_id,
+							   0);
+			StandbyAcquireLock(XLogRecGetXid(record), &locktag, AccessExclusiveLock);
+			ResolveRecoveryConflictWithDatabase(xlrec->src_db_id);
+
+			/* Lock pg_database, to conflict with base backups */
+			SET_LOCKTAG_RELATION(locktag, 0, DatabaseRelationId);
+			StandbyAcquireLock(XLogRecGetXid(record), &locktag, RowExclusiveLock);
+		}
+
+		/*
 		 * Our theory for replaying a CREATE is to forcibly drop the target
 		 * subdirectory if present, then re-copy the source data. This may be
 		 * more work than needed, but it is simple to implement.
@@ -2086,16 +2159,31 @@ dbase_redo(XLogReaderState *record)
 
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
-		if (InHotStandby)
+		/*
+		 * Can only do the locking if the server generating the record was new
+		 * enough to know it has to assign a xid.
+		 */
+		if (InHotStandby && TransactionIdIsValid(XLogRecGetXid(record)))
 		{
+			LOCKTAG locktag;
+
 			/*
 			 * Lock database while we resolve conflicts to ensure that
 			 * InitPostgres() cannot fully re-execute concurrently. This
 			 * avoids backends re-connecting automatically to same database,
 			 * which can happen in some cases.
 			 */
-			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
+			SET_LOCKTAG_OBJECT(locktag,
+							   InvalidOid,
+							   DatabaseRelationId,
+							   xlrec->db_id,
+							   0);
+			StandbyAcquireLock(XLogRecGetXid(record), &locktag, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
+
+			/* Lock pg_database, to conflict with base backups */
+			SET_LOCKTAG_RELATION(locktag, 0, DatabaseRelationId);
+			StandbyAcquireLock(XLogRecGetXid(record), &locktag, RowExclusiveLock);
 		}
 
 		/* Drop pages for this database that are in the shared buffer cache */
@@ -2112,18 +2200,6 @@ dbase_redo(XLogReaderState *record)
 			ereport(WARNING,
 					(errmsg("some useless files may be left behind in old database directory \"%s\"",
 							dst_path)));
-
-		if (InHotStandby)
-		{
-			/*
-			 * Release locks prior to commit. XXX There is a race condition
-			 * here that may allow backends to reconnect, but the window for
-			 * this is small because the gap between here and commit is mostly
-			 * fairly small and it is unlikely that people will be dropping
-			 * databases that we are trying to connect to anyway.
-			 */
-			UnlockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
-		}
 	}
 	else
 		elog(PANIC, "dbase_redo: unknown op code %u", info);
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 3058ce9..a0b590f 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -17,8 +17,10 @@
 #include <unistd.h>
 #include <time.h>
 
+#include "access/xact.h"
 #include "access/xlog_internal.h"		/* for pg_start/stop_backup */
 #include "catalog/catalog.h"
+#include "catalog/pg_database.h"
 #include "catalog/pg_type.h"
 #include "lib/stringinfo.h"
 #include "libpq/libpq.h"
@@ -32,6 +34,8 @@
 #include "replication/walsender_private.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/lock.h"
 #include "utils/builtins.h"
 #include "utils/elog.h"
 #include "utils/ps_status.h"
@@ -134,6 +138,9 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 
 	startptr = do_pg_start_backup(opt->label, opt->fastcheckpoint, &starttli,
 								  &labelfile);
+
+	LockRelationOidForSession(DatabaseRelationId, ShareLock);
+
 	/*
 	 * Once do_pg_start_backup has been called, ensure that any failure causes
 	 * us to abort the backup so we don't "leak" a backup counter. For this reason,
@@ -304,6 +311,9 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 
+	/* release lock preventing base backups - no risk while backing up WAL */
+	UnlockRelationOidForSession(DatabaseRelationId, ShareLock);
+
 	endptr = do_pg_stop_backup(labelfile, !opt->nowait, &endtli);
 
 	if (opt->includewal)
@@ -675,6 +685,9 @@ SendBaseBackup(BaseBackupCmd *cmd)
 		set_ps_display(activitymsg, false);
 	}
 
+	/* to provide error handling around locks */
+	StartTransactionCommand();
+
 	/* Make sure we can open the directory with tablespaces in it */
 	dir = AllocateDir("pg_tblspc");
 	if (!dir)
@@ -684,6 +697,8 @@ SendBaseBackup(BaseBackupCmd *cmd)
 	perform_base_backup(&opt, dir);
 
 	FreeDir(dir);
+
+	CommitTransactionCommand();
 }
 
 static void
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 05d2339..3f25ccf 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -185,7 +185,6 @@ static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
-static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -2566,17 +2565,6 @@ WalSndSigHupHandler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/* SIGUSR1: set flag to send WAL records */
-static void
-WalSndXLogSendHandler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	latch_sigusr1_handler();
-
-	errno = save_errno;
-}
-
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
@@ -2610,7 +2598,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 292bed5..98e89e1 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -38,10 +38,16 @@ int			max_standby_archive_delay = 30 * 1000;
 int			max_standby_streaming_delay = 30 * 1000;
 
 static List *RecoveryLockList;
+typedef struct RecoveryLockListEntry
+{
+	TransactionId xid;
+	LOCKMODE lockmode;
+	LOCKTAG locktag;
+} RecoveryLockListEntry;
 
 static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 									   ProcSignalReason reason);
-static void ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid);
+static void ResolveRecoveryConflictWithLock(LOCKTAG *tag, LOCKMODE mode);
 static void SendRecoveryConflictWithBufferPin(ProcSignalReason reason);
 static XLogRecPtr LogCurrentRunningXacts(RunningTransactions CurrRunningXacts);
 static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks);
@@ -320,10 +326,10 @@ ResolveRecoveryConflictWithDatabase(Oid dbid)
 	 * us. This is rare enough that we do this as simply as possible: no wait,
 	 * just force them off immediately.
 	 *
-	 * No locking is required here because we already acquired
-	 * AccessExclusiveLock. Anybody trying to connect while we do this will
-	 * block during InitPostgres() and then disconnect when they see the
-	 * database has been removed.
+	 * No locking is required here because we already acquired a
+	 * AccessExclusiveLock on the database in dbase_redo(). Anybody trying to
+	 * connect while we do this will block during InitPostgres() and then
+	 * disconnect when they see the database has been removed.
 	 */
 	while (CountDBBackends(dbid) > 0)
 	{
@@ -338,14 +344,11 @@ ResolveRecoveryConflictWithDatabase(Oid dbid)
 }
 
 static void
-ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid)
+ResolveRecoveryConflictWithLock(LOCKTAG *locktag, LOCKMODE mode)
 {
 	VirtualTransactionId *backends;
 	bool		lock_acquired = false;
 	int			num_attempts = 0;
-	LOCKTAG		locktag;
-
-	SET_LOCKTAG_RELATION(locktag, dbOid, relOid);
 
 	/*
 	 * If blowing away everybody with conflicting locks doesn't work, after
@@ -358,7 +361,7 @@ ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid)
 	while (!lock_acquired)
 	{
 		if (++num_attempts < 3)
-			backends = GetLockConflicts(&locktag, AccessExclusiveLock);
+			backends = GetLockConflicts(locktag, mode);
 		else
 			backends = GetConflictingVirtualXIDs(InvalidTransactionId,
 												 InvalidOid);
@@ -366,7 +369,7 @@ ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid)
 		ResolveRecoveryConflictWithVirtualXIDs(backends,
 											 PROCSIG_RECOVERY_CONFLICT_LOCK);
 
-		if (LockAcquireExtended(&locktag, AccessExclusiveLock, true, true, false)
+		if (LockAcquireExtended(locktag, mode, true, true, false)
 			!= LOCKACQUIRE_NOT_AVAIL)
 			lock_acquired = true;
 	}
@@ -544,19 +547,18 @@ StandbyTimeoutHandler(void)
  * this lock, so query access is not allowed at this time". So the Startup
  * process is the proxy by which the original locks are implemented.
  *
- * We only keep track of AccessExclusiveLocks, which are only ever held by
- * one transaction on one relation, and don't worry about lock queuing.
+ * We only keep track of the primary's AccessExclusiveLocks, which are only
+ * ever held by one transaction on one relation, and don't worry about lock
+ * queuing.  The startup process however does acquire other locks occasionally
+ * (c.f. dbase_redo()) - but even there no queuing is possible.
  *
  * We keep a single dynamically expandible list of locks in local memory,
  * RelationLockList, so we can keep track of the various entries made by
  * the Startup process's virtual xid in the shared lock table.
  *
  * We record the lock against the top-level xid, rather than individual
- * subtransaction xids. This means AccessExclusiveLocks held by aborted
- * subtransactions are not released as early as possible on standbys.
- *
- * List elements use type xl_rel_lock, since the WAL record type exactly
- * matches the information that we need to keep track of.
+ * subtransaction xids. This means locks held by aborted subtransactions are
+ * not released as early as possible on standbys.
  *
  * We use session locks rather than normal locks so we don't need
  * ResourceOwners.
@@ -564,10 +566,11 @@ StandbyTimeoutHandler(void)
 
 
 void
-StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid)
+StandbyAcquireLock(TransactionId xid, LOCKTAG *locktag, LOCKMODE mode)
 {
-	xl_standby_lock *newlock;
-	LOCKTAG		locktag;
+	RecoveryLockListEntry *newlock;
+	Oid dbOid = locktag->locktag_field1;
+	Oid relOid = locktag->locktag_field2;
 
 	/* Already processed? */
 	if (!TransactionIdIsValid(xid) ||
@@ -576,25 +579,24 @@ StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid)
 		return;
 
 	elog(trace_recovery(DEBUG4),
-		 "adding recovery lock: db %u rel %u", dbOid, relOid);
+		 "adding recovery lock: db %u rel %u",
+		 dbOid, relOid);
 
 	/* dbOid is InvalidOid when we are locking a shared relation. */
 	Assert(OidIsValid(relOid));
 
-	newlock = palloc(sizeof(xl_standby_lock));
+	newlock = palloc(sizeof(RecoveryLockListEntry));
 	newlock->xid = xid;
-	newlock->dbOid = dbOid;
-	newlock->relOid = relOid;
+	newlock->lockmode = mode;
+	memcpy(&newlock->locktag, locktag, sizeof(LOCKTAG));
 	RecoveryLockList = lappend(RecoveryLockList, newlock);
 
 	/*
 	 * Attempt to acquire the lock as requested, if not resolve conflict
 	 */
-	SET_LOCKTAG_RELATION(locktag, newlock->dbOid, newlock->relOid);
-
-	if (LockAcquireExtended(&locktag, AccessExclusiveLock, true, true, false)
+	if (LockAcquireExtended(locktag, mode, true, true, false)
 		== LOCKACQUIRE_NOT_AVAIL)
-		ResolveRecoveryConflictWithLock(newlock->dbOid, newlock->relOid);
+		ResolveRecoveryConflictWithLock(locktag, mode);
 }
 
 static void
@@ -610,22 +612,16 @@ StandbyReleaseLocks(TransactionId xid)
 	prev = NULL;
 	for (cell = list_head(RecoveryLockList); cell; cell = next)
 	{
-		xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell);
+		RecoveryLockListEntry *lock = (RecoveryLockListEntry *) lfirst(cell);
 
 		next = lnext(cell);
 
 		if (!TransactionIdIsValid(xid) || lock->xid == xid)
 		{
-			LOCKTAG		locktag;
-
-			elog(trace_recovery(DEBUG4),
-				 "releasing recovery lock: xid %u db %u rel %u",
-				 lock->xid, lock->dbOid, lock->relOid);
-			SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid);
-			if (!LockRelease(&locktag, AccessExclusiveLock, true))
+			if (!LockRelease(&lock->locktag, lock->lockmode, true))
 				elog(LOG,
-					 "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u database %u relation %u",
-					 lock->xid, lock->dbOid, lock->relOid);
+					 "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u",
+					 lock->xid);
 
 			RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev);
 			pfree(lock);
@@ -662,25 +658,24 @@ StandbyReleaseAllLocks(void)
 	ListCell   *cell,
 			   *prev,
 			   *next;
-	LOCKTAG		locktag;
 
 	elog(trace_recovery(DEBUG2), "release all standby locks");
 
 	prev = NULL;
 	for (cell = list_head(RecoveryLockList); cell; cell = next)
 	{
-		xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell);
+		RecoveryLockListEntry *lock = (RecoveryLockListEntry *) lfirst(cell);
 
 		next = lnext(cell);
 
 		elog(trace_recovery(DEBUG4),
-			 "releasing recovery lock: xid %u db %u rel %u",
-			 lock->xid, lock->dbOid, lock->relOid);
-		SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid);
-		if (!LockRelease(&locktag, AccessExclusiveLock, true))
+			 "releasing recovery lock for xid %u",
+			 lock->xid);
+
+		if (!LockRelease(&lock->locktag, lock->lockmode, true))
 			elog(LOG,
-				 "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u database %u relation %u",
-				 lock->xid, lock->dbOid, lock->relOid);
+				 "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u",
+				 lock->xid);
 		RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev);
 		pfree(lock);
 	}
@@ -697,12 +692,11 @@ StandbyReleaseOldLocks(int nxids, TransactionId *xids)
 	ListCell   *cell,
 			   *prev,
 			   *next;
-	LOCKTAG		locktag;
 
 	prev = NULL;
 	for (cell = list_head(RecoveryLockList); cell; cell = next)
 	{
-		xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell);
+		RecoveryLockListEntry *lock = (RecoveryLockListEntry *) lfirst(cell);
 		bool		remove = false;
 
 		next = lnext(cell);
@@ -735,13 +729,13 @@ StandbyReleaseOldLocks(int nxids, TransactionId *xids)
 		if (remove)
 		{
 			elog(trace_recovery(DEBUG4),
-				 "releasing recovery lock: xid %u db %u rel %u",
-				 lock->xid, lock->dbOid, lock->relOid);
-			SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid);
-			if (!LockRelease(&locktag, AccessExclusiveLock, true))
+				 "releasing recovery lock: xid %u",
+				 lock->xid);
+
+			if (!LockRelease(&lock->locktag, lock->lockmode, true))
 				elog(LOG,
-					 "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u database %u relation %u",
-					 lock->xid, lock->dbOid, lock->relOid);
+					 "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u",
+					 lock->xid);
 			RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev);
 			pfree(lock);
 		}
@@ -776,9 +770,16 @@ standby_redo(XLogReaderState *record)
 		int			i;
 
 		for (i = 0; i < xlrec->nlocks; i++)
-			StandbyAcquireAccessExclusiveLock(xlrec->locks[i].xid,
-											  xlrec->locks[i].dbOid,
-											  xlrec->locks[i].relOid);
+		{
+			LOCKTAG locktag;
+
+			SET_LOCKTAG_RELATION(locktag,
+								 xlrec->locks[i].dbOid,
+								 xlrec->locks[i].relOid);
+			StandbyAcquireLock(xlrec->locks[i].xid,
+							   &locktag,
+							   AccessExclusiveLock);
+		}
 	}
 	else if (info == XLOG_RUNNING_XACTS)
 	{
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index d13a167..c428b38 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -248,6 +248,37 @@ UnlockRelation(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationOidForSession
+ *
+ * Lock a relation in session mode, without requiring having it opened it in
+ * transaction local mode. That's likely only useful during WAL replay.
+ */
+void
+LockRelationOidForSession(Oid relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SetLocktagRelationOid(&tag, relid);
+
+	(void) LockAcquire(&tag, lockmode, true, false);
+}
+
+/*
+ *		UnlockRelationOidForSession
+ *
+ * Unlock lock acquired by LockRelationOidForSession.
+ */
+void
+UnlockRelationOidForSession(Oid relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SetLocktagRelationOid(&tag, relid);
+
+	LockRelease(&tag, lockmode, true);
+}
+
+/*
  *		LockHasWaitersRelation
  *
  * This is a functiion to check if someone else is waiting on a
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 61c8d21..f051ad6 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -710,13 +710,16 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (RecoveryInProgress() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
-		lockmode > RowExclusiveLock)
+		lockmode > RowExclusiveLock &&
+		!sessionLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
 						lockMethodTable->lockModeNames[lockmode]),
 				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
 
+	Assert(!InRecovery || (sessionLock && dontWait));
+
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
 		elog(LOG, "LockAcquire: lock [%u,%u] %s",
@@ -2804,6 +2807,8 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode)
 		 * on this lockable object.
 		 */
 		LWLockRelease(partitionLock);
+		vxids[count].backendId = InvalidBackendId;
+		vxids[count].localTransactionId = InvalidLocalTransactionId;
 		return vxids;
 	}
 
@@ -2857,6 +2862,8 @@ GetLockConflicts(const LOCKTAG *locktag, LOCKMODE lockmode)
 	if (count > MaxBackends)	/* should never happen */
 		elog(PANIC, "too many conflicting locks found");
 
+	vxids[count].backendId = InvalidBackendId;
+	vxids[count].localTransactionId = InvalidLocalTransactionId;
 	return vxids;
 }
 
@@ -3857,9 +3864,7 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
 	if (lockmode == AccessExclusiveLock &&
 		locktag->locktag_type == LOCKTAG_RELATION)
 	{
-		StandbyAcquireAccessExclusiveLock(xid,
-										locktag->locktag_field1 /* dboid */ ,
-									  locktag->locktag_field2 /* reloid */ );
+		StandbyAcquireLock(xid, locktag, AccessExclusiveLock);
 	}
 }
 
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 1f5cf06..dfe2322 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -886,7 +886,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 			ereport(FATAL,
 					(errcode(ERRCODE_UNDEFINED_DATABASE),
 					 errmsg("database \"%s\" does not exist", dbname),
-			   errdetail("It seems to have just been dropped or renamed.")));
+			   errdetail("It seems to have just been dropped, moved or renamed.")));
 	}
 
 	/*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 138deaf..f3893f3 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -253,6 +253,7 @@ extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast,
 extern XLogRecPtr do_pg_stop_backup(char *labelfile, bool waitforarchive,
 				  TimeLineID *stoptli_p);
 extern void do_pg_abort_backup(void);
+extern bool LocalBaseBackupInProgress(void);
 
 /* File path names (all relative to $PGDATA) */
 #define BACKUP_LABEL_FILE		"backup_label"
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index f5d70e5..2397f11 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,6 +50,9 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
+extern void LockRelationOidForSession(Oid relid, LOCKMODE lockmode);
+extern void UnlockRelationOidForSession(Oid relid, LOCKMODE lockmode);
+
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index c32c963..f711281 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -45,7 +45,7 @@ extern void StandbyTimeoutHandler(void);
  * to make hot standby work. That includes logging AccessExclusiveLocks taken
  * by transactions and running-xacts snapshots.
  */
-extern void StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid);
+extern void StandbyAcquireLock(TransactionId xid, LOCKTAG *tag, LOCKMODE mode);
 extern void StandbyReleaseLockTree(TransactionId xid,
 					   int nsubxids, TransactionId *subxids);
 extern void StandbyReleaseAllLocks(void);
-- 
2.0.0.rc2.4.g1dc51c6.dirty

#18Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#17)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On Tue, Jan 27, 2015 at 2:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:

That'd end up essentially being a re-emulation of locks. Don't find that
all that pretty. But maybe we have to go there.

The advantage of it is that it is simple. The only thing we're really
giving up is the deadlock detector, which I think isn't needed in this
case.

Here's an alternative approach. I think it generally is superior and
going in the right direction, but I'm not sure it's backpatchable.

I tend think this is changing too many things to back-patch. It might
all be fine, but it's pretty wide-reaching, so the chances of
collateral damage seem non-trivial. Even in master, I'm not sure I
see the point in rocking the apple cart to this degree.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#19Andres Freund
andres@2ndquadrant.com
In reply to: Robert Haas (#18)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 2015-01-27 19:24:24 -0500, Robert Haas wrote:

On Tue, Jan 27, 2015 at 2:16 PM, Andres Freund <andres@2ndquadrant.com> wrote:

That'd end up essentially being a re-emulation of locks. Don't find that
all that pretty. But maybe we have to go there.

The advantage of it is that it is simple. The only thing we're really
giving up is the deadlock detector, which I think isn't needed in this
case.

I think it's more than just the deadlock detector. Consider
pg_locks/pg_stat_activity.waiting, cancelling a acquisition, error
cleanup and recursive acquisitions. Acquiring session locks + a
surrounding transaction command deals with with cleanups without
introducing PG_ENSURE blocks in a couple places. And we need recursive
acquisition so a streaming base backup can acquire the lock over the
whole runtime, while a manual pg_start_backup() does only for its own
time.

Especially the visibility in the system views is something I'd not like
to give up in additional blocks we introduce in the backbranches.

Here's an alternative approach. I think it generally is superior and
going in the right direction, but I'm not sure it's backpatchable.

I tend think this is changing too many things to back-patch. It might
all be fine, but it's pretty wide-reaching, so the chances of
collateral damage seem non-trivial. Even in master, I'm not sure I
see the point in rocking the apple cart to this degree.

It's big, true. But a fair amount of it stuff I think we have to do
anyway. The current code acquiring session locks in dbase_redo() is
borked - we either assert or segfault if there's a connection in the
wrong moment beause there's no deadlock handler. Plus it has races that
aren't that hard to hit :(. To fix the crashes (but not the race) we can
have a separate ResolveRecoveryConflictWithObjectLock that doesn't
record the existance of the lock, but doesn't ever do a ProcSleep(). Not
pretty either :(

Seems like a situation with no nice solutions so far :(

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#20Andres Freund
andres@2ndquadrant.com
In reply to: Andres Freund (#17)
4 attachment(s)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 2015-01-27 20:16:43 +0100, Andres Freund wrote:

Here's an alternative approach. I think it generally is superior and
going in the right direction, but I'm not sure it's backpatchable.

It basically consists out of:
1) Make GetLockConflicts() actually work.

already commited as being a independent problem.

2) Allow the startup process to actually acquire locks other than
AccessExclusiveLocks. There already is code acquiring other locks,
but it's currently broken because they're acquired in blocking mode
which isn't actually supported in the startup mode. Using this
infrastructure we can actually fix a couple small races around
database creation/drop.
3) Allow session locks during recovery to be heavier than
RowExclusiveLock - since relation/object session locks aren't ever
taken out by the user (without a plain lock first) that's safe.

merged and submitted independently.

5) Make walsender actually react to recovery conflict interrupts

submitted here. (0003)

4) Perform streaming base backups from within a transaction command, to
provide error handling.
6) Prevent access to the template database of a CREATE DATABASE during
WAL replay.
7) Add an interlock to prevent base backups and movedb() to run
concurrently. What we actually came here for.

combined, submitted here. (0004)

I think this actually doesn't look that bad.

Greetings,

Andres Freund

--
Andres Freund http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Attachments:

0001-Allow-recovery-lock-infrastructure-to-not-only-hold-.patchtext/x-patch; charset=us-asciiDownload
>From 8531ca4d200d24ee45265774f7ead613563adca4 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 30 Jan 2015 09:28:17 +0100
Subject: [PATCH 1/4] Allow recovery lock infrastructure to not only hold
 relation locks.

This is a preparatory commit to fix several bugs, separated out for
easier review.

The startup process could so far only properly acquire access exlusive
locks on transactions. As it does not setup enough state to do lock
queuing - primarily to be able to issue recovery conflict interrupts,
and to release them when the primary releases locks - it has it's own
infrastructure to manage locks. That infrastructure so far assumed
that all locks are access exlusive locks on relations.

Unfortunately some code in the startup process has to acquire other
locks than what's supported by the aforementioned infrastructure in
standby.c. Namely dbase_redo() has to acquire locks on the database
objects.  Also further such locks will be added soon, to fix a another
bug.

So this patch shanges the infrastructure to be able to acquire locks
of different modes and locktags.

Additionally allow acquiring more heavyweight relation logs on the
standby than RowExclusive when acquired in session mode.

Discussion: 20150120152819.GC24381@alap3.anarazel.de

Backpatch all the way.
---
 src/backend/storage/ipc/standby.c | 120 ++++++++++++++++++--------------------
 src/backend/storage/lmgr/lock.c   |   7 +--
 src/include/storage/standby.h     |   2 +-
 3 files changed, 61 insertions(+), 68 deletions(-)

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index 292bed5..0502aab 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -38,10 +38,16 @@ int			max_standby_archive_delay = 30 * 1000;
 int			max_standby_streaming_delay = 30 * 1000;
 
 static List *RecoveryLockList;
+typedef struct RecoveryLockListEntry
+{
+	TransactionId	xid;
+	LOCKMODE		lockmode;
+	LOCKTAG			locktag;
+} RecoveryLockListEntry;
 
 static void ResolveRecoveryConflictWithVirtualXIDs(VirtualTransactionId *waitlist,
 									   ProcSignalReason reason);
-static void ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid);
+static void ResolveRecoveryConflictWithLock(LOCKTAG *tag, LOCKMODE mode);
 static void SendRecoveryConflictWithBufferPin(ProcSignalReason reason);
 static XLogRecPtr LogCurrentRunningXacts(RunningTransactions CurrRunningXacts);
 static void LogAccessExclusiveLocks(int nlocks, xl_standby_lock *locks);
@@ -320,10 +326,10 @@ ResolveRecoveryConflictWithDatabase(Oid dbid)
 	 * us. This is rare enough that we do this as simply as possible: no wait,
 	 * just force them off immediately.
 	 *
-	 * No locking is required here because we already acquired
-	 * AccessExclusiveLock. Anybody trying to connect while we do this will
-	 * block during InitPostgres() and then disconnect when they see the
-	 * database has been removed.
+	 * No locking is required here because we already acquired a
+	 * AccessExclusiveLock on the database in dbase_redo(). Anybody trying to
+	 * connect while we do this will block during InitPostgres() and then
+	 * disconnect when they see the database has been removed.
 	 */
 	while (CountDBBackends(dbid) > 0)
 	{
@@ -338,14 +344,11 @@ ResolveRecoveryConflictWithDatabase(Oid dbid)
 }
 
 static void
-ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid)
+ResolveRecoveryConflictWithLock(LOCKTAG *locktag, LOCKMODE mode)
 {
 	VirtualTransactionId *backends;
 	bool		lock_acquired = false;
 	int			num_attempts = 0;
-	LOCKTAG		locktag;
-
-	SET_LOCKTAG_RELATION(locktag, dbOid, relOid);
 
 	/*
 	 * If blowing away everybody with conflicting locks doesn't work, after
@@ -358,7 +361,7 @@ ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid)
 	while (!lock_acquired)
 	{
 		if (++num_attempts < 3)
-			backends = GetLockConflicts(&locktag, AccessExclusiveLock);
+			backends = GetLockConflicts(locktag, mode);
 		else
 			backends = GetConflictingVirtualXIDs(InvalidTransactionId,
 												 InvalidOid);
@@ -366,7 +369,7 @@ ResolveRecoveryConflictWithLock(Oid dbOid, Oid relOid)
 		ResolveRecoveryConflictWithVirtualXIDs(backends,
 											 PROCSIG_RECOVERY_CONFLICT_LOCK);
 
-		if (LockAcquireExtended(&locktag, AccessExclusiveLock, true, true, false)
+		if (LockAcquireExtended(locktag, mode, true, true, false)
 			!= LOCKACQUIRE_NOT_AVAIL)
 			lock_acquired = true;
 	}
@@ -544,19 +547,18 @@ StandbyTimeoutHandler(void)
  * this lock, so query access is not allowed at this time". So the Startup
  * process is the proxy by which the original locks are implemented.
  *
- * We only keep track of AccessExclusiveLocks, which are only ever held by
- * one transaction on one relation, and don't worry about lock queuing.
+ * We only keep track of the primary's AccessExclusiveLocks, which are only
+ * ever held by one transaction on one relation, and don't worry about lock
+ * queuing.  The startup process however does acquire other locks occasionally
+ * (c.f. dbase_redo()) - but even there no queuing is possible.
  *
  * We keep a single dynamically expandible list of locks in local memory,
  * RelationLockList, so we can keep track of the various entries made by
  * the Startup process's virtual xid in the shared lock table.
  *
  * We record the lock against the top-level xid, rather than individual
- * subtransaction xids. This means AccessExclusiveLocks held by aborted
- * subtransactions are not released as early as possible on standbys.
- *
- * List elements use type xl_rel_lock, since the WAL record type exactly
- * matches the information that we need to keep track of.
+ * subtransaction xids. This means locks held by aborted subtransactions are
+ * not released as early as possible on standbys.
  *
  * We use session locks rather than normal locks so we don't need
  * ResourceOwners.
@@ -564,10 +566,11 @@ StandbyTimeoutHandler(void)
 
 
 void
-StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid)
+StandbyAcquireLock(TransactionId xid, LOCKTAG *locktag, LOCKMODE mode)
 {
-	xl_standby_lock *newlock;
-	LOCKTAG		locktag;
+	RecoveryLockListEntry *newlock;
+
+	Assert(locktag->locktag_lockmethodid == DEFAULT_LOCKMETHOD);
 
 	/* Already processed? */
 	if (!TransactionIdIsValid(xid) ||
@@ -575,26 +578,18 @@ StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid)
 		TransactionIdDidAbort(xid))
 		return;
 
-	elog(trace_recovery(DEBUG4),
-		 "adding recovery lock: db %u rel %u", dbOid, relOid);
-
-	/* dbOid is InvalidOid when we are locking a shared relation. */
-	Assert(OidIsValid(relOid));
-
-	newlock = palloc(sizeof(xl_standby_lock));
+	newlock = palloc(sizeof(RecoveryLockListEntry));
 	newlock->xid = xid;
-	newlock->dbOid = dbOid;
-	newlock->relOid = relOid;
+	newlock->lockmode = mode;
+	memcpy(&newlock->locktag, locktag, sizeof(LOCKTAG));
 	RecoveryLockList = lappend(RecoveryLockList, newlock);
 
 	/*
 	 * Attempt to acquire the lock as requested, if not resolve conflict
 	 */
-	SET_LOCKTAG_RELATION(locktag, newlock->dbOid, newlock->relOid);
-
-	if (LockAcquireExtended(&locktag, AccessExclusiveLock, true, true, false)
+	if (LockAcquireExtended(locktag, mode, true, true, false)
 		== LOCKACQUIRE_NOT_AVAIL)
-		ResolveRecoveryConflictWithLock(newlock->dbOid, newlock->relOid);
+		ResolveRecoveryConflictWithLock(locktag, mode);
 }
 
 static void
@@ -610,22 +605,16 @@ StandbyReleaseLocks(TransactionId xid)
 	prev = NULL;
 	for (cell = list_head(RecoveryLockList); cell; cell = next)
 	{
-		xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell);
+		RecoveryLockListEntry *lock = (RecoveryLockListEntry *) lfirst(cell);
 
 		next = lnext(cell);
 
 		if (!TransactionIdIsValid(xid) || lock->xid == xid)
 		{
-			LOCKTAG		locktag;
-
-			elog(trace_recovery(DEBUG4),
-				 "releasing recovery lock: xid %u db %u rel %u",
-				 lock->xid, lock->dbOid, lock->relOid);
-			SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid);
-			if (!LockRelease(&locktag, AccessExclusiveLock, true))
+			if (!LockRelease(&lock->locktag, lock->lockmode, true))
 				elog(LOG,
-					 "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u database %u relation %u",
-					 lock->xid, lock->dbOid, lock->relOid);
+					 "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u",
+					 lock->xid);
 
 			RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev);
 			pfree(lock);
@@ -662,25 +651,24 @@ StandbyReleaseAllLocks(void)
 	ListCell   *cell,
 			   *prev,
 			   *next;
-	LOCKTAG		locktag;
 
 	elog(trace_recovery(DEBUG2), "release all standby locks");
 
 	prev = NULL;
 	for (cell = list_head(RecoveryLockList); cell; cell = next)
 	{
-		xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell);
+		RecoveryLockListEntry *lock = (RecoveryLockListEntry *) lfirst(cell);
 
 		next = lnext(cell);
 
 		elog(trace_recovery(DEBUG4),
-			 "releasing recovery lock: xid %u db %u rel %u",
-			 lock->xid, lock->dbOid, lock->relOid);
-		SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid);
-		if (!LockRelease(&locktag, AccessExclusiveLock, true))
+			 "releasing recovery lock for xid %u",
+			 lock->xid);
+
+		if (!LockRelease(&lock->locktag, lock->lockmode, true))
 			elog(LOG,
-				 "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u database %u relation %u",
-				 lock->xid, lock->dbOid, lock->relOid);
+				 "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u",
+				 lock->xid);
 		RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev);
 		pfree(lock);
 	}
@@ -697,12 +685,11 @@ StandbyReleaseOldLocks(int nxids, TransactionId *xids)
 	ListCell   *cell,
 			   *prev,
 			   *next;
-	LOCKTAG		locktag;
 
 	prev = NULL;
 	for (cell = list_head(RecoveryLockList); cell; cell = next)
 	{
-		xl_standby_lock *lock = (xl_standby_lock *) lfirst(cell);
+		RecoveryLockListEntry *lock = (RecoveryLockListEntry *) lfirst(cell);
 		bool		remove = false;
 
 		next = lnext(cell);
@@ -735,13 +722,13 @@ StandbyReleaseOldLocks(int nxids, TransactionId *xids)
 		if (remove)
 		{
 			elog(trace_recovery(DEBUG4),
-				 "releasing recovery lock: xid %u db %u rel %u",
-				 lock->xid, lock->dbOid, lock->relOid);
-			SET_LOCKTAG_RELATION(locktag, lock->dbOid, lock->relOid);
-			if (!LockRelease(&locktag, AccessExclusiveLock, true))
+				 "releasing recovery lock: xid %u",
+				 lock->xid);
+
+			if (!LockRelease(&lock->locktag, lock->lockmode, true))
 				elog(LOG,
-					 "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u database %u relation %u",
-					 lock->xid, lock->dbOid, lock->relOid);
+					 "RecoveryLockList contains entry for lock no longer recorded by lock manager: xid %u",
+					 lock->xid);
 			RecoveryLockList = list_delete_cell(RecoveryLockList, cell, prev);
 			pfree(lock);
 		}
@@ -776,9 +763,16 @@ standby_redo(XLogReaderState *record)
 		int			i;
 
 		for (i = 0; i < xlrec->nlocks; i++)
-			StandbyAcquireAccessExclusiveLock(xlrec->locks[i].xid,
-											  xlrec->locks[i].dbOid,
-											  xlrec->locks[i].relOid);
+		{
+			LOCKTAG locktag;
+
+			SET_LOCKTAG_RELATION(locktag,
+								 xlrec->locks[i].dbOid,
+								 xlrec->locks[i].relOid);
+			StandbyAcquireLock(xlrec->locks[i].xid,
+							   &locktag,
+							   AccessExclusiveLock);
+		}
 	}
 	else if (info == XLOG_RUNNING_XACTS)
 	{
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 1eb2d4b..02ecf3d 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -710,7 +710,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
 	if (RecoveryInProgress() && !InRecovery &&
 		(locktag->locktag_type == LOCKTAG_OBJECT ||
 		 locktag->locktag_type == LOCKTAG_RELATION) &&
-		lockmode > RowExclusiveLock)
+		lockmode > RowExclusiveLock &&
+		!sessionLock)
 		ereport(ERROR,
 				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
 				 errmsg("cannot acquire lock mode %s on database objects while recovery is in progress",
@@ -3861,9 +3862,7 @@ lock_twophase_standby_recover(TransactionId xid, uint16 info,
 	if (lockmode == AccessExclusiveLock &&
 		locktag->locktag_type == LOCKTAG_RELATION)
 	{
-		StandbyAcquireAccessExclusiveLock(xid,
-										locktag->locktag_field1 /* dboid */ ,
-									  locktag->locktag_field2 /* reloid */ );
+		StandbyAcquireLock(xid, locktag, AccessExclusiveLock);
 	}
 }
 
diff --git a/src/include/storage/standby.h b/src/include/storage/standby.h
index c32c963..f711281 100644
--- a/src/include/storage/standby.h
+++ b/src/include/storage/standby.h
@@ -45,7 +45,7 @@ extern void StandbyTimeoutHandler(void);
  * to make hot standby work. That includes logging AccessExclusiveLocks taken
  * by transactions and running-xacts snapshots.
  */
-extern void StandbyAcquireAccessExclusiveLock(TransactionId xid, Oid dbOid, Oid relOid);
+extern void StandbyAcquireLock(TransactionId xid, LOCKTAG *tag, LOCKMODE mode);
 extern void StandbyReleaseLockTree(TransactionId xid,
 					   int nsubxids, TransactionId *subxids);
 extern void StandbyReleaseAllLocks(void);
-- 
2.2.1.212.gc5b9256

0002-Fix-startup-process-crash-and-race-during-CREATE-DRO.patchtext/x-patch; charset=us-asciiDownload
>From 4b36f4c0e1efecbd50918a2c854bfe6f26105201 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 30 Jan 2015 09:46:15 +0100
Subject: [PATCH 2/4] Fix startup process crash and race during
 CREATE/DROP/ALTER DATABASE.

When replaying database related WAL records the startup process
acquires lock on the database object to prevent new connections to the
affected database. That's necessary because we currently don't
replicate object locks from the primary.

Unfortunately that locking couldn't use the standby infrastructure for
managing locks and instead just used a normal
LockSharedObjectForSession(). While that works in the (common!)
uncontended situation, it doesn't if the lock is currently already
acquired. The standby process intentionally doesn't set up the
infrastructure to wait for locks in the normal, blocking, way. The
goal is to cancel other lock holder after max_standby_*_delay, which
is only possible if the sleeping is controlled. If we have to wait
anyway, we either crash with a assertion or a FATAL (promoted to
PANIC) error.

There's also another race, which was documented in comments. Namely
that not using the standby lock infrastructure requires explicitly
releasing the lock. That means there's a short phase where the catalog
and physical state of the database are inconsistent.

After the previous commit we can now use the standby infrastructure to
manage the lock. That requires assigning a transactionid to the
create/drop database transactions. To handle WAL generated by a older
point release we thus need to make the locking conditional on the
relevant records having a transactionid. On the primary WAL format is
bumped instead.

Additionally add previously missing locking for CREATE DATABASE that
could lead to checksum failures due to hint bit writes.

Now that we don't do acquire any locks in potentially blocking mode in
the startup process anymore, add a assertion ensuring that that stays
so in the primary.

Backpatch all the way.

Discussion: 20150120152819.GC24381@alap3.anarazel.de
---
 src/backend/commands/dbcommands.c | 71 +++++++++++++++++++++++++++++++--------
 src/backend/storage/lmgr/lock.c   |  2 ++
 2 files changed, 59 insertions(+), 14 deletions(-)

diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 5e66961..3845eb7 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1204,6 +1204,12 @@ movedb(const char *dbname, const char *tblspcname)
 	}
 
 	/*
+	 * Force xid assignment, for the benefit of dbase_redo()'s internal
+	 * locking.
+	 */
+	GetTopTransactionId();
+
+	/*
 	 * Use an ENSURE block to make sure we remove the debris if the copy fails
 	 * (eg, due to out-of-disk-space).  This is not a 100% solution, because
 	 * of the possibility of failure during transaction commit, but it should
@@ -1314,6 +1320,12 @@ movedb(const char *dbname, const char *tblspcname)
 	StartTransactionCommand();
 
 	/*
+	 * Force xid assignment, for the benefit of dbase_redo()'s internal
+	 * locking.
+	 */
+	GetTopTransactionId();
+
+	/*
 	 * Remove files from the old tablespace
 	 */
 	if (!rmtree(src_dbpath, true))
@@ -2053,6 +2065,34 @@ dbase_redo(XLogReaderState *record)
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
 		/*
+		 * Can only do the locking if the server generating the record was new
+		 * enough to know it has to assign a xid.
+		 */
+		if (InHotStandby && TransactionIdIsValid(XLogRecGetXid(record)))
+		{
+			LOCKTAG locktag;
+
+			SET_LOCKTAG_OBJECT(locktag,
+							   InvalidOid,
+							   DatabaseRelationId,
+							   xlrec->src_db_id,
+							   0);
+
+			/* Lock source database, do avoid concurrent hint bit writes et al. */
+			StandbyAcquireLock(XLogRecGetXid(record), &locktag, AccessExclusiveLock);
+			ResolveRecoveryConflictWithDatabase(xlrec->src_db_id);
+
+			/* Lock target database, it'll be overwritten in a second */
+			SET_LOCKTAG_OBJECT(locktag,
+							   InvalidOid,
+							   DatabaseRelationId,
+							   xlrec->db_id,
+							   0);
+			StandbyAcquireLock(XLogRecGetXid(record), &locktag, AccessExclusiveLock);
+			ResolveRecoveryConflictWithDatabase(xlrec->src_db_id);
+		}
+
+		/*
 		 * Our theory for replaying a CREATE is to forcibly drop the target
 		 * subdirectory if present, then re-copy the source data. This may be
 		 * more work than needed, but it is simple to implement.
@@ -2086,16 +2126,31 @@ dbase_redo(XLogReaderState *record)
 
 		dst_path = GetDatabasePath(xlrec->db_id, xlrec->tablespace_id);
 
-		if (InHotStandby)
+		/*
+		 * Can only do the locking if the server generating the record was new
+		 * enough to know it has to assign a xid.
+		 */
+		if (InHotStandby && TransactionIdIsValid(XLogRecGetXid(record)))
 		{
+			LOCKTAG locktag;
+
 			/*
 			 * Lock database while we resolve conflicts to ensure that
 			 * InitPostgres() cannot fully re-execute concurrently. This
 			 * avoids backends re-connecting automatically to same database,
 			 * which can happen in some cases.
 			 */
-			LockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
+			SET_LOCKTAG_OBJECT(locktag,
+							   InvalidOid,
+							   DatabaseRelationId,
+							   xlrec->db_id,
+							   0);
+			StandbyAcquireLock(XLogRecGetXid(record), &locktag, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->db_id);
+
+			/* Lock pg_database, to conflict with base backups */
+			SET_LOCKTAG_RELATION(locktag, 0, DatabaseRelationId);
+			StandbyAcquireLock(XLogRecGetXid(record), &locktag, RowExclusiveLock);
 		}
 
 		/* Drop pages for this database that are in the shared buffer cache */
@@ -2112,18 +2167,6 @@ dbase_redo(XLogReaderState *record)
 			ereport(WARNING,
 					(errmsg("some useless files may be left behind in old database directory \"%s\"",
 							dst_path)));
-
-		if (InHotStandby)
-		{
-			/*
-			 * Release locks prior to commit. XXX There is a race condition
-			 * here that may allow backends to reconnect, but the window for
-			 * this is small because the gap between here and commit is mostly
-			 * fairly small and it is unlikely that people will be dropping
-			 * databases that we are trying to connect to anyway.
-			 */
-			UnlockSharedObjectForSession(DatabaseRelationId, xlrec->db_id, 0, AccessExclusiveLock);
-		}
 	}
 	else
 		elog(PANIC, "dbase_redo: unknown op code %u", info);
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 02ecf3d..f051ad6 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -718,6 +718,8 @@ LockAcquireExtended(const LOCKTAG *locktag,
 						lockMethodTable->lockModeNames[lockmode]),
 				 errhint("Only RowExclusiveLock or less can be acquired on database objects during recovery.")));
 
+	Assert(!InRecovery || (sessionLock && dontWait));
+
 #ifdef LOCK_DEBUG
 	if (LOCK_DEBUG_ENABLED(locktag))
 		elog(LOG, "LockAcquire: lock [%u,%u] %s",
-- 
2.2.1.212.gc5b9256

0003-Allow-walsender-to-be-cancelled-by-recovery-conflict.patchtext/x-patch; charset=us-asciiDownload
>From bf41f8e77fd9f1d3f4951a8a27528a6ecd4d432f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 30 Jan 2015 09:32:29 +0100
Subject: [PATCH 3/4] Allow walsender to be cancelled by recovery conflict
 interrupts.

This is preliminary work to allow walsender to receive lock conflicts
interrupts which is required by a later patch. It's separated out for
easier review.
---
 src/backend/replication/walsender.c | 14 +-------------
 1 file changed, 1 insertion(+), 13 deletions(-)

diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 05d2339..3f25ccf 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -185,7 +185,6 @@ static XLogRecPtr logical_startptr = InvalidXLogRecPtr;
 
 /* Signal handlers */
 static void WalSndSigHupHandler(SIGNAL_ARGS);
-static void WalSndXLogSendHandler(SIGNAL_ARGS);
 static void WalSndLastCycleHandler(SIGNAL_ARGS);
 
 /* Prototypes for private functions */
@@ -2566,17 +2565,6 @@ WalSndSigHupHandler(SIGNAL_ARGS)
 	errno = save_errno;
 }
 
-/* SIGUSR1: set flag to send WAL records */
-static void
-WalSndXLogSendHandler(SIGNAL_ARGS)
-{
-	int			save_errno = errno;
-
-	latch_sigusr1_handler();
-
-	errno = save_errno;
-}
-
 /* SIGUSR2: set flag to do a last cycle and shut down afterwards */
 static void
 WalSndLastCycleHandler(SIGNAL_ARGS)
@@ -2610,7 +2598,7 @@ WalSndSignals(void)
 	pqsignal(SIGQUIT, quickdie);	/* hard crash time */
 	InitializeTimeouts();		/* establishes SIGALRM handler */
 	pqsignal(SIGPIPE, SIG_IGN);
-	pqsignal(SIGUSR1, WalSndXLogSendHandler);	/* request WAL sending */
+	pqsignal(SIGUSR1, procsignal_sigusr1_handler);
 	pqsignal(SIGUSR2, WalSndLastCycleHandler);	/* request a last cycle and
 												 * shutdown */
 
-- 
2.2.1.212.gc5b9256

0004-Don-t-allow-changing-a-database-s-tablespace-during-.patchtext/x-patch; charset=us-asciiDownload
>From 891e527a6e48a69334eb5960a7a7add0625353d1 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 30 Jan 2015 09:35:41 +0100
Subject: [PATCH 4/4] Don't allow changing a database's tablespace during a
 basebackup.

A basebackup that's being run while a ALTER DATABASE ... SET
TABLESPACE is occuring can end up being corrupted because moving a
database doesn't use normal WAL mechanisms. It can happen that the
basebackup tries to backup a tablespace/database directory while
movedb() is removing the directory. That can lead to inconsistent old
and new database directories.  Normally such inconsistencies are fixed
by WAL replay, but the WAL doesn't contain the data of the database
that's moved and instead just recopies the database directories.

It would be nicer to fix this problem by making such database use
proper WAL logging, but that's not going to be backpatchable. So we
prevent basebackups from occurring while ALTER DATABASE ... SET
TABLESPACE is happening and the other way round.

We do this in one direction by acquiring a lock on pg_database when
starting a base backup, for the entirety of a streaming base backup
and when replaying database events.  For the case when ALTER DATABASE
... SET TABLESPACE is called when a exclusive (i.e. not a replication
protocol one) is in progress a error is raised. That's required
because we can't hold a lock between the pg_start_backup() and
pg_stop_backup() calls since they can be made in independent sessions.

That approach means that external code calling
do_pg_start_backup(not_exclusive) aren't protected, but there doesn't
seem to be a realistic way to protect those. They'll have to do the
locking themselves.

Discussion: 20150120152819.GC24381@alap3.anarazel.de

Backpatch all the way.
---
 src/backend/access/transam/xlog.c    | 29 +++++++++++++++++++++++++++++
 src/backend/commands/dbcommands.c    | 33 +++++++++++++++++++++++++++++++++
 src/backend/replication/basebackup.c | 15 +++++++++++++++
 src/backend/storage/lmgr/lmgr.c      | 31 +++++++++++++++++++++++++++++++
 src/backend/utils/init/postinit.c    |  2 +-
 src/include/access/xlog.h            |  1 +
 src/include/storage/lmgr.h           |  3 +++
 7 files changed, 113 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 629a457..38e7dff 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -53,6 +53,7 @@
 #include "storage/ipc.h"
 #include "storage/large_object.h"
 #include "storage/latch.h"
+#include "storage/lmgr.h"
 #include "storage/pmsignal.h"
 #include "storage/predicate.h"
 #include "storage/proc.h"
@@ -9291,6 +9292,12 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
 				 errmsg("backup label too long (max %d bytes)",
 						MAXPGPATH)));
+	/*
+	 * Acquire lock on pg datababase to prevent concurrent CREATE DATABASE and
+	 * ALTER DATABASE ... SET TABLESPACE from running. In case of an error
+	 * it'll be released by the transaction abort code.
+	 */
+	LockRelationOidForSession(DatabaseRelationId, ShareLock);
 
 	/*
 	 * Mark backup active in shared memory.  We must do full-page WAL writes
@@ -9523,6 +9530,8 @@ do_pg_start_backup(const char *backupidstr, bool fast, TimeLineID *starttli_p,
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(pg_start_backup_callback, (Datum) BoolGetDatum(exclusive));
 
+	UnlockRelationOidForSession(DatabaseRelationId, ShareLock);
+
 	/*
 	 * We're done.  As a convenience, return the starting WAL location.
 	 */
@@ -9937,6 +9946,26 @@ do_pg_abort_backup(void)
 }
 
 /*
+ * Is a (exclusive or nonexclusive) base backup running?
+ *
+ * Note that this does not check whether any standby of this node has a
+ * basebackup running, or whether any upstream master (if this is a standby)
+ * has one in progress
+ */
+bool
+LocalBaseBackupInProgress(void)
+{
+	bool ret;
+
+	WALInsertLockAcquire();
+	ret = XLogCtl->Insert.exclusiveBackup ||
+		XLogCtl->Insert.nonExclusiveBackups > 0;
+	WALInsertLockRelease();
+
+	return ret;
+}
+
+/*
  * Get latest redo apply position.
  *
  * Exported to allow WALReceiver to read the pointer directly.
diff --git a/src/backend/commands/dbcommands.c b/src/backend/commands/dbcommands.c
index 3845eb7..e6a3352 100644
--- a/src/backend/commands/dbcommands.c
+++ b/src/backend/commands/dbcommands.c
@@ -1071,6 +1071,9 @@ movedb(const char *dbname, const char *tblspcname)
 	LockSharedObjectForSession(DatabaseRelationId, db_id, 0,
 							   AccessExclusiveLock);
 
+	/* Acquire lock on pg_database against concurrent base backups starting */
+	LockRelationOidForSession(DatabaseRelationId, RowExclusiveLock);
+
 	/*
 	 * Permission checks
 	 */
@@ -1087,6 +1090,30 @@ movedb(const char *dbname, const char *tblspcname)
 				 errmsg("cannot change the tablespace of the currently open database")));
 
 	/*
+	 * Prevent SET TABLESPACE from running concurrently with a base
+	 * backup. Without that check a base backup would potentially copy a
+	 * partially removed source database; which WAL replay then would copy
+	 * over the new database...
+	 *
+	 * Starting a base backup takes a SHARE lock on pg_database. In addition a
+	 * streaming basebackup takes the same lock for the entirety of the copy
+	 * of the data directory.  That, combined with this check, prevents base
+	 * backups from being taken at the same time a SET TABLESPACE is in
+	 * progress.
+	 *
+	 * Note that this check here will not trigger if a standby currently has a
+	 * base backup ongoing; instead WAL replay will have a recovery conflict
+	 * when replaying the DBASE_CREATE record. That is only safe because
+	 * standbys can only do nonexclusive base backups which hold the lock over
+	 * the entire runtime - otherwise no lock would prevent replay after
+	 * pg_start_backup().
+	 */
+	if (LocalBaseBackupInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("cannot change a database's tablespace while a base backup is in progress")));
+
+	/*
 	 * Get tablespace's oid
 	 */
 	dst_tblspcoid = get_tablespace_oid(tblspcname, false);
@@ -1116,6 +1143,7 @@ movedb(const char *dbname, const char *tblspcname)
 		heap_close(pgdbrel, NoLock);
 		UnlockSharedObjectForSession(DatabaseRelationId, db_id, 0,
 									 AccessExclusiveLock);
+		UnlockRelationOidForSession(DatabaseRelationId, RowExclusiveLock);
 		return;
 	}
 
@@ -1352,6 +1380,7 @@ movedb(const char *dbname, const char *tblspcname)
 	/* Now it's safe to release the database lock */
 	UnlockSharedObjectForSession(DatabaseRelationId, db_id, 0,
 								 AccessExclusiveLock);
+	UnlockRelationOidForSession(DatabaseRelationId, RowExclusiveLock);
 }
 
 /* Error cleanup callback for movedb */
@@ -2090,6 +2119,10 @@ dbase_redo(XLogReaderState *record)
 							   0);
 			StandbyAcquireLock(XLogRecGetXid(record), &locktag, AccessExclusiveLock);
 			ResolveRecoveryConflictWithDatabase(xlrec->src_db_id);
+
+			/* Lock pg_database, to conflict with base backups */
+			SET_LOCKTAG_RELATION(locktag, 0, DatabaseRelationId);
+			StandbyAcquireLock(XLogRecGetXid(record), &locktag, RowExclusiveLock);
 		}
 
 		/*
diff --git a/src/backend/replication/basebackup.c b/src/backend/replication/basebackup.c
index 3058ce9..a0b590f 100644
--- a/src/backend/replication/basebackup.c
+++ b/src/backend/replication/basebackup.c
@@ -17,8 +17,10 @@
 #include <unistd.h>
 #include <time.h>
 
+#include "access/xact.h"
 #include "access/xlog_internal.h"		/* for pg_start/stop_backup */
 #include "catalog/catalog.h"
+#include "catalog/pg_database.h"
 #include "catalog/pg_type.h"
 #include "lib/stringinfo.h"
 #include "libpq/libpq.h"
@@ -32,6 +34,8 @@
 #include "replication/walsender_private.h"
 #include "storage/fd.h"
 #include "storage/ipc.h"
+#include "storage/lmgr.h"
+#include "storage/lock.h"
 #include "utils/builtins.h"
 #include "utils/elog.h"
 #include "utils/ps_status.h"
@@ -134,6 +138,9 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 
 	startptr = do_pg_start_backup(opt->label, opt->fastcheckpoint, &starttli,
 								  &labelfile);
+
+	LockRelationOidForSession(DatabaseRelationId, ShareLock);
+
 	/*
 	 * Once do_pg_start_backup has been called, ensure that any failure causes
 	 * us to abort the backup so we don't "leak" a backup counter. For this reason,
@@ -304,6 +311,9 @@ perform_base_backup(basebackup_options *opt, DIR *tblspcdir)
 	}
 	PG_END_ENSURE_ERROR_CLEANUP(base_backup_cleanup, (Datum) 0);
 
+	/* release lock preventing base backups - no risk while backing up WAL */
+	UnlockRelationOidForSession(DatabaseRelationId, ShareLock);
+
 	endptr = do_pg_stop_backup(labelfile, !opt->nowait, &endtli);
 
 	if (opt->includewal)
@@ -675,6 +685,9 @@ SendBaseBackup(BaseBackupCmd *cmd)
 		set_ps_display(activitymsg, false);
 	}
 
+	/* to provide error handling around locks */
+	StartTransactionCommand();
+
 	/* Make sure we can open the directory with tablespaces in it */
 	dir = AllocateDir("pg_tblspc");
 	if (!dir)
@@ -684,6 +697,8 @@ SendBaseBackup(BaseBackupCmd *cmd)
 	perform_base_backup(&opt, dir);
 
 	FreeDir(dir);
+
+	CommitTransactionCommand();
 }
 
 static void
diff --git a/src/backend/storage/lmgr/lmgr.c b/src/backend/storage/lmgr/lmgr.c
index d13a167..c428b38 100644
--- a/src/backend/storage/lmgr/lmgr.c
+++ b/src/backend/storage/lmgr/lmgr.c
@@ -248,6 +248,37 @@ UnlockRelation(Relation relation, LOCKMODE lockmode)
 }
 
 /*
+ *		LockRelationOidForSession
+ *
+ * Lock a relation in session mode, without requiring having it opened it in
+ * transaction local mode. That's likely only useful during WAL replay.
+ */
+void
+LockRelationOidForSession(Oid relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SetLocktagRelationOid(&tag, relid);
+
+	(void) LockAcquire(&tag, lockmode, true, false);
+}
+
+/*
+ *		UnlockRelationOidForSession
+ *
+ * Unlock lock acquired by LockRelationOidForSession.
+ */
+void
+UnlockRelationOidForSession(Oid relid, LOCKMODE lockmode)
+{
+	LOCKTAG		tag;
+
+	SetLocktagRelationOid(&tag, relid);
+
+	LockRelease(&tag, lockmode, true);
+}
+
+/*
  *		LockHasWaitersRelation
  *
  * This is a functiion to check if someone else is waiting on a
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 1f5cf06..dfe2322 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -886,7 +886,7 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
 			ereport(FATAL,
 					(errcode(ERRCODE_UNDEFINED_DATABASE),
 					 errmsg("database \"%s\" does not exist", dbname),
-			   errdetail("It seems to have just been dropped or renamed.")));
+			   errdetail("It seems to have just been dropped, moved or renamed.")));
 	}
 
 	/*
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 138deaf..f3893f3 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -253,6 +253,7 @@ extern XLogRecPtr do_pg_start_backup(const char *backupidstr, bool fast,
 extern XLogRecPtr do_pg_stop_backup(char *labelfile, bool waitforarchive,
 				  TimeLineID *stoptli_p);
 extern void do_pg_abort_backup(void);
+extern bool LocalBaseBackupInProgress(void);
 
 /* File path names (all relative to $PGDATA) */
 #define BACKUP_LABEL_FILE		"backup_label"
diff --git a/src/include/storage/lmgr.h b/src/include/storage/lmgr.h
index f5d70e5..2397f11 100644
--- a/src/include/storage/lmgr.h
+++ b/src/include/storage/lmgr.h
@@ -50,6 +50,9 @@ extern bool LockHasWaitersRelation(Relation relation, LOCKMODE lockmode);
 extern void LockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 extern void UnlockRelationIdForSession(LockRelId *relid, LOCKMODE lockmode);
 
+extern void LockRelationOidForSession(Oid relid, LOCKMODE lockmode);
+extern void UnlockRelationOidForSession(Oid relid, LOCKMODE lockmode);
+
 /* Lock a relation for extension */
 extern void LockRelationForExtension(Relation relation, LOCKMODE lockmode);
 extern void UnlockRelationForExtension(Relation relation, LOCKMODE lockmode);
-- 
2.2.1.212.gc5b9256

#21Bruce Momjian
bruce@momjian.us
In reply to: Andres Freund (#20)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On Fri, Jan 30, 2015 at 09:36:42PM +0100, Andres Freund wrote:

On 2015-01-27 20:16:43 +0100, Andres Freund wrote:

Here's an alternative approach. I think it generally is superior and
going in the right direction, but I'm not sure it's backpatchable.

It basically consists out of:
1) Make GetLockConflicts() actually work.

already commited as being a independent problem.

2) Allow the startup process to actually acquire locks other than
AccessExclusiveLocks. There already is code acquiring other locks,
but it's currently broken because they're acquired in blocking mode
which isn't actually supported in the startup mode. Using this
infrastructure we can actually fix a couple small races around
database creation/drop.
3) Allow session locks during recovery to be heavier than
RowExclusiveLock - since relation/object session locks aren't ever
taken out by the user (without a plain lock first) that's safe.

merged and submitted independently.

5) Make walsender actually react to recovery conflict interrupts

submitted here. (0003)

4) Perform streaming base backups from within a transaction command, to
provide error handling.
6) Prevent access to the template database of a CREATE DATABASE during
WAL replay.
7) Add an interlock to prevent base backups and movedb() to run
concurrently. What we actually came here for.

combined, submitted here. (0004)

I think this actually doesn't look that bad.

Where are we on this?

--
Bruce Momjian <bruce@momjian.us> http://momjian.us
EnterpriseDB http://enterprisedb.com

+ Everyone has their own god. +

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#22Abhijit Menon-Sen
ams@2ndQuadrant.com
In reply to: Andres Freund (#20)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

At 2015-01-30 21:36:42 +0100, andres@2ndquadrant.com wrote:

Here's an alternative approach. I think it generally is superior and
going in the right direction, but I'm not sure it's backpatchable.

It basically consists out of:
1) Make GetLockConflicts() actually work.

already commited as being a independent problem.

2) Allow the startup process to actually acquire locks other than
AccessExclusiveLocks. There already is code acquiring other locks,
but it's currently broken because they're acquired in blocking mode
which isn't actually supported in the startup mode. Using this
infrastructure we can actually fix a couple small races around
database creation/drop.
3) Allow session locks during recovery to be heavier than
RowExclusiveLock - since relation/object session locks aren't ever
taken out by the user (without a plain lock first) that's safe.

merged and submitted independently.

5) Make walsender actually react to recovery conflict interrupts

submitted here. (0003)

4) Perform streaming base backups from within a transaction command, to
provide error handling.
6) Prevent access to the template database of a CREATE DATABASE during
WAL replay.
7) Add an interlock to prevent base backups and movedb() to run
concurrently. What we actually came here for.

combined, submitted here. (0004)

I think this actually doesn't look that bad.

Hi Andres. Do you intend to commit these patches? (I just verified that
they apply without trouble to HEAD.)

-- Abhijit

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#23Andres Freund
andres@anarazel.de
In reply to: Abhijit Menon-Sen (#22)
Re: basebackups during ALTER DATABASE ... SET TABLESPACE ... not safe?

On 2015-04-28 10:26:54 +0530, Abhijit Menon-Sen wrote:

Hi Andres. Do you intend to commit these patches? (I just verified that
they apply without trouble to HEAD.)

I intend to come back to them, yes. I'm not fully sure whether we should
really apply them to the back branches.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers