Hot Standby (v9d)
Latest version of Hot Standby includes new deferred cancellation for
conflict resolution and further refactoring.
http://wiki.postgresql.org/wiki/Hot_Standby#Usage
[Please note that max_standby_delay parameter has been set to default to
zero (0), to allow investigation of usability. i.e. queries will be
cancelled fairly quickly if they conflict. Please set this in
recovery.conf to any value you consider reasonable and report findings.]
GENERAL PLANS
* Bug fix v9 over next few days
* Put corrected version into GIT
* Produce outstanding items as patch-on-patch via GIT
* Work with reviewers to nibble away until fully committed
OUTSTANDING ITEMS (as of Version 9d)
Reported problems (Identified by) (Target release)
* None (as yet!) - please post all bugs to list
All v9 bugs have been fixed. This version has passed initial bash and
function tests and we will continue testing this version tomorrow.
Main work/required additional items (Proposed by) (Target release)
* Complete Prepared Transaction support (Simon)
Performance tuning opportunities (Proposed by) (Target release)
* tuning of RecordKnownAssignedTransactionIds() using hash table
(Heikki)
Usability enhancements
* don't cancel all currently connected users every time we drop a
tablespace that has temp files being used within that tablespace -
should be able to parse the filenames to get pids to cancel
* don't kill all currently connected users every time we drop a user -
should be able to match list of current users against connected roles
* more work possible on btree deletes to reduce impact - should be able
to work out a reasonable latestRemovedXid
Testing of any kind welcome, the heavier the better. Save the database
and logs if it crashes, please. Performance data especially valuable.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Attachments:
hs.v9d.20090122_4.tar.bz2application/x-bzip-compressed-tar; name=hs.v9d.20090122_4.tar.bz2Download+0-1
Simon Riggs wrote:
* Put corrected version into GIT
* Produce outstanding items as patch-on-patch via GIT
I've applied the hot standby patch and recovery infra v9 patch to
branches in my git repository, and pushed those to:
git://git.postgresql.org/git/~hlinnaka/pgsql.git
You can clone that to get started.
I've set those branches up so that the hot standby branch is branched
off from the recovery infra branch. I'd like to keep the two separate,
as the recovery infra patch is useful on it's own, and can be reviewed
separately.
As a teaser, I made a couple of minor changes after importing your
patches. For the sake of the archives, I've also included those changes
as patches here.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Attachments:
0001-Remove-padding-in-XLogCtl-might-be-a-good-idea-but.patchtext/x-diff; name=0001-Remove-padding-in-XLogCtl-might-be-a-good-idea-but.patchDownload
On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
* Put corrected version into GIT
* Produce outstanding items as patch-on-patch via GITI've applied the hot standby patch and recovery infra v9 patch to
branches in my git repository, and pushed those to:git://git.postgresql.org/git/~hlinnaka/pgsql.git
You can clone that to get started.
I've set those branches up so that the hot standby branch is branched
off from the recovery infra branch. I'd like to keep the two separate,
as the recovery infra patch is useful on it's own, and can be reviewed
separately.As a teaser, I made a couple of minor changes after importing your
patches. For the sake of the archives, I've also included those changes
as patches here.
OK, I'll look at those. I've fixed 6 bugs today (!), so I'd rather go
for the new version coming out in an hour. That's too many to unpick
unfortunately.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote:
I made a couple of minor changes after importing your
patches.
I've applied them both to v9g, attached here.
If you wouldn't mind redoing the initial step, I will promise not to do
anything else to the code, except via patch on GIT.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Attachments:
hs.v9g.20090123_3.tar.bz2application/x-bzip-compressed-tar; name=hs.v9g.20090123_3.tar.bz2Download+1-1
Simon Riggs wrote:
On Fri, 2009-01-23 at 16:14 +0200, Heikki Linnakangas wrote:
I made a couple of minor changes after importing your
patches.I've applied them both to v9g, attached here.
If you wouldn't mind redoing the initial step, I will promise not to do
anything else to the code, except via patch on GIT.
No problem. I don't need to do it from scratch, I'll just apply the
changes from that patch as an incremental commit. Done, you can see it
in my git repo now too.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
@@ -1601,6 +1602,24 @@ BufferProcessRecoveryConflictsIfAny(volatile BufferDesc *bufHdr)
{
XLogRecPtr bufLSN = BufferGetLSN(bufHdr);+ /* + * If the buffer is recent we may need to cancel ourselves + * rather than risk returning a wrong answer. This test is + * too conservative, but it is correct. + *+ * We only need to cancel the current subtransaction.
+ * Once we've handled the error then other subtransactions can + * continue processing. Note that we do *not* reset the + * BufferRecoveryConflictLSN at subcommit/abort, but we do + * reset it if we release our last remaining sbapshot. + * see SnapshotResetXmin() + *
Is it really enough to cancel just the current subtransaction? What if
it's a serializable transaction?
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Fri, 2009-01-23 at 18:22 +0200, Heikki Linnakangas wrote:
@@ -1601,6 +1602,24 @@ BufferProcessRecoveryConflictsIfAny(volatile BufferDesc *bufHdr)
{
XLogRecPtr bufLSN = BufferGetLSN(bufHdr);+ /* + * If the buffer is recent we may need to cancel ourselves + * rather than risk returning a wrong answer. This test is + * too conservative, but it is correct. + *+ * We only need to cancel the current subtransaction.
+ * Once we've handled the error then other subtransactions can + * continue processing. Note that we do *not* reset the + * BufferRecoveryConflictLSN at subcommit/abort, but we do + * reset it if we release our last remaining sbapshot. + * see SnapshotResetXmin() + *Is it really enough to cancel just the current subtransaction? What if
it's a serializable transaction?
I did originally think that when I first looked at the problem. I'm
sorry if I say that a lot.
If you have a serializable transaction with subtransactions that suffers
a serializability error it only cancels the current subtransaction. That
means it's snapshot is still valid and can be used again. By analogy, as
long as a transaction does not see any data that is inconsistent with
its snapshot it seems OK for it to continue. So I think it is correct.
(Bizarrely, this might mean that if we did this programatically in a
loop we might keep the system busy for some time while it continually
re-reads data and fails. But that's another story).
You remind me that we can now do what Kevin has requested and throw a
errcode(ERRCODE_T_R_SERIALIZATION_FAILURE) at this point, which I agree
is the most easily understood way of describing this error.
(I was sorely tempted to make it "snapshot too old", as a joke).
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs wrote:
If you have a serializable transaction with subtransactions that suffers
a serializability error it only cancels the current subtransaction. That
means it's snapshot is still valid and can be used again. By analogy, as
long as a transaction does not see any data that is inconsistent with
its snapshot it seems OK for it to continue. So I think it is correct.
Yeah, you're right. How bizarre.
(I was sorely tempted to make it "snapshot too old", as a joke).
Yeah, that is a very describing message, actually. If there wasn't any
precedence to that, I believe we might have came up with exactly that
message ourselves.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas wrote:
Simon Riggs wrote:
If you have a serializable transaction with subtransactions that suffers
a serializability error it only cancels the current subtransaction. That
means it's snapshot is still valid and can be used again. By analogy, as
long as a transaction does not see any data that is inconsistent with
its snapshot it seems OK for it to continue. So I think it is correct.Yeah, you're right. How bizarre.
It was argued this way to me way back when subtransactions were written.
Originally I had written it in such a way as to abort the whole
transaction, on the rationale that if you're blindly restarting the
subtransaction after a serialization error, it would result in a (maybe
infinite) loop. I think the reason it only aborts the subxact is that
that's what all other errors do, so why would this one act differently.
Hmm, now that I think about it, I think it was deadlock errors not
serialization errors ...
--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
If you have a serializable transaction with subtransactions that
suffers
a serializability error it only cancels the current subtransaction.
This is a complete tangent to your work, but I wonder if this is
really right. I mean it's not as if we could have srrialized the
transaction as a whole with respect to whatever other transaction we
failed.
Import Notes
Resolved by subject fallback
On Thu, 2009-01-22 at 22:35 +0000, Simon Riggs wrote:
* Bug fix v9 over next few days
version 9g - please use this for testing now
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Attachments:
hs.v9g.20090123_3.tar.bz2application/x-bzip-compressed-tar; name=hs.v9g.20090123_3.tar.bz2Download+1-1
On Fri, 2009-01-23 at 20:13 +0000, Greg Stark wrote:
If you have a serializable transaction with subtransactions that
suffers
a serializability error it only cancels the current subtransaction.This is a complete tangent to your work, but I wonder if this is
really right. I mean it's not as if we could have srrialized the
transaction as a whole with respect to whatever other transaction we
failed.
Isn't this back to the discussion about PostgreSQL serializability
versus true serializability?
I agree that's not true serializability.
Regards,
Jeff Davis
Simon Riggs wrote:
On Thu, 2009-01-22 at 22:35 +0000, Simon Riggs wrote:
* Bug fix v9 over next few days
version 9g - please use this for testing now
I'm doing some test runs with this now. I notice an old flatfiles
related bug has reappeared:
master:
=# create database test;
slave:
=# select datname from pg_database where datname like 'test';
datname
---------
test
(1 row)
postgres=# \c test
FATAL: database "test" does not exist
Previous connection kept
On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote:
version 9g - please use this for testing now
I'm doing some test runs with this now. I notice an old flatfiles
related bug has reappeared:
I'm seeing an off-by-one error on xmax, in some cases. That then causes
the flat file update to not pick up correct info, even though it
executed in other ways as intended. If you run two create databases and
then test only the first, it appears to have worked as intended.
These bugs are result of recent refactoring and it will take a few days
to shake some of them out. We've had more than 20 already so we're
beating them back, but we're not done yet.
Thanks for the report, will publish this and other fixes on Monday.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Sat, 2009-01-24 at 11:20 +0000, Simon Riggs wrote:
On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote:
version 9g - please use this for testing now
I'm doing some test runs with this now. I notice an old flatfiles
related bug has reappeared:I'm seeing an off-by-one error on xmax, in some cases. That then causes
the flat file update to not pick up correct info, even though it
executed in other ways as intended. If you run two create databases and
then test only the first, it appears to have worked as intended.These bugs are result of recent refactoring and it will take a few days
to shake some of them out. We've had more than 20 already so we're
beating them back, but we're not done yet.
I was at a loss to explain how this could have slipped through our
tests. It appears that the error was corrected following each checkpoint
as a result of ProcArrayUpdateRunningXacts(). Our tests were performed
after a short delay, which typically would be greater than the
deliberately short setting of checkpoint_timeout/archive_timeout and so
by the time we looked the error was gone and masked the problem. We're
setting checkpoint_timeout to 30 mins now to avoid the delay...
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Simon Riggs wrote:
On Sat, 2009-01-24 at 11:20 +0000, Simon Riggs wrote:
On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote:
version 9g - please use this for testing now
I'm doing some test runs with this now. I notice an old flatfiles
related bug has reappeared:I'm seeing an off-by-one error on xmax, in some cases. That then causes
the flat file update to not pick up correct info, even though it
executed in other ways as intended. If you run two create databases and
then test only the first, it appears to have worked as intended.These bugs are result of recent refactoring and it will take a few days
to shake some of them out. We've had more than 20 already so we're
beating them back, but we're not done yet.I was at a loss to explain how this could have slipped through our
tests. It appears that the error was corrected following each checkpoint
as a result of ProcArrayUpdateRunningXacts(). Our tests were performed
after a short delay, which typically would be greater than the
deliberately short setting of checkpoint_timeout/archive_timeout and so
by the time we looked the error was gone and masked the problem. We're
setting checkpoint_timeout to 30 mins now to avoid the delay...
That makes sense - I had archive_timeout set at 5 minutes, and I would
have checked before 5 minutes were up.
Cheers
Mark
On Sat, 2009-01-24 at 17:24 +1300, Mark Kirkwood wrote:
I'm doing some test runs with this now. I notice an old flatfiles
related bug has reappeared:
Bug fix patch against git repo.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
Attachments:
hs.v9h_dif_v9g.patchtext/x-patch; charset=UTF-8; name=hs.v9h_dif_v9g.patchDownload+95-81
I skimmed through the Hot Standby patch for a preliminary review. I noted the
following things, some minor tweaks, some just questions. None of the things I
noted are big issues unless some of the questions uncover issues.
1) This code is obviously a cut-pasto:
+ else if (strcmp(tok1, "max_standby_delay") == 0)
+ {
+ errno = 0;
+ maxStandbyDelay = (TransactionId) strtoul(tok2, NULL, 0);
+ if (errno == EINVAL || errno == ERANGE)
+ ereport(FATAL,
+ (errmsg("max_standby_delay is not a valid number: \"%s\"",
+ tok2)));
Have you ever tested the behaviour with max_standby_delay = -1 ?
Also, the default max_standby_delay is currently 0 -- ie, kill queries
immediately upon detecting any conflicts at all -- which I don't really think
anyone would be happy with. I still *strongly* feel the default has to be the
non-destructive conservative -1.
2) This hard coded constant of 500ms seems arbitrary to me. If your use case
is a heavily-used reporting engine you might get this message all the time. I
think this either has to be configurable (warn_standby_delay?) or based on
max_standby_delay or some other parameter somehow.
+ /*
+ * log that we have been waiting for a while now...
+ */
+ if (!logged && standbyWait_ms > 500)
3) These two blocks of code seem unsatisfactory:
! /*
! * Keep track of the block number of the lastBlockVacuumed, so
! * we can scan those blocks as well during WAL replay. This then
! * provides concurrency protection and allows btrees to be used
! * while in recovery.
! */
! if (lastBlockVacuumed > vstate->lastBlockVacuumed)
! vstate->lastBlockVacuumed = lastBlockVacuumed;
!
+ /*
+ * XXXHS we don't actually need to read the block, we
+ * just need to confirm it is unpinned. If we had a special call
+ * into the buffer manager we could optimise this so that
+ * if the block is not in shared_buffers we confirm it as unpinned.
+ */
+ buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL);
+ if (BufferIsValid(buffer))
+ {
+ LockBufferForCleanup(buffer);
+ UnlockReleaseBuffer(buffer);
+ }
Aside from the XXX comment (I thought we actually had such a call now, but if
not shouldn't we just add one instead of carping?) I'm not convinced this
handles all the cases that can arise. Notable, what happens if a previous
vacuum died in the middle of the scan?
I think we have a vacuum id which we use already with btrees to the same
issue. It seems to me this be more robust if you stamped the xlog record with
that id number and ensured it matched the id of the lastBloockVacuumed? Then
you could start from 0 if the id changes.
4) Why is this necessary?
+ if (IsRecoveryProcessingMode() &&
+ locktag->locktag_type == LOCKTAG_OBJECT &&
+ lockmode > AccessShareLock)
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("cannot acquire lockmode %s on database objects while recovery is in progress",
+ lockMethodTable->lockModeNames[lockmode]),
+ errhint("Only AccessShareLock can be acquired on database objects during recovery.")));
+
Obviously we can't lock records (SELECT FOR UPDATE) since that requires write
access. But shouldn't we be able to do manual LOCK TABLE calls?
I hope this isn't the only interlock we have against trying to do write i/o or
DDL against tables...?
5) The code for killing off conflicting transactions looks odd to me but I
haven't really traced through it to see what it's doing. It seems to kill off
just one process? What if there are several holding the lock in question?
Also, it doesn't seem to take into account how long the various transactions
have held the lock?
Is there a risk of, for example, if you have a long report running against a
table and lots of OLTP requests against the same table it seems the conflict
resolution code would fire randomly into the crowd and hit mostly innocent
OLTP transactions until eventually it found the culprit report?
Also, it kills of backends using SIGINT which I *think* Tom went through and
made safe earlier this release, right?
In any case if the backend doesn't die off promptly we wait forever with no
further warnings or log messages. I would think we should at least print some
sort of message here, even if it's a "can't happen" elog. The doubling thing
is probably unnecessary too in this case.
+ if (!XLogRecPtrIsValid(conflict_LSN))
+ {
+ /* wait awhile for it to die */
+ pg_usleep(wontDieWait * 5000L);
+ wontDieWait *= 2;
+ }
+ }
6) I still don't understand why you need unobserved_xids. We don't need this
in normal running, an xid we don't know for certain is committed is exactly
the same as a transaction we know is currently running or aborted. So why do
you need it during HS?
The comment says:
+ * This is very important because if we leave
+ * those xids out of the snapshot then they will appear to be already complete.
+ * Later, when they have actually completed this could lead to confusion as to
+ * whether those xids are visible or not, blowing a huge hole in MVCC.
+ * We need 'em.
But that doesn't sound rational to me. I'm not sure what "confusion" this
would cause. If they later actually complete then any existing snapshots would
still not see them. And any later snapshots wouldn't be confused by the
earlier conclusions.
--
Gregory Stark
EnterpriseDB http://www.enterprisedb.com
Ask me about EnterpriseDB's 24x7 Postgres support!
On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
I skimmed through the Hot Standby patch for a preliminary review. I noted the
following things, some minor tweaks, some just questions. None of the things I
noted are big issues unless some of the questions uncover issues.
Thanks for your attention.
1) This code is obviously a cut-pasto:
+ else if (strcmp(tok1, "max_standby_delay") == 0) + { + errno = 0; + maxStandbyDelay = (TransactionId) strtoul(tok2, NULL, 0); + if (errno == EINVAL || errno == ERANGE) + ereport(FATAL, + (errmsg("max_standby_delay is not a valid number: \"%s\"", + tok2)));Have you ever tested the behaviour with max_standby_delay = -1 ?
Also, the default max_standby_delay is currently 0 -- ie, kill queries
immediately upon detecting any conflicts at all -- which I don't really think
anyone would be happy with.
Agreed. As explained when I published that patch it is deliberately
severe to allow testing of conflict resolution and feedback on it.
I still *strongly* feel the default has to be the
non-destructive conservative -1.
I don't. Primarily, we must support high availability. It is much better
if we get people saying "I get my queries cancelled" and we say RTFM and
change parameter X, than if people say "my failover was 12 hours behind
when I needed it to be 10 seconds behind and I lost a $1 million because
of downtime of Postgres" and we say RTFM and change parameter X.
2) This hard coded constant of 500ms seems arbitrary to me. If your use case
is a heavily-used reporting engine you might get this message all the time. I
think this either has to be configurable (warn_standby_delay?) or based on
max_standby_delay or some other parameter somehow.+ /* + * log that we have been waiting for a while now... + */ + if (!logged && standbyWait_ms > 500)
Greg, that's a DEBUG5 message.
3) These two blocks of code seem unsatisfactory:
! /*
! * Keep track of the block number of the lastBlockVacuumed, so
! * we can scan those blocks as well during WAL replay. This then
! * provides concurrency protection and allows btrees to be used
! * while in recovery.
! */
! if (lastBlockVacuumed > vstate->lastBlockVacuumed)
! vstate->lastBlockVacuumed = lastBlockVacuumed;
!+ /* + * XXXHS we don't actually need to read the block, we + * just need to confirm it is unpinned. If we had a special call + * into the buffer manager we could optimise this so that + * if the block is not in shared_buffers we confirm it as unpinned. + */ + buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno, RBM_NORMAL); + if (BufferIsValid(buffer)) + { + LockBufferForCleanup(buffer); + UnlockReleaseBuffer(buffer); + }Aside from the XXX comment (I thought we actually had such a call now, but if
not shouldn't we just add one instead of carping?)
We should add many things. The equivalent optimisation of VACUUM in
normal running has not been done either. Considering we have both HOT
and visibility map enhanced VACUUM, its not a priority.
I'm not convinced this
handles all the cases that can arise. Notable, what happens if a previous
vacuum died in the middle of the scan?
Nothing, because it won't then have removed any heap rows.
I think we have a vacuum id which we use already with btrees to the same
issue. It seems to me this be more robust if you stamped the xlog record with
that id number and ensured it matched the id of the lastBloockVacuumed? Then
you could start from 0 if the id changes.
4) Why is this necessary?
+ if (IsRecoveryProcessingMode() && + locktag->locktag_type == LOCKTAG_OBJECT && + lockmode > AccessShareLock) + ereport(ERROR, + (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE), + errmsg("cannot acquire lockmode %s on database objects while recovery is in progress", + lockMethodTable->lockModeNames[lockmode]), + errhint("Only AccessShareLock can be acquired on database objects during recovery."))); +Obviously we can't lock records (SELECT FOR UPDATE) since that requires write
access. But shouldn't we be able to do manual LOCK TABLE calls?
Read the rest of the comments on the locking code section.
I hope this isn't the only interlock we have against trying to do write i/o or
DDL against tables...?
No it's not.
5) The code for killing off conflicting transactions looks odd to me but I
haven't really traced through it to see what it's doing. It seems to kill off
just one process?
No, why do you think that?
What if there are several holding the lock in question?
Covered.
Also, it doesn't seem to take into account how long the various transactions
have held the lock?
True. Why would that matter?
Is there a risk of, for example, if you have a long report running against a
table and lots of OLTP requests against the same table
it seems the conflict resolution code would fire randomly into the
crowd
OK, that's where I stop bothering to respond.
--
Simon Riggs www.2ndQuadrant.com
PostgreSQL Training, Services and Support
On Wed, 2009-01-28 at 19:27 +0000, Simon Riggs wrote:
On Wed, 2009-01-28 at 18:55 +0000, Gregory Stark wrote:
Agreed. As explained when I published that patch it is deliberately
severe to allow testing of conflict resolution and feedback on it.I still *strongly* feel the default has to be the
non-destructive conservative -1.I don't. Primarily, we must support high availability. It is much better
if we get people saying "I get my queries cancelled" and we say RTFM and
change parameter X, than if people say "my failover was 12 hours behind
when I needed it to be 10 seconds behind and I lost a $1 million because
of downtime of Postgres" and we say RTFM and change parameter X.
If the person was stupid enough to configure it for such as thing they
deserve to the lose the money. Not to mention we have already lost them
as a user because they will blame postgresql regardless of reality as
evidenced by their inability to RTFM (or have a vendor that RTFMs) in
the first place.
I got to vote with Greg on this one.
Sincerely,
Joshua D. Drake
--
PostgreSQL - XMPP: jdrake@jabber.postgresql.org
Consulting, Development, Support, Training
503-667-4564 - http://www.commandprompt.com/
The PostgreSQL Company, serving since 1997