pgsql: Augment WAL records for btree delete with GetOldestXmin() to
Log Message:
-----------
Augment WAL records for btree delete with GetOldestXmin() to reduce
false positives during Hot Standby conflict processing. Simple
patch to enhance conflict processing, following previous discussions.
Controlled by parameter minimize_standby_conflicts = on | off, with
default off allows measurement of performance impact to see whether
it should be set on all the time.
Modified Files:
--------------
pgsql/doc/src/sgml:
config.sgml (r1.246 -> r1.247)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/doc/src/sgml/config.sgml?r1=1.246&r2=1.247)
pgsql/src/backend/access/nbtree:
nbtpage.c (r1.115 -> r1.116)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/nbtree/nbtpage.c?r1=1.115&r2=1.116)
pgsql/src/backend/access/transam:
xlog.c (r1.364 -> r1.365)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/access/transam/xlog.c?r1=1.364&r2=1.365)
pgsql/src/backend/utils/misc:
guc.c (r1.536 -> r1.537)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/misc/guc.c?r1=1.536&r2=1.537)
postgresql.conf.sample (r1.274 -> r1.275)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/backend/utils/misc/postgresql.conf.sample?r1=1.274&r2=1.275)
pgsql/src/include/access:
xlog.h (r1.99 -> r1.100)
(http://anoncvs.postgresql.org/cvsweb.cgi/pgsql/src/include/access/xlog.h?r1=1.99&r2=1.100)
sriggs@postgresql.org (Simon Riggs) writes:
Log Message:
-----------
Augment WAL records for btree delete with GetOldestXmin() to reduce
false positives during Hot Standby conflict processing. Simple
patch to enhance conflict processing, following previous discussions.
Controlled by parameter minimize_standby_conflicts = on | off, with
default off allows measurement of performance impact to see whether
it should be set on all the time.
WTF? Simon, this seems to be getting way way beyond anything the
community has agreed to. Particularly, inventing GUCs is not something
to be doing without consensus.
regards, tom lane
Tom Lane wrote:
sriggs@postgresql.org (Simon Riggs) writes:
Log Message:
-----------
Augment WAL records for btree delete with GetOldestXmin() to reduce
false positives during Hot Standby conflict processing. Simple
patch to enhance conflict processing, following previous discussions.
Controlled by parameter minimize_standby_conflicts = on | off, with
default off allows measurement of performance impact to see whether
it should be set on all the time.WTF? Simon, this seems to be getting way way beyond anything the
community has agreed to. Particularly, inventing GUCs is not something
to be doing without consensus.
yeah actually both of the last commits seem rather strange given the
current discussion on this list I must say...
Stefan
On Fri, 2010-01-29 at 14:04 -0500, Tom Lane wrote:
sriggs@postgresql.org (Simon Riggs) writes:
Log Message:
-----------
Augment WAL records for btree delete with GetOldestXmin() to reduce
false positives during Hot Standby conflict processing. Simple
patch to enhance conflict processing, following previous discussions.
Controlled by parameter minimize_standby_conflicts = on | off, with
default off allows measurement of performance impact to see whether
it should be set on all the time.WTF? Simon, this seems to be getting way way beyond anything the
community has agreed to. Particularly, inventing GUCs is not something
to be doing without consensus.
If you or anybody else would like me to revoke it then I am happy to do
that, with no problem or argument. I agree it hasn't had discussion,
though am happy to have such a discussion.
The commit is a one line change, with parameter to control it, discussed
by Heikki and myself in December 2008. I stand by the accuracy of the
change; the parameter is really to ensure we can test during beta.
I imagined such a minor addition would pass without comment. This is not
the same patch not-being-discussed on the other thread, which is too
complex for a commit without review. It is a separate change altogether,
although it does relate to conflict processing. In any case I would
never commit a patch while discussion on it continues.
--
Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes:
On Fri, 2010-01-29 at 14:04 -0500, Tom Lane wrote:
WTF? Simon, this seems to be getting way way beyond anything the
community has agreed to. Particularly, inventing GUCs is not something
to be doing without consensus.
If you or anybody else would like me to revoke it then I am happy to do
that, with no problem or argument. I agree it hasn't had discussion,
though am happy to have such a discussion.
The commit is a one line change, with parameter to control it, discussed
by Heikki and myself in December 2008. I stand by the accuracy of the
change; the parameter is really to ensure we can test during beta.
Well, I was waiting to see if anyone else had an opinion, but: my
opinion is that a GUC is not appropriate here. Either test it yourself
enough to be sure it's a win, or don't put it in.
regards, tom lane
On Sun, 2010-01-31 at 14:07 -0500, Tom Lane wrote:
The commit is a one line change, with parameter to control it, discussed
by Heikki and myself in December 2008. I stand by the accuracy of the
change; the parameter is really to ensure we can test during beta.Well, I was waiting to see if anyone else had an opinion, but: my
opinion is that a GUC is not appropriate here. Either test it yourself
enough to be sure it's a win, or don't put it in.
I will remove the parameter then, keeping the augmentation. That OK?
--
Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote:
On Sun, 2010-01-31 at 14:07 -0500, Tom Lane wrote:
The commit is a one line change, with parameter to control it, discussed
by Heikki and myself in December 2008. I stand by the accuracy of the
change; the parameter is really to ensure we can test during beta.Well, I was waiting to see if anyone else had an opinion, but: my
opinion is that a GUC is not appropriate here. Either test it yourself
enough to be sure it's a win, or don't put it in.I will remove the parameter then, keeping the augmentation. That OK?
Well how much is the actual hit with this on the master for different
workloads do we have realistic numbers on that? Also how much of an
actual win is it in the other direction - as in under what circumstances
and workloads does it help in avoiding superflous cancelations on the
standby?
Stefan
On Sun, 2010-01-31 at 20:48 +0100, Stefan Kaltenbrunner wrote:
Simon Riggs wrote:
On Sun, 2010-01-31 at 14:07 -0500, Tom Lane wrote:
The commit is a one line change, with parameter to control it, discussed
by Heikki and myself in December 2008. I stand by the accuracy of the
change; the parameter is really to ensure we can test during beta.Well, I was waiting to see if anyone else had an opinion, but: my
opinion is that a GUC is not appropriate here. Either test it yourself
enough to be sure it's a win, or don't put it in.I will remove the parameter then, keeping the augmentation. That OK?
Well how much is the actual hit with this on the master for different
workloads do we have realistic numbers on that? Also how much of an
actual win is it in the other direction - as in under what circumstances
and workloads does it help in avoiding superflous cancelations on the
standby?
At the moment a btree delete record will cancel every request
1. no matter how long they have been running
2. no matter if they haven't accessed the index being cleaned (they
might later, is the thinking...)
This change improves case (1) in that it will only remove queries that
are older than the oldest snapshot on the primary, should
max_standby_delay be exceeded. Case (2) would have been improved by my
other proposed patch should max_standby_delay be exceeded.
The cost of improving case (1) is that we do the equivalent of taking a
snapshot of the procarray while holding locks on the btree block being
cleaned. That will increase index contention somewhat in applications
that do btree deletes, i.e. non-hot index updates or deletes.
Greg Stark's comments have been that he wants to see max_standby_delay =
-1 put back, which would effecively prevent both (1) and (2) from
occurring. I am working on that now.
--
Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes:
At the moment a btree delete record will cancel every request
1. no matter how long they have been running
2. no matter if they haven't accessed the index being cleaned (they
might later, is the thinking...)
That seems seriously horrid. What is the rationale for #2 in
particular? I would hope that at worst this would affect sessions
that are actively competing for the index being cleaned.
regards, tom lane
On Sun, 2010-01-31 at 15:41 -0500, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
At the moment a btree delete record will cancel every request
1. no matter how long they have been running
2. no matter if they haven't accessed the index being cleaned (they
might later, is the thinking...)That seems seriously horrid. What is the rationale for #2 in
particular? I would hope that at worst this would affect sessions
that are actively competing for the index being cleaned.
That is exactly the feedback I received from many other people and why I
prioritised the relation-specific conflict patch.
It's worse that that because point 2 effects WAL cleanup records for the
heap also.
The rationale is that a session *might* in the future access a table,
and if it did so it would receive the wrong answer *potentially*. This
is the issue I have been discussing for a long time now, in various
forms, starting on-list in Aug 2008.
--
Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote:
On Sun, 2010-01-31 at 20:48 +0100, Stefan Kaltenbrunner wrote:
Simon Riggs wrote:
On Sun, 2010-01-31 at 14:07 -0500, Tom Lane wrote:
The commit is a one line change, with parameter to control it, discussed
by Heikki and myself in December 2008. I stand by the accuracy of the
change; the parameter is really to ensure we can test during beta.Well, I was waiting to see if anyone else had an opinion, but: my
opinion is that a GUC is not appropriate here. Either test it yourself
enough to be sure it's a win, or don't put it in.I will remove the parameter then, keeping the augmentation. That OK?
Well how much is the actual hit with this on the master for different
workloads do we have realistic numbers on that? Also how much of an
actual win is it in the other direction - as in under what circumstances
and workloads does it help in avoiding superflous cancelations on the
standby?At the moment a btree delete record will cancel every request
1. no matter how long they have been running
2. no matter if they haven't accessed the index being cleaned (they
might later, is the thinking...)This change improves case (1) in that it will only remove queries that
are older than the oldest snapshot on the primary, should
max_standby_delay be exceeded. Case (2) would have been improved by my
other proposed patch should max_standby_delay be exceeded.The cost of improving case (1) is that we do the equivalent of taking a
snapshot of the procarray while holding locks on the btree block being
cleaned. That will increase index contention somewhat in applications
that do btree deletes, i.e. non-hot index updates or deletes.
hmm makes sense from the HS side - but I think to make a case for a GUC
it would help a lot to see actual numbers(as in benchmarks) showing how
much of a hit it is to the master.
Stefan
On Sun, 2010-01-31 at 22:05 +0100, Stefan Kaltenbrunner wrote:
hmm makes sense from the HS side - but I think to make a case for a GUC
it would help a lot to see actual numbers(as in benchmarks) showing how
much of a hit it is to the master.
Agreed, though my approach to benchmarking was to provide the mechanism
by which it was possible to benchmark. I didn't presume to be able to
cover wide area with benchmarking tests just for this.
--
Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote:
On Sun, 2010-01-31 at 22:05 +0100, Stefan Kaltenbrunner wrote:
hmm makes sense from the HS side - but I think to make a case for a GUC
it would help a lot to see actual numbers(as in benchmarks) showing how
much of a hit it is to the master.Agreed, though my approach to benchmarking was to provide the mechanism
by which it was possible to benchmark. I didn't presume to be able to
cover wide area with benchmarking tests just for this.
I don't think this patch helps much though. It's setting lastRemovedXid
to GetOldestXmin(), which is still a very pessimistic estimate. In fact,
if there's only one transaction running in the master, it's not any
better than just setting it to InvalidTransactionId and killing all
active queries in the standby.
What we'd want to set it to is the xmin/xmax of the oldest heap tuple
whose pointer was removed from the index page. That could be much much
older than GetOldestXmin(), allowing many more read-only queries to live
in the standby.
IIRC it was Greg Stark who suggested last time this was discussed that
we could calculate the exact value for latestRemovedXid in the standby.
When replaying the deletion record, the standby could look at all the
heap tuples whose index pointers are being removed, to see which one was
newest. That can be pretty expensive, involving random I/O, but it gives
an exact value, and doesn't stress the master at all. And we could skip
it if there's no potentially conflicting read-only queries running.
That seems like the most user-friendly approach to me. Even though
random I/O is expensive, surely it's better than killing queries.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
IIRC it was Greg Stark who suggested last time this was discussed that
we could calculate the exact value for latestRemovedXid in the standby.
When replaying the deletion record, the standby could look at all the
heap tuples whose index pointers are being removed, to see which one was
newest.
This assumes that the standby can tell which heap tuples are being
removed, which I'm not sure is true --- consider the case where the
deletion record is carrying a page image instead of a list of deleted
tuples. But maybe we could rejigger the representation to make sure
that the info is always available. In general it sounds like a much
nicer answer.
regards, tom lane
On Sun, 2010-01-31 at 23:43 +0200, Heikki Linnakangas wrote:
IIRC it was Greg Stark who suggested last time this was discussed that
we could calculate the exact value for latestRemovedXid in the
standby. When replaying the deletion record, the standby could look at
all the heap tuples whose index pointers are being removed, to see
which one was newest. That can be pretty expensive, involving random
I/O, but it gives an exact value, and doesn't stress the master at
all. And we could skip it if there's no potentially conflicting
read-only queries running.That seems like the most user-friendly approach to me. Even though
random I/O is expensive, surely it's better than killing queries.
Best solution, no time to do it.
Should I revoke this change?
--
Simon Riggs www.2ndQuadrant.com
Simon Riggs <simon@2ndQuadrant.com> writes:
Should I revoke this change?
Please. We can always put it back later if nothing better gets
implemented.
regards, tom lane
On Sun, 2010-01-31 at 17:10 -0500, Tom Lane wrote:
Simon Riggs <simon@2ndQuadrant.com> writes:
Should I revoke this change?
Please.
Will do, in morning.
--
Simon Riggs www.2ndQuadrant.com
On Sun, 2010-01-31 at 16:53 -0500, Tom Lane wrote:
Heikki Linnakangas <heikki.linnakangas@enterprisedb.com> writes:
IIRC it was Greg Stark who suggested last time this was discussed that
we could calculate the exact value for latestRemovedXid in the standby.
When replaying the deletion record, the standby could look at all the
heap tuples whose index pointers are being removed, to see which one was
newest.This assumes that the standby can tell which heap tuples are being
removed, which I'm not sure is true --- consider the case where the
deletion record is carrying a page image instead of a list of deleted
tuples. But maybe we could rejigger the representation to make sure
that the info is always available. In general it sounds like a much
nicer answer.
I'm looking at this now since we definitely need either this or the
previous patch.
XLOG_BTREE_DELETE records would not carry latestRemovedXid, this would
be calculated by inspection of heap tuples. XLOG_BTREE_DELETE would have
a heap relid, to allow direct access to the heap without a Relation. We
would check xmin, xmax and xvac. Code will look a little like
heap_fetch() though disimilar enough not to use directly. We would only
look at the tuple header and info.
--
Simon Riggs www.2ndQuadrant.com
Simon Riggs wrote:
XLOG_BTREE_DELETE records would not carry latestRemovedXid, this would
be calculated by inspection of heap tuples.
You might still want to keep the conservative latestRemovedXid value in
the WAL record to avoid the extra work when latestRemovedXid alone allows.
--
Heikki Linnakangas
EnterpriseDB http://www.enterprisedb.com
On Tue, 2010-03-09 at 11:20 +0200, Heikki Linnakangas wrote:
Simon Riggs wrote:
XLOG_BTREE_DELETE records would not carry latestRemovedXid, this would
be calculated by inspection of heap tuples.You might still want to keep the conservative latestRemovedXid value in
the WAL record to avoid the extra work when latestRemovedXid alone allows.
OK will do.
--
Simon Riggs www.2ndQuadrant.com