Avoiding pin scan during btree vacuum
During VACUUM of btrees, we need to pin all pages, even those where tuples
are not removed, which I am calling here the "pin scan". This is especially
a problem during redo, where removing one tuple from a 100GB btree can take
a minute or more. That causes replication lags. Bad Thing.
Previously, I have suggested ways of optimizing that and code comments
reflect that. I would like to look at removing the pin scan entirely, on a
standby only.
In
www.postgresql.org/message-id/flat/721615179.3351449.1423959585771.JavaMail.yahoo@mail.yahoo.com,
the focus was on removing pins from btrees.
In the commit message for that thread/patch,
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=2ed5b87f96d473962ec5230fd820abfeaccb2069
we see that there are some preconditions to altering the locking.
"This patch leaves behavior unchanged for some cases, which can be
addressed separately so that each case can be evaluated on its own
merits. These unchanged cases are when a scan uses a non-MVCC
snapshot, an index-only scan, and a scan of a btree index for which
modifications are not WAL-logged. If later patches allow all of
these cases to drop the buffer pin after reading a leaf page, then
the btree vacuum process can be simplified; it will no longer need
the "super-exclusive" lock to delete tuples from a page."
The case for master and standby are different. The standby is simpler, yet
more important to change since the pin scan happens in the foreground.
Looking just at the case for standbys, we see there are 3 cases
* non-WAL logged indexes - does not apply on a standby, so ignore
* non-MVCC snapshot - looks like only TOAST scans are a problem on standbys
* index only scans (IOS) - the analysis of which looks wrong to me...
IOSs always check the visibility before using the tuple. If a tuple is
about to be removed by a VACUUM then the tuple will already be dead and the
visibility map will always be set to not-all-visible. So any tuple being
removed by vacuum can never cause a problem to an IOS. Hence the locking
interactions are not required, at least on standbys, for normal tables.
So it looks like we can skip the "pin scan" during redo unless we are
vacuuming a toast index.
Patch attached.
Notice that the patch does not slacken the requirement to
super-exclusive-lock the block from which tuples are being removed. The
only thing it does is skip past the requirement to pin each of the
intervening blocks where nothing has happened.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
avoid_standby_pin_scan.v1.patchapplication/octet-stream; name=avoid_standby_pin_scan.v1.patchDownload
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cf4a6dc..3ae74d1 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -22,6 +22,7 @@
#include "access/relscan.h"
#include "access/xlog.h"
#include "catalog/index.h"
+#include "catalog/pg_namespace.h"
#include "commands/vacuum.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
@@ -836,6 +837,14 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
vstate.lastBlockVacuumed < vstate.lastBlockLocked)
{
Buffer buf;
+ BlockNumber lastBlockVacuumed = InvalidBlockNumber;
+
+ /*
+ * If we are vacuuming a Toast Index, then set lastBlockVacuumed
+ * to enable pin scan during Hot Standby.
+ */
+ if (rel->rd_rel->relnamespace == PG_TOAST_NAMESPACE)
+ lastBlockVacuumed = vstate.lastBlockVacuumed;
/*
* The page should be valid, but we can't use _bt_getbuf() because we
@@ -847,7 +856,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
RBM_NORMAL, info->strategy);
LockBufferForCleanup(buf);
_bt_checkpage(rel, buf);
- _bt_delitems_vacuum(rel, buf, NULL, 0, vstate.lastBlockVacuumed);
+ _bt_delitems_vacuum(rel, buf, NULL, 0, lastBlockVacuumed);
_bt_relbuf(rel, buf);
}
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index da28e21..f28680c 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -412,7 +412,7 @@ btree_xlog_vacuum(XLogReaderState *record)
* isn't yet consistent; so we need not fear reading still-corrupt blocks
* here during crash recovery.
*/
- if (HotStandbyActiveInReplay())
+ if (HotStandbyActiveInReplay() && BlockNumberIsValid(xlrec->lastBlockVacuumed))
{
RelFileNode thisrnode;
BlockNumber thisblkno;
Simon Riggs wrote:
During VACUUM of btrees, we need to pin all pages, even those where tuples
are not removed, which I am calling here the "pin scan". This is especially
a problem during redo, where removing one tuple from a 100GB btree can take
a minute or more. That causes replication lags. Bad Thing.
Agreed on there being a problem here.
As a reminder, this "pin scan" is implemented in the
HotStandbyActiveInReplay() block in btree_xlog_vacuum; this routine is
in charge of replaying an action recorded by _bt_delitems_vacuum. That
replay works by acquiring cleanup lock on all btree blocks from
lastBlockVacuumed to "this one" (i.e. the one in which elements are
being removed). In essence, this means pinning *all* the blocks in the
index, which is bad.
The new code sets lastBlockVacuumed to Invalid; a new test in the replay
code makes that value not pin anything. This is supposed to optimize
the case in question.
One thing that this patch should update is the comment above struct
xl_btree_vacuum, which currently state that EnsureBlockUnpinned() is one
of the options to apply to each block, but that routine doesn't exist.
One possible naive optimization is to avoid locking pages that aren't
leaf pages, but that would still require fetching the block from disk,
so it wouldn't actually optimize anything other than the cleanup lock
acquisition. (And probably not too useful, since leaf pages are said to
be 95% - 99% of index pages anyway.)
Also of interest is the comment above the call to _bt_delitems_vacuum in
btvacuumpage(); that needs an update too.
It appears to me that your patch changes the call in btvacuumscan() but
not the one in btvacuumpage(). I assume you should be changing both.
I think the new comment that talks about Toast Index should explain
*why* we can skip the pinning in all cases except that one, instead of
just saying we can do it.
--
�lvaro Herrera http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
I think the new comment that talks about Toast Index should explain
*why* we can skip the pinning in all cases except that one, instead of
just saying we can do it.
I've not looked at that code in a long while, but my recollection is
that it's designed that way to protect non-MVCC catalog scans, which
are gone now ... except for SnapshotToast. Maybe the right way to
approach this is to adjust HeapTupleSatisfiesToast (or maybe just
convince ourselves that no one could be dereferencing a stale toast
pointer in the first place?) and then redesign btree vacuuming without
the requirement to support non-MVCC scans, period.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On 21 December 2015 at 21:28, Alvaro Herrera <alvherre@2ndquadrant.com>
wrote:
Simon Riggs wrote:
During VACUUM of btrees, we need to pin all pages, even those where
tuples
are not removed, which I am calling here the "pin scan". This is
especially
a problem during redo, where removing one tuple from a 100GB btree can
take
a minute or more. That causes replication lags. Bad Thing.
Agreed on there being a problem here.
As a reminder, this "pin scan" is implemented in the
HotStandbyActiveInReplay() block in btree_xlog_vacuum; this routine is
in charge of replaying an action recorded by _bt_delitems_vacuum. That
replay works by acquiring cleanup lock on all btree blocks from
lastBlockVacuumed to "this one" (i.e. the one in which elements are
being removed). In essence, this means pinning *all* the blocks in the
index, which is bad.
The new code sets lastBlockVacuumed to Invalid; a new test in the replay
code makes that value not pin anything. This is supposed to optimize
the case in question.
Nice summary, thanks.
One thing that this patch should update is the comment above struct
xl_btree_vacuum, which currently state that EnsureBlockUnpinned() is one
of the options to apply to each block, but that routine doesn't exist.
Updated
One possible naive optimization is to avoid locking pages that aren't
leaf pages, but that would still require fetching the block from disk,
so it wouldn't actually optimize anything other than the cleanup lock
acquisition. (And probably not too useful, since leaf pages are said to
be 95% - 99% of index pages anyway.)
Agreed, not worth it.
Also of interest is the comment above the call to _bt_delitems_vacuum in
btvacuumpage(); that needs an update too.
Updated
It appears to me that your patch changes the call in btvacuumscan() but
not the one in btvacuumpage(). I assume you should be changing both.
Yes, that was correct. Updated.
I think the new comment that talks about Toast Index should explain
*why* we can skip the pinning in all cases except that one, instead of
just saying we can do it.
Greatly expanded comments.
Thanks for the review.
New patch attached.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
avoid_standby_pin_scan.v2.patchapplication/octet-stream; name=avoid_standby_pin_scan.v2.patchDownload
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 712385b..752e3b5 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -22,6 +22,7 @@
#include "access/relscan.h"
#include "access/xlog.h"
#include "catalog/index.h"
+#include "catalog/pg_namespace.h"
#include "commands/vacuum.h"
#include "storage/indexfsm.h"
#include "storage/ipc.h"
@@ -823,6 +824,11 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
}
/*
+ * Check to see if we need to issue one final WAL record for this index,
+ * which may be needed for correctness on a hot standby node when
+ * non-MVCC index scans could take place. This now only occurs when we
+ * perform a TOAST scan, so only occurs for TOAST indexes.
+ *
* If the WAL is replayed in hot standby, the replay process needs to get
* cleanup locks on all index leaf pages, just as we've been doing here.
* However, we won't issue any WAL records about pages that have no items
@@ -833,6 +839,7 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
* against the last leaf page in the index, if that one wasn't vacuumed.
*/
if (XLogStandbyInfoActive() &&
+ rel->rd_rel->relnamespace == PG_TOAST_NAMESPACE &&
vstate.lastBlockVacuumed < vstate.lastBlockLocked)
{
Buffer buf;
@@ -1031,6 +1038,20 @@ restart:
*/
if (ndeletable > 0)
{
+ BlockNumber lastBlockVacuumed = InvalidBlockNumber;
+
+ /*
+ * We may need to record the lastBlockVacuumed for use when
+ * non-MVCC scans might be performed on the index on a
+ * hot standby. See explanation in btree_xlog_vacuum().
+ *
+ * On a hot standby, a non-MVCC scan can only take place
+ * when we access a Toast Index, so we need only record
+ * the lastBlockVacuumed if we are vacuuming a Toast Index.
+ */
+ if (rel->rd_rel->relnamespace == PG_TOAST_NAMESPACE)
+ lastBlockVacuumed = vstate->lastBlockVacuumed;
+
/*
* Notice that the issued XLOG_BTREE_VACUUM WAL record includes an
* instruction to the replay code to get cleanup lock on all pages
@@ -1043,7 +1064,7 @@ restart:
* that.
*/
_bt_delitems_vacuum(rel, buf, deletable, ndeletable,
- vstate->lastBlockVacuumed);
+ lastBlockVacuumed);
/*
* Remember highest leaf page number we've issued a
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index bba4840..0d094ca 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -391,6 +391,19 @@ btree_xlog_vacuum(XLogReaderState *record)
BTPageOpaque opaque;
/*
+ * If we are running non-MVCC scans using this index we need to do some
+ * additional work to ensure correctness, which is known as a "pin scan"
+ * described in more detail in next paragraphs. We used to do the extra
+ * work in all cases, whereas we now avoid that work except when the index
+ * is a toast index, since toast scans aren't fully MVCC compliant.
+ * If lastBlockVacuumed is set to InvalidBlockNumber then we skip the
+ * additional work required for the pin scan.
+ *
+ * Avoiding this extra work is important since it requires us to touch
+ * every page in the index, so is an O(N) operation. Worse, it is an
+ * operation performed in the foreground during redo, so it delays
+ * replication directly.
+ *
* If queries might be active then we need to ensure every leaf page is
* unpinned between the lastBlockVacuumed and the current block, if there
* are any. This prevents replay of the VACUUM from reaching the stage of
@@ -412,7 +425,7 @@ btree_xlog_vacuum(XLogReaderState *record)
* isn't yet consistent; so we need not fear reading still-corrupt blocks
* here during crash recovery.
*/
- if (HotStandbyActiveInReplay())
+ if (HotStandbyActiveInReplay() && BlockNumberIsValid(xlrec->lastBlockVacuumed))
{
RelFileNode thisrnode;
BlockNumber thisblkno;
@@ -433,7 +446,8 @@ btree_xlog_vacuum(XLogReaderState *record)
* XXX we don't actually need to read the block, we just need to
* confirm it is unpinned. If we had a special call into the
* buffer manager we could optimise this so that if the block is
- * not in shared_buffers we confirm it as unpinned.
+ * not in shared_buffers we confirm it as unpinned. Optimizing
+ * this is now moot, since in most cases we avoid the scan.
*/
buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
RBM_NORMAL_NO_LOG);
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 9ebf446..b760833 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -331,8 +331,10 @@ typedef struct xl_btree_reuse_page
* The WAL record can represent deletion of any number of index tuples on a
* single index page when executed by VACUUM.
*
- * The correctness requirement for applying these changes during recovery is
- * that we must do one of these two things for every block in the index:
+ * For MVCC scans, lastBlockVacuumed will be set to InvalidBlockNumber.
+ * For a non-MVCC index scans there is an additional correctness requirement
+ * for applying these changes during recovery, which is that we must do one
+ * of these two things for every block in the index:
* * lock the block for cleanup and apply any required changes
* * EnsureBlockUnpinned()
* The purpose of this is to ensure that no index scans started before we
On 21 December 2015 at 21:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
I think the new comment that talks about Toast Index should explain
*why* we can skip the pinning in all cases except that one, instead of
just saying we can do it.I've not looked at that code in a long while, but my recollection is
that it's designed that way to protect non-MVCC catalog scans, which
are gone now ... except for SnapshotToast.
Yes, that's the principle in use here. Which means we can optimize away the
scan on a Hot Standby in all cases except for Toast Index vacuums.
Maybe the right way to
approach this is to adjust HeapTupleSatisfiesToast (or maybe just
convince ourselves that no one could be dereferencing a stale toast
pointer in the first place?) and then redesign btree vacuuming without
the requirement to support non-MVCC scans, period.
We could, but its likely to be a much more complex patch.
I'm happy with this being a simple patch now, not least because I would
like to backpatch this to 9.4 where catalog scans became MVCC.
A backpatch is warranted because it is a severe performance issue with
replication and we can fix that before 9.5 hits the streets.
I'll be doing some more testing and checking, so not in a rush.
--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2016-01-03 15:40:01 +0000, Simon Riggs wrote:
I'm happy with this being a simple patch now, not least because I would
like to backpatch this to 9.4 where catalog scans became MVCC.A backpatch is warranted because it is a severe performance issue with
replication and we can fix that before 9.5 hits the streets.I'll be doing some more testing and checking, so not in a rush.
This seems like a might subtle thing to backpatch. If we really want to
go there, ISTM that the relevant code should stew in an unreleased
branch for a while, before being backpatched.
Andres
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Andres Freund <andres@anarazel.de> writes:
On 2016-01-03 15:40:01 +0000, Simon Riggs wrote:
I'm happy with this being a simple patch now, not least because I would
like to backpatch this to 9.4 where catalog scans became MVCC.A backpatch is warranted because it is a severe performance issue with
replication and we can fix that before 9.5 hits the streets.I'll be doing some more testing and checking, so not in a rush.
This seems like a might subtle thing to backpatch. If we really want to
go there, ISTM that the relevant code should stew in an unreleased
branch for a while, before being backpatched.
I'm definitely -1 on back-patching such a thing. Put it in HEAD for
awhile. If it survives six months or so then we could discuss it again.
regards, tom lane
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
This seems like a might subtle thing to backpatch. If we really want to
go there, ISTM that the relevant code should stew in an unreleased
branch for a while, before being backpatched.I'm definitely -1 on back-patching such a thing. Put it in HEAD for
awhile. If it survives six months or so then we could discuss it again.
I agree with Tom.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
This seems like a might subtle thing to backpatch. If we really want to
go there, ISTM that the relevant code should stew in an unreleased
branch for a while, before being backpatched.I'm definitely -1 on back-patching such a thing. Put it in HEAD for
awhile. If it survives six months or so then we could discuss it again.I agree with Tom.
Okay, several months have passed with this in the development branch and
now seems a good time to backpatch this all the way back to 9.4.
Any objections?
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Thu, Oct 20, 2016 at 4:00 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Robert Haas wrote:
On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
This seems like a might subtle thing to backpatch. If we really want to
go there, ISTM that the relevant code should stew in an unreleased
branch for a while, before being backpatched.I'm definitely -1 on back-patching such a thing. Put it in HEAD for
awhile. If it survives six months or so then we could discuss it again.I agree with Tom.
Okay, several months have passed with this in the development branch and
now seems a good time to backpatch this all the way back to 9.4.
Are you talking about commit -
3e4b7d87988f0835f137f15f5c1a40598dd21f3d? If yes, do we want to
retain this code in its current form under define UNUSED, is there any
advantage of same. Another point is that won't this commit make
information in xl_btree_vacuum redundant, so shouldn't we remove it
usage during WAL writing as well?
--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
On Wed, Oct 19, 2016 at 6:30 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
Robert Haas wrote:
On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
This seems like a might subtle thing to backpatch. If we really want to
go there, ISTM that the relevant code should stew in an unreleased
branch for a while, before being backpatched.I'm definitely -1 on back-patching such a thing. Put it in HEAD for
awhile. If it survives six months or so then we could discuss it again.I agree with Tom.
Okay, several months have passed with this in the development branch and
now seems a good time to backpatch this all the way back to 9.4.Any objections?
Although the code has now been in the tree for six months, it's only
been in a released branch for three weeks, which is something about
which to think.
I guess what's really bothering me is that I don't think this is a bug
fix. It seems like a performance optimization. And I think that we
generally have a policy that we don't back-patch performance
optimizations. Of course, there is some fuzziness because when the
performance gets sufficiently bad, we sometimes decide that it amounts
to a bug, as in 73cc7d3b240e1d46b1996382e5735a820f8bc3f7. Maybe this
case qualifies for similar treatment, but I'm not sure.
Why are you talking about back-patching this to 9.4 rather than all
supported branches?
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
Robert Haas wrote:
On Wed, Oct 19, 2016 at 6:30 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:Robert Haas wrote:
On Mon, Jan 4, 2016 at 10:30 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
This seems like a might subtle thing to backpatch. If we really want to
go there, ISTM that the relevant code should stew in an unreleased
branch for a while, before being backpatched.I'm definitely -1 on back-patching such a thing. Put it in HEAD for
awhile. If it survives six months or so then we could discuss it again.I agree with Tom.
Okay, several months have passed with this in the development branch and
now seems a good time to backpatch this all the way back to 9.4.Any objections?
Although the code has now been in the tree for six months, it's only
been in a released branch for three weeks, which is something about
which to think.
The objection above was "stew in an unreleased branch", which it has.
I guess what's really bothering me is that I don't think this is a bug
fix. It seems like a performance optimization. And I think that we
generally have a policy that we don't back-patch performance
optimizations. Of course, there is some fuzziness because when the
performance gets sufficiently bad, we sometimes decide that it amounts
to a bug, as in 73cc7d3b240e1d46b1996382e5735a820f8bc3f7. Maybe this
case qualifies for similar treatment, but I'm not sure.
I have seen a number of situations in which the standby strangely lags
behind seemingly randomly sometimes for very long periods of time,
without any apparent cause, without any possible remedy, and it makes
the standby unusable. If the user happens to monitor the lag they may
notice it and some may decide not to run certain queries. In other
cases the I/O load is so bad that queries that otherwise run without a
hitch are stuck for long.
In my opinion, this is a serious problem.
Why are you talking about back-patching this to 9.4 rather than all
supported branches?
As far as I understand this is dependent on catalog scans being MVCC,
so it cannot be backpatched any further than that.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers
I am ready now to backpatch this to 9.4 and 9.5; here are the patches.
They are pretty similar, but some adjustments were needed due to XLog
format changes in 9.5. (I kept most of Simon's original commit
message.)
This patch has survived in master for a long time, and in released 9.6
for a couple of months now. We have a couple of customers running with
this patch installed also (who make heavy use of their standbys),
without reported problems. Moreover, the next minors for 9.4 and 9.5
are scheduled for February 2017, so unless some major security problem
pops up that prompts a more urgent update, we have three months to go
before this is released to the masses running 9.4 and 9.5; if a bug
crops up in the meantime, we have plenty of time to get it fixed.
While reviewing this I noticed a small thinko in the README, which I'm
going to patch in 9.6 and master thusly (with the same commit message):
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 067d15c..a3f11da 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -521,11 +521,12 @@ because it allows running applications to continue while the standby
changes state into a normally running server.
The interlocking required to avoid returning incorrect results from
-MVCC scans is not required on standby nodes. That is because
+non-MVCC scans is not required on standby nodes. That is because
HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only
ever used during write transactions, which cannot exist on the standby.
-This leaves HeapTupleSatisfiesMVCC() and HeapTupleSatisfiesToast().
+MVCC scans are already protected by definition, so HeapTupleSatisfiesMVCC()
+is not a problem. That leaves concern only for HeapTupleSatisfiesToast().
HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
because it doesn't need to - if the main heap row is visible then the
toast rows will also be visible. So as long as we follow a toast
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachments:
REL9_4_STABLE-0001-Avoid-pin-scan-for-replay-of-XLOG_BTREE_VACUUM-in-al.patchtext/plain; charset=utf-8Download
From caaba7a1db9d9981c8f93b6dda7d33979164845d Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Tue, 15 Nov 2016 22:26:19 -0300
Subject: [PATCH] Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to
require complex interlocking that matched the requirements on the
master. This required an O(N) operation that became a significant
problem with large indexes, causing replication delays of seconds or in
some cases minutes while the XLOG_BTREE_VACUUM was replayed.
This commit skips the âpin scanâ that was previously required, by
observing in detail when and how it is safe to do so, with full
documentation. The pin scan is skipped only in replay; the VACUUM code
path on master is not touched here.
No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.
This is a back-patch of commits 687f2cd7a015, 3e4b7d87988f, b60284261375
by Simon Riggs, to branches 9.4 and 9.5. No further backpatch is
possible because this depends on catalog scans being MVCC. I (Ãlvaro)
additionally updated a slight problem in the README, which explains why
this touches the 9.6 and master branches.
---
src/backend/access/nbtree/README | 22 ++++++++++++++++++++++
src/backend/access/nbtree/nbtree.c | 15 +++++++++++----
src/backend/access/nbtree/nbtxlog.c | 23 +++++++++++++++++++++--
src/include/access/nbtree.h | 6 ++++--
4 files changed, 58 insertions(+), 8 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 4820f76..997d272 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -486,6 +486,28 @@ normal running after recovery has completed. This is a key capability
because it allows running applications to continue while the standby
changes state into a normally running server.
+The interlocking required to avoid returning incorrect results from
+non-MVCC scans is not required on standby nodes. That is because
+HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
+HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only
+ever used during write transactions, which cannot exist on the standby.
+MVCC scans are already protected by definition, so HeapTupleSatisfiesMVCC()
+is not a problem. That leaves concern only for HeapTupleSatisfiesToast().
+HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
+because it doesn't need to - if the main heap row is visible then the
+toast rows will also be visible. So as long as we follow a toast
+pointer from a visible (live) tuple the corresponding toast rows
+will also be visible, so we do not need to recheck MVCC on them.
+There is one minor exception, which is that the optimizer sometimes
+looks at the boundaries of value ranges using SnapshotDirty, which
+could result in returning a newer value for query statistics; this
+would affect the query plan in rare cases, but not the correctness.
+The risk window is small since the stats look at the min and max values
+in the index, so the scan retrieves a tid then immediately uses it
+to look in the heap. It is unlikely that the tid could have been
+deleted, vacuumed and re-inserted in the time taken to look in the heap
+via direct tid access. So we ignore that scan type as a problem.
+
Other Things That Are Handy to Know
-----------------------------------
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index 36dc6c2..469c2ed 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -811,6 +811,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
}
/*
+ * Check to see if we need to issue one final WAL record for this index,
+ * which may be needed for correctness on a hot standby node when non-MVCC
+ * index scans could take place.
+ *
* If the WAL is replayed in hot standby, the replay process needs to get
* cleanup locks on all index leaf pages, just as we've been doing here.
* However, we won't issue any WAL records about pages that have no items
@@ -1018,10 +1022,13 @@ restart:
if (ndeletable > 0)
{
/*
- * Notice that the issued XLOG_BTREE_VACUUM WAL record includes an
- * instruction to the replay code to get cleanup lock on all pages
- * between the previous lastBlockVacuumed and this page. This
- * ensures that WAL replay locks all leaf pages at some point.
+ * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
+ * all information to the replay code to allow it to get a cleanup
+ * lock on all pages between the previous lastBlockVacuumed and
+ * this page. This ensures that WAL replay locks all leaf pages at
+ * some point, which is important should non-MVCC scans be
+ * requested. This is currently unused on standby, but we record
+ * it anyway, so that the WAL contains the required information.
*
* Since we can visit leaf pages out-of-order when recursing,
* replay might end up locking such pages an extra time, but it
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index 2debb87..42d65fa 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -479,7 +479,24 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
Page page;
BTPageOpaque opaque;
+#ifdef UNUSED
/*
+ * This section of code is thought to be no longer needed, after analysis
+ * of the calling paths. It is retained to allow the code to be reinstated
+ * if a flaw is revealed in that thinking.
+ *
+ * If we are running non-MVCC scans using this index we need to do some
+ * additional work to ensure correctness, which is known as a "pin scan"
+ * described in more detail in next paragraphs. We used to do the extra
+ * work in all cases, whereas we now avoid that work in most cases. If
+ * lastBlockVacuumed is set to InvalidBlockNumber then we skip the
+ * additional work required for the pin scan.
+ *
+ * Avoiding this extra work is important since it requires us to touch
+ * every page in the index, so is an O(N) operation. Worse, it is an
+ * operation performed in the foreground during redo, so it delays
+ * replication directly.
+ *
* If queries might be active then we need to ensure every leaf page is
* unpinned between the lastBlockVacuumed and the current block, if there
* are any. This prevents replay of the VACUUM from reaching the stage of
@@ -500,7 +517,7 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
* isn't yet consistent; so we need not fear reading still-corrupt blocks
* here during crash recovery.
*/
- if (HotStandbyActiveInReplay())
+ if (HotStandbyActiveInReplay() && BlockNumberIsValid(xlrec->lastBlockVacuumed))
{
BlockNumber blkno;
@@ -517,7 +534,8 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
* XXX we don't actually need to read the block, we just need to
* confirm it is unpinned. If we had a special call into the
* buffer manager we could optimise this so that if the block is
- * not in shared_buffers we confirm it as unpinned.
+ * not in shared_buffers we confirm it as unpinned. Optimizing
+ * this is now moot, since in most cases we avoid the scan.
*/
buffer = XLogReadBufferExtended(xlrec->node, MAIN_FORKNUM, blkno,
RBM_NORMAL_NO_LOG);
@@ -528,6 +546,7 @@ btree_xlog_vacuum(XLogRecPtr lsn, XLogRecord *record)
}
}
}
+#endif
/*
* If we have a full-page image, restore it (using a cleanup lock) and
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index f281759..09708c0 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -342,8 +342,10 @@ typedef struct xl_btree_reuse_page
* The WAL record can represent deletion of any number of index tuples on a
* single index page when executed by VACUUM.
*
- * The correctness requirement for applying these changes during recovery is
- * that we must do one of these two things for every block in the index:
+ * For MVCC scans, lastBlockVacuumed will be set to InvalidBlockNumber.
+ * For a non-MVCC index scans there is an additional correctness requirement
+ * for applying these changes during recovery, which is that we must do one
+ * of these two things for every block in the index:
* * lock the block for cleanup and apply any required changes
* * EnsureBlockUnpinned()
* The purpose of this is to ensure that no index scans started before we
--
2.1.4
REL9_5_STABLE-0001-Avoid-pin-scan-for-replay-of-XLOG_BTREE_VACUUM-in-al.patchtext/plain; charset=utf-8Download
From f10f8b1c039463b1688a9ef3a13b69ca279014ce Mon Sep 17 00:00:00 2001
From: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date: Tue, 15 Nov 2016 22:32:12 -0300
Subject: [PATCH] Avoid pin scan for replay of XLOG_BTREE_VACUUM in all cases
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to
require complex interlocking that matched the requirements on the
master. This required an O(N) operation that became a significant
problem with large indexes, causing replication delays of seconds or in
some cases minutes while the XLOG_BTREE_VACUUM was replayed.
This commit skips the âpin scanâ that was previously required, by
observing in detail when and how it is safe to do so, with full
documentation. The pin scan is skipped only in replay; the VACUUM code
path on master is not touched here.
No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.
This is a back-patch of commits 687f2cd7a015, 3e4b7d87988f, b60284261375
by Simon Riggs, to branches 9.4 and 9.5. No further backpatch is
possible because this depends on catalog scans being MVCC. I (Ãlvaro)
additionally updated a slight problem in the README, which explains why
this touches the 9.6 and master branches.
---
src/backend/access/nbtree/README | 22 ++++++++++++++++++++++
src/backend/access/nbtree/nbtree.c | 15 +++++++++++----
src/backend/access/nbtree/nbtxlog.c | 25 ++++++++++++++++++++++---
src/include/access/nbtree.h | 6 ++++--
4 files changed, 59 insertions(+), 9 deletions(-)
diff --git a/src/backend/access/nbtree/README b/src/backend/access/nbtree/README
index 7055c24..a3f11da 100644
--- a/src/backend/access/nbtree/README
+++ b/src/backend/access/nbtree/README
@@ -520,6 +520,28 @@ normal running after recovery has completed. This is a key capability
because it allows running applications to continue while the standby
changes state into a normally running server.
+The interlocking required to avoid returning incorrect results from
+non-MVCC scans is not required on standby nodes. That is because
+HeapTupleSatisfiesUpdate(), HeapTupleSatisfiesSelf(),
+HeapTupleSatisfiesDirty() and HeapTupleSatisfiesVacuum() are only
+ever used during write transactions, which cannot exist on the standby.
+MVCC scans are already protected by definition, so HeapTupleSatisfiesMVCC()
+is not a problem. That leaves concern only for HeapTupleSatisfiesToast().
+HeapTupleSatisfiesToast() doesn't use MVCC semantics, though that's
+because it doesn't need to - if the main heap row is visible then the
+toast rows will also be visible. So as long as we follow a toast
+pointer from a visible (live) tuple the corresponding toast rows
+will also be visible, so we do not need to recheck MVCC on them.
+There is one minor exception, which is that the optimizer sometimes
+looks at the boundaries of value ranges using SnapshotDirty, which
+could result in returning a newer value for query statistics; this
+would affect the query plan in rare cases, but not the correctness.
+The risk window is small since the stats look at the min and max values
+in the index, so the scan retrieves a tid then immediately uses it
+to look in the heap. It is unlikely that the tid could have been
+deleted, vacuumed and re-inserted in the time taken to look in the heap
+via direct tid access. So we ignore that scan type as a problem.
+
Other Things That Are Handy to Know
-----------------------------------
diff --git a/src/backend/access/nbtree/nbtree.c b/src/backend/access/nbtree/nbtree.c
index cd2d4a6..fc31fe4 100644
--- a/src/backend/access/nbtree/nbtree.c
+++ b/src/backend/access/nbtree/nbtree.c
@@ -804,6 +804,10 @@ btvacuumscan(IndexVacuumInfo *info, IndexBulkDeleteResult *stats,
}
/*
+ * Check to see if we need to issue one final WAL record for this index,
+ * which may be needed for correctness on a hot standby node when non-MVCC
+ * index scans could take place.
+ *
* If the WAL is replayed in hot standby, the replay process needs to get
* cleanup locks on all index leaf pages, just as we've been doing here.
* However, we won't issue any WAL records about pages that have no items
@@ -1013,10 +1017,13 @@ restart:
if (ndeletable > 0)
{
/*
- * Notice that the issued XLOG_BTREE_VACUUM WAL record includes an
- * instruction to the replay code to get cleanup lock on all pages
- * between the previous lastBlockVacuumed and this page. This
- * ensures that WAL replay locks all leaf pages at some point.
+ * Notice that the issued XLOG_BTREE_VACUUM WAL record includes
+ * all information to the replay code to allow it to get a cleanup
+ * lock on all pages between the previous lastBlockVacuumed and
+ * this page. This ensures that WAL replay locks all leaf pages at
+ * some point, which is important should non-MVCC scans be
+ * requested. This is currently unused on standby, but we record
+ * it anyway, so that the WAL contains the required information.
*
* Since we can visit leaf pages out-of-order when recursing,
* replay might end up locking such pages an extra time, but it
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index da28e21..d569ae1 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -385,12 +385,29 @@ static void
btree_xlog_vacuum(XLogReaderState *record)
{
XLogRecPtr lsn = record->EndRecPtr;
- xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
Buffer buffer;
Page page;
BTPageOpaque opaque;
+#ifdef UNUSED
+ xl_btree_vacuum *xlrec = (xl_btree_vacuum *) XLogRecGetData(record);
/*
+ * This section of code is thought to be no longer needed, after analysis
+ * of the calling paths. It is retained to allow the code to be reinstated
+ * if a flaw is revealed in that thinking.
+ *
+ * If we are running non-MVCC scans using this index we need to do some
+ * additional work to ensure correctness, which is known as a "pin scan"
+ * described in more detail in next paragraphs. We used to do the extra
+ * work in all cases, whereas we now avoid that work in most cases. If
+ * lastBlockVacuumed is set to InvalidBlockNumber then we skip the
+ * additional work required for the pin scan.
+ *
+ * Avoiding this extra work is important since it requires us to touch
+ * every page in the index, so is an O(N) operation. Worse, it is an
+ * operation performed in the foreground during redo, so it delays
+ * replication directly.
+ *
* If queries might be active then we need to ensure every leaf page is
* unpinned between the lastBlockVacuumed and the current block, if there
* are any. This prevents replay of the VACUUM from reaching the stage of
@@ -412,7 +429,7 @@ btree_xlog_vacuum(XLogReaderState *record)
* isn't yet consistent; so we need not fear reading still-corrupt blocks
* here during crash recovery.
*/
- if (HotStandbyActiveInReplay())
+ if (HotStandbyActiveInReplay() && BlockNumberIsValid(xlrec->lastBlockVacuumed))
{
RelFileNode thisrnode;
BlockNumber thisblkno;
@@ -433,7 +450,8 @@ btree_xlog_vacuum(XLogReaderState *record)
* XXX we don't actually need to read the block, we just need to
* confirm it is unpinned. If we had a special call into the
* buffer manager we could optimise this so that if the block is
- * not in shared_buffers we confirm it as unpinned.
+ * not in shared_buffers we confirm it as unpinned. Optimizing
+ * this is now moot, since in most cases we avoid the scan.
*/
buffer = XLogReadBufferExtended(thisrnode, MAIN_FORKNUM, blkno,
RBM_NORMAL_NO_LOG);
@@ -444,6 +462,7 @@ btree_xlog_vacuum(XLogReaderState *record)
}
}
}
+#endif
/*
* Like in btvacuumpage(), we need to take a cleanup lock on every leaf
diff --git a/src/include/access/nbtree.h b/src/include/access/nbtree.h
index 1b0c649..42928d4 100644
--- a/src/include/access/nbtree.h
+++ b/src/include/access/nbtree.h
@@ -331,8 +331,10 @@ typedef struct xl_btree_reuse_page
* The WAL record can represent deletion of any number of index tuples on a
* single index page when executed by VACUUM.
*
- * The correctness requirement for applying these changes during recovery is
- * that we must do one of these two things for every block in the index:
+ * For MVCC scans, lastBlockVacuumed will be set to InvalidBlockNumber.
+ * For a non-MVCC index scans there is an additional correctness requirement
+ * for applying these changes during recovery, which is that we must do one
+ * of these two things for every block in the index:
* * lock the block for cleanup and apply any required changes
* * EnsureBlockUnpinned()
* The purpose of this is to ensure that no index scans started before we
--
2.1.4
Alvaro Herrera wrote:
I am ready now to backpatch this to 9.4 and 9.5; here are the patches.
They are pretty similar, but some adjustments were needed due to XLog
format changes in 9.5. (I kept most of Simon's original commit
message.)
Finally done.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers