pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM

Started by Simon Riggsabout 10 years ago9 messages
#1Simon Riggs
simon@2ndQuadrant.com

Avoid pin scan for replay of XLOG_BTREE_VACUUM
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require
complex interlocking that matched the requirements on the master. This required
an O(N) operation that became a significant problem with large indexes, causing
replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.

This commit skips the “pin scan” that was previously required, by observing in
detail when and how it is safe to do so, with full documentation. The pin scan
is skipped only in replay; the VACUUM code path on master is not touched here.

The current commit still performs the pin scan for toast indexes, though this
can also be avoided if we recheck scans on toast indexes. Later patch will
address this.

No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.

Branch
------
master

Details
-------
http://git.postgresql.org/pg/commitdiff/687f2cd7a0150647794efe432ae0397cb41b60ff

Modified Files
--------------
src/backend/access/nbtree/README | 24 ++++++++++++++++++++++++
src/backend/access/nbtree/nbtree.c | 23 ++++++++++++++++++++++-
src/backend/access/nbtree/nbtxlog.c | 18 ++++++++++++++++--
src/backend/access/rmgrdesc/nbtdesc.c | 2 +-
src/include/access/nbtree.h | 6 ++++--
5 files changed, 67 insertions(+), 6 deletions(-)

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#2Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#1)
Re: pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM

Hi,

On 2016-01-09 10:13:27 +0000, Simon Riggs wrote:

src/backend/access/rmgrdesc/nbtdesc.c | 2 +-

I've not reviewed the patch, but a very quick glance during a rebase
with conflicts showed:

@@ -48,7 +48,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;

-               appendStringInfo(buf, "lastBlockVacuumed %u",
+               appendStringInfo(buf, "lastBlockVacuumed %d",
                                 xlrec->lastBlockVacuumed);
                break;
            }

which doesn't look right?

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#3Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#2)
Re: pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM

On 9 January 2016 at 12:23, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2016-01-09 10:13:27 +0000, Simon Riggs wrote:

src/backend/access/rmgrdesc/nbtdesc.c | 2 +-

I've not reviewed the patch, but a very quick glance during a rebase
with conflicts showed:

@@ -48,7 +48,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;

-               appendStringInfo(buf, "lastBlockVacuumed %u",
+               appendStringInfo(buf, "lastBlockVacuumed %d",
xlrec->lastBlockVacuumed);
break;
}

which doesn't look right?

It's right. New value of -1 allowed in that field, so change required to
allow it to display properly for debug.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#4Andres Freund
andres@anarazel.de
In reply to: Simon Riggs (#3)
Re: pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM

On 2016-01-09 17:58:01 +0000, Simon Riggs wrote:

On 9 January 2016 at 12:23, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2016-01-09 10:13:27 +0000, Simon Riggs wrote:

src/backend/access/rmgrdesc/nbtdesc.c | 2 +-

I've not reviewed the patch, but a very quick glance during a rebase
with conflicts showed:

@@ -48,7 +48,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;

-               appendStringInfo(buf, "lastBlockVacuumed %u",
+               appendStringInfo(buf, "lastBlockVacuumed %d",
xlrec->lastBlockVacuumed);
break;
}

which doesn't look right?

It's right. New value of -1 allowed in that field, so change required to
allow it to display properly for debug.

Uh. xl_btree_vacuum->lastBlockVacuumed is of type BlockNumber, which in
turn is of type uint32. So no, this isn't correct as is.

--
Sent via pgsql-committers mailing list (pgsql-committers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-committers

#5Simon Riggs
simon@2ndQuadrant.com
In reply to: Andres Freund (#4)
Re: pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM

On 9 January 2016 at 18:08, Andres Freund <andres@anarazel.de> wrote:

On 2016-01-09 17:58:01 +0000, Simon Riggs wrote:

On 9 January 2016 at 12:23, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2016-01-09 10:13:27 +0000, Simon Riggs wrote:

src/backend/access/rmgrdesc/nbtdesc.c | 2 +-

I've not reviewed the patch, but a very quick glance during a rebase
with conflicts showed:

@@ -48,7 +48,7 @@ btree_desc(StringInfo buf, XLogReaderState *record)
{
xl_btree_vacuum *xlrec = (xl_btree_vacuum *) rec;

-               appendStringInfo(buf, "lastBlockVacuumed %u",
+               appendStringInfo(buf, "lastBlockVacuumed %d",
xlrec->lastBlockVacuumed);
break;
}

which doesn't look right?

It's right. New value of -1 allowed in that field, so change required to
allow it to display properly for debug.

Uh. xl_btree_vacuum->lastBlockVacuumed is of type BlockNumber, which in
turn is of type uint32. So no, this isn't correct as is.

OK, agreed. Will fix.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#6Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#1)
Re: [COMMITTERS] pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM

On Sat, Jan 9, 2016 at 5:13 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Avoid pin scan for replay of XLOG_BTREE_VACUUM
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to require
complex interlocking that matched the requirements on the master. This required
an O(N) operation that became a significant problem with large indexes, causing
replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.

This commit skips the “pin scan” that was previously required, by observing in
detail when and how it is safe to do so, with full documentation. The pin scan
is skipped only in replay; the VACUUM code path on master is not touched here.

The current commit still performs the pin scan for toast indexes, though this
can also be avoided if we recheck scans on toast indexes. Later patch will
address this.

No tests included. Manual tests using an additional patch to view WAL records
and their timing have shown the change in WAL records and their handling has
successfully reduced replication delay.

I suspect I might be missing something here, but I don't see how a
test against rel->rd_rel->relnamespace can work during recovery.
Won't the relcache entry we're looking at here be one created by
CreateFakeRelcacheEntry(), and thus that field won't be valid?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#6)
Re: [COMMITTERS] pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM

On 10 January 2016 at 16:32, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jan 9, 2016 at 5:13 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Avoid pin scan for replay of XLOG_BTREE_VACUUM
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to

require

complex interlocking that matched the requirements on the master. This

required

an O(N) operation that became a significant problem with large indexes,

causing

replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.

This commit skips the “pin scan” that was previously required, by

observing in

detail when and how it is safe to do so, with full documentation. The

pin scan

is skipped only in replay; the VACUUM code path on master is not touched

here.

The current commit still performs the pin scan for toast indexes, though

this

can also be avoided if we recheck scans on toast indexes. Later patch

will

address this.

No tests included. Manual tests using an additional patch to view WAL

records

and their timing have shown the change in WAL records and their handling

has

successfully reduced replication delay.

I suspect I might be missing something here, but I don't see how a
test against rel->rd_rel->relnamespace can work during recovery.
Won't the relcache entry we're looking at here be one created by
CreateFakeRelcacheEntry(), and thus that field won't be valid?

The test isn't made during recovery, its made on the master.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#8Robert Haas
robertmhaas@gmail.com
In reply to: Simon Riggs (#7)
Re: [COMMITTERS] pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM

On Sun, Jan 10, 2016 at 3:50 PM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 10 January 2016 at 16:32, Robert Haas <robertmhaas@gmail.com> wrote:

On Sat, Jan 9, 2016 at 5:13 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

Avoid pin scan for replay of XLOG_BTREE_VACUUM
Replay of XLOG_BTREE_VACUUM during Hot Standby was previously thought to
require
complex interlocking that matched the requirements on the master. This
required
an O(N) operation that became a significant problem with large indexes,
causing
replication delays of seconds or in some cases minutes while the
XLOG_BTREE_VACUUM was replayed.

This commit skips the “pin scan” that was previously required, by
observing in
detail when and how it is safe to do so, with full documentation. The
pin scan
is skipped only in replay; the VACUUM code path on master is not touched
here.

The current commit still performs the pin scan for toast indexes, though
this
can also be avoided if we recheck scans on toast indexes. Later patch
will
address this.

No tests included. Manual tests using an additional patch to view WAL
records
and their timing have shown the change in WAL records and their handling
has
successfully reduced replication delay.

I suspect I might be missing something here, but I don't see how a
test against rel->rd_rel->relnamespace can work during recovery.
Won't the relcache entry we're looking at here be one created by
CreateFakeRelcacheEntry(), and thus that field won't be valid?

The test isn't made during recovery, its made on the master.

/me crawls into a hole.

Thanks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Simon Riggs
simon@2ndQuadrant.com
In reply to: Robert Haas (#8)
Re: [COMMITTERS] pgsql: Avoid pin scan for replay of XLOG_BTREE_VACUUM

On 11 January 2016 at 01:46, Robert Haas <robertmhaas@gmail.com> wrote:

/me crawls into a hole.

Thanks.

Far from it, thanks very much for thinking about it.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services