pgsql: Allow users to limit storage reserved by replication slots

Started by Alvaro Herreraabout 6 years ago6 messageshackers
Jump to latest
#1Alvaro Herrera
alvherre@2ndquadrant.com

Allow users to limit storage reserved by replication slots

Replication slots are useful to retain data that may be needed by a
replication system. But experience has shown that allowing them to
retain excessive data can lead to the primary failing because of running
out of space. This new feature allows the user to configure a maximum
amount of space to be reserved using the new option
max_slot_wal_keep_size. Slots that overrun that space are invalidated
at checkpoint time, enabling the storage to be released.

Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: /messages/by-id/20170228.122736.123383594.horiguchi.kyotaro@lab.ntt.co.jp

Branch
------
master

Details
-------
https://git.postgresql.org/pg/commitdiff/c6550776394e25c1620bc8258427c8f1d448080d

Modified Files
--------------
doc/src/sgml/catalogs.sgml | 38 +++++
doc/src/sgml/config.sgml | 23 +++
doc/src/sgml/high-availability.sgml | 8 +-
src/backend/access/transam/xlog.c | 145 ++++++++++++++---
src/backend/catalog/system_views.sql | 4 +-
src/backend/replication/logical/logicalfuncs.c | 2 +-
src/backend/replication/slot.c | 100 +++++++++++-
src/backend/replication/slotfuncs.c | 44 ++++-
src/backend/replication/walsender.c | 4 +-
src/backend/utils/misc/guc.c | 13 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/xlog.h | 14 ++
src/include/catalog/catversion.h | 2 +-
src/include/catalog/pg_proc.dat | 6 +-
src/include/replication/slot.h | 11 +-
src/test/recovery/t/019_replslot_limit.pl | 217 +++++++++++++++++++++++++
src/test/regress/expected/rules.out | 6 +-
17 files changed, 595 insertions(+), 43 deletions(-)

#2Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Alvaro Herrera (#1)
Re: pgsql: Allow users to limit storage reserved by replication slots

On 2020-Apr-07, Alvaro Herrera wrote:

src/test/recovery/t/019_replslot_limit.pl | 217 +++++++++++++++++++++++++

I fixed the perlcritic complaint from buildfarm member crake, but
there's a new one in francolin:

# Failed test 'check that the slot state changes to "reserved"'
# at t/019_replslot_limit.pl line 125.
# got: '0/15000D8|reserved|216 bytes'
# expected: '0/1500000|reserved|216 bytes'

# Failed test 'check that the slot state changes to "lost"'
# at t/019_replslot_limit.pl line 135.
# got: '0/15000D8|lost|t'
# expected: '0/1500000|lost|t'
# Looks like you failed 2 tests of 13.
[23:07:28] t/019_replslot_limit.pl ..............

where the Perl code is:

$start_lsn = $node_master->lsn('write');
$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
$node_standby->stop;

# Advance WAL again without checkpoint, reducing remain by 6 MB.
advance_wal($node_master, 6);

# Slot gets into 'reserved' state
$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(restart_lsn - min_safe_lsn) as remain FROM pg_replication_slots WHERE slot_name = 'rep1'");
is($result, "$start_lsn|reserved|216 bytes", 'check that the slot state changes to "reserved"');

0xD8 is 216, so this seems to be saying that the checkpoint record was
skipped by the restart_lsn. I'm not clear exactly why that happened ...
is this saying that a checkpoint occurred?

One easy fix would be to remove the "restart_lsn" output column from the
query, but do we lose test specificity? (I think the answer is no.)

However, even with that change, we're still testing that a checkpoint is
216 bytes ... in other words, whenever someone changes the definition of
struct CheckPoint, this test will fail. That seems unnecessary and
unfriendly. I'm not sure how to improve that without also removing that
column.

--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

#3Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#2)
Re: pgsql: Allow users to limit storage reserved by replication slots

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I fixed the perlcritic complaint from buildfarm member crake, but
there's a new one in francolin:

Other buildfarm members are showing related-but-different failures.
I think this test is just plain unstable.

regards, tom lane

#4Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#3)
Re: pgsql: Allow users to limit storage reserved by replication slots

Hi,

On April 7, 2020 6:13:51 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

I fixed the perlcritic complaint from buildfarm member crake, but
there's a new one in francolin:

Other buildfarm members are showing related-but-different failures.
I think this test is just plain unstable.

I have not looked at the source, but the error messages show LSNs and bytes. I can't really imagine how that could be made stable.

Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#5Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#4)
Re: pgsql: Allow users to limit storage reserved by replication slots

On Tue, Apr 07, 2020 at 07:10:07PM -0700, Andres Freund wrote:

I have not looked at the source, but the error messages show LSNs
and bytes. I can't really imagine how that could be made stable.

Another bad news is that this is page-size dependent. What if you
removed pg_size_pretty() and replaced it with a condition that returns
a boolean status in the result itself?
--
Michael

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Alvaro Herrera (#2)
Re: pgsql: Allow users to limit storage reserved by replication slots

Alvaro Herrera <alvherre@2ndquadrant.com> writes:

However, even with that change, we're still testing that a checkpoint is
216 bytes ... in other words, whenever someone changes the definition of
struct CheckPoint, this test will fail. That seems unnecessary and
unfriendly. I'm not sure how to improve that without also removing that
column.

I read florican's results as showing that sizeof(CheckPoint) is already
different on 32-bit machines than 64-bit; it's repeatably getting this:

# Failed test 'check that the slot state changes to "reserved"'
# at t/019_replslot_limit.pl line 125.
# got: '0/15000C0|reserved|192 bytes'
# expected: '0/15000C0|reserved|216 bytes'

This test case was *not* well thought out.

regards, tom lane