pgsql: Allow users to limit storage reserved by replication slots
Allow users to limit storage reserved by replication slots
Replication slots are useful to retain data that may be needed by a
replication system. But experience has shown that allowing them to
retain excessive data can lead to the primary failing because of running
out of space. This new feature allows the user to configure a maximum
amount of space to be reserved using the new option
max_slot_wal_keep_size. Slots that overrun that space are invalidated
at checkpoint time, enabling the storage to be released.
Author: Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp>
Reviewed-by: Masahiko Sawada <sawada.mshk@gmail.com>
Reviewed-by: Jehan-Guillaume de Rorthais <jgdr@dalibo.com>
Reviewed-by: Álvaro Herrera <alvherre@alvh.no-ip.org>
Discussion: /messages/by-id/20170228.122736.123383594.horiguchi.kyotaro@lab.ntt.co.jp
Branch
------
master
Details
-------
https://git.postgresql.org/pg/commitdiff/c6550776394e25c1620bc8258427c8f1d448080d
Modified Files
--------------
doc/src/sgml/catalogs.sgml | 38 +++++
doc/src/sgml/config.sgml | 23 +++
doc/src/sgml/high-availability.sgml | 8 +-
src/backend/access/transam/xlog.c | 145 ++++++++++++++---
src/backend/catalog/system_views.sql | 4 +-
src/backend/replication/logical/logicalfuncs.c | 2 +-
src/backend/replication/slot.c | 100 +++++++++++-
src/backend/replication/slotfuncs.c | 44 ++++-
src/backend/replication/walsender.c | 4 +-
src/backend/utils/misc/guc.c | 13 ++
src/backend/utils/misc/postgresql.conf.sample | 1 +
src/include/access/xlog.h | 14 ++
src/include/catalog/catversion.h | 2 +-
src/include/catalog/pg_proc.dat | 6 +-
src/include/replication/slot.h | 11 +-
src/test/recovery/t/019_replslot_limit.pl | 217 +++++++++++++++++++++++++
src/test/regress/expected/rules.out | 6 +-
17 files changed, 595 insertions(+), 43 deletions(-)
On 2020-Apr-07, Alvaro Herrera wrote:
src/test/recovery/t/019_replslot_limit.pl | 217 +++++++++++++++++++++++++
I fixed the perlcritic complaint from buildfarm member crake, but
there's a new one in francolin:
# Failed test 'check that the slot state changes to "reserved"'
# at t/019_replslot_limit.pl line 125.
# got: '0/15000D8|reserved|216 bytes'
# expected: '0/1500000|reserved|216 bytes'
# Failed test 'check that the slot state changes to "lost"'
# at t/019_replslot_limit.pl line 135.
# got: '0/15000D8|lost|t'
# expected: '0/1500000|lost|t'
# Looks like you failed 2 tests of 13.
[23:07:28] t/019_replslot_limit.pl ..............
where the Perl code is:
$start_lsn = $node_master->lsn('write');
$node_master->wait_for_catchup($node_standby, 'replay', $start_lsn);
$node_standby->stop;
# Advance WAL again without checkpoint, reducing remain by 6 MB.
advance_wal($node_master, 6);
# Slot gets into 'reserved' state
$result = $node_master->safe_psql('postgres', "SELECT restart_lsn, wal_status, pg_size_pretty(restart_lsn - min_safe_lsn) as remain FROM pg_replication_slots WHERE slot_name = 'rep1'");
is($result, "$start_lsn|reserved|216 bytes", 'check that the slot state changes to "reserved"');
0xD8 is 216, so this seems to be saying that the checkpoint record was
skipped by the restart_lsn. I'm not clear exactly why that happened ...
is this saying that a checkpoint occurred?
One easy fix would be to remove the "restart_lsn" output column from the
query, but do we lose test specificity? (I think the answer is no.)
However, even with that change, we're still testing that a checkpoint is
216 bytes ... in other words, whenever someone changes the definition of
struct CheckPoint, this test will fail. That seems unnecessary and
unfriendly. I'm not sure how to improve that without also removing that
column.
--
�lvaro Herrera https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
I fixed the perlcritic complaint from buildfarm member crake, but
there's a new one in francolin:
Other buildfarm members are showing related-but-different failures.
I think this test is just plain unstable.
regards, tom lane
Hi,
On April 7, 2020 6:13:51 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
I fixed the perlcritic complaint from buildfarm member crake, but
there's a new one in francolin:Other buildfarm members are showing related-but-different failures.
I think this test is just plain unstable.
I have not looked at the source, but the error messages show LSNs and bytes. I can't really imagine how that could be made stable.
Andres
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Tue, Apr 07, 2020 at 07:10:07PM -0700, Andres Freund wrote:
I have not looked at the source, but the error messages show LSNs
and bytes. I can't really imagine how that could be made stable.
Another bad news is that this is page-size dependent. What if you
removed pg_size_pretty() and replaced it with a condition that returns
a boolean status in the result itself?
--
Michael
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
However, even with that change, we're still testing that a checkpoint is
216 bytes ... in other words, whenever someone changes the definition of
struct CheckPoint, this test will fail. That seems unnecessary and
unfriendly. I'm not sure how to improve that without also removing that
column.
I read florican's results as showing that sizeof(CheckPoint) is already
different on 32-bit machines than 64-bit; it's repeatably getting this:
# Failed test 'check that the slot state changes to "reserved"'
# at t/019_replslot_limit.pl line 125.
# got: '0/15000C0|reserved|192 bytes'
# expected: '0/15000C0|reserved|216 bytes'
This test case was *not* well thought out.
regards, tom lane