Restricting maximum keep segments by repslots

Started by Kyotaro Horiguchiover 9 years ago143 messageshackers

horikyota.ntt@gmail.com

over 9 years ago

Hello.

Although replication slot is helpful to avoid unwanted WAL
deletion, on the other hand it can cause a disastrous situation
by keeping WAL segments without a limit. Removing the causal
repslot will save this situation but it is not doable if the
standby is active. We should do a rather complex and forcible
steps to relieve the situation especially in an automatic
manner. (As for me, specifically in an HA cluster.)

This patch adds a GUC to put a limit to the number of segments
that replication slots can keep. Hitting the limit during
checkpoint shows a warining and the segments older than the limit
are removed.

WARNING: restart LSN of replication slots is ignored by checkpoint
DETAIL: Some replication slots lose required WAL segnents to continue.

Another measure would be automatic deletion or inactivation of
the culprit slot but it seems too complex for the problem.

As we have already postponed some patches by the triage for the
last commit fest, this might should be postponed to PG11.

regards,

--
Kyotaro Horiguchi
NTT Open Source Software Center

Michael Paquier

michael@paquier.xyz

over 9 years ago

In reply to: Kyotaro Horiguchi (#1)

Re: Restricting maximum keep segments by repslots

On Tue, Feb 28, 2017 at 12:27 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:

Although replication slot is helpful to avoid unwanted WAL
deletion, on the other hand it can cause a disastrous situation
by keeping WAL segments without a limit. Removing the causal
repslot will save this situation but it is not doable if the
standby is active. We should do a rather complex and forcible
steps to relieve the situation especially in an automatic
manner. (As for me, specifically in an HA cluster.)

This patch adds a GUC to put a limit to the number of segments
that replication slots can keep. Hitting the limit during
checkpoint shows a warining and the segments older than the limit
are removed.

WARNING: restart LSN of replication slots is ignored by checkpoint
DETAIL: Some replication slots lose required WAL segnents to continue.

Another measure would be automatic deletion or inactivation of
the culprit slot but it seems too complex for the problem.

As we have already postponed some patches by the triage for the
last commit fest, this might should be postponed to PG11.

Please no. Replication slots are designed the current way because we
don't want to have to use something like wal_keep_segments as it is a
wart, and this applies as well for replication slots in my opinion. If
a slot is bloating WAL and you care about your Postgres instance, I
would recommend instead that you use a background worker that does
monitoring of the situation based on max_wal_size for example, killing
the WAL sender associated to the slot if there is something connected
but it is frozen or it cannot keep up the pace of WAL generation, and
then dropping the slot. You may want to issue a checkpoint in this
case as well to ensure that segments get recycled. But anyway, if you
reach this point of WAL bloat, perhaps that's for the best as users
would know about it because backups would get in danger. For some
applications, that is acceptable, but you could always rely on
monitoring slots and kill them on sight if needed. That's as well more
flexible than having a parameter that basically is just a synonym of
max_wal_size.
--
Michael

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Restricting maximum keep segments by repslots

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: