BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached

Started by PG Bug reporting formover 4 years ago7 messagesbugs
Jump to latest
#1PG Bug reporting form
noreply@postgresql.org

The following bug has been logged on the website:

Bug reference: 17327
Logged by: Alex E
Email address: alex@altmetric.com
PostgreSQL version: 13.5
Operating system: Ubuntu 18.04
Description:

We have recently run into a situation where our pg_basebackup-based backups
started failing unexpectedly. These use WAL streaming to keep up with
changes (which uses a temporary replication slot server side).

The only errors logged on the client side were as listed below:

pg_basebackup: error: could not receive data from WAL stream: SSL connection
has been closed unexpectedly
pg_basebackup: error: could not read COPY data: server closed the connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_basebackup: removing contents of data directory "/backups/some/path/"

whilst on the server side we only got:

2021-12-03 16:21:54 UTC [29724-2647] LOG: terminating process 42601 to
release replication slot "pg_basebackup_42601"
2021-12-03 16:21:54 UTC [42601-1] replicator@[unknown] FATAL: terminating
connection due to administrator command
2021-12-03 16:21:54 UTC [42601-2] replicator@[unknown] STATEMENT:
START_REPLICATION SLOT "pg_basebackup_42601" 4721F/45000000 TIMELINE 3

The above was very unhelpful as it made us believe we might be dealing with
either a network interruption or another type of mysterious hardware
error.

We then proceeded to try several things to try and determine the root cause
of the problem and eventually realized (by trial and error and monitoring
various statistics) that we were breaching our max_slot_wal_keep_size limit
for the temporary replication slot whilst taking the pg_basebackup. The only
way we realized this was by using a permanent physical replication slot to
take the backup instead of a temporary one, and when doing this a relevant
error related to max_slot_wal_keep_size being breached was issued.

The core issue here then in our opinion is that Postgres server should log
an error when the max_slot_wal_keep_size limit is reached for temporary
replication slots as well as for permanent ones as otherwise
users/administrators are presented only with non-descript connection
termination errors which do not point to the actual cause of the problem.

#2Masahiko Sawada
sawada.mshk@gmail.com
In reply to: PG Bug reporting form (#1)
Re: BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached

On Wed, Dec 8, 2021 at 9:58 PM PG Bug reporting form
<noreply@postgresql.org> wrote:

The following bug has been logged on the website:

Bug reference: 17327
Logged by: Alex E
Email address: alex@altmetric.com
PostgreSQL version: 13.5
Operating system: Ubuntu 18.04
Description:

We have recently run into a situation where our pg_basebackup-based backups
started failing unexpectedly. These use WAL streaming to keep up with
changes (which uses a temporary replication slot server side).

The only errors logged on the client side were as listed below:

pg_basebackup: error: could not receive data from WAL stream: SSL connection
has been closed unexpectedly
pg_basebackup: error: could not read COPY data: server closed the connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_basebackup: removing contents of data directory "/backups/some/path/"

whilst on the server side we only got:

2021-12-03 16:21:54 UTC [29724-2647] LOG: terminating process 42601 to
release replication slot "pg_basebackup_42601"
2021-12-03 16:21:54 UTC [42601-1] replicator@[unknown] FATAL: terminating
connection due to administrator command
2021-12-03 16:21:54 UTC [42601-2] replicator@[unknown] STATEMENT:
START_REPLICATION SLOT "pg_basebackup_42601" 4721F/45000000 TIMELINE 3

The above was very unhelpful as it made us believe we might be dealing with
either a network interruption or another type of mysterious hardware
error.

We then proceeded to try several things to try and determine the root cause
of the problem and eventually realized (by trial and error and monitoring
various statistics) that we were breaching our max_slot_wal_keep_size limit
for the temporary replication slot whilst taking the pg_basebackup. The only
way we realized this was by using a permanent physical replication slot to
take the backup instead of a temporary one, and when doing this a relevant
error related to max_slot_wal_keep_size being breached was issued.

The core issue here then in our opinion is that Postgres server should log
an error when the max_slot_wal_keep_size limit is reached for temporary
replication slots as well as for permanent ones as otherwise
users/administrators are presented only with non-descript connection
termination errors which do not point to the actual cause of the problem.

IIUC the LOG message you got, "terminating process 42601 to release
replication slot "pg_basebackup_42601"", is the message you want. That
is, it's emitted when the replication slot reached
max_slot_wal_keep_size and was invalidated. I don't think such
behavior varies depending on persistent slots and temporary slots. If
the message "terminating process XXXX to release replication slot
"YYY"" is not clear to indicate that the replication slot is
invalidated, probably we can improve the message or add details.

Regards,

--
Masahiko Sawada
EDB: https://www.enterprisedb.com/

#3Alex Enachioaie
alex@altmetric.com
In reply to: Masahiko Sawada (#2)
Re: BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached

Hello Masahiko,

I believe that's inaccurate as the way we actually tracked down the
root cause in our situation was that we tried using a persistent slot to
check if that would make any difference to the process and although
that failed as well (for the same reason) we did get an additional
error logged in that case. This specifically stated that the
max_slot_wal_keep_size limit had been reached. So it seems like the
behavior
does differ between the two unless I'm missing something.

Please see relevant server-side logs from running the exact same
pg_basebackup command with a persistent replication slot below:

2021-12-03 00:57:16 UTC [29724-2645] LOG: terminating process 3899 to
release replication slot "backup"
2021-12-03 00:57:16 UTC [3899-1] replicator@[unknown] FATAL:
terminating connection due to administrator command
2021-12-03 00:57:16 UTC [3899-2] replicator@[unknown] STATEMENT:
START_REPLICATION SLOT "backup" 47189/75000000 TIMELINE 3
2021-12-03 00:57:16 UTC [29724-2646] LOG: invalidating slot "backup"
because its restart_lsn 47198/1E000000 exceeds max_slot_wal_keep_size

So, essentially the server side log emmitted on a temporary replication
breaching max_slot_wal_keep_size limit is only:

2021-12-03 16:21:54 UTC [29724-2647] LOG: terminating process 42601 to
release replication slot "pg_basebackup_42601"

whereas for a persistent replication slot we get an additional line
that clearly states _why_ the replication process was terminated:

2021-12-03 00:57:16 UTC [29724-2645] LOG: terminating process 3899 to
release replication slot "backup"
2021-12-03 00:57:16 UTC [29724-2646] LOG: invalidating slot "backup"
because its restart_lsn 47198/1E000000 exceeds max_slot_wal_keep_size

I'm not sure if this means that in the case of a temporary slot it does
not get invalidated at all (I've not looked at the code), or it's
simply that we don't emit a log message when it does because the slot
would be discarded anyway, but such a message would be very useful for
diagnostic purposes imo.

Thank you for looking at this

Kind regards,

Alex Enachioaie
Senior Site Reliability Engineer
Altmetric

On Fri, Dec 10 2021 at 17:41:56 +0900, Masahiko Sawada
<sawada.mshk@gmail.com> wrote:

Show quoted text

On Wed, Dec 8, 2021 at 9:58 PM PG Bug reporting form
<noreply@postgresql.org <mailto:noreply@postgresql.org>> wrote:

The following bug has been logged on the website:

Bug reference: 17327
Logged by: Alex E
Email address: alex@altmetric.com <mailto:alex@altmetric.com>
PostgreSQL version: 13.5
Operating system: Ubuntu 18.04
Description:

We have recently run into a situation where our pg_basebackup-based
backups
started failing unexpectedly. These use WAL streaming to keep up
with
changes (which uses a temporary replication slot server side).

The only errors logged on the client side were as listed below:

pg_basebackup: error: could not receive data from WAL stream: SSL
connection
has been closed unexpectedly
pg_basebackup: error: could not read COPY data: server closed the
connection
unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
pg_basebackup: removing contents of data directory
"/backups/some/path/"

whilst on the server side we only got:

2021-12-03 16:21:54 UTC [29724-2647] LOG: terminating process
42601 to
release replication slot "pg_basebackup_42601"
2021-12-03 16:21:54 UTC [42601-1] replicator@[unknown] FATAL:
terminating
connection due to administrator command
2021-12-03 16:21:54 UTC [42601-2] replicator@[unknown] STATEMENT:
START_REPLICATION SLOT "pg_basebackup_42601" 4721F/45000000
TIMELINE 3

The above was very unhelpful as it made us believe we might be
dealing with
either a network interruption or another type of mysterious hardware
error.

We then proceeded to try several things to try and determine the
root cause
of the problem and eventually realized (by trial and error and
monitoring
various statistics) that we were breaching our
max_slot_wal_keep_size limit
for the temporary replication slot whilst taking the pg_basebackup.
The only
way we realized this was by using a permanent physical replication
slot to
take the backup instead of a temporary one, and when doing this a
relevant
error related to max_slot_wal_keep_size being breached was issued.

The core issue here then in our opinion is that Postgres server
should log
an error when the max_slot_wal_keep_size limit is reached for
temporary
replication slots as well as for permanent ones as otherwise
users/administrators are presented only with non-descript connection
termination errors which do not point to the actual cause of the
problem.

IIUC the LOG message you got, "terminating process 42601 to release
replication slot "pg_basebackup_42601"", is the message you want. That
is, it's emitted when the replication slot reached
max_slot_wal_keep_size and was invalidated. I don't think such
behavior varies depending on persistent slots and temporary slots. If
the message "terminating process XXXX to release replication slot
"YYY"" is not clear to indicate that the replication slot is
invalidated, probably we can improve the message or add details.

Regards,

--
Masahiko Sawada
EDB: <https://www.enterprisedb.com/&gt;

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: PG Bug reporting form (#1)
Re: BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached

At Fri, 10 Dec 2021 15:46:11 +0000, Alex Enachioaie <alex@altmetric.com> wrote in

So, essentially the server side log emmitted on a temporary
replication breaching max_slot_wal_keep_size limit is only:

2021-12-03 16:21:54 UTC [29724-2647] LOG: terminating process 42601 to
release replication slot "pg_basebackup_42601"

whereas for a persistent replication slot we get an additional line
that clearly states _why_ the replication process was terminated:

2021-12-03 00:57:16 UTC [29724-2645] LOG: terminating process 3899 to
release replication slot "backup"
2021-12-03 00:57:16 UTC [29724-2646] LOG: invalidating slot "backup"
because its restart_lsn 47198/1E000000 exceeds max_slot_wal_keep_size

I'm not sure if this means that in the case of a temporary slot it
does not get invalidated at all (I've not looked at the code), or it's
simply that we don't emit a log message when it does because the slot
would be discarded anyway, but such a message would be very useful for
diagnostic purposes imo.

The "invalidating slot" message is emitted when the slot needs to be
invalidated, that is, when the slot persists after the user process is
terminated. Thus the message cannot be seen for temporary slots since
they are removed at process termination and no longer exist after
that.

At Wed, 08 Dec 2021 11:23:35 +0000, PG Bug reporting form <noreply@postgresql.org> wrote in

The core issue here then in our opinion is that Postgres server should log
an error when the max_slot_wal_keep_size limit is reached for temporary
replication slots as well as for permanent ones as otherwise
users/administrators are presented only with non-descript connection
termination errors which do not point to the actual cause of the problem.

If you mean the "invalidating slot" message by "an error", that
wouldn't happen since invalidation is actually doesn't happen. Or, we
could change the message like this. Does this make sense for you?

LOG: terminating process 42601 to release temporary replication slot "pg_basebackup_42601"
DETAIL: The slot will be dropped by the process termination.

LOG: terminating process 3899 to release persistent replication slot "backup"

...

LOG: invalidating slot "backup" because its restart_lsn 47198/1E000000 exceeds max_slot_wal_keep_size

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Alex Enachioaie
alex@altmetric.com
In reply to: Kyotaro Horiguchi (#4)
Re: BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached

Hello Kyotaro,

Understood, that makes sense re: invalidation and what I assumed might
be happening.

I think I'm happy to leave the method of resolution up to you, I think
the main point for me would be that when a
replication process gets terminated as a consequence of the underlying
temporary replication slot reaching max_slot_wal_keep_size
that we log a specific message to indicate to the user the cause of the
termination rather than leave it ambiguous.

Thank you

King regards

Alex E
Senior Site Reliability Engineer
Altmetric

On Mon, Dec 13 2021 at 14:44:42 +0900, Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Show quoted text

At Fri, 10 Dec 2021 15:46:11 +0000, Alex Enachioaie
<alex@altmetric.com <mailto:alex@altmetric.com>> wrote in

So, essentially the server side log emmitted on a temporary
replication breaching max_slot_wal_keep_size limit is only:

2021-12-03 16:21:54 UTC [29724-2647] LOG: terminating process 42601
to
release replication slot "pg_basebackup_42601"

whereas for a persistent replication slot we get an additional line
that clearly states _why_ the replication process was terminated:

2021-12-03 00:57:16 UTC [29724-2645] LOG: terminating process 3899
to
release replication slot "backup"
2021-12-03 00:57:16 UTC [29724-2646] LOG: invalidating slot "backup"
because its restart_lsn 47198/1E000000 exceeds
max_slot_wal_keep_size

I'm not sure if this means that in the case of a temporary slot it
does not get invalidated at all (I've not looked at the code), or
it's
simply that we don't emit a log message when it does because the
slot
would be discarded anyway, but such a message would be very useful
for
diagnostic purposes imo.

The "invalidating slot" message is emitted when the slot needs to be
invalidated, that is, when the slot persists after the user process is
terminated. Thus the message cannot be seen for temporary slots since
they are removed at process termination and no longer exist after
that.

At Wed, 08 Dec 2021 11:23:35 +0000, PG Bug reporting form
<noreply@postgresql.org <mailto:noreply@postgresql.org>> wrote in

The core issue here then in our opinion is that Postgres server
should log
an error when the max_slot_wal_keep_size limit is reached for
temporary
replication slots as well as for permanent ones as otherwise
users/administrators are presented only with non-descript connection
termination errors which do not point to the actual cause of the
problem.

If you mean the "invalidating slot" message by "an error", that
wouldn't happen since invalidation is actually doesn't happen. Or, we
could change the message like this. Does this make sense for you?

LOG: terminating process 42601 to release temporary replication
slot "pg_basebackup_42601"
DETAIL: The slot will be dropped by the process termination.

LOG: terminating process 3899 to release persistent replication
slot "backup"

...

LOG: invalidating slot "backup" because its restart_lsn
47198/1E000000 exceeds max_slot_wal_keep_size

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#6Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Alex Enachioaie (#5)
Re: BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached

Hello, Alex.

At Mon, 13 Dec 2021 09:45:51 +0000, Alex Enachioaie <alex@altmetric.com> wrote in

I think I'm happy to leave the method of resolution up to you, I think
the main point for me would be that when a
replication process gets terminated as a consequence of the underlying
temporary replication slot reaching max_slot_wal_keep_size
that we log a specific message to indicate to the user the cause of
the termination rather than leave it ambiguous.

Ah.. I think I got your point. I thought that the message
"terminating process to release replication slot X" is showing the
cause for process termination but actually it doesn't mention the
reason for releasing replication slot. For persistent slots the
following message "invalidating slot" does that but that message is
absent for temporary slots. However, in that sense, the root reason
is not clearly stated in any case.

So I'm going to propose something like the following.

LOG: terminating process nnnnn to release replication slot "slot"
+ DETAIL: The WAL files retained by the slot is going to exceed max_slot_wal_keep_size.

Thanks!

--
Kyotaro Horiguchi
NTT Open Source Software Center

#7Alex Enachioaie
alex@altmetric.com
In reply to: Kyotaro Horiguchi (#6)
Re: BUG #17327: Postgres server does not correctly emit error for max_slot_wal_keep_size being breached

Hello Kyotaro,

Exactly and that sounds like an excellent solution.

Thanks!

Alex Enachioaie
Senior Site Reliability Engineer
Altmetric

On Tue, Dec 14 2021 at 10:11:37 +0900, Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Show quoted text

Hello, Alex.

At Mon, 13 Dec 2021 09:45:51 +0000, Alex Enachioaie
<alex@altmetric.com <mailto:alex@altmetric.com>> wrote in

I think I'm happy to leave the method of resolution up to you, I
think
the main point for me would be that when a
replication process gets terminated as a consequence of the
underlying
temporary replication slot reaching max_slot_wal_keep_size
that we log a specific message to indicate to the user the cause of
the termination rather than leave it ambiguous.

Ah.. I think I got your point. I thought that the message
"terminating process to release replication slot X" is showing the
cause for process termination but actually it doesn't mention the
reason for releasing replication slot. For persistent slots the
following message "invalidating slot" does that but that message is
absent for temporary slots. However, in that sense, the root reason
is not clearly stated in any case.

So I'm going to propose something like the following.

LOG: terminating process nnnnn to release replication slot "slot"
+ DETAIL: The WAL files retained by the slot is going to exceed
max_slot_wal_keep_size.

Thanks!

--
Kyotaro Horiguchi
NTT Open Source Software Center