Fix GetWALAvailability function code comments for WALAVAIL_REMOVED return value
Hi Hackers,
The current code comment says that the replication stream on a slot with
the given targetLSN can't continue after a restart but even without a
restart the stream cannot continue. The slot is invalidated and the
walsender process is terminated by the checkpoint process. Attaching a
small patch to fix the comment.
2022-10-19 06:26:22.387 UTC [144482] STATEMENT: START_REPLICATION SLOT
"s2" LOGICAL 0/0
2022-10-19 06:27:41.998 UTC [2553755] LOG: checkpoint starting: time
2022-10-19 06:28:04.974 UTC [2553755] LOG: terminating process 144482 to
release replication slot "s2"
2022-10-19 06:28:04.974 UTC [144482] FATAL: terminating connection due to
administrator command
2022-10-19 06:28:04.974 UTC [144482] CONTEXT: slot "s2", output plugin
"test_decoding", in the change callback, associated LSN 0/1E23AB68
2022-10-19 06:28:04.974 UTC [144482] STATEMENT: START_REPLICATION SLOT
"s2" LOGICAL 0/0
Thanks,
Sirisha
Attachments:
0001-Fix-GetWALAvailability-function-code-comments.patchapplication/octet-stream; name=0001-Fix-GetWALAvailability-function-code-comments.patchDownload
From 4e9a7d9874639c3af9646266eb19a85debe3c85a Mon Sep 17 00:00:00 2001
From: root <root@pgvm.rlsumirojk0etd4qpjbaa2afce.tx.internal.cloudapp.net>
Date: Wed, 19 Oct 2022 06:47:38 +0000
Subject: [PATCH] Fix GetWALAvailability function code comments.
When checkpoint remove the WAL required for the replication stream, it invalidates the slot and walsender is terminated.
---
src/backend/access/transam/xlog.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 81d339d57d..5300cdfcd2 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7363,7 +7363,7 @@ CreateRestartPoint(int flags)
* above.
*
* * WALAVAIL_REMOVED means it has been removed. A replication stream on
- * a slot with this LSN cannot continue after a restart.
+ * a slot with this LSN cannot continue.
*
* * WALAVAIL_INVALID_LSN means the slot hasn't been set to reserve WAL.
*/
--
2.25.1
On Wed, Oct 19, 2022 at 12:39 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:
Hi Hackers,
The current code comment says that the replication stream on a slot with the given targetLSN can't continue after a restart but even without a restart the stream cannot continue. The slot is invalidated and the walsender process is terminated by the checkpoint process. Attaching a small patch to fix the comment.
2022-10-19 06:26:22.387 UTC [144482] STATEMENT: START_REPLICATION SLOT "s2" LOGICAL 0/0
2022-10-19 06:27:41.998 UTC [2553755] LOG: checkpoint starting: time
2022-10-19 06:28:04.974 UTC [2553755] LOG: terminating process 144482 to release replication slot "s2"
2022-10-19 06:28:04.974 UTC [144482] FATAL: terminating connection due to administrator command
2022-10-19 06:28:04.974 UTC [144482] CONTEXT: slot "s2", output plugin "test_decoding", in the change callback, associated LSN 0/1E23AB68
2022-10-19 06:28:04.974 UTC [144482] STATEMENT: START_REPLICATION SLOT "s2" LOGICAL 0/0
I think the walsender/replication stream can still continue even
before the checkpointer signals it to terminate, there's an
illuminating comment (see [1]case WALAVAIL_REMOVED:) specifying when it can happen. It means
that the GetWALAvailability() can return WALAVAIL_REMOVED but the
checkpointer hasn't yet signalled/in the process of signalling the
walsender to terminate.
* * WALAVAIL_REMOVED means it has been removed. A replication stream on
* a slot with this LSN cannot continue after a restart.
The above existing comment, says that the slot isn't usable if
"someone" (either checkpoitner or walsender or entire server itself)
got restarted. It looks fine, no?
[1]: case WALAVAIL_REMOVED:
case WALAVAIL_REMOVED:
/*
* If we read the restart_lsn long enough ago, maybe that file
* has been removed by now. However, the walsender could have
* moved forward enough that it jumped to another file after
* we looked. If checkpointer signalled the process to
* termination, then it's definitely lost; but if a process is
* still alive, then "unreserved" seems more appropriate.
*
* If we do change it, save the state for safe_wal_size below.
*/
if (!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn))
{
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
At Wed, 19 Oct 2022 13:06:08 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
On Wed, Oct 19, 2022 at 12:39 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:The current code comment says that the replication stream on a slot with the given targetLSN can't continue after a restart but even without a restart the stream cannot continue. The slot is invalidated and the walsender process is terminated by the checkpoint process. Attaching a small patch to fix the comment.
The description was correct at the early stage of the development of
max_slot_wal_keep_size_mb. At that time the affected processes
(walsenders) are not actively killed. On the other hand, the current
implement actively kills the affected walsenders before removing
segments. Thus considering the whole current mechanism of this
feature, WALAVAIL_REMOVED represents that "the segment has been
removed, along with the affected processeses having been already
killed, too".
In short, the proposed fix alone seems fine to me. If we want to show
further details, I would add a bit as follows.
| * * WALAVAIL_REMOVED means it has been removed. A replication stream on
| * a slot with this LSN cannot continue. Note that the affected
| * processes have been terminated by checkpointer, too.
I think the walsender/replication stream can still continue even
before the checkpointer signals it to terminate, there's an
illuminating comment (see [1]) specifying when it can happen. It means
It is a description about the possible advancement of restart_lsn
after the function reads it. So the point is a bit off the said
proposal.
that the GetWALAvailability() can return WALAVAIL_REMOVED but the
checkpointer hasn't yet signalled/in the process of signalling the
walsender to terminate.* * WALAVAIL_REMOVED means it has been removed. A replication stream on
* a slot with this LSN cannot continue after a restart.The above existing comment, says that the slot isn't usable if
"someone" (either checkpoitner or walsender or entire server itself)
got restarted. It looks fine, no?[1]
case WALAVAIL_REMOVED:/*
* If we read the restart_lsn long enough ago, maybe that file
* has been removed by now. However, the walsender could have
* moved forward enough that it jumped to another file after
* we looked. If checkpointer signalled the process to
* termination, then it's definitely lost; but if a process is
* still alive, then "unreserved" seems more appropriate.
*
* If we do change it, save the state for safe_wal_size below.
*/
if (!XLogRecPtrIsInvalid(slot_contents.data.restart_lsn))
{
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Wed, Oct 19, 2022 at 7:59 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:
At Wed, 19 Oct 2022 13:06:08 +0530, Bharath Rupireddy <
bharath.rupireddyforpostgres@gmail.com> wrote inOn Wed, Oct 19, 2022 at 12:39 PM sirisha chamarthi
<sirichamarthi22@gmail.com> wrote:The current code comment says that the replication stream on a slot
with the given targetLSN can't continue after a restart but even without a
restart the stream cannot continue. The slot is invalidated and the
walsender process is terminated by the checkpoint process. Attaching a
small patch to fix the comment.In short, the proposed fix alone seems fine to me. If we want to show
further details, I would add a bit as follows.| * * WALAVAIL_REMOVED means it has been removed. A replication stream on
| * a slot with this LSN cannot continue. Note that the affected
| * processes have been terminated by checkpointer, too.
Thanks for your comments! Attached the patch with your suggestions.
Thanks,
Sirisha
Attachments:
v2-0001-Fix-GetWALAvailability-function-code-comments.patchapplication/octet-stream; name=v2-0001-Fix-GetWALAvailability-function-code-comments.patchDownload
From 6a04a2ab692c7e91ad39964fce2652f71356f028 Mon Sep 17 00:00:00 2001
From: root <root@pgvm.rlsumirojk0etd4qpjbaa2afce.tx.internal.cloudapp.net>
Date: Thu, 20 Oct 2022 16:28:58 +0000
Subject: [PATCH] Fix GetWALAvailability function code comments
When checkpoint remove the WAL required for the replication stream, it invalidates the slot and walsender is terminated.
---
src/backend/access/transam/xlog.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 81d339d57d..4ac46de508 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -7363,7 +7363,8 @@ CreateRestartPoint(int flags)
* above.
*
* * WALAVAIL_REMOVED means it has been removed. A replication stream on
- * a slot with this LSN cannot continue after a restart.
+ * a slot with this LSN cannot continue. Note that the affected processes
+ * have been terminated by checkpointer, too.
*
* * WALAVAIL_INVALID_LSN means the slot hasn't been set to reserve WAL.
*/
--
2.25.1
sirisha chamarthi <sirichamarthi22@gmail.com> writes:
On Wed, Oct 19, 2022 at 7:59 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:In short, the proposed fix alone seems fine to me. If we want to show
further details, I would add a bit as follows.| * * WALAVAIL_REMOVED means it has been removed. A replication stream on
| * a slot with this LSN cannot continue. Note that the affected
| * processes have been terminated by checkpointer, too.
Thanks for your comments! Attached the patch with your suggestions.
Pushed with a bit of additional wordsmithing. I thought "have been"
was a bit too strong of an assertion considering that this function
does not pay any attention to the actual state of any processes,
so I made it say "should have been".
regards, tom lane
At Thu, 19 Jan 2023 18:43:52 -0500, Tom Lane <tgl@sss.pgh.pa.us> wrote in
sirisha chamarthi <sirichamarthi22@gmail.com> writes:
On Wed, Oct 19, 2022 at 7:59 PM Kyotaro Horiguchi <horikyota.ntt@gmail.com>
wrote:In short, the proposed fix alone seems fine to me. If we want to show
further details, I would add a bit as follows.| * * WALAVAIL_REMOVED means it has been removed. A replication stream on
| * a slot with this LSN cannot continue. Note that the affected
| * processes have been terminated by checkpointer, too.Thanks for your comments! Attached the patch with your suggestions.
Pushed with a bit of additional wordsmithing. I thought "have been"
Thanks!
was a bit too strong of an assertion considering that this function
does not pay any attention to the actual state of any processes,
so I made it say "should have been".
I think you're correct here.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center