pg_recvlogical prints bogus error when interrupted

Started by Andres Freundabout 3 years ago20 messages
#1Andres Freund
andres@anarazel.de

Hi,

While reviewing
/messages/by-id/CAD21AoBe2o2D=xyycsxw2bQOD=zPj7ETuJ5VYGN=dpoTiCMRJQ@mail.gmail.com
I noticed that pg_recvlogical prints
"pg_recvlogical: error: unexpected termination of replication stream: "

when signalled with SIGINT/TERM.

Oddly enough, that looks to have "always" been the case, even though clearly
the code tried to make provisions for a different outcome.

It looks to me like all that's needed is to gate the block printing the
message with an !time_to_abort.

I also then noticed that we don't fsync the output file in cases of errors -
that seems wrong to me? Looks to me like that block should be moved till after
the error:?

Greetings,

Andres Freund

#2Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#1)
1 attachment(s)
Re: pg_recvlogical prints bogus error when interrupted

On Thu, Oct 20, 2022 at 3:10 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

While reviewing
/messages/by-id/CAD21AoBe2o2D=xyycsxw2bQOD=zPj7ETuJ5VYGN=dpoTiCMRJQ@mail.gmail.com
I noticed that pg_recvlogical prints
"pg_recvlogical: error: unexpected termination of replication stream: "

when signalled with SIGINT/TERM.

Oddly enough, that looks to have "always" been the case, even though clearly
the code tried to make provisions for a different outcome.

It looks to me like all that's needed is to gate the block printing the
message with an !time_to_abort.

+1. How about emitting a message like its friend pg_receivewal, like
the attached patch?

I also then noticed that we don't fsync the output file in cases of errors -
that seems wrong to me? Looks to me like that block should be moved till after
the error:?

How about something like the attached patch?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-pg_recvlogical-fixes.patchapplication/octet-stream; name=v1-0001-pg_recvlogical-fixes.patchDownload
From bfbc76e6a85b40e3b9000e22d1c53d06e8799a2a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 20 Oct 2022 07:57:04 +0000
Subject: [PATCH v1] pg_recvlogical fixes

---
 src/bin/pg_basebackup/pg_recvlogical.c | 17 ++++++++++++++++-
 1 file changed, 16 insertions(+), 1 deletion(-)

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 5f2e6af445..4ce4ee648c 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -614,7 +614,14 @@ StreamLogicalLog(void)
 
 		res = PQgetResult(conn);
 	}
-	if (PQresultStatus(res) != PGRES_COMMAND_OK)
+
+	if (time_to_abort)
+	{
+		if (verbose)
+			pg_log_info("received interrupt signal, exiting");
+		goto error;
+	}
+	else if (PQresultStatus(res) != PGRES_COMMAND_OK)
 	{
 		pg_log_error("unexpected termination of replication stream: %s",
 					 PQresultErrorMessage(res));
@@ -634,6 +641,14 @@ StreamLogicalLog(void)
 	}
 	outfd = -1;
 error:
+	if (outfd != -1)
+	{
+		TimestampTz t;
+
+		t = feGetCurrentTimestamp();
+		/* no need to jump to error on failure here, we're finishing anyway */
+		OutputFsync(t);
+	}
 	if (copybuf != NULL)
 	{
 		PQfreemem(copybuf);
-- 
2.34.1

#3Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Bharath Rupireddy (#2)
1 attachment(s)
Re: pg_recvlogical prints bogus error when interrupted

At Thu, 20 Oct 2022 13:28:45 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

On Thu, Oct 20, 2022 at 3:10 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

While reviewing
/messages/by-id/CAD21AoBe2o2D=xyycsxw2bQOD=zPj7ETuJ5VYGN=dpoTiCMRJQ@mail.gmail.com
I noticed that pg_recvlogical prints
"pg_recvlogical: error: unexpected termination of replication stream: "

when signalled with SIGINT/TERM.

Oddly enough, that looks to have "always" been the case, even though clearly
the code tried to make provisions for a different outcome.

It looks to me like all that's needed is to gate the block printing the
message with an !time_to_abort.

+1

+1. How about emitting a message like its friend pg_receivewal, like
the attached patch?

I'm not a fan of treating SIGINT as an error in this case. It calls
prepareToTerminate() when time_to_abort and everything goes fine after
then. So I think we should do the same thing after receiving an
interrupt. This also does file-sync naturally as a part of normal
shutdown. I'm also not a fan of doing fsync at error.

I also then noticed that we don't fsync the output file in cases of errors -
that seems wrong to me? Looks to me like that block should be moved till after
the error:?

How about something like the attached patch?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

pg_recvlogical_graceful_interrupt.txttext/plain; charset=us-asciiDownload
diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 5f2e6af445..e33c204df0 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -55,6 +55,7 @@ static const char *plugin = "test_decoding";
 /* Global State */
 static int	outfd = -1;
 static volatile sig_atomic_t time_to_abort = false;
+static volatile sig_atomic_t interrupted = false;
 static volatile sig_atomic_t output_reopen = false;
 static bool output_isfile;
 static TimestampTz output_last_fsync = -1;
@@ -206,6 +207,7 @@ StreamLogicalLog(void)
 	char	   *copybuf = NULL;
 	TimestampTz last_status = -1;
 	int			i;
+	XLogRecPtr	cur_record_lsn = InvalidXLogRecPtr;
 	PQExpBuffer query;
 
 	output_written_lsn = InvalidXLogRecPtr;
@@ -275,7 +277,6 @@ StreamLogicalLog(void)
 		int			bytes_written;
 		TimestampTz now;
 		int			hdr_len;
-		XLogRecPtr	cur_record_lsn = InvalidXLogRecPtr;
 
 		if (copybuf != NULL)
 		{
@@ -487,7 +488,7 @@ StreamLogicalLog(void)
 
 			if (endposReached)
 			{
-				prepareToTerminate(conn, endpos, true, InvalidXLogRecPtr);
+				cur_record_lsn = InvalidXLogRecPtr;
 				time_to_abort = true;
 				break;
 			}
@@ -527,7 +528,6 @@ StreamLogicalLog(void)
 			 */
 			if (!flushAndSendFeedback(conn, &now))
 				goto error;
-			prepareToTerminate(conn, endpos, false, cur_record_lsn);
 			time_to_abort = true;
 			break;
 		}
@@ -572,12 +572,14 @@ StreamLogicalLog(void)
 			/* endpos was exactly the record we just processed, we're done */
 			if (!flushAndSendFeedback(conn, &now))
 				goto error;
-			prepareToTerminate(conn, endpos, false, cur_record_lsn);
 			time_to_abort = true;
 			break;
 		}
 	}
 
+	if (time_to_abort)
+		prepareToTerminate(conn, endpos, false, cur_record_lsn);
+
 	res = PQgetResult(conn);
 	if (PQresultStatus(res) == PGRES_COPY_OUT)
 	{
@@ -657,6 +659,7 @@ static void
 sigexit_handler(SIGNAL_ARGS)
 {
 	time_to_abort = true;
+	interrupted = true;
 }
 
 /*
@@ -1031,6 +1034,8 @@ prepareToTerminate(PGconn *conn, XLogRecPtr endpos, bool keepalive, XLogRecPtr l
 		if (keepalive)
 			pg_log_info("end position %X/%X reached by keepalive",
 						LSN_FORMAT_ARGS(endpos));
+		else if (interrupted)
+			pg_log_info("interrupted after %X/%X", LSN_FORMAT_ARGS(lsn));
 		else
 			pg_log_info("end position %X/%X reached by WAL record at %X/%X",
 						LSN_FORMAT_ARGS(endpos), LSN_FORMAT_ARGS(lsn));
#4Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Kyotaro Horiguchi (#3)
1 attachment(s)
Re: pg_recvlogical prints bogus error when interrupted

On Fri, Oct 21, 2022 at 7:52 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

+1. How about emitting a message like its friend pg_receivewal, like
the attached patch?

I'm not a fan of treating SIGINT as an error in this case. It calls
prepareToTerminate() when time_to_abort and everything goes fine after
then. So I think we should do the same thing after receiving an
interrupt. This also does file-sync naturally as a part of normal
shutdown. I'm also not a fan of doing fsync at error.

I think the pg_recvlogical can gracefully exit on both SIGINT and
SIGTERM to keep things simple.

I also then noticed that we don't fsync the output file in cases of errors -
that seems wrong to me? Looks to me like that block should be moved till after
the error:?

How about something like the attached patch?

The attached patch (pg_recvlogical_graceful_interrupt.text) has a
couple of problems, I believe. We're losing prepareToTerminate() with
keepalive true and we're not skipping pg_log_error("unexpected
termination of replication stream: %s" upon interrupt, after all we're
here discussing how to avoid it.

I came up with the attached v2 patch, please have a look.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v2-0001-pg_recvlogical-fixes.patchapplication/x-patch; name=v2-0001-pg_recvlogical-fixes.patchDownload
From 56e25373796b114254f5995701b07b05381f28ef Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 22 Oct 2022 08:35:16 +0000
Subject: [PATCH v2] pg_recvlogical fixes

---
 src/bin/pg_basebackup/pg_recvlogical.c | 25 ++++++++++++++++++++++---
 1 file changed, 22 insertions(+), 3 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 5f2e6af445..849e9d9071 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -54,7 +54,8 @@ static const char *plugin = "test_decoding";
 
 /* Global State */
 static int	outfd = -1;
-static volatile sig_atomic_t time_to_abort = false;
+static bool	time_to_abort = false;
+static volatile sig_atomic_t ready_to_exit = false;
 static volatile sig_atomic_t output_reopen = false;
 static bool output_isfile;
 static TimestampTz output_last_fsync = -1;
@@ -283,6 +284,23 @@ StreamLogicalLog(void)
 			copybuf = NULL;
 		}
 
+		/* When we get SIGINT/SIGTERM, we exit */
+		if (ready_to_exit)
+		{
+			/*
+			 * Try informing the server about our exit, but don't wait around
+			 * or retry on failure.
+			 */
+			(void) PQputCopyEnd(conn, NULL);
+			(void) PQflush(conn);
+			time_to_abort = ready_to_exit;
+
+			if (verbose)
+				pg_log_info("received interrupt signal, exiting");
+
+			break;
+		}
+
 		/*
 		 * Potentially send a status message to the primary.
 		 */
@@ -614,7 +632,8 @@ StreamLogicalLog(void)
 
 		res = PQgetResult(conn);
 	}
-	if (PQresultStatus(res) != PGRES_COMMAND_OK)
+	if (!ready_to_exit &&
+		PQresultStatus(res) != PGRES_COMMAND_OK)
 	{
 		pg_log_error("unexpected termination of replication stream: %s",
 					 PQresultErrorMessage(res));
@@ -656,7 +675,7 @@ error:
 static void
 sigexit_handler(SIGNAL_ARGS)
 {
-	time_to_abort = true;
+	ready_to_exit = true;
 }
 
 /*
-- 
2.34.1

#5Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#4)
Re: pg_recvlogical prints bogus error when interrupted

On Mon, Oct 24, 2022 at 8:15 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Oct 21, 2022 at 7:52 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

+1. How about emitting a message like its friend pg_receivewal, like
the attached patch?

I'm not a fan of treating SIGINT as an error in this case. It calls
prepareToTerminate() when time_to_abort and everything goes fine after
then. So I think we should do the same thing after receiving an
interrupt. This also does file-sync naturally as a part of normal
shutdown. I'm also not a fan of doing fsync at error.

I think the pg_recvlogical can gracefully exit on both SIGINT and
SIGTERM to keep things simple.

I also then noticed that we don't fsync the output file in cases of errors -
that seems wrong to me? Looks to me like that block should be moved till after
the error:?

How about something like the attached patch?

The attached patch (pg_recvlogical_graceful_interrupt.text) has a
couple of problems, I believe. We're losing prepareToTerminate() with
keepalive true and we're not skipping pg_log_error("unexpected
termination of replication stream: %s" upon interrupt, after all we're
here discussing how to avoid it.

I came up with the attached v2 patch, please have a look.

FWIW, I added it to CF - https://commitfest.postgresql.org/40/3966/.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#6Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#4)
Re: pg_recvlogical prints bogus error when interrupted

Hi,

On 2022-10-24 08:15:11 +0530, Bharath Rupireddy wrote:

I came up with the attached v2 patch, please have a look.

Thanks for working on this!

+		/* When we get SIGINT/SIGTERM, we exit */
+		if (ready_to_exit)
+		{
+			/*
+			 * Try informing the server about our exit, but don't wait around
+			 * or retry on failure.
+			 */
+			(void) PQputCopyEnd(conn, NULL);
+			(void) PQflush(conn);
+			time_to_abort = ready_to_exit;

This doesn't strike me as great - because the ready_to_exit isn't checked in
the loop around StreamLogicalLog(), we'll reconnect if something else causes
StreamLogicalLog() to return.

Why do we need both time_to_abort and ready_to_exit? Perhaps worth noting that
time_to_abort is still an sig_atomic_t, but isn't modified in a signal
handler, which seems a bit unnecessarily confusing.

Greetings,

Andres Freund

#7Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#6)
1 attachment(s)
Re: pg_recvlogical prints bogus error when interrupted

On Fri, Oct 28, 2022 at 4:41 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-10-24 08:15:11 +0530, Bharath Rupireddy wrote:

+             /* When we get SIGINT/SIGTERM, we exit */
+             if (ready_to_exit)
+             {
+                     /*
+                      * Try informing the server about our exit, but don't wait around
+                      * or retry on failure.
+                      */
+                     (void) PQputCopyEnd(conn, NULL);
+                     (void) PQflush(conn);
+                     time_to_abort = ready_to_exit;

This doesn't strike me as great - because the ready_to_exit isn't checked in
the loop around StreamLogicalLog(), we'll reconnect if something else causes
StreamLogicalLog() to return.

Fixed.

Why do we need both time_to_abort and ready_to_exit?

Intention to have ready_to_exit is to be able to distinguish between
SIGINT/SIGTERM and aborting when endpos is reached so that necessary
code is skipped/executed and proper logs are printed.

Perhaps worth noting that
time_to_abort is still an sig_atomic_t, but isn't modified in a signal
handler, which seems a bit unnecessarily confusing.

time_to_abort is just a static variable, no?

+static bool    time_to_abort = false;
+static volatile sig_atomic_t ready_to_exit = false;

Please see the attached v3 patch.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-Fix-pg_recvlogical-error-message-upon-SIGINT-SIGT.patchapplication/octet-stream; name=v3-0001-Fix-pg_recvlogical-error-message-upon-SIGINT-SIGT.patchDownload
From b5f940a6af46d487e164f3de17055be1289ae41b Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 28 Oct 2022 03:09:46 +0000
Subject: [PATCH v3] Fix pg_recvlogical error message upon SIGINT/SIGTERM

---
 src/bin/pg_basebackup/pg_recvlogical.c | 29 ++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index 5f2e6af445..8f1d4d581f 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -54,7 +54,8 @@ static const char *plugin = "test_decoding";
 
 /* Global State */
 static int	outfd = -1;
-static volatile sig_atomic_t time_to_abort = false;
+static bool	time_to_abort = false;
+static volatile sig_atomic_t ready_to_exit = false;
 static volatile sig_atomic_t output_reopen = false;
 static bool output_isfile;
 static TimestampTz output_last_fsync = -1;
@@ -283,6 +284,23 @@ StreamLogicalLog(void)
 			copybuf = NULL;
 		}
 
+		/* When we get SIGINT/SIGTERM, we exit */
+		if (ready_to_exit)
+		{
+			/*
+			 * Try informing the server about our exit, but don't wait around
+			 * or retry on failure.
+			 */
+			(void) PQputCopyEnd(conn, NULL);
+			(void) PQflush(conn);
+			time_to_abort = true;
+
+			if (verbose)
+				pg_log_info("received interrupt signal, exiting");
+
+			break;
+		}
+
 		/*
 		 * Potentially send a status message to the primary.
 		 */
@@ -614,7 +632,10 @@ StreamLogicalLog(void)
 
 		res = PQgetResult(conn);
 	}
-	if (PQresultStatus(res) != PGRES_COMMAND_OK)
+
+	/* It is not unexepected termination error when Ctrl-C'ed. */
+	if (!ready_to_exit &&
+		PQresultStatus(res) != PGRES_COMMAND_OK)
 	{
 		pg_log_error("unexpected termination of replication stream: %s",
 					 PQresultErrorMessage(res));
@@ -656,7 +677,7 @@ error:
 static void
 sigexit_handler(SIGNAL_ARGS)
 {
-	time_to_abort = true;
+	ready_to_exit = true;
 }
 
 /*
@@ -976,7 +997,7 @@ main(int argc, char **argv)
 	while (true)
 	{
 		StreamLogicalLog();
-		if (time_to_abort)
+		if (ready_to_exit || time_to_abort)
 		{
 			/*
 			 * We've been Ctrl-C'ed or reached an exit limit condition. That's
-- 
2.34.1

#8Cary Huang
cary.huang@highgo.ca
In reply to: Bharath Rupireddy (#7)
Re: pg_recvlogical prints bogus error when interrupted

The following review has been posted through the commitfest application:
make installcheck-world: tested, passed
Implements feature: tested, passed
Spec compliant: tested, passed
Documentation: not tested

Hello

The patch applies and tests fine. I like the way to have both ready_to_exit and time_to_abort variables to control the exit sequence. I think the (void) cast can be removed in front of PQputCopyEnd(), PQflush for consistency purposes as it does not give warnings and everywhere else does not have those casts.

thank you
Cary

#9Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#4)
Re: pg_recvlogical prints bogus error when interrupted

On Mon, Oct 24, 2022 at 08:15:11AM +0530, Bharath Rupireddy wrote:

The attached patch (pg_recvlogical_graceful_interrupt.text) has a
couple of problems, I believe. We're losing prepareToTerminate() with
keepalive true and we're not skipping pg_log_error("unexpected
termination of replication stream: %s" upon interrupt, after all we're
here discussing how to avoid it.

I came up with the attached v2 patch, please have a look.

This thread has slipped through the feature freeze deadline. Would
people be OK to do something now on HEAD? A backpatch is also in
order, IMO, as the current behavior looks confusing under SIGINT and
SIGTERM.
--
Michael

#10Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#9)
1 attachment(s)
Re: pg_recvlogical prints bogus error when interrupted

On Tue, Apr 11, 2023 at 11:42 AM Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Oct 24, 2022 at 08:15:11AM +0530, Bharath Rupireddy wrote:

The attached patch (pg_recvlogical_graceful_interrupt.text) has a
couple of problems, I believe. We're losing prepareToTerminate() with
keepalive true and we're not skipping pg_log_error("unexpected
termination of replication stream: %s" upon interrupt, after all we're
here discussing how to avoid it.

I came up with the attached v2 patch, please have a look.

This thread has slipped through the feature freeze deadline. Would
people be OK to do something now on HEAD? A backpatch is also in
order, IMO, as the current behavior looks confusing under SIGINT and
SIGTERM.

IMO, +1 for HEAD/PG16 and +0.5 for backpatching as it may not be so
critical to backpatch all the way down. What may happen without this
patch is that the output file isn't fsync-ed upon SIGINT/SIGTERM.
Well, is it a critical issue on production servers?

On Fri, Apr 7, 2023 at 5:12 AM Cary Huang <cary.huang@highgo.ca> wrote:

The following review has been posted through the commitfest application:

The patch applies and tests fine. I like the way to have both ready_to_exit and time_to_abort variables to control the exit sequence. I think the (void) cast can be removed in front of PQputCopyEnd(), PQflush for consistency purposes as it does not give warnings and everywhere else does not have those casts.

Thanks for reviewing. I removed the (void) casts like elsewhere in the
code, however, I didn't change such casts in prepareToTerminate() to
not create a diff.

I'm attaching the v4 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-Fix-pg_recvlogical-error-message-upon-SIGINT-SIGT.patchapplication/x-patch; name=v4-0001-Fix-pg_recvlogical-error-message-upon-SIGINT-SIGT.patchDownload
From 22195f136440fdadae0c6a0bf04c23fa16b4031c Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 26 Apr 2023 15:25:03 +0000
Subject: [PATCH v4] Fix pg_recvlogical error message upon SIGINT/SIGTERM

When pg_recvlogical gets SIGINT/SIGTERM, it emits
"unexpected termination of replication stream" error, which is
meant for really unexpected termination or a crash, but not for
SIGINT/SIGTERM. Upon SIGINT/SIGTERM, we want pg_recvlogical to
fsync the output file before exiting cleanly. This commit changes
pg_recvlogical to that.

Reported-by: Andres Freund
Author: Bharath Rupireddy
Reviewed-by: Kyotaro Horiguchi, Andres Freund
Reviewed-by: Cary Huang
Discussion: https://www.postgresql.org/message-id/20221019213953.htdtzikf4f45ywil%40awork3.anarazel.de
---
 src/bin/pg_basebackup/pg_recvlogical.c | 29 ++++++++++++++++++++++----
 1 file changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index f3c7937a1d..337076647b 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -54,7 +54,8 @@ static const char *plugin = "test_decoding";
 
 /* Global State */
 static int	outfd = -1;
-static volatile sig_atomic_t time_to_abort = false;
+static bool	time_to_abort = false;
+static volatile sig_atomic_t ready_to_exit = false;
 static volatile sig_atomic_t output_reopen = false;
 static bool output_isfile;
 static TimestampTz output_last_fsync = -1;
@@ -283,6 +284,23 @@ StreamLogicalLog(void)
 			copybuf = NULL;
 		}
 
+		/* When we get SIGINT/SIGTERM, we exit */
+		if (ready_to_exit)
+		{
+			/*
+			 * Try informing the server about our exit, but don't wait around
+			 * or retry on failure.
+			 */
+			PQputCopyEnd(conn, NULL);
+			PQflush(conn);
+			time_to_abort = true;
+
+			if (verbose)
+				pg_log_info("received interrupt signal, exiting");
+
+			break;
+		}
+
 		/*
 		 * Potentially send a status message to the primary.
 		 */
@@ -614,7 +632,10 @@ StreamLogicalLog(void)
 
 		res = PQgetResult(conn);
 	}
-	if (PQresultStatus(res) != PGRES_COMMAND_OK)
+
+	/* It is not unexepected termination error when Ctrl-C'ed. */
+	if (!ready_to_exit &&
+		PQresultStatus(res) != PGRES_COMMAND_OK)
 	{
 		pg_log_error("unexpected termination of replication stream: %s",
 					 PQresultErrorMessage(res));
@@ -656,7 +677,7 @@ error:
 static void
 sigexit_handler(SIGNAL_ARGS)
 {
-	time_to_abort = true;
+	ready_to_exit = true;
 }
 
 /*
@@ -976,7 +997,7 @@ main(int argc, char **argv)
 	while (true)
 	{
 		StreamLogicalLog();
-		if (time_to_abort)
+		if (ready_to_exit || time_to_abort)
 		{
 			/*
 			 * We've been Ctrl-C'ed or reached an exit limit condition. That's
-- 
2.34.1

#11Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#10)
Re: pg_recvlogical prints bogus error when interrupted

On Thu, Apr 27, 2023 at 11:24:52AM +0530, Bharath Rupireddy wrote:

IMO, +1 for HEAD/PG16 and +0.5 for backpatching as it may not be so
critical to backpatch all the way down. What may happen without this
patch is that the output file isn't fsync-ed upon SIGINT/SIGTERM.
Well, is it a critical issue on production servers?

It is also true that it's been this way for years with basically
nobody complaining outside this thread. So there is also an argument
about leaving v16 out of the picture, and do that only in 17~ to avoid
playing with the branch stability more than necessary? I see 7 open
items as of today, and there are TAP tests linked to pg_recvlogical.
That should be OK, because none of these tests rely on specific
signals, but the buildfarm has some weird setups, as well..
--
Michael

#12Tristan Partin
tristan@neon.tech
In reply to: Bharath Rupireddy (#10)
Re: pg_recvlogical prints bogus error when interrupted

On Thu Apr 27, 2023 at 12:54 AM CDT, Bharath Rupireddy wrote:

Thanks for reviewing. I removed the (void) casts like elsewhere in the
code, however, I didn't change such casts in prepareToTerminate() to
not create a diff.

I'm attaching the v4 patch for further review.

Bharath,

I signed up to review the patch for the commitfest. The patch looks
pretty good to me, but I would like to come to a conclusion on what
Andres posted earlier in the thread.

Why do we need both time_to_abort and ready_to_exit?

I am trying to understand why we need both as well. Maybe I am missing
something important :).

/* It is not unexepected termination error when Ctrl-C'ed. */

My only other comment is that it would be nice to have the word "an"
before unexpected.

--
Tristan Partin
Neon (https://neon.tech)

#13Michael Paquier
michael@paquier.xyz
In reply to: Tristan Partin (#12)
Re: pg_recvlogical prints bogus error when interrupted

On Thu, Jul 06, 2023 at 10:29:10AM -0500, Tristan Partin wrote:

On Thu Apr 27, 2023 at 12:54 AM CDT, Bharath Rupireddy wrote:

Why do we need both time_to_abort and ready_to_exit?

I am trying to understand why we need both as well. Maybe I am missing
something important :).

As StreamLogicalLog() states once it leaves its main loop because
time_to_abort has been switched to true, we want a clean exit. I
think that this patch is just a more complicated way to avoid doing
twice the operations done by prepareToTerminate(). So how about
moving the prepareToTerminate() call outside the main streaming loop
and call it when time_to_abort is true? Then, I would suggest to
change the keepalive argument of prepareToTerminate() to an enum able
to handle three values to log the reason why the tool is stopping: the
end of WAL, an interruption or a keepalive when logging. There are
two of them now, but we want a third mode for the signals.

/* It is not unexepected termination error when Ctrl-C'ed. */

My only other comment is that it would be nice to have the word "an"
before unexpected.

s/unexepected/unexpected/. Still, it seems to me that we don't need
this comment.
--
Michael

#14Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#13)
1 attachment(s)
Re: pg_recvlogical prints bogus error when interrupted

On Mon, Jul 10, 2023 at 01:44:45PM +0900, Michael Paquier wrote:

As StreamLogicalLog() states once it leaves its main loop because
time_to_abort has been switched to true, we want a clean exit. I
think that this patch is just a more complicated way to avoid doing
twice the operations done by prepareToTerminate(). So how about
moving the prepareToTerminate() call outside the main streaming loop
and call it when time_to_abort is true? Then, I would suggest to
change the keepalive argument of prepareToTerminate() to an enum able
to handle three values to log the reason why the tool is stopping: the
end of WAL, an interruption or a keepalive when logging. There are
two of them now, but we want a third mode for the signals.

It took me some time to come back to this one, but attached is what I
had in mind. This stuff has three reasons to stop: keepalive, end LSN
or signal. This makes the code easier to follow.

Thoughts or comments?
--
Michael

Attachments:

v5-0001-Fix-pg_recvlogical-error-message-upon-SIGINT-SIGT.patchtext/x-diff; charset=us-asciiDownload
From efb4f15d7b1f3f97da370cf4b7db7562eec08cb3 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 19 Jul 2023 11:29:02 +0900
Subject: [PATCH v5] Fix pg_recvlogical error message upon SIGINT/SIGTERM

When pg_recvlogical needs to abort on a signal like SIGINT/SIGTERM, it
is expected to exit cleanly.  However, the code forgot to clean up the
state of the connection befor leaving.  This would cause the tool to
emit messages like "unexpected termination of replication stream" error,
which is meant for really unexpected termination or a crash.

The code is refactored to apply the same termination abort operations for
signals, end LSN and keepalive cases.

Reported-by: Andres Freund
Author: Bharath Rupireddy
Reviewed-by: Kyotaro Horiguchi, Andres Freund, Cary Huang
Discussion: https://www.postgresql.org/message-id/20221019213953.htdtzikf4f45ywil%40awork3.anarazel.de
---
 src/bin/pg_basebackup/pg_recvlogical.c | 52 ++++++++++++++++++++------
 1 file changed, 41 insertions(+), 11 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index f3c7937a1d..3bd83deee7 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -32,6 +32,13 @@
 /* Time to sleep between reconnection attempts */
 #define RECONNECT_SLEEP_TIME 5
 
+typedef enum
+{
+	STREAM_STOP_END_OF_WAL,
+	STREAM_STOP_KEEPALIVE,
+	STREAM_STOP_SIGNAL
+}			StreamStopReason;
+
 /* Global Options */
 static char *outfile = NULL;
 static int	verbose = 0;
@@ -66,7 +73,7 @@ static void usage(void);
 static void StreamLogicalLog(void);
 static bool flushAndSendFeedback(PGconn *conn, TimestampTz *now);
 static void prepareToTerminate(PGconn *conn, XLogRecPtr endpos,
-							   bool keepalive, XLogRecPtr lsn);
+							   StreamStopReason reason, XLogRecPtr lsn);
 
 static void
 usage(void)
@@ -207,6 +214,8 @@ StreamLogicalLog(void)
 	TimestampTz last_status = -1;
 	int			i;
 	PQExpBuffer query;
+	XLogRecPtr	stop_lsn = InvalidXLogRecPtr;
+	StreamStopReason stop_reason = STREAM_STOP_SIGNAL;
 
 	output_written_lsn = InvalidXLogRecPtr;
 	output_fsync_lsn = InvalidXLogRecPtr;
@@ -487,7 +496,7 @@ StreamLogicalLog(void)
 
 			if (endposReached)
 			{
-				prepareToTerminate(conn, endpos, true, InvalidXLogRecPtr);
+				stop_reason = STREAM_STOP_KEEPALIVE;
 				time_to_abort = true;
 				break;
 			}
@@ -519,6 +528,12 @@ StreamLogicalLog(void)
 		/* Extract WAL location for this block */
 		cur_record_lsn = fe_recvint64(&copybuf[1]);
 
+		/*
+		 * If this loop is aborted, like on signal, saving this information
+		 * here gives a correct feedback.
+		 */
+		stop_lsn = cur_record_lsn;
+
 		if (endpos != InvalidXLogRecPtr && cur_record_lsn > endpos)
 		{
 			/*
@@ -527,7 +542,7 @@ StreamLogicalLog(void)
 			 */
 			if (!flushAndSendFeedback(conn, &now))
 				goto error;
-			prepareToTerminate(conn, endpos, false, cur_record_lsn);
+			stop_reason = STREAM_STOP_END_OF_WAL;
 			time_to_abort = true;
 			break;
 		}
@@ -572,12 +587,16 @@ StreamLogicalLog(void)
 			/* endpos was exactly the record we just processed, we're done */
 			if (!flushAndSendFeedback(conn, &now))
 				goto error;
-			prepareToTerminate(conn, endpos, false, cur_record_lsn);
+			stop_reason = STREAM_STOP_END_OF_WAL;
 			time_to_abort = true;
 			break;
 		}
 	}
 
+	/* Clean up connection state if stream has been aborted */
+	if (time_to_abort)
+		prepareToTerminate(conn, endpos, stop_reason, stop_lsn);
+
 	res = PQgetResult(conn);
 	if (PQresultStatus(res) == PGRES_COPY_OUT)
 	{
@@ -1021,18 +1040,29 @@ flushAndSendFeedback(PGconn *conn, TimestampTz *now)
  * retry on failure.
  */
 static void
-prepareToTerminate(PGconn *conn, XLogRecPtr endpos, bool keepalive, XLogRecPtr lsn)
+prepareToTerminate(PGconn *conn, XLogRecPtr endpos, StreamStopReason reason,
+				   XLogRecPtr lsn)
 {
 	(void) PQputCopyEnd(conn, NULL);
 	(void) PQflush(conn);
 
 	if (verbose)
 	{
-		if (keepalive)
-			pg_log_info("end position %X/%X reached by keepalive",
-						LSN_FORMAT_ARGS(endpos));
-		else
-			pg_log_info("end position %X/%X reached by WAL record at %X/%X",
-						LSN_FORMAT_ARGS(endpos), LSN_FORMAT_ARGS(lsn));
+		switch (reason)
+		{
+			case STREAM_STOP_SIGNAL:
+				pg_log_info("end position %X/%X reached on signal",
+							LSN_FORMAT_ARGS(lsn));
+				break;
+			case STREAM_STOP_KEEPALIVE:
+				pg_log_info("end position %X/%X reached by keepalive",
+							LSN_FORMAT_ARGS(endpos));
+				break;
+			case STREAM_STOP_END_OF_WAL:
+				Assert(!XLogRecPtrIsInvalid(lsn));
+				pg_log_info("end position %X/%X reached by WAL record at %X/%X",
+							LSN_FORMAT_ARGS(endpos), LSN_FORMAT_ARGS(lsn));
+				break;
+		}
 	}
 }
-- 
2.40.1

#15Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#14)
1 attachment(s)
Re: pg_recvlogical prints bogus error when interrupted

On Wed, Jul 19, 2023 at 8:04 AM Michael Paquier <michael@paquier.xyz> wrote:

It took me some time to come back to this one, but attached is what I
had in mind. This stuff has three reasons to stop: keepalive, end LSN
or signal. This makes the code easier to follow.

Thoughts or comments?

Thanks. I have some comments on v5:

1. I don't think we need a stop_lsn variable, the cur_record_lsn can
help if it's defined outside the loop. With this, the patch can
further be simplified as attached v6.

2. And, I'd prefer not to assume the stop_reason as signal by default.
While it works for now because the remaining place where time_to_abort
is set to true is only in the signal handler, it is not extensible, if
there's another exit condition comes in future that sets time_to_abort
to true.

3. pg_log_info("end position %X/%X reached on signal", .... For
signal, end position is a bit vague wording and I think we can just
say pg_log_info("received interrupt signal, exiting"); like
pg_receivewal. We really can't have a valid stop_lsn for signal exit
because that depends on when signal arrives in the while loop. If at
all, someone wants to know the last received LSN - they can look at
the other messages that pg_recvlogical emits - pg_recvlogical:
confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0 (slot myslot).

4. With v5, it was taking a while to exit after the first CTRL+C, see
multiple CTRL+Cs at the end:
ubuntu::~/postgres/inst/bin$ ./pg_recvlogical --slot=lsub1_repl_slot
--file=lsub1.data --start --verbose
pg_recvlogical: starting log streaming at 0/0 (slot lsub1_repl_slot)
pg_recvlogical: streaming initiated
pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
^Cpg_recvlogical: end position 0/2867D70 reached on signal
^C^C^C^C^C^C^C^C^C^C^C^C
^C^C^C^C^C^C^C^C^C^C^C^C

5. FWIW, on HEAD we'd get the following and the patch emits a better messaging:
ubuntu:~/postgres/inst/bin$ ./pg_recvlogical --slot=lsub1_repl_slot
--file=lsub1.data --start --dbname='host=localhost port=5432
dbname=postgres user=ubuntu' --verbose
pg_recvlogical: starting log streaming at 0/0 (slot lsub1_repl_slot)
pg_recvlogical: streaming initiated
pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
^Cpg_recvlogical: error: unexpected termination of replication stream:

Attaching v6 patch with the above changes to v6. Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v6-0001-Fix-pg_recvlogical-error-message-upon-SIGINT-SIGT.patchapplication/octet-stream; name=v6-0001-Fix-pg_recvlogical-error-message-upon-SIGINT-SIGT.patchDownload
From 9d0820124bca27549a50165b39a3c3f8a9ce6815 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 19 Jul 2023 06:34:40 +0000
Subject: [PATCH v6] Fix pg_recvlogical error message upon SIGINT/SIGTERM

When pg_recvlogical needs to abort on a signal like SIGINT/SIGTERM, it
is expected to exit cleanly.  However, the code forgot to clean up the
state of the connection befor leaving.  This would cause the tool to
emit messages like "unexpected termination of replication stream" error,
which is meant for really unexpected termination or a crash.

The code is refactored to apply the same termination abort operations for
signals, end LSN and keepalive cases.

Reported-by: Andres Freund
Author: Bharath Rupireddy
Reviewed-by: Kyotaro Horiguchi, Andres Freund, Cary Huang
Discussion: https://www.postgresql.org/message-id/20221019213953.htdtzikf4f45ywil%40awork3.anarazel.de
---
 src/bin/pg_basebackup/pg_recvlogical.c | 51 ++++++++++++++++++++------
 1 file changed, 39 insertions(+), 12 deletions(-)

diff --git a/src/bin/pg_basebackup/pg_recvlogical.c b/src/bin/pg_basebackup/pg_recvlogical.c
index f3c7937a1d..973c2e6a5b 100644
--- a/src/bin/pg_basebackup/pg_recvlogical.c
+++ b/src/bin/pg_basebackup/pg_recvlogical.c
@@ -32,6 +32,14 @@
 /* Time to sleep between reconnection attempts */
 #define RECONNECT_SLEEP_TIME 5
 
+typedef enum
+{
+	STREAM_STOP_NONE,
+	STREAM_STOP_END_OF_WAL,
+	STREAM_STOP_KEEPALIVE,
+	STREAM_STOP_SIGNAL
+}			StreamStopReason;
+
 /* Global Options */
 static char *outfile = NULL;
 static int	verbose = 0;
@@ -55,6 +63,7 @@ static const char *plugin = "test_decoding";
 /* Global State */
 static int	outfd = -1;
 static volatile sig_atomic_t time_to_abort = false;
+static volatile sig_atomic_t stop_reason = STREAM_STOP_NONE;
 static volatile sig_atomic_t output_reopen = false;
 static bool output_isfile;
 static TimestampTz output_last_fsync = -1;
@@ -66,7 +75,7 @@ static void usage(void);
 static void StreamLogicalLog(void);
 static bool flushAndSendFeedback(PGconn *conn, TimestampTz *now);
 static void prepareToTerminate(PGconn *conn, XLogRecPtr endpos,
-							   bool keepalive, XLogRecPtr lsn);
+							   XLogRecPtr lsn);
 
 static void
 usage(void)
@@ -207,6 +216,7 @@ StreamLogicalLog(void)
 	TimestampTz last_status = -1;
 	int			i;
 	PQExpBuffer query;
+	XLogRecPtr	cur_record_lsn;
 
 	output_written_lsn = InvalidXLogRecPtr;
 	output_fsync_lsn = InvalidXLogRecPtr;
@@ -275,7 +285,8 @@ StreamLogicalLog(void)
 		int			bytes_written;
 		TimestampTz now;
 		int			hdr_len;
-		XLogRecPtr	cur_record_lsn = InvalidXLogRecPtr;
+
+		cur_record_lsn = InvalidXLogRecPtr;
 
 		if (copybuf != NULL)
 		{
@@ -487,7 +498,7 @@ StreamLogicalLog(void)
 
 			if (endposReached)
 			{
-				prepareToTerminate(conn, endpos, true, InvalidXLogRecPtr);
+				stop_reason = STREAM_STOP_KEEPALIVE;
 				time_to_abort = true;
 				break;
 			}
@@ -527,7 +538,7 @@ StreamLogicalLog(void)
 			 */
 			if (!flushAndSendFeedback(conn, &now))
 				goto error;
-			prepareToTerminate(conn, endpos, false, cur_record_lsn);
+			stop_reason = STREAM_STOP_END_OF_WAL;
 			time_to_abort = true;
 			break;
 		}
@@ -572,12 +583,16 @@ StreamLogicalLog(void)
 			/* endpos was exactly the record we just processed, we're done */
 			if (!flushAndSendFeedback(conn, &now))
 				goto error;
-			prepareToTerminate(conn, endpos, false, cur_record_lsn);
+			stop_reason = STREAM_STOP_END_OF_WAL;
 			time_to_abort = true;
 			break;
 		}
 	}
 
+	/* Clean up connection state if stream has been aborted */
+	if (time_to_abort)
+		prepareToTerminate(conn, endpos, cur_record_lsn);
+
 	res = PQgetResult(conn);
 	if (PQresultStatus(res) == PGRES_COPY_OUT)
 	{
@@ -656,6 +671,7 @@ error:
 static void
 sigexit_handler(SIGNAL_ARGS)
 {
+	stop_reason = STREAM_STOP_SIGNAL;
 	time_to_abort = true;
 }
 
@@ -1021,18 +1037,29 @@ flushAndSendFeedback(PGconn *conn, TimestampTz *now)
  * retry on failure.
  */
 static void
-prepareToTerminate(PGconn *conn, XLogRecPtr endpos, bool keepalive, XLogRecPtr lsn)
+prepareToTerminate(PGconn *conn, XLogRecPtr endpos, XLogRecPtr lsn)
 {
 	(void) PQputCopyEnd(conn, NULL);
 	(void) PQflush(conn);
 
+	Assert(stop_reason != STREAM_STOP_NONE);
+
 	if (verbose)
 	{
-		if (keepalive)
-			pg_log_info("end position %X/%X reached by keepalive",
-						LSN_FORMAT_ARGS(endpos));
-		else
-			pg_log_info("end position %X/%X reached by WAL record at %X/%X",
-						LSN_FORMAT_ARGS(endpos), LSN_FORMAT_ARGS(lsn));
+		switch (stop_reason)
+		{
+			case STREAM_STOP_SIGNAL:
+				pg_log_info("received interrupt signal, exiting");
+				break;
+			case STREAM_STOP_KEEPALIVE:
+				pg_log_info("end position %X/%X reached by keepalive",
+							LSN_FORMAT_ARGS(endpos));
+				break;
+			case STREAM_STOP_END_OF_WAL:
+				Assert(!XLogRecPtrIsInvalid(lsn));
+				pg_log_info("end position %X/%X reached by WAL record at %X/%X",
+							LSN_FORMAT_ARGS(endpos), LSN_FORMAT_ARGS(lsn));
+				break;
+		}
 	}
 }
-- 
2.34.1

#16Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#15)
Re: pg_recvlogical prints bogus error when interrupted

On Wed, Jul 19, 2023 at 12:07 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

4. With v5, it was taking a while to exit after the first CTRL+C, see
multiple CTRL+Cs at the end:
ubuntu::~/postgres/inst/bin$ ./pg_recvlogical --slot=lsub1_repl_slot
--file=lsub1.data --start --verbose
pg_recvlogical: starting log streaming at 0/0 (slot lsub1_repl_slot)
pg_recvlogical: streaming initiated
pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
^Cpg_recvlogical: end position 0/2867D70 reached on signal
^C^C^C^C^C^C^C^C^C^C^C^C
^C^C^C^C^C^C^C^C^C^C^C^C

I think the delay is expected for the reason specified below and is
not because of any of the changes in v5. As far as CTRL+C is
concerned, it is a clean exit and hence we can't escape the while(1)
loop.

/*
* We're doing a client-initiated clean exit and have sent CopyDone to
* the server. Drain any messages, so we don't miss a last-minute
* ErrorResponse. The walsender stops generating XLogData records once
* it sees CopyDone, so expect this to finish quickly. After CopyDone,
* it's too late for sendFeedback(), even if this were to take a long
* time. Hence, use synchronous-mode PQgetCopyData().
*/
while (1)
{

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#17Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#15)
Re: pg_recvlogical prints bogus error when interrupted

On Wed, Jul 19, 2023 at 12:07:21PM +0530, Bharath Rupireddy wrote:

1. I don't think we need a stop_lsn variable, the cur_record_lsn can
help if it's defined outside the loop. With this, the patch can
further be simplified as attached v6.

Okay by me.

2. And, I'd prefer not to assume the stop_reason as signal by default.
While it works for now because the remaining place where time_to_abort
is set to true is only in the signal handler, it is not extensible, if
there's another exit condition comes in future that sets time_to_abort
to true.

Yeah, I was also wondering whether this should be set by the signal
handler in this case while storing the reason statically on a specific
sig_atomic_t.

3. pg_log_info("end position %X/%X reached on signal", .... For
signal, end position is a bit vague wording and I think we can just
say pg_log_info("received interrupt signal, exiting"); like
pg_receivewal. We really can't have a valid stop_lsn for signal exit
because that depends on when signal arrives in the while loop. If at
all, someone wants to know the last received LSN - they can look at
the other messages that pg_recvlogical emits - pg_recvlogical:
confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0 (slot myslot).

+    case STREAM_STOP_SIGNAL:
+            pg_log_info("received interrupt signal, exiting");
+            break;

Still it is useful to report the location we have finished with when
stopping on a signal, no? Why couldn't we use "lsn" here, aka
cur_record_lsn?

4. With v5, it was taking a while to exit after the first CTRL+C, see
multiple CTRL+Cs at the end:
ubuntu::~/postgres/inst/bin$ ./pg_recvlogical --slot=lsub1_repl_slot
--file=lsub1.data --start --verbose
pg_recvlogical: starting log streaming at 0/0 (slot lsub1_repl_slot)
pg_recvlogical: streaming initiated
pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
pg_recvlogical: confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0
(slot lsub1_repl_slot)
^Cpg_recvlogical: end position 0/2867D70 reached on signal
^C^C^C^C^C^C^C^C^C^C^C^C
^C^C^C^C^C^C^C^C^C^C^C^C

Isn't that where we'd need to look at a long change but we need to
stop cleanly? That sounds expected to me.
--
Michael

#18Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#17)
Re: pg_recvlogical prints bogus error when interrupted

On Wed, Jul 19, 2023 at 1:41 PM Michael Paquier <michael@paquier.xyz> wrote:

3. pg_log_info("end position %X/%X reached on signal", .... For
signal, end position is a bit vague wording and I think we can just
say pg_log_info("received interrupt signal, exiting"); like
pg_receivewal. We really can't have a valid stop_lsn for signal exit
because that depends on when signal arrives in the while loop. If at
all, someone wants to know the last received LSN - they can look at
the other messages that pg_recvlogical emits - pg_recvlogical:
confirming write up to 0/2BFFFD0, flush to 0/2BFFFD0 (slot myslot).

+    case STREAM_STOP_SIGNAL:
+            pg_log_info("received interrupt signal, exiting");
+            break;

Still it is useful to report the location we have finished with when
stopping on a signal, no? Why couldn't we use "lsn" here, aka
cur_record_lsn?

Printing LSN on signal exit won't be correct - if signal is received
before cur_record_lsn gets assigned, we will be showing an old LSN if
it was previously assigned or invalid LSN if it wasn't assigned
previously. Signal arrival and processing are indeterministic, so we
can't always show the right info. Instead, we can just be simple in
the messaging without an lsn like pg_receivewal.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#19Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#16)
Re: pg_recvlogical prints bogus error when interrupted

On Wed, Jul 19, 2023 at 01:33:15PM +0530, Bharath Rupireddy wrote:

I think the delay is expected for the reason specified below and is
not because of any of the changes in v5. As far as CTRL+C is
concerned, it is a clean exit and hence we can't escape the while(1)
loop.

Yes, that's also what I am expecting. That's costly when replaying a
large change chunk, but we also want a clean exit on a signal as the
code comments document, so..
--
Michael

#20Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#18)
Re: pg_recvlogical prints bogus error when interrupted

On Wed, Jul 19, 2023 at 01:46:02PM +0530, Bharath Rupireddy wrote:

Printing LSN on signal exit won't be correct - if signal is received
before cur_record_lsn gets assigned, we will be showing an old LSN if
it was previously assigned or invalid LSN if it wasn't assigned
previously. Signal arrival and processing are indeterministic, so we
can't always show the right info.

I think that there's an argument to be made because cur_record_lsn
will be set before coming back to the beginning of the replay loop
when a stop is triggered by a signal.

Instead, we can just be simple in the messaging without an lsn like
pg_receivewal.

Anyway, I'm OK with simple for now as it looks that you don't feel
about that either, and the patch is enough to fix the report of this
thread. And one would get periodic information in --verbose mode
depending the sync message frequency, as well.

So, I have applied v6 after fixing two issues with it:
- I have kept the reason as an argument of prepareToTerminate(), to be
able to take advantage of the switch structure where compilers would
generate a warning if adding a new value to StreamStopReason.
- cur_record_lsn was not initialized at the beginning of
StreamLogicalLog(), which is where the compiler complains about the
case of receiving a signal before entering in the replay loop, after
establising a connection.
--
Michael