Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

Started by Michael Paquierabout 7 years ago30 messages
#1Michael Paquier
michael@paquier.xyz
1 attachment(s)

Hi all,

Since 2dedf4d, recovery.conf is dead and all recovery parameters are now
GUCs. While looking at a patch to switch primary_conninfo and
primary_slot_name to be reloadable, Sergei Kornilov had a very good
point that the WAL receiver uses a connection string and a physical slot
name based on what the startup process wants the WAL receiver to use:
/messages/by-id/20181212043208.GI17695@paquier.xyz

It seems to me that doing so is now strange as the WAL receiver knows
about the GUC context, and actually it knows the parameters it should
use, so there is no point to pass down the values when requesting the
WAL receiver to start.

What do you think about the attached to simplify the logic? Even if
primary_conninfo and primary_slot_name are not switched to SIGHUP this
cleanup looks like a good thing to me.

Thoughts?
--
Michael

Attachments:

walreceiver-gucs-v1.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c80b14ed97..9d664559e8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11848,8 +11848,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 									 tli, curFileTLI);
 						}
 						curFileTLI = tli;
-						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
-											 PrimarySlotName);
+						RequestXLogStreaming(tli, ptr);
 						receivedUpto = 0;
 					}
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9643c2ed7b..3e4449583e 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -188,9 +188,7 @@ DisableWalRcvImmediateExit(void)
 void
 WalReceiverMain(void)
 {
-	char		conninfo[MAXCONNINFO];
 	char	   *tmp_conninfo;
-	char		slotname[NAMEDATALEN];
 	XLogRecPtr	startpoint;
 	TimeLineID	startpointTLI;
 	TimeLineID	primaryTLI;
@@ -250,8 +248,6 @@ WalReceiverMain(void)
 
 	/* Fetch information required to start streaming */
 	walrcv->ready_to_display = false;
-	strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
-	strlcpy(slotname, (char *) walrcv->slotname, NAMEDATALEN);
 	startpoint = walrcv->receiveStart;
 	startpointTLI = walrcv->receiveStartTLI;
 
@@ -291,9 +287,14 @@ WalReceiverMain(void)
 	/* Unblock signals (they were blocked when the postmaster forked us) */
 	PG_SETMASK(&UnBlockSig);
 
+	if (PrimaryConnInfo == NULL || PrimaryConnInfo[0] == '\0')
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("cannot connect to the primary server as \"primary_conninfo\" is not defined")));
+
 	/* Establish the connection to the primary for XLOG streaming */
 	EnableWalRcvImmediateExit();
-	wrconn = walrcv_connect(conninfo, false, "walreceiver", &err);
+	wrconn = walrcv_connect(PrimaryConnInfo, false, "walreceiver", &err);
 	if (!wrconn)
 		ereport(ERROR,
 				(errmsg("could not connect to the primary server: %s", err)));
@@ -311,6 +312,10 @@ WalReceiverMain(void)
 	if (tmp_conninfo)
 		strlcpy((char *) walrcv->conninfo, tmp_conninfo, MAXCONNINFO);
 
+	memset(walrcv->slotname, 0, NAMEDATALEN);
+	if (PrimarySlotName)
+		strlcpy((char *) walrcv->slotname, PrimarySlotName, NAMEDATALEN);
+
 	memset(walrcv->sender_host, 0, NI_MAXHOST);
 	if (sender_host)
 		strlcpy((char *) walrcv->sender_host, sender_host, NI_MAXHOST);
@@ -387,7 +392,8 @@ WalReceiverMain(void)
 		 */
 		options.logical = false;
 		options.startpoint = startpoint;
-		options.slotname = slotname[0] != '\0' ? slotname : NULL;
+		options.slotname = (PrimarySlotName && PrimarySlotName[0] != '\0') ?
+			PrimarySlotName : NULL;
 		options.proto.physical.startpointTLI = startpointTLI;
 		ThisTimeLineID = startpointTLI;
 		if (walrcv_startstreaming(wrconn, &options))
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 67b1a074cc..82dc128d10 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -220,8 +220,7 @@ ShutdownWalRcv(void)
  * of a replication slot to acquire.
  */
 void
-RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
-					 const char *slotname)
+RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr)
 {
 	WalRcvData *walrcv = WalRcv;
 	bool		launch = false;
@@ -243,15 +242,8 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	Assert(walrcv->walRcvState == WALRCV_STOPPED ||
 		   walrcv->walRcvState == WALRCV_WAITING);
 
-	if (conninfo != NULL)
-		strlcpy((char *) walrcv->conninfo, conninfo, MAXCONNINFO);
-	else
-		walrcv->conninfo[0] = '\0';
-
-	if (slotname != NULL)
-		strlcpy((char *) walrcv->slotname, slotname, NAMEDATALEN);
-	else
-		walrcv->slotname[0] = '\0';
+	walrcv->conninfo[0] = '\0';
+	walrcv->slotname[0] = '\0';
 
 	if (walrcv->walRcvState == WALRCV_STOPPED)
 	{
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 5913b580c2..ba18b1f93f 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -306,8 +306,7 @@ extern void WalRcvShmemInit(void);
 extern void ShutdownWalRcv(void);
 extern bool WalRcvStreaming(void);
 extern bool WalRcvRunning(void);
-extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
-					 const char *conninfo, const char *slotname);
+extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr);
 extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
#2Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#1)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On December 11, 2018 9:30:42 PM PST, Michael Paquier <michael@paquier.xyz> wrote:

Hi all,

Since 2dedf4d, recovery.conf is dead and all recovery parameters are
now
GUCs. While looking at a patch to switch primary_conninfo and
primary_slot_name to be reloadable, Sergei Kornilov had a very good
point that the WAL receiver uses a connection string and a physical
slot
name based on what the startup process wants the WAL receiver to use:
/messages/by-id/20181212043208.GI17695@paquier.xyz

It seems to me that doing so is now strange as the WAL receiver knows
about the GUC context, and actually it knows the parameters it should
use, so there is no point to pass down the values when requesting the
WAL receiver to start.

What do you think about the attached to simplify the logic? Even if
primary_conninfo and primary_slot_name are not switched to SIGHUP this
cleanup looks like a good thing to me.

I am not convinced this is a good idea. This allows the state of walrcv and startup to diverge, they could e.g. have different configuration, due to differently time config reloads. And they need to communicate via shmem anyway, so there's not a lot of complexity avoided. And I think it's good to be able to show the active connection via functions, rather than the one currently in pg.conf, which might be different.

Andres

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

#3Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#2)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Tue, Dec 11, 2018 at 09:34:58PM -0800, Andres Freund wrote:

I am not convinced this is a good idea. This allows the state of
walrcv and startup to diverge, they could e.g. have different
configuration, due to differently time config reloads. And they need
to communicate via shmem anyway, so there's not a lot of complexity
avoided. And I think it's good to be able to show the active
connection via functions, rather than the one currently in pg.conf,
which might be different.

Well, the conninfo and slot name accessible to the user are the values
available only once the information of the WAL receiver in shared memory
is ready to be displayed. Relying more on the GUC context has the
advantage to never have in shared memory the original string and only
store the clobbered one, which actually simplifies what's stored in
shared memory because you can entirely remove ready_to_display (I forgot
to remove that in the patch actually). If those parameters become
reloadable, you actually rely only on what the param context has, and
not on what the shared memory context may have or not.

Removing entirely ready_to_display is quite appealing actually..
--
Michael

#4Simon Riggs
simon@2ndquadrant.com
In reply to: Andres Freund (#2)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Wed, 12 Dec 2018 at 05:35, Andres Freund <andres@anarazel.de> wrote:

What do you think about the attached to simplify the logic? Even if
primary_conninfo and primary_slot_name are not switched to SIGHUP this
cleanup looks like a good thing to me.

I am not convinced this is a good idea. This allows the state of walrcv
and startup to diverge, they could e.g. have different configuration, due
to differently time config reloads.

That sounds bad, but most parameters apply to one or the other, not both.

If there are some that apply to both, then yes, coordination would be
important.

It does seem likely that the new scheme will require us to look carefully
at when parameters are reloaded, since the timing of reloads was never
taken into account in the previous coding.

--
Simon Riggs http://www.2ndQuadrant.com/
<http://www.2ndquadrant.com/&gt;
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

In reply to: Andres Freund (#2)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

Hello

This allows the state of walrcv and startup to diverge, they could e.g. have different configuration, due to differently time config reloads.

So we have exactly same problem with, for example, hot_standby_feedback, right?
We change hot_standby_feedback value, reload it and we can have 'show hot_standby_feedback' different to currently running walreceiver.
But why we have nothing about hot_standby_feedback in pg_stat_get_wal_receiver()?
Where is difference?

regards, Sergei

#6Michael Paquier
michael@paquier.xyz
In reply to: Sergei Kornilov (#5)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Wed, Dec 12, 2018 at 06:55:04PM +0300, Sergei Kornilov wrote:

This allows the state of walrcv and startup to diverge, they could
e.g. have different configuration, due to differently time config
reloads.

So we have exactly same problem with, for example, hot_standby_feedback, right?
We change hot_standby_feedback value, reload it and we can have 'show
hot_standby_feedback' different to currently running walreceiver. But
why we have nothing about hot_standby_feedback in pg_stat_get_wal_receiver()?
Where is difference?

The difference is in the fact that a WAL receiver uses the connection
string and the slot name when establishing the connection when using
START_STREAMING through the replication protocol, and
hot_standby_feedback gets used depending on the context of the messages
exchanged.
--
Michael

#7Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#3)
1 attachment(s)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Wed, Dec 12, 2018 at 02:55:17PM +0900, Michael Paquier wrote:

Well, the conninfo and slot name accessible to the user are the values
available only once the information of the WAL receiver in shared memory
is ready to be displayed. Relying more on the GUC context has the
advantage to never have in shared memory the original string and only
store the clobbered one, which actually simplifies what's stored in
shared memory because you can entirely remove ready_to_display (I forgot
to remove that in the patch actually). If those parameters become
reloadable, you actually rely only on what the param context has, and
not on what the shared memory context may have or not.

Removing entirely ready_to_display is quite appealing actually..

I have been looking at that stuff this morning, and indeed
ready_for_display can be entirely removed. I think that's quite an
advantage as WalRcv->conninfo only stores the connection string
cloberred to hide security-sensitive fields and it never has to save the
initial state which could have sensitive data, simplifying how the WAL
receiver data is filled. I am adding that to the next commit fest.
--
Michael

Attachments:

walreceiver-gucs-v2.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index c80b14ed97..9d664559e8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11848,8 +11848,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 									 tli, curFileTLI);
 						}
 						curFileTLI = tli;
-						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
-											 PrimarySlotName);
+						RequestXLogStreaming(tli, ptr);
 						receivedUpto = 0;
 					}
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 9643c2ed7b..1b077c9a3d 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -188,9 +188,7 @@ DisableWalRcvImmediateExit(void)
 void
 WalReceiverMain(void)
 {
-	char		conninfo[MAXCONNINFO];
 	char	   *tmp_conninfo;
-	char		slotname[NAMEDATALEN];
 	XLogRecPtr	startpoint;
 	TimeLineID	startpointTLI;
 	TimeLineID	primaryTLI;
@@ -248,10 +246,6 @@ WalReceiverMain(void)
 	walrcv->pid = MyProcPid;
 	walrcv->walRcvState = WALRCV_STREAMING;
 
-	/* Fetch information required to start streaming */
-	walrcv->ready_to_display = false;
-	strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
-	strlcpy(slotname, (char *) walrcv->slotname, NAMEDATALEN);
 	startpoint = walrcv->receiveStart;
 	startpointTLI = walrcv->receiveStartTLI;
 
@@ -291,9 +285,14 @@ WalReceiverMain(void)
 	/* Unblock signals (they were blocked when the postmaster forked us) */
 	PG_SETMASK(&UnBlockSig);
 
+	if (PrimaryConnInfo == NULL || PrimaryConnInfo[0] == '\0')
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("cannot connect to the primary server as \"primary_conninfo\" is not defined")));
+
 	/* Establish the connection to the primary for XLOG streaming */
 	EnableWalRcvImmediateExit();
-	wrconn = walrcv_connect(conninfo, false, "walreceiver", &err);
+	wrconn = walrcv_connect(PrimaryConnInfo, false, "walreceiver", &err);
 	if (!wrconn)
 		ereport(ERROR,
 				(errmsg("could not connect to the primary server: %s", err)));
@@ -311,12 +310,15 @@ WalReceiverMain(void)
 	if (tmp_conninfo)
 		strlcpy((char *) walrcv->conninfo, tmp_conninfo, MAXCONNINFO);
 
+	memset(walrcv->slotname, 0, NAMEDATALEN);
+	if (PrimarySlotName)
+		strlcpy((char *) walrcv->slotname, PrimarySlotName, NAMEDATALEN);
+
 	memset(walrcv->sender_host, 0, NI_MAXHOST);
 	if (sender_host)
 		strlcpy((char *) walrcv->sender_host, sender_host, NI_MAXHOST);
 
 	walrcv->sender_port = sender_port;
-	walrcv->ready_to_display = true;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (tmp_conninfo)
@@ -387,7 +389,8 @@ WalReceiverMain(void)
 		 */
 		options.logical = false;
 		options.startpoint = startpoint;
-		options.slotname = slotname[0] != '\0' ? slotname : NULL;
+		options.slotname = (PrimarySlotName && PrimarySlotName[0] != '\0') ?
+			PrimarySlotName : NULL;
 		options.proto.physical.startpointTLI = startpointTLI;
 		ThisTimeLineID = startpointTLI;
 		if (walrcv_startstreaming(wrconn, &options))
@@ -776,7 +779,6 @@ WalRcvDie(int code, Datum arg)
 	Assert(walrcv->pid == MyProcPid);
 	walrcv->walRcvState = WALRCV_STOPPED;
 	walrcv->pid = 0;
-	walrcv->ready_to_display = false;
 	walrcv->latch = NULL;
 	SpinLockRelease(&walrcv->mutex);
 
@@ -1374,7 +1376,6 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	Datum	   *values;
 	bool	   *nulls;
 	int			pid;
-	bool		ready_to_display;
 	WalRcvState state;
 	XLogRecPtr	receive_start_lsn;
 	TimeLineID	receive_start_tli;
@@ -1392,7 +1393,6 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	/* Take a lock to ensure value consistency */
 	SpinLockAcquire(&WalRcv->mutex);
 	pid = (int) WalRcv->pid;
-	ready_to_display = WalRcv->ready_to_display;
 	state = WalRcv->walRcvState;
 	receive_start_lsn = WalRcv->receiveStart;
 	receive_start_tli = WalRcv->receiveStartTLI;
@@ -1412,7 +1412,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	 * No WAL receiver (or not ready yet), just return a tuple with NULL
 	 * values
 	 */
-	if (pid == 0 || !ready_to_display)
+	if (pid == 0)
 		PG_RETURN_NULL();
 
 	/* determine result type */
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 67b1a074cc..82dc128d10 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -220,8 +220,7 @@ ShutdownWalRcv(void)
  * of a replication slot to acquire.
  */
 void
-RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
-					 const char *slotname)
+RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr)
 {
 	WalRcvData *walrcv = WalRcv;
 	bool		launch = false;
@@ -243,15 +242,8 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	Assert(walrcv->walRcvState == WALRCV_STOPPED ||
 		   walrcv->walRcvState == WALRCV_WAITING);
 
-	if (conninfo != NULL)
-		strlcpy((char *) walrcv->conninfo, conninfo, MAXCONNINFO);
-	else
-		walrcv->conninfo[0] = '\0';
-
-	if (slotname != NULL)
-		strlcpy((char *) walrcv->slotname, slotname, NAMEDATALEN);
-	else
-		walrcv->slotname[0] = '\0';
+	walrcv->conninfo[0] = '\0';
+	walrcv->slotname[0] = '\0';
 
 	if (walrcv->walRcvState == WALRCV_STOPPED)
 	{
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index 5913b580c2..71a55903d9 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -104,8 +104,7 @@ typedef struct
 	TimestampTz latestWalEndTime;
 
 	/*
-	 * connection string; initially set to connect to the primary, and later
-	 * clobbered to hide security-sensitive fields.
+	 * Connection string clobbered to hide security-sensitive fields.
 	 */
 	char		conninfo[MAXCONNINFO];
 
@@ -122,9 +121,6 @@ typedef struct
 	 */
 	char		slotname[NAMEDATALEN];
 
-	/* set true once conninfo is ready to display (obfuscated pwds etc) */
-	bool		ready_to_display;
-
 	/*
 	 * Latch used by startup process to wake up walreceiver after telling it
 	 * where to start streaming (after setting receiveStart and
@@ -306,8 +302,7 @@ extern void WalRcvShmemInit(void);
 extern void ShutdownWalRcv(void);
 extern bool WalRcvStreaming(void);
 extern bool WalRcvRunning(void);
-extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
-					 const char *conninfo, const char *slotname);
+extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr);
 extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
In reply to: Michael Paquier (#7)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

Hi

Not sure i can be reviewer since this was initially my proposal.
I vote +1 for this patch. Code looks good and i think we have no reason to leave RequestXLogStreaming as-is.

regards, Sergei

#9Michael Paquier
michael@paquier.xyz
In reply to: Sergei Kornilov (#8)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Fri, Dec 14, 2018 at 12:23:09PM +0300, Sergei Kornilov wrote:

Not sure i can be reviewer since this was initially my proposal.

No issues from me if you wish to review the code. In my opinion, it is
not a problem if you are a reviewer as I wrote the code.

I vote +1 for this patch. Code looks good and i think we have no
reason to leave RequestXLogStreaming as-is.

Thanks for the input. The take on my side is to be able to remove
ready_to_display from WalRcv so let's see what others think.
--
Michael

#10Donald Dong
xdong@csumb.edu
In reply to: Michael Paquier (#3)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

Hi Michael,

Thank you for the information!

On Dec 11, 2018, at 9:55 PM, Michael Paquier <michael@paquier.xyz> wrote:

Well, the conninfo and slot name accessible to the user are the values
available only once the information of the WAL receiver in shared memory
is ready to be displayed. Relying more on the GUC context has the
advantage to never have in shared memory the original string and only
store the clobbered one, which actually simplifies what's stored in
shared memory because you can entirely remove ready_to_display (I forgot
to remove that in the patch actually). If those parameters become
reloadable, you actually rely only on what the param context has, and
not on what the shared memory context may have or not.

I wonder why do we need to have this addition?

src/backend/replication/walreceiver.c
+       memset(walrcv->slotname, 0, NAMEDATALEN);
+       if (PrimarySlotName)
+               strlcpy((char *) walrcv->slotname, PrimarySlotName, NAMEDATALEN);

src/backend/replication/walreceiver.c
288: PrimaryConnInfo == NULL || PrimaryConnInfo[0] == '\0'
291: errmsg("cannot connect to the primary server as \"primary_conninfo\" is not defined")));
392: PrimarySlotName && PrimarySlotName[0] != '\0'

src/backend/access/transam/xlog.c
11824: * If primary_conninfo is set, launch walreceiver to try
11833: PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0

I notice these patterns appear in different places. Maybe it will be better to have a common method such as pg_str_empty()?

Regards,
Donald Dong

#11Michael Paquier
michael@paquier.xyz
In reply to: Donald Dong (#10)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Wed, Jan 09, 2019 at 06:00:23AM -0800, Donald Dong wrote:

I wonder why do we need to have this addition?

src/backend/replication/walreceiver.c
+       memset(walrcv->slotname, 0, NAMEDATALEN);
+       if (PrimarySlotName)
+               strlcpy((char *) walrcv->slotname, PrimarySlotName, NAMEDATALEN);

That's much cleaner to me to clean up the field for safety before
starting the process. When requesting a WAL receiver to start,
slotname and conninfo gets zeroed before doing anything, you are right
that we could do without it actually.

I notice these patterns appear in different places. Maybe it will be
better to have a common method such as pg_str_empty()?

That's a pattern used quite a lot for many GUCs, so that's quite
separate from this patch. Perhaps it would make sense to have a macro
for that purpose for GUCs, I am not sure if that's worth it though.
--
Michael

#12Donald Dong
xdong@csumb.edu
In reply to: Michael Paquier (#11)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Jan 9, 2019, at 4:47 PM, Michael Paquier <michael@paquier.xyz> wrote:

That's much cleaner to me to clean up the field for safety before
starting the process. When requesting a WAL receiver to start,
slotname and conninfo gets zeroed before doing anything, you are right
that we could do without it actually.

Cool! I agree that cleaning up the field would make the logic cleaner.

That's a pattern used quite a lot for many GUCs, so that's quite
separate from this patch. Perhaps it would make sense to have a macro
for that purpose for GUCs, I am not sure if that's worth it though.

Yeah. I think having such macro would make the code a bit cleaner. Tho the
duplication of logic is not too significant.

#13Michael Paquier
michael@paquier.xyz
In reply to: Donald Dong (#12)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Wed, Jan 09, 2019 at 05:38:53PM -0800, Donald Dong wrote:

On Jan 9, 2019, at 4:47 PM, Michael Paquier <michael@paquier.xyz> wrote:

That's much cleaner to me to clean up the field for safety before
starting the process. When requesting a WAL receiver to start,
slotname and conninfo gets zeroed before doing anything, you are right
that we could do without it actually.

Cool! I agree that cleaning up the field would make the logic cleaner.

I was just double-checking the code, and it is possible to remove the
part from RequestXLogStreaming() which sets slotname and conninfo to
'\0', so I removed that part from my local branch to simplify the
code.
--
Michael

In reply to: Michael Paquier (#13)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

Hello

I was just double-checking the code, and it is possible to remove the
part from RequestXLogStreaming() which sets slotname and conninfo to
'\0', so I removed that part from my local branch to simplify the
code.

Without both ready_to_display and cleaning in RequestXLogStreaming we do not show outdated PrimaryConnInfo in case walreceiver just started, but does not connected yet? While waiting walrcv_connect for example.

regards, Sergei

#15Michael Paquier
michael@paquier.xyz
In reply to: Sergei Kornilov (#14)
1 attachment(s)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Thu, Jan 10, 2019 at 01:10:16PM +0300, Sergei Kornilov wrote:

Without both ready_to_display and cleaning in RequestXLogStreaming
we do not show outdated PrimaryConnInfo in case walreceiver just
started, but does not connected yet? While waiting walrcv_connect
for example.

Good point. There is a gap between the moment the WAL receiver PID is
set when the WAL receiver starts up and the moment where the different
fields are reset to 0 which is not good as it could be possible that
the user sees the information from the previous WAL receiver, so we
should move the memset calls when the PID is set, reaching a state
where the PID is alive but where there is no connection yet. The port
number needs a similar treatment.

Looking closer at the code, it seems to me that it would be a good
thing if we update the shared memory state when the WAL receiver dies,
so as not only the PID is set to 0, but all connection-related
information gets the call.

With all those comments I am finishing with the updated attached.
Thoughts?
--
Michael

Attachments:

walreceiver-gucs-v3.patchtext/x-diff; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9823b75767..4191b23621 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -11857,8 +11857,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 									 tli, curFileTLI);
 						}
 						curFileTLI = tli;
-						RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
-											 PrimarySlotName);
+						RequestXLogStreaming(tli, ptr);
 						receivedUpto = 0;
 					}
 
diff --git a/src/backend/replication/walreceiver.c b/src/backend/replication/walreceiver.c
index 2e90944ad5..7e3ff63a44 100644
--- a/src/backend/replication/walreceiver.c
+++ b/src/backend/replication/walreceiver.c
@@ -188,9 +188,7 @@ DisableWalRcvImmediateExit(void)
 void
 WalReceiverMain(void)
 {
-	char		conninfo[MAXCONNINFO];
 	char	   *tmp_conninfo;
-	char		slotname[NAMEDATALEN];
 	XLogRecPtr	startpoint;
 	TimeLineID	startpointTLI;
 	TimeLineID	primaryTLI;
@@ -248,10 +246,6 @@ WalReceiverMain(void)
 	walrcv->pid = MyProcPid;
 	walrcv->walRcvState = WALRCV_STREAMING;
 
-	/* Fetch information required to start streaming */
-	walrcv->ready_to_display = false;
-	strlcpy(conninfo, (char *) walrcv->conninfo, MAXCONNINFO);
-	strlcpy(slotname, (char *) walrcv->slotname, NAMEDATALEN);
 	startpoint = walrcv->receiveStart;
 	startpointTLI = walrcv->receiveStartTLI;
 
@@ -262,6 +256,16 @@ WalReceiverMain(void)
 	/* Report the latch to use to awaken this process */
 	walrcv->latch = &MyProc->procLatch;
 
+	/*
+	 * Reset all connection information when the PID is set, which makes
+	 * the information visible at SQL level, still we are not connected
+	 * yet.
+	 */
+	memset(walrcv->conninfo, 0, MAXCONNINFO);
+	memset(walrcv->slotname, 0, NAMEDATALEN);
+	memset(walrcv->sender_host, 0, NI_MAXHOST);
+	walrcv->sender_port = 0;
+
 	SpinLockRelease(&walrcv->mutex);
 
 	/* Arrange to clean up at walreceiver exit */
@@ -291,32 +295,36 @@ WalReceiverMain(void)
 	/* Unblock signals (they were blocked when the postmaster forked us) */
 	PG_SETMASK(&UnBlockSig);
 
+	if (PrimaryConnInfo == NULL || PrimaryConnInfo[0] == '\0')
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("cannot connect to the primary server as \"primary_conninfo\" is not defined")));
+
 	/* Establish the connection to the primary for XLOG streaming */
 	EnableWalRcvImmediateExit();
-	wrconn = walrcv_connect(conninfo, false, "walreceiver", &err);
+	wrconn = walrcv_connect(PrimaryConnInfo, false, "walreceiver", &err);
 	if (!wrconn)
 		ereport(ERROR,
 				(errmsg("could not connect to the primary server: %s", err)));
 	DisableWalRcvImmediateExit();
 
 	/*
-	 * Save user-visible connection string.  This clobbers the original
-	 * conninfo, for security. Also save host and port of the sender server
-	 * this walreceiver is connected to.
+	 * Save user-visible connection string.  Also save host and port of the
+	 * sender server this walreceiver is connected to.
 	 */
 	tmp_conninfo = walrcv_get_conninfo(wrconn);
 	walrcv_get_senderinfo(wrconn, &sender_host, &sender_port);
 	SpinLockAcquire(&walrcv->mutex);
-	memset(walrcv->conninfo, 0, MAXCONNINFO);
 	if (tmp_conninfo)
 		strlcpy((char *) walrcv->conninfo, tmp_conninfo, MAXCONNINFO);
 
-	memset(walrcv->sender_host, 0, NI_MAXHOST);
+	if (PrimarySlotName)
+		strlcpy((char *) walrcv->slotname, PrimarySlotName, NAMEDATALEN);
+
 	if (sender_host)
 		strlcpy((char *) walrcv->sender_host, sender_host, NI_MAXHOST);
 
 	walrcv->sender_port = sender_port;
-	walrcv->ready_to_display = true;
 	SpinLockRelease(&walrcv->mutex);
 
 	if (tmp_conninfo)
@@ -387,7 +395,8 @@ WalReceiverMain(void)
 		 */
 		options.logical = false;
 		options.startpoint = startpoint;
-		options.slotname = slotname[0] != '\0' ? slotname : NULL;
+		options.slotname = (PrimarySlotName && PrimarySlotName[0] != '\0') ?
+			PrimarySlotName : NULL;
 		options.proto.physical.startpointTLI = startpointTLI;
 		ThisTimeLineID = startpointTLI;
 		if (walrcv_startstreaming(wrconn, &options))
@@ -776,8 +785,17 @@ WalRcvDie(int code, Datum arg)
 	Assert(walrcv->pid == MyProcPid);
 	walrcv->walRcvState = WALRCV_STOPPED;
 	walrcv->pid = 0;
-	walrcv->ready_to_display = false;
 	walrcv->latch = NULL;
+
+	/*
+	 * Reset all connection information when the PID is reset, which makes
+	 * the shared memory state consistent with the rest, even if this it not
+	 * visible at SQL level.
+	 */
+	memset(walrcv->conninfo, 0, MAXCONNINFO);
+	memset(walrcv->slotname, 0, NAMEDATALEN);
+	memset(walrcv->sender_host, 0, NI_MAXHOST);
+	walrcv->sender_port = 0;
 	SpinLockRelease(&walrcv->mutex);
 
 	/* Terminate the connection gracefully. */
@@ -1374,7 +1392,6 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	Datum	   *values;
 	bool	   *nulls;
 	int			pid;
-	bool		ready_to_display;
 	WalRcvState state;
 	XLogRecPtr	receive_start_lsn;
 	TimeLineID	receive_start_tli;
@@ -1392,7 +1409,6 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	/* Take a lock to ensure value consistency */
 	SpinLockAcquire(&WalRcv->mutex);
 	pid = (int) WalRcv->pid;
-	ready_to_display = WalRcv->ready_to_display;
 	state = WalRcv->walRcvState;
 	receive_start_lsn = WalRcv->receiveStart;
 	receive_start_tli = WalRcv->receiveStartTLI;
@@ -1412,7 +1428,7 @@ pg_stat_get_wal_receiver(PG_FUNCTION_ARGS)
 	 * No WAL receiver (or not ready yet), just return a tuple with NULL
 	 * values
 	 */
-	if (pid == 0 || !ready_to_display)
+	if (pid == 0)
 		PG_RETURN_NULL();
 
 	/* determine result type */
diff --git a/src/backend/replication/walreceiverfuncs.c b/src/backend/replication/walreceiverfuncs.c
index 2d6cdfe0a2..68d37eb2c8 100644
--- a/src/backend/replication/walreceiverfuncs.c
+++ b/src/backend/replication/walreceiverfuncs.c
@@ -220,8 +220,7 @@ ShutdownWalRcv(void)
  * of a replication slot to acquire.
  */
 void
-RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
-					 const char *slotname)
+RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr)
 {
 	WalRcvData *walrcv = WalRcv;
 	bool		launch = false;
@@ -243,16 +242,6 @@ RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr, const char *conninfo,
 	Assert(walrcv->walRcvState == WALRCV_STOPPED ||
 		   walrcv->walRcvState == WALRCV_WAITING);
 
-	if (conninfo != NULL)
-		strlcpy((char *) walrcv->conninfo, conninfo, MAXCONNINFO);
-	else
-		walrcv->conninfo[0] = '\0';
-
-	if (slotname != NULL)
-		strlcpy((char *) walrcv->slotname, slotname, NAMEDATALEN);
-	else
-		walrcv->slotname[0] = '\0';
-
 	if (walrcv->walRcvState == WALRCV_STOPPED)
 	{
 		launch = true;
diff --git a/src/include/replication/walreceiver.h b/src/include/replication/walreceiver.h
index e04d725ff5..cb62f40c07 100644
--- a/src/include/replication/walreceiver.h
+++ b/src/include/replication/walreceiver.h
@@ -104,8 +104,7 @@ typedef struct
 	TimestampTz latestWalEndTime;
 
 	/*
-	 * connection string; initially set to connect to the primary, and later
-	 * clobbered to hide security-sensitive fields.
+	 * Connection string clobbered to hide security-sensitive fields.
 	 */
 	char		conninfo[MAXCONNINFO];
 
@@ -122,9 +121,6 @@ typedef struct
 	 */
 	char		slotname[NAMEDATALEN];
 
-	/* set true once conninfo is ready to display (obfuscated pwds etc) */
-	bool		ready_to_display;
-
 	/*
 	 * Latch used by startup process to wake up walreceiver after telling it
 	 * where to start streaming (after setting receiveStart and
@@ -306,8 +302,7 @@ extern void WalRcvShmemInit(void);
 extern void ShutdownWalRcv(void);
 extern bool WalRcvStreaming(void);
 extern bool WalRcvRunning(void);
-extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr,
-					 const char *conninfo, const char *slotname);
+extern void RequestXLogStreaming(TimeLineID tli, XLogRecPtr recptr);
 extern XLogRecPtr GetWalRcvWriteRecPtr(XLogRecPtr *latestChunkStart, TimeLineID *receiveTLI);
 extern int	GetReplicationApplyDelay(void);
 extern int	GetReplicationTransferLatency(void);
In reply to: Michael Paquier (#15)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

Hi

Thank you, patch looks good for me.

regards, Sergei

#17Michael Paquier
michael@paquier.xyz
In reply to: Sergei Kornilov (#16)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Thu, Jan 10, 2019 at 02:20:27PM +0300, Sergei Kornilov wrote:

Thank you, patch looks good for me.

Thanks Sergei for the review. I have been looking at the patch again
this morning with a fresh set of eyes and the thing looks in a
committable shape (the GUC values for NULL checks are a bit
inconsistent in the last patch by the way, so I have fixed them
locally).

I'll do an extra pass on it in the next couple of days and commit.
Then let's move on with the reloadable portions.
--
Michael

#18Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#17)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

Hi,

On 2019-01-11 09:34:28 +0900, Michael Paquier wrote:

On Thu, Jan 10, 2019 at 02:20:27PM +0300, Sergei Kornilov wrote:

Thank you, patch looks good for me.

Thanks Sergei for the review. I have been looking at the patch again
this morning with a fresh set of eyes and the thing looks in a
committable shape (the GUC values for NULL checks are a bit
inconsistent in the last patch by the way, so I have fixed them
locally).

I'll do an extra pass on it in the next couple of days and commit.
Then let's move on with the reloadable portions.

I still think this whole direction of accessing the GUC in walreceiver
is a bad idea and shouldn't be pursued further. There's definite
potential for startup process and WAL receiver having different states
of GUCs, the code doesn't get meaningfully simpler, the GUC value checks
in walreceiver make for horrible reporting up the chain.

Greetings,

Andres Freund

#19Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#18)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Thu, Jan 10, 2019 at 04:41:47PM -0800, Andres Freund wrote:

I still think this whole direction of accessing the GUC in walreceiver
is a bad idea and shouldn't be pursued further. There's definite
potential for startup process and WAL receiver having different states
of GUCs, the code doesn't get meaningfully simpler, the GUC value checks
in walreceiver make for horrible reporting up the chain.

Did you notice the set of messages from upthread? The code *gets*
simpler by removing ready_to_display and the need to manipulate the
non-clobbered connection string sent directly from the startup
process. In my opinion that's a clear gain. We gain also the
possibility to track down that a WAL receiver is started but not
connected yet for monitoring tools.
--
Michael

#20Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#19)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On 2019-01-11 09:50:49 +0900, Michael Paquier wrote:

On Thu, Jan 10, 2019 at 04:41:47PM -0800, Andres Freund wrote:

I still think this whole direction of accessing the GUC in walreceiver
is a bad idea and shouldn't be pursued further. There's definite
potential for startup process and WAL receiver having different states
of GUCs, the code doesn't get meaningfully simpler, the GUC value checks
in walreceiver make for horrible reporting up the chain.

Did you notice the set of messages from upthread? The code *gets*
simpler by removing ready_to_display and the need to manipulate the
non-clobbered connection string sent directly from the startup
process.

It's a minor simplification for code that's existed for quite a
while. Not worth it.

#21Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#20)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Thu, Jan 10, 2019 at 04:52:53PM -0800, Andres Freund wrote:

It's a minor simplification for code that's existed for quite a
while. Not worth it.

Meh, I disagree for the ready_to_display part as it has been around
for a long time:
commit: 9ed551e0a4fdb046d8e5ea779adfd3e184d5dc0d
author: Alvaro Herrera <alvherre@alvh.no-ip.org>
date: Wed, 29 Jun 2016 16:57:17 -0400
Add conninfo to pg_stat_wal_receiver

I was honestly unhappy from the start with it because there was no
actual way to have the WAL receiver *not* save the original connection
string. In my opinion, getting rid of it is worth it because this
really simplifies the way the WAL receiver data visibility is handled
at SQL level and we don't finish by using again and again the same
field for different purposes, consolidating the code.

For the reloading part, we also make the WAL receiver rely on the GUC
context and stop it if there is a change in primary_conninfo and
primary_slot_name. It would be inconsistent to rely on the GUC
context for one thing, and the startup process for the other one.
Adding a safeguard to fail WAL receiver startup is actually more of
sanity check that anything else (that could be an assertion but for
forks a failure is of better benefit).

At this stage, I would like to hear more opinions before doing or not
doing something. Alvaro, you got involved in the introduction of
ready_to_siplay. Peter, you have committed the merge of the recovery
params. Perhaps you have an opinion to share?
--
Michael

#22Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#21)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Fri, Jan 11, 2019 at 10:09:04AM +0900, Michael Paquier wrote:

On Thu, Jan 10, 2019 at 04:52:53PM -0800, Andres Freund wrote:

It's a minor simplification for code that's existed for quite a
while. Not worth it.

Meh, I disagree for the ready_to_display part as it has been around
for a long time:
commit: 9ed551e0a4fdb046d8e5ea779adfd3e184d5dc0d
author: Alvaro Herrera <alvherre@alvh.no-ip.org>
date: Wed, 29 Jun 2016 16:57:17 -0400
Add conninfo to pg_stat_wal_receiver

This should read "ready_to_display has not been around for a long
time" :)
--
Michael

#23Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#18)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Thu, Jan 10, 2019 at 7:42 PM Andres Freund <andres@anarazel.de> wrote:

I still think this whole direction of accessing the GUC in walreceiver
is a bad idea and shouldn't be pursued further. There's definite
potential for startup process and WAL receiver having different states
of GUCs, the code doesn't get meaningfully simpler, the GUC value checks
in walreceiver make for horrible reporting up the chain.

I'm trying to assess the validity of this complaint. It seems to me
that it might advance the discussion to be more clear about what the
problem is here. When pg_ctl reload is executed (or the equivalent),
different backends reload the file at different times. Currently,
because the values are explicitly passed down from the startup process
to the walreceiver, it's guaranteed that they will both use the same
value. With this change, one might see an older value and the other
the current value. So there's room to be nervous about code like
this:

if (PrimaryConnInfo && strcmp(PrimaryConnInfo, "") != 0)
{
... assorted other code ...
RequestXLogStreaming(tli, ptr, PrimaryConnInfo,
PrimarySlotName);
receivedUpto = 0;
}

With the patch, the PrimaryConnInfo and PrimarySlotName arguments are
removed from RequestXLogStreaming. That means that the new
walreceiver could come to a different conclusion about the values of
those arguments than the startup process. In particular, it could end
up thinking that primary_conninfo is empty when, if the startup
process had thought that, the walreceiver would never have been
launched in the first place. But it's not obvious that you've added
any logic in WALReceiverMain or elsewhere to compensate for this
possibility -- what would happen in that case? Would we crash?
Connect to the wrong server?

I might be wrong here, but I'm inclined to think that this scenario
hasn't really been contemplated carefully by the patch authors. There
are no added TAP tests for the scenario where the values differ
between the two processes, no code comments which explain why it's OK
if that happens, really no mention of it in the patch at all. And on
that basis I'm inclined to think that Andres is really quite correct
to be worried about this. The problem he's talking about here is very
low-probability because the race condition is narrow, but it's real,
and it surely needs to be handled somehow.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#24Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#23)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

Hi,

Thanks for chiming in.

On 2019-01-11 12:52:08 -0500, Robert Haas wrote:

And on that basis I'm inclined to think that Andres is really quite
correct to be worried about this. The problem he's talking about here
is very low-probability because the race condition is narrow, but it's
real, and it surely needs to be handled somehow.

I think it's also going to get more likely with projects we really got
to tackle like providing more builtin support for failing over to
different systems and such.

Greetings,

Andres Freund

In reply to: Robert Haas (#23)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

<div>Hello</div><div><br /></div>&gt; pg_ctl reload is executed (or the equivalent),<div>&gt; different backends reload the file at different times.<br />&gt; There are no added TAP tests for the scenario where the values differ between the two processes, no code comments which explain why it's OK</div><div><br /></div><div>But wait, connection string can not be changed via reload, only during cluster start. How it can differs between processes? </div><div><br />regards, Sergei</div>

#26Andres Freund
andres@anarazel.de
In reply to: Sergei Kornilov (#25)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

Hi,

On 2019-01-12 01:55:23 +0300, Sergei Kornilov wrote:

pg_ctl reload is executed (or the equivalent),
different backends reload the file at different times.
There are no added TAP tests for the scenario where the values differ between

the two processes, no code comments which explain why it's OK

But wait, connection string can not be changed via reload, only during cluster
start. How it can differs between processes?

One of the major reasons for moving recovery parameters from
recovery.conf to GUCs was to make them changable at runtime in the
future.

Greetings,

Andres Freund

#27Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#23)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Fri, Jan 11, 2019 at 12:52:08PM -0500, Robert Haas wrote:

With the patch, the PrimaryConnInfo and PrimarySlotName arguments are
removed from RequestXLogStreaming. That means that the new
walreceiver could come to a different conclusion about the values of
those arguments than the startup process. In particular, it could end
up thinking that primary_conninfo is empty when, if the startup
process had thought that, the walreceiver would never have been
launched in the first place. But it's not obvious that you've added
any logic in WALReceiverMain or elsewhere to compensate for this
possibility -- what would happen in that case? Would we crash?
Connect to the wrong server?

If I contemplate the patch this morning there is this bit:
@@ -291,32 +295,40 @@ WalReceiverMain(void)
/* Unblock signals (they were blocked when the postmaster forked
us) */
PG_SETMASK(&UnBlockSig);

+   /*
+    * Fail immediately if primary_conninfo goes missing, better safe than
+    * sorry.
+    */
+   if (PrimaryConnInfo == NULL || strcmp(PrimaryConnInfo, "") == 0)
+       ereport(ERROR,
+               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                errmsg("cannot connect to the primary server as \"primary_conninfo\" is not defined")));

So the answer to your question is: the WAL receiver fails to start.

I might be wrong here, but I'm inclined to think that this scenario
hasn't really been contemplated carefully by the patch authors. There
are no added TAP tests for the scenario where the values differ
between the two processes, no code comments which explain why it's OK
if that happens, really no mention of it in the patch at all. And on
that basis I'm inclined to think that Andres is really quite correct
to be worried about this. The problem he's talking about here is very
low-probability because the race condition is narrow, but it's real,
and it surely needs to be handled somehow.

primary_conninfo and primary_slot_name are PGC_POSTMASTER now, so
adding tests now don't really make sense.
--
Michael

#28Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#27)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Sat, Jan 12, 2019 at 08:10:07AM +0900, Michael Paquier wrote:

On Fri, Jan 11, 2019 at 12:52:08PM -0500, Robert Haas wrote:

With the patch, the PrimaryConnInfo and PrimarySlotName arguments are
removed from RequestXLogStreaming. That means that the new
walreceiver could come to a different conclusion about the values of
those arguments than the startup process. In particular, it could end
up thinking that primary_conninfo is empty when, if the startup
process had thought that, the walreceiver would never have been
launched in the first place. But it's not obvious that you've added
any logic in WALReceiverMain or elsewhere to compensate for this
possibility -- what would happen in that case? Would we crash?
Connect to the wrong server?

If I contemplate the patch this morning there is this bit:
@@ -291,32 +295,40 @@ WalReceiverMain(void)
/* Unblock signals (they were blocked when the postmaster forked
us) */
PG_SETMASK(&UnBlockSig);

+   /*
+    * Fail immediately if primary_conninfo goes missing, better safe than
+    * sorry.
+    */
+   if (PrimaryConnInfo == NULL || strcmp(PrimaryConnInfo, "") == 0)
+       ereport(ERROR,
+               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+                errmsg("cannot connect to the primary server as \"primary_conninfo\" is not defined")));

So the answer to your question is: the WAL receiver fails to start.

Robert, does this answer your question? I agree that depending on the
timing, we could finish by having the startup process spawning a WAL
receiver with a given connection string, and that the WAL receiver
could use a different one, but as long as we fail the WAL receiver
startup this does not seem like an issue to me, so I disagree with the
upthread statement that the patch author has not thought about such
cases :)

It seems to me that making the WAL receiver relying more on the GUC
context makes it more robust when it comes to reloading the parameters
which would be an upcoming change as we need to rely on the WAL
receiver check the GUC context itself and FATAL (or we would need to
have the startup process check for a change in
primary_conninfo/primary_slot_name, and then have the startup process
kill the WAL receiver by itself). In short, handling all that in the
WAL receiver would be more robust in the long term in my opinion as we
reduce interactions between the WAL receiver and the startup process.
And on top of it we get rid of ready_to_display, which is something I
am unhappy with since we had to introduce it.
--
Michael

#29Robert Haas
robertmhaas@gmail.com
In reply to: Michael Paquier (#28)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Tue, Jan 15, 2019 at 8:48 PM Michael Paquier <michael@paquier.xyz> wrote:

So the answer to your question is: the WAL receiver fails to start.

Robert, does this answer your question? I agree that depending on the
timing, we could finish by having the startup process spawning a WAL
receiver with a given connection string, and that the WAL receiver
could use a different one, but as long as we fail the WAL receiver
startup this does not seem like an issue to me, so I disagree with the
upthread statement that the patch author has not thought about such
cases :)

OK, if that was in the patch, I dunno why I didn't see it. I really
didn't think it was there, but if I'm wrong, then I am.

It seems to me that making the WAL receiver relying more on the GUC
context makes it more robust when it comes to reloading the parameters
which would be an upcoming change as we need to rely on the WAL
receiver check the GUC context itself and FATAL (or we would need to
have the startup process check for a change in
primary_conninfo/primary_slot_name, and then have the startup process
kill the WAL receiver by itself). In short, handling all that in the
WAL receiver would be more robust in the long term in my opinion as we
reduce interactions between the WAL receiver and the startup process.
And on top of it we get rid of ready_to_display, which is something I
am unhappy with since we had to introduce it.

I think you have some valid points here, but I also think that the
patch as written doesn't really seem to have any user-visible
benefits. Sure, ready_to_display is a crock, but getting rid of it
doesn't immediately help anybody. And on the flip side your patch
might have bugs, in which case we'll be worse off. I'm not going to
stand on my soapbox and say that committing this patch is a terrible
idea, because as far as I can tell, it isn't. But it feels like it
might be a commit for the sake of making a commit, and I can't get too
excited about that either. Since the logic will have to be further
revised anyway to make primary_conninfo PGC_SIGHUP, why not just wait
until that's done and then commit all the changes together instead of
guessing that this might make that easier?

If this does get committed now, and count me as about -0.25 for that,
then I at least think it should have some comments clearly addressing
the synchronization issues. Unless those are also in the patch and I
missed them, too, but I hope I'm not that dense.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#30Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#29)
Re: Making WAL receiver startup rely on GUC context for primary_conninfo and primary_slot_name

On Wed, Jan 16, 2019 at 11:02:48AM -0500, Robert Haas wrote:

I think you have some valid points here, but I also think that the
patch as written doesn't really seem to have any user-visible
benefits. Sure, ready_to_display is a crock, but getting rid of it
doesn't immediately help anybody. And on the flip side your patch
might have bugs, in which case we'll be worse off. I'm not going to
stand on my soapbox and say that committing this patch is a terrible
idea, because as far as I can tell, it isn't. But it feels like it
might be a commit for the sake of making a commit, and I can't get too
excited about that either. Since the logic will have to be further
revised anyway to make primary_conninfo PGC_SIGHUP, why not just wait
until that's done and then commit all the changes together instead of
guessing that this might make that easier?

You can say that about any patch which gets committed, even
refactoring patches have a risk of introducing bugs, and even subtle
ones understood only after release.

I was justifying the existence of this thread exactly for opposite
reasons. After reviewing the other patch switch to reload the
parameters we could do more, and spawning a new thread to attract the
correct audience looked rather right (it still does):
/messages/by-id/20181212043208.GI17695@paquier.xyz

And this refactoring seemed to be justified as part of a different
thread. I don't mind dropping this patch and this thread and just go
back there, but that seems like taking steps backward, and what's
proposed here is a bit different than just making the parameters
reloadable. Still the refactoring looks justified to me for
correctness with the parameter handling.

Anyway, based on the opinions gathered until now, I don't mind just
dropping this patch and move on to the other thread, though we will
likely finish with the same discussion as what's done here :)

A patch on which two committers are expressing concerns about is as
good as going to the waste bin. That's of course fine by me.
--
Michael