Outdated replication protocol error?

Started by Jeff Davisabout 5 years ago8 messages
#1Jeff Davis
pgsql@j-davis.com

Commit 5ee2197767 (about 4 years old) introduced the error:

"IDENTIFY_SYSTEM has not been run before START_REPLICATION"

But it seems like running START_REPLICATION first works (at least in
the simple case).

We should either:

1. Document that IDENTIFY_SYSTEM must always be run before
START_REPLICATION, and always issue a WARNING if that's not done (an
ERROR might break existing applications); or

2. If the commit is out of date and no longer needed, or if it's easy
enough to fix, just remove the error (and Assert a valid
ThisTimeLineID).

Regards,
Jeff Davis

#2Fujii Masao
masao.fujii@oss.nttdata.com
In reply to: Jeff Davis (#1)
Re: Outdated replication protocol error?

On 2021/01/12 9:06, Jeff Davis wrote:

Commit 5ee2197767 (about 4 years old) introduced the error:

"IDENTIFY_SYSTEM has not been run before START_REPLICATION"

But it seems like running START_REPLICATION first works (at least in
the simple case).

We should either:

1. Document that IDENTIFY_SYSTEM must always be run before
START_REPLICATION, and always issue a WARNING if that's not done (an
ERROR might break existing applications); or

2. If the commit is out of date and no longer needed, or if it's easy
enough to fix, just remove the error (and Assert a valid
ThisTimeLineID).

+1 to remove the error if START_REPLICATION can always work fine without
IDENTIFY_SYSTEM. I found that the error happens when we connect to the standby
and just run START_REPLICATION without IDENTIFY_SYSTEM. But I'm not sure
if IDENTIFY_SYSTEM is actually necessary even in that case.

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#3Andres Freund
andres@anarazel.de
In reply to: Fujii Masao (#2)
Re: Outdated replication protocol error?

Hi,

On 2021-01-14 16:40:26 +0900, Fujii Masao wrote:

On 2021/01/12 9:06, Jeff Davis wrote:

Commit 5ee2197767 (about 4 years old) introduced the error:

"IDENTIFY_SYSTEM has not been run before START_REPLICATION"

But it seems like running START_REPLICATION first works (at least in
the simple case).

We should either:

1. Document that IDENTIFY_SYSTEM must always be run before
START_REPLICATION, and always issue a WARNING if that's not done (an
ERROR might break existing applications); or

2. If the commit is out of date and no longer needed, or if it's easy
enough to fix, just remove the error (and Assert a valid
ThisTimeLineID).

+1 to remove the error if START_REPLICATION can always work fine without
IDENTIFY_SYSTEM. I found that the error happens when we connect to the standby
and just run START_REPLICATION without IDENTIFY_SYSTEM. But I'm not sure
if IDENTIFY_SYSTEM is actually necessary even in that case.

The current approach seems quite bad to me too. By that point the value
could be just about arbitrarily out of date - but that doesn't really
matter because GetStandbyFlushRecPtr() updates it. And for the
!am_cascading_walsender it's trivial to compute.

Has anybody dug out the thread leading to the commit?

Greetings,

Andres Freund

#4Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#3)
Re: Outdated replication protocol error?

On Fri, 2021-01-15 at 13:55 -0800, Andres Freund wrote:

Has anybody dug out the thread leading to the commit?

/messages/by-id/CAMsr+YEN04ztb+SDb_UfD72Kg5M3F+Cpad31QBKT2rRjysmxRg@mail.gmail.com

Regards,
Jeff Davis

#5Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#3)
Re: Outdated replication protocol error?

On Fri, 2021-01-15 at 13:55 -0800, Andres Freund wrote:

We should either:

1. Document that IDENTIFY_SYSTEM must always be run before
START_REPLICATION, and always issue a WARNING if that's not done
(an
ERROR might break existing applications); or

2. If the commit is out of date and no longer needed, or if it's
easy
enough to fix, just remove the error (and Assert a valid
ThisTimeLineID).

+1 to remove the error if START_REPLICATION can always work fine
without
IDENTIFY_SYSTEM. I found that the error happens when we connect to
the standby
and just run START_REPLICATION without IDENTIFY_SYSTEM. But I'm not
sure
if IDENTIFY_SYSTEM is actually necessary even in that case.

The current approach seems quite bad to me too. By that point the
value
could be just about arbitrarily out of date - but that doesn't really
matter because GetStandbyFlushRecPtr() updates it. And for the
!am_cascading_walsender it's trivial to compute.

[ digging up old thread ]

It seems everyone agrees that the current behavior is strange. Any
ideas on a solution here?

Has anybody dug out the thread leading to the commit?

/messages/by-id/CAMsr+YEN04ztb+SDb_UfD72Kg5M3F+Cpad31QBKT2rRjysmxRg@mail.gmail.com

Regards,
Jeff Davis

#6Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#5)
Re: Outdated replication protocol error?

Hi,

On 2021-06-16 15:59:11 -0700, Jeff Davis wrote:

[ digging up old thread ]

It seems everyone agrees that the current behavior is strange.

Yea. I don't remember the details, but I've also hit this problem since
in some odd circumstance while reviewing the "logical decoding on
standbys" patchset.

Any ideas on a solution here?

I think we should explicitly compute the current timeline before using
ThisTimelineID. E.g. in StartReplication() call a new version of
GetFlushRecPtr() that also returns the current timeline id.

Greetings,

Andres Freund

#7Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#6)
1 attachment(s)
Re: Outdated replication protocol error?

On Wed, 2021-06-16 at 16:17 -0700, Andres Freund wrote:

I think we should explicitly compute the current timeline before
using
ThisTimelineID. E.g. in StartReplication() call a new version of
GetFlushRecPtr() that also returns the current timeline id.

I think all we need to do is follow the pattern in IdentifySystem() by
calling:

am_cascading_walsender = RecoveryInProgress();

first. There are three cases:

1. If the server was a primary the last time RecoveryInProgress() was
called, and it's still a primary, then it returns false immediately.
ThisTimeLineID should be set properly already.

2. If the server was a secondary the last time RecoveryInProgress() was
called, and now it's a primary, then it updates ThisTimeLineID in
InitXLOGAccess() and returns false.

3. If the server was a secondary the last time, and is still a
secondary, then it returns true. Then, StartReplication() will call
GetStandbyFlushRecPtr(), which will update ThisTimeLineID.

It was confusing to me for a while because I was trying to sort out
whether am_cascading_walsender and/or ThisTimeLineID could be out of
date, and how that would result in not updating ThisTimeLineID; and
also why there was a difference between IdentifySystem() and
StartReplication().

Simple patch attached. I didn't test it yet, but wanted to post my
analysis.

Regards,
Jeff Davis

Attachments:

fix-start-replication-identify-system.difftext/x-patch; charset=UTF-8; name=fix-start-replication-identify-system.diffDownload
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 32245363561..17cb21220e0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -573,11 +573,6 @@ StartReplication(StartReplicationCmd *cmd)
 	StringInfoData buf;
 	XLogRecPtr	FlushPtr;
 
-	if (ThisTimeLineID == 0)
-		ereport(ERROR,
-				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
-				 errmsg("IDENTIFY_SYSTEM has not been run before START_REPLICATION")));
-
 	/* create xlogreader for physical replication */
 	xlogreader =
 		XLogReaderAllocate(wal_segment_size, NULL,
@@ -619,6 +614,7 @@ StartReplication(StartReplicationCmd *cmd)
 	 * that. Otherwise use the timeline of the last replayed record, which is
 	 * kept in ThisTimeLineID.
 	 */
+	am_cascading_walsender = RecoveryInProgress();
 	if (am_cascading_walsender)
 	{
 		/* this also updates ThisTimeLineID */
#8Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#7)
Re: Outdated replication protocol error?

Hi,

On 2021-06-17 18:13:57 -0700, Jeff Davis wrote:

On Wed, 2021-06-16 at 16:17 -0700, Andres Freund wrote:

I think we should explicitly compute the current timeline before
using
ThisTimelineID. E.g. in StartReplication() call a new version of
GetFlushRecPtr() that also returns the current timeline id.

I think all we need to do is follow the pattern in IdentifySystem() by
calling:

am_cascading_walsender = RecoveryInProgress();

first.

Yea, that sounds reasonable.

I'm not a fan of hiding the timeline determination inside
RecoveryInProgress(), particularly not when communicated via global
variable. But that's not the fault of this patch.

Greetings,

Andres Freund