Outdated replication protocol error?
Commit 5ee2197767 (about 4 years old) introduced the error:
"IDENTIFY_SYSTEM has not been run before START_REPLICATION"
But it seems like running START_REPLICATION first works (at least in
the simple case).
We should either:
1. Document that IDENTIFY_SYSTEM must always be run before
START_REPLICATION, and always issue a WARNING if that's not done (an
ERROR might break existing applications); or
2. If the commit is out of date and no longer needed, or if it's easy
enough to fix, just remove the error (and Assert a valid
ThisTimeLineID).
Regards,
Jeff Davis
On 2021/01/12 9:06, Jeff Davis wrote:
Commit 5ee2197767 (about 4 years old) introduced the error:
"IDENTIFY_SYSTEM has not been run before START_REPLICATION"
But it seems like running START_REPLICATION first works (at least in
the simple case).We should either:
1. Document that IDENTIFY_SYSTEM must always be run before
START_REPLICATION, and always issue a WARNING if that's not done (an
ERROR might break existing applications); or2. If the commit is out of date and no longer needed, or if it's easy
enough to fix, just remove the error (and Assert a valid
ThisTimeLineID).
+1 to remove the error if START_REPLICATION can always work fine without
IDENTIFY_SYSTEM. I found that the error happens when we connect to the standby
and just run START_REPLICATION without IDENTIFY_SYSTEM. But I'm not sure
if IDENTIFY_SYSTEM is actually necessary even in that case.
Regards,
--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION
Hi,
On 2021-01-14 16:40:26 +0900, Fujii Masao wrote:
On 2021/01/12 9:06, Jeff Davis wrote:
Commit 5ee2197767 (about 4 years old) introduced the error:
"IDENTIFY_SYSTEM has not been run before START_REPLICATION"
But it seems like running START_REPLICATION first works (at least in
the simple case).We should either:
1. Document that IDENTIFY_SYSTEM must always be run before
START_REPLICATION, and always issue a WARNING if that's not done (an
ERROR might break existing applications); or2. If the commit is out of date and no longer needed, or if it's easy
enough to fix, just remove the error (and Assert a valid
ThisTimeLineID).+1 to remove the error if START_REPLICATION can always work fine without
IDENTIFY_SYSTEM. I found that the error happens when we connect to the standby
and just run START_REPLICATION without IDENTIFY_SYSTEM. But I'm not sure
if IDENTIFY_SYSTEM is actually necessary even in that case.
The current approach seems quite bad to me too. By that point the value
could be just about arbitrarily out of date - but that doesn't really
matter because GetStandbyFlushRecPtr() updates it. And for the
!am_cascading_walsender it's trivial to compute.
Has anybody dug out the thread leading to the commit?
Greetings,
Andres Freund
On Fri, 2021-01-15 at 13:55 -0800, Andres Freund wrote:
Has anybody dug out the thread leading to the commit?
/messages/by-id/CAMsr+YEN04ztb+SDb_UfD72Kg5M3F+Cpad31QBKT2rRjysmxRg@mail.gmail.com
Regards,
Jeff Davis
On Fri, 2021-01-15 at 13:55 -0800, Andres Freund wrote:
We should either:
1. Document that IDENTIFY_SYSTEM must always be run before
START_REPLICATION, and always issue a WARNING if that's not done
(an
ERROR might break existing applications); or2. If the commit is out of date and no longer needed, or if it's
easy
enough to fix, just remove the error (and Assert a valid
ThisTimeLineID).+1 to remove the error if START_REPLICATION can always work fine
without
IDENTIFY_SYSTEM. I found that the error happens when we connect to
the standby
and just run START_REPLICATION without IDENTIFY_SYSTEM. But I'm not
sure
if IDENTIFY_SYSTEM is actually necessary even in that case.The current approach seems quite bad to me too. By that point the
value
could be just about arbitrarily out of date - but that doesn't really
matter because GetStandbyFlushRecPtr() updates it. And for the
!am_cascading_walsender it's trivial to compute.
[ digging up old thread ]
It seems everyone agrees that the current behavior is strange. Any
ideas on a solution here?
Has anybody dug out the thread leading to the commit?
/messages/by-id/CAMsr+YEN04ztb+SDb_UfD72Kg5M3F+Cpad31QBKT2rRjysmxRg@mail.gmail.com
Regards,
Jeff Davis
Hi,
On 2021-06-16 15:59:11 -0700, Jeff Davis wrote:
[ digging up old thread ]
It seems everyone agrees that the current behavior is strange.
Yea. I don't remember the details, but I've also hit this problem since
in some odd circumstance while reviewing the "logical decoding on
standbys" patchset.
Any ideas on a solution here?
I think we should explicitly compute the current timeline before using
ThisTimelineID. E.g. in StartReplication() call a new version of
GetFlushRecPtr() that also returns the current timeline id.
Greetings,
Andres Freund
On Wed, 2021-06-16 at 16:17 -0700, Andres Freund wrote:
I think we should explicitly compute the current timeline before
using
ThisTimelineID. E.g. in StartReplication() call a new version of
GetFlushRecPtr() that also returns the current timeline id.
I think all we need to do is follow the pattern in IdentifySystem() by
calling:
am_cascading_walsender = RecoveryInProgress();
first. There are three cases:
1. If the server was a primary the last time RecoveryInProgress() was
called, and it's still a primary, then it returns false immediately.
ThisTimeLineID should be set properly already.
2. If the server was a secondary the last time RecoveryInProgress() was
called, and now it's a primary, then it updates ThisTimeLineID in
InitXLOGAccess() and returns false.
3. If the server was a secondary the last time, and is still a
secondary, then it returns true. Then, StartReplication() will call
GetStandbyFlushRecPtr(), which will update ThisTimeLineID.
It was confusing to me for a while because I was trying to sort out
whether am_cascading_walsender and/or ThisTimeLineID could be out of
date, and how that would result in not updating ThisTimeLineID; and
also why there was a difference between IdentifySystem() and
StartReplication().
Simple patch attached. I didn't test it yet, but wanted to post my
analysis.
Regards,
Jeff Davis
Attachments:
fix-start-replication-identify-system.difftext/x-patch; charset=UTF-8; name=fix-start-replication-identify-system.diffDownload
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 32245363561..17cb21220e0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -573,11 +573,6 @@ StartReplication(StartReplicationCmd *cmd)
StringInfoData buf;
XLogRecPtr FlushPtr;
- if (ThisTimeLineID == 0)
- ereport(ERROR,
- (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
- errmsg("IDENTIFY_SYSTEM has not been run before START_REPLICATION")));
-
/* create xlogreader for physical replication */
xlogreader =
XLogReaderAllocate(wal_segment_size, NULL,
@@ -619,6 +614,7 @@ StartReplication(StartReplicationCmd *cmd)
* that. Otherwise use the timeline of the last replayed record, which is
* kept in ThisTimeLineID.
*/
+ am_cascading_walsender = RecoveryInProgress();
if (am_cascading_walsender)
{
/* this also updates ThisTimeLineID */
Hi,
On 2021-06-17 18:13:57 -0700, Jeff Davis wrote:
On Wed, 2021-06-16 at 16:17 -0700, Andres Freund wrote:
I think we should explicitly compute the current timeline before
using
ThisTimelineID. E.g. in StartReplication() call a new version of
GetFlushRecPtr() that also returns the current timeline id.I think all we need to do is follow the pattern in IdentifySystem() by
calling:am_cascading_walsender = RecoveryInProgress();
first.
Yea, that sounds reasonable.
I'm not a fan of hiding the timeline determination inside
RecoveryInProgress(), particularly not when communicated via global
variable. But that's not the fault of this patch.
Greetings,
Andres Freund