pg_rewind does not rewind diverging timelines

Started by Mats Kindahlabout 1 month ago11 messageshackers
Jump to latest
#1Mats Kindahl
mats.kindahl@gmail.com

Hi all,

I have been playing around with various promotion scenarios to check if it
is possible to lose writes in more complicated scenarios involving
promotions and uses of synchronous_standby_names and decided to create a
TLA+ model for streaming replication involving promotions and check those
with TLC. You can find the models at [1]https://github.com/mkindahl/tla-postgres if you're interested.

There is one scenario that I assume is known that TLC found, but does not
seem to be fixed. It is a relatively rare case, but since the fix is quite
easy, I thought I'd share it with you and get feedback.

The scenario can occur if you're unlucky and have more than one crash when
promoting standbys to be primaries, and goes like this:

You have three servers, S1, S2, and S3. S1 is primary and S2 and S3 are
standbys. All are on timeline (TLI) 1.

1. S1 crashes
2. S1 recovers and starts promotion. It writes XLOG_END_OF_RECOVERY (EOR)
for TLI 2 to the WAL.
3. S1 It manages to write some records W1 to the WAL.
4. Before the EOR is replicated to any standby, S1 crashes again. It is now
on TLI 2 and has some changes that are not elsewhere.
5. S2 is promoted. It writes an EOR for TLI 2 (since it is not aware of any
other timeline) to the WAL.
6. S2 writes some records W2 to WAL and now S1 has a record of TLI 2
version 1 (TLI 2.1) and S2 is on TLI 2.2.
7. S1 recovers and wants to join as a standby. You run pg_rewind to get rid
of the extra data, but since S2 is also on TLI 2, pg_rewind will happily
assume that both are on the same timeline.
8. S2 is now a standby but has that extra record for W2 both in the WAL and
in the database.

The fix (see attached draft) is quite simple: add a UUID to the EOR and to
the history file. When comparing timelines, don't only check the TLI, also
check the UUID. If not both match, go back further until you find a
timeline where both the TLI and the timeline UUID matches and do the usual
fandango to find the good LSN to rewind to.

[1]: https://github.com/mkindahl/tla-postgres

Attachments:

0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patch.v1application/octet-stream; name=0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patch.v1Download+349-25
#2Mats Kindahl
mats.kindahl@gmail.com
In reply to: Mats Kindahl (#1)
Re: pg_rewind does not rewind diverging timelines

On Thu, Apr 30, 2026 at 10:19 AM Mats Kindahl <mats.kindahl@gmail.com>
wrote:

Hi all,

I have been playing around with various promotion scenarios to check if it
is possible to lose writes in more complicated scenarios involving
promotions and uses of synchronous_standby_names and decided to create a
TLA+ model for streaming replication involving promotions and check those
with TLC. You can find the models at [1] if you're interested.

There is one scenario that I assume is known that TLC found, but does not
seem to be fixed. It is a relatively rare case, but since the fix is quite
easy, I thought I'd share it with you and get feedback.

The scenario can occur if you're unlucky and have more than one crash when
promoting standbys to be primaries, and goes like this:

You have three servers, S1, S2, and S3. S1 is primary and S2 and S3 are
standbys. All are on timeline (TLI) 1.

1. S1 crashes
2. S1 recovers and starts promotion. It writes XLOG_END_OF_RECOVERY (EOR)
for TLI 2 to the WAL.
3. S1 It manages to write some records W1 to the WAL.
4. Before the EOR is replicated to any standby, S1 crashes again. It is
now on TLI 2 and has some changes that are not elsewhere.
5. S2 is promoted. It writes an EOR for TLI 2 (since it is not aware of
any other timeline) to the WAL.
6. S2 writes some records W2 to WAL and now S1 has a record of TLI 2
version 1 (TLI 2.1) and S2 is on TLI 2.2.
7. S1 recovers and wants to join as a standby. You run pg_rewind to get
rid of the extra data, but since S2 is also on TLI 2, pg_rewind will
happily assume that both are on the same timeline.
8. S2 is now a standby but has that extra record for W2 both in the WAL
and in the database.

The fix (see attached draft) is quite simple: add a UUID to the EOR and to
the history file. When comparing timelines, don't only check the TLI, also
check the UUID. If not both match, go back further until you find a
timeline where both the TLI and the timeline UUID matches and do the usual
fandango to find the good LSN to rewind to.

[1]: https://github.com/mkindahl/tla-postgres

Here is an updated version of the patch. It seems like it is not necessary
to extend the XLOG_END_OF_RECOVERY record with the UUID, just the history
files. The scenario is still the same though, and can trigger diverging
servers, possibly silent. I have an additional test case using a divergence
going back three promotions.
--
Best wishes,
Mats Kindahl, Multigres Developer, Supabase

Attachments:

v2.0002-pg_rewind-test-rewind-across-UUID-mismatched-TLI.patchtext/x-patch; charset=US-ASCII; name=v2.0002-pg_rewind-test-rewind-across-UUID-mismatched-TLI.patchDownload+121-1
v2.0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patchtext/x-patch; charset=US-ASCII; name=v2.0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patchDownload+349-25
#3surya poondla
suryapoondla4@gmail.com
In reply to: Mats Kindahl (#1)
Re: pg_rewind does not rewind diverging timelines

Hi Mats,

Thanks for picking this up -- the scenario is a real one and I think the
UUID-tagging approach is a clean way to solve it. v2 applies and builds
without trouble, and the core algorithm reads well to me.
I have a handful of observations that I'd love your thoughts.

Regarding Correctness I have the below thoughts

1. UUIDv7 timestamp epoch.
In StartupXLOG():
TimestampTz now = GetCurrentTimestamp();
generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
(uint32)(now % 1000) * 1000);

I think there might be a small mismatch here: GetCurrentTimestamp() returns
microseconds since the Postgres epoch (2000-01-01),
whereas generate_uuidv7_r describes its first argument as milliseconds
since the Unix epoch.
As written that 30-year offset would land in the UUID's timestamp field, so
the resulting UUID wouldn't be a conformant UUIDv7 and wouldn't
time-order against UUIDv7s generated through the SQL functions.

Uniqueness is preserved either way, so the rewind logic still works as
intended but it seemed worth flagging.

I see conversion that's used elsewhere as:
us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
* SECS_PER_DAY * USECS_PER_SEC;

Or, since promotion isn't on a hot path, gettimeofday() / time(NULL)
directly would also be fine.

2. EOR-record path, the intent is unclear.

The comment above generate_uuidv7_r() at says:

"The same UUID is written into the history file and later into the
XLOG_END_OF_RECOVERY record so that pg_rewind can distinguish two
servers..."

But from what I can see only the history-file part actually lands.
xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't add
the UUID, and XLogCtl->ThisTimeLineUUID is written under info_lck without a
reader (I couldn't grep it).

The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads like
preparation for an EOR-struct extension that ended up not being part of the
patch.

Was the EOR-record piece something you intended to keep for a follow-up, or
has it been superseded by the history-file approach?

3. Malformed UUID handling in readTimeLineHistory().

The optional field-4 path is:

if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
{
Datum datum = DirectFunctionCall1(uuid_in,
CStringGetDatum(uuid_str));
...
}

uuid_in() raises ereport(ERROR) on a malformed input, while the surrounding
syntax-error paths in readTimeLineHistory() use FATAL deliberately.
In practice an ERROR during startup ends up being fatal too, so this isn't
strictly a bug but it would be nicer to stay consistent.

Regarding the Tests I have the following thoughts

The two new cases are nice, a few extensions that I think would strengthen
them:
1. A mixed-version case where one side has a zero UUID. That's the path
we're claiming is graceful, but nothing currently exercises it
2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that
findCommonAncestorTimeline's loop walks past matching entries
before hitting the mismatch. The 0002 test puts the divergence at
depth 1.
3. A small assertion against the on-disk 00000002.history contents, to pin
down the file format.
4. On 0002 the dependency on restore_command pointing at node_x's pg_wal is
the kind of thing that tends to break under
environment changes. A CHECKPOINT on node_x before the backup, or
wal_keep_size as in 0001, would let the test stand on its own.

I'm happy to keep reviewing/contributing, thanks again for working on it.

Regards,
Surya Poondla

Show quoted text
#4Mats Kindahl
mats.kindahl@gmail.com
In reply to: surya poondla (#3)
Re: pg_rewind does not rewind diverging timelines

On Fri, May 22, 2026 at 12:09 AM surya poondla <suryapoondla4@gmail.com>
wrote:

Hi Mats,

Thanks for picking this up -- the scenario is a real one and I think the
UUID-tagging approach is a clean way to solve it. v2 applies and builds
without trouble, and the core algorithm reads well to me.
I have a handful of observations that I'd love your thoughts.

Hi Surya,

Thank you for the review. It is a quite rare scenario, but it is real and
the fix is simple.

Regarding Correctness I have the below thoughts

1. UUIDv7 timestamp epoch.
In StartupXLOG():
TimestampTz now = GetCurrentTimestamp();
generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
(uint32)(now % 1000) * 1000);

I think there might be a small mismatch here: GetCurrentTimestamp()
returns microseconds since the Postgres epoch (2000-01-01),
whereas generate_uuidv7_r describes its first argument as milliseconds
since the Unix epoch.
As written that 30-year offset would land in the UUID's timestamp field,
so the resulting UUID wouldn't be a conformant UUIDv7 and wouldn't
time-order against UUIDv7s generated through the SQL functions.

Uniqueness is preserved either way, so the rewind logic still works as
intended but it seemed worth flagging.

I see conversion that's used elsewhere as:
us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
* SECS_PER_DAY * USECS_PER_SEC;

Or, since promotion isn't on a hot path, gettimeofday() / time(NULL)
directly would also be fine.

Yes, the intention was to use a proper timestamp to allow debugging servers
if necessary. Switched to gettimeofday() and used 0 for sub-ms since this
is not going to be critical. (We could use ns here as well, but that would
only solve a race if you have two servers being promoted in the same ms,
which I find unlikely, and there is a random number added for that
situation.)

2. EOR-record path, the intent is unclear.

The comment above generate_uuidv7_r() at says:

"The same UUID is written into the history file and later into the
XLOG_END_OF_RECOVERY record so that pg_rewind can distinguish two
servers..."

But from what I can see only the history-file part actually lands.
xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't add
the UUID, and XLogCtl->ThisTimeLineUUID is written under info_lck without a
reader (I couldn't grep it).

The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads like
preparation for an EOR-struct extension that ended up not being part of the
patch.

Was the EOR-record piece something you intended to keep for a follow-up,
or has it been superseded by the history-file approach?

No, the EOR changes are not needed for the promotion, contrary to what I
originally thought. Cleaned up the comment and the code and removed all
traces of changes to the EOR (I hope).

3. Malformed UUID handling in readTimeLineHistory().

The optional field-4 path is:

if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
{
Datum datum = DirectFunctionCall1(uuid_in,
CStringGetDatum(uuid_str));
...
}

uuid_in() raises ereport(ERROR) on a malformed input, while the
surrounding syntax-error paths in readTimeLineHistory() use FATAL
deliberately.
In practice an ERROR during startup ends up being fatal too, so this isn't
strictly a bug but it would be nicer to stay consistent.

Agree. I added code to capture the error and raise a FATAL instead (with
the error message from the uuid_in, in case it is modified it makes sense
to show this).

Regarding the Tests I have the following thoughts

The two new cases are nice, a few extensions that I think would strengthen
them:
1. A mixed-version case where one side has a zero UUID. That's the path
we're claiming is graceful, but nothing currently exercises it

Yes, that should work regardless of whether the source or the target has
the zero UUID.

I realized one thing: if two timelines have identical TLI but one has zero
UUID and one has not, it seems they could not come from the same promotion
(one promotion happened on an old server and the other one on a new
server), that is, they should be treated as different. Does that make
sense? I made the necessary changes in the attached patches for testing.
Please have a look.

2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that
findCommonAncestorTimeline's loop walks past matching entries
before hitting the mismatch. The 0002 test puts the divergence at
depth 1.

I was unsure if this test was necessary or interesting, hence a separate
commit. Since you thought it was useful, it's now rolled into the patch and
I extended the tests with the scenarios you suggested.

I also did some refactorings of the tests to avoid duplication. More below.

3. A small assertion against the on-disk 00000002.history contents, to pin
down the file format.
4. On 0002 the dependency on restore_command pointing at node_x's pg_wal
is the kind of thing that tends to break under
environment changes. A CHECKPOINT on node_x before the backup, or
wal_keep_size as in 0001, would let the test stand on its own.

Good point.

I refactored the code to avoid some duplication and make the test flow
self-explanatory and as part of that I set the wal_keep_size for all nodes.

In the process I noticed that many of the functions in RewindTest.pm do the
same job as the primitives I wrote, but have hard-coded variable names. I
could rewrite them to take parameters, but that would be quite a big patch
to add additional changes to each call site, so I did not do that and
rather added small wrappers specific for the tests in 005_same_timeline.pl.

Attached a new version of the now single patch.

I'm happy to keep reviewing/contributing, thanks again for working on it.

Thank you for reviewing it.
--
Best wishes,
Mats Kindahl, Multigres Developer, Supabase

Attachments:

v3.0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patchtext/x-patch; charset=US-ASCII; name=v3.0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patchDownload+613-24
#5Japin Li
japinli@hotmail.com
In reply to: Mats Kindahl (#4)
Re: pg_rewind does not rewind diverging timelines

Hi, Mats

On Sun, 24 May 2026 at 20:30, Mats Kindahl <mats.kindahl@gmail.com> wrote:

On Fri, May 22, 2026 at 12:09 AM surya poondla <suryapoondla4@gmail.com> wrote:

Hi Mats,

Thanks for picking this up -- the scenario is a real one and I think the UUID-tagging approach is a clean way to
solve it. v2 applies and builds without trouble, and the core algorithm reads well to me.
I have a handful of observations that I'd love your thoughts.

Hi Surya,

Thank you for the review. It is a quite rare scenario, but it is real and the fix is simple.

Regarding Correctness I have the below thoughts

1. UUIDv7 timestamp epoch.
In StartupXLOG():
TimestampTz now = GetCurrentTimestamp();
generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
(uint32)(now % 1000) * 1000);

I think there might be a small mismatch here: GetCurrentTimestamp() returns microseconds since the Postgres epoch
(2000-01-01),
whereas generate_uuidv7_r describes its first argument as milliseconds since the Unix epoch.
As written that 30-year offset would land in the UUID's timestamp field, so the resulting UUID wouldn't be a
conformant UUIDv7 and wouldn't
time-order against UUIDv7s generated through the SQL functions.

Uniqueness is preserved either way, so the rewind logic still works as intended but it seemed worth flagging.

I see conversion that's used elsewhere as:
us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
* SECS_PER_DAY * USECS_PER_SEC;

Or, since promotion isn't on a hot path, gettimeofday() / time(NULL) directly would also be fine.

Yes, the intention was to use a proper timestamp to allow debugging servers if necessary. Switched to gettimeofday() and
used 0 for sub-ms since this is not going to be critical. (We could use ns here as well, but that would only solve a race
if you have two servers being promoted in the same ms, which I find unlikely, and there is a random number added for that
situation.)

2. EOR-record path, the intent is unclear.

The comment above generate_uuidv7_r() at says:

"The same UUID is written into the history file and later into the XLOG_END_OF_RECOVERY record so that pg_rewind can
distinguish two servers..."

But from what I can see only the history-file part actually lands.
xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't add the UUID, and XLogCtl->ThisTimeLineUUID is
written under info_lck without a
reader (I couldn't grep it).

The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads like preparation for an EOR-struct extension that
ended up not being part of the patch.

Was the EOR-record piece something you intended to keep for a follow-up, or has it been superseded by the
history-file approach?

No, the EOR changes are not needed for the promotion, contrary to what I originally thought. Cleaned up the comment and
the code and removed all traces of changes to the EOR (I hope).

3. Malformed UUID handling in readTimeLineHistory().

The optional field-4 path is:

if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
{
Datum datum = DirectFunctionCall1(uuid_in,
CStringGetDatum(uuid_str));
...
}

uuid_in() raises ereport(ERROR) on a malformed input, while the surrounding syntax-error paths in readTimeLineHistory
() use FATAL deliberately.
In practice an ERROR during startup ends up being fatal too, so this isn't strictly a bug but it would be nicer to
stay consistent.

Agree. I added code to capture the error and raise a FATAL instead (with the error message from the uuid_in, in case it
is modified it makes sense to show this).

Regarding the Tests I have the following thoughts

The two new cases are nice, a few extensions that I think would strengthen them:
1. A mixed-version case where one side has a zero UUID. That's the path we're claiming is graceful, but nothing
currently exercises it

Yes, that should work regardless of whether the source or the target has the zero UUID.

I realized one thing: if two timelines have identical TLI but one has zero UUID and one has not, it seems they could not
come from the same promotion (one promotion happened on an old server and the other one on a new server), that is, they
should be treated as different. Does that make sense? I made the necessary changes in the attached patches for testing.
Please have a look.

2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that findCommonAncestorTimeline's loop walks past
matching entries
before hitting the mismatch. The 0002 test puts the divergence at depth 1.

I was unsure if this test was necessary or interesting, hence a separate commit. Since you thought it was useful, it's
now rolled into the patch and I extended the tests with the scenarios you suggested.

I also did some refactorings of the tests to avoid duplication. More below.

3. A small assertion against the on-disk 00000002.history contents, to pin down the file format.
4. On 0002 the dependency on restore_command pointing at node_x's pg_wal is the kind of thing that tends to break
under
environment changes. A CHECKPOINT on node_x before the backup, or wal_keep_size as in 0001, would let the test
stand on its own.

Good point.

I refactored the code to avoid some duplication and make the test flow self-explanatory and as part of that I set the
wal_keep_size for all nodes.

In the process I noticed that many of the functions in RewindTest.pm do the same job as the primitives I wrote, but have
hard-coded variable names. I could rewrite them to take parameters, but that would be quite a big patch to add additional
changes to each call site, so I did not do that and rather added small wrappers specific for the tests in
005_same_timeline.pl⚠️.

Attached a new version of the now single patch.

I'm happy to keep reviewing/contributing, thanks again for working on it.

Thank you for reviewing it.

Thank you for your work. I have one comment.

+	a = &tlh->source[tlh->sourceNentries - 2].tluuid;
+	b = &tlh->target[tlh->targetNentries - 2].tluuid;
+
+	if (memcmp(a, &zero, UUID_LEN) == 0 && memcmp(b, &zero, UUID_LEN) == 0)
+		return true;
+
+	return memcmp(a, b, UUID_LEN) == 0;

Since we already have matchingTimelineUUID(), the above code can be simplified
using it.

--
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.

#6Mats Kindahl
mats.kindahl@gmail.com
In reply to: Japin Li (#5)
Re: pg_rewind does not rewind diverging timelines

Hi Japin,

On Mon, May 25, 2026 at 7:21 AM Japin Li <japinli@hotmail.com> wrote:

Hi, Mats

On Sun, 24 May 2026 at 20:30, Mats Kindahl <mats.kindahl@gmail.com> wrote:

On Fri, May 22, 2026 at 12:09 AM surya poondla <suryapoondla4@gmail.com>

wrote:

Hi Mats,

Thanks for picking this up -- the scenario is a real one and I think

the UUID-tagging approach is a clean way to

solve it. v2 applies and builds without trouble, and the core algorithm

reads well to me.

I have a handful of observations that I'd love your thoughts.

Hi Surya,

Thank you for the review. It is a quite rare scenario, but it is real

and the fix is simple.

Regarding Correctness I have the below thoughts

1. UUIDv7 timestamp epoch.
In StartupXLOG():
TimestampTz now = GetCurrentTimestamp();
generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
(uint32)(now % 1000) * 1000);

I think there might be a small mismatch here: GetCurrentTimestamp()

returns microseconds since the Postgres epoch

(2000-01-01),
whereas generate_uuidv7_r describes its first argument as milliseconds

since the Unix epoch.

As written that 30-year offset would land in the UUID's timestamp

field, so the resulting UUID wouldn't be a

conformant UUIDv7 and wouldn't
time-order against UUIDv7s generated through the SQL functions.

Uniqueness is preserved either way, so the rewind logic still works as

intended but it seemed worth flagging.

I see conversion that's used elsewhere as:
us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
* SECS_PER_DAY * USECS_PER_SEC;

Or, since promotion isn't on a hot path, gettimeofday() / time(NULL)

directly would also be fine.

Yes, the intention was to use a proper timestamp to allow debugging

servers if necessary. Switched to gettimeofday() and

used 0 for sub-ms since this is not going to be critical. (We could use

ns here as well, but that would only solve a race

if you have two servers being promoted in the same ms, which I find

unlikely, and there is a random number added for that

situation.)

2. EOR-record path, the intent is unclear.

The comment above generate_uuidv7_r() at says:

"The same UUID is written into the history file and later into the

XLOG_END_OF_RECOVERY record so that pg_rewind can

distinguish two servers..."

But from what I can see only the history-file part actually lands.
xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't

add the UUID, and XLogCtl->ThisTimeLineUUID is

written under info_lck without a
reader (I couldn't grep it).

The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads like

preparation for an EOR-struct extension that

ended up not being part of the patch.

Was the EOR-record piece something you intended to keep for a

follow-up, or has it been superseded by the

history-file approach?

No, the EOR changes are not needed for the promotion, contrary to what I

originally thought. Cleaned up the comment and

the code and removed all traces of changes to the EOR (I hope).

3. Malformed UUID handling in readTimeLineHistory().

The optional field-4 path is:

if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
{
Datum datum = DirectFunctionCall1(uuid_in,

CStringGetDatum(uuid_str));

...
}

uuid_in() raises ereport(ERROR) on a malformed input, while the

surrounding syntax-error paths in readTimeLineHistory

() use FATAL deliberately.
In practice an ERROR during startup ends up being fatal too, so this

isn't strictly a bug but it would be nicer to

stay consistent.

Agree. I added code to capture the error and raise a FATAL instead (with

the error message from the uuid_in, in case it

is modified it makes sense to show this).

Regarding the Tests I have the following thoughts

The two new cases are nice, a few extensions that I think would

strengthen them:

1. A mixed-version case where one side has a zero UUID. That's the path

we're claiming is graceful, but nothing

currently exercises it

Yes, that should work regardless of whether the source or the target has

the zero UUID.

I realized one thing: if two timelines have identical TLI but one has

zero UUID and one has not, it seems they could not

come from the same promotion (one promotion happened on an old server

and the other one on a new server), that is, they

should be treated as different. Does that make sense? I made the

necessary changes in the attached patches for testing.

Please have a look.

2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that

findCommonAncestorTimeline's loop walks past

matching entries
before hitting the mismatch. The 0002 test puts the divergence at

depth 1.

I was unsure if this test was necessary or interesting, hence a separate

commit. Since you thought it was useful, it's

now rolled into the patch and I extended the tests with the scenarios

you suggested.

I also did some refactorings of the tests to avoid duplication. More

below.

3. A small assertion against the on-disk 00000002.history contents, to

pin down the file format.

4. On 0002 the dependency on restore_command pointing at node_x's

pg_wal is the kind of thing that tends to break

under
environment changes. A CHECKPOINT on node_x before the backup, or

wal_keep_size as in 0001, would let the test

stand on its own.

Good point.

I refactored the code to avoid some duplication and make the test flow

self-explanatory and as part of that I set the

wal_keep_size for all nodes.

In the process I noticed that many of the functions in RewindTest.pm do

the same job as the primitives I wrote, but have

hard-coded variable names. I could rewrite them to take parameters, but

that would be quite a big patch to add additional

changes to each call site, so I did not do that and rather added small

wrappers specific for the tests in

005_same_timeline.pl⚠️.

Attached a new version of the now single patch.

I'm happy to keep reviewing/contributing, thanks again for working on

it.

Thank you for reviewing it.

Thank you for your work. I have one comment.

+       a = &tlh->source[tlh->sourceNentries - 2].tluuid;
+       b = &tlh->target[tlh->targetNentries - 2].tluuid;
+
+       if (memcmp(a, &zero, UUID_LEN) == 0 && memcmp(b, &zero, UUID_LEN)
== 0)
+               return true;
+
+       return memcmp(a, b, UUID_LEN) == 0;

Since we already have matchingTimelineUUID(), the above code can be
simplified
using it.

Thank you for the review. I switched to using the matchingTimelineUUID()
for this part of the code and made some other minor improvements as well.
--
Best wishes,
Mats Kindahl, Multigres Developer, Supabase

Attachments:

v4.0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patchtext/x-patch; charset=US-ASCII; name=v4.0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patchDownload+606-24
#7Japin Li
japinli@hotmail.com
In reply to: Mats Kindahl (#6)
Re: pg_rewind does not rewind diverging timelines

Hi, Mats

Thanks for updating the patch.

On Mon, 25 May 2026 at 20:59, Mats Kindahl <mats.kindahl@gmail.com> wrote:

Hi Japin,

On Mon, May 25, 2026 at 7:21 AM Japin Li <japinli@hotmail.com> wrote:

Hi, Mats

On Sun, 24 May 2026 at 20:30, Mats Kindahl <mats.kindahl@gmail.com> wrote:

On Fri, May 22, 2026 at 12:09 AM surya poondla <suryapoondla4@gmail.com> wrote:

Hi Mats,

Thanks for picking this up -- the scenario is a real one and I think the UUID-tagging approach is a clean way to
solve it. v2 applies and builds without trouble, and the core algorithm reads well to me.
I have a handful of observations that I'd love your thoughts.

Hi Surya,

Thank you for the review. It is a quite rare scenario, but it is real and the fix is simple.

Regarding Correctness I have the below thoughts

1. UUIDv7 timestamp epoch.
In StartupXLOG():
TimestampTz now = GetCurrentTimestamp();
generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
(uint32)(now % 1000) * 1000);

I think there might be a small mismatch here: GetCurrentTimestamp() returns microseconds since the Postgres epoch
(2000-01-01),
whereas generate_uuidv7_r describes its first argument as milliseconds since the Unix epoch.
As written that 30-year offset would land in the UUID's timestamp field, so the resulting UUID wouldn't be a
conformant UUIDv7 and wouldn't
time-order against UUIDv7s generated through the SQL functions.

Uniqueness is preserved either way, so the rewind logic still works as intended but it seemed worth flagging.

I see conversion that's used elsewhere as:
us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
* SECS_PER_DAY * USECS_PER_SEC;

Or, since promotion isn't on a hot path, gettimeofday() / time(NULL) directly would also be fine.

Yes, the intention was to use a proper timestamp to allow debugging servers if necessary. Switched to gettimeofday

() and

used 0 for sub-ms since this is not going to be critical. (We could use ns here as well, but that would only solve

a race

if you have two servers being promoted in the same ms, which I find unlikely, and there is a random number added

for that

situation.)

2. EOR-record path, the intent is unclear.

The comment above generate_uuidv7_r() at says:

"The same UUID is written into the history file and later into the XLOG_END_OF_RECOVERY record so that pg_rewind

can

distinguish two servers..."

But from what I can see only the history-file part actually lands.
xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't add the UUID, and XLogCtl->ThisTimeLineUUID

is

written under info_lck without a
reader (I couldn't grep it).

The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads like preparation for an EOR-struct extension

that

ended up not being part of the patch.

Was the EOR-record piece something you intended to keep for a follow-up, or has it been superseded by the
history-file approach?

No, the EOR changes are not needed for the promotion, contrary to what I originally thought. Cleaned up the comment

and

the code and removed all traces of changes to the EOR (I hope).

3. Malformed UUID handling in readTimeLineHistory().

The optional field-4 path is:

if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
{
Datum datum = DirectFunctionCall1(uuid_in,
CStringGetDatum(uuid_str));
...
}

uuid_in() raises ereport(ERROR) on a malformed input, while the surrounding syntax-error paths in

readTimeLineHistory

() use FATAL deliberately.
In practice an ERROR during startup ends up being fatal too, so this isn't strictly a bug but it would be nicer to
stay consistent.

Agree. I added code to capture the error and raise a FATAL instead (with the error message from the uuid_in, in

case it

is modified it makes sense to show this).

Regarding the Tests I have the following thoughts

The two new cases are nice, a few extensions that I think would strengthen them:
1. A mixed-version case where one side has a zero UUID. That's the path we're claiming is graceful, but nothing
currently exercises it

Yes, that should work regardless of whether the source or the target has the zero UUID.

I realized one thing: if two timelines have identical TLI but one has zero UUID and one has not, it seems they

could not

come from the same promotion (one promotion happened on an old server and the other one on a new server), that is,

they

should be treated as different. Does that make sense? I made the necessary changes in the attached patches for

testing.

Please have a look.

2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that findCommonAncestorTimeline's loop walks past
matching entries
before hitting the mismatch. The 0002 test puts the divergence at depth 1.

I was unsure if this test was necessary or interesting, hence a separate commit. Since you thought it was useful,

it's

now rolled into the patch and I extended the tests with the scenarios you suggested.

I also did some refactorings of the tests to avoid duplication. More below.

3. A small assertion against the on-disk 00000002.history contents, to pin down the file format.
4. On 0002 the dependency on restore_command pointing at node_x's pg_wal is the kind of thing that tends to break
under
environment changes. A CHECKPOINT on node_x before the backup, or wal_keep_size as in 0001, would let the

test

stand on its own.

Good point.

I refactored the code to avoid some duplication and make the test flow self-explanatory and as part of that I set

the

wal_keep_size for all nodes.

In the process I noticed that many of the functions in RewindTest.pm do the same job as the primitives I wrote, but

have

hard-coded variable names. I could rewrite them to take parameters, but that would be quite a big patch to add

additional

changes to each call site, so I did not do that and rather added small wrappers specific for the tests in
005_same_timeline.pl⚠️⚠️.

Attached a new version of the now single patch.

I'm happy to keep reviewing/contributing, thanks again for working on it.

Thank you for reviewing it.

Thank you for your work. I have one comment.

+       a = &tlh->source[tlh->sourceNentries - 2].tluuid;
+       b = &tlh->target[tlh->targetNentries - 2].tluuid;
+
+       if (memcmp(a, &zero, UUID_LEN) == 0 && memcmp(b, &zero, UUID_LEN) == 0)
+               return true;
+
+       return memcmp(a, b, UUID_LEN) == 0;

Since we already have matchingTimelineUUID(), the above code can be simplified
using it.

Thank you for the review. I switched to using the matchingTimelineUUID() for this part of the code and made some other
minor improvements as well.

Here are some comments on v4.

1.
+/*
+ * Timeline histories for both clusters, populated by timelines_match().
+ */

I don't see a timelines_match() function. Does this refer to
matchAndFetchTimelines()?

2.
+typedef struct TimelineHistoriesData
+{
+       TimeLineHistoryEntry *source,
+                          *target;
+       int                     sourceNentries,
+                               targetNentries;
+}                      TimelineHistoriesData;

I'd prefer to use TimeLineHistoriesData to stay consistent with
TimeLineHistoryEntry. Anyway I'm not instant on it.

3.
+typedef TimelineHistoriesData * TimelineHistories;

The space between * and TimelineHistories is unnecessary — see
StringInfoData and other typedefs.

4.
+# node_x and node_b both start from the same TLI 1 baseline.
+my ($node_x, $node_b2) =
+  setup_standbys_from_origin($node_origin2, 'node_x', 'node_b2');

There appears to be a typo in the comment. The node_b should be node_b2.

Everything else looks good. Thank you again for updating the patch!

--
Best wishes,
Mats Kindahl, Multigres Developer, Supabase

--
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.

#8Mats Kindahl
mats.kindahl@gmail.com
In reply to: Japin Li (#7)
Re: pg_rewind does not rewind diverging timelines

Hi Japin,

Thanks for reviewing the patch.

On Tue, May 26, 2026 at 8:56 AM Japin Li <japinli@hotmail.com> wrote:

Hi, Mats

Thanks for updating the patch.

On Mon, 25 May 2026 at 20:59, Mats Kindahl <mats.kindahl@gmail.com> wrote:

Hi Japin,

On Mon, May 25, 2026 at 7:21 AM Japin Li <japinli@hotmail.com> wrote:

Hi, Mats

On Sun, 24 May 2026 at 20:30, Mats Kindahl <mats.kindahl@gmail.com>

wrote:

On Fri, May 22, 2026 at 12:09 AM surya poondla <

suryapoondla4@gmail.com> wrote:

Hi Mats,

Thanks for picking this up -- the scenario is a real one and I think

the UUID-tagging approach is a clean way to

solve it. v2 applies and builds without trouble, and the core

algorithm reads well to me.

I have a handful of observations that I'd love your thoughts.

Hi Surya,

Thank you for the review. It is a quite rare scenario, but it is real

and the fix is simple.

Regarding Correctness I have the below thoughts

1. UUIDv7 timestamp epoch.
In StartupXLOG():
TimestampTz now = GetCurrentTimestamp();
generate_uuidv7_r(&uuid_buf, (uint64)(now / 1000),
(uint32)(now % 1000) * 1000);

I think there might be a small mismatch here: GetCurrentTimestamp()

returns microseconds since the Postgres epoch

(2000-01-01),
whereas generate_uuidv7_r describes its first argument as

milliseconds since the Unix epoch.

As written that 30-year offset would land in the UUID's timestamp

field, so the resulting UUID wouldn't be a

conformant UUIDv7 and wouldn't
time-order against UUIDv7s generated through the SQL functions.

Uniqueness is preserved either way, so the rewind logic still works

as intended but it seemed worth flagging.

I see conversion that's used elsewhere as:
us = ts + (POSTGRES_EPOCH_JDATE - UNIX_EPOCH_JDATE)
* SECS_PER_DAY * USECS_PER_SEC;

Or, since promotion isn't on a hot path, gettimeofday() / time(NULL)

directly would also be fine.

Yes, the intention was to use a proper timestamp to allow debugging

servers if necessary. Switched to gettimeofday

() and

used 0 for sub-ms since this is not going to be critical. (We could

use ns here as well, but that would only solve

a race

if you have two servers being promoted in the same ms, which I find

unlikely, and there is a random number added

for that

situation.)

2. EOR-record path, the intent is unclear.

The comment above generate_uuidv7_r() at says:

"The same UUID is written into the history file and later into the

XLOG_END_OF_RECOVERY record so that pg_rewind

can

distinguish two servers..."

But from what I can see only the history-file part actually lands.
xl_end_of_recovery is unchanged, CreateEndOfRecoveryRecord() doesn't

add the UUID, and XLogCtl->ThisTimeLineUUID

is

written under info_lck without a
reader (I couldn't grep it).

The xlog_redo() memset() + Min(rec_len, sizeof(...)) change reads

like preparation for an EOR-struct extension

that

ended up not being part of the patch.

Was the EOR-record piece something you intended to keep for a

follow-up, or has it been superseded by the

history-file approach?

No, the EOR changes are not needed for the promotion, contrary to

what I originally thought. Cleaned up the comment

and

the code and removed all traces of changes to the EOR (I hope).

3. Malformed UUID handling in readTimeLineHistory().

The optional field-4 path is:

if (nfields == 4 && strlen(uuid_str) == UUID_STR_LEN)
{
Datum datum = DirectFunctionCall1(uuid_in,

CStringGetDatum(uuid_str));

...
}

uuid_in() raises ereport(ERROR) on a malformed input, while the

surrounding syntax-error paths in

readTimeLineHistory

() use FATAL deliberately.
In practice an ERROR during startup ends up being fatal too, so this

isn't strictly a bug but it would be nicer to

stay consistent.

Agree. I added code to capture the error and raise a FATAL instead

(with the error message from the uuid_in, in

case it

is modified it makes sense to show this).

Regarding the Tests I have the following thoughts

The two new cases are nice, a few extensions that I think would

strengthen them:

1. A mixed-version case where one side has a zero UUID. That's the

path we're claiming is graceful, but nothing

currently exercises it

Yes, that should work regardless of whether the source or the target

has the zero UUID.

I realized one thing: if two timelines have identical TLI but one has

zero UUID and one has not, it seems they

could not

come from the same promotion (one promotion happened on an old server

and the other one on a new server), that is,

they

should be treated as different. Does that make sense? I made the

necessary changes in the attached patches for

testing.

Please have a look.

2. A deeper-divergence case (e.g. TLI1->2->3 vs TLI1->2->3') so that

findCommonAncestorTimeline's loop walks past

matching entries
before hitting the mismatch. The 0002 test puts the divergence

at depth 1.

I was unsure if this test was necessary or interesting, hence a

separate commit. Since you thought it was useful,

it's

now rolled into the patch and I extended the tests with the scenarios

you suggested.

I also did some refactorings of the tests to avoid duplication. More

below.

3. A small assertion against the on-disk 00000002.history contents,

to pin down the file format.

4. On 0002 the dependency on restore_command pointing at node_x's

pg_wal is the kind of thing that tends to break

under
environment changes. A CHECKPOINT on node_x before the backup,

or wal_keep_size as in 0001, would let the

test

stand on its own.

Good point.

I refactored the code to avoid some duplication and make the test

flow self-explanatory and as part of that I set

the

wal_keep_size for all nodes.

In the process I noticed that many of the functions in RewindTest.pm

do the same job as the primitives I wrote, but

have

hard-coded variable names. I could rewrite them to take parameters,

but that would be quite a big patch to add

additional

changes to each call site, so I did not do that and rather added

small wrappers specific for the tests in

005_same_timeline.pl⚠️⚠️.

Attached a new version of the now single patch.

I'm happy to keep reviewing/contributing, thanks again for working

on it.

Thank you for reviewing it.

Thank you for your work. I have one comment.

+       a = &tlh->source[tlh->sourceNentries - 2].tluuid;
+       b = &tlh->target[tlh->targetNentries - 2].tluuid;
+
+       if (memcmp(a, &zero, UUID_LEN) == 0 && memcmp(b, &zero,

UUID_LEN) == 0)

+               return true;
+
+       return memcmp(a, b, UUID_LEN) == 0;

Since we already have matchingTimelineUUID(), the above code can be

simplified

using it.

Thank you for the review. I switched to using the matchingTimelineUUID()

for this part of the code and made some other

minor improvements as well.

Here are some comments on v4.

1.
+/*
+ * Timeline histories for both clusters, populated by timelines_match().
+ */

I don't see a timelines_match() function. Does this refer to
matchAndFetchTimelines()?

Correct. Updated.

2.
+typedef struct TimelineHistoriesData
+{
+       TimeLineHistoryEntry *source,
+                          *target;
+       int                     sourceNentries,
+                               targetNentries;
+}                      TimelineHistoriesData;

I'd prefer to use TimeLineHistoriesData to stay consistent with
TimeLineHistoryEntry. Anyway I'm not instant on it.

Makes sense to be consistent. Updated.

3.
+typedef TimelineHistoriesData * TimelineHistories;

The space between * and TimelineHistories is unnecessary — see
StringInfoData and other typedefs.

My mistake. FIxed.

4.
+# node_x and node_b both start from the same TLI 1 baseline.
+my ($node_x, $node_b2) =
+  setup_standbys_from_origin($node_origin2, 'node_x', 'node_b2');

There appears to be a typo in the comment. The node_b should be node_b2.

Right. Fixed.

Everything else looks good. Thank you again for updating the patch!

Thank you again for reviewing the patch. :)

Attached a new version of the patch with the changes you suggested.

--
Best wishes,
Mats Kindahl, Multigres Developer, Supabase

Attachments:

v5.0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patchtext/x-patch; charset=US-ASCII; name=v5.0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patchDownload+606-24
#9Japin Li
japinli@hotmail.com
In reply to: Mats Kindahl (#8)
Re: pg_rewind does not rewind diverging timelines

Hi, Mats

On Tue, 26 May 2026 at 18:03, Mats Kindahl <mats.kindahl@gmail.com> wrote:

Attached a new version of the patch with the changes you suggested.

I found an error on the Windows platform [1]https://cirrus-ci.com/task/6228217159221248.

[07:08:28.538] >>> MALLOC_PERTURB_=168 PG_REGRESS=C:\cirrus\build\src/test\regress\pg_regress.exe REGRESS_SHLIB=C:\cirrus\build\src/test\regress\regress.dll MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 top_builddir=C:\cirrus\build UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MESON_TEST_ITERATION=1 PATH=C:\cirrus\build\tmp_install\usr\local\pgsql\bin;C:\cirrus\build\src\bin\pg_rewind;C:/cirrus/build/src/bin/pg_rewind/test;C:\VS_2019\VC\Tools\MSVC\14.29.30133\bin\HostX64\x64;C:\VS_2019\MSBuild\Current\bin\Roslyn;C:\Program Files (x86)\Windows Kits\10\bin\10.0.22621.0\x64;C:\Program Files (x86)\Windows Kits\10\bin\x64;C:\VS_2019\\MSBuild\Current\Bin;C:\Windows\Microsoft.NET\Framework64\v4.0.30319;C:\VS_2019\Common7\IDE\;C:\VS_2019\Common7\Tools\;C:\VS_2019\VC\Auxiliary\Build;C:\zstd\zstd-v1.5.2-win64;C:\zlib;C:\lz4;C:\icu;C:\winflexbison;C:\strawberry\5.42.0.1\perl\bin;C:\python\Scripts\;C:\python\;C:\Windows Kits\10\Debuggers\x64;C:\Program Files\Git\usr\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\ProgramData\GooGet;C:\Program Files\Google\Compute Engine\metadata_scripts;C:\Program Files (x86)\Google\Cloud SDK\google-cloud-sdk\bin;C:\Program Files\PowerShell\7\;C:\Program Files\Google\Compute Engine\sysprep;C:\ProgramData\chocolatey\bin;C:\Program Files\Git\cmd;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Windows\system32\config\systemprofile\AppData\Local\Microsoft\WindowsApps INITDB_TEMPLATE=C:/cirrus/build/tmp_install/initdb-template ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 share_contrib_dir=C:/cirrus/build/tmp_install//usr/local/pgsql/share/contrib C:\python\python3.EXE C:\cirrus\build\..\src/tools/testwrap --basedir C:\cirrus\build --srcdir C:\cirrus\src\bin\pg_rewind --pg-test-extra --testgroup pg_rewind --testname 005_same_timeline -- C:\strawberry\5.42.0.1\perl\bin\perl.EXE -I C:/cirrus/src/test/perl -I C:\cirrus\src\bin\pg_rewind C:/cirrus/src/bin/pg_rewind/t/005_same_timeline.pl
[07:08:28.538] ------------------------------------- 8< -------------------------------------
[07:08:28.538] stderr:
[07:08:28.538] # Failed test 'pg_rewind rewinds across mismatched TLI 2 / TLI 2-prime to TLI 1'
[07:08:28.538] # at C:/cirrus/src/bin/pg_rewind/t/005_same_timeline.pl line 45.
[07:08:28.538] # ---------- command failed ----------
[07:08:28.538] # pg_rewind --debug --source-pgdata C:\cirrus\build/testrun/pg_rewind/005_same_timeline\data/t_005_same_timeline_node_b2_data/pgdata --target-pgdata C:\cirrus\build/testrun/pg_rewind/005_same_timeline\data/t_005_same_timeline_node_a2_data/pgdata --no-sync --config-file C:\cirrus\build\testrun\pg_rewind\005_same_timeline\data\tmp_test_ZCeZ/target-postgresql.conf.tmp --restore-target-wal
[07:08:28.538] # -------------- stderr --------------
[07:08:28.538] # pg_rewind: using for rewind "restore_command = 'cp "C:cirrusuild/testrun/pg_rewind/005_same_timelinedata/t_005_same_timeline_node_x_data/pgdata/pg_wal/%f" "%p"'"
[07:08:28.538] # pg_rewind: Source timeline history:
[07:08:28.538] # pg_rewind: 1: 0/00000000 - 0/040000E0
[07:08:28.538] # pg_rewind: 2: 0/040000E0 - 0/00000000
[07:08:28.538] # pg_rewind: Target timeline history:
[07:08:28.538] # pg_rewind: 1: 0/00000000 - 0/040000E0
[07:08:28.538] # pg_rewind: 2: 0/040000E0 - 0/060000E0
[07:08:28.538] # pg_rewind: 3: 0/060000E0 - 0/00000000
[07:08:28.538] # pg_rewind: servers diverged at WAL location 0/040000E0 on timeline 1
[07:08:28.538] # cp: cannot stat 'C:cirrus'$'\b''uild/testrun/pg_rewind/005_same_timelinedata/t_005_same_timeline_node_x_data/pgdata/pg_wal/000000020000000000000004': No such file or directory
[07:08:28.538] # pg_rewind: error: could not restore file "000000020000000000000004" from archive
[07:08:28.538] # pg_rewind: error: could not find previous WAL record at 0/040000E0
[07:08:28.538] # ------------------------------------
[07:08:28.538] # Failed test 'rewound node reflects source history, not target TLI 2/TLI 3 data'
[07:08:28.538] # at C:/cirrus/src/bin/pg_rewind/t/005_same_timeline.pl line 260.
[07:08:28.538] # got: 'origin2
[07:08:28.538] # x'
[07:08:28.538] # expected: 'b
[07:08:28.538] # origin2'
[07:08:28.538] # Looks like you failed 2 tests of 11.
[07:08:28.538]
[07:08:28.538] (test program exited with status code 2)
[07:08:28.538] ------------------------------------------------------------------------------
[07:08:28.538]

[1]: https://cirrus-ci.com/task/6228217159221248

--
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.

#10Mats Kindahl
mats.kindahl@gmail.com
In reply to: Japin Li (#9)
Re: pg_rewind does not rewind diverging timelines

Hi Japin,

On 5/29/26 04:01, Japin Li wrote:

Hi, Mats

On Tue, 26 May 2026 at 18:03, Mats Kindahl <mats.kindahl@gmail.com> wrote:

Attached a new version of the patch with the changes you suggested.

I found an error on the Windows platform [1].

[07:08:28.538] >>> MALLOC_PERTURB_=168 PG_REGRESS=C:\cirrus\build\src/test\regress\pg_regress.exe REGRESS_SHLIB=C:\cirrus\build\src/test\regress\regress.dll MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 top_builddir=C:\cirrus\build UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1 MESON_TEST_ITERATION=1 PATH=C:\cirrus\build\tmp_install\usr\local\pgsql\bin;C:\cirrus\build\src\bin\pg_rewind;C:/cirrus/build/src/bin/pg_rewind/test;C:\VS_2019\VC\Tools\MSVC\14.29.30133\bin\HostX64\x64;C:\VS_2019\MSBuild\Current\bin\Roslyn;C:\Program Files (x86)\Windows Kits\10\bin\10.0.22621.0\x64;C:\Program Files (x86)\Windows Kits\10\bin\x64;C:\VS_2019\\MSBuild\Current\Bin;C:\Windows\Microsoft.NET\Framework64\v4.0.30319;C:\VS_2019\Common7\IDE\;C:\VS_2019\Common7\Tools\;C:\VS_2019\VC\Auxiliary\Build;C:\zstd\zstd-v1.5.2-win64;C:\zlib;C:\lz4;C:\icu;C:\winflexbison;C:\strawberry\5.42.0.1\perl\bin;C:\python\Scripts\;C:\python\;C:\Windows Kits\10\Debuggers\x64;C:\Program Files\Git\usr\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\ProgramData\GooGet;C:\Program Files\Google\Compute Engine\metadata_scripts;C:\Program Files (x86)\Google\Cloud SDK\google-cloud-sdk\bin;C:\Program Files\PowerShell\7\;C:\Program Files\Google\Compute Engine\sysprep;C:\ProgramData\chocolatey\bin;C:\Program Files\Git\cmd;C:\Program Files\Git\mingw64\bin;C:\Program Files\Git\usr\bin;C:\Windows\system32\config\systemprofile\AppData\Local\Microsoft\WindowsApps INITDB_TEMPLATE=C:/cirrus/build/tmp_install/initdb-template ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1 share_contrib_dir=C:/cirrus/build/tmp_install//usr/local/pgsql/share/contrib C:\python\python3.EXE C:\cirrus\build\..\src/tools/testwrap --basedir C:\cirrus\build --srcdir C:\cirrus\src\bin\pg_rewind --pg-test-extra --testgroup pg_rewind --testname 005_same_timeline -- C:\strawberry\5.42.0.1\perl\bin\perl.EXE -I C:/cirrus/src/test/perl -I C:\cirrus\src\bin\pg_rewind C:/cirrus/src/bin/pg_rewind/t/005_same_timeline.pl
[07:08:28.538] ------------------------------------- 8< -------------------------------------
[07:08:28.538] stderr:
[07:08:28.538] # Failed test 'pg_rewind rewinds across mismatched TLI 2 / TLI 2-prime to TLI 1'
[07:08:28.538] # at C:/cirrus/src/bin/pg_rewind/t/005_same_timeline.pl line 45.
[07:08:28.538] # ---------- command failed ----------
[07:08:28.538] # pg_rewind --debug --source-pgdata C:\cirrus\build/testrun/pg_rewind/005_same_timeline\data/t_005_same_timeline_node_b2_data/pgdata --target-pgdata C:\cirrus\build/testrun/pg_rewind/005_same_timeline\data/t_005_same_timeline_node_a2_data/pgdata --no-sync --config-file C:\cirrus\build\testrun\pg_rewind\005_same_timeline\data\tmp_test_ZCeZ/target-postgresql.conf.tmp --restore-target-wal
[07:08:28.538] # -------------- stderr --------------
[07:08:28.538] # pg_rewind: using for rewind "restore_command = 'cp "C:cirrusuild/testrun/pg_rewind/005_same_timelinedata/t_005_same_timeline_node_x_data/pgdata/pg_wal/%f" "%p"'"
[07:08:28.538] # pg_rewind: Source timeline history:
[07:08:28.538] # pg_rewind: 1: 0/00000000 - 0/040000E0
[07:08:28.538] # pg_rewind: 2: 0/040000E0 - 0/00000000
[07:08:28.538] # pg_rewind: Target timeline history:
[07:08:28.538] # pg_rewind: 1: 0/00000000 - 0/040000E0
[07:08:28.538] # pg_rewind: 2: 0/040000E0 - 0/060000E0
[07:08:28.538] # pg_rewind: 3: 0/060000E0 - 0/00000000
[07:08:28.538] # pg_rewind: servers diverged at WAL location 0/040000E0 on timeline 1
[07:08:28.538] # cp: cannot stat 'C:cirrus'$'\b''uild/testrun/pg_rewind/005_same_timelinedata/t_005_same_timeline_node_x_data/pgdata/pg_wal/000000020000000000000004': No such file or directory
[07:08:28.538] # pg_rewind: error: could not restore file "000000020000000000000004" from archive
[07:08:28.538] # pg_rewind: error: could not find previous WAL record at 0/040000E0
[07:08:28.538] # ------------------------------------
[07:08:28.538] # Failed test 'rewound node reflects source history, not target TLI 2/TLI 3 data'
[07:08:28.538] # at C:/cirrus/src/bin/pg_rewind/t/005_same_timeline.pl line 260.
[07:08:28.538] # got: 'origin2
[07:08:28.538] # x'
[07:08:28.538] # expected: 'b
[07:08:28.538] # origin2'
[07:08:28.538] # Looks like you failed 2 tests of 11.
[07:08:28.538]
[07:08:28.538] (test program exited with status code 2)
[07:08:28.538] ------------------------------------------------------------------------------
[07:08:28.538]

[1] https://cirrus-ci.com/task/6228217159221248

Thanks for testing it on Windows.

It seems like the path needs to be cleaned on Windows. I checked
Cluster.pm and created a version of that code and added that to the test
that should work. See attached patch.

I noted that many of the paths are not platform-agnostic. It an idea to
switch to use something like File::Spec instead and build paths using
that, but it's out of scope for this patch.

Best wishes,
Mats Kindahl, Multigres Engineer, Supabase

Attachments:

v6.0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patchtext/x-patch; charset=UTF-8; name=v6.0001-pg_rewind-use-UUIDs-to-detect-independent-same-TLI-p.patchDownload+615-24
#11Japin Li
japinli@hotmail.com
In reply to: Mats Kindahl (#10)
Re: pg_rewind does not rewind diverging timelines

Hi, Mats

On Sat, 30 May 2026 at 22:26, Mats Kindahl <mats.kindahl@gmail.com> wrote:

Hi Japin,

On 5/29/26 04:01, Japin Li wrote:

Hi, Mats

On Tue, 26 May 2026 at 18:03, Mats Kindahl <mats.kindahl@gmail.com> wrote:

Attached a new version of the patch with the changes you suggested.

I found an error on the Windows platform [1].

[07:08:28.538] >>> MALLOC_PERTURB_=168
PG_REGRESS=C:\cirrus\build\src/test\regress\pg_regress.exe
REGRESS_SHLIB=C:\cirrus\build\src/test\regress\regress.dll
MSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1
top_builddir=C:\cirrus\build
UBSAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1:print_stacktrace=1
MESON_TEST_ITERATION=1
PATH=C:\cirrus\build\tmp_install\usr\local\pgsql\bin;C:\cirrus\build\src\bin\pg_rewind;C:/cirrus/build/src/bin/pg_rewind/test;C:\VS_2019\VC\Tools\MSVC\14.29.30133\bin\HostX64\x64;C:\VS_2019\MSBuild\Current\bin\Roslyn;C:\Program
Files (x86)\Windows Kits\10\bin\10.0.22621.0\x64;C:\Program Files
(x86)\Windows
Kits\10\bin\x64;C:\VS_2019\\MSBuild\Current\Bin;C:\Windows\Microsoft.NET\Framework64\v4.0.30319;C:\VS_2019\Common7\IDE\;C:\VS_2019\Common7\Tools\;C:\VS_2019\VC\Auxiliary\Build;C:\zstd\zstd-v1.5.2-win64;C:\zlib;C:\lz4;C:\icu;C:\winflexbison;C:\strawberry\5.42.0.1\perl\bin;C:\python\Scripts\;C:\python\;C:\Windows
Kits\10\Debuggers\x64;C:\Program
Files\Git\usr\bin;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem;C:\Windows\System32\WindowsPowerShell\v1.0\;C:\Windows\System32\OpenSSH\;C:\ProgramData\GooGet;C:\Program
Files\Google\Compute Engine\metadata_scripts;C:\Program Files
(x86)\Google\Cloud SDK\google-cloud-sdk\bin;C:\Program
Files\PowerShell\7\;C:\Program Files\Google\Compute
Engine\sysprep;C:\ProgramData\chocolatey\bin;C:\Program
Files\Git\cmd;C:\Program Files\Git\mingw64\bin;C:\Program
Files\Git\usr\bin;C:\Windows\system32\config\systemprofile\AppData\Local\Microsoft\WindowsApps
INITDB_TEMPLATE=C:/cirrus/build/tmp_install/initdb-template
ASAN_OPTIONS=halt_on_error=1:abort_on_error=1:print_summary=1
share_contrib_dir=C:/cirrus/build/tmp_install//usr/local/pgsql/share/contrib
C:\python\python3.EXE C:\cirrus\build\..\src/tools/testwrap
--basedir C:\cirrus\build --srcdir C:\cirrus\src\bin\pg_rewind
--pg-test-extra --testgroup pg_rewind --testname 005_same_timeline
-- C:\strawberry\5.42.0.1\perl\bin\perl.EXE -I
C:/cirrus/src/test/perl -I C:\cirrus\src\bin\pg_rewind
C:/cirrus/src/bin/pg_rewind/t/005_same_timeline.pl
[07:08:28.538] ------------------------------------- 8< -------------------------------------
[07:08:28.538] stderr:
[07:08:28.538] # Failed test 'pg_rewind rewinds across mismatched TLI 2 / TLI 2-prime to TLI 1'
[07:08:28.538] # at C:/cirrus/src/bin/pg_rewind/t/005_same_timeline.pl line 45.
[07:08:28.538] # ---------- command failed ----------
[07:08:28.538] # pg_rewind --debug --source-pgdata
C:\cirrus\build/testrun/pg_rewind/005_same_timeline\data/t_005_same_timeline_node_b2_data/pgdata
--target-pgdata
C:\cirrus\build/testrun/pg_rewind/005_same_timeline\data/t_005_same_timeline_node_a2_data/pgdata
--no-sync --config-file
C:\cirrus\build\testrun\pg_rewind\005_same_timeline\data\tmp_test_ZCeZ/target-postgresql.conf.tmp
--restore-target-wal
[07:08:28.538] # -------------- stderr --------------
[07:08:28.538] # pg_rewind: using for rewind "restore_command = 'cp "C:cirrusuild/testrun/pg_rewind/005_same_timelinedata/t_005_same_timeline_node_x_data/pgdata/pg_wal/%f" "%p"'"
[07:08:28.538] # pg_rewind: Source timeline history:
[07:08:28.538] # pg_rewind: 1: 0/00000000 - 0/040000E0
[07:08:28.538] # pg_rewind: 2: 0/040000E0 - 0/00000000
[07:08:28.538] # pg_rewind: Target timeline history:
[07:08:28.538] # pg_rewind: 1: 0/00000000 - 0/040000E0
[07:08:28.538] # pg_rewind: 2: 0/040000E0 - 0/060000E0
[07:08:28.538] # pg_rewind: 3: 0/060000E0 - 0/00000000
[07:08:28.538] # pg_rewind: servers diverged at WAL location 0/040000E0 on timeline 1
[07:08:28.538] # cp: cannot stat 'C:cirrus'$'\b''uild/testrun/pg_rewind/005_same_timelinedata/t_005_same_timeline_node_x_data/pgdata/pg_wal/000000020000000000000004': No such file or directory
[07:08:28.538] # pg_rewind: error: could not restore file "000000020000000000000004" from archive
[07:08:28.538] # pg_rewind: error: could not find previous WAL record at 0/040000E0
[07:08:28.538] # ------------------------------------
[07:08:28.538] # Failed test 'rewound node reflects source history, not target TLI 2/TLI 3 data'
[07:08:28.538] # at C:/cirrus/src/bin/pg_rewind/t/005_same_timeline.pl line 260.
[07:08:28.538] # got: 'origin2
[07:08:28.538] # x'
[07:08:28.538] # expected: 'b
[07:08:28.538] # origin2'
[07:08:28.538] # Looks like you failed 2 tests of 11.
[07:08:28.538]
[07:08:28.538] (test program exited with status code 2)
[07:08:28.538] ------------------------------------------------------------------------------
[07:08:28.538]

[1] https://cirrus-ci.com/task/6228217159221248

Thanks for testing it on Windows.

It seems like the path needs to be cleaned on Windows. I checked
Cluster.pm and created a version of that code and added that to the
test that should work. See attached patch.

I noted that many of the paths are not platform-agnostic. It an idea
to switch to use something like File::Spec instead and build paths
using that, but it's out of scope for this patch.

Thanks for updating the patch. LGTM.

--
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.