Can we change pg_rewind used without wal_log_hints and data_checksums

Started by lchch1990@sina.cn3 months ago12 messageshackers
Jump to latest
#1lchch1990@sina.cn
lchch1990@sina.cn

hi hackers,

I am thinking about why pg_rewind need wal_log_hints or data_checksums which significantly
limits its usability. I research somewhere can can only find it's for against data
corruption in code comment.

And i come up a case which may need depend on the page consistence after do pg_rewind:
1. We have primary A and standby B.
2. We have a transaction xact1 currently and xact1 modify some pages.
3. Do a checkpoint on A
4. standby B promote
5. xact1 committed on and do a query on all data modified by xact1
6. do pg_rewind on A

If on no page consistence mode, and hack pg_rewind code to force a rewind, then we
may see xact1 on A and can not see xact1 on B. And it's cause unconsistence.

Now I tell myself ,pg_rewind may can not handle this case so we must set wal_log_hints
on to avoid the case. If so we can modify pg_rewind to met this case. If not so, I want
to known the reason or some mail thread discuss that? Thanks.

Here i want to introduce a way to solve the case above:
We need record all transaction ID commited after diverge record and research more wal
before diverge record, we need to find a start lsn(lsn_s) which older than all the
transactions. And we should read from lsn_s to diverge lsn to collect influenced pages
by the transactions. So we can copy them at the rewind phase.

Best Regards,Movead Li

#2Michael Paquier
michael@paquier.xyz
In reply to: lchch1990@sina.cn (#1)
Re: Can we change pg_rewind used without wal_log_hints and data_checksums

On Thu, Jan 15, 2026 at 09:47:39AM +0800, lchch1990@sina.cn wrote:

I am thinking about why pg_rewind need wal_log_hints or data_checksums which significantly
limits its usability. I research somewhere can can only find it's for against data
corruption in code comment.

Hint bints can be set on a page, and we *have to* WAL log these pages
so as pg_rewind can track the modified blocks, or we would corrupt a
data folder after a rewind. See around this thread about the original
description of the issue:
/messages/by-id/519E5493.5060800@vmware.com

Here i want to introduce a way to solve the case above:
We need record all transaction ID commited after diverge record and research more wal
before diverge record, we need to find a start lsn(lsn_s) which older than all the
transactions. And we should read from lsn_s to diverge lsn to collect influenced pages
by the transactions. So we can copy them at the rewind phase.

How is a method based on the tracking of transaction IDs and the
modified blocks not going to be more costly than the current method
where we are able to track the modified blocks directly in the WAL
records?
--
Michael

#3lchch1990@sina.cn
lchch1990@sina.cn
回复:Re: Can we change pg_rewind used without wal_log_hints and data_checksums

On Thu, Jan 15, 2026 at 09:47:39AM +0800, lchch1990@sina.cn wrote:

Hint bints can be set on a page, and we *have to* WAL log these pages
so as pg_rewind can track the modified blocks, or we would corrupt a
data folder after a rewind. See around this thread about the original
description of the issue:
/messages/by-id/519E5493.5060800@vmware.com

Thanks I will read the thread.

How is a method based on the tracking of transaction IDs and the
modified blocks not going to be more costly than the current method
where we are able to track the modified blocks directly in the WAL
records?My purpose is to remove the depend on wal_log_hints and data_checksumsfor pg_rewind so we need to do more thing than currently.The method to track transaction IDs is to solve some case which currentlypg_rewind can not handle.

--
Movead Li

#4Laurenz Albe
laurenz.albe@cybertec.at
In reply to: lchch1990@sina.cn (#1)
Re: Can we change pg_rewind used without wal_log_hints and data_checksums

On Thu, 2026-01-15 at 09:47 +0800, lchch1990@sina.cn wrote:

I am thinking about why pg_rewind need wal_log_hints or data_checksums which significantly
limits its usability. I research somewhere can can only find it's for against data
corruption in code comment.

See /messages/by-id/CA+TgmoY4j+p7JY69ry8GpOSMMdZNYqU6dtiONPrcxaVG+SPByg@mail.gmail.com

In more detail:

1. there is a transaction open on the primary server (server A)

2. the transaction inserts a row

3. a checkpoint happens

4. the transaction commits

5. the session reads the row it just inserted, which sets hint bits on the row
that mark it as generally visible

Now the standby (server B) promoted between steps 3 and 4, which means that on server B
(the new primary), the transaction didn't commit and the row is invisible.

Now if we run pg_rewind on server A, it examines the local WAL to find all the blocks
that were modified after the last common checkpoint (which happened in step 3 above).
If neither wal_log_hints = on nor checksums are enabled (which effectively forces
WAL-logging hint bit changes), there is no track of step 5 in the WAL, and pg_rewind
fails to copy that block from server B. The consequence is that after pg_rewind, the
row is *still* visible on server A because of the hint bits. That is data corruption.

Therefore, the requirement cannot be relaxed.

Yours,
Laurenz Albe

#5lchch1990@sina.cn
lchch1990@sina.cn
In reply to: lchch1990@sina.cn (#1)
Re: Can we change pg_rewind used without wal_log_hints and data_checksums

Here i want to introduce a way to solve the case above:
We need record all transaction ID commited after diverge record and research more wal
before diverge record, we need to find a start lsn(lsn_s) which older than all the
transactions. And we should read from lsn_s to diverge lsn to collect influenced pages
by the transactions. So we can copy them at the rewind phase.

Hi hackers,

I am sorry and I am tring how to send a in order mail.

----
Best Regards,
Movead Li

#6lchch1990@sina.cn
lchch1990@sina.cn
In reply to: lchch1990@sina.cn (#1)
Re: Can we change pg_rewind used without wal_log_hints and data_checksums

On Thu, 2026-01-15 at 13:47 +0800, laurenz.albe@cybertec.at wrote:

Now if we run pg_rewind on server A, it examines the local WAL to find all the blocks
that were modified after the last common checkpoint (which happened in step 3 above).
If neither wal_log_hints = on nor checksums are enabled (which effectively forces
WAL-logging hint bit changes), there is no track of step 5 in the WAL, and pg_rewind
fails to copy that block from server B.  The consequence is that after pg_rewind, the
row is *still* visible on server A because of the hint bits.  That is data corruption.
Therefore, the requirement cannot be relaxed.

Yes I known the step and I have check the mail link. As described in the top mail we can
find some way to solve the problem so that pg_rewind can run without wal_log_hints and
data_checksums.

Currently pg_rewind search wal start at checkpoint lsn or redo lsn, I mean to search more
wal to cover whole releated transactions so any releated pages with copyed, and we never
warried about hint bits issue.

Anyway, I wish this mail in order.

----
Best Regards,

Movead Li

 

#7lchch1990@sina.cn
lchch1990@sina.cn
In reply to: lchch1990@sina.cn (#1)
Re: Can we change pg_rewind used without wal_log_hints and data_checksums

On Thu, 2026-01-15 at 14:14 +0800, laurenz.albe@cybertec.at wrote:
Now if we run pg_rewind on server A, it examines the local WAL to find all the blocks
that were modified after the last common checkpoint (which happened in step 3 above).
If neither wal_log_hints = on nor checksums are enabled (which effectively forces
WAL-logging hint bit changes), there is no track of step 5 in the WAL, and pg_rewind
fails to copy that block from server B.  The consequence is that after pg_rewind, the
row is *still* visible on server A because of the hint bits.  That is data corruption.
Therefore, the requirement cannot be relaxed.

Currently pg_rewind search wal start at checkpoint lsn or redo lsn, I mean to search more
wal to cover whole releated transactions so any releated pages with copyed, and we never
warried about hint bits issue.

Base on the discussion I write a patch and introduce it:

Currently pg_rewind search checkpoint start at divergerec and walk backward. Then it
collect change pages from checkpoint to divergerec forward.

We modify the second step and collect the minimal commited transaction id and named
min_commited_xid. And collect the 'first appeared' transaction id by XLOG_RUNNING_XACTS
wal record and named base_xid. If base_xid <= min_commited_xid we can work a safy
rewind.

How ever if we can not met 'base_xid <= min_commited_xid' then we read wal from
checkpoint and walk backward until we met the goal, ofcause we collect change pages during
the third step. If we can not met the goal at last, we report an error for can not finish.

The third step maybe slowly so I add a option(-d or --deep-dig), by default it stop if can not
met the goal at the second step. And user should add -d to run the third step.

Patch attached.

----
Best Regards,
Movead Li

 

 

 

 

 

 

Attachments:

0001-Enable-pg_rewind-without-page-consistence.patchapplication/octet-stream; name=0001-Enable-pg_rewind-without-page-consistence.patchDownload+204-20
#8lchch1990@sina.cn
lchch1990@sina.cn
In reply to: lchch1990@sina.cn (#1)
Re: Can we change pg_rewind used without wal_log_hints and data_checksums

As discussed on the other mail thread:
/messages/by-id/696885c3db9be6.68269280.fb0eaf68@m0.mail.sina.com.cn

I need to do two changes on the patch.
1. Find some way to handle static variables.
2. Do not collect useless pages on the third step.

For the first one because the static variables needed on several place and it seem worst if
deliver by function parameters, so I keep them.

For the second one I have tryied to keep a commited transaction ID list or bitmapset, however
the data structs seem hard to use in src/bin code. So I use min_commited_xid and
max_commited_xid instead to filter the useless pages.

In fact the min_commited_xid and max_commited_xid is the edge transaction commited after
diverge record, so it's enough.

Changed patch attached.

----
Best Regards,
Movead Li

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Attachments:

0001-pg_rewind_enhance.patchapplication/octet-stream; name=0001-pg_rewind_enhance.patchDownload+259-19
#9lchch1990@sina.cn
lchch1990@sina.cn
In reply to: lchch1990@sina.cn (#1)
Re: Can we change pg_rewind used without wal_log_hints and data_checksums

I modify the patch for ninja build error which found by CI.

----
Best Regards,

Movead Li

Attachments:

0001-pg_rewind_enhance_03.patchapplication/octet-stream; name=0001-pg_rewind_enhance_03.patchDownload+261-19
#10Neil Chen
carpenter.nail.cz@gmail.com
In reply to: lchch1990@sina.cn (#8)
Re: Can we change pg_rewind used without wal_log_hints and data_checksums

Hi Movead,

It's an interesting idea.

While it’s impossible to predict exactly how much WAL we’ll need to
backtrack through --
I assume it mainly depends on the duration of long-running transactions --
this approach
seems to offer an opportunity using pg_rewind without enabling
wal_log_hints.

On Fri, Jan 16, 2026 at 9:28 PM Movead <lchch1990@sina.cn> wrote:

In fact the min_commited_xid and max_commited_xid is the edge transaction
commited after
diverge record, so it's enough.

Given the potential large gap between transaction IDs (especially when
long-running transactions are involved),
maintaining a list/bitmap struct would be worthwhile.

A minor suggestion, for an operation that may fail, I suggest retrieving
the first XLOG_RUNNING_XACTS record to obtain its base_xid
before doing the deep-dig process. If the task cannot be completed (i.e.,
the base_xid <= min_commited_xid condition isn’t met),
we can throw an error immediately instead of waiting for all WAL records to
be parsed.

#11lchch1990@sina.cn
lchch1990@sina.cn
In reply to: lchch1990@sina.cn (#1)
Re: Can we change pg_rewind used without wal_log_hints and data_checksums

It's an interesting idea.
While it’s impossible to predict exactly how much WAL we’ll need to backtrack through --
I assume it mainly depends on the duration of long-running transactions -- this approach
seems to offer an opportunity using pg_rewind without enabling wal_log_hints.

Hi Neil, thanks and I think it's meaningful.

Given the potential large gap between transaction IDs (especially when long-running transactions are involved),
maintaining a list/bitmap struct would be worthwhile.

Yes I intend todo that but bitmapset can not use in src/bin and it seem no necessary to implement
one. And attention that min and max is what commited after diverge record, so it nomally small gap.
So it's the reason I give up.

A minor suggestion, for an operation that may fail, I suggest retrieving the first XLOG_RUNNING_XACTS record to obtain its base_xid
before doing the deep-dig process. If the task cannot be completed (i.e., the base_xid <= min_commited_xid condition isn’t met),
we can throw an error immediately instead of waiting for all WAL records to be parsed.

Mainly we can not get the first wal segment, because if no enough wal it will fetch wal segment
by restore_command. Your suggestion is meanful only if no restore_command. Anyway let's see
hacker's opnions.

----
Best Regards,
Movead Li

#12wenhui qiu
qiuwenhuifx@gmail.com
In reply to: lchch1990@sina.cn (#11)
Re: Can we change pg_rewind used without wal_log_hints and data_checksums

Hi Movead and Heikki
I have reviewed this patch and find the approach reasonable.Given that
you are one of the contributors to this tool, I would be grateful if you
could review it and provide your feedback.Thank you very much.

Thanks

On Fri, Jan 16, 2026 at 11:07 PM Movead <lchch1990@sina.cn> wrote:

Show quoted text

It's an interesting idea.
While it’s impossible to predict exactly how much WAL we’ll need to

backtrack through --

I assume it mainly depends on the duration of long-running transactions

-- this approach

seems to offer an opportunity using pg_rewind without enabling

wal_log_hints.

Hi Neil, thanks and I think it's meaningful.

Given the potential large gap between transaction IDs (especially when

long-running transactions are involved),

maintaining a list/bitmap struct would be worthwhile.

Yes I intend todo that but bitmapset can not use in src/bin and it seem no
necessary to implement
one. And attention that min and max is what commited after diverge record,
so it nomally small gap.
So it's the reason I give up.

A minor suggestion, for an operation that may fail, I suggest retrieving

the first XLOG_RUNNING_XACTS record to obtain its base_xid

before doing the deep-dig process. If the task cannot be completed (i.e.,

the base_xid <= min_commited_xid condition isn’t met),

we can throw an error immediately instead of waiting for all WAL records

to be parsed.

Mainly we can not get the first wal segment, because if no enough wal it
will fetch wal segment
by restore_command. Your suggestion is meanful only if no restore_command.
Anyway let's see
hacker's opnions.

----
Best Regards,
Movead Li