Return value of XLogInsertRecord() for XLOG_SWITCH record

Started by cca55072 months ago11 messageshackers
Jump to latest
#1cca5507
cca5507@qq.com

Hi,

I find that the return value of XLogInsertRecord() for XLOG_SWITCH record
is inconsistent with other records.

For XLOG_SWITCH record:

```
/*
* Even though we reserved the rest of the segment for us, which is
* reflected in EndPos, we return a pointer to just the end of the
* xlog-switch record.
*/
if (inserted)
{
EndPos = StartPos + SizeOfXLogRecord;
if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
{
uint64 offset = XLogSegmentOffset(EndPos, wal_segment_size);

if (offset == EndPos % XLOG_BLCKSZ)
EndPos += SizeOfXLogLongPHD;
else
EndPos += SizeOfXLogShortPHD;
}
}
```

It is equivalent to XLogBytePosToRecPtr(), but all other records use XLogBytePosToEndRecPtr().

No actual problem found yet, but I think it's better to keep them consistent. Thoughts?

--
Regards,
ChangAo Chen

Attachments:

v1-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchapplication/octet-stream; charset=utf-8; name=v1-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchDownload+2-13
#2반지현
rring0727@gmail.com
In reply to: cca5507 (#1)
Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

Hi ChangAo,

I tested v1 of this patch.

Environment:
- PostgreSQL master (commit 9b43e6793b0)
- Ubuntu 24.04, gcc 13.3.0, x86_64
- Configured with --enable-cassert --enable-debug --enable-tap-tests

The patch applies cleanly and builds without new warnings.

Testing performed:
- make check: PASS
- make check-world: PASS
- make -C src/test/recovery check: PASS (599 tests)
In particular, t/043_no_contrecord_switch.pl and t/039_end_of_wal.pl
both pass, which exercise the XLOG_SWITCH boundary handling.

Manual verification:
I ran pg_switch_wal() twice with some activity in between and observed
that the returned LSN and subsequent pg_current_wal_lsn() values are
consistent with segment boundaries and page headers:

SELECT pg_switch_wal(); -- 0/0178FE38
SELECT pg_current_wal_lsn(); -- 0/02000000 (new segment)

-- ... some DDL and INSERT ...

SELECT pg_switch_wal(); -- 0/02026528
SELECT pg_current_wal_lsn(); -- 0/03000060 (new segment + 0x60
-- == SizeOfXLogLongPHD)

The 0x60 offset matches the long page header size, which is what the
original code was computing via the explicit SizeOfXLogLongPHD /
SizeOfXLogShortPHD branches. The refactored version using
XLogBytePosToEndRecPtr(XLogRecPtrToBytePos(StartPos) +
MAXALIGN(SizeOfXLogRecord)) produces the same result while matching
the pattern used elsewhere in XLogInsertRecord().

One question:
The original code did not apply MAXALIGN() to SizeOfXLogRecord before
adding it. In practice SizeOfXLogRecord is likely already MAXALIGN'd
(given typical record header layout), but could you confirm whether
MAXALIGN() here is a correctness fix, a defensive no-op, or something
that requires a separate note in the commit message?

Otherwise the change looks good to me, and I think it's a reasonable
cleanup.

Tested-by: Jihyun Bahn <rring0727@gmail.com>

Regards,
Jihyun Bahn

#3cca5507
cca5507@qq.com
In reply to: 반지현 (#2)
Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

Hi, thanks for the test!

One question:
The original code did not apply MAXALIGN() to SizeOfXLogRecord before
adding it. In practice SizeOfXLogRecord is likely already MAXALIGN'd
(given typical record header layout), but could you confirm whether
MAXALIGN() here is a correctness fix, a defensive no-op, or something
that requires a separate note in the commit message?

Otherwise the change looks good to me, and I think it's a reasonable
cleanup.

I apply the MAXALIGN() to keep it consistent with ReserveXLogSwitch(), the
value seems unchanged though.

Attach v2 which is more efficient.

--
Regards,
ChangAo Chen

Attachments:

v2-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchapplication/octet-stream; charset=utf-8; name=v2-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchDownload+9-3
#4반지현
rring0727@gmail.com
In reply to: cca5507 (#3)
Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

Hi ChangAo,

Thanks for the explanation and v2. I re-tested the new version and
also ran a comparison against unpatched master to better understand
the behavior change.

## Test environment

Commit: 64b2b421248 (today's master)
OS: Ubuntu 24.04 LTS on AWS EC2 (t3.xlarge, 4 vCPU, 16 GB)
Build: --enable-cassert --enable-debug --enable-tap-tests
CFLAGS="-O0 -ggdb -fno-omit-frame-pointer"

## Test results for v2

make check: All 248 tests passed
make -C src/test/recovery check: All 599 tests passed (262s)

Nothing new from the build side — clean compile, no new warnings.

## Manual verification

A few observations from exercising pg_switch_wal() repeatedly:

-- After some activity that lands mid-segment:
SELECT pg_switch_wal(); -> 0/03000078
SELECT pg_current_wal_lsn(); -> 0/04000000

-- Immediately calling switch again (now exactly on segment boundary):
SELECT pg_switch_wal(); -> 0/04000000 (no-op, same LSN)
SELECT pg_switch_wal(); -> 0/04000000 (still no-op)

So pg_switch_wal() is idempotent at segment boundaries — which is
what one would hope for.

## Comparison with unpatched master

I stashed v2, rebuilt, and re-ran the same sequence on current
master. The externally observable LSN behavior is identical:
repeated pg_switch_wal() calls at a segment boundary all return
the same LSN with no further WAL written.

This makes sense: when ReserveXLogSwitch() determines there's no
space worth switching, XLogInsertRecord() enters the branch with
inserted == false, so the EndPos adjustment logic on the
`if (inserted)` path isn't actually reached at segment boundaries
in current code.

## My take on v2

The new `EndPos % XLOG_BLCKSZ != 0` guard doesn't fix an
externally observable bug in current master — ReserveXLogSwitch()
already prevents the problematic case from reaching this code path.
But the guard makes the invariant local to XLogInsertRecord()
rather than depending on the caller's contract, and the added
comment referencing XLogBytePosToEndRecPtr() ties the two paths
together nicely. I think that's a worthwhile defensive improvement.

Going back to the structure of the original code (rather than the
v1 helper-based rewrite) also seems reasonable — no function-call
overhead, and the arithmetic is clear enough with the new comment.

The MAXALIGN() consistency argument with ReserveXLogSwitch() makes
sense to me.

Tested-by: Jihyun Bahn <rring0727@gmail.com>

Regards,
Jihyun Bahn

2026년 4월 21일 (화) 오후 8:36, cca5507 <cca5507@qq.com>님이 작성:

Show quoted text

Hi, thanks for the test!

One question:
The original code did not apply MAXALIGN() to SizeOfXLogRecord before
adding it. In practice SizeOfXLogRecord is likely already MAXALIGN'd
(given typical record header layout), but could you confirm whether
MAXALIGN() here is a correctness fix, a defensive no-op, or something
that requires a separate note in the commit message?

Otherwise the change looks good to me, and I think it's a reasonable
cleanup.

I apply the MAXALIGN() to keep it consistent with ReserveXLogSwitch(), the
value seems unchanged though.

Attach v2 which is more efficient.

--
Regards,
ChangAo Chen

#5ZizhuanLiu X-MAN
44973863@qq.com
In reply to: cca5507 (#3)
Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

Original
From: cca5507 <cca5507@qq.com>
Date: 2026-04-21 19:35
To:반지현<rring0727@gmail.com>, pgsql-hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

Hi, thanks for the test!

One question:
The original code did not apply MAXALIGN() to SizeOfXLogRecord before
adding it. In practice SizeOfXLogRecord is likely already MAXALIGN'd
(given typical record header layout), but could you confirm whether
MAXALIGN() here is a correctness fix, a defensive no-op, or something
that requires a separate note in the commit message

Otherwise the change looks good to me, and I think it's a reasonable
cleanup.

I apply the MAXALIGN() to keep it consistent with ReserveXLogSwitch(), the
value seems unchanged though.

Attach v2 which is more efficient.

--
Regards,
ChangAo Chen

Hi, cca5507,반지현,

Thanks for your patch and test— it’s sparked my deeper dive into the WAL switch logic.

I’ve done a preliminary survey of all top-level callers of XLogInsertRecord()
that generate XLOG_SWITCH records, and grouped them into three categories:
1.do_pg_backup_start(), do_pg_backup_stop(), ShutdownXLOG()
These callers completely ignore the return value of XLogInsertRecord(),
so a semantic change here would have zero internal impact on them.

2.CheckArchiveTimeout()
It stores the returned LSN into switchpoint only for debug logging:
```c
if (XLogSegmentOffset(switchpoint, wal_segment_size) != 0)
elog(DEBUG1, "write-ahead log switch forced (\"archive_timeout\"=%d)",
XLogArchiveTimeout);
```c
The offset check here is just a trivial debug print, and would not
introduce functional defects even if the returned LSN changes its meaning.

3.pg_switch_wal(PG_FUNCTION_ARGS) — the critical public entry point
This function directly returns the LSN from XLogInsertRecord() as
its SQL-level return value exposed to end users, scripts, and external tooling.

Risk analysis:
From the above classification, internal PostgreSQL core logic suffers
no functional breakage if we unify the return value semantics of XLogInsertRecord().

However, there are compatibility risks for external consumers relying on pg_switch_wal()’s output:
Custom applications, backup scripts, and monitoring tools may take
the returned LSN as the exact position of the XLOG_SWITCH record to
run space / time range analysis between the switch record and segment end.

Since we cannot guarantee how existing third-party systems interpret this return
LSN, altering its definition would break established external workflows and trigger
unpredictable side effects for legacy deployments.

That is the main backward-compatibility risk I can identify for this change.

Happy to hear your thoughts or any corrections to my analysis.

regards,
--
ZizhuanLiu (X-MAN) 
44973863@qq.com

#6반지현
rring0727@gmail.com
In reply to: ZizhuanLiu X-MAN (#5)
Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

Hi ZizhuanLiu,

Thanks for the caller survey — classifying the call sites by how they
consume the return value is a useful way to frame the risk, and I agree
pg_switch_wal() is the one path where the LSN escapes to external
consumers.

To help bound that risk, I re-ran my April comparison on a second
platform (Windows 11, MSYS2/gcc 16.1.0, commit 9d141466ff, 19beta1),
building unpatched master and v2 side by side and running the same
sequence on identically-initialized clusters:

before_switch: 0/017CF958 (both builds)
switch_1: 0/017CF970 (both builds)
after_1: 0/02000000 (both builds)
switch_2/3: 0/02000000, no-op (both builds)

The two builds returned byte-for-byte identical LSNs at every step,
and "make check" passes on the patched build (All 245 tests passed).
Incidentally this also confirms numerically that the MAXALIGN() change
is a no-op: 0/017CF958 + 24 = 0/017CF970, since SizeOfXLogRecord is
already 8-byte aligned.

As far as I can tell, the only input where v2 can return a different
value than current master is when the XLOG_SWITCH record ends exactly
on a page boundary (StartPos at page offset XLOG_BLCKSZ - 24): old code
takes the cross-page branch and adds a page header size even though no
part of the record lies on the next page, while v2 returns the boundary
itself. The v2 value is the one consistent with the end-pointer
convention of XLogBytePosToEndRecPtr(). So any external tool that
observed a different value in that rare alignment was depending on a
value inconsistent with PostgreSQL's own end-pointer semantics — I'd
read v2 as correcting that, rather than breaking compatibility.

So in summary: identical SQL-visible behavior on all common paths
(verified on Linux in April and Windows now), with the single divergent
case being a consistency fix. If it would help, I could try to craft a
reproduction of the exact page-boundary case (padding WAL position via
pg_logical_emit_message), or write a small TAP test pinning down the
boundary behavior of pg_switch_wal().

Regards,
Jihyun Bahn

Attachments:

comparison-results.txt.txttext/plain; name=comparison-results.txt.txt; x-apple-part-url=B0512A15-F6BD-4030-ACB6-F56B3DB550A0Download
#7ZizhuanLiu X-MAN
44973863@qq.com
In reply to: 반지현 (#6)
Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

To: ZizhuanLiu X-MAN <44973863(at)qq(dot)com>, cca5507 <cca5507(at)qq(dot)com>
Cc: pgsql-hackers <pgsql-hackers(at)lists(dot)postgresql(dot)org>
Subject: Re: Return value of XLogInsertRecord() for XLOG_SWITCH record
Date: 2026-06-11 06:53:19
Message-ID: 3CAC439F-E1F0-4F37-BE58-CC9D7CDE889C@gmail.com
Views: Whole Thread | Raw Message | Download mbox | Resend email
Lists: pgsql-hackers
Hi ZizhuanLiu,

Thanks for the caller survey — classifying the call sites by how they
consume the return value is a useful way to frame the risk, and I agree
pg_switch_wal() is the one path where the LSN escapes to external
consumers.

To help bound that risk, I re-ran my April comparison on a second
platform (Windows 11, MSYS2/gcc 16.1.0, commit 9d141466ff, 19beta1),
building unpatched master and v2 side by side and running the same
sequence on identically-initialized clusters:

before_switch: 0/017CF958 (both builds)
switch_1: 0/017CF970 (both builds)
after_1: 0/02000000 (both builds)
switch_2/3: 0/02000000, no-op (both builds)

The two builds returned byte-for-byte identical LSNs at every step,
and "make check" passes on the patched build (All 245 tests passed).
Incidentally this also confirms numerically that the MAXALIGN() change
is a no-op: 0/017CF958 + 24 = 0/017CF970, since SizeOfXLogRecord is
already 8-byte aligned.

As far as I can tell, the only input where v2 can return a different
value than current master is when the XLOG_SWITCH record ends exactly
on a page boundary (StartPos at page offset XLOG_BLCKSZ - 24): old code
takes the cross-page branch and adds a page header size even though no
part of the record lies on the next page, while v2 returns the boundary
itself. The v2 value is the one consistent with the end-pointer
convention of XLogBytePosToEndRecPtr(). So any external tool that
observed a different value in that rare alignment was depending on a
value inconsistent with PostgreSQL's own end-pointer semantics — I'd
read v2 as correcting that, rather than breaking compatibility.

So in summary: identical SQL-visible behavior on all common paths
(verified on Linux in April and Windows now), with the single divergent
case being a consistency fix. If it would help, I could try to craft a
reproduction of the exact page-boundary case (padding WAL position via
pg_logical_emit_message), or write a small TAP test pinning down the
boundary behavior of pg_switch_wal().

Regards,
Jihyun Bahn

Hi, 반지현

Thank you very much for your professional and comprehensive testing and analysis.
I’ve learned a great deal more about WAL internals through this discussion, including
details related to pg_logical_emit_message.

After further review and testing on my side, I fully agree with your reasoning.
Patch v2 corrects the inconsistent end-pointer calculation and does not break
compatibility with external tools.

Upon closer reflection, the original condition
```c
if (StartPos / XLOG_BLCKSZ != EndPos / XLOG_BLCKSZ)
```c
is intended to detect whether the XLOG_SWITCH record crosses a page boundary.
This branch then decides whether to add SizeOfXLogLongPHD (crossing WAL segment)
or SizeOfXLogShortPHD (crossing page only, same segment).

As we all know, EndPos points to the start of the next WAL record rather than the end
of the current one. The byte range occupied by any WAL record is [StartPos, EndPos).
Therefore, to correctly judge page crossing for the switch record, the condition should be adjusted to:
```c
if (StartPos / XLOG_BLCKSZ != (EndPos - 1) / XLOG_BLCKSZ)
```c
This expression accurately preserves the original design intent. I have prepared a v3 patch
based on v2, with only this single conditional expression modified.

After additional consideration: the helper functions that reserve space for WAL records
already compute the [StartPos, EndPos) range, including the *EndPos = XLogBytePosToEndRecPtr()
logic we discussed earlier. The XLOG_SWITCH record is a special case because it intentionally
consumes the remaining space in the current segment, forcing us to recalculate EndPos
a second time in this code block.

To avoid duplicate calculations, we can extend the parameters of the local function
ReserveXLogSwitch() by adding an output argument XLogRecPtr *actual_EndPos
to store the value computed via *EndPos = XLogBytePosToEndRecPtr(). With this change,
the logic here becomes cleaner: after inserted the XLOG_SWITCH WAL, we simply restore
EndPos to the precomputed actual_EndPos. All end-pointer calculations are unified in
one place — this is what v4 implements.

Both v3 and v4 compile successfully and pass tests covering 5 scenarios (including multiple corner cases).
The patch files and detailed test logs are attached for your review:
1.An XLOG_SWITCH record fits entirely within a WAL block with ample trailing free space;
2.An XLOG_SWITCH record exactly fills a WAL block with zero remaining free space;
3.Multiple consecutive calls to pg_switch_wal() do not trigger an actual WAL segment rotation;
4.The XLOG_SWITCH record crosses block boundaries within a single WAL segment;
5.The XLOG_SWITCH record crosses both block boundaries and WAL segment files.

Feel free to share your thoughts and point out any mistakes.

regards,
--
ZizhuanLiu (X-MAN) 
44973863@qq.com

Attachments:

v3-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchapplication/octet-stream; charset=utf-8; name=v3-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchDownload+8-3
v4-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchapplication/octet-stream; charset=utf-8; name=v4-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchDownload+12-17
test-result-v3.txtapplication/octet-stream; charset=utf-8; name=test-result-v3.txtDownload
test-result-v4.txtapplication/octet-stream; charset=utf-8; name=test-result-v4.txtDownload
#8ZizhuanLiu X-MAN
44973863@qq.com
In reply to: ZizhuanLiu X-MAN (#7)
Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

rebase v4.

Early v2/v3 only fixed the page-cross judgment condition, but still retained duplicated
EndPos calculation logic. v4 refactors ReserveXLogSwitch to compute standard EndPos
once and reuse it, eliminating redundant code fundamentally.

All behavior remains identical to v2/v3 after fix; v4 just unifies the calculation
point without changing runtime logic.

ZizhuanLiu X-MAN
44973863@qq.com

Attachments:

v4-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchapplication/octet-stream; charset=utf-8; name=v4-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchDownload+12-24
#9ZizhuanLiu X-MAN
44973863@qq.com
In reply to: ZizhuanLiu X-MAN (#8)
Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

I am so sorry. rebase v3 and v4.

regards,
--
ZizhuanLiu (X-MAN)
44973863@qq.com

Show quoted text

Original
From: ZizhuanLiu X-MAN <44973863@qq.com>
Date: 2026-06-14 22:37
To: rring0727 <rring0727@gmail.com>, cca5507 <cca5507@qq.com>
Cc: pgsql-hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

rebase v4.

Early v2/v3 only fixed the page-cross judgment condition, but still retained duplicated
EndPos calculation logic. v4 refactors ReserveXLogSwitch to compute standard EndPos
once and reuse it, eliminating redundant code fundamentally.

All behavior remains identical to v2/v3 after fix; v4 just unifies the calculation
point without changing runtime logic.

ZizhuanLiu X-MAN
44973863@qq.com

Attachments:

v4-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchapplication/octet-stream; charset=utf-8; name=v4-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchDownload+12-23
v3-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchapplication/octet-stream; charset=utf-8; name=v3-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchDownload+1-3
#10ZizhuanLiu X-MAN
44973863@qq.com
In reply to: ZizhuanLiu X-MAN (#8)
Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

Let us rebase v3 first .

regards,
--
ZizhuanLiu (X-MAN)
44973863@qq.com

Show quoted text

Original
From: ZizhuanLiu X-MAN <44973863@qq.com>
Date: 2026-06-14 22:37
To: rring0727 <rring0727@gmail.com>, cca5507 <cca5507@qq.com>
Cc: pgsql-hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

Early v2/v3 only fixed the page-cross judgment condition, but still retained duplicated
EndPos calculation logic. v4 refactors ReserveXLogSwitch to compute standard EndPos
once and reuse it, eliminating redundant code fundamentally.

All behavior remains identical to v2/v3 after fix; v4 just unifies the calculation
point without changing runtime logic.

ZizhuanLiu X-MAN
44973863@qq.com

Attachments:

v3-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchapplication/octet-stream; charset=utf-8; name=v3-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchDownload+8-3
#11ZizhuanLiu X-MAN
44973863@qq.com
In reply to: ZizhuanLiu X-MAN (#10)
Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

Rebase v4 .

regards,
--
ZizhuanLiu (X-MAN) 
44973863@qq.com

Show quoted text

Original
From: ZizhuanLiu X-MAN <44973863@qq.com>
Date: 2026-06-14 22:37
To: rring0727 <rring0727@gmail.com>, cca5507 <cca5507@qq.com>
Cc: pgsql-hackers <pgsql-hackers@lists.postgresql.org>
Subject: Re: Return value of XLogInsertRecord() for XLOG_SWITCH record

Early v2/v3 only fixed the page-cross judgment condition, but still retained duplicated
EndPos calculation logic. v4 refactors ReserveXLogSwitch to compute standard EndPos 
once and reuse it, eliminating redundant code fundamentally.

All behavior remains identical to v2/v3 after fix; v4 just unifies the calculation
point without changing runtime logic.

ZizhuanLiu X-MAN
44973863@qq.com

Attachments:

v4-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchapplication/octet-stream; charset=utf-8; name=v4-0001-Keep-the-return-value-of-XLogInsertRecord-for-XLO.patchDownload+12-17