Propagate XLogFindNextRecord error to callers

Started by Anthonin Bonnefoy4 months ago17 messageshackers
Jump to latest
#1Anthonin Bonnefoy
anthonin.bonnefoy@datadoghq.com

Hi,

Currently, XLogFindNextRecord errormsg is ignored and callers will only
output a generic 'could not find a valid record' message without details.
Additionally, invalid page header won't go through XLogReadRecord, leaving
the error in state->errormsg_buf.

This patch propagates XLogFindNextRecord's error message to the callers and
displays it. In case of an invalid page header, the errormsg is filled with
errormsg_buf content.

With this patch, pg_waldump will now have the following output when reading
a file with an invalid header:
pg_waldump: error: could not find a valid record after D80/5C000000:
invalid magic number D116 in WAL segment 0000001400000D8000000017, LSN
D80/5C000000, offset 0

Regards,
Anthonin Bonnefoy

Attachments:

v1-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patchapplication/octet-stream; name=v1-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patchDownload+47-17
#2Mircea Cadariu
cadariu.mircea@gmail.com
In reply to: Anthonin Bonnefoy (#1)
Re: Propagate XLogFindNextRecord error to callers

Hi Anthonin,

On 12/12/2025 10:39, Anthonin Bonnefoy wrote:

With this patch, pg_waldump will now have the following output when
reading a file with an invalid header:
pg_waldump: error: could not find a valid record after D80/5C000000:
invalid magic number D116 in WAL segment 0000001400000D8000000017, LSN
D80/5C000000, offset 0

I've picked up the review for your patch.

Attached is a failing test that reproduces the issue. Have I got it
right? We can consider using it to validate your patch then.

--
Thanks,
Mircea Cadariu

Attachments:

v1-0001-Add-pg_waldump-test-for-invalid-WAL-page-magic-er.patchtext/plain; charset=UTF-8; name=v1-0001-Add-pg_waldump-test-for-invalid-WAL-page-magic-er.patchDownload+22-1
#3Anthonin Bonnefoy
anthonin.bonnefoy@datadoghq.com
In reply to: Mircea Cadariu (#2)
Re: Propagate XLogFindNextRecord error to callers

On Tue, Feb 3, 2026 at 5:05 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:

I've picked up the review for your patch.

Thanks for picking it!

Attached is a failing test that reproduces the issue. Have I got it
right? We can consider using it to validate your patch then.

Yeah, that's the gist of it. However, the test you've written will
only work on little endian architectures. Also, I think the xlog page
header size won't have the 4 bytes padding on 32 bits systems.

I've added a similar test in 001_basic.pl, but it relies on copying an
existing WAL file and setting the WAL magic to 0000. This way, the
result will be the same independent of the endianness and memory
padding.

Regards,
Anthonin Bonnefoy

Attachments:

v2-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patchapplication/octet-stream; name=v2-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patchDownload+92-17
#4Japin Li
japinli@hotmail.com
In reply to: Anthonin Bonnefoy (#3)
Re: Propagate XLogFindNextRecord error to callers

On Mon, 09 Feb 2026 at 10:21, Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote:

On Tue, Feb 3, 2026 at 5:05 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:

I've picked up the review for your patch.

Thanks for picking it!

Attached is a failing test that reproduces the issue. Have I got it
right? We can consider using it to validate your patch then.

Yeah, that's the gist of it. However, the test you've written will
only work on little endian architectures. Also, I think the xlog page
header size won't have the 4 bytes padding on 32 bits systems.

I've added a similar test in 001_basic.pl, but it relies on copying an
existing WAL file and setting the WAL magic to 0000. This way, the
result will be the same independent of the endianness and memory
padding.

+	{
+		if (errormsg)
+			pg_fatal("could not find a valid record after %X/%X: %s",
+					 LSN_FORMAT_ARGS(private.startptr), errormsg);
+		else
+			pg_fatal("could not find a valid record after %X/%X",
+					 LSN_FORMAT_ARGS(private.startptr));

For consistency, this should use the %X/%08X format as elsewhere.

Regards,
Anthonin Bonnefoy

[2. text/x-diff; v2-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patch]...

--
Regards,
Japin Li
ChengDu WenWu Information Technology Co., Ltd.

#5Anthonin Bonnefoy
anthonin.bonnefoy@datadoghq.com
In reply to: Japin Li (#4)
Re: Propagate XLogFindNextRecord error to callers

On Mon, Feb 9, 2026 at 10:36 AM Japin Li <japinli@hotmail.com> wrote:

For consistency, this should use the %X/%08X format as elsewhere.

Good catch, I've updated the patch.

Attachments:

v3-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patchapplication/octet-stream; name=v3-0001-Propage-errormsg-to-XLogFindNextRecord-caller.patchDownload+70-17
#6Mircea Cadariu
cadariu.mircea@gmail.com
In reply to: Anthonin Bonnefoy (#5)
Re: Propagate XLogFindNextRecord error to callers

Hi Anthonin,

Thanks for the updated patch.

I have noticed this code added in XLogFindNextRecord in the patch,
appears also in XLogNextRecord (line 334).

+	if (state->errormsg_deferred)
+	{
+		if (state->errormsg_buf[0] != '\0')
+			*errormsg = state->errormsg_buf;
+		state->errormsg_deferred = false;
+	}
+

In XLogNextRecord, right before the above code, we do *errormsg = NULL.
Should this be done also in XLogFindNextRecord in the patch?

If so, what about even extracting a helper method which will be called
from both places?

A nit for the commit message: Propage -> Propagate

--
Thanks,
Mircea Cadariu

#7Anthonin Bonnefoy
anthonin.bonnefoy@datadoghq.com
In reply to: Mircea Cadariu (#6)
Re: Propagate XLogFindNextRecord error to callers

On Mon, Feb 9, 2026 at 1:15 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:

In XLogNextRecord, right before the above code, we do *errormsg = NULL. Should this be done also in XLogFindNextRecord in the patch?

If so, what about even extracting a helper method which will be called from both places?

XLogReadRecord may already have consumed errormsg_deferred and set
errormsg. We can't set it to NULL, or that would erase a valid error
message.

A simplification would have been to remove processing the deferred
error processing from XLogNextRecord and leave it to
XLogFindNextRecord, but there are some calls to XLogNextRecord (like
in xlogprefetcher.c), so being able to get the error message when
calling XLogNextRecord is necessary.

A nit for the commit message: Propage -> Propagate

Fixed

Attachments:

v4-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patchapplication/octet-stream; name=v4-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patchDownload+70-17
#8Mircea Cadariu
cadariu.mircea@gmail.com
In reply to: Anthonin Bonnefoy (#7)
Re: Propagate XLogFindNextRecord error to callers

On 11/02/2026 10:43, Anthonin Bonnefoy wrote:

XLogReadRecord may already have consumed errormsg_deferred and set
errormsg. We can't set it to NULL, or that would erase a valid error

Indeed. Should we place this initialisation at the top of
XLogFindNextRecord, before any processing? At that point, there's
nothing to erase.

+	/*
+	 * we may have reported errors due to invalid WAL header, propagate the
+	 * error message to the caller.
+	 */

You can consider capitalising.

--
Thanks,
Mircea Cadariu

#9Anthonin Bonnefoy
anthonin.bonnefoy@datadoghq.com
In reply to: Mircea Cadariu (#8)
Re: Propagate XLogFindNextRecord error to callers

On Wed, Feb 11, 2026 at 1:16 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:

Indeed. Should we place this initialisation at the top of
XLogFindNextRecord, before any processing? At that point, there's
nothing to erase.

That makes sense, I've added the '*errormsg = NULL' at the top of the function.

You can consider capitalising.

Done

Attachments:

v5-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patchapplication/octet-stream; name=v5-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patchDownload+72-17
#10Mircea Cadariu
cadariu.mircea@gmail.com
In reply to: Anthonin Bonnefoy (#9)
Re: Propagate XLogFindNextRecord error to callers

Hi,

On 12/02/2026 09:43, Anthonin Bonnefoy wrote:

Done

Thanks for the updated patch.

It applies cleanly on HEAD and the tests pass locally, as well as in CI.

I've set the patch entry to 'Ready for Committer'.

--
Thanks,
Mircea Cadariu

#11Chao Li
li.evan.chao@gmail.com
In reply to: Anthonin Bonnefoy (#9)
Re: Propagate XLogFindNextRecord error to callers

On Feb 12, 2026, at 15:43, Anthonin Bonnefoy <anthonin.bonnefoy@datadoghq.com> wrote:

On Wed, Feb 11, 2026 at 1:16 PM Mircea Cadariu <cadariu.mircea@gmail.com> wrote:

Indeed. Should we place this initialisation at the top of
XLogFindNextRecord, before any processing? At that point, there's
nothing to erase.

That makes sense, I've added the '*errormsg = NULL' at the top of the function.

You can consider capitalising.

Done
<v5-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patch>

Hi Anthonin,

Thanks for the patch. I agree it’s useful to print a more detailed error message instead of the generic one.

From a design perspective, I’m not sure we need to add a new errormsg parameter to XLogFindNextRecord(). The new parameter ultimately just exposes state->errormsg_buf, so the returned errormsg implicitly depends on the lifetime of state, and we also need extra handling for cases like errormsg == NULL.

Instead, perhaps we could add a helper function, say XLogReaderGetLastError(XLogReaderState *state). which internally pstrdup()s state->errormsg_buf (after checking errormsg_deferred, etc.). That way the caller owns the returned string explicitly, and there’s no hidden dependency on the reader state’s lifetime.

This would also avoid changing the XLogFindNextRecord() signature while making the ownership semantics clearer.

What do you think?

Best regards,
--
Chao Li (Evan)
HighGo Software Co., Ltd.
https://www.highgo.com/

#12Anthonin Bonnefoy
anthonin.bonnefoy@datadoghq.com
In reply to: Chao Li (#11)
Re: Propagate XLogFindNextRecord error to callers

Thanks for the comments!

On Tue, Feb 24, 2026 at 5:00 AM Chao Li <li.evan.chao@gmail.com> wrote:

From a design perspective, I’m not sure we need to add a new errormsg parameter to XLogFindNextRecord(). The new parameter ultimately just exposes state->errormsg_buf, so the returned errormsg implicitly depends on the lifetime of state, and we also need extra handling for cases like errormsg == NULL.

Instead, perhaps we could add a helper function, say XLogReaderGetLastError(XLogReaderState *state). which internally pstrdup()s state->errormsg_buf (after checking errormsg_deferred, etc.). That way the caller owns the returned string explicitly, and there’s no hidden dependency on the reader state’s lifetime.

This would also avoid changing the XLogFindNextRecord() signature while making the ownership semantics clearer.

One issue I see is that it introduces another way to get the error,
with XLogReadRecord and XLogNextRecord using an errormsg parameter,
and XLogFindNextRecord using the helper function. Maybe the solution
would be to change both XLogReadRecord and XLogNextRecord to use this
new function to stay consistent, but that means changing their
signatures.

Also, I see the errormsg parameter as a way to signal the caller that
"this function can fail, the detailed error will be available here".
With the XLogReaderGetLastError, it becomes the caller's
responsibility to know which function may fill the error message and
check it accordingly.

The error message is likely printed shortly after the function's call,
so I suspect the risk of using the errormsg after its intended
lifetime is low.

You bring up a good point about the errormsg's lifetime, which is
definitely something to mention in the function's comments. I've
updated the patch with the additional comments.

Regards,
Anthonin Bonnefoy

Attachments:

v6-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patchapplication/octet-stream; name=v6-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patchDownload+79-17
#13Fujii Masao
masao.fujii@gmail.com
In reply to: Anthonin Bonnefoy (#12)
Re: Propagate XLogFindNextRecord error to callers

On Thu, Feb 26, 2026 at 5:19 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:

Thanks for the comments!

On Tue, Feb 24, 2026 at 5:00 AM Chao Li <li.evan.chao@gmail.com> wrote:

From a design perspective, I’m not sure we need to add a new errormsg parameter to XLogFindNextRecord(). The new parameter ultimately just exposes state->errormsg_buf, so the returned errormsg implicitly depends on the lifetime of state, and we also need extra handling for cases like errormsg == NULL.

Instead, perhaps we could add a helper function, say XLogReaderGetLastError(XLogReaderState *state). which internally pstrdup()s state->errormsg_buf (after checking errormsg_deferred, etc.). That way the caller owns the returned string explicitly, and there’s no hidden dependency on the reader state’s lifetime.

This would also avoid changing the XLogFindNextRecord() signature while making the ownership semantics clearer.

One issue I see is that it introduces another way to get the error,
with XLogReadRecord and XLogNextRecord using an errormsg parameter,
and XLogFindNextRecord using the helper function. Maybe the solution
would be to change both XLogReadRecord and XLogNextRecord to use this
new function to stay consistent, but that means changing their
signatures.

Also, I see the errormsg parameter as a way to signal the caller that
"this function can fail, the detailed error will be available here".
With the XLogReaderGetLastError, it becomes the caller's
responsibility to know which function may fill the error message and
check it accordingly.

The error message is likely printed shortly after the function's call,
so I suspect the risk of using the errormsg after its intended
lifetime is low.

You bring up a good point about the errormsg's lifetime, which is
definitely something to mention in the function's comments. I've
updated the patch with the additional comments.

Since this patch is marked Ready for Committer, I've started reviewing it.

+ * When set, *errormsg points to an internal buffer that's valid until the next
+ * call to XLogReadRecord.

Could that buffer also be invalidated by other functions that modify
"XLogReaderState *state", such as XLogBeginRead()?

+# Wrong WAL version. We copy an existing wal file and set the
+# page's magic value to 0000.

Would it be better to describe the purpose of this test at the top?
For example:

# Test that pg_waldump reports a detailed error message when dumping
# a WAL file with an invalid magic number (0000).
{
# The broken WAL file is created by copying a valid WAL file and
# overwriting its magic number with 0000.
my $broken_wal_dir = PostgreSQL::Test::Utils::tempdir_short();

+ open($fh, '+<', $broken_wal);
+ close($fh);

Should we add error handling like:

open(my $fh, '+<', $broken_wal)
or BAIL_OUT("open($broken_wal) failed: $!");
close($fh)
or BAIL_OUT("close failed: $!");

Also, other similar tests seem to call binmode. Is it really unnecessary
in this case?

Regards,

--
Fujii Masao

#14Anthonin Bonnefoy
anthonin.bonnefoy@datadoghq.com
In reply to: Fujii Masao (#13)
Re: Propagate XLogFindNextRecord error to callers

On Tue, Mar 17, 2026 at 10:49 AM Fujii Masao <masao.fujii@gmail.com> wrote:

Since this patch is marked Ready for Committer, I've started reviewing it.

Thanks for the review!

On Tue, Mar 17, 2026 at 10:49 AM Fujii Masao <masao.fujii@gmail.com> wrote:

+ * When set, *errormsg points to an internal buffer that's valid until the next
+ * call to XLogReadRecord.

Could that buffer also be invalidated by other functions that modify
"XLogReaderState *state", such as XLogBeginRead()?

Yes, XLogBeginRead() will clear the buffer, but I don't think there's
a pattern where XLogBeginRead() is called after XLogFindNextRecord()
on the same reader?
The comment was more focused on how the function would be used, with
XLogReadRecord() being called after the reader state is initialized,
and talking about XLogBeginRead() here could be confusing.

+# Wrong WAL version. We copy an existing wal file and set the
+# page's magic value to 0000.

Would it be better to describe the purpose of this test at the top?
For example:

# Test that pg_waldump reports a detailed error message when dumping
# a WAL file with an invalid magic number (0000).
{
# The broken WAL file is created by copying a valid WAL file and
# overwriting its magic number with 0000.
my $broken_wal_dir = PostgreSQL::Test::Utils::tempdir_short();

Sounds good, I've updated the patch.

+ open($fh, '+<', $broken_wal);
+ close($fh);

Should we add error handling like:

open(my $fh, '+<', $broken_wal)
or BAIL_OUT("open($broken_wal) failed: $!");
close($fh)
or BAIL_OUT("close failed: $!");

Added.

Also, other similar tests seem to call binmode. Is it really unnecessary
in this case?

Ha right, I've missed those calls. I'm not sure if it's strictly
necessary for the functions used there, but it's probably a lot saner
to read and write WAL using binary mode. I've added a binmode call.

Regards,
Anthonin Bonnefoy

Attachments:

v7-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patchapplication/octet-stream; name=v7-0001-Propagate-errormsg-to-XLogFindNextRecord-caller.patchDownload+85-17
#15Fujii Masao
masao.fujii@gmail.com
In reply to: Anthonin Bonnefoy (#14)
Re: Propagate XLogFindNextRecord error to callers

On Mon, Mar 23, 2026 at 5:15 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:

+# Wrong WAL version. We copy an existing wal file and set the
+# page's magic value to 0000.

Would it be better to describe the purpose of this test at the top?
For example:

# Test that pg_waldump reports a detailed error message when dumping
# a WAL file with an invalid magic number (0000).
{
# The broken WAL file is created by copying a valid WAL file and
# overwriting its magic number with 0000.
my $broken_wal_dir = PostgreSQL::Test::Utils::tempdir_short();

Sounds good, I've updated the patch.

Thanks for updating the patch!

It looks like the first part of the comment above wasn't included,
so I've added it. I also made a few cosmetic changes and updated
the commit message. Attached is the updated version of the patch.

Unless there are objections, I'll commit this version.

Regards,

--
Fujii Masao

Attachments:

v8-0001-Report-detailed-errors-from-XLogFindNextRecord-fa.patchapplication/octet-stream; name=v8-0001-Report-detailed-errors-from-XLogFindNextRecord-fa.patchDownload+84-17
#16Anthonin Bonnefoy
anthonin.bonnefoy@datadoghq.com
In reply to: Fujii Masao (#15)
Re: Propagate XLogFindNextRecord error to callers

On Mon, Mar 23, 2026 at 1:00 PM Fujii Masao <masao.fujii@gmail.com> wrote:

It looks like the first part of the comment above wasn't included,
so I've added it. I also made a few cosmetic changes and updated
the commit message. Attached is the updated version of the patch.

Right, I've missed the first part... Thanks for the fix!

#17Fujii Masao
masao.fujii@gmail.com
In reply to: Anthonin Bonnefoy (#16)
Re: Propagate XLogFindNextRecord error to callers

On Mon, Mar 23, 2026 at 9:07 PM Anthonin Bonnefoy
<anthonin.bonnefoy@datadoghq.com> wrote:

On Mon, Mar 23, 2026 at 1:00 PM Fujii Masao <masao.fujii@gmail.com> wrote:

It looks like the first part of the comment above wasn't included,
so I've added it. I also made a few cosmetic changes and updated
the commit message. Attached is the updated version of the patch.

Right, I've missed the first part... Thanks for the fix!

I've pushed the patch. Thanks!

Regards,

--
Fujii Masao