faulty error handling around pgstat_count_io_op_time()

Started by Peter Eisentraut11 days ago6 messageshackers

peter_e@gmx.net

11 days ago

There are several places where the return value of pg_pread() or
pg_pwrite() is passed directly as the byte count to
pgstat_count_io_op_time(). The bytes argument of
pgstat_count_io_op_time() is of type uint64, and so error returns of -1
are going to passed as UINT64_MAX and added as such to the internal
statistics.

In the attached patch, I have marked up those places.

I think the correction here would be to move the
pgstat_count_io_op_time() calls to after the error returns are handled.
This is effectively how most other code already behaves. For example,
most smgr calls don't return on error, so you don't get a chance to make
any pgstat calls afterwards. It's only the open-coded places where we
can even do that.

However, XLogPageRead() even goes out of its way to make an explicit
pgstat_count_io_op_time() call in the error branch. I suppose this
could be useful to record short reads, but a) this particular instance
is still faulty regarding -1, and b) other places don't do that. So
it's a bit unclear what the preferred behavior on error should be.

An alternative would be to call pgstat_count_io_op_time() with like
Max(byteswritten, 0), but that seems kind of ugly.

Another alternative would be to change the bytes argument of
pgstat_count_io_op_time() to ssize_t. POSIX file system operations
can't operate on sizes larger than ssize_t, so this type should be
sufficient. And then error returns could be handled centrally in
pgstat_count_io_op_time(). (Record them, don't record them, or even
count errors separately, etc.)

Thoughts?

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

9 days ago

In reply to: Peter Eisentraut (#1)

Re: faulty error handling around pgstat_count_io_op_time()

Hi,

On Fri, Jun 12, 2026 at 06:01:45PM +0200, Peter Eisentraut wrote:

There are several places where the return value of pg_pread() or pg_pwrite()
is passed directly as the byte count to pgstat_count_io_op_time(). The
bytes argument of pgstat_count_io_op_time() is of type uint64, and so error
returns of -1 are going to passed as UINT64_MAX and added as such to the
internal statistics.

Nice catch!

In the attached patch, I have marked up those places.

I agree with those places and did not find others.

I think the correction here would be to move the pgstat_count_io_op_time()
calls to after the error returns are handled. This is effectively how most
other code already behaves. For example, most smgr calls don't return on
error, so you don't get a chance to make any pgstat calls afterwards. It's
only the open-coded places where we can even do that.

Sounds reasonable to me and done that way in the attached.

However, XLogPageRead() even goes out of its way to make an explicit
pgstat_count_io_op_time() call in the error branch. I suppose this could be
useful to record short reads, but a) this particular instance is still
faulty regarding -1, and b) other places don't do that. So it's a bit
unclear what the preferred behavior on error should be.

What about keeping the intent (record short reads) by discarding r <= 0? (done
in the attached).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Michael Paquier

michael@paquier.xyz

7 days ago

In reply to: Bertrand Drouvot (#2)

Re: faulty error handling around pgstat_count_io_op_time()

On Mon, Jun 15, 2026 at 06:26:00AM +0000, Bertrand Drouvot wrote:

On Fri, Jun 12, 2026 at 06:01:45PM +0200, Peter Eisentraut wrote:

There are several places where the return value of pg_pread() or pg_pwrite()
is passed directly as the byte count to pgstat_count_io_op_time(). The
bytes argument of pgstat_count_io_op_time() is of type uint64, and so error
returns of -1 are going to passed as UINT64_MAX and added as such to the
internal statistics.

Nice catch!

This thread has slipped through, and it looks like I'm involved as of
a051e71e28a1. (Please feel free to add me in CC in such cases.)

I think the correction here would be to move the pgstat_count_io_op_time()
calls to after the error returns are handled. This is effectively how most
other code already behaves. For example, most smgr calls don't return on
error, so you don't get a chance to make any pgstat calls afterwards. It's
only the open-coded places where we can even do that.

Sounds reasonable to me and done that way in the attached.

In the xlogrecovery.c case, we should care about the short read case.
What you are doing here looks OK for this path.

In XLogFileInitInternal(), the first pgstat_count_io_op_time() is not
completely right, no? pg_pwrite_zeros() or pg_pwrite() could fail,
and it does not make sense to me to count data if we have a
save_errno, and the files are unlinked in the error path. I'd propose
to delay the count() call to happen after the error check is done.

This leads me to the v2 attached. This is your v1 plus the extra
change for XLogFileInitInternal() when the segments are initialized.
--
Michael

Ayush Tiwari

ayushtiwari.slg01@gmail.com

7 days ago

In reply to: Michael Paquier (#3)

Re: faulty error handling around pgstat_count_io_op_time()

Hi,

On Wed, 17 Jun 2026 at 05:56, Michael Paquier <michael@paquier.xyz> wrote:

On Mon, Jun 15, 2026 at 06:26:00AM +0000, Bertrand Drouvot wrote:

On Fri, Jun 12, 2026 at 06:01:45PM +0200, Peter Eisentraut wrote:

I think the correction here would be to move the

pgstat_count_io_op_time()

calls to after the error returns are handled. This is effectively how

most

other code already behaves. For example, most smgr calls don't return

on

error, so you don't get a chance to make any pgstat calls afterwards.

It's

only the open-coded places where we can even do that.

Sounds reasonable to me and done that way in the attached.

The "if (r > 0)" guard to keep counting short reads in xlogrecovery.c,
looks correct.

In the xlogrecovery.c case, we should care about the short read case.

What you are doing here looks OK for this path.

In XLogFileInitInternal(), the first pgstat_count_io_op_time() is not
completely right, no? pg_pwrite_zeros() or pg_pwrite() could fail,
and it does not make sense to me to count data if we have a
save_errno, and the files are unlinked in the error path. I'd propose
to delay the count() call to happen after the error check is done.

I agree on this change, but the original placement was also recording
the I/O timing of the attempted write, not just byte count, so moving
it post save_errno drops that. (But even the ordinary write/fsync
paths ereport before reaching their pgstat_count_io_op_time() call).

Regards,
Ayush

Bertrand Drouvot

bertranddrouvot.pg@gmail.com

7 days ago

In reply to: Michael Paquier (#3)

Re: faulty error handling around pgstat_count_io_op_time()

Hi,

On Wed, Jun 17, 2026 at 09:26:22AM +0900, Michael Paquier wrote:

In XLogFileInitInternal(), the first pgstat_count_io_op_time() is not
completely right, no? pg_pwrite_zeros() or pg_pwrite() could fail,
and it does not make sense to me to count data if we have a
save_errno, and the files are unlinked in the error path. I'd propose
to delay the count() call to happen after the error check is done.

I think you are right. This one was not handled because it's not a type
conversion bug but we should not count I/O on a failed operation.

This leads me to the v2 attached. This is your v1 plus the extra
change for XLogFileInitInternal() when the segments are initialized.

LGTM.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Michael Paquier

michael@paquier.xyz

7 days ago

In reply to: Ayush Tiwari (#4)

Re: faulty error handling around pgstat_count_io_op_time()

On Wed, Jun 17, 2026 at 10:29:36AM +0530, Ayush Tiwari wrote:

I agree on this change, but the original placement was also recording
the I/O timing of the attempted write, not just byte count, so moving
it post save_errno drops that. (But even the ordinary write/fsync
paths ereport before reaching their pgstat_count_io_op_time() call).

FWIW, I don't understand why registering the time taken for a failure
would make sense for WAL segment inits. :)
--
Michael

faulty error handling around pgstat_count_io_op_time()

Attachments:

Attachments:

Attachments: