Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

Started by Bharath Rupireddyover 3 years ago80 messageshackers
Jump to latest
#1Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com

Hi,

I noticed that dir_open_for_write() in walmethods.c uses write() for
WAL file initialization (note that this code is used by pg_receivewal
and pg_basebackup) as opposed to core using pg_pwritev_with_retry() in
XLogFileInitInternal() to avoid partial writes. Do we need to fix
this?

Thoughts?

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

#2Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#1)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Fri, Aug 05, 2022 at 03:55:26PM +0530, Bharath Rupireddy wrote:

I noticed that dir_open_for_write() in walmethods.c uses write() for
WAL file initialization (note that this code is used by pg_receivewal
and pg_basebackup) as opposed to core using pg_pwritev_with_retry() in
XLogFileInitInternal() to avoid partial writes. Do we need to fix
this?

0d56acfb has moved pg_pwritev_with_retry to be backend-only in fd.c :/

Thoughts?

Makes sense to me for the WAL segment pre-padding initialization, as
we still want to point to the beginning of the segment after we are
done with the pre-padding, and the code has an extra lseek().
--
Michael

#3Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#2)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier <michael@paquier.xyz> wrote:

On Fri, Aug 05, 2022 at 03:55:26PM +0530, Bharath Rupireddy wrote:

I noticed that dir_open_for_write() in walmethods.c uses write() for
WAL file initialization (note that this code is used by pg_receivewal
and pg_basebackup) as opposed to core using pg_pwritev_with_retry() in
XLogFileInitInternal() to avoid partial writes. Do we need to fix
this?

0d56acfb has moved pg_pwritev_with_retry to be backend-only in fd.c :/

Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h
so that everyone can use it.

Thoughts?

Makes sense to me for the WAL segment pre-padding initialization, as
we still want to point to the beginning of the segment after we are
done with the pre-padding, and the code has an extra lseek().

Thanks. I attached the v1 patch, please review it.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachments:

v1-0001-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patchapplication/octet-stream; name=v1-0001-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patchDownload+129-115
#4Thomas Munro
thomas.munro@gmail.com
In reply to: Bharath Rupireddy (#3)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Sun, Aug 7, 2022 at 1:12 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier <michael@paquier.xyz> wrote:
Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h
so that everyone can use it.

Thoughts?

Makes sense to me for the WAL segment pre-padding initialization, as
we still want to point to the beginning of the segment after we are
done with the pre-padding, and the code has an extra lseek().

Thanks. I attached the v1 patch, please review it.

Hi Bharath,

+1 for moving pg_pwritev_with_retry() to file_utils.c, but I think the
commit message should say that is happening. Maybe the move should
even be in a separate patch (IMHO it's nice to separate mechanical
change patches from new logic/work patches).

+/*
+ * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the given
+ * file of size total_sz in batches of size block_sz.
+ */
+ssize_t
+pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz)

Hmm, why not give it a proper name that says it writes zeroes?

Size arguments around syscall-like facilities are usually size_t.

+ memset(zbuffer.data, 0, block_sz);

I believe the modern fashion as of a couple of weeks ago is to tell
the compiler to zero-initialise. Since it's a union you'd need
designated initialiser syntax, something like zbuffer = { .data = {0}
}?

+ iov[i].iov_base = zbuffer.data;
+ iov[i].iov_len = block_sz;

How can it be OK to use caller supplied block_sz, when
sizeof(zbuffer.data) is statically determined? What is the point of
this argument?

-            dir_data->lasterrno = errno;
+            /* If errno isn't set, assume problem is no disk space */
+            dir_data->lasterrno = errno ? errno : ENOSPC;

This coding pattern is used in places where we blame short writes on
lack of disk space without bothering to retry. But the code used in
this patch already handles short writes: it always retries, until
eventually, if you really are out of disk space, you should get an
actual ENOSPC from the operating system. So I don't think the
guess-it-must-be-ENOSPC technique applies here.

#5Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Thomas Munro (#4)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Sun, Aug 7, 2022 at 7:43 AM Thomas Munro <thomas.munro@gmail.com> wrote:

On Sun, Aug 7, 2022 at 1:12 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Sat, Aug 6, 2022 at 12:11 PM Michael Paquier <michael@paquier.xyz> wrote:
Yeah. pg_pwritev_with_retry can also be part of common/file_utils.c/.h
so that everyone can use it.

Thoughts?

Makes sense to me for the WAL segment pre-padding initialization, as
we still want to point to the beginning of the segment after we are
done with the pre-padding, and the code has an extra lseek().

Thanks. I attached the v1 patch, please review it.

Hi Bharath,

+1 for moving pg_pwritev_with_retry() to file_utils.c, but I think the
commit message should say that is happening. Maybe the move should
even be in a separate patch (IMHO it's nice to separate mechanical
change patches from new logic/work patches).

Agree. I separated out the changes.

+/*
+ * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the given
+ * file of size total_sz in batches of size block_sz.
+ */
+ssize_t
+pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz)

Hmm, why not give it a proper name that says it writes zeroes?

Done.

Size arguments around syscall-like facilities are usually size_t.

+ memset(zbuffer.data, 0, block_sz);

I believe the modern fashion as of a couple of weeks ago is to tell
the compiler to zero-initialise. Since it's a union you'd need
designated initialiser syntax, something like zbuffer = { .data = {0}
}?

Hm, but we have many places still using memset(). If we were to change
these syntaxes, IMO, it must be done separately.

+ iov[i].iov_base = zbuffer.data;
+ iov[i].iov_len = block_sz;

How can it be OK to use caller supplied block_sz, when
sizeof(zbuffer.data) is statically determined? What is the point of
this argument?

Yes, removed block_sz function parameter.

-            dir_data->lasterrno = errno;
+            /* If errno isn't set, assume problem is no disk space */
+            dir_data->lasterrno = errno ? errno : ENOSPC;

This coding pattern is used in places where we blame short writes on
lack of disk space without bothering to retry. But the code used in
this patch already handles short writes: it always retries, until
eventually, if you really are out of disk space, you should get an
actual ENOSPC from the operating system. So I don't think the
guess-it-must-be-ENOSPC technique applies here.

Done.

Thanks for reviewing. PSA v2 patch-set. Also,I added a CF entry
https://commitfest.postgresql.org/39/3803/ to give the patches a
chance to get tested.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachments:

v2-0001-Move-pg_pwritev_with_retry-to-file_utils.c.patchapplication/x-patch; name=v2-0001-Move-pg_pwritev_with_retry-to-file_utils.c.patchDownload+72-72
v2-0002-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patchapplication/x-patch; name=v2-0002-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patchDownload+57-43
#6Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#5)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Sun, Aug 07, 2022 at 10:41:49AM +0530, Bharath Rupireddy wrote:

Agree. I separated out the changes.

+
+/*
+ * A convenience wrapper for pwritev() that retries on partial write.  If an
+ * error is returned, it is unspecified how much has been written.
+ */
+ssize_t
+pg_pwritev_with_retry(int fd, const struct iovec *iov, int iovcnt, off_t offset)

If moving this routine, this could use a more explicit description,
especially on errno, for example, that could be consumed by the caller
on failure to know what's happening.

+/*
+ * A convenience wrapper for pg_pwritev_with_retry() that zero-fills the given
+ * file of size total_sz in batches of size block_sz.
+ */
+ssize_t
+pg_pwritev_with_retry_and_init(int fd, int total_sz, int block_sz)

Hmm, why not give it a proper name that says it writes zeroes?

Done.

FWIW, when it comes to that we have a couple of routines that just use
'0' to mean such a thing, aka palloc0(). I find 0002 confusing, as it
introduces in fe_utils.c a new wrapper
(pg_pwritev_with_retry_and_write_zeros) on what's already a wrapper
(pg_pwritev_with_retry) for pwrite().

A second thing is that pg_pwritev_with_retry_and_write_zeros() is
designed to work on WAL segments initialization and it uses
XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
in its name that tells us so. This makes me question whether
file_utils.c is a good location for this second thing. Could a new
file be a better location? We have a xlogutils.c in the backend, and
a name similar to that in src/common/ would be one possibility.
--
Michael

#7Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#6)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Sun, Aug 7, 2022 at 7:56 PM Michael Paquier <michael@paquier.xyz> wrote:

FWIW, when it comes to that we have a couple of routines that just use
'0' to mean such a thing, aka palloc0(). I find 0002 confusing, as it
introduces in fe_utils.c a new wrapper
(pg_pwritev_with_retry_and_write_zeros) on what's already a wrapper
(pg_pwritev_with_retry) for pwrite().

What about pg_write_initial_zeros(fd, size)? How it writes zeros (ie
using vector I/O, and retrying) seems to be an internal detail, no?
"initial" to make it clearer that it's at offset 0, or maybe
pg_pwrite_zeros(fd, size, offset).

A second thing is that pg_pwritev_with_retry_and_write_zeros() is
designed to work on WAL segments initialization and it uses
XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
in its name that tells us so. This makes me question whether
file_utils.c is a good location for this second thing. Could a new
file be a better location? We have a xlogutils.c in the backend, and
a name similar to that in src/common/ would be one possibility.

Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or
maybe it's OK to use BLCKSZ with a comment to say that it's a bit
arbitrary, or maybe it's better to define a new zero buffer of some
arbitrary size just in this code if that is too strange. We could
experiment with different size buffers to see how it performs, bearing
in mind that every time we double it you halve the number of system
calls, but also bearing in mind that at some point it's too much for
the stack. I can tell you that the way that code works today was not
really written with performance in mind (unlike, say, the code
reverted from 9.4 that tried to do this with posix_fallocate()), it
was just finding an excuse to call pwritev(), to exercise new fallback
code being committed for use by later AIO stuff (more patches coming
soon). The retry support was added because it seemed plausible that
some system out there would start to do short writes as we cranked up
the sizes for some implementation reason other than ENOSPC, so we
should make a reusable retry routine.

I think this should also handle the remainder after processing whole
blocks, just for completeness. If I call the code as presented with size
8193, I think this code will only write 8192 bytes.

I think if this ever needs to work on O_DIRECT files there would be an
alignment constraint on the buffer and size, but I don't think we have
to worry about that for now.

#8Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Thomas Munro (#7)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Sun, Aug 7, 2022 at 3:19 PM Thomas Munro <thomas.munro@gmail.com> wrote:

A second thing is that pg_pwritev_with_retry_and_write_zeros() is
designed to work on WAL segments initialization and it uses
XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
in its name that tells us so. This makes me question whether
file_utils.c is a good location for this second thing. Could a new
file be a better location? We have a xlogutils.c in the backend, and
a name similar to that in src/common/ would be one possibility.

Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or
maybe it's OK to use BLCKSZ with a comment to say that it's a bit
arbitrary, or maybe it's better to define a new zero buffer of some
arbitrary size just in this code if that is too strange. We could
experiment with different size buffers to see how it performs, bearing
in mind that every time we double it you halve the number of system
calls, but also bearing in mind that at some point it's too much for
the stack. I can tell you that the way that code works today was not
really written with performance in mind (unlike, say, the code
reverted from 9.4 that tried to do this with posix_fallocate()), it
was just finding an excuse to call pwritev(), to exercise new fallback
code being committed for use by later AIO stuff (more patches coming
soon). The retry support was added because it seemed plausible that
some system out there would start to do short writes as we cranked up
the sizes for some implementation reason other than ENOSPC, so we
should make a reusable retry routine.

Yes, doubling the zerobuffer size to say 2 * XLOG_BLCKSZ or 2 * BLCKSZ
reduces the system calls to half (right now, pg_pwritev_with_retry()
gets called 64 times per 16MB WAL file, it writes in the batches of 32
blocks per call).

Is there a ready-to-use tool or script or specific settings for
pgbench (pgbench command line options or GUC settings) that I can play
with to measure the performance?

I think this should also handle the remainder after processing whole
blocks, just for completeness. If I call the code as presented with size
8193, I think this code will only write 8192 bytes.

Hm, I will fix it.

I think if this ever needs to work on O_DIRECT files there would be an
alignment constraint on the buffer and size, but I don't think we have
to worry about that for now.

We can add a comment about the above limitation, if required.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

#9Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#8)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Sun, Aug 7, 2022 at 9:22 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Sun, Aug 7, 2022 at 3:19 PM Thomas Munro <thomas.munro@gmail.com> wrote:

A second thing is that pg_pwritev_with_retry_and_write_zeros() is
designed to work on WAL segments initialization and it uses
XLOG_BLCKSZ and PGAlignedXLogBlock for the job, but there is nothing
in its name that tells us so. This makes me question whether
file_utils.c is a good location for this second thing. Could a new
file be a better location? We have a xlogutils.c in the backend, and
a name similar to that in src/common/ would be one possibility.

Yeah, I think it should probably be disconnected from XLOG_BLCKSZ, or
maybe it's OK to use BLCKSZ with a comment to say that it's a bit
arbitrary, or maybe it's better to define a new zero buffer of some
arbitrary size just in this code if that is too strange. We could
experiment with different size buffers to see how it performs, bearing
in mind that every time we double it you halve the number of system
calls, but also bearing in mind that at some point it's too much for
the stack. I can tell you that the way that code works today was not
really written with performance in mind (unlike, say, the code
reverted from 9.4 that tried to do this with posix_fallocate()), it
was just finding an excuse to call pwritev(), to exercise new fallback
code being committed for use by later AIO stuff (more patches coming
soon). The retry support was added because it seemed plausible that
some system out there would start to do short writes as we cranked up
the sizes for some implementation reason other than ENOSPC, so we
should make a reusable retry routine.

Yes, doubling the zerobuffer size to say 2 * XLOG_BLCKSZ or 2 * BLCKSZ
reduces the system calls to half (right now, pg_pwritev_with_retry()
gets called 64 times per 16MB WAL file, it writes in the batches of 32
blocks per call).

Is there a ready-to-use tool or script or specific settings for
pgbench (pgbench command line options or GUC settings) that I can play
with to measure the performance?

I played with a simple insert use-case [1] that generates ~380 WAL
files, with different block sizes. To my surprise, I have not seen any
improvement with larger block sizes. I may be doing something wrong
here, suggestions on to test and see the benefits are welcome.

I think this should also handle the remainder after processing whole
blocks, just for completeness. If I call the code as presented with size
8193, I think this code will only write 8192 bytes.

Hm, I will fix it.

Fixed.

I'm attaching v5 patch-set. I've addressed review comments received so
far and fixed a compiler warning that CF bot complained about.

Please review it further.

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

Attachments:

v3-0001-Move-pg_pwritev_with_retry-to-file_utils.c.patchapplication/octet-stream; name=v3-0001-Move-pg_pwritev_with_retry-to-file_utils.c.patchDownload+72-72
v3-0002-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patchapplication/octet-stream; name=v3-0002-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patchDownload+91-43
#10Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#9)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Mon, Aug 8, 2022 at 6:10 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I played with a simple insert use-case [1] that generates ~380 WAL
files, with different block sizes. To my surprise, I have not seen any
improvement with larger block sizes. I may be doing something wrong
here, suggestions on to test and see the benefits are welcome.

I think this should also handle the remainder after processing whole
blocks, just for completeness. If I call the code as presented with size
8193, I think this code will only write 8192 bytes.

Hm, I will fix it.

Fixed.

I'm attaching v5 patch-set. I've addressed review comments received so
far and fixed a compiler warning that CF bot complained about.

Please review it further.

I tried to vary the zero buffer size to see if there's any huge
benefit for the WAL-generating queries. Unfortunately, I didn't see
any benefit on my dev system (16 vcore, 512GB SSD, 32GB RAM) . The use
case I've tried is at [1]/* built source code with release flags */ ./configure --with-zlib --enable-depend --prefix=$PWD/inst/ --with-openssl --with-readline --with-perl --with-libxml CFLAGS='-O2' and the results are at [2]HEAD Time: 84249.535 ms (01:24.250).

Having said that, the use of pg_pwritev_with_retry() in walmethods.c
will definitely reduce number of system calls - on HEAD the
dir_open_for_write() makes pad_to_size/XLOG_BLCKSZ i.e. 16MB/8KB =
2,048 write() calls and with patch it makes only 64
pg_pwritev_with_retry() calls with XLOG_BLCKSZ zero buffer size. The
proposed patches will provide straight 32x reduction in system calls
(for pg_receivewal and pg_basebackup) apart from the safety against
partial writes.

[1]: /* built source code with release flags */ ./configure --with-zlib --enable-depend --prefix=$PWD/inst/ --with-openssl --with-readline --with-perl --with-libxml CFLAGS='-O2'
/* built source code with release flags */
./configure --with-zlib --enable-depend --prefix=$PWD/inst/
--with-openssl --with-readline --with-perl --with-libxml CFLAGS='-O2'

install.log && make -j 8 install > install.log 2>&1 &

\q
./pg_ctl -D data -l logfile stop
rm -rf data

/* ensured that nothing exists in OS page cache */
free -m
sudo su
sync; echo 3 > /proc/sys/vm/drop_caches
exit
free -m

./initdb -D data
./pg_ctl -D data -l logfile start
./psql -d postgres -c 'ALTER SYSTEM SET max_wal_size = "64GB";'
./psql -d postgres -c 'ALTER SYSTEM SET shared_buffers = "8GB";'
./psql -d postgres -c 'ALTER SYSTEM SET work_mem = "16MB";'
./psql -d postgres -c 'ALTER SYSTEM SET checkpoint_timeout = "1d";'
./pg_ctl -D data -l logfile restart
./psql -d postgres -c 'create table foo(bar int);'
./psql -d postgres
\timing
insert into foo select * from generate_series(1, 100000000); /* this
query generates about 385 WAL files, no checkpoint hence no recycle of
old WAL files, all new WAL files */

[2]: HEAD Time: 84249.535 ms (01:24.250)
HEAD
Time: 84249.535 ms (01:24.250)

HEAD with wal_init_zero off
Time: 75086.300 ms (01:15.086)

#define PWRITEV_BLCKSZ XLOG_BLCKSZ
Time: 85254.302 ms (01:25.254)

#define PWRITEV_BLCKSZ (4 * XLOG_BLCKSZ)
Time: 83542.885 ms (01:23.543)

#define PWRITEV_BLCKSZ (16 * XLOG_BLCKSZ)
Time: 84035.770 ms (01:24.036)

#define PWRITEV_BLCKSZ (64 * XLOG_BLCKSZ)
Time: 84749.021 ms (01:24.749)

#define PWRITEV_BLCKSZ (256 * XLOG_BLCKSZ)
Time: 84273.466 ms (01:24.273)

#define PWRITEV_BLCKSZ (512 * XLOG_BLCKSZ)
Time: 84233.576 ms (01:24.234)

--
Bharath Rupireddy
RDS Open Source Databases: https://aws.amazon.com/rds/postgresql/

#11Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#9)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote:

I'm attaching v5 patch-set. I've addressed review comments received so
far and fixed a compiler warning that CF bot complained about.

Please review it further.

0001 looks reasonable to me.

+        errno = 0;
+        rc = pg_pwritev_zeros(fd, pad_to_size);

Do we need to reset errno? pg_pwritev_zeros() claims to set errno
appropriately.

+/*
+ * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if
+ * writing more bytes per pg_pwritev_with_retry() call is proven to be more
+ * performant.
+ */
+#define PWRITEV_BLCKSZ  XLOG_BLCKSZ

This seems like something we should sort out now instead of leaving as
future work. Given your recent note, I think we should just use
XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance
findings with different buffer sizes.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#12Nathan Bossart
nathandbossart@gmail.com
In reply to: Nathan Bossart (#11)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Tue, Sep 20, 2022 at 04:00:26PM -0700, Nathan Bossart wrote:

On Mon, Aug 08, 2022 at 06:10:23PM +0530, Bharath Rupireddy wrote:

I'm attaching v5 patch-set. I've addressed review comments received so
far and fixed a compiler warning that CF bot complained about.

Please review it further.

0001 looks reasonable to me.

+        errno = 0;
+        rc = pg_pwritev_zeros(fd, pad_to_size);

Do we need to reset errno? pg_pwritev_zeros() claims to set errno
appropriately.

+/*
+ * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if
+ * writing more bytes per pg_pwritev_with_retry() call is proven to be more
+ * performant.
+ */
+#define PWRITEV_BLCKSZ  XLOG_BLCKSZ

This seems like something we should sort out now instead of leaving as
future work. Given your recent note, I think we should just use
XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance
findings with different buffer sizes.

I also noticed that the latest patch set no longer applies, so I've marked
the commitfest entry as waiting-on-author.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#13Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#11)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Wed, Sep 21, 2022 at 4:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

0001 looks reasonable to me.

Thanks for reviewing.

+        errno = 0;
+        rc = pg_pwritev_zeros(fd, pad_to_size);

Do we need to reset errno? pg_pwritev_zeros() claims to set errno
appropriately.

Right, pg_pwritev_zeros(), (rather pg_pwritev_with_retry() ensures
that pwritev() or pwrite()) sets the correct errno, please see
Thomas's comments [1]/messages/by-id/CA+hUKGJKwUrpP0igOFAv5khj3dbHvfWqLzFeo7vtNzDYXv_YZw@mail.gmail.com as well. Removed it.

+/*
+ * PWRITEV_BLCKSZ is same as XLOG_BLCKSZ for now, however it may change if
+ * writing more bytes per pg_pwritev_with_retry() call is proven to be more
+ * performant.
+ */
+#define PWRITEV_BLCKSZ  XLOG_BLCKSZ

This seems like something we should sort out now instead of leaving as
future work. Given your recent note, I think we should just use
XLOG_BLCKSZ and PGAlignedXLogBlock and add a comment about the performance
findings with different buffer sizes.

Agreed. Removed the new structure and added a comment.

Another change that I had to do was to allow lseek() even after
pwrite() (via pg_pwritev_zeros()) on Windows in walmethods.c. Without
this, the regression tests start to fail on Windows. And I figured out
that it's not an issue with this patch, it looks like an issue with
pwrite() implementation in win32pwrite.c, see the failure here [2]https://github.com/BRupireddy/postgres/tree/use_pwrite_without_lseek_on_windiws, I
plan to start a separate thread to discuss this.

Please review the attached v4 patch set further.

[1]: /messages/by-id/CA+hUKGJKwUrpP0igOFAv5khj3dbHvfWqLzFeo7vtNzDYXv_YZw@mail.gmail.com
[2]: https://github.com/BRupireddy/postgres/tree/use_pwrite_without_lseek_on_windiws

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-Move-pg_pwritev_with_retry-to-file_utils.c.patchapplication/x-patch; name=v4-0001-Move-pg_pwritev_with_retry-to-file_utils.c.patchDownload+72-72
v4-0002-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patchapplication/x-patch; name=v4-0002-Use-pg_pwritev_with_retry-instead-of-write-in-wal.patchDownload+101-40
#14Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#13)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?
+        PGAlignedXLogBlock zbuffer;
+
+        memset(zbuffer.data, 0, XLOG_BLCKSZ);

This seems excessive for only writing a single byte.

+#ifdef WIN32
+        /*
+         * XXX: It looks like on Windows, we need an explicit lseek() call here
+         * despite using pwrite() implementation from win32pwrite.c. Otherwise
+         * an error occurs.
+         */

I think this comment is too vague. Can we describe the error in more
detail? Or better yet, can we fix it as a prerequisite to this patch set?

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#15Thomas Munro
thomas.munro@gmail.com
In reply to: Nathan Bossart (#14)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Sat, Sep 24, 2022 at 8:24 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

+#ifdef WIN32
+        /*
+         * XXX: It looks like on Windows, we need an explicit lseek() call here
+         * despite using pwrite() implementation from win32pwrite.c. Otherwise
+         * an error occurs.
+         */

I think this comment is too vague. Can we describe the error in more
detail? Or better yet, can we fix it as a prerequisite to this patch set?

Although WriteFile() with a synchronous file handle and an explicit
offset doesn't use the current file position, it appears that it still
changes it. :-(

You'd think from the documentation[1]https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile#synchronization-and-file-position that that isn't the case, because it says:

"Considerations for working with synchronous file handles:

* If lpOverlapped is NULL, the write operation starts at the current
file position and WriteFile does not return until the operation is
complete, and the system updates the file pointer before WriteFile
returns.

* If lpOverlapped is not NULL, the write operation starts at the
offset that is specified in the OVERLAPPED structure and WriteFile
does not return until the write operation is complete. The system
updates the OVERLAPPED Internal and InternalHigh fields before
WriteFile returns."

So it's explicitly saying the file pointer is updated in the first
bullet point and not the second, but src/port/win32pwrite.c is doing
the second thing. Yet the following assertion added to Bharath's code
fails[2]https://api.cirrus-ci.com/v1/artifact/task/6201051266154496/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup:

+++ b/src/bin/pg_basebackup/walmethods.c
@@ -221,6 +221,10 @@ dir_open_for_write(WalWriteMethod *wwmethod,
const char *pathname,
        if (pad_to_size && wwmethod->compression_algorithm ==
PG_COMPRESSION_NONE)
        {
                ssize_t rc;
+               off_t before_offset;
+               off_t after_offset;
+
+               before_offset = lseek(fd, 0, SEEK_CUR);

rc = pg_pwritev_zeros(fd, pad_to_size);

@@ -231,6 +235,9 @@ dir_open_for_write(WalWriteMethod *wwmethod, const
char *pathname,
return NULL;
}

+               after_offset = lseek(fd, 0, SEEK_CUR);
+               Assert(before_offset == after_offset);
+

[1]: https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile#synchronization-and-file-position
[2]: https://api.cirrus-ci.com/v1/artifact/task/6201051266154496/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup

#16Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Thomas Munro (#15)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Sat, Sep 24, 2022 at 9:44 AM Thomas Munro <thomas.munro@gmail.com> wrote:

Although WriteFile() with a synchronous file handle and an explicit
offset doesn't use the current file position, it appears that it still
changes it. :-(

You'd think from the documentation[1] that that isn't the case, because it says:

"Considerations for working with synchronous file handles:

* If lpOverlapped is NULL, the write operation starts at the current
file position and WriteFile does not return until the operation is
complete, and the system updates the file pointer before WriteFile
returns.

* If lpOverlapped is not NULL, the write operation starts at the
offset that is specified in the OVERLAPPED structure and WriteFile
does not return until the write operation is complete. The system
updates the OVERLAPPED Internal and InternalHigh fields before
WriteFile returns."

So it's explicitly saying the file pointer is updated in the first
bullet point and not the second, but src/port/win32pwrite.c is doing
the second thing.

The WriteFile() and pwrite() implementation in win32pwrite.c are
changing the file pointer on Windows, see the following and [4]https://api.cirrus-ci.com/v1/artifact/task/5294264635621376/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup for
more details.

# Running: pg_basebackup --no-sync -cfast -D
C:\cirrus\build\testrun\pg_basebackup\010_pg_basebackup\data\tmp_test_sV4r/backup2
--no-manifest --waldir
C:\cirrus\build\testrun\pg_basebackup\010_pg_basebackup\data\tmp_test_sV4r/xlog2
pg_basebackup: offset_before 0 and offset_after 16777216 aren't the same
Assertion failed: offset_before == offset_after, file
../src/bin/pg_basebackup/walmethods.c, line 254

Irrespective of what Windows does with file pointers in WriteFile(),
should we add lseek(SEEK_SET) in our own pwrite()'s implementation,
something like [5]diff --git a/src/port/win32pwrite.c b/src/port/win32pwrite.c index 7f2e62e8a7..542b548279 100644 --- a/src/port/win32pwrite.c +++ b/src/port/win32pwrite.c @@ -37,5 +37,16 @@ pwrite(int fd, const void *buf, size_t size, off_t offset) return -1; }? This is rather hackish without fully knowing what
Windows does internally in WriteFile(), but this does fix inherent
issues that our pwrite() callers (there are quite a number of places
that use pwrite() and presumes file pointer doesn't change on Windows)
may have on Windows. See the regression tests passing [6]https://github.com/BRupireddy/postgres/tree/add_pwrite_and_offset_checks_in_walmethods_v3 with the fix
[5]: diff --git a/src/port/win32pwrite.c b/src/port/win32pwrite.c index 7f2e62e8a7..542b548279 100644 --- a/src/port/win32pwrite.c +++ b/src/port/win32pwrite.c @@ -37,5 +37,16 @@ pwrite(int fd, const void *buf, size_t size, off_t offset) return -1; }

Yet the following assertion added to Bharath's code
fails[2]:

That was a very quick patch though, here's another version adding
offset checks and an assertion [3]https://github.com/BRupireddy/postgres/tree/add_pwrite_and_offset_checks_in_walmethods_v2, see the assertion failures here
[4]: https://api.cirrus-ci.com/v1/artifact/task/5294264635621376/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup

I also think that we need to discuss this issue separately.

Thoughts?

[1] https://learn.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-writefile#synchronization-and-file-position
[2] https://api.cirrus-ci.com/v1/artifact/task/6201051266154496/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup

[3] - https://github.com/BRupireddy/postgres/tree/add_pwrite_and_offset_checks_in_walmethods_v2
[4] - https://api.cirrus-ci.com/v1/artifact/task/5294264635621376/testrun/build/testrun/pg_basebackup/010_pg_basebackup/log/regress_log_010_pg_basebackup
[5]
diff --git a/src/port/win32pwrite.c b/src/port/win32pwrite.c
index 7f2e62e8a7..542b548279 100644
--- a/src/port/win32pwrite.c
+++ b/src/port/win32pwrite.c
@@ -37,5 +37,16 @@ pwrite(int fd, const void *buf, size_t size, off_t offset)
                return -1;
        }
+       /*
+        * On Windows, it is found that WriteFile() changes the file
pointer and we
+        * want pwrite() to not change. Hence, we explicitly reset the
file pointer
+        * to beginning of the file.
+        */
+       if (lseek(fd, 0, SEEK_SET) != 0)
+       {
+               _dosmaperr(GetLastError());
+               return -1;
+       }
+
        return result;
 }
[6] - https://github.com/BRupireddy/postgres/tree/add_pwrite_and_offset_checks_in_walmethods_v3

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#17Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#16)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Mon, Sep 26, 2022 at 08:33:53PM +0530, Bharath Rupireddy wrote:

Irrespective of what Windows does with file pointers in WriteFile(),
should we add lseek(SEEK_SET) in our own pwrite()'s implementation,
something like [5]? This is rather hackish without fully knowing what
Windows does internally in WriteFile(), but this does fix inherent
issues that our pwrite() callers (there are quite a number of places
that use pwrite() and presumes file pointer doesn't change on Windows)
may have on Windows. See the regression tests passing [6] with the fix
[5].

I think so. I don't see why we would rather have each caller ensure
pwrite() behaves as documented.

+       /*
+        * On Windows, it is found that WriteFile() changes the file
pointer and we
+        * want pwrite() to not change. Hence, we explicitly reset the
file pointer
+        * to beginning of the file.
+        */
+       if (lseek(fd, 0, SEEK_SET) != 0)
+       {
+               _dosmaperr(GetLastError());
+               return -1;
+       }
+
return result;
}

Why reset to the beginning of the file? Shouldn't we reset it to what it
was before the call to pwrite()?

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#18Thomas Munro
thomas.munro@gmail.com
In reply to: Nathan Bossart (#17)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Tue, Sep 27, 2022 at 10:27 AM Nathan Bossart
<nathandbossart@gmail.com> wrote:

On Mon, Sep 26, 2022 at 08:33:53PM +0530, Bharath Rupireddy wrote:

Irrespective of what Windows does with file pointers in WriteFile(),
should we add lseek(SEEK_SET) in our own pwrite()'s implementation,
something like [5]? This is rather hackish without fully knowing what
Windows does internally in WriteFile(), but this does fix inherent
issues that our pwrite() callers (there are quite a number of places
that use pwrite() and presumes file pointer doesn't change on Windows)
may have on Windows. See the regression tests passing [6] with the fix
[5].

I think so. I don't see why we would rather have each caller ensure
pwrite() behaves as documented.

I don't think so, that's an extra kernel call. I think I'll just have
to revert part of my recent change that removed the pg_ prefix from
those function names in our code, and restore the comment that warns
you about the portability hazard (I thought it went away with HP-UX
10, where we were literally calling lseek() before every write()).
The majority of users of these functions don't intermix them with
calls to read()/write(), so they don't care about the file position,
so I think it's just something we'll have to continue to be mindful of
in the places that do.

Unless, that is, I can find a way to stop it from doing that... I've
added this to my Windows to-do list. I am going to have a round of
Windows hacking quite soon.

#19Nathan Bossart
nathandbossart@gmail.com
In reply to: Thomas Munro (#18)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Tue, Sep 27, 2022 at 10:37:38AM +1300, Thomas Munro wrote:

I don't think so, that's an extra kernel call. I think I'll just have
to revert part of my recent change that removed the pg_ prefix from
those function names in our code, and restore the comment that warns
you about the portability hazard (I thought it went away with HP-UX
10, where we were literally calling lseek() before every write()).
The majority of users of these functions don't intermix them with
calls to read()/write(), so they don't care about the file position,
so I think it's just something we'll have to continue to be mindful of
in the places that do.

Ah, you're right, it's probably best to avoid the extra system call for the
majority of callers that don't care about the file position. I retract my
previous message.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#20Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Thomas Munro (#18)
Re: Use pg_pwritev_with_retry() instead of write() in dir_open_for_write() to avoid partial writes?

On Tue, Sep 27, 2022 at 3:08 AM Thomas Munro <thomas.munro@gmail.com> wrote:

I don't think so, that's an extra kernel call. I think I'll just have
to revert part of my recent change that removed the pg_ prefix from
those function names in our code, and restore the comment that warns
you about the portability hazard (I thought it went away with HP-UX
10, where we were literally calling lseek() before every write()).
The majority of users of these functions don't intermix them with
calls to read()/write(), so they don't care about the file position,
so I think it's just something we'll have to continue to be mindful of
in the places that do.

Yes, all of the existing pwrite() callers don't care about the file
position, but the new callers such as the actual idea and patch
proposed here in this thread cares.

Is this the commit cf112c122060568aa06efe4e6e6fb9b2dd4f1090 part of
which [1]https://github.com/postgres/postgres/blob/REL_14_STABLE/src/port/pwrite.c#L11 you're planning to revert? If so, will we have the
pg_{pwrite, pread} back? IIUC, we don't have lseek(SEEK_SET) in
pg_{pwrite, pread} right? It is the callers responsibility to set the
file position correctly if they wish to, isn't it? Oftentimes,
developers miss the notes in the function comments and use these
functions expecting them to not change file position which works well
on non-Windows platforms but fails on Windows.

This makes me think that we can have pwrite(), pread() introduced by
cf112c122060568aa06efe4e6e6fb9b2dd4f1090 as-is and re-introduce
pg_{pwrite, pread} with pwrite()/pread()+lseek(SEEK_SET) in
win32pwrite.c and win32pread.c. These functions reduce the caller's
efforts and reduce the duplicate code. If okay, I'm happy to lend my
hands on this patch.

Thoughts?

Unless, that is, I can find a way to stop it from doing that... I've
added this to my Windows to-do list. I am going to have a round of
Windows hacking quite soon.

Thanks! I think it's time for me to start a new thread just for this
to get more attention and opinion from other hackers and not to
sidetrack the original idea and patch proposed in this thread.

[1]: https://github.com/postgres/postgres/blob/REL_14_STABLE/src/port/pwrite.c#L11

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#21Thomas Munro
thomas.munro@gmail.com
In reply to: Bharath Rupireddy (#20)
#22Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Thomas Munro (#21)
#23Thomas Munro
thomas.munro@gmail.com
In reply to: Bharath Rupireddy (#22)
#24Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Thomas Munro (#23)
#25Thomas Munro
thomas.munro@gmail.com
In reply to: Bharath Rupireddy (#24)
#26Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#14)
#27Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#26)
#28Thomas Munro
thomas.munro@gmail.com
In reply to: Nathan Bossart (#27)
#29Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Thomas Munro (#28)
#30Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#29)
#31Michael Paquier
michael@paquier.xyz
In reply to: Nathan Bossart (#30)
#32Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#31)
#33Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#31)
#34Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#33)
#35Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#34)
#36Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#35)
#37Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#34)
#38Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#37)
#39Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Michael Paquier (#31)
#40Michael Paquier
michael@paquier.xyz
In reply to: Alvaro Herrera (#39)
#41Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#40)
#42Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#41)
#43Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#42)
#44John Naylor
john.naylor@enterprisedb.com
In reply to: Michael Paquier (#43)
#45Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: John Naylor (#44)
#46Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#45)
#47John Naylor
john.naylor@enterprisedb.com
In reply to: Michael Paquier (#46)
#48Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#42)
#49Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#48)
#50Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#49)
#51Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#50)
#52Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#50)
#53Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Bharath Rupireddy (#52)
#54Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#53)
#55Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#50)
#56Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#55)
#57Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#55)
#58Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#56)
#59Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#57)
#60Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#58)
#61Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#59)
#62Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#61)
#63Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#59)
#64Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#63)
#65Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#64)
#66Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#65)
#67Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#66)
#68Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#67)
#69Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#68)
#70Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#69)
#71Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#70)
#72Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#70)
#73Thomas Munro
thomas.munro@gmail.com
In reply to: Thomas Munro (#72)
#74Michael Paquier
michael@paquier.xyz
In reply to: Thomas Munro (#73)
#75Michael Paquier
michael@paquier.xyz
In reply to: Andres Freund (#71)
#76Thomas Munro
thomas.munro@gmail.com
In reply to: Michael Paquier (#74)
#77Julien Rouhaud
rjuju123@gmail.com
In reply to: Thomas Munro (#76)
#78Thomas Munro
thomas.munro@gmail.com
In reply to: Julien Rouhaud (#77)
#79Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Julien Rouhaud (#77)
#80Andres Freund
andres@anarazel.de
In reply to: Thomas Munro (#73)