Improve WALRead() to suck data directly from WAL buffers when possible

Started by Bharath Rupireddyover 3 years ago95 messageshackers

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

Hi,

WALRead() currently reads WAL from the WAL file on the disk, which
means, the walsenders serving streaming and logical replication
(callers of WALRead()) will have to hit the disk/OS's page cache for
reading the WAL. This may increase the amount of read IO required for
all the walsenders put together as one typically maintains many
standbys/subscribers on production servers for high availability,
disaster recovery, read-replicas and so on. Also, it may increase
replication lag if all the WAL reads are always hitting the disk.

It may happen that WAL buffers contain the requested WAL, if so, the
WALRead() can attempt to read from the WAL buffers first before
reading from the file. If the read hits the WAL buffers, then reading
from the file on disk is avoided. This mainly reduces the read IO/read
system calls. It also enables us to do other features specified
elsewhere [1]/messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com.

I'm attaching a patch that implements the idea which is also noted
elsewhere [2]* XXX probably this should be improved to suck data directly from the * WAL buffers when possible. */ bool WALRead(XLogReaderState *state,. I've run some tests [3]1 primary, 1 sync standby, 1 async standby ./pgbench --initialize --scale=300 postgres ./pgbench --jobs=16 --progress=300 --client=32 --time=900 --username=ubuntu postgres. The WAL buffers hit ratio with
the patch stood at 95%, in other words, the walsenders avoided 95% of
the time reading from the file. The benefit, if measured in terms of
the amount of data - 79% (13.5GB out of total 17GB) of the requested
WAL is read from the WAL buffers as opposed to 21% from the file. Note
that the WAL buffers hit ratio can be very low for write-heavy
workloads, in which case, file reads are inevitable.

The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.

I'll add this to the next commitfest.

Thoughts?

[1]: /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com

[2]: * XXX probably this should be improved to suck data directly from the * WAL buffers when possible. */ bool WALRead(XLogReaderState *state,
* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.
*/
bool
WALRead(XLogReaderState *state,

[3]: 1 primary, 1 sync standby, 1 async standby ./pgbench --initialize --scale=300 postgres ./pgbench --jobs=16 --progress=300 --client=32 --time=900 --username=ubuntu postgres
1 primary, 1 sync standby, 1 async standby
./pgbench --initialize --scale=300 postgres
./pgbench --jobs=16 --progress=300 --client=32 --time=900
--username=ubuntu postgres

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#1)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.

I'll add this to the next commitfest.

Thoughts?

This adds copying of the whole page (at least) at every WAL *record*
read, fighting all WAL writers by taking WALBufMappingLock on a very
busy page while the copying. I'm a bit doubtful that it results in an
overall improvement. It seems to me almost all pread()s here happens
on file buffer so it is unclear to me that copying a whole WAL page
(then copying the target record again) wins over a pread() call that
copies only the record to read. Do you have an actual number of how
frequent WAL reads go to disk, or the actual number of performance
gain or real I/O reduction this patch offers?

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#2)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

Mmm. I'm a bit dim. Recovery doesn't read concurrently-written
records. Please forget about recovery.

I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Kyotaro Horiguchi

horikyota.ntt@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#3)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

Sorry for the confusion.

At Mon, 12 Dec 2022 12:06:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

Mmm. I'm a bit dim. Recovery doesn't read concurrently-written
records. Please forget about recovery.

NO... Logical walsenders do that. So, please forget about this...

I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Kyotaro Horiguchi (#2)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Mon, Dec 12, 2022 at 8:27 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Thanks for providing thoughts.

At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.

I'll add this to the next commitfest.

Thoughts?

This adds copying of the whole page (at least) at every WAL *record*
read,

In the worst case yes, but that may not always be true. On a typical
production server with decent write traffic, it happens that the
callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
MAX_SEND_SIZE bytes.

fighting all WAL writers by taking WALBufMappingLock on a very
busy page while the copying. I'm a bit doubtful that it results in an
overall improvement.

Well, the tests don't reflect that [1]PATCHED: 1 1470.329907 2 1437.096329 4 2966.096948 8 5978.441040 16 11405.538255 32 22933.546058 64 43341.870038 128 73623.837068 256 104754.248661 512 115746.359530 768 106106.691455 1024 91900.079086 2048 84134.278589 4096 62580.875507, I've run an insert work load
[2]: Test details: ./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & 1 primary, 1 async standby cd inst/bin ./pg_ctl -D data -l logfile stop ./pg_ctl -D assbdata -l logfile1 stop rm -rf data assbdata rm logfile logfile1 free -m sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches' free -m ./initdb -D data rm -rf /home/ubuntu/archived_wal mkdir /home/ubuntu/archived_wal cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' track_wal_io_timing = 'on' EOF ./pg_ctl -D data -l logfile start ./psql -c "select pg_create_physical_replication_slot('assb1_repl_slot', true, false)" postgres ./pg_ctl -D data -l logfile restart ./pg_basebackup -D assbdata ./pg_ctl -D data -l logfile stop cat << EOF >> assbdata/postgresql.conf port=5433 primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu application_name=assb1' primary_slot_name='assb1_repl_slot' restore_command='cp /home/ubuntu/archived_wal/%f %p' EOF touch assbdata/standby.signal ./pg_ctl -D data -l logfile start ./pg_ctl -D assbdata -l logfile1 start ./pgbench -i -s 1 -d postgres ./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" cat << EOF >> insert.sql \set aid random(1, 10 * :scale) \set delta random(1, 100000 * :scale) INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta); EOF ulimit -S -n 5000 for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
pretty cool. If you have any use-cases in mind, please share them
and/or feel free to run at your end.

It seems to me almost all pread()s here happens
on file buffer so it is unclear to me that copying a whole WAL page
(then copying the target record again) wins over a pread() call that
copies only the record to read.

That's not always guaranteed. Imagine a typical production server with
decent write traffic and heavy analytical queries (which fills OS page
cache with the table pages accessed for the queries), the WAL pread()
calls turn to IOPS. Despite the WAL being present in WAL buffers,
customers will be paying unnecessarily for these IOPS too. With the
patch, we are basically avoiding the pread() system calls which may
turn into IOPS on production servers (99% of the time for the insert
use case [1]PATCHED: 1 1470.329907 2 1437.096329 4 2966.096948 8 5978.441040 16 11405.538255 32 22933.546058 64 43341.870038 128 73623.837068 256 104754.248661 512 115746.359530 768 106106.691455 1024 91900.079086 2048 84134.278589 4096 62580.875507[2]Test details: ./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & 1 primary, 1 async standby cd inst/bin ./pg_ctl -D data -l logfile stop ./pg_ctl -D assbdata -l logfile1 stop rm -rf data assbdata rm logfile logfile1 free -m sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches' free -m ./initdb -D data rm -rf /home/ubuntu/archived_wal mkdir /home/ubuntu/archived_wal cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' track_wal_io_timing = 'on' EOF ./pg_ctl -D data -l logfile start ./psql -c "select pg_create_physical_replication_slot('assb1_repl_slot', true, false)" postgres ./pg_ctl -D data -l logfile restart ./pg_basebackup -D assbdata ./pg_ctl -D data -l logfile stop cat << EOF >> assbdata/postgresql.conf port=5433 primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu application_name=assb1' primary_slot_name='assb1_repl_slot' restore_command='cp /home/ubuntu/archived_wal/%f %p' EOF touch assbdata/standby.signal ./pg_ctl -D data -l logfile start ./pg_ctl -D assbdata -l logfile1 start ./pgbench -i -s 1 -d postgres ./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" cat << EOF >> insert.sql \set aid random(1, 10 * :scale) \set delta random(1, 100000 * :scale) INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta); EOF ulimit -S -n 5000 for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done, 95% of the time for pgbench use case specified
upthread). With the patch, WAL buffers can act as L1 cache, if one
calls OS page cache as L2 cache (of course this illustration is not
related to the typical processor L1 and L2 ... caches).

Do you have an actual number of how
frequent WAL reads go to disk, or the actual number of performance
gain or real I/O reduction this patch offers?

It might be a bit tough to generate such heavy traffic. An idea is to
ensure the WAL page/file goes out of the OS page cache before
WALRead() - these might help here - 0002 patch from
/messages/by-id/CA+hUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_=X7HztA@mail.gmail.com
and tool https://github.com/klando/pgfincore.

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

WALRead() callers are smart enough to take the flushed bytes only.
Although they read the whole WAL page, they calculate the valid bytes.

I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.

Yes, the non-flushed WAL can be read and sent across if one wishes to
to make replication faster and parallel flushing on primary and
standbys at the cost of a bit of extra crash handling, that's
mentioned here /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com.
However, this can be a separate discussion.

I also want to reiterate that the patch implemented a TODO item:

* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.
*/
bool
WALRead(XLogReaderState *state,

[1]: PATCHED: 1 1470.329907 2 1437.096329 4 2966.096948 8 5978.441040 16 11405.538255 32 22933.546058 64 43341.870038 128 73623.837068 256 104754.248661 512 115746.359530 768 106106.691455 1024 91900.079086 2048 84134.278589 4096 62580.875507
PATCHED:
1 1470.329907
2 1437.096329
4 2966.096948
8 5978.441040
16 11405.538255
32 22933.546058
64 43341.870038
128 73623.837068
256 104754.248661
512 115746.359530
768 106106.691455
1024 91900.079086
2048 84134.278589
4096 62580.875507

HEAD:
1 1312.054496
2 1449.429321
4 2717.496207
8 5913.361540
16 10762.978907
32 19653.449728
64 41086.124269
128 68548.061171
256 104468.415361
512 114328.943598
768 91751.279309
1024 96403.736757
2048 82155.140270
4096 66160.659511

[2]: Test details: ./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & 1 primary, 1 async standby cd inst/bin ./pg_ctl -D data -l logfile stop ./pg_ctl -D assbdata -l logfile1 stop rm -rf data assbdata rm logfile logfile1 free -m sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches' free -m ./initdb -D data rm -rf /home/ubuntu/archived_wal mkdir /home/ubuntu/archived_wal cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' track_wal_io_timing = 'on' EOF ./pg_ctl -D data -l logfile start ./psql -c "select pg_create_physical_replication_slot('assb1_repl_slot', true, false)" postgres ./pg_ctl -D data -l logfile restart ./pg_basebackup -D assbdata ./pg_ctl -D data -l logfile stop cat << EOF >> assbdata/postgresql.conf port=5433 primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu application_name=assb1' primary_slot_name='assb1_repl_slot' restore_command='cp /home/ubuntu/archived_wal/%f %p' EOF touch assbdata/standby.signal ./pg_ctl -D data -l logfile start ./pg_ctl -D assbdata -l logfile1 start ./pgbench -i -s 1 -d postgres ./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" cat << EOF >> insert.sql \set aid random(1, 10 * :scale) \set delta random(1, 100000 * :scale) INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta); EOF ulimit -S -n 5000 for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j
8 install > install.log 2>&1 &
1 primary, 1 async standby
cd inst/bin
./pg_ctl -D data -l logfile stop
./pg_ctl -D assbdata -l logfile1 stop
rm -rf data assbdata
rm logfile logfile1
free -m
sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
free -m
./initdb -D data
rm -rf /home/ubuntu/archived_wal
mkdir /home/ubuntu/archived_wal
cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
track_wal_io_timing = 'on'
EOF
./pg_ctl -D data -l logfile start
./psql -c "select
pg_create_physical_replication_slot('assb1_repl_slot', true, false)"
postgres
./pg_ctl -D data -l logfile restart
./pg_basebackup -D assbdata
./pg_ctl -D data -l logfile stop
cat << EOF >> assbdata/postgresql.conf
port=5433
primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu
application_name=assb1'
primary_slot_name='assb1_repl_slot'
restore_command='cp /home/ubuntu/archived_wal/%f %p'
EOF
touch assbdata/standby.signal
./pg_ctl -D data -l logfile start
./pg_ctl -D assbdata -l logfile1 start
./pgbench -i -s 1 -d postgres
./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;"
cat << EOF >> insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 100000 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta);
EOF
ulimit -S -n 5000
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#5)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Dec 23, 2022 at 3:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Mon, Dec 12, 2022 at 8:27 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Thanks for providing thoughts.

At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.

I'll add this to the next commitfest.

Thoughts?

This adds copying of the whole page (at least) at every WAL *record*
read,

In the worst case yes, but that may not always be true. On a typical
production server with decent write traffic, it happens that the
callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
MAX_SEND_SIZE bytes.

I agree with this.

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

WALRead() callers are smart enough to take the flushed bytes only.
Although they read the whole WAL page, they calculate the valid bytes.

Right

On first read the patch looks good, although it needs some more
thoughts on 'XXX' comments in the patch.

And also I do not like that XLogReadFromBuffers() is using 3 bools
hit/partial hit/miss, instead of this we can use an enum or some
tristate variable, I think that will be cleaner.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#6)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sun, Dec 25, 2022 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

This adds copying of the whole page (at least) at every WAL *record*
read,

In the worst case yes, but that may not always be true. On a typical
production server with decent write traffic, it happens that the
callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
MAX_SEND_SIZE bytes.

I agree with this.

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

WALRead() callers are smart enough to take the flushed bytes only.
Although they read the whole WAL page, they calculate the valid bytes.

Right

On first read the patch looks good, although it needs some more
thoughts on 'XXX' comments in the patch.

Thanks a lot for reviewing.

Here are some open points that I mentioned in v1 patch:

1.
+     * XXX: Perhaps, measuring the immediate lock availability and its impact
+     * on concurrent WAL writers is a good idea here.

It was shown in my testng upthread [1]/messages/by-id/CALj2ACXUbvON86vgwTkum8ab3bf1=HkMxQ5hZJZS3ZcJn8NEXQ@mail.gmail.com that the patch does no harm in
this regard. It will be great if other members try testing in their
respective environments and use cases.

2.
+     * XXX: Perhaps, returning if lock is not immediately available a good idea
+     * here. The caller can then go ahead with reading WAL from WAL file.

After thinking a bit more on this, ISTM that doing the above is right
to not cause any contention when the lock is busy. I've done so in the
v2 patch.

3.
+     * XXX: Perhaps, quickly finding if the given WAL record is in WAL buffers
+     * a good idea here. This avoids unnecessary lock acquire-release cycles.
+     * One way to do that is by maintaining oldest WAL record that's currently
+     * present in WAL buffers.

I think by doing the above we might end up creating a new point of
contention. Because shared variables to track min and max available
LSNs in the WAL buffers will need to be protected against all the
concurrent writers. Also, with the change that's done in (2) above,
that is, quickly exiting if the lock was busy, this comment seems
unnecessary to worry about. Hence, I decided to leave it there.

4.
+         * XXX: Perhaps, we can further go and validate the found page header,
+         * record header and record at least in assert builds, something like
+         * the xlogreader.c does and return if any of those validity checks
+         * fail. Having said that, we stick to the minimal checks for now.

I was being over-cautious initially. The fact that we acquire
WALBufMappingLock while reading the needed WAL buffer page itself
guarantees that no one else initializes it/makes it ready for next use
in AdvanceXLInsertBuffer(). The checks that we have for page header
(xlp_magic, xlp_pageaddr and xlp_tli) in the patch are enough for us
to ensure that we're not reading a page that got just initialized. The
callers will anyway perform extensive checks on page and record in
XLogReaderValidatePageHeader() and ValidXLogRecordHeader()
respectively. If any such failures occur after reading WAL from WAL
buffers, then that must be treated as a bug IMO. Hence, I don't think
we need to do the above.

And also I do not like that XLogReadFromBuffers() is using 3 bools
hit/partial hit/miss, instead of this we can use an enum or some
tristate variable, I think that will be cleaner.

Yeah, that seems more verbose, all that information can be deduced
from requested bytes and read bytes, I've done so in the v2 patch.

Please review the attached v2 patch further.

I'm also attaching two helper patches (as .txt files) herewith for
testing that basically adds WAL read stats -
USE-ON-HEAD-Collect-WAL-read-from-file-stats.txt - apply on HEAD and
monitor pg_stat_replication for per-walsender WAL read from WAL file
stats. USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txt -
apply on v2 patch and monitor pg_stat_replication for per-walsender
WAL read from WAL buffers and WAL file stats.

[1]: /messages/by-id/CALj2ACXUbvON86vgwTkum8ab3bf1=HkMxQ5hZJZS3ZcJn8NEXQ@mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Jeff Davis

pgsql@j-davis.com

over 3 years ago

In reply to: Bharath Rupireddy (#7)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:

Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Jeff Davis (#8)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:

On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:

Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?

One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.

Greetings,

Andres Freund

#10

SATYANARAYANA NARLAPURAM

satyanarlapuram@gmail.com

over 3 years ago

In reply to: Andres Freund (#9)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sat, Jan 14, 2023 at 12:34 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:

On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:

Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?

One benefit would be that it'd make it more realistic to use direct IO for
WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a
win.

+1. Archive modules rely on reading the wal files for PITR. Direct IO for
WAL requires reading these files from disk anyways for archival. However,
Archiving using utilities like pg_receivewal can take advantage of this
patch together with direct IO for WAL.

Thanks,
Satya

#11

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Andres Freund (#9)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2023-01-14 12:34:03 -0800, Andres Freund wrote:

On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:

On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:

Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?

One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.

Satya's email just now reminded me of another important reason:

Eventually we should add the ability to stream out WAL *before* it has locally
been written out and flushed. Obviously the relevant positions would have to
be noted in the relevant message in the streaming protocol, and we couldn't
generally allow standbys to apply that data yet.

That'd allow us to significantly reduce the overhead of synchronous
replication, because instead of commonly needing to send out all the pending
WAL at commit, we'd just need to send out the updated flush position. The
reason this would lower the overhead is that:

a) The reduced amount of data to be transferred reduces latency - it's easy to
accumulate a few TCP packets worth of data even in a single small OLTP
transaction
b) The remote side can start to write out data earlier

Of course this would require additional infrastructure on the receiver
side. E.g. some persistent state indicating up to where WAL is allowed to be
applied, to avoid the standby getting ahead of th eprimary, in case the
primary crash-restarts (or has more severe issues).

With a bit of work we could perform WAL replay on standby without waiting for
the fdatasync of the received WAL - that only needs to happen when a) we need
to confirm a flush position to the primary b) when we need to write back pages
from the buffer pool (and some other things).

Greetings,

Andres Freund

#12

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Andres Freund (#11)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Thu, Jan 26, 2023 at 2:45 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-01-14 12:34:03 -0800, Andres Freund wrote:

On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:

On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:

Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?

One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.

Satya's email just now reminded me of another important reason:

Eventually we should add the ability to stream out WAL *before* it has locally
been written out and flushed. Obviously the relevant positions would have to
be noted in the relevant message in the streaming protocol, and we couldn't
generally allow standbys to apply that data yet.

That'd allow us to significantly reduce the overhead of synchronous
replication, because instead of commonly needing to send out all the pending
WAL at commit, we'd just need to send out the updated flush position. The
reason this would lower the overhead is that:

a) The reduced amount of data to be transferred reduces latency - it's easy to
accumulate a few TCP packets worth of data even in a single small OLTP
transaction
b) The remote side can start to write out data earlier

Of course this would require additional infrastructure on the receiver
side. E.g. some persistent state indicating up to where WAL is allowed to be
applied, to avoid the standby getting ahead of th eprimary, in case the
primary crash-restarts (or has more severe issues).

With a bit of work we could perform WAL replay on standby without waiting for
the fdatasync of the received WAL - that only needs to happen when a) we need
to confirm a flush position to the primary b) when we need to write back pages
from the buffer pool (and some other things).

Thanks Andres, Jeff and Satya for taking a look at the thread. Andres
is right, the eventual plan is to do a bunch of other stuff as
described above and we've discussed this in another thread (see
below). I would like to once again clarify motivation behind this
feature:

1. It enables WAL readers (callers of WALRead() - wal senders,
pg_walinspect etc.) to use WAL buffers as first level cache which
might reduce number of IOPS at a peak load especially when the pread()
results in a disk read (WAL isn't available in OS page cache). I had
earlier presented the buffer hit ratio/amount of pread() system calls
reduced with wal senders in the first email of this thread (95% of the
time wal senders are able to read from WAL buffers without impacting
anybody). Now, here are the results with the WAL DIO patch [1]/messages/by-id/CA+hUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_=X7HztA@mail.gmail.com - where
WAL pread() turns into a disk read, see the results [2]Test case is an insert pgbench workload. clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ 1 1404 1070 1424 1375 2 1487 796 1454 1517 4 3064 1743 3011 3019 8 6114 3556 6026 5954 16 11560 7051 12216 12132 32 23181 13079 23449 23561 64 43607 26983 43997 45636 128 80723 45169 81515 81911 256 110925 90185 107332 114046 512 119354 109817 110287 117506 768 112435 105795 106853 111605 1024 107554 105541 105942 109370 2048 88552 79024 80699 90555 4096 61323 54814 58704 61743 and attached
graph.

2. As Andres rightly mentioned, it helps WAL DIO; since there's no OS
page cache, using WAL buffers as read cache helps a lot. It is clearly
evident from my experiment with WAL DIO patch [1]/messages/by-id/CA+hUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_=X7HztA@mail.gmail.com, see the results [2]Test case is an insert pgbench workload. clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ 1 1404 1070 1424 1375 2 1487 796 1454 1517 4 3064 1743 3011 3019 8 6114 3556 6026 5954 16 11560 7051 12216 12132 32 23181 13079 23449 23561 64 43607 26983 43997 45636 128 80723 45169 81515 81911 256 110925 90185 107332 114046 512 119354 109817 110287 117506 768 112435 105795 106853 111605 1024 107554 105541 105942 109370 2048 88552 79024 80699 90555 4096 61323 54814 58704 61743
and attached graph. As expected, WAL DIO brings down the TPS, whereas
WAL buffers read i.e. this patch brings it up.

3. As Andres rightly mentioned above, it enables flushing WAL in
parallel on primary and all standbys [3]/messages/by-id/20220309020123.sneaoijlg3rszvst@alap3.anarazel.de /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com. I haven't yet started work
on this, I will aim for PG 17.

4. It will make the work on - disallow async standbys or subscribers
getting ahead of the sync standbys [3]/messages/by-id/20220309020123.sneaoijlg3rszvst@alap3.anarazel.de /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com possible. I haven't yet started
work on this, I will aim for PG 17.

5. It implements the following TODO item specified near WALRead():
* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.
*/
bool
WALRead(XLogReaderState *state,

That said, this feature is separately reviewable and perhaps can go
separately as it has its own benefits.

[1]: /messages/by-id/CA+hUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_=X7HztA@mail.gmail.com

[2]: Test case is an insert pgbench workload. clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ 1 1404 1070 1424 1375 2 1487 796 1454 1517 4 3064 1743 3011 3019 8 6114 3556 6026 5954 16 11560 7051 12216 12132 32 23181 13079 23449 23561 64 43607 26983 43997 45636 128 80723 45169 81515 81911 256 110925 90185 107332 114046 512 119354 109817 110287 117506 768 112435 105795 106853 111605 1024 107554 105541 105942 109370 2048 88552 79024 80699 90555 4096 61323 54814 58704 61743
clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ
1 1404 1070 1424 1375
2 1487 796 1454 1517
4 3064 1743 3011 3019
8 6114 3556 6026 5954
16 11560 7051 12216 12132
32 23181 13079 23449 23561
64 43607 26983 43997 45636
128 80723 45169 81515 81911
256 110925 90185 107332 114046
512 119354 109817 110287 117506
768 112435 105795 106853 111605
1024 107554 105541 105942 109370
2048 88552 79024 80699 90555
4096 61323 54814 58704 61743

[3]: /messages/by-id/20220309020123.sneaoijlg3rszvst@alap3.anarazel.de /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com
/messages/by-id/20220309020123.sneaoijlg3rszvst@alap3.anarazel.de
/messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13

Masahiko Sawada

sawada.mshk@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#12)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Thu, Jan 26, 2023 at 2:33 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Thu, Jan 26, 2023 at 2:45 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-01-14 12:34:03 -0800, Andres Freund wrote:

On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:

On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:

Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?

One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.

Satya's email just now reminded me of another important reason:

Eventually we should add the ability to stream out WAL *before* it has locally
been written out and flushed. Obviously the relevant positions would have to
be noted in the relevant message in the streaming protocol, and we couldn't
generally allow standbys to apply that data yet.

That'd allow us to significantly reduce the overhead of synchronous
replication, because instead of commonly needing to send out all the pending
WAL at commit, we'd just need to send out the updated flush position. The
reason this would lower the overhead is that:

a) The reduced amount of data to be transferred reduces latency - it's easy to
accumulate a few TCP packets worth of data even in a single small OLTP
transaction
b) The remote side can start to write out data earlier

Of course this would require additional infrastructure on the receiver
side. E.g. some persistent state indicating up to where WAL is allowed to be
applied, to avoid the standby getting ahead of th eprimary, in case the
primary crash-restarts (or has more severe issues).

With a bit of work we could perform WAL replay on standby without waiting for
the fdatasync of the received WAL - that only needs to happen when a) we need
to confirm a flush position to the primary b) when we need to write back pages
from the buffer pool (and some other things).

Thanks Andres, Jeff and Satya for taking a look at the thread. Andres
is right, the eventual plan is to do a bunch of other stuff as
described above and we've discussed this in another thread (see
below). I would like to once again clarify motivation behind this
feature:

1. It enables WAL readers (callers of WALRead() - wal senders,
pg_walinspect etc.) to use WAL buffers as first level cache which
might reduce number of IOPS at a peak load especially when the pread()
results in a disk read (WAL isn't available in OS page cache). I had
earlier presented the buffer hit ratio/amount of pread() system calls
reduced with wal senders in the first email of this thread (95% of the
time wal senders are able to read from WAL buffers without impacting
anybody). Now, here are the results with the WAL DIO patch [1] - where
WAL pread() turns into a disk read, see the results [2] and attached
graph.

2. As Andres rightly mentioned, it helps WAL DIO; since there's no OS
page cache, using WAL buffers as read cache helps a lot. It is clearly
evident from my experiment with WAL DIO patch [1], see the results [2]
and attached graph. As expected, WAL DIO brings down the TPS, whereas
WAL buffers read i.e. this patch brings it up.

[2] Test case is an insert pgbench workload.
clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ
1 1404 1070 1424 1375
2 1487 796 1454 1517
4 3064 1743 3011 3019
8 6114 3556 6026 5954
16 11560 7051 12216 12132
32 23181 13079 23449 23561
64 43607 26983 43997 45636
128 80723 45169 81515 81911
256 110925 90185 107332 114046
512 119354 109817 110287 117506
768 112435 105795 106853 111605
1024 107554 105541 105942 109370
2048 88552 79024 80699 90555
4096 61323 54814 58704 61743

If I'm understanding this result correctly, it seems to me that your
patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
BUFFERS READ), but there seems no visible performance gain with only
your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
patch should be included in the WAL DIO patch rather than applying it
alone. Am I missing something?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#14

Andres Freund

andres@anarazel.de

over 3 years ago

In reply to: Masahiko Sawada (#13)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote:

If I'm understanding this result correctly, it seems to me that your
patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
BUFFERS READ), but there seems no visible performance gain with only
your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
patch should be included in the WAL DIO patch rather than applying it
alone. Am I missing something?

We already support using DIO for WAL - it's just restricted in a way that
makes it practically not usable. And the reason for that is precisely that
walsenders need to read the WAL. See get_sync_bit():

/*
* Optimize writes by bypassing kernel cache with O_DIRECT when using
* O_SYNC and O_DSYNC. But only if archiving and streaming are disabled,
* otherwise the archive command or walsender process will read the WAL
* soon after writing it, which is guaranteed to cause a physical read if
* we bypassed the kernel cache. We also skip the
* posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
* reason.
*
* Never use O_DIRECT in walreceiver process for similar reasons; the WAL
* written by walreceiver is normally read by the startup process soon
* after it's written. Also, walreceiver performs unaligned writes, which
* don't work with O_DIRECT, so it is required for correctness too.
*/
if (!XLogIsNeeded() && !AmWalReceiverProcess())
o_direct_flag = PG_O_DIRECT;

Even if that weren't the case, splitting up bigger commits in incrementally
committable chunks is a good idea.

Greetings,

Andres Freund

#15

Masahiko Sawada

sawada.mshk@gmail.com

over 3 years ago

In reply to: Andres Freund (#14)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Jan 27, 2023 at 3:17 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote:

If I'm understanding this result correctly, it seems to me that your
patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
BUFFERS READ), but there seems no visible performance gain with only
your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
patch should be included in the WAL DIO patch rather than applying it
alone. Am I missing something?

We already support using DIO for WAL - it's just restricted in a way that
makes it practically not usable. And the reason for that is precisely that
walsenders need to read the WAL. See get_sync_bit():

/*
* Optimize writes by bypassing kernel cache with O_DIRECT when using
* O_SYNC and O_DSYNC. But only if archiving and streaming are disabled,
* otherwise the archive command or walsender process will read the WAL
* soon after writing it, which is guaranteed to cause a physical read if
* we bypassed the kernel cache. We also skip the
* posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
* reason.
*
* Never use O_DIRECT in walreceiver process for similar reasons; the WAL
* written by walreceiver is normally read by the startup process soon
* after it's written. Also, walreceiver performs unaligned writes, which
* don't work with O_DIRECT, so it is required for correctness too.
*/
if (!XLogIsNeeded() && !AmWalReceiverProcess())
o_direct_flag = PG_O_DIRECT;

Even if that weren't the case, splitting up bigger commits in incrementally
committable chunks is a good idea.

Agreed. I was wondering about the fact that the test result doesn't
show things to satisfy the first motivation of this patch, which is to
improve performance by reducing disk I/O and system calls regardless
of the DIO patch. But it makes sense to me that this patch is a part
of the DIO patch series.

I'd like to confirm whether there is any performance regression caused
by this patch in some cases, especially when not using DIO.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#16

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Masahiko Sawada (#15)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Jan 27, 2023 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'd like to confirm whether there is any performance regression caused
by this patch in some cases, especially when not using DIO.

Thanks. I ran some insert tests with primary and 1 async standby.
Please see the numbers below and attached graphs. I've not noticed a
regression as such, in fact, with patch, there's a slight improvement.
Note that there's no WAL DIO involved here.

test-case 1:
clients HEAD PATCHED
1 139 156
2 624 599
4 3113 3410
8 6194 6433
16 11255 11722
32 22455 21658
64 46072 47103
128 80255 85970
256 110067 111488
512 114043 118094
768 109588 111892
1024 106144 109361
2048 85808 90745
4096 55911 53755

test-case 2:
clients HEAD PATCHED
1 177 128
2 186 425
4 2114 2946
8 5835 5840
16 10654 11199
32 14071 13959
64 18092 17519
128 27298 28274
256 24600 24843
512 17139 19450
768 16778 20473
1024 18294 20209
2048 12898 13920
4096 6399 6815

test-case 3:
clients HEAD PATCHED
1 148 191
2 302 317
4 3415 3243
8 5864 6193
16 9573 10267
32 14069 15819
64 17424 18453
128 24493 29192
256 33180 38250
512 35568 36551
768 29731 30317
1024 32291 32124
2048 27964 28933
4096 13702 15034

[1]: cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' EOF
cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
EOF

test-case 1:
./pgbench -i -s 300 -d postgres
./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;"
cat << EOF >> insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 100000 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta);
EOF
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

test-case 2:
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

test-case 3:
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b simple-update
-c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#17

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#7)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Mon, Dec 26, 2022 at 2:20 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I have gone through this patch, I have some comments (mostly cosmetic
and comments)

1.
+ /*
+ * We have found the WAL buffer page holding the given LSN. Read from a
+ * pointer to the right offset within the page.
+ */
+ memcpy(page, (XLogCtl->pages + idx * (Size) XLOG_BLCKSZ),
+    (Size) XLOG_BLCKSZ);

From the above comments, it appears that we are reading from the exact
pointer we are interested to read, but actually, we are reading
the complete page. I think this comment needs to be fixed and we can
also explain why we read the complete page here.

2.
+static char *
+GetXLogBufferForRead(XLogRecPtr ptr, TimeLineID tli, char *page)
+{
+ XLogRecPtr expectedEndPtr;
+ XLogRecPtr endptr;
+ int idx;
+ char    *recptr = NULL;

Generally, we use the name 'recptr' to represent XLogRecPtr type of
variable, but in your case, it is actually data at that recptr, so
better use some other name like 'buf' or 'buffer'.

3.
+ if ((recptr + nbytes) <= (page + XLOG_BLCKSZ))
+ {
+ /* All the bytes are in one page. */
+ memcpy(dst, recptr, nbytes);
+ dst += nbytes;
+ *read_bytes += nbytes;
+ ptr += nbytes;
+ nbytes = 0;
+ }
+ else if ((recptr + nbytes) > (page + XLOG_BLCKSZ))
+ {
+ /* All the bytes are not in one page. */
+ Size bytes_remaining;

Why do you have this 'else if ((recptr + nbytes) > (page +
XLOG_BLCKSZ))' check in the else part? why it is not directly else
without a condition in 'if'?

4.
+XLogReadFromBuffers(XLogRecPtr startptr,
+ TimeLineID tli,
+ Size count,
+ char *buf,
+ Size *read_bytes)

I think we do not need 2 separate variables 'count' and '*read_bytes',
just one variable for input/output is sufficient. The original value
can always be stored in some temp variable
instead of the function argument.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#18

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#17)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Feb 7, 2023 at 4:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Dec 26, 2022 at 2:20 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I have gone through this patch, I have some comments (mostly cosmetic
and comments)

Thanks a lot for reviewing.

From the above comments, it appears that we are reading from the exact
pointer we are interested to read, but actually, we are reading
the complete page. I think this comment needs to be fixed and we can
also explain why we read the complete page here.

I modified it. Please see the attached v3 patch.

Generally, we use the name 'recptr' to represent XLogRecPtr type of
variable, but in your case, it is actually data at that recptr, so
better use some other name like 'buf' or 'buffer'.

Changed it to use 'data' as it seemed more appropriate than just a
buffer to not confuse with the WAL buffer page.

3.
+ if ((recptr + nbytes) <= (page + XLOG_BLCKSZ))
+ {
+ }
+ else if ((recptr + nbytes) > (page + XLOG_BLCKSZ))
+ {
Why do you have this 'else if ((recptr + nbytes) > (page +
XLOG_BLCKSZ))' check in the else part? why it is not directly else
without a condition in 'if'?

Changed.

I think we do not need 2 separate variables 'count' and '*read_bytes',
just one variable for input/output is sufficient. The original value
can always be stored in some temp variable
instead of the function argument.

We could do that, but for the sake of readability and not cluttering
the API, I kept it as-is.

Besides addressing the above review comments, I've made some more
changes - 1) I optimized the patch a bit by removing an extra memcpy.
Up until v2 patch, the entire WAL buffer page is returned and the
caller takes what is wanted from it. This adds an extra memcpy, so I
changed it to avoid extra memcpy and just copy what is wanted. 2) I
improved the comments.

I can also do a few other things, but before working on them, I'd like
to hear from others:
1. A separate wait event (WAIT_EVENT_WAL_READ_FROM_BUFFERS) for
reading from WAL buffers - right now, WAIT_EVENT_WAL_READ is being
used both for reading from WAL buffers and WAL files. Given the fact
that we won't wait for a lock or do a time-taking task while reading
from buffers, it seems unnecessary.
2. A separate TAP test for verifying that the WAL is actually read
from WAL buffers - right now, existing tests for recovery,
subscription, pg_walinspect already cover the code, see [1]recovery tests: PATCHED: WAL buffers hit - 14759, misses - 3371. However,
if needed, I can add a separate TAP test.
3. Use the oldest initialized WAL buffer page to quickly tell if the
given LSN is present in WAL buffers without taking any lock - right
now, WALBufMappingLock is acquired to do so. While this doesn't seem
to impact much, it's good to optimize it away. But, the oldest
initialized WAL buffer page isn't tracked, so I've put up a patch and
sent in another thread [2]/messages/by-id/CALj2ACVgi6LirgLDZh=FdfdvGvKAD==WTOSWcQy=AtNgPDVnKw@mail.gmail.com. Irrespective of [2]/messages/by-id/CALj2ACVgi6LirgLDZh=FdfdvGvKAD==WTOSWcQy=AtNgPDVnKw@mail.gmail.com, we are still good
with what we have in this patch.

[1]: recovery tests: PATCHED: WAL buffers hit - 14759, misses - 3371
recovery tests:
PATCHED: WAL buffers hit - 14759, misses - 3371

subscription tests:
PATCHED: WAL buffers hit - 1972, misses - 32616

pg_walinspect tests:
PATCHED: WAL buffers hit - 8, misses - 8

[2]: /messages/by-id/CALj2ACVgi6LirgLDZh=FdfdvGvKAD==WTOSWcQy=AtNgPDVnKw@mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#19

Dilip Kumar

dilipbalaut@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#18)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, Feb 8, 2023 at 9:57 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I can also do a few other things, but before working on them, I'd like
to hear from others:
1. A separate wait event (WAIT_EVENT_WAL_READ_FROM_BUFFERS) for
reading from WAL buffers - right now, WAIT_EVENT_WAL_READ is being
used both for reading from WAL buffers and WAL files. Given the fact
that we won't wait for a lock or do a time-taking task while reading
from buffers, it seems unnecessary.

Yes, we do not need this separate wait event and we also don't need
WAIT_EVENT_WAL_READ wait event while reading from the buffer. Because
we are not performing any IO so no specific wait event is needed and
for reading from the WAL buffer we are acquiring WALBufMappingLock so
that lwlock event will be tracked under that lock.

2. A separate TAP test for verifying that the WAL is actually read
from WAL buffers - right now, existing tests for recovery,
subscription, pg_walinspect already cover the code, see [1]. However,
if needed, I can add a separate TAP test.

Can we write a test that can actually validate that we have read from
a WAL Buffer? If so then it would be good to have such a test to avoid
any future breakage in that logic. But if it is just for hitting the
code but no guarantee that whether we can validate as part of the test
whether it has hit the WAL buffer or not then I think the existing
cases are sufficient.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#20

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Dilip Kumar (#19)

Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, Feb 8, 2023 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Feb 8, 2023 at 9:57 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I can also do a few other things, but before working on them, I'd like
to hear from others:
1. A separate wait event (WAIT_EVENT_WAL_READ_FROM_BUFFERS) for
reading from WAL buffers - right now, WAIT_EVENT_WAL_READ is being
used both for reading from WAL buffers and WAL files. Given the fact
that we won't wait for a lock or do a time-taking task while reading
from buffers, it seems unnecessary.

Yes, we do not need this separate wait event and we also don't need
WAIT_EVENT_WAL_READ wait event while reading from the buffer. Because
we are not performing any IO so no specific wait event is needed and
for reading from the WAL buffer we are acquiring WALBufMappingLock so
that lwlock event will be tracked under that lock.

Nope, LWLockConditionalAcquire doesn't wait, so no lock wait event (no
LWLockReportWaitStart) there. I agree to not have any wait event for
reading from WAL buffers as no IO is involved there. I removed it in
the attached v4 patch.

2. A separate TAP test for verifying that the WAL is actually read
from WAL buffers - right now, existing tests for recovery,
subscription, pg_walinspect already cover the code, see [1]. However,
if needed, I can add a separate TAP test.

Can we write a test that can actually validate that we have read from
a WAL Buffer? If so then it would be good to have such a test to avoid
any future breakage in that logic. But if it is just for hitting the
code but no guarantee that whether we can validate as part of the test
whether it has hit the WAL buffer or not then I think the existing
cases are sufficient.

We could set up a standby or a logical replication subscriber or
pg_walinspect extension and verify if the code got hit with the help
of the server log (DEBUG1) message added by the patch. However, this
can make the test volatile.

Therefore, I came up with a simple and small test module/extension
named test_wal_read_from_buffers under src/test/module. It basically
exposes a SQL-function given an LSN, it calls XLogReadFromBuffers()
and returns true if it hits WAL buffers, otherwise false. And the
simple TAP test of this module verifies if the function returns true.
I attached the test module as v4-0002 here. The test module looks
specific and also helps as demonstration of how one can possibly use
the new XLogReadFromBuffers().

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#21

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#20)

#22

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Nathan Bossart (#21)

#23

Kuntal Ghosh

kuntalghosh.2007@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#22)

#24

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#22)

#25

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Kuntal Ghosh (#23)

#26

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#25)

#27

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Nathan Bossart (#24)

#28

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#27)

#29

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Nathan Bossart (#28)

#30

Nathan Bossart

nathandbossart@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#29)

#31

Nitin Jadhav

nitinjadhavpostgres@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#29)

#32

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Nitin Jadhav (#31)

#33

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Bharath Rupireddy (#32)

#34

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 3 years ago

In reply to: Nathan Bossart (#30)

#35

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Andres Freund (#9)

#36

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Jeff Davis (#35)

#37

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Andres Freund (#36)

#38

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Andres Freund (#36)

#39

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Bharath Rupireddy (#38)

#40

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Bharath Rupireddy (#39)

#41

Nathan Bossart

nathandbossart@gmail.com

over 2 years ago

In reply to: Jeff Davis (#40)

#42

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Jeff Davis (#40)

#43

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Bharath Rupireddy (#42)

#44

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Jeff Davis (#43)

#45

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Bharath Rupireddy (#44)

#46

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Jeff Davis (#45)

#47

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Bharath Rupireddy (#46)

#48

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Bharath Rupireddy (#46)

#49

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Bharath Rupireddy (#44)

#50

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Jeff Davis (#47)

#51

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Bharath Rupireddy (#50)

#52

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Jeff Davis (#51)

#53

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Andres Freund (#49)

#54

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Bharath Rupireddy (#53)

#55

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Andres Freund (#54)

#56

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Bharath Rupireddy (#55)

#57

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Bharath Rupireddy (#56)

#58

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Jeff Davis (#57)

#59

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Bharath Rupireddy (#58)

#60

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Jeff Davis (#59)

#61

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Bharath Rupireddy (#60)

#62

Melih Mutlu

m.melihmutlu@gmail.com

over 2 years ago

In reply to: Bharath Rupireddy (#60)

#63

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Bharath Rupireddy (#60)

#64

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Andres Freund (#63)

#65

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Jeff Davis (#64)

#66

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Bharath Rupireddy (#65)

#67

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Jeff Davis (#66)

#68

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Bharath Rupireddy (#67)

#69

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Jeff Davis (#68)

#70

Alvaro Herrera

alvherre@2ndquadrant.com

over 2 years ago

In reply to: Bharath Rupireddy (#69)

#71

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Alvaro Herrera (#70)

#72

Alvaro Herrera

alvherre@2ndquadrant.com

over 2 years ago

In reply to: Bharath Rupireddy (#71)

#73

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Alvaro Herrera (#72)

#74

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Bharath Rupireddy (#71)

#75

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Jeff Davis (#74)

#76

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Andres Freund (#75)

#77

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Jeff Davis (#76)

#78

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Jeff Davis (#74)

#79

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Jeff Davis (#78)

#80

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Andres Freund (#77)

#81

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Jeff Davis (#74)

#82

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Andres Freund (#77)

#83

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Jeff Davis (#80)

#84

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Bharath Rupireddy (#82)

#85

Andres Freund

andres@anarazel.de

over 2 years ago

In reply to: Jeff Davis (#80)

#86

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Jeff Davis (#83)

#87

Jeff Davis

pgsql@j-davis.com

over 2 years ago

In reply to: Bharath Rupireddy (#86)

#88

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Jeff Davis (#87)

#89

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Bharath Rupireddy (#88)

#90

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

over 2 years ago

In reply to: Bharath Rupireddy (#89)

#91

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 2 years ago

In reply to: Bharath Rupireddy (#90)

#92

Andrey Borodin

amborodin@acm.org

about 2 years ago

In reply to: Bharath Rupireddy (#91)

#93

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Andrey Borodin (#92)

#94

Bharath Rupireddy

bharath.rupireddyforpostgres@gmail.com

about 2 years ago

In reply to: Michael Paquier (#93)

#95

Michael Paquier

michael@paquier.xyz

about 2 years ago

In reply to: Bharath Rupireddy (#94)

Improve WALRead() to suck data directly from WAL buffers when possible

Attachments:

Attachments:

Attachments:

Attachments:

Attachments:

Attachments: