Improve WALRead() to suck data directly from WAL buffers when possible
Hi,
WALRead() currently reads WAL from the WAL file on the disk, which
means, the walsenders serving streaming and logical replication
(callers of WALRead()) will have to hit the disk/OS's page cache for
reading the WAL. This may increase the amount of read IO required for
all the walsenders put together as one typically maintains many
standbys/subscribers on production servers for high availability,
disaster recovery, read-replicas and so on. Also, it may increase
replication lag if all the WAL reads are always hitting the disk.
It may happen that WAL buffers contain the requested WAL, if so, the
WALRead() can attempt to read from the WAL buffers first before
reading from the file. If the read hits the WAL buffers, then reading
from the file on disk is avoided. This mainly reduces the read IO/read
system calls. It also enables us to do other features specified
elsewhere [1]/messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com.
I'm attaching a patch that implements the idea which is also noted
elsewhere [2]* XXX probably this should be improved to suck data directly from the * WAL buffers when possible. */ bool WALRead(XLogReaderState *state,. I've run some tests [3]1 primary, 1 sync standby, 1 async standby ./pgbench --initialize --scale=300 postgres ./pgbench --jobs=16 --progress=300 --client=32 --time=900 --username=ubuntu postgres. The WAL buffers hit ratio with
the patch stood at 95%, in other words, the walsenders avoided 95% of
the time reading from the file. The benefit, if measured in terms of
the amount of data - 79% (13.5GB out of total 17GB) of the requested
WAL is read from the WAL buffers as opposed to 21% from the file. Note
that the WAL buffers hit ratio can be very low for write-heavy
workloads, in which case, file reads are inevitable.
The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.
I'll add this to the next commitfest.
Thoughts?
[1]: /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com
[2]: * XXX probably this should be improved to suck data directly from the * WAL buffers when possible. */ bool WALRead(XLogReaderState *state,
* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.
*/
bool
WALRead(XLogReaderState *state,
[3]: 1 primary, 1 sync standby, 1 async standby ./pgbench --initialize --scale=300 postgres ./pgbench --jobs=16 --progress=300 --client=32 --time=900 --username=ubuntu postgres
1 primary, 1 sync standby, 1 async standby
./pgbench --initialize --scale=300 postgres
./pgbench --jobs=16 --progress=300 --client=32 --time=900
--username=ubuntu postgres
PATCHED:
-[ RECORD 1 ]----------+----------------
application_name | assb1
wal_read | 31005
wal_read_bytes | 3800607104
wal_read_time | 779.402
wal_read_buffers | 610611
wal_read_bytes_buffers | 14493226440
wal_read_time_buffers | 3033.309
sync_state | async
-[ RECORD 2 ]----------+----------------
application_name | ssb1
wal_read | 31027
wal_read_bytes | 3800932712
wal_read_time | 696.365
wal_read_buffers | 610580
wal_read_bytes_buffers | 14492900832
wal_read_time_buffers | 2989.507
sync_state | sync
HEAD:
-[ RECORD 1 ]----+----------------
application_name | assb1
wal_read | 705627
wal_read_bytes | 18343480640
wal_read_time | 7607.783
sync_state | async
-[ RECORD 2 ]----+------------
application_name | ssb1
wal_read | 705625
wal_read_bytes | 18343480640
wal_read_time | 4539.058
sync_state | sync
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
v1-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchapplication/octet-stream; name=v1-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchDownload
From d93a6c97bad19d3718f0e4f06caeac5ce9937b37 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 8 Dec 2022 09:37:01 +0000
Subject: [PATCH v1] Improve WALRead() to suck data directly from WAL buffers
when possible
---
src/backend/access/transam/xlog.c | 184 ++++++++++++++++++++++++
src/backend/access/transam/xlogreader.c | 58 +++++++-
src/include/access/xlog.h | 9 ++
3 files changed, 249 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..196be98591 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -689,6 +689,7 @@ static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
XLogRecPtr *PrevPtr);
static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
+static char *GetXLogBufferForRead(XLogRecPtr ptr, TimeLineID tli, char *page);
static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
@@ -1639,6 +1640,189 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
return cachedPos + ptr % XLOG_BLCKSZ;
}
+/*
+ * Get the WAL buffer page containing passed in WAL record and also return the
+ * record's location within that buffer page.
+ */
+static char *
+GetXLogBufferForRead(XLogRecPtr ptr, TimeLineID tli, char *page)
+{
+ XLogRecPtr expectedEndPtr;
+ XLogRecPtr endptr;
+ int idx;
+ char *recptr = NULL;
+
+ idx = XLogRecPtrToBufIdx(ptr);
+ expectedEndPtr = ptr;
+ expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+ /*
+ * Hold WALBufMappingLock in shared mode so that the other concurrent WAL
+ * readers are also allowed. We try to do as less work as possible while
+ * holding the lock as it might impact concurrent WAL writers.
+ *
+ * XXX: Perhaps, measuring the immediate lock availability and its impact
+ * on concurrent WAL writers is a good idea here.
+ *
+ * XXX: Perhaps, returning if lock is not immediately available a good idea
+ * here. The caller can then go ahead with reading WAL from WAL file.
+ *
+ * XXX: Perhaps, quickly finding if the given WAL record is in WAL buffers
+ * a good idea here. This avoids unnecessary lock acquire-release cycles.
+ * One way to do that is by maintaining oldest WAL record that's currently
+ * present in WAL buffers.
+ */
+ LWLockAcquire(WALBufMappingLock, LW_SHARED);
+
+ /*
+ * Holding WALBufMappingLock ensures inserters don't overwrite this value
+ * while we are reading it.
+ */
+ endptr = XLogCtl->xlblocks[idx];
+
+ if (expectedEndPtr == endptr)
+ {
+ XLogPageHeader phdr;
+
+ /*
+ * We have found the WAL buffer page holding the given LSN. Read from a pointer
+ * to the right offset within the page.
+ */
+ memcpy(page, (XLogCtl->pages + idx * (Size) XLOG_BLCKSZ),
+ (Size) XLOG_BLCKSZ);
+
+ /*
+ * Release the lock as early as possible to avoid any possible
+ * contention.
+ */
+ LWLockRelease(WALBufMappingLock);
+
+ /*
+ * Despite we read the WAL buffer page by holding all necessary locks,
+ * we still want to be extra cautious here and serve the valid WAL
+ * buffer page.
+ *
+ * XXX: Perhaps, we can further go and validate the found page header,
+ * record header and record at least in assert builds, something like
+ * the xlogreader.c does and return if any of those validity checks
+ * fail. Having said that, we stick to the minimal checks for now.
+ */
+ phdr = (XLogPageHeader) page;
+
+ if (phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+ phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+ phdr->xlp_tli == tli)
+ {
+ /*
+ * Page looks valid, so return the page and the requested record's
+ * LSN.
+ */
+ recptr = page + ptr % XLOG_BLCKSZ;
+ }
+ }
+ else
+ {
+ /* We have not found anything. */
+ LWLockRelease(WALBufMappingLock);
+ }
+
+ return recptr;
+}
+
+/*
+ * When possible, read WAL starting at 'startptr' of size 'count' bytes from
+ * WAL buffers into buffer passed in by the caller 'buf'. Read as much WAL as
+ * possible from the WAL buffers, remaining WAL, if any, the caller will take
+ * care of reading from WAL files directly.
+ *
+ * This function sets read bytes to 'read_bytes' and sets 'hit', 'partial_hit'
+ * and 'miss' accordingly.
+ */
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+ TimeLineID tli,
+ Size count,
+ char *buf,
+ Size *read_bytes,
+ bool *hit,
+ bool *partial_hit,
+ bool *miss)
+{
+ XLogRecPtr ptr;
+ char *dst;
+ Size nbytes;
+
+ Assert(!XLogRecPtrIsInvalid(startptr));
+ Assert(count > 0);
+ Assert(startptr <= GetFlushRecPtr(NULL));
+ Assert(!RecoveryInProgress());
+
+ ptr = startptr;
+ nbytes = count;
+ dst = buf;
+ *read_bytes = 0;
+ *hit = false;
+ *partial_hit = false;
+ *miss = false;
+
+ while (nbytes > 0)
+ {
+ char page[XLOG_BLCKSZ] = {0};
+ char *recptr;
+
+ recptr = GetXLogBufferForRead(ptr, tli, page);
+
+ if (recptr == NULL)
+ break;
+
+ if ((recptr + nbytes) <= (page + XLOG_BLCKSZ))
+ {
+ /* All the bytes are in one page. */
+ memcpy(dst, recptr, nbytes);
+ dst += nbytes;
+ *read_bytes += nbytes;
+ ptr += nbytes;
+ nbytes = 0;
+ }
+ else if ((recptr + nbytes) > (page + XLOG_BLCKSZ))
+ {
+ /* All the bytes are not in one page. */
+ Size bytes_remaining;
+
+ /*
+ * Compute the remaining bytes on the current page, copy them over
+ * to output buffer and move forward to read further.
+ */
+ bytes_remaining = XLOG_BLCKSZ - (recptr - page);
+ memcpy(dst, recptr, bytes_remaining);
+ dst += bytes_remaining;
+ nbytes -= bytes_remaining;
+ *read_bytes += bytes_remaining;
+ ptr += bytes_remaining;
+ }
+ }
+
+ if (*read_bytes == count)
+ {
+ /* It's a buffer hit. */
+ *hit = true;
+ }
+ else if (*read_bytes > 0 &&
+ *read_bytes < count)
+ {
+ /* It's a buffer partial hit. */
+ *partial_hit = true;
+ }
+ else if (*read_bytes == 0)
+ {
+ /* It's a buffer miss. */
+ *miss = true;
+ }
+
+ elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X",
+ *read_bytes, count, LSN_FORMAT_ARGS(startptr));
+}
+
/*
* Converts a "usable byte position" to XLogRecPtr. A usable byte position
* is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a38a80e049..7ec94a0535 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1485,8 +1485,7 @@ err:
* Returns true if succeeded, false if an error occurs, in which case
* 'errinfo' receives error details.
*
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
*/
bool
WALRead(XLogReaderState *state,
@@ -1497,6 +1496,61 @@ WALRead(XLogReaderState *state,
XLogRecPtr recptr;
Size nbytes;
+#ifndef FRONTEND
+ /* Frontend tools have no idea of WAL buffers. */
+ Size read_bytes;
+ bool hit;
+ bool partial_hit;
+ bool miss;
+
+ /*
+ * When possible, read WAL from WAL buffers. We skip this step and continue
+ * the usual way, that is to read from WAL file, either when the server is
+ * in recovery (standby mode, archive or crash recovery), in which case the
+ * WAL buffers are not used or when the server is inserting in a different
+ * timeline from that of the timeline that we're trying to read WAL from.
+ */
+ if (!RecoveryInProgress() &&
+ tli == GetWALInsertionTimeLine())
+ {
+ pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+ XLogReadFromBuffers(startptr, tli, count, buf, &read_bytes,
+ &hit, &partial_hit, &miss);
+ pgstat_report_wait_end();
+
+ if (hit)
+ {
+ /*
+ * We have fully read the requested WAL from WAL buffers, so
+ * return.
+ */
+ Assert(count == read_bytes);
+ return true;
+ }
+ else if (partial_hit)
+ {
+ /*
+ * We have partially read from WAL buffers, so reset the state and
+ * read the remaining bytes the usual way.
+ */
+ Assert(read_bytes > 0 && count > read_bytes);
+ buf += read_bytes;
+ startptr += read_bytes;
+ count -= read_bytes;
+ }
+#ifdef USE_ASSERT_CHECKING
+ else if (miss)
+ {
+ /*
+ * We have not read anything from WAL buffers, so read the usual way,
+ * that is to read from WAL file.
+ */
+ Assert(read_bytes == 0);
+ }
+#endif /* USE_ASSERT_CHECKING */
+ }
+#endif /* FRONTEND */
+
p = buf;
recptr = startptr;
nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..968608353e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,15 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
extern void SetWalWriterSleeping(bool sleeping);
+extern void XLogReadFromBuffers(XLogRecPtr startptr,
+ TimeLineID tli,
+ Size count,
+ char *buf,
+ Size *read_bytes,
+ bool *hit,
+ bool *partial_hit,
+ bool *miss);
+
/*
* Routines used by xlogrecovery.c to call back into xlog.c during recovery.
*/
--
2.34.1
At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.I'll add this to the next commitfest.
Thoughts?
This adds copying of the whole page (at least) at every WAL *record*
read, fighting all WAL writers by taking WALBufMappingLock on a very
busy page while the copying. I'm a bit doubtful that it results in an
overall improvement. It seems to me almost all pread()s here happens
on file buffer so it is unclear to me that copying a whole WAL page
(then copying the target record again) wins over a pread() call that
copies only the record to read. Do you have an actual number of how
frequent WAL reads go to disk, or the actual number of performance
gain or real I/O reduction this patch offers?
This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.
I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.
Mmm. I'm a bit dim. Recovery doesn't read concurrently-written
records. Please forget about recovery.
I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.
--
Kyotaro Horiguchi
NTT Open Source Software Center
Sorry for the confusion.
At Mon, 12 Dec 2022 12:06:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in
This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.Mmm. I'm a bit dim. Recovery doesn't read concurrently-written
records. Please forget about recovery.
NO... Logical walsenders do that. So, please forget about this...
I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.
regards.
--
Kyotaro Horiguchi
NTT Open Source Software Center
On Mon, Dec 12, 2022 at 8:27 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:
Thanks for providing thoughts.
At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.I'll add this to the next commitfest.
Thoughts?
This adds copying of the whole page (at least) at every WAL *record*
read,
In the worst case yes, but that may not always be true. On a typical
production server with decent write traffic, it happens that the
callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
MAX_SEND_SIZE bytes.
fighting all WAL writers by taking WALBufMappingLock on a very
busy page while the copying. I'm a bit doubtful that it results in an
overall improvement.
Well, the tests don't reflect that [1]PATCHED: 1 1470.329907 2 1437.096329 4 2966.096948 8 5978.441040 16 11405.538255 32 22933.546058 64 43341.870038 128 73623.837068 256 104754.248661 512 115746.359530 768 106106.691455 1024 91900.079086 2048 84134.278589 4096 62580.875507, I've run an insert work load
[2]: Test details: ./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & 1 primary, 1 async standby cd inst/bin ./pg_ctl -D data -l logfile stop ./pg_ctl -D assbdata -l logfile1 stop rm -rf data assbdata rm logfile logfile1 free -m sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches' free -m ./initdb -D data rm -rf /home/ubuntu/archived_wal mkdir /home/ubuntu/archived_wal cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' track_wal_io_timing = 'on' EOF ./pg_ctl -D data -l logfile start ./psql -c "select pg_create_physical_replication_slot('assb1_repl_slot', true, false)" postgres ./pg_ctl -D data -l logfile restart ./pg_basebackup -D assbdata ./pg_ctl -D data -l logfile stop cat << EOF >> assbdata/postgresql.conf port=5433 primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu application_name=assb1' primary_slot_name='assb1_repl_slot' restore_command='cp /home/ubuntu/archived_wal/%f %p' EOF touch assbdata/standby.signal ./pg_ctl -D data -l logfile start ./pg_ctl -D assbdata -l logfile1 start ./pgbench -i -s 1 -d postgres ./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" cat << EOF >> insert.sql \set aid random(1, 10 * :scale) \set delta random(1, 100000 * :scale) INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta); EOF ulimit -S -n 5000 for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
pretty cool. If you have any use-cases in mind, please share them
and/or feel free to run at your end.
It seems to me almost all pread()s here happens
on file buffer so it is unclear to me that copying a whole WAL page
(then copying the target record again) wins over a pread() call that
copies only the record to read.
That's not always guaranteed. Imagine a typical production server with
decent write traffic and heavy analytical queries (which fills OS page
cache with the table pages accessed for the queries), the WAL pread()
calls turn to IOPS. Despite the WAL being present in WAL buffers,
customers will be paying unnecessarily for these IOPS too. With the
patch, we are basically avoiding the pread() system calls which may
turn into IOPS on production servers (99% of the time for the insert
use case [1]PATCHED: 1 1470.329907 2 1437.096329 4 2966.096948 8 5978.441040 16 11405.538255 32 22933.546058 64 43341.870038 128 73623.837068 256 104754.248661 512 115746.359530 768 106106.691455 1024 91900.079086 2048 84134.278589 4096 62580.875507[2]Test details: ./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & 1 primary, 1 async standby cd inst/bin ./pg_ctl -D data -l logfile stop ./pg_ctl -D assbdata -l logfile1 stop rm -rf data assbdata rm logfile logfile1 free -m sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches' free -m ./initdb -D data rm -rf /home/ubuntu/archived_wal mkdir /home/ubuntu/archived_wal cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' track_wal_io_timing = 'on' EOF ./pg_ctl -D data -l logfile start ./psql -c "select pg_create_physical_replication_slot('assb1_repl_slot', true, false)" postgres ./pg_ctl -D data -l logfile restart ./pg_basebackup -D assbdata ./pg_ctl -D data -l logfile stop cat << EOF >> assbdata/postgresql.conf port=5433 primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu application_name=assb1' primary_slot_name='assb1_repl_slot' restore_command='cp /home/ubuntu/archived_wal/%f %p' EOF touch assbdata/standby.signal ./pg_ctl -D data -l logfile start ./pg_ctl -D assbdata -l logfile1 start ./pgbench -i -s 1 -d postgres ./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" cat << EOF >> insert.sql \set aid random(1, 10 * :scale) \set delta random(1, 100000 * :scale) INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta); EOF ulimit -S -n 5000 for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done, 95% of the time for pgbench use case specified
upthread). With the patch, WAL buffers can act as L1 cache, if one
calls OS page cache as L2 cache (of course this illustration is not
related to the typical processor L1 and L2 ... caches).
Do you have an actual number of how
frequent WAL reads go to disk, or the actual number of performance
gain or real I/O reduction this patch offers?
It might be a bit tough to generate such heavy traffic. An idea is to
ensure the WAL page/file goes out of the OS page cache before
WALRead() - these might help here - 0002 patch from
/messages/by-id/CA+hUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_=X7HztA@mail.gmail.com
and tool https://github.com/klando/pgfincore.
This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.
WALRead() callers are smart enough to take the flushed bytes only.
Although they read the whole WAL page, they calculate the valid bytes.
I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.
Yes, the non-flushed WAL can be read and sent across if one wishes to
to make replication faster and parallel flushing on primary and
standbys at the cost of a bit of extra crash handling, that's
mentioned here /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com.
However, this can be a separate discussion.
I also want to reiterate that the patch implemented a TODO item:
* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.
*/
bool
WALRead(XLogReaderState *state,
[1]: PATCHED: 1 1470.329907 2 1437.096329 4 2966.096948 8 5978.441040 16 11405.538255 32 22933.546058 64 43341.870038 128 73623.837068 256 104754.248661 512 115746.359530 768 106106.691455 1024 91900.079086 2048 84134.278589 4096 62580.875507
PATCHED:
1 1470.329907
2 1437.096329
4 2966.096948
8 5978.441040
16 11405.538255
32 22933.546058
64 43341.870038
128 73623.837068
256 104754.248661
512 115746.359530
768 106106.691455
1024 91900.079086
2048 84134.278589
4096 62580.875507
-[ RECORD 1 ]----------+-----------
application_name | assb1
sent_lsn | 0/1B8106A8
write_lsn | 0/1B8106A8
flush_lsn | 0/1B8106A8
replay_lsn | 0/1B8106A8
write_lag |
flush_lag |
replay_lag |
wal_read | 104
wal_read_bytes | 10733008
wal_read_time | 1.845
wal_read_buffers | 76662
wal_read_bytes_buffers | 383598808
wal_read_time_buffers | 205.418
sync_state | async
HEAD:
1 1312.054496
2 1449.429321
4 2717.496207
8 5913.361540
16 10762.978907
32 19653.449728
64 41086.124269
128 68548.061171
256 104468.415361
512 114328.943598
768 91751.279309
1024 96403.736757
2048 82155.140270
4096 66160.659511
-[ RECORD 1 ]----+-----------
application_name | assb1
sent_lsn | 0/1AB5BCB8
write_lsn | 0/1AB5BCB8
flush_lsn | 0/1AB5BCB8
replay_lsn | 0/1AB5BCB8
write_lag |
flush_lag |
replay_lag |
wal_read | 71967
wal_read_bytes | 381009080
wal_read_time | 243.616
sync_state | async
[2]: Test details: ./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & 1 primary, 1 async standby cd inst/bin ./pg_ctl -D data -l logfile stop ./pg_ctl -D assbdata -l logfile1 stop rm -rf data assbdata rm logfile logfile1 free -m sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches' free -m ./initdb -D data rm -rf /home/ubuntu/archived_wal mkdir /home/ubuntu/archived_wal cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' track_wal_io_timing = 'on' EOF ./pg_ctl -D data -l logfile start ./psql -c "select pg_create_physical_replication_slot('assb1_repl_slot', true, false)" postgres ./pg_ctl -D data -l logfile restart ./pg_basebackup -D assbdata ./pg_ctl -D data -l logfile stop cat << EOF >> assbdata/postgresql.conf port=5433 primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu application_name=assb1' primary_slot_name='assb1_repl_slot' restore_command='cp /home/ubuntu/archived_wal/%f %p' EOF touch assbdata/standby.signal ./pg_ctl -D data -l logfile start ./pg_ctl -D assbdata -l logfile1 start ./pgbench -i -s 1 -d postgres ./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" cat << EOF >> insert.sql \set aid random(1, 10 * :scale) \set delta random(1, 100000 * :scale) INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta); EOF ulimit -S -n 5000 for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j
8 install > install.log 2>&1 &
1 primary, 1 async standby
cd inst/bin
./pg_ctl -D data -l logfile stop
./pg_ctl -D assbdata -l logfile1 stop
rm -rf data assbdata
rm logfile logfile1
free -m
sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
free -m
./initdb -D data
rm -rf /home/ubuntu/archived_wal
mkdir /home/ubuntu/archived_wal
cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
track_wal_io_timing = 'on'
EOF
./pg_ctl -D data -l logfile start
./psql -c "select
pg_create_physical_replication_slot('assb1_repl_slot', true, false)"
postgres
./pg_ctl -D data -l logfile restart
./pg_basebackup -D assbdata
./pg_ctl -D data -l logfile stop
cat << EOF >> assbdata/postgresql.conf
port=5433
primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu
application_name=assb1'
primary_slot_name='assb1_repl_slot'
restore_command='cp /home/ubuntu/archived_wal/%f %p'
EOF
touch assbdata/standby.signal
./pg_ctl -D data -l logfile start
./pg_ctl -D assbdata -l logfile1 start
./pgbench -i -s 1 -d postgres
./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;"
cat << EOF >> insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 100000 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta);
EOF
ulimit -S -n 5000
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
On Fri, Dec 23, 2022 at 3:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
On Mon, Dec 12, 2022 at 8:27 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:Thanks for providing thoughts.
At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in
The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.I'll add this to the next commitfest.
Thoughts?
This adds copying of the whole page (at least) at every WAL *record*
read,In the worst case yes, but that may not always be true. On a typical
production server with decent write traffic, it happens that the
callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
MAX_SEND_SIZE bytes.
I agree with this.
This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.WALRead() callers are smart enough to take the flushed bytes only.
Although they read the whole WAL page, they calculate the valid bytes.
Right
On first read the patch looks good, although it needs some more
thoughts on 'XXX' comments in the patch.
And also I do not like that XLogReadFromBuffers() is using 3 bools
hit/partial hit/miss, instead of this we can use an enum or some
tristate variable, I think that will be cleaner.
--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
On Sun, Dec 25, 2022 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
This adds copying of the whole page (at least) at every WAL *record*
read,In the worst case yes, but that may not always be true. On a typical
production server with decent write traffic, it happens that the
callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
MAX_SEND_SIZE bytes.I agree with this.
This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.WALRead() callers are smart enough to take the flushed bytes only.
Although they read the whole WAL page, they calculate the valid bytes.Right
On first read the patch looks good, although it needs some more
thoughts on 'XXX' comments in the patch.
Thanks a lot for reviewing.
Here are some open points that I mentioned in v1 patch:
1.
+ * XXX: Perhaps, measuring the immediate lock availability and its impact
+ * on concurrent WAL writers is a good idea here.
It was shown in my testng upthread [1]/messages/by-id/CALj2ACXUbvON86vgwTkum8ab3bf1=HkMxQ5hZJZS3ZcJn8NEXQ@mail.gmail.com that the patch does no harm in
this regard. It will be great if other members try testing in their
respective environments and use cases.
2.
+ * XXX: Perhaps, returning if lock is not immediately available a good idea
+ * here. The caller can then go ahead with reading WAL from WAL file.
After thinking a bit more on this, ISTM that doing the above is right
to not cause any contention when the lock is busy. I've done so in the
v2 patch.
3.
+ * XXX: Perhaps, quickly finding if the given WAL record is in WAL buffers
+ * a good idea here. This avoids unnecessary lock acquire-release cycles.
+ * One way to do that is by maintaining oldest WAL record that's currently
+ * present in WAL buffers.
I think by doing the above we might end up creating a new point of
contention. Because shared variables to track min and max available
LSNs in the WAL buffers will need to be protected against all the
concurrent writers. Also, with the change that's done in (2) above,
that is, quickly exiting if the lock was busy, this comment seems
unnecessary to worry about. Hence, I decided to leave it there.
4.
+ * XXX: Perhaps, we can further go and validate the found page header,
+ * record header and record at least in assert builds, something like
+ * the xlogreader.c does and return if any of those validity checks
+ * fail. Having said that, we stick to the minimal checks for now.
I was being over-cautious initially. The fact that we acquire
WALBufMappingLock while reading the needed WAL buffer page itself
guarantees that no one else initializes it/makes it ready for next use
in AdvanceXLInsertBuffer(). The checks that we have for page header
(xlp_magic, xlp_pageaddr and xlp_tli) in the patch are enough for us
to ensure that we're not reading a page that got just initialized. The
callers will anyway perform extensive checks on page and record in
XLogReaderValidatePageHeader() and ValidXLogRecordHeader()
respectively. If any such failures occur after reading WAL from WAL
buffers, then that must be treated as a bug IMO. Hence, I don't think
we need to do the above.
And also I do not like that XLogReadFromBuffers() is using 3 bools
hit/partial hit/miss, instead of this we can use an enum or some
tristate variable, I think that will be cleaner.
Yeah, that seems more verbose, all that information can be deduced
from requested bytes and read bytes, I've done so in the v2 patch.
Please review the attached v2 patch further.
I'm also attaching two helper patches (as .txt files) herewith for
testing that basically adds WAL read stats -
USE-ON-HEAD-Collect-WAL-read-from-file-stats.txt - apply on HEAD and
monitor pg_stat_replication for per-walsender WAL read from WAL file
stats. USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txt -
apply on v2 patch and monitor pg_stat_replication for per-walsender
WAL read from WAL buffers and WAL file stats.
[1]: /messages/by-id/CALj2ACXUbvON86vgwTkum8ab3bf1=HkMxQ5hZJZS3ZcJn8NEXQ@mail.gmail.com
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
USE-ON-HEAD-Collect-WAL-read-from-file-stats.txttext/plain; charset=US-ASCII; name=USE-ON-HEAD-Collect-WAL-read-from-file-stats.txtDownload
From 6517e50f482f88ea5185609ff4dcf0e0256475d5 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 26 Dec 2022 08:14:11 +0000
Subject: [PATCH v2] Collect WAL read from file stats for WAL senders
---
doc/src/sgml/monitoring.sgml | 31 +++++++++++
src/backend/access/transam/xlogreader.c | 33 ++++++++++--
src/backend/access/transam/xlogutils.c | 2 +-
src/backend/catalog/system_views.sql | 5 +-
src/backend/replication/walsender.c | 58 +++++++++++++++++++--
src/bin/pg_waldump/pg_waldump.c | 2 +-
src/include/access/xlogreader.h | 21 ++++++--
src/include/catalog/pg_proc.dat | 6 +--
src/include/replication/walsender_private.h | 4 ++
src/test/regress/expected/rules.out | 7 ++-
10 files changed, 151 insertions(+), 18 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 363b183e5f..fdf4c7d774 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2615,6 +2615,37 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
Send time of last reply message received from standby server
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>wal_read</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of times WAL data is read from disk
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>wal_read_bytes</structfield> <type>numeric</type>
+ </para>
+ <para>
+ Total amount of WAL read from disk in bytes
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>wal_read_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total amount of time spent reading WAL from disk via
+ <function>WALRead</function> request, in milliseconds
+ (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
+ otherwise zero).
+ </para></entry>
+ </row>
+
</tbody>
</tgroup>
</table>
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a38a80e049..7453724a07 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -31,6 +31,7 @@
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
#include "common/pg_lzcompress.h"
+#include "portability/instr_time.h"
#include "replication/origin.h"
#ifndef FRONTEND
@@ -1489,9 +1490,9 @@ err:
* WAL buffers when possible.
*/
bool
-WALRead(XLogReaderState *state,
- char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
- WALReadError *errinfo)
+WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr, Size count,
+ TimeLineID tli, WALReadError *errinfo, WALReadStats *stats,
+ bool capture_wal_io_timing)
{
char *p;
XLogRecPtr recptr;
@@ -1506,6 +1507,7 @@ WALRead(XLogReaderState *state,
uint32 startoff;
int segbytes;
int readbytes;
+ instr_time start;
startoff = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
@@ -1540,6 +1542,10 @@ WALRead(XLogReaderState *state,
else
segbytes = nbytes;
+ /* Measure I/O timing to read WAL data if requested by the caller. */
+ if (stats != NULL && capture_wal_io_timing)
+ INSTR_TIME_SET_CURRENT(start);
+
#ifndef FRONTEND
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
#endif
@@ -1552,6 +1558,27 @@ WALRead(XLogReaderState *state,
pgstat_report_wait_end();
#endif
+ /* Collect I/O stats if requested by the caller. */
+ if (stats != NULL)
+ {
+ /* Increment the number of times WAL is read from disk. */
+ stats->wal_read++;
+
+ /* Collect bytes read. */
+ if (readbytes > 0)
+ stats->wal_read_bytes += readbytes;
+
+ /* Increment the I/O timing. */
+ if (capture_wal_io_timing)
+ {
+ instr_time duration;
+
+ INSTR_TIME_SET_CURRENT(duration);
+ INSTR_TIME_SUBTRACT(duration, start);
+ stats->wal_read_time += INSTR_TIME_GET_MICROSEC(duration);
+ }
+ }
+
if (readbytes <= 0)
{
errinfo->wre_errno = errno;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 563cba258d..372de2c7d8 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -1027,7 +1027,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
* zero-padded up to the page boundary if it's incomplete.
*/
if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
- &errinfo))
+ &errinfo, NULL, false))
WALReadRaiseError(&errinfo);
/* number of valid bytes in the buffer */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..b47f44a852 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -892,7 +892,10 @@ CREATE VIEW pg_stat_replication AS
W.replay_lag,
W.sync_priority,
W.sync_state,
- W.reply_time
+ W.reply_time,
+ W.wal_read,
+ W.wal_read_bytes,
+ W.wal_read_time
FROM pg_stat_get_activity(NULL) AS S
JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..fa02e327f2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -259,7 +259,7 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
TimeLineID *tli_p);
-
+static void WalSndAccumulateWalReadStats(WALReadStats *stats);
/* Initialize walsender process before entering the main command loop */
void
@@ -907,6 +907,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
WALReadError errinfo;
XLogSegNo segno;
TimeLineID currTLI = GetWALInsertionTimeLine();
+ WALReadStats stats;
/*
* Since logical decoding is only permitted on a primary server, we know
@@ -932,6 +933,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
else
count = flushptr - targetPagePtr; /* part of the page available */
+ MemSet(&stats, 0, sizeof(WALReadStats));
+
/* now actually read the data, we know it's there */
if (!WALRead(state,
cur_page,
@@ -940,9 +943,13 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
state->seg.ws_tli, /* Pass the current TLI because only
* WalSndSegmentOpen controls whether new
* TLI is needed. */
- &errinfo))
+ &errinfo,
+ &stats,
+ track_wal_io_timing))
WALReadRaiseError(&errinfo);
+ WalSndAccumulateWalReadStats(&stats);
+
/*
* After reading into the buffer, check that what we read was valid. We do
* this after reading, because even though the segment was present when we
@@ -2610,6 +2617,9 @@ InitWalSenderSlot(void)
walsnd->sync_standby_priority = 0;
walsnd->latch = &MyProc->procLatch;
walsnd->replyTime = 0;
+ walsnd->wal_read_stats.wal_read = 0;
+ walsnd->wal_read_stats.wal_read_bytes = 0;
+ walsnd->wal_read_stats.wal_read_time = 0;
SpinLockRelease(&walsnd->mutex);
/* don't need the lock anymore */
MyWalSnd = (WalSnd *) walsnd;
@@ -2730,6 +2740,7 @@ XLogSendPhysical(void)
Size nbytes;
XLogSegNo segno;
WALReadError errinfo;
+ WALReadStats stats;
/* If requested switch the WAL sender to the stopping state. */
if (got_STOPPING)
@@ -2945,6 +2956,8 @@ XLogSendPhysical(void)
enlargeStringInfo(&output_message, nbytes);
retry:
+ MemSet(&stats, 0, sizeof(WALReadStats));
+
if (!WALRead(xlogreader,
&output_message.data[output_message.len],
startptr,
@@ -2952,9 +2965,13 @@ retry:
xlogreader->seg.ws_tli, /* Pass the current TLI because
* only WalSndSegmentOpen controls
* whether new TLI is needed. */
- &errinfo))
+ &errinfo,
+ &stats,
+ track_wal_io_timing))
WALReadRaiseError(&errinfo);
+ WalSndAccumulateWalReadStats(&stats);
+
/* See logical_read_xlog_page(). */
XLByteToSeg(startptr, segno, xlogreader->segcxt.ws_segsize);
CheckXLogRemoved(segno, xlogreader->seg.ws_tli);
@@ -3458,7 +3475,7 @@ offset_to_interval(TimeOffset offset)
Datum
pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
{
-#define PG_STAT_GET_WAL_SENDERS_COLS 12
+#define PG_STAT_GET_WAL_SENDERS_COLS 15
ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
SyncRepStandbyData *sync_standbys;
int num_standbys;
@@ -3487,9 +3504,13 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
WalSndState state;
TimestampTz replyTime;
bool is_sync_standby;
+ int64 wal_read;
+ uint64 wal_read_bytes;
+ int64 wal_read_time;
Datum values[PG_STAT_GET_WAL_SENDERS_COLS];
bool nulls[PG_STAT_GET_WAL_SENDERS_COLS] = {0};
int j;
+ char buf[256];
/* Collect data from shared memory */
SpinLockAcquire(&walsnd->mutex);
@@ -3509,6 +3530,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
applyLag = walsnd->applyLag;
priority = walsnd->sync_standby_priority;
replyTime = walsnd->replyTime;
+ wal_read = walsnd->wal_read_stats.wal_read;
+ wal_read_bytes = walsnd->wal_read_stats.wal_read_bytes;
+ wal_read_time = walsnd->wal_read_stats.wal_read_time;
SpinLockRelease(&walsnd->mutex);
/*
@@ -3605,6 +3629,18 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
nulls[11] = true;
else
values[11] = TimestampTzGetDatum(replyTime);
+
+ values[12] = Int64GetDatum(wal_read);
+
+ /* Convert to numeric. */
+ snprintf(buf, sizeof buf, UINT64_FORMAT, wal_read_bytes);
+ values[13] = DirectFunctionCall3(numeric_in,
+ CStringGetDatum(buf),
+ ObjectIdGetDatum(0),
+ Int32GetDatum(-1));
+
+ /* Convert counter from microsec to millisec for display. */
+ values[14] = Float8GetDatum(((double) wal_read_time) / 1000.0);
}
tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
@@ -3849,3 +3885,17 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
Assert(time != 0);
return now - time;
}
+
+/*
+ * Function to accumulate WAL Read stats for WAL sender.
+ */
+static void
+WalSndAccumulateWalReadStats(WALReadStats *stats)
+{
+ /* Collect I/O stats for walsender. */
+ SpinLockAcquire(&MyWalSnd->mutex);
+ MyWalSnd->wal_read_stats.wal_read += stats->wal_read;
+ MyWalSnd->wal_read_stats.wal_read_bytes += stats->wal_read_bytes;
+ MyWalSnd->wal_read_stats.wal_read_time += stats->wal_read_time;
+ SpinLockRelease(&MyWalSnd->mutex);
+}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 9993378ca5..698ce1e9f7 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -364,7 +364,7 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
}
if (!WALRead(state, readBuff, targetPagePtr, count, private->timeline,
- &errinfo))
+ &errinfo, NULL, false))
{
WALOpenSegment *seg = &errinfo.wre_seg;
char fname[MAXPGPATH];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e87f91316a..26a2c975de 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -389,9 +389,24 @@ typedef struct WALReadError
WALOpenSegment wre_seg; /* Segment we tried to read from. */
} WALReadError;
-extern bool WALRead(XLogReaderState *state,
- char *buf, XLogRecPtr startptr, Size count,
- TimeLineID tli, WALReadError *errinfo);
+/*
+ * WAL read stats from WALRead that the callers can use.
+ */
+typedef struct WALReadStats
+{
+ /* Number of times WAL read from disk. */
+ int64 wal_read;
+
+ /* Total amount of WAL read from disk in bytes. */
+ uint64 wal_read_bytes;
+
+ /* Total amount of time spent reading WAL from disk. */
+ int64 wal_read_time;
+} WALReadStats;
+
+extern bool WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr,
+ Size count, TimeLineID tli, WALReadError *errinfo,
+ WALReadStats *stats, bool capture_wal_io_timing);
/* Functions for decoding an XLogRecord */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7056c95371..18320cf846 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5391,9 +5391,9 @@
proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
proretset => 't', provolatile => 's', proparallel => 'r',
prorettype => 'record', proargtypes => '',
- proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
- proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
- proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+ proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,numeric,float8}',
+ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,wal_read,wal_read_bytes,wal_read_time}',
prosrc => 'pg_stat_get_wal_senders' },
{ oid => '3317', descr => 'statistics: information about WAL receiver',
proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 7897c74589..35413ea0d2 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -13,6 +13,7 @@
#define _WALSENDER_PRIVATE_H
#include "access/xlog.h"
+#include "access/xlogreader.h"
#include "nodes/nodes.h"
#include "replication/syncrep.h"
#include "storage/latch.h"
@@ -78,6 +79,9 @@ typedef struct WalSnd
* Timestamp of the last message received from standby.
*/
TimestampTz replyTime;
+
+ /* WAL read stats for walsender. */
+ WALReadStats wal_read_stats;
} WalSnd;
extern PGDLLIMPORT WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..fd9d298e79 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2054,9 +2054,12 @@ pg_stat_replication| SELECT s.pid,
w.replay_lag,
w.sync_priority,
w.sync_state,
- w.reply_time
+ w.reply_time,
+ w.wal_read,
+ w.wal_read_bytes,
+ w.wal_read_time
FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
- JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+ JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, wal_read, wal_read_bytes, wal_read_time) ON ((s.pid = w.pid)))
LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
pg_stat_replication_slots| SELECT s.slot_name,
s.spill_txns,
--
2.34.1
v2-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchapplication/octet-stream; name=v2-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchDownload
From 047c58df6aaae586efe58b6a4068b17f25976b0a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 26 Dec 2022 08:36:28 +0000
Subject: [PATCH v2] Improve WALRead() to suck data directly from WAL buffers
when possible
---
src/backend/access/transam/xlog.c | 154 ++++++++++++++++++++++++
src/backend/access/transam/xlogreader.c | 47 +++++++-
src/include/access/xlog.h | 6 +
3 files changed, 205 insertions(+), 2 deletions(-)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 91473b00d9..c3138493be 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -689,6 +689,7 @@ static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
XLogRecPtr *PrevPtr);
static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
+static char *GetXLogBufferForRead(XLogRecPtr ptr, TimeLineID tli, char *page);
static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
@@ -1639,6 +1640,159 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
return cachedPos + ptr % XLOG_BLCKSZ;
}
+/*
+ * Get the WAL buffer page containing passed in WAL record and also return the
+ * record's location within that buffer page.
+ */
+static char *
+GetXLogBufferForRead(XLogRecPtr ptr, TimeLineID tli, char *page)
+{
+ XLogRecPtr expectedEndPtr;
+ XLogRecPtr endptr;
+ int idx;
+ char *recptr = NULL;
+
+ idx = XLogRecPtrToBufIdx(ptr);
+ expectedEndPtr = ptr;
+ expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+ /*
+ * Try to acquire WALBufMappingLock in shared mode so that the other
+ * concurrent WAL readers are also allowed. We try to do as less work as
+ * possible while holding the lock as it might impact concurrent WAL
+ * writers.
+ *
+ * If we cannot immediately acquire the lock, meaning the lock was busy,
+ * then exit quickly to not cause any contention. The caller can then
+ * fallback to reading WAL from WAL file.
+ */
+ if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+ return recptr;
+
+ /*
+ * Holding WALBufMappingLock ensures inserters don't overwrite this value
+ * while we are reading it.
+ */
+ endptr = XLogCtl->xlblocks[idx];
+
+ if (expectedEndPtr == endptr)
+ {
+ XLogPageHeader phdr;
+
+ /*
+ * We have found the WAL buffer page holding the given LSN. Read from a
+ * pointer to the right offset within the page.
+ */
+ memcpy(page, (XLogCtl->pages + idx * (Size) XLOG_BLCKSZ),
+ (Size) XLOG_BLCKSZ);
+
+ /*
+ * Release the lock as early as possible to avoid creating any possible
+ * contention.
+ */
+ LWLockRelease(WALBufMappingLock);
+
+ /*
+ * The fact that we acquire WALBufMappingLock while reading the WAL
+ * buffer page itself guarantees that no one else initializes it or
+ * makes it ready for next use in AdvanceXLInsertBuffer().
+ *
+ * However, we perform basic page header checks for ensuring that we
+ * are not reading a page that got just initialized. The callers will
+ * anyway perform extensive page-level and record-level checks.
+ */
+ phdr = (XLogPageHeader) page;
+
+ if (phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+ phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+ phdr->xlp_tli == tli)
+ {
+ /*
+ * Page looks valid, so return the page and the requested record's
+ * LSN.
+ */
+ recptr = page + ptr % XLOG_BLCKSZ;
+ }
+ }
+ else
+ {
+ /* We have found nothing. */
+ LWLockRelease(WALBufMappingLock);
+ }
+
+ return recptr;
+}
+
+/*
+ * When possible, read WAL starting at 'startptr' of size 'count' bytes from
+ * WAL buffers into buffer passed in by the caller 'buf'. Read as much WAL as
+ * possible from the WAL buffers, remaining WAL, if any, the caller will take
+ * care of reading from WAL files directly.
+ *
+ * This function sets read bytes to 'read_bytes'.
+ */
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+ TimeLineID tli,
+ Size count,
+ char *buf,
+ Size *read_bytes)
+{
+ XLogRecPtr ptr;
+ char *dst;
+ Size nbytes;
+
+ Assert(!XLogRecPtrIsInvalid(startptr));
+ Assert(count > 0);
+ Assert(startptr <= GetFlushRecPtr(NULL));
+ Assert(!RecoveryInProgress());
+
+ ptr = startptr;
+ nbytes = count;
+ dst = buf;
+ *read_bytes = 0;
+
+ while (nbytes > 0)
+ {
+ char page[XLOG_BLCKSZ] = {0};
+ char *recptr;
+
+ recptr = GetXLogBufferForRead(ptr, tli, page);
+
+ if (recptr == NULL)
+ break;
+
+ if ((recptr + nbytes) <= (page + XLOG_BLCKSZ))
+ {
+ /* All the bytes are in one page. */
+ memcpy(dst, recptr, nbytes);
+ dst += nbytes;
+ *read_bytes += nbytes;
+ ptr += nbytes;
+ nbytes = 0;
+ }
+ else if ((recptr + nbytes) > (page + XLOG_BLCKSZ))
+ {
+ /* All the bytes are not in one page. */
+ Size bytes_remaining;
+
+ /*
+ * Compute the remaining bytes on the current page, copy them over
+ * to output buffer and move forward to read further.
+ */
+ bytes_remaining = XLOG_BLCKSZ - (recptr - page);
+ memcpy(dst, recptr, bytes_remaining);
+ dst += bytes_remaining;
+ nbytes -= bytes_remaining;
+ *read_bytes += bytes_remaining;
+ ptr += bytes_remaining;
+ }
+ }
+
+ elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+ *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
+}
+
/*
* Converts a "usable byte position" to XLogRecPtr. A usable byte position
* is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a38a80e049..4a2e7af169 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1485,8 +1485,7 @@ err:
* Returns true if succeeded, false if an error occurs, in which case
* 'errinfo' receives error details.
*
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
*/
bool
WALRead(XLogReaderState *state,
@@ -1497,6 +1496,50 @@ WALRead(XLogReaderState *state,
XLogRecPtr recptr;
Size nbytes;
+#ifndef FRONTEND
+ /* Frontend tools have no idea of WAL buffers. */
+ Size read_bytes;
+
+ /*
+ * When possible, read WAL from WAL buffers. We skip this step and continue
+ * the usual way, that is to read from WAL file, either when the server is
+ * in recovery (standby mode, archive or crash recovery), in which case the
+ * WAL buffers are not used or when the server is inserting in a different
+ * timeline from that of the timeline that we're trying to read WAL from.
+ */
+ if (!RecoveryInProgress() &&
+ tli == GetWALInsertionTimeLine())
+ {
+ pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+ XLogReadFromBuffers(startptr, tli, count, buf, &read_bytes);
+ pgstat_report_wait_end();
+
+ /*
+ * Check if we have read fully (hit), partially (partial hit) or
+ * nothing (miss) from WAL buffers. If we have read either partially or
+ * nothing, then continue to read the remaining bytes the usual way,
+ * that is, read from WAL file.
+ */
+ if (count == read_bytes)
+ {
+ /* Buffer hit, so return. */
+ return true;
+ }
+ else if (read_bytes > 0 && count > read_bytes)
+ {
+ /*
+ * Buffer partial hit, so reset the state to count the read bytes
+ * and continue.
+ */
+ buf += read_bytes;
+ startptr += read_bytes;
+ count -= read_bytes;
+ }
+
+ /* Buffer miss i.e., read_bytes = 0, so continue */
+ }
+#endif /* FRONTEND */
+
p = buf;
recptr = startptr;
nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..f4e1c46b23 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
extern void SetWalWriterSleeping(bool sleeping);
+extern void XLogReadFromBuffers(XLogRecPtr startptr,
+ TimeLineID tli,
+ Size count,
+ char *buf,
+ Size *read_bytes);
+
/*
* Routines used by xlogrecovery.c to call back into xlog.c during recovery.
*/
--
2.34.1
USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txttext/plain; charset=US-ASCII; name=USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txtDownload
From f90dfcbd1968280feec6d116568697225854ac40 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 26 Dec 2022 08:13:07 +0000
Subject: [PATCH v2] Collect WAL read from buffers and file stats for WAL
senders
---
doc/src/sgml/monitoring.sgml | 61 +++++++++++++++
src/backend/access/transam/xlogreader.c | 56 +++++++++++++-
src/backend/access/transam/xlogutils.c | 2 +-
src/backend/catalog/system_views.sql | 8 +-
src/backend/replication/walsender.c | 85 ++++++++++++++++++++-
src/bin/pg_waldump/pg_waldump.c | 2 +-
src/include/access/xlogreader.h | 30 +++++++-
src/include/catalog/pg_proc.dat | 6 +-
src/include/replication/walsender_private.h | 4 +
src/test/regress/expected/rules.out | 10 ++-
10 files changed, 246 insertions(+), 18 deletions(-)
diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 363b183e5f..239e0b0db9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2615,6 +2615,67 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
Send time of last reply message received from standby server
</para></entry>
</row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>wal_read</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of times WAL data is read from disk
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>wal_read_bytes</structfield> <type>numeric</type>
+ </para>
+ <para>
+ Total amount of WAL read from disk in bytes
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>wal_read_time</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total amount of time spent reading WAL from disk via
+ <function>WALRead</function> request, in milliseconds
+ (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
+ otherwise zero).
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>wal_read_buffers</structfield> <type>bigint</type>
+ </para>
+ <para>
+ Number of times WAL data is read from WAL buffers
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>wal_read_bytes_buffers</structfield> <type>numeric</type>
+ </para>
+ <para>
+ Total amount of WAL read from WAL buffers in bytes
+ </para></entry>
+ </row>
+
+ <row>
+ <entry role="catalog_table_entry"><para role="column_definition">
+ <structfield>wal_read_time_buffers</structfield> <type>double precision</type>
+ </para>
+ <para>
+ Total amount of time spent reading WAL from WAL buffers via
+ <function>WALRead</function> request, in milliseconds
+ (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
+ otherwise zero).
+ </para></entry>
+ </row>
+
</tbody>
</tgroup>
</table>
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 4a2e7af169..b9dfd4fde7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -31,6 +31,7 @@
#include "access/xlogrecord.h"
#include "catalog/pg_control.h"
#include "common/pg_lzcompress.h"
+#include "portability/instr_time.h"
#include "replication/origin.h"
#ifndef FRONTEND
@@ -1488,9 +1489,9 @@ err:
* When possible, this function reads data directly from WAL buffers.
*/
bool
-WALRead(XLogReaderState *state,
- char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
- WALReadError *errinfo)
+WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr, Size count,
+ TimeLineID tli, WALReadError *errinfo, WALReadStats *stats,
+ bool capture_wal_io_timing)
{
char *p;
XLogRecPtr recptr;
@@ -1510,10 +1511,33 @@ WALRead(XLogReaderState *state,
if (!RecoveryInProgress() &&
tli == GetWALInsertionTimeLine())
{
+ instr_time start;
+
+ /* Measure I/O timing to read WAL data if requested by the caller. */
+ if (stats != NULL && capture_wal_io_timing)
+ INSTR_TIME_SET_CURRENT(start);
+
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
XLogReadFromBuffers(startptr, tli, count, buf, &read_bytes);
pgstat_report_wait_end();
+ /* Collect I/O stats if requested by the caller. */
+ if (stats != NULL && read_bytes > 0)
+ {
+ stats->wal_read_buffers++;
+ stats->wal_read_bytes_buffers += read_bytes;
+
+ /* Increment the I/O timing. */
+ if (capture_wal_io_timing)
+ {
+ instr_time duration;
+
+ INSTR_TIME_SET_CURRENT(duration);
+ INSTR_TIME_SUBTRACT(duration, start);
+ stats->wal_read_time_buffers += INSTR_TIME_GET_MICROSEC(duration);
+ }
+ }
+
/*
* Check if we have read fully (hit), partially (partial hit) or
* nothing (miss) from WAL buffers. If we have read either partially or
@@ -1549,6 +1573,7 @@ WALRead(XLogReaderState *state,
uint32 startoff;
int segbytes;
int readbytes;
+ instr_time start;
startoff = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
@@ -1583,6 +1608,10 @@ WALRead(XLogReaderState *state,
else
segbytes = nbytes;
+ /* Measure I/O timing to read WAL data if requested by the caller. */
+ if (stats != NULL && capture_wal_io_timing)
+ INSTR_TIME_SET_CURRENT(start);
+
#ifndef FRONTEND
pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
#endif
@@ -1595,6 +1624,27 @@ WALRead(XLogReaderState *state,
pgstat_report_wait_end();
#endif
+ /* Collect I/O stats if requested by the caller. */
+ if (stats != NULL)
+ {
+ /* Increment the number of times WAL is read from disk. */
+ stats->wal_read++;
+
+ /* Collect bytes read. */
+ if (readbytes > 0)
+ stats->wal_read_bytes += readbytes;
+
+ /* Increment the I/O timing. */
+ if (capture_wal_io_timing)
+ {
+ instr_time duration;
+
+ INSTR_TIME_SET_CURRENT(duration);
+ INSTR_TIME_SUBTRACT(duration, start);
+ stats->wal_read_time += INSTR_TIME_GET_MICROSEC(duration);
+ }
+ }
+
if (readbytes <= 0)
{
errinfo->wre_errno = errno;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 563cba258d..372de2c7d8 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -1027,7 +1027,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
* zero-padded up to the page boundary if it's incomplete.
*/
if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
- &errinfo))
+ &errinfo, NULL, false))
WALReadRaiseError(&errinfo);
/* number of valid bytes in the buffer */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..bf6315df27 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -892,7 +892,13 @@ CREATE VIEW pg_stat_replication AS
W.replay_lag,
W.sync_priority,
W.sync_state,
- W.reply_time
+ W.reply_time,
+ W.wal_read,
+ W.wal_read_bytes,
+ W.wal_read_time,
+ W.wal_read_buffers,
+ W.wal_read_bytes_buffers,
+ W.wal_read_time_buffers
FROM pg_stat_get_activity(NULL) AS S
JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..d3393b2b63 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -259,7 +259,7 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
TimeLineID *tli_p);
-
+static void WalSndAccumulateWalReadStats(WALReadStats *stats);
/* Initialize walsender process before entering the main command loop */
void
@@ -907,6 +907,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
WALReadError errinfo;
XLogSegNo segno;
TimeLineID currTLI = GetWALInsertionTimeLine();
+ WALReadStats stats;
/*
* Since logical decoding is only permitted on a primary server, we know
@@ -932,6 +933,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
else
count = flushptr - targetPagePtr; /* part of the page available */
+ MemSet(&stats, 0, sizeof(WALReadStats));
+
/* now actually read the data, we know it's there */
if (!WALRead(state,
cur_page,
@@ -940,9 +943,13 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
state->seg.ws_tli, /* Pass the current TLI because only
* WalSndSegmentOpen controls whether new
* TLI is needed. */
- &errinfo))
+ &errinfo,
+ &stats,
+ track_wal_io_timing))
WALReadRaiseError(&errinfo);
+ WalSndAccumulateWalReadStats(&stats);
+
/*
* After reading into the buffer, check that what we read was valid. We do
* this after reading, because even though the segment was present when we
@@ -2610,6 +2617,12 @@ InitWalSenderSlot(void)
walsnd->sync_standby_priority = 0;
walsnd->latch = &MyProc->procLatch;
walsnd->replyTime = 0;
+ walsnd->wal_read_stats.wal_read = 0;
+ walsnd->wal_read_stats.wal_read_bytes = 0;
+ walsnd->wal_read_stats.wal_read_time = 0;
+ walsnd->wal_read_stats.wal_read_buffers = 0;
+ walsnd->wal_read_stats.wal_read_bytes_buffers = 0;
+ walsnd->wal_read_stats.wal_read_time_buffers = 0;
SpinLockRelease(&walsnd->mutex);
/* don't need the lock anymore */
MyWalSnd = (WalSnd *) walsnd;
@@ -2730,6 +2743,7 @@ XLogSendPhysical(void)
Size nbytes;
XLogSegNo segno;
WALReadError errinfo;
+ WALReadStats stats;
/* If requested switch the WAL sender to the stopping state. */
if (got_STOPPING)
@@ -2945,6 +2959,8 @@ XLogSendPhysical(void)
enlargeStringInfo(&output_message, nbytes);
retry:
+ MemSet(&stats, 0, sizeof(WALReadStats));
+
if (!WALRead(xlogreader,
&output_message.data[output_message.len],
startptr,
@@ -2952,9 +2968,13 @@ retry:
xlogreader->seg.ws_tli, /* Pass the current TLI because
* only WalSndSegmentOpen controls
* whether new TLI is needed. */
- &errinfo))
+ &errinfo,
+ &stats,
+ track_wal_io_timing))
WALReadRaiseError(&errinfo);
+ WalSndAccumulateWalReadStats(&stats);
+
/* See logical_read_xlog_page(). */
XLByteToSeg(startptr, segno, xlogreader->segcxt.ws_segsize);
CheckXLogRemoved(segno, xlogreader->seg.ws_tli);
@@ -3458,7 +3478,7 @@ offset_to_interval(TimeOffset offset)
Datum
pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
{
-#define PG_STAT_GET_WAL_SENDERS_COLS 12
+#define PG_STAT_GET_WAL_SENDERS_COLS 18
ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
SyncRepStandbyData *sync_standbys;
int num_standbys;
@@ -3487,9 +3507,16 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
WalSndState state;
TimestampTz replyTime;
bool is_sync_standby;
+ int64 wal_read;
+ uint64 wal_read_bytes;
+ int64 wal_read_time;
+ int64 wal_read_buffers;
+ uint64 wal_read_bytes_buffers;
+ int64 wal_read_time_buffers;
Datum values[PG_STAT_GET_WAL_SENDERS_COLS];
bool nulls[PG_STAT_GET_WAL_SENDERS_COLS] = {0};
int j;
+ char buf[256];
/* Collect data from shared memory */
SpinLockAcquire(&walsnd->mutex);
@@ -3509,6 +3536,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
applyLag = walsnd->applyLag;
priority = walsnd->sync_standby_priority;
replyTime = walsnd->replyTime;
+ wal_read = walsnd->wal_read_stats.wal_read;
+ wal_read_bytes = walsnd->wal_read_stats.wal_read_bytes;
+ wal_read_time = walsnd->wal_read_stats.wal_read_time;
+ wal_read_buffers = walsnd->wal_read_stats.wal_read_buffers;
+ wal_read_bytes_buffers = walsnd->wal_read_stats.wal_read_bytes_buffers;
+ wal_read_time_buffers = walsnd->wal_read_stats.wal_read_time_buffers;
SpinLockRelease(&walsnd->mutex);
/*
@@ -3605,6 +3638,31 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
nulls[11] = true;
else
values[11] = TimestampTzGetDatum(replyTime);
+
+ values[12] = Int64GetDatum(wal_read);
+
+ /* Convert to numeric. */
+ snprintf(buf, sizeof buf, UINT64_FORMAT, wal_read_bytes);
+ values[13] = DirectFunctionCall3(numeric_in,
+ CStringGetDatum(buf),
+ ObjectIdGetDatum(0),
+ Int32GetDatum(-1));
+
+ /* Convert counter from microsec to millisec for display. */
+ values[14] = Float8GetDatum(((double) wal_read_time) / 1000.0);
+
+ values[15] = Int64GetDatum(wal_read_buffers);
+
+ /* Convert to numeric. */
+ MemSet(buf, '\0', sizeof buf);
+ snprintf(buf, sizeof buf, UINT64_FORMAT, wal_read_bytes_buffers);
+ values[16] = DirectFunctionCall3(numeric_in,
+ CStringGetDatum(buf),
+ ObjectIdGetDatum(0),
+ Int32GetDatum(-1));
+
+ /* Convert counter from microsec to millisec for display. */
+ values[17] = Float8GetDatum(((double) wal_read_time_buffers) / 1000.0);
}
tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
@@ -3849,3 +3907,22 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
Assert(time != 0);
return now - time;
}
+
+/*
+ * Function to accumulate WAL Read stats for WAL sender.
+ */
+static void
+WalSndAccumulateWalReadStats(WALReadStats *stats)
+{
+ /* Collect I/O stats for walsender. */
+ SpinLockAcquire(&MyWalSnd->mutex);
+ MyWalSnd->wal_read_stats.wal_read += stats->wal_read;
+ MyWalSnd->wal_read_stats.wal_read_bytes += stats->wal_read_bytes;
+ MyWalSnd->wal_read_stats.wal_read_time += stats->wal_read_time;
+ MyWalSnd->wal_read_stats.wal_read_buffers += stats->wal_read_buffers;
+ MyWalSnd->wal_read_stats.wal_read_bytes_buffers +=
+ stats->wal_read_bytes_buffers;
+ MyWalSnd->wal_read_stats.wal_read_time_buffers +=
+ stats->wal_read_time_buffers;
+ SpinLockRelease(&MyWalSnd->mutex);
+}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 9993378ca5..698ce1e9f7 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -364,7 +364,7 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
}
if (!WALRead(state, readBuff, targetPagePtr, count, private->timeline,
- &errinfo))
+ &errinfo, NULL, false))
{
WALOpenSegment *seg = &errinfo.wre_seg;
char fname[MAXPGPATH];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e87f91316a..9287114779 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -389,9 +389,33 @@ typedef struct WALReadError
WALOpenSegment wre_seg; /* Segment we tried to read from. */
} WALReadError;
-extern bool WALRead(XLogReaderState *state,
- char *buf, XLogRecPtr startptr, Size count,
- TimeLineID tli, WALReadError *errinfo);
+/*
+ * WAL read stats from WALRead that the callers can use.
+ */
+typedef struct WALReadStats
+{
+ /* Number of times WAL read from disk. */
+ int64 wal_read;
+
+ /* Total amount of WAL read from disk in bytes. */
+ uint64 wal_read_bytes;
+
+ /* Total amount of time spent reading WAL from disk. */
+ int64 wal_read_time;
+
+ /* Number of times WAL read from WAL buffers. */
+ int64 wal_read_buffers;
+
+ /* Total amount of WAL read from WAL buffers in bytes. */
+ uint64 wal_read_bytes_buffers;
+
+ /* Total amount of time spent reading WAL from WAL buffers. */
+ int64 wal_read_time_buffers;
+} WALReadStats;
+
+extern bool WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr,
+ Size count, TimeLineID tli, WALReadError *errinfo,
+ WALReadStats *stats, bool capture_wal_io_timing);
/* Functions for decoding an XLogRecord */
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7056c95371..706a005c2b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5391,9 +5391,9 @@
proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
proretset => 't', provolatile => 's', proparallel => 'r',
prorettype => 'record', proargtypes => '',
- proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
- proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
- proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+ proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,numeric,float8,int8,numeric,float8}',
+ proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+ proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,wal_read,wal_read_bytes,wal_read_time,wal_read_buffers,wal_read_bytes_buffers,wal_read_time_buffers}',
prosrc => 'pg_stat_get_wal_senders' },
{ oid => '3317', descr => 'statistics: information about WAL receiver',
proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 7897c74589..35413ea0d2 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -13,6 +13,7 @@
#define _WALSENDER_PRIVATE_H
#include "access/xlog.h"
+#include "access/xlogreader.h"
#include "nodes/nodes.h"
#include "replication/syncrep.h"
#include "storage/latch.h"
@@ -78,6 +79,9 @@ typedef struct WalSnd
* Timestamp of the last message received from standby.
*/
TimestampTz replyTime;
+
+ /* WAL read stats for walsender. */
+ WALReadStats wal_read_stats;
} WalSnd;
extern PGDLLIMPORT WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..6ae65981c2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2054,9 +2054,15 @@ pg_stat_replication| SELECT s.pid,
w.replay_lag,
w.sync_priority,
w.sync_state,
- w.reply_time
+ w.reply_time,
+ w.wal_read,
+ w.wal_read_bytes,
+ w.wal_read_time,
+ w.wal_read_buffers,
+ w.wal_read_bytes_buffers,
+ w.wal_read_time_buffers
FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
- JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+ JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, wal_read, wal_read_bytes, wal_read_time, wal_read_buffers, wal_read_bytes_buffers, wal_read_time_buffers) ON ((s.pid = w.pid)))
LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
pg_stat_replication_slots| SELECT s.slot_name,
s.spill_txns,
--
2.34.1
On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
Please review the attached v2 patch further.
I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?
Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?
--
Jeff Davis
PostgreSQL Contributor Team - AWS
Hi,
On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
Please review the attached v2 patch further.
I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?
One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.
Greetings,
Andres Freund
On Sat, Jan 14, 2023 at 12:34 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
Please review the attached v2 patch further.
I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?One benefit would be that it'd make it more realistic to use direct IO for
WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a
win.
+1. Archive modules rely on reading the wal files for PITR. Direct IO for
WAL requires reading these files from disk anyways for archival. However,
Archiving using utilities like pg_receivewal can take advantage of this
patch together with direct IO for WAL.
Thanks,
Satya
Hi,
On 2023-01-14 12:34:03 -0800, Andres Freund wrote:
On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
Please review the attached v2 patch further.
I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.
Satya's email just now reminded me of another important reason:
Eventually we should add the ability to stream out WAL *before* it has locally
been written out and flushed. Obviously the relevant positions would have to
be noted in the relevant message in the streaming protocol, and we couldn't
generally allow standbys to apply that data yet.
That'd allow us to significantly reduce the overhead of synchronous
replication, because instead of commonly needing to send out all the pending
WAL at commit, we'd just need to send out the updated flush position. The
reason this would lower the overhead is that:
a) The reduced amount of data to be transferred reduces latency - it's easy to
accumulate a few TCP packets worth of data even in a single small OLTP
transaction
b) The remote side can start to write out data earlier
Of course this would require additional infrastructure on the receiver
side. E.g. some persistent state indicating up to where WAL is allowed to be
applied, to avoid the standby getting ahead of th eprimary, in case the
primary crash-restarts (or has more severe issues).
With a bit of work we could perform WAL replay on standby without waiting for
the fdatasync of the received WAL - that only needs to happen when a) we need
to confirm a flush position to the primary b) when we need to write back pages
from the buffer pool (and some other things).
Greetings,
Andres Freund
On Thu, Jan 26, 2023 at 2:45 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-01-14 12:34:03 -0800, Andres Freund wrote:
On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
Please review the attached v2 patch further.
I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.Satya's email just now reminded me of another important reason:
Eventually we should add the ability to stream out WAL *before* it has locally
been written out and flushed. Obviously the relevant positions would have to
be noted in the relevant message in the streaming protocol, and we couldn't
generally allow standbys to apply that data yet.That'd allow us to significantly reduce the overhead of synchronous
replication, because instead of commonly needing to send out all the pending
WAL at commit, we'd just need to send out the updated flush position. The
reason this would lower the overhead is that:a) The reduced amount of data to be transferred reduces latency - it's easy to
accumulate a few TCP packets worth of data even in a single small OLTP
transaction
b) The remote side can start to write out data earlierOf course this would require additional infrastructure on the receiver
side. E.g. some persistent state indicating up to where WAL is allowed to be
applied, to avoid the standby getting ahead of th eprimary, in case the
primary crash-restarts (or has more severe issues).With a bit of work we could perform WAL replay on standby without waiting for
the fdatasync of the received WAL - that only needs to happen when a) we need
to confirm a flush position to the primary b) when we need to write back pages
from the buffer pool (and some other things).
Thanks Andres, Jeff and Satya for taking a look at the thread. Andres
is right, the eventual plan is to do a bunch of other stuff as
described above and we've discussed this in another thread (see
below). I would like to once again clarify motivation behind this
feature:
1. It enables WAL readers (callers of WALRead() - wal senders,
pg_walinspect etc.) to use WAL buffers as first level cache which
might reduce number of IOPS at a peak load especially when the pread()
results in a disk read (WAL isn't available in OS page cache). I had
earlier presented the buffer hit ratio/amount of pread() system calls
reduced with wal senders in the first email of this thread (95% of the
time wal senders are able to read from WAL buffers without impacting
anybody). Now, here are the results with the WAL DIO patch [1]/messages/by-id/CA+hUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_=X7HztA@mail.gmail.com - where
WAL pread() turns into a disk read, see the results [2]Test case is an insert pgbench workload. clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ 1 1404 1070 1424 1375 2 1487 796 1454 1517 4 3064 1743 3011 3019 8 6114 3556 6026 5954 16 11560 7051 12216 12132 32 23181 13079 23449 23561 64 43607 26983 43997 45636 128 80723 45169 81515 81911 256 110925 90185 107332 114046 512 119354 109817 110287 117506 768 112435 105795 106853 111605 1024 107554 105541 105942 109370 2048 88552 79024 80699 90555 4096 61323 54814 58704 61743 and attached
graph.
2. As Andres rightly mentioned, it helps WAL DIO; since there's no OS
page cache, using WAL buffers as read cache helps a lot. It is clearly
evident from my experiment with WAL DIO patch [1]/messages/by-id/CA+hUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_=X7HztA@mail.gmail.com, see the results [2]Test case is an insert pgbench workload. clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ 1 1404 1070 1424 1375 2 1487 796 1454 1517 4 3064 1743 3011 3019 8 6114 3556 6026 5954 16 11560 7051 12216 12132 32 23181 13079 23449 23561 64 43607 26983 43997 45636 128 80723 45169 81515 81911 256 110925 90185 107332 114046 512 119354 109817 110287 117506 768 112435 105795 106853 111605 1024 107554 105541 105942 109370 2048 88552 79024 80699 90555 4096 61323 54814 58704 61743
and attached graph. As expected, WAL DIO brings down the TPS, whereas
WAL buffers read i.e. this patch brings it up.
3. As Andres rightly mentioned above, it enables flushing WAL in
parallel on primary and all standbys [3]/messages/by-id/20220309020123.sneaoijlg3rszvst@alap3.anarazel.de /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com. I haven't yet started work
on this, I will aim for PG 17.
4. It will make the work on - disallow async standbys or subscribers
getting ahead of the sync standbys [3]/messages/by-id/20220309020123.sneaoijlg3rszvst@alap3.anarazel.de /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com possible. I haven't yet started
work on this, I will aim for PG 17.
5. It implements the following TODO item specified near WALRead():
* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.
*/
bool
WALRead(XLogReaderState *state,
That said, this feature is separately reviewable and perhaps can go
separately as it has its own benefits.
[1]: /messages/by-id/CA+hUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_=X7HztA@mail.gmail.com
[2]: Test case is an insert pgbench workload. clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ 1 1404 1070 1424 1375 2 1487 796 1454 1517 4 3064 1743 3011 3019 8 6114 3556 6026 5954 16 11560 7051 12216 12132 32 23181 13079 23449 23561 64 43607 26983 43997 45636 128 80723 45169 81515 81911 256 110925 90185 107332 114046 512 119354 109817 110287 117506 768 112435 105795 106853 111605 1024 107554 105541 105942 109370 2048 88552 79024 80699 90555 4096 61323 54814 58704 61743
clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ
1 1404 1070 1424 1375
2 1487 796 1454 1517
4 3064 1743 3011 3019
8 6114 3556 6026 5954
16 11560 7051 12216 12132
32 23181 13079 23449 23561
64 43607 26983 43997 45636
128 80723 45169 81515 81911
256 110925 90185 107332 114046
512 119354 109817 110287 117506
768 112435 105795 106853 111605
1024 107554 105541 105942 109370
2048 88552 79024 80699 90555
4096 61323 54814 58704 61743
[3]: /messages/by-id/20220309020123.sneaoijlg3rszvst@alap3.anarazel.de /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com
/messages/by-id/20220309020123.sneaoijlg3rszvst@alap3.anarazel.de
/messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
On Thu, Jan 26, 2023 at 2:33 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:
On Thu, Jan 26, 2023 at 2:45 AM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-01-14 12:34:03 -0800, Andres Freund wrote:
On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:
On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:
Please review the attached v2 patch further.
I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.Satya's email just now reminded me of another important reason:
Eventually we should add the ability to stream out WAL *before* it has locally
been written out and flushed. Obviously the relevant positions would have to
be noted in the relevant message in the streaming protocol, and we couldn't
generally allow standbys to apply that data yet.That'd allow us to significantly reduce the overhead of synchronous
replication, because instead of commonly needing to send out all the pending
WAL at commit, we'd just need to send out the updated flush position. The
reason this would lower the overhead is that:a) The reduced amount of data to be transferred reduces latency - it's easy to
accumulate a few TCP packets worth of data even in a single small OLTP
transaction
b) The remote side can start to write out data earlierOf course this would require additional infrastructure on the receiver
side. E.g. some persistent state indicating up to where WAL is allowed to be
applied, to avoid the standby getting ahead of th eprimary, in case the
primary crash-restarts (or has more severe issues).With a bit of work we could perform WAL replay on standby without waiting for
the fdatasync of the received WAL - that only needs to happen when a) we need
to confirm a flush position to the primary b) when we need to write back pages
from the buffer pool (and some other things).Thanks Andres, Jeff and Satya for taking a look at the thread. Andres
is right, the eventual plan is to do a bunch of other stuff as
described above and we've discussed this in another thread (see
below). I would like to once again clarify motivation behind this
feature:1. It enables WAL readers (callers of WALRead() - wal senders,
pg_walinspect etc.) to use WAL buffers as first level cache which
might reduce number of IOPS at a peak load especially when the pread()
results in a disk read (WAL isn't available in OS page cache). I had
earlier presented the buffer hit ratio/amount of pread() system calls
reduced with wal senders in the first email of this thread (95% of the
time wal senders are able to read from WAL buffers without impacting
anybody). Now, here are the results with the WAL DIO patch [1] - where
WAL pread() turns into a disk read, see the results [2] and attached
graph.2. As Andres rightly mentioned, it helps WAL DIO; since there's no OS
page cache, using WAL buffers as read cache helps a lot. It is clearly
evident from my experiment with WAL DIO patch [1], see the results [2]
and attached graph. As expected, WAL DIO brings down the TPS, whereas
WAL buffers read i.e. this patch brings it up.[2] Test case is an insert pgbench workload.
clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ
1 1404 1070 1424 1375
2 1487 796 1454 1517
4 3064 1743 3011 3019
8 6114 3556 6026 5954
16 11560 7051 12216 12132
32 23181 13079 23449 23561
64 43607 26983 43997 45636
128 80723 45169 81515 81911
256 110925 90185 107332 114046
512 119354 109817 110287 117506
768 112435 105795 106853 111605
1024 107554 105541 105942 109370
2048 88552 79024 80699 90555
4096 61323 54814 58704 61743
If I'm understanding this result correctly, it seems to me that your
patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
BUFFERS READ), but there seems no visible performance gain with only
your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
patch should be included in the WAL DIO patch rather than applying it
alone. Am I missing something?
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
Hi,
On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote:
If I'm understanding this result correctly, it seems to me that your
patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
BUFFERS READ), but there seems no visible performance gain with only
your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
patch should be included in the WAL DIO patch rather than applying it
alone. Am I missing something?
We already support using DIO for WAL - it's just restricted in a way that
makes it practically not usable. And the reason for that is precisely that
walsenders need to read the WAL. See get_sync_bit():
/*
* Optimize writes by bypassing kernel cache with O_DIRECT when using
* O_SYNC and O_DSYNC. But only if archiving and streaming are disabled,
* otherwise the archive command or walsender process will read the WAL
* soon after writing it, which is guaranteed to cause a physical read if
* we bypassed the kernel cache. We also skip the
* posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
* reason.
*
* Never use O_DIRECT in walreceiver process for similar reasons; the WAL
* written by walreceiver is normally read by the startup process soon
* after it's written. Also, walreceiver performs unaligned writes, which
* don't work with O_DIRECT, so it is required for correctness too.
*/
if (!XLogIsNeeded() && !AmWalReceiverProcess())
o_direct_flag = PG_O_DIRECT;
Even if that weren't the case, splitting up bigger commits in incrementally
committable chunks is a good idea.
Greetings,
Andres Freund
On Fri, Jan 27, 2023 at 3:17 PM Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote:
If I'm understanding this result correctly, it seems to me that your
patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
BUFFERS READ), but there seems no visible performance gain with only
your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
patch should be included in the WAL DIO patch rather than applying it
alone. Am I missing something?We already support using DIO for WAL - it's just restricted in a way that
makes it practically not usable. And the reason for that is precisely that
walsenders need to read the WAL. See get_sync_bit():/*
* Optimize writes by bypassing kernel cache with O_DIRECT when using
* O_SYNC and O_DSYNC. But only if archiving and streaming are disabled,
* otherwise the archive command or walsender process will read the WAL
* soon after writing it, which is guaranteed to cause a physical read if
* we bypassed the kernel cache. We also skip the
* posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
* reason.
*
* Never use O_DIRECT in walreceiver process for similar reasons; the WAL
* written by walreceiver is normally read by the startup process soon
* after it's written. Also, walreceiver performs unaligned writes, which
* don't work with O_DIRECT, so it is required for correctness too.
*/
if (!XLogIsNeeded() && !AmWalReceiverProcess())
o_direct_flag = PG_O_DIRECT;Even if that weren't the case, splitting up bigger commits in incrementally
committable chunks is a good idea.
Agreed. I was wondering about the fact that the test result doesn't
show things to satisfy the first motivation of this patch, which is to
improve performance by reducing disk I/O and system calls regardless
of the DIO patch. But it makes sense to me that this patch is a part
of the DIO patch series.
I'd like to confirm whether there is any performance regression caused
by this patch in some cases, especially when not using DIO.
Regards,
--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com
On Fri, Jan 27, 2023 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
I'd like to confirm whether there is any performance regression caused
by this patch in some cases, especially when not using DIO.
Thanks. I ran some insert tests with primary and 1 async standby.
Please see the numbers below and attached graphs. I've not noticed a
regression as such, in fact, with patch, there's a slight improvement.
Note that there's no WAL DIO involved here.
test-case 1:
clients HEAD PATCHED
1 139 156
2 624 599
4 3113 3410
8 6194 6433
16 11255 11722
32 22455 21658
64 46072 47103
128 80255 85970
256 110067 111488
512 114043 118094
768 109588 111892
1024 106144 109361
2048 85808 90745
4096 55911 53755
test-case 2:
clients HEAD PATCHED
1 177 128
2 186 425
4 2114 2946
8 5835 5840
16 10654 11199
32 14071 13959
64 18092 17519
128 27298 28274
256 24600 24843
512 17139 19450
768 16778 20473
1024 18294 20209
2048 12898 13920
4096 6399 6815
test-case 3:
clients HEAD PATCHED
1 148 191
2 302 317
4 3415 3243
8 5864 6193
16 9573 10267
32 14069 15819
64 17424 18453
128 24493 29192
256 33180 38250
512 35568 36551
768 29731 30317
1024 32291 32124
2048 27964 28933
4096 13702 15034
[1]: cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' EOF
cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
EOF
test-case 1:
./pgbench -i -s 300 -d postgres
./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;"
cat << EOF >> insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 100000 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta);
EOF
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
test-case 2:
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
test-case 3:
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b simple-update
-c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com
Attachments:
test_case_3.pngimage/png; name=test_case_3.pngDownload
�PNG
IHDR � ! �`s� IDATx^�� �U����,�L�L62$�" A���?rUnb�����1!
#W���JEQ���^W�� �3lYIX��d�-3��z��z�{z��>u�����2�s���|�[��Y�������E$@$@$��@YYB[�l �y��zJd)$@$@$@��u+�~@$@$@� P�ugy$@$@$`� �
H�H�H (�9��"I�H�H�L �� � :�$
0}�H�H�r@���,�H�H�(�� �
p��H � �H�H�H (�9��"I�H�H�L �� � :�$
0}�H�H�r@���,�
�@V�-����`Q�TN%Y�� �����47b��]��P��J���3o��&f��8+�P�������a��3� @.�Fd4�S$��[��"ZV�a�
��e%�a�$�
�_d�o�H6�jT�f�DYM]�[_���O��z�����L[��H��L�g�����VP���oZ
p��>��;
���Y@AH�K��U��&�z��
u��_tX6*��{��w0��X�Q��������I��c 9e}2�f����_^�J��R`�8�Y�H*�I��`�]�����'����$���J�;:��g��I}K������S��)&@V��(�d����z�V����
TJar2��g��s��f��Q}2�k-���9�L���~O���]� H#�q;pb��P��l�
a�G���w6����� ;�����#�LD�� ��59+����p�b"ga�;��b(��P_��N��������-7����(x��obVP ��"�iN5E��y9`$��&��t����q\�����lU� ���O��6d
tH1Yp��w���"���*���l���{��w������%r �mTl�����-��:"��j��
p{�vpX�[��`� ��h�F�8�E��<�>l�����&�]�K�w��nD�c����9\Y�5w�b,C���r$�$��_ �7���JL�*O����H ((�Ai)�I$@$PP(���� � �@PP���R��H�H��P��9Y ��� ��h' � �@A� Ts�2$@$@A!@JK�N ��"@.��deH�H��B�����$@$@E�\P���� � � 8(-E;I�H�
� �����!
pPZ�v� �
pA5'+C$@$����$ ((��jNV�H�H (�p4�7�`����F=��8V[5��Nd���A�I;I�H�H�)���X����X�9�\��9��9�]�P�����6Ez%��Q���e��ZLG$@$@�M�� 77b���h��u-�Xz���0*����B�S7�Gz��H�H�H������{��F���8�
AO�MS��QX �p��� �������f�=�g�}C�UK���B�����%���6�g�,Gr��3��XJ"����~���r(���]�;��Y���q67����M���_��q��5�wh�
�g�y�����C��`9�����g,%�sj�8!�T=`s������*��,S�`�(>L�3�K��S��WX��� �m)2��,j��)������$knX����)r
�SR�������;��
-oi��7�w���������?h>L�3f����������[6)K� ��0`�S��8������`X�?)����e�������_�/r�E�=`=>��� ��'gr� ��rv��������8SfF��La�����3�/��)JqR���,�|��89��L����&gW��r��Sb2����M�Lv�4:n�c����x�Pg>z<��)�z<�e)tL��<&'g��\�4�A�W��6�tJ��uq���Rq��FVa��mVI3�7c��/k���|��������o)��
m\��L���/����
�LZ��FVYE���&{�zh&/����>9��C[�l ;�2#� �5��6�(. �-D����� ��tJ�{:>L�������������qs{9+���:�T�����)�;1�i�����+Q�p��|���S���9k��2���+�@��b��tqv7m���v:���E]KD�����z���B�S�[�l'D�vB)�4�3;+����*i��6�x$M�IiK�_�irV�n����QYo�t
!� g��t��:����P�>M�8�^����Q�9��
�=���npeed^� � � x%�c4�� K�w1V�H������j����TC����>������J�i�J���C3y)�e=�uqv(� �������C�jd����6��Y8b��Q�2}9]��� 5hn��e�#U�`����s�n�EXNIyO��1�[�/
*���+h�4{��%�6��� ���RcZ?������[�9����0�Cu(=QI��P�1c&����A�4{)fJ��Q&�|���9D|/�=m���+<C?{�����<�a"��T���9K�n'c�$!��)�i�����c�S1���������5��Y#t�9;��u��q���g���3���4M���#�q<i����s����j��
�e@�U�L��X�6�� �o�^
�JoM��.�� �i�#?�]omI��t�L�L� ��QR5��g�jdO����6��I�C3y)�e=��9�s�.��6�T��EX'^o�������^{�����8�B���4C�'���k��1=5�����52O7�s��e��[z�I�o���y���1t��d�
��a����t�Eq�ze��o����\�����3��F$?�KW#{�@��Ubs�}�52�7��kd�n g��=9N���������=�d�p�������� ����� c��X��E�����~NZ��/[�<U��M|�@R���@����������N��c�#8��u����������q�v����9����p�Ur�Y(>e
�_�����'���L���K`���Gp.n#c��HJ��{r��/E���P>{!�/���<����pB9:d-���b�I��[������X��B�*_��\�����=as}9S�];�9�|�l���b�u9^X����p����>�4��-NN�����j��!g����A�n�yKO��8����h������0h4��B�I��V%#rR��j���;mv���e.����7�WX��j�Sh����o��S�&e(�fD��s&c�K ��B�[V���j
i5~�[�s��� ��`NTtcE%���U
��E���>w�W� �������9�,C������A����A�������� 31�3?P]����0)CI��2e)�e=
@�*�����r`�)��Xz���V<_��Tq�-�?�jfY|����t�}�O7��\^_v?�C�8.AO�e�D�e�n���&=�B��s&c������U��C�5Xd�~�����Y�e���{?in��G*��G;��?���?�a,�
�e}~�f,�������/�Ze�J��KK ����sFDY' ��:�����/^�2�7D,bj�nn����h{�~S EU\~�lz�o���A�� t��d���Q6�r�+���2�i3"g�9�����V��g%�����]W�������pe��I�lF�Z��#-8<�V���^z�h����g^n���������0mbNS���kEkG�������w���u?�Kq���������H$@$����`(������Q���������W�����V����S����o�������,�����Z����7�c������c��3a(��qf�s4��I�o��?��q��2?8�G����S���k����f�Ce�8�#.���9�����g���p�\o��H�7nK�Q����|�������
��g�.{g-�Y��h������l�������.��!�%��^uq%�����+���`�h#vl��������W���a�������+���/-����G��xyIM�^�����S3s�v�]�*X_�",'������������h���=�����w�I� �c&��wN}=��`t������6&=����d�?c������4V6�d+���H�����%�0��U��~�*�����R\7�t����5{�U"�rU���{,���$��(sT�T/�>"\q/z^[c���/Ez������C����L��3� S�{��!p����~���~��p�U�)DNm.*
��{�I/���G�fJ�HI�
O����Y���������s�C?����0_Z��)����>G7��#LY'"��9mc#GO��2�+s�rX���Z0>dBqq���T��+=_�W&�������9�n���j~Ar�2����i�/��*�P�mVM4y~A�4{�����:9S��m*A�������d��B9���wk��'/ �O���� ~�����fm���_|�]��O���m�~�w������M��Psw���559���w$���,=������^��o�s������m���g�N�.��'��k�w"|���������w���;�'� F��Xx~]27-��t�w��t��)g]�8)�6;��]2���������W��o:6<�A��g|�L����q��7���w�W�]�$\z���yfJ��1�"��q������8����;~�>���:�?�
������������3#
������XJ g
pZO3c��qId�yQd��\����%������(+-�W>< �/��s�:�]�g���/��e�,��$j�������6&��{�������6i��&x��-gO�(��6+�$;2��18=c�|Nl{G����x�#&���xp+����;O��zy�E��c_��;~`���pL���[n���hOZ���AK�8���D���?f ��1���f����x����l�����sv/;��m*rf8�����'�lX�
Y�����{��l����^�.��<�n����l���*/�����kpt�����_�HM'e��3�0��������q'��������G�'}���r�e_�rh�����������b����u��?d�:������._e����'F���V���������x��xs���8���d+V��Y�(�=%1����h�1�{:�C��;2��g"J���S1xh9FL~'N7C�8?��nd��mY���fU$��>t���&��4s��~Z����_����_��X���^`?�l&�~������3��U�n�7�6���G�!�C�\��poO��x8B������-o���5?D��X�UR�AW���i��GN���:z��}��8��U�����Dy���8;�����1`�h<i��g��]<���Q��r���3�I�3���d�=C'9�sz�m��%���`��
qj�)�P��NY�x9�����E���6W>�
h����W���w�C�2����1�"|��_�]�O�hp������aFL����$���m�������W�����#����YWb��ScSa�d���l������cG�P��%tu��,l���k��(�k�)4d�O:���QRu^��IT���0� h�B�)�"c�K ��B��P����X ����;1�i��?��0 �-����I��>t����5���+Y�����1�"�t��pN���(�u~W������L��@�5�W�~��}v������K.��hO���T<e���7a��f��gl�z{3�����]�#�,���)J������� ���n�D��I�@�~���I����������K1��k����7�
A��.�����Q
,�^�}���[�:A��,��#?�EF/Jb�Z�D6����8�b �����R������p^�K}�
�1%Yz��u<\����u�SQd1��I�HN�.�6��*S;��a��6����C��aL��-z��=Qd�sh7F�z��,��l�v�6/�z�Yo������������A�9h�R���Nw���
W�;]�E��5�d n���?�����������+f��������\�V��"����,�V!1�`h4���j�9x)t�y�����g��E��%��o�m5O+{��cfLf��P�������1 ��3�^���0N�z9c�2JP4t�����zL��7<������4{)���1���|�� �jd;����zT7�W��2r���k��nF���8�/u������_��C���v&�\�$��I3x�M(?�wr����A��Z����c���"�����]�!'�S�������O`���>d���wAY)�0�d�)e;0���J�������VF��LF����5�����.��"(N�Ee[[ol�*�����0�%�x��Yi������~�������1WXj��}2�xGzQ���9����^za�y�l�[���-�����.3{��u��I��h�8�h$���������>��x�)fo���Z!S�k�-f��a_<�(N|���0.�C��6���w�]*����Y�K�i�gJG����;9���\�����,p�a�U�E[�;�WA�������v86G*{g�}iEl������}���/����P��)r��1%����
���g(��3\��t^�'�.��N�B��y*���K�p���a����NV���T9���I�{pR�+2�]�6K�;�0��.7�����/�G��]���l�����;h�
���G�����s6EwVGm�Y��HoxY��5�d���)�L���=x:�>�g�R�����!�Y|%=�%;3�����h�|���������l_��j�ql�n��6fI�����CMa���F����� �0���������g�y���Wo�?tc�� ��)m=���h���i9n9;���tA����'��W�o�`}��ZR&~����$�]�=���W�b}����?��0�x,������\Ep�����S�0�7��/g�_�l�c6O;��y���Iz������y`I��m�F&�om'��*�8����jLY���s�F�� h����p��SRrN��_ �l9��G�Wh����o��r G��QX���=9����r�u/�����;�r��
����Yz��|�l��ye�=���P5���>�<]M:�<l�M�&&��{����
��4?9�m�q�w�������������L��^
p�U�w]���X���������d�|(�������{_���W�M��.Tr0�K�jdun�9(',E���,���_;b�_���d��"b|zeD��w��8���L��1�'�=�[�18�2��1��s�s.���s��� ��j]���X�OB��!�]��1����Y��"V���W���������t5��z�fo4��d>������=�t�n�{��n������0����'��\��)�3�X~����A������u4�� K�=�mO�����+1l�8EKK��s^��!k�)���l_�2z���7+�
_��he�������dK��8�����2����q��;V�2�����A����}��s����[��
�`���5_Nr���V����p�%.�>e�|d���A�,��<�
�>b������Z�.��5SGa���qs�fT+C��1Q�=*��e�X�����w�W�8�^�����B'�~)��b�a������~������:)_ �tvP9/�C�����(BO�w��X�5sj��a. �_��o������S�=Y�R�sjN�R����R�g0��g��d��,��0�H�E��C:�������R6~T9q?Vo8`a[�,�������)qk�����K������� ���<t���HV�����
���Tm����A��dK�Y��x Oo:h.��_��r����(W2lm]~U�s���)*�Lrf8F@�����I3��� �KV?�6��������J'L��o����gYm����h�r���O�S�� IDAT!AN�q-[�N��D�8�N����9h��7_��vH^�.��w=���W��q������-�����}��K�����aRYCU6KZz��/2C5Z�����2��W�����8�����AN8�:����K�/��K��*9��+h�R�uy��/��N�� I� �0�����L�����g�lBa|�F���.k����K�WD��76s�8[Wl�������IWU�q���mo�FC��������r���� '���������eU�����Y��H(%��;E��,h�'q��o��4|i
�M��(%�t5�c���f�4S��7g�7�z������d{�,���K�tC�#y����� �-�7c�v���6�A�o��8;��Uhl���ZCZ���������(Iu-��XU��_���z��
;�!��f���V��f�_��/�u_�Ye�h�J��`{�2o,"�f�����8��QVS��_P�
U� \rWi��j=3�B_��HE
rN����l�v7td�\�\����c� ������^����l����_w��Gw���'�+s'9,E]2:�:��5�c)",��d1�}�X�g5�������k����X�1��d�V} ������J��}��Z��w�Xh�z����x��(������������8���;>������c�!N��9�cYM-{��gf����3��,n��5T-�q��c�%��������)�/����;�Y� �`UD}��s��3�n�M��2u��c�V����x�g2��v]��Xf���8e����8-��@�8���_���]'����8���.�y`��2�����H����kP�o[����U���|tVUC��y�@�p7*��j�s���e!������rb������0�%7<k���$~��{p�����N��v~$�N��`���!N��9��b�?�����:q����I��4WU�b���d���aeU��C���~�,GrV����b���Z|#���`���Va=xK>P��:k�����<��=�h+���99��Y"9=��mxe�)VI�&���t��a��$�#d����~��h���Y� 77b���q9�|�,��m52�Z�� ��FH�m �?w�&�a=��/`j�?���9���=�y��R���w;-FY::�2�i3"g��e�Xr=��~3����p�2o,A#d��~Y��;7>
9��~Y��%�������_���9rV!��9�<E&���kPz�|��&��o�++9��f.=o�v���z��s�8��r
���QN<S�� '��c��~�;��S��Rt���9`�U���L��m!����t���]!�p��30������]���r�Y%����u,QY�L�!�k�7��o,�l]����4~Ug�W%Uy�fU$����s�`{���Yn.����x���_i~]���ei�J��K�k+��rNu����'��Y�e?S�P5TM_�/�"��7�� �������?Z��:����^�gJ�KFJ���*�D�U���l�����r�|�����jcu��pY����\%��+��j�2}����
A���
Zz��A\��_�����=]a�����=3/w��j��Q�9]H�tGc���Pu4pD�@dUu4p���Ve�{R��������S�~��� �=��X��|s��O���M�����*
��*i����L�Gcn87o,_��9����f�{HE�W���W,��lU��q�0n�U�VU;�K�i��eK����8�>�������n��7�n�����t5��:�f�4)�ni�
�(��KHE�hL{�������2��4���!�PIo8F���������#����o��������7G�$�`���jd�.K�U�� gC3SHEk���WL�d�z����|���gU'Ugc��{���&�<?]��� [[�����N���w�v�@���t5��:�f�4)��hf
�(����V��/g��^���: $�:������{uq�lmA�)������F|������O)u5�J�i�J�`�h�>c��$���
�8�� �����>����j���|����������s�`k���g�s-_�U�T��W���%�����UV�6��I�AS����1��TLw4��f�iU������A9O��S��,�c�A�,BS�0hU�;E�j��i��K�t��9?�#q�i��!_ zZ����,�o�jjYU-����DE������6G������#�mNrV����*9�,#p���I��6~���_�E6�{�F����j,� �f������ �����6Ut#�����J��-H�/�
�,�pP��<D�L��l��{�+��T���fN5N�2VU������Pu������U��� ��i]�=A�
mu����Ui�[�^����l�� �TlmA�~����tz���08iMuit=L�,�7��mN��_=l�������X3����0�P��4�l�*��Uzl��tq�&���],���$8�p�T��[��{*�� I������8(�NZS]]�:�)�*Y��+�_��(��f�?�e�N����/��}��n14�T_DF��_�cQ�Z�������8(�z"r&�t��Y���"�"��!�l�C�Nm����K���{��-[����>��X+�R[�����j��2~����0*~��8���������K��?i �@@���������6���#m=1��a��2�=��N+Ciq�O��\�������B�M����Bt�|�N9����H���T�!��,=��X3�W|���|�����D|����8��=�l\��������;��Y��R������d��~YGc�b�������6��6��v��Z��^�6�{h��ps#�/[���y1N�=I��N?�������������`�-�&_ j8f�%h�U�k�� /����mM��I�9%�7��jk!W�@��\�1g�sn���������r�lv��Q�����8t�o������y�����������������&�v�KO�U�L����3�TLv4�e���e�q,���U\�g�'8BN����O�~�]��)�����_�����fa�0�^��\v�_m�:_]����47�f�4�M���RY����N{��T�E\3�i�����W��j��j��j ��Otr��N���l.H>��u���TL|c�,+&`���j������Yj��6��I�CS8]��hL9s��}������V$
�(y��~�w�@�+k��Um
U�>