Improve WALRead() to suck data directly from WAL buffers when possible

Started by Bharath Rupireddyabout 3 years ago95 messages
#1Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
1 attachment(s)

Hi,

WALRead() currently reads WAL from the WAL file on the disk, which
means, the walsenders serving streaming and logical replication
(callers of WALRead()) will have to hit the disk/OS's page cache for
reading the WAL. This may increase the amount of read IO required for
all the walsenders put together as one typically maintains many
standbys/subscribers on production servers for high availability,
disaster recovery, read-replicas and so on. Also, it may increase
replication lag if all the WAL reads are always hitting the disk.

It may happen that WAL buffers contain the requested WAL, if so, the
WALRead() can attempt to read from the WAL buffers first before
reading from the file. If the read hits the WAL buffers, then reading
from the file on disk is avoided. This mainly reduces the read IO/read
system calls. It also enables us to do other features specified
elsewhere [1]/messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com.

I'm attaching a patch that implements the idea which is also noted
elsewhere [2]* XXX probably this should be improved to suck data directly from the * WAL buffers when possible. */ bool WALRead(XLogReaderState *state,. I've run some tests [3]1 primary, 1 sync standby, 1 async standby ./pgbench --initialize --scale=300 postgres ./pgbench --jobs=16 --progress=300 --client=32 --time=900 --username=ubuntu postgres. The WAL buffers hit ratio with
the patch stood at 95%, in other words, the walsenders avoided 95% of
the time reading from the file. The benefit, if measured in terms of
the amount of data - 79% (13.5GB out of total 17GB) of the requested
WAL is read from the WAL buffers as opposed to 21% from the file. Note
that the WAL buffers hit ratio can be very low for write-heavy
workloads, in which case, file reads are inevitable.

The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.

I'll add this to the next commitfest.

Thoughts?

[1]: /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com

[2]: * XXX probably this should be improved to suck data directly from the * WAL buffers when possible. */ bool WALRead(XLogReaderState *state,
* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.
*/
bool
WALRead(XLogReaderState *state,

[3]: 1 primary, 1 sync standby, 1 async standby ./pgbench --initialize --scale=300 postgres ./pgbench --jobs=16 --progress=300 --client=32 --time=900 --username=ubuntu postgres
1 primary, 1 sync standby, 1 async standby
./pgbench --initialize --scale=300 postgres
./pgbench --jobs=16 --progress=300 --client=32 --time=900
--username=ubuntu postgres

PATCHED:
-[ RECORD 1 ]----------+----------------
application_name | assb1
wal_read | 31005
wal_read_bytes | 3800607104
wal_read_time | 779.402
wal_read_buffers | 610611
wal_read_bytes_buffers | 14493226440
wal_read_time_buffers | 3033.309
sync_state | async
-[ RECORD 2 ]----------+----------------
application_name | ssb1
wal_read | 31027
wal_read_bytes | 3800932712
wal_read_time | 696.365
wal_read_buffers | 610580
wal_read_bytes_buffers | 14492900832
wal_read_time_buffers | 2989.507
sync_state | sync

HEAD:
-[ RECORD 1 ]----+----------------
application_name | assb1
wal_read | 705627
wal_read_bytes | 18343480640
wal_read_time | 7607.783
sync_state | async
-[ RECORD 2 ]----+------------
application_name | ssb1
wal_read | 705625
wal_read_bytes | 18343480640
wal_read_time | 4539.058
sync_state | sync

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v1-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchapplication/octet-stream; name=v1-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchDownload
From d93a6c97bad19d3718f0e4f06caeac5ce9937b37 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 8 Dec 2022 09:37:01 +0000
Subject: [PATCH v1] Improve WALRead() to suck data directly from WAL buffers
 when possible

---
 src/backend/access/transam/xlog.c       | 184 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  58 +++++++-
 src/include/access/xlog.h               |   9 ++
 3 files changed, 249 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index a31fbbff78..196be98591 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -689,6 +689,7 @@ static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
+static char *GetXLogBufferForRead(XLogRecPtr ptr, TimeLineID tli, char *page);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
@@ -1639,6 +1640,189 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Get the WAL buffer page containing passed in WAL record and also return the
+ * record's location within that buffer page.
+ */
+static char *
+GetXLogBufferForRead(XLogRecPtr ptr, TimeLineID tli, char *page)
+{
+	XLogRecPtr	expectedEndPtr;
+	XLogRecPtr	endptr;
+	int 	idx;
+	char    *recptr = NULL;
+
+	idx = XLogRecPtrToBufIdx(ptr);
+	expectedEndPtr = ptr;
+	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+	/*
+	 * Hold WALBufMappingLock in shared mode so that the other concurrent WAL
+	 * readers are also allowed. We try to do as less work as possible while
+	 * holding the lock as it might impact concurrent WAL writers.
+	 *
+	 * XXX: Perhaps, measuring the immediate lock availability and its impact
+	 * on concurrent WAL writers is a good idea here.
+	 *
+	 * XXX: Perhaps, returning if lock is not immediately available a good idea
+	 * here. The caller can then go ahead with reading WAL from WAL file.
+	 *
+	 * XXX: Perhaps, quickly finding if the given WAL record is in WAL buffers
+	 * a good idea here. This avoids unnecessary lock acquire-release cycles.
+	 * One way to do that is by maintaining oldest WAL record that's currently
+	 * present in WAL buffers.
+	 */
+	LWLockAcquire(WALBufMappingLock, LW_SHARED);
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it.
+	 */
+	endptr = XLogCtl->xlblocks[idx];
+
+	if (expectedEndPtr == endptr)
+	{
+		XLogPageHeader phdr;
+
+		/*
+		 * We have found the WAL buffer page holding the given LSN. Read from a pointer
+		 * to the right offset within the page.
+		 */
+		memcpy(page, (XLogCtl->pages + idx * (Size) XLOG_BLCKSZ),
+			   (Size) XLOG_BLCKSZ);
+
+		/*
+		 * Release the lock as early as possible to avoid any possible
+		 * contention.
+		 */
+		LWLockRelease(WALBufMappingLock);
+
+		/*
+		 * Despite we read the WAL buffer page by holding all necessary locks,
+		 * we still want to be extra cautious here and serve the valid WAL
+		 * buffer page.
+		 *
+		 * XXX: Perhaps, we can further go and validate the found page header,
+		 * record header and record at least in assert builds, something like
+		 * the xlogreader.c does and return if any of those validity checks
+		 * fail. Having said that, we stick to the minimal checks for now.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		if (phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			phdr->xlp_tli == tli)
+		{
+			/*
+			 * Page looks valid, so return the page and the requested record's
+			 * LSN.
+			 */
+			recptr = page + ptr % XLOG_BLCKSZ;
+		}
+	}
+	else
+	{
+		/* We have not found anything. */
+		LWLockRelease(WALBufMappingLock);
+	}
+
+	return recptr;
+}
+
+/*
+ * When possible, read WAL starting at 'startptr' of size 'count' bytes from
+ * WAL buffers into buffer passed in by the caller 'buf'. Read as much WAL as
+ * possible from the WAL buffers, remaining WAL, if any, the caller will take
+ * care of reading from WAL files directly.
+ *
+ * This function sets read bytes to 'read_bytes' and sets 'hit', 'partial_hit'
+ * and 'miss' accordingly.
+ */
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf,
+					Size *read_bytes,
+					bool *hit,
+					bool *partial_hit,
+					bool *miss)
+{
+	XLogRecPtr	ptr;
+	char    *dst;
+	Size    nbytes;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+
+	ptr = startptr;
+	nbytes = count;
+	dst = buf;
+	*read_bytes = 0;
+	*hit = false;
+	*partial_hit = false;
+	*miss = false;
+
+	while (nbytes > 0)
+	{
+		char 	page[XLOG_BLCKSZ] = {0};
+		char    *recptr;
+
+		recptr = GetXLogBufferForRead(ptr, tli, page);
+
+		if (recptr == NULL)
+			break;
+
+		if ((recptr + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			memcpy(dst, recptr, nbytes);
+			dst += nbytes;
+			*read_bytes += nbytes;
+			ptr += nbytes;
+			nbytes = 0;
+		}
+		else if ((recptr + nbytes) > (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are not in one page. */
+			Size bytes_remaining;
+
+			/*
+			 * Compute the remaining bytes on the current page, copy them over
+			 * to output buffer and move forward to read further.
+			 */
+			bytes_remaining = XLOG_BLCKSZ - (recptr - page);
+			memcpy(dst, recptr, bytes_remaining);
+			dst += bytes_remaining;
+			nbytes -= bytes_remaining;
+			*read_bytes += bytes_remaining;
+			ptr += bytes_remaining;
+		}
+	}
+
+	if (*read_bytes == count)
+	{
+		/* It's a buffer hit. */
+		*hit = true;
+	}
+	else if (*read_bytes > 0 &&
+			 *read_bytes < count)
+	{
+		/* It's a buffer partial hit. */
+		*partial_hit = true;
+	}
+	else if (*read_bytes == 0)
+	{
+		/* It's a buffer miss. */
+		*miss = true;
+	}
+
+	elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X",
+		 *read_bytes, count, LSN_FORMAT_ARGS(startptr));
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a38a80e049..7ec94a0535 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1485,8 +1485,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1497,6 +1496,61 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend tools have no idea of WAL buffers. */
+	Size        read_bytes;
+	bool		hit;
+	bool		partial_hit;
+	bool		miss;
+
+	/*
+	 * When possible, read WAL from WAL buffers. We skip this step and continue
+	 * the usual way, that is to read from WAL file, either when the server is
+	 * in recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		XLogReadFromBuffers(startptr, tli, count, buf, &read_bytes,
+							&hit, &partial_hit, &miss);
+		pgstat_report_wait_end();
+
+		if (hit)
+		{
+			/*
+			 * We have fully read the requested WAL from WAL buffers, so
+			 * return.
+			 */
+			Assert(count == read_bytes);
+			return true;
+		}
+		else if (partial_hit)
+		{
+			/*
+			 * We have partially read from WAL buffers, so reset the state and
+			 * read the remaining bytes the usual way.
+			 */
+			Assert(read_bytes > 0 && count > read_bytes);
+			buf += read_bytes;
+			startptr += read_bytes;
+			count -= read_bytes;
+		}
+#ifdef USE_ASSERT_CHECKING
+		else if (miss)
+		{
+			/*
+			 * We have not read anything from WAL buffers, so read the usual way,
+			 * that is to read from WAL file.
+			 */
+			Assert(read_bytes == 0);
+		}
+#endif	/* USE_ASSERT_CHECKING */
+	}
+#endif	/* FRONTEND */
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..968608353e 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,15 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern void XLogReadFromBuffers(XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf,
+								Size *read_bytes,
+								bool *hit,
+								bool *partial_hit,
+								bool *miss);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

#2Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Bharath Rupireddy (#1)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.

I'll add this to the next commitfest.

Thoughts?

This adds copying of the whole page (at least) at every WAL *record*
read, fighting all WAL writers by taking WALBufMappingLock on a very
busy page while the copying. I'm a bit doubtful that it results in an
overall improvement. It seems to me almost all pread()s here happens
on file buffer so it is unclear to me that copying a whole WAL page
(then copying the target record again) wins over a pread() call that
copies only the record to read. Do you have an actual number of how
frequent WAL reads go to disk, or the actual number of performance
gain or real I/O reduction this patch offers?

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#3Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#2)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

Mmm. I'm a bit dim. Recovery doesn't read concurrently-written
records. Please forget about recovery.

I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#4Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#3)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Sorry for the confusion.

At Mon, 12 Dec 2022 12:06:36 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Mon, 12 Dec 2022 11:57:17 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

Mmm. I'm a bit dim. Recovery doesn't read concurrently-written
records. Please forget about recovery.

NO... Logical walsenders do that. So, please forget about this...

I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#5Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Kyotaro Horiguchi (#2)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Mon, Dec 12, 2022 at 8:27 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Thanks for providing thoughts.

At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.

I'll add this to the next commitfest.

Thoughts?

This adds copying of the whole page (at least) at every WAL *record*
read,

In the worst case yes, but that may not always be true. On a typical
production server with decent write traffic, it happens that the
callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
MAX_SEND_SIZE bytes.

fighting all WAL writers by taking WALBufMappingLock on a very
busy page while the copying. I'm a bit doubtful that it results in an
overall improvement.

Well, the tests don't reflect that [1]PATCHED: 1 1470.329907 2 1437.096329 4 2966.096948 8 5978.441040 16 11405.538255 32 22933.546058 64 43341.870038 128 73623.837068 256 104754.248661 512 115746.359530 768 106106.691455 1024 91900.079086 2048 84134.278589 4096 62580.875507, I've run an insert work load
[2]: Test details: ./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & 1 primary, 1 async standby cd inst/bin ./pg_ctl -D data -l logfile stop ./pg_ctl -D assbdata -l logfile1 stop rm -rf data assbdata rm logfile logfile1 free -m sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches' free -m ./initdb -D data rm -rf /home/ubuntu/archived_wal mkdir /home/ubuntu/archived_wal cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' track_wal_io_timing = 'on' EOF ./pg_ctl -D data -l logfile start ./psql -c "select pg_create_physical_replication_slot('assb1_repl_slot', true, false)" postgres ./pg_ctl -D data -l logfile restart ./pg_basebackup -D assbdata ./pg_ctl -D data -l logfile stop cat << EOF >> assbdata/postgresql.conf port=5433 primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu application_name=assb1' primary_slot_name='assb1_repl_slot' restore_command='cp /home/ubuntu/archived_wal/%f %p' EOF touch assbdata/standby.signal ./pg_ctl -D data -l logfile start ./pg_ctl -D assbdata -l logfile1 start ./pgbench -i -s 1 -d postgres ./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" cat << EOF >> insert.sql \set aid random(1, 10 * :scale) \set delta random(1, 100000 * :scale) INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta); EOF ulimit -S -n 5000 for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
pretty cool. If you have any use-cases in mind, please share them
and/or feel free to run at your end.

It seems to me almost all pread()s here happens
on file buffer so it is unclear to me that copying a whole WAL page
(then copying the target record again) wins over a pread() call that
copies only the record to read.

That's not always guaranteed. Imagine a typical production server with
decent write traffic and heavy analytical queries (which fills OS page
cache with the table pages accessed for the queries), the WAL pread()
calls turn to IOPS. Despite the WAL being present in WAL buffers,
customers will be paying unnecessarily for these IOPS too. With the
patch, we are basically avoiding the pread() system calls which may
turn into IOPS on production servers (99% of the time for the insert
use case [1]PATCHED: 1 1470.329907 2 1437.096329 4 2966.096948 8 5978.441040 16 11405.538255 32 22933.546058 64 43341.870038 128 73623.837068 256 104754.248661 512 115746.359530 768 106106.691455 1024 91900.079086 2048 84134.278589 4096 62580.875507[2]Test details: ./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & 1 primary, 1 async standby cd inst/bin ./pg_ctl -D data -l logfile stop ./pg_ctl -D assbdata -l logfile1 stop rm -rf data assbdata rm logfile logfile1 free -m sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches' free -m ./initdb -D data rm -rf /home/ubuntu/archived_wal mkdir /home/ubuntu/archived_wal cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' track_wal_io_timing = 'on' EOF ./pg_ctl -D data -l logfile start ./psql -c "select pg_create_physical_replication_slot('assb1_repl_slot', true, false)" postgres ./pg_ctl -D data -l logfile restart ./pg_basebackup -D assbdata ./pg_ctl -D data -l logfile stop cat << EOF >> assbdata/postgresql.conf port=5433 primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu application_name=assb1' primary_slot_name='assb1_repl_slot' restore_command='cp /home/ubuntu/archived_wal/%f %p' EOF touch assbdata/standby.signal ./pg_ctl -D data -l logfile start ./pg_ctl -D assbdata -l logfile1 start ./pgbench -i -s 1 -d postgres ./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" cat << EOF >> insert.sql \set aid random(1, 10 * :scale) \set delta random(1, 100000 * :scale) INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta); EOF ulimit -S -n 5000 for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done, 95% of the time for pgbench use case specified
upthread). With the patch, WAL buffers can act as L1 cache, if one
calls OS page cache as L2 cache (of course this illustration is not
related to the typical processor L1 and L2 ... caches).

Do you have an actual number of how
frequent WAL reads go to disk, or the actual number of performance
gain or real I/O reduction this patch offers?

It might be a bit tough to generate such heavy traffic. An idea is to
ensure the WAL page/file goes out of the OS page cache before
WALRead() - these might help here - 0002 patch from
/messages/by-id/CA+hUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_=X7HztA@mail.gmail.com
and tool https://github.com/klando/pgfincore.

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

WALRead() callers are smart enough to take the flushed bytes only.
Although they read the whole WAL page, they calculate the valid bytes.

I remember that the one of the advantage of reading the on-memory WAL
records is that that allows walsender to presend the unwritten
records. So perhaps we should manage how far the buffer is filled with
valid content (or how far we can presend) in this feature.

Yes, the non-flushed WAL can be read and sent across if one wishes to
to make replication faster and parallel flushing on primary and
standbys at the cost of a bit of extra crash handling, that's
mentioned here /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com.
However, this can be a separate discussion.

I also want to reiterate that the patch implemented a TODO item:

* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.
*/
bool
WALRead(XLogReaderState *state,

[1]: PATCHED: 1 1470.329907 2 1437.096329 4 2966.096948 8 5978.441040 16 11405.538255 32 22933.546058 64 43341.870038 128 73623.837068 256 104754.248661 512 115746.359530 768 106106.691455 1024 91900.079086 2048 84134.278589 4096 62580.875507
PATCHED:
1 1470.329907
2 1437.096329
4 2966.096948
8 5978.441040
16 11405.538255
32 22933.546058
64 43341.870038
128 73623.837068
256 104754.248661
512 115746.359530
768 106106.691455
1024 91900.079086
2048 84134.278589
4096 62580.875507

-[ RECORD 1 ]----------+-----------
application_name | assb1
sent_lsn | 0/1B8106A8
write_lsn | 0/1B8106A8
flush_lsn | 0/1B8106A8
replay_lsn | 0/1B8106A8
write_lag |
flush_lag |
replay_lag |
wal_read | 104
wal_read_bytes | 10733008
wal_read_time | 1.845
wal_read_buffers | 76662
wal_read_bytes_buffers | 383598808
wal_read_time_buffers | 205.418
sync_state | async

HEAD:
1 1312.054496
2 1449.429321
4 2717.496207
8 5913.361540
16 10762.978907
32 19653.449728
64 41086.124269
128 68548.061171
256 104468.415361
512 114328.943598
768 91751.279309
1024 96403.736757
2048 82155.140270
4096 66160.659511

-[ RECORD 1 ]----+-----------
application_name | assb1
sent_lsn | 0/1AB5BCB8
write_lsn | 0/1AB5BCB8
flush_lsn | 0/1AB5BCB8
replay_lsn | 0/1AB5BCB8
write_lag |
flush_lag |
replay_lag |
wal_read | 71967
wal_read_bytes | 381009080
wal_read_time | 243.616
sync_state | async

[2]: Test details: ./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j 8 install > install.log 2>&1 & 1 primary, 1 async standby cd inst/bin ./pg_ctl -D data -l logfile stop ./pg_ctl -D assbdata -l logfile1 stop rm -rf data assbdata rm logfile logfile1 free -m sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches' free -m ./initdb -D data rm -rf /home/ubuntu/archived_wal mkdir /home/ubuntu/archived_wal cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' track_wal_io_timing = 'on' EOF ./pg_ctl -D data -l logfile start ./psql -c "select pg_create_physical_replication_slot('assb1_repl_slot', true, false)" postgres ./pg_ctl -D data -l logfile restart ./pg_basebackup -D assbdata ./pg_ctl -D data -l logfile stop cat << EOF >> assbdata/postgresql.conf port=5433 primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu application_name=assb1' primary_slot_name='assb1_repl_slot' restore_command='cp /home/ubuntu/archived_wal/%f %p' EOF touch assbdata/standby.signal ./pg_ctl -D data -l logfile start ./pg_ctl -D assbdata -l logfile1 start ./pgbench -i -s 1 -d postgres ./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT pgbench_accounts_pkey;" cat << EOF >> insert.sql \set aid random(1, 10 * :scale) \set delta random(1, 100000 * :scale) INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta); EOF ulimit -S -n 5000 for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n "$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done
./configure --prefix=$PWD/inst/ CFLAGS="-O3" > install.log && make -j
8 install > install.log 2>&1 &
1 primary, 1 async standby
cd inst/bin
./pg_ctl -D data -l logfile stop
./pg_ctl -D assbdata -l logfile1 stop
rm -rf data assbdata
rm logfile logfile1
free -m
sudo su -c 'sync; echo 3 > /proc/sys/vm/drop_caches'
free -m
./initdb -D data
rm -rf /home/ubuntu/archived_wal
mkdir /home/ubuntu/archived_wal
cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
track_wal_io_timing = 'on'
EOF
./pg_ctl -D data -l logfile start
./psql -c "select
pg_create_physical_replication_slot('assb1_repl_slot', true, false)"
postgres
./pg_ctl -D data -l logfile restart
./pg_basebackup -D assbdata
./pg_ctl -D data -l logfile stop
cat << EOF >> assbdata/postgresql.conf
port=5433
primary_conninfo='host=localhost port=5432 dbname=postgres user=ubuntu
application_name=assb1'
primary_slot_name='assb1_repl_slot'
restore_command='cp /home/ubuntu/archived_wal/%f %p'
EOF
touch assbdata/standby.signal
./pg_ctl -D data -l logfile start
./pg_ctl -D assbdata -l logfile1 start
./pgbench -i -s 1 -d postgres
./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;"
cat << EOF >> insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 100000 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta);
EOF
ulimit -S -n 5000
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#6Dilip Kumar
dilipbalaut@gmail.com
In reply to: Bharath Rupireddy (#5)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Dec 23, 2022 at 3:46 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Mon, Dec 12, 2022 at 8:27 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

Thanks for providing thoughts.

At Fri, 9 Dec 2022 14:33:39 +0530, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote in

The patch introduces concurrent readers for the WAL buffers, so far
only there are concurrent writers. In the patch, WALRead() takes just
one lock (WALBufMappingLock) in shared mode to enable concurrent
readers and does minimal things - checks if the requested WAL page is
present in WAL buffers, if so, copies the page and releases the lock.
I think taking just WALBufMappingLock is enough here as the concurrent
writers depend on it to initialize and replace a page in WAL buffers.

I'll add this to the next commitfest.

Thoughts?

This adds copying of the whole page (at least) at every WAL *record*
read,

In the worst case yes, but that may not always be true. On a typical
production server with decent write traffic, it happens that the
callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
MAX_SEND_SIZE bytes.

I agree with this.

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

WALRead() callers are smart enough to take the flushed bytes only.
Although they read the whole WAL page, they calculate the valid bytes.

Right

On first read the patch looks good, although it needs some more
thoughts on 'XXX' comments in the patch.

And also I do not like that XLogReadFromBuffers() is using 3 bools
hit/partial hit/miss, instead of this we can use an enum or some
tristate variable, I think that will be cleaner.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#7Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Dilip Kumar (#6)
3 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sun, Dec 25, 2022 at 4:55 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

This adds copying of the whole page (at least) at every WAL *record*
read,

In the worst case yes, but that may not always be true. On a typical
production server with decent write traffic, it happens that the
callers of WALRead() read a full WAL page of size XLOG_BLCKSZ bytes or
MAX_SEND_SIZE bytes.

I agree with this.

This patch copies the bleeding edge WAL page without recording the
(next) insertion point nor checking whether all in-progress insertion
behind the target LSN have finished. Thus the copied page may have
holes. That being said, the sequential-reading nature and the fact
that WAL buffers are zero-initialized may make it work for recovery,
but I don't think this also works for replication.

WALRead() callers are smart enough to take the flushed bytes only.
Although they read the whole WAL page, they calculate the valid bytes.

Right

On first read the patch looks good, although it needs some more
thoughts on 'XXX' comments in the patch.

Thanks a lot for reviewing.

Here are some open points that I mentioned in v1 patch:

1.
+     * XXX: Perhaps, measuring the immediate lock availability and its impact
+     * on concurrent WAL writers is a good idea here.

It was shown in my testng upthread [1]/messages/by-id/CALj2ACXUbvON86vgwTkum8ab3bf1=HkMxQ5hZJZS3ZcJn8NEXQ@mail.gmail.com that the patch does no harm in
this regard. It will be great if other members try testing in their
respective environments and use cases.

2.
+     * XXX: Perhaps, returning if lock is not immediately available a good idea
+     * here. The caller can then go ahead with reading WAL from WAL file.

After thinking a bit more on this, ISTM that doing the above is right
to not cause any contention when the lock is busy. I've done so in the
v2 patch.

3.
+     * XXX: Perhaps, quickly finding if the given WAL record is in WAL buffers
+     * a good idea here. This avoids unnecessary lock acquire-release cycles.
+     * One way to do that is by maintaining oldest WAL record that's currently
+     * present in WAL buffers.

I think by doing the above we might end up creating a new point of
contention. Because shared variables to track min and max available
LSNs in the WAL buffers will need to be protected against all the
concurrent writers. Also, with the change that's done in (2) above,
that is, quickly exiting if the lock was busy, this comment seems
unnecessary to worry about. Hence, I decided to leave it there.

4.
+         * XXX: Perhaps, we can further go and validate the found page header,
+         * record header and record at least in assert builds, something like
+         * the xlogreader.c does and return if any of those validity checks
+         * fail. Having said that, we stick to the minimal checks for now.

I was being over-cautious initially. The fact that we acquire
WALBufMappingLock while reading the needed WAL buffer page itself
guarantees that no one else initializes it/makes it ready for next use
in AdvanceXLInsertBuffer(). The checks that we have for page header
(xlp_magic, xlp_pageaddr and xlp_tli) in the patch are enough for us
to ensure that we're not reading a page that got just initialized. The
callers will anyway perform extensive checks on page and record in
XLogReaderValidatePageHeader() and ValidXLogRecordHeader()
respectively. If any such failures occur after reading WAL from WAL
buffers, then that must be treated as a bug IMO. Hence, I don't think
we need to do the above.

And also I do not like that XLogReadFromBuffers() is using 3 bools
hit/partial hit/miss, instead of this we can use an enum or some
tristate variable, I think that will be cleaner.

Yeah, that seems more verbose, all that information can be deduced
from requested bytes and read bytes, I've done so in the v2 patch.

Please review the attached v2 patch further.

I'm also attaching two helper patches (as .txt files) herewith for
testing that basically adds WAL read stats -
USE-ON-HEAD-Collect-WAL-read-from-file-stats.txt - apply on HEAD and
monitor pg_stat_replication for per-walsender WAL read from WAL file
stats. USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txt -
apply on v2 patch and monitor pg_stat_replication for per-walsender
WAL read from WAL buffers and WAL file stats.

[1]: /messages/by-id/CALj2ACXUbvON86vgwTkum8ab3bf1=HkMxQ5hZJZS3ZcJn8NEXQ@mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

USE-ON-HEAD-Collect-WAL-read-from-file-stats.txttext/plain; charset=US-ASCII; name=USE-ON-HEAD-Collect-WAL-read-from-file-stats.txtDownload
From 6517e50f482f88ea5185609ff4dcf0e0256475d5 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 26 Dec 2022 08:14:11 +0000
Subject: [PATCH v2] Collect WAL read from file stats for WAL senders

---
 doc/src/sgml/monitoring.sgml                | 31 +++++++++++
 src/backend/access/transam/xlogreader.c     | 33 ++++++++++--
 src/backend/access/transam/xlogutils.c      |  2 +-
 src/backend/catalog/system_views.sql        |  5 +-
 src/backend/replication/walsender.c         | 58 +++++++++++++++++++--
 src/bin/pg_waldump/pg_waldump.c             |  2 +-
 src/include/access/xlogreader.h             | 21 ++++++--
 src/include/catalog/pg_proc.dat             |  6 +--
 src/include/replication/walsender_private.h |  4 ++
 src/test/regress/expected/rules.out         |  7 ++-
 10 files changed, 151 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 363b183e5f..fdf4c7d774 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2615,6 +2615,37 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Send time of last reply message received from standby server
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL data is read from disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read_bytes</structfield> <type>numeric</type>
+      </para>
+      <para>
+       Total amount of WAL read from disk in bytes
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent reading WAL from disk via
+       <function>WALRead</function> request, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
+       otherwise zero).
+      </para></entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a38a80e049..7453724a07 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -31,6 +31,7 @@
 #include "access/xlogrecord.h"
 #include "catalog/pg_control.h"
 #include "common/pg_lzcompress.h"
+#include "portability/instr_time.h"
 #include "replication/origin.h"
 
 #ifndef FRONTEND
@@ -1489,9 +1490,9 @@ err:
  * WAL buffers when possible.
  */
 bool
-WALRead(XLogReaderState *state,
-		char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
-		WALReadError *errinfo)
+WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr, Size count,
+		TimeLineID tli, WALReadError *errinfo, WALReadStats *stats,
+		bool capture_wal_io_timing)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -1506,6 +1507,7 @@ WALRead(XLogReaderState *state,
 		uint32		startoff;
 		int			segbytes;
 		int			readbytes;
+		instr_time	start;
 
 		startoff = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
@@ -1540,6 +1542,10 @@ WALRead(XLogReaderState *state,
 		else
 			segbytes = nbytes;
 
+		/* Measure I/O timing to read WAL data if requested by the caller. */
+		if (stats != NULL && capture_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
 #ifndef FRONTEND
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 #endif
@@ -1552,6 +1558,27 @@ WALRead(XLogReaderState *state,
 		pgstat_report_wait_end();
 #endif
 
+		/* Collect I/O stats if requested by the caller. */
+		if (stats != NULL)
+		{
+			/* Increment the number of times WAL is read from disk. */
+			stats->wal_read++;
+
+			/* Collect bytes read. */
+			if (readbytes > 0)
+				stats->wal_read_bytes += readbytes;
+
+			/* Increment the I/O timing. */
+			if (capture_wal_io_timing)
+			{
+				instr_time	duration;
+
+				INSTR_TIME_SET_CURRENT(duration);
+				INSTR_TIME_SUBTRACT(duration, start);
+				stats->wal_read_time += INSTR_TIME_GET_MICROSEC(duration);
+			}
+		}
+
 		if (readbytes <= 0)
 		{
 			errinfo->wre_errno = errno;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 563cba258d..372de2c7d8 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -1027,7 +1027,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
 	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
-				 &errinfo))
+				 &errinfo, NULL, false))
 		WALReadRaiseError(&errinfo);
 
 	/* number of valid bytes in the buffer */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..b47f44a852 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -892,7 +892,10 @@ CREATE VIEW pg_stat_replication AS
             W.replay_lag,
             W.sync_priority,
             W.sync_state,
-            W.reply_time
+            W.reply_time,
+            W.wal_read,
+            W.wal_read_bytes,
+            W.wal_read_time
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..fa02e327f2 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -259,7 +259,7 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
 							  TimeLineID *tli_p);
-
+static void WalSndAccumulateWalReadStats(WALReadStats *stats);
 
 /* Initialize walsender process before entering the main command loop */
 void
@@ -907,6 +907,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	WALReadError errinfo;
 	XLogSegNo	segno;
 	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	WALReadStats	stats;
 
 	/*
 	 * Since logical decoding is only permitted on a primary server, we know
@@ -932,6 +933,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
+	MemSet(&stats, 0, sizeof(WALReadStats));
+
 	/* now actually read the data, we know it's there */
 	if (!WALRead(state,
 				 cur_page,
@@ -940,9 +943,13 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 state->seg.ws_tli, /* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
-				 &errinfo))
+				 &errinfo,
+				 &stats,
+				 track_wal_io_timing))
 		WALReadRaiseError(&errinfo);
 
+	WalSndAccumulateWalReadStats(&stats);
+
 	/*
 	 * After reading into the buffer, check that what we read was valid. We do
 	 * this after reading, because even though the segment was present when we
@@ -2610,6 +2617,9 @@ InitWalSenderSlot(void)
 			walsnd->sync_standby_priority = 0;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+			walsnd->wal_read_stats.wal_read = 0;
+			walsnd->wal_read_stats.wal_read_bytes = 0;
+			walsnd->wal_read_stats.wal_read_time = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -2730,6 +2740,7 @@ XLogSendPhysical(void)
 	Size		nbytes;
 	XLogSegNo	segno;
 	WALReadError errinfo;
+	WALReadStats stats;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2945,6 +2956,8 @@ XLogSendPhysical(void)
 	enlargeStringInfo(&output_message, nbytes);
 
 retry:
+	MemSet(&stats, 0, sizeof(WALReadStats));
+
 	if (!WALRead(xlogreader,
 				 &output_message.data[output_message.len],
 				 startptr,
@@ -2952,9 +2965,13 @@ retry:
 				 xlogreader->seg.ws_tli,	/* Pass the current TLI because
 											 * only WalSndSegmentOpen controls
 											 * whether new TLI is needed. */
-				 &errinfo))
+				 &errinfo,
+				 &stats,
+				 track_wal_io_timing))
 		WALReadRaiseError(&errinfo);
 
+	WalSndAccumulateWalReadStats(&stats);
+
 	/* See logical_read_xlog_page(). */
 	XLByteToSeg(startptr, segno, xlogreader->segcxt.ws_segsize);
 	CheckXLogRemoved(segno, xlogreader->seg.ws_tli);
@@ -3458,7 +3475,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	12
+#define PG_STAT_GET_WAL_SENDERS_COLS	15
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	SyncRepStandbyData *sync_standbys;
 	int			num_standbys;
@@ -3487,9 +3504,13 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		WalSndState state;
 		TimestampTz replyTime;
 		bool		is_sync_standby;
+		int64		wal_read;
+		uint64		wal_read_bytes;
+		int64		wal_read_time;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS] = {0};
 		int			j;
+		char		buf[256];
 
 		/* Collect data from shared memory */
 		SpinLockAcquire(&walsnd->mutex);
@@ -3509,6 +3530,9 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		replyTime = walsnd->replyTime;
+		wal_read = walsnd->wal_read_stats.wal_read;
+		wal_read_bytes = walsnd->wal_read_stats.wal_read_bytes;
+		wal_read_time = walsnd->wal_read_stats.wal_read_time;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3605,6 +3629,18 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[11] = true;
 			else
 				values[11] = TimestampTzGetDatum(replyTime);
+
+			values[12] = Int64GetDatum(wal_read);
+
+			/* Convert to numeric. */
+			snprintf(buf, sizeof buf, UINT64_FORMAT, wal_read_bytes);
+			values[13] = DirectFunctionCall3(numeric_in,
+											 CStringGetDatum(buf),
+											 ObjectIdGetDatum(0),
+											 Int32GetDatum(-1));
+
+			/* Convert counter from microsec to millisec for display. */
+			values[14] = Float8GetDatum(((double) wal_read_time) / 1000.0);
 		}
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
@@ -3849,3 +3885,17 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+/*
+ * Function to accumulate WAL Read stats for WAL sender.
+ */
+static void
+WalSndAccumulateWalReadStats(WALReadStats *stats)
+{
+	/* Collect I/O stats for walsender. */
+	SpinLockAcquire(&MyWalSnd->mutex);
+	MyWalSnd->wal_read_stats.wal_read += stats->wal_read;
+	MyWalSnd->wal_read_stats.wal_read_bytes += stats->wal_read_bytes;
+	MyWalSnd->wal_read_stats.wal_read_time += stats->wal_read_time;
+	SpinLockRelease(&MyWalSnd->mutex);
+}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 9993378ca5..698ce1e9f7 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -364,7 +364,7 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 	}
 
 	if (!WALRead(state, readBuff, targetPagePtr, count, private->timeline,
-				 &errinfo))
+				 &errinfo, NULL, false))
 	{
 		WALOpenSegment *seg = &errinfo.wre_seg;
 		char		fname[MAXPGPATH];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e87f91316a..26a2c975de 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -389,9 +389,24 @@ typedef struct WALReadError
 	WALOpenSegment wre_seg;		/* Segment we tried to read from. */
 } WALReadError;
 
-extern bool WALRead(XLogReaderState *state,
-					char *buf, XLogRecPtr startptr, Size count,
-					TimeLineID tli, WALReadError *errinfo);
+/*
+ * WAL read stats from WALRead that the callers can use.
+ */
+typedef struct WALReadStats
+{
+	/* Number of times WAL read from disk. */
+	int64	wal_read;
+
+	/* Total amount of WAL read from disk in bytes. */
+	uint64	wal_read_bytes;
+
+	/* Total amount of time spent reading WAL from disk. */
+	int64 	wal_read_time;
+} WALReadStats;
+
+extern bool WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr,
+					Size count, TimeLineID tli, WALReadError *errinfo,
+					WALReadStats *stats, bool capture_wal_io_timing);
 
 /* Functions for decoding an XLogRecord */
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7056c95371..18320cf846 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5391,9 +5391,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,numeric,float8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,wal_read,wal_read_bytes,wal_read_time}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 7897c74589..35413ea0d2 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_PRIVATE_H
 
 #include "access/xlog.h"
+#include "access/xlogreader.h"
 #include "nodes/nodes.h"
 #include "replication/syncrep.h"
 #include "storage/latch.h"
@@ -78,6 +79,9 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	/* WAL read stats for walsender. */
+	WALReadStats wal_read_stats;
 } WalSnd;
 
 extern PGDLLIMPORT WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..fd9d298e79 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2054,9 +2054,12 @@ pg_stat_replication| SELECT s.pid,
     w.replay_lag,
     w.sync_priority,
     w.sync_state,
-    w.reply_time
+    w.reply_time,
+    w.wal_read,
+    w.wal_read_bytes,
+    w.wal_read_time
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, wal_read, wal_read_bytes, wal_read_time) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_replication_slots| SELECT s.slot_name,
     s.spill_txns,
-- 
2.34.1

v2-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchapplication/octet-stream; name=v2-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchDownload
From 047c58df6aaae586efe58b6a4068b17f25976b0a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 26 Dec 2022 08:36:28 +0000
Subject: [PATCH v2] Improve WALRead() to suck data directly from WAL buffers
 when possible

---
 src/backend/access/transam/xlog.c       | 154 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  47 +++++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 205 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 91473b00d9..c3138493be 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -689,6 +689,7 @@ static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
+static char *GetXLogBufferForRead(XLogRecPtr ptr, TimeLineID tli, char *page);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
@@ -1639,6 +1640,159 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Get the WAL buffer page containing passed in WAL record and also return the
+ * record's location within that buffer page.
+ */
+static char *
+GetXLogBufferForRead(XLogRecPtr ptr, TimeLineID tli, char *page)
+{
+	XLogRecPtr	expectedEndPtr;
+	XLogRecPtr	endptr;
+	int 	idx;
+	char    *recptr = NULL;
+
+	idx = XLogRecPtrToBufIdx(ptr);
+	expectedEndPtr = ptr;
+	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+	/*
+	 * Try to acquire WALBufMappingLock in shared mode so that the other
+	 * concurrent WAL readers are also allowed. We try to do as less work as
+	 * possible while holding the lock as it might impact concurrent WAL
+	 * writers.
+	 *
+	 * If we cannot immediately acquire the lock, meaning the lock was busy,
+	 * then exit quickly to not cause any contention. The caller can then
+	 * fallback to reading WAL from WAL file.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return recptr;
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it.
+	 */
+	endptr = XLogCtl->xlblocks[idx];
+
+	if (expectedEndPtr == endptr)
+	{
+		XLogPageHeader phdr;
+
+		/*
+		 * We have found the WAL buffer page holding the given LSN. Read from a
+		 * pointer to the right offset within the page.
+		 */
+		memcpy(page, (XLogCtl->pages + idx * (Size) XLOG_BLCKSZ),
+			   (Size) XLOG_BLCKSZ);
+
+		/*
+		 * Release the lock as early as possible to avoid creating any possible
+		 * contention.
+		 */
+		LWLockRelease(WALBufMappingLock);
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer().
+		 *
+		 * However, we perform basic page header checks for ensuring that we
+		 * are not reading a page that got just initialized. The callers will
+		 * anyway perform extensive page-level and record-level checks.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		if (phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			phdr->xlp_tli == tli)
+		{
+			/*
+			 * Page looks valid, so return the page and the requested record's
+			 * LSN.
+			 */
+			recptr = page + ptr % XLOG_BLCKSZ;
+		}
+	}
+	else
+	{
+		/* We have found nothing. */
+		LWLockRelease(WALBufMappingLock);
+	}
+
+	return recptr;
+}
+
+/*
+ * When possible, read WAL starting at 'startptr' of size 'count' bytes from
+ * WAL buffers into buffer passed in by the caller 'buf'. Read as much WAL as
+ * possible from the WAL buffers, remaining WAL, if any, the caller will take
+ * care of reading from WAL files directly.
+ *
+ * This function sets read bytes to 'read_bytes'.
+ */
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf,
+					Size *read_bytes)
+{
+	XLogRecPtr	ptr;
+	char    *dst;
+	Size    nbytes;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+
+	ptr = startptr;
+	nbytes = count;
+	dst = buf;
+	*read_bytes = 0;
+
+	while (nbytes > 0)
+	{
+		char 	page[XLOG_BLCKSZ] = {0};
+		char    *recptr;
+
+		recptr = GetXLogBufferForRead(ptr, tli, page);
+
+		if (recptr == NULL)
+			break;
+
+		if ((recptr + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			memcpy(dst, recptr, nbytes);
+			dst += nbytes;
+			*read_bytes += nbytes;
+			ptr += nbytes;
+			nbytes = 0;
+		}
+		else if ((recptr + nbytes) > (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are not in one page. */
+			Size bytes_remaining;
+
+			/*
+			 * Compute the remaining bytes on the current page, copy them over
+			 * to output buffer and move forward to read further.
+			 */
+			bytes_remaining = XLOG_BLCKSZ - (recptr - page);
+			memcpy(dst, recptr, bytes_remaining);
+			dst += bytes_remaining;
+			nbytes -= bytes_remaining;
+			*read_bytes += bytes_remaining;
+			ptr += bytes_remaining;
+		}
+	}
+
+	elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+		 *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index a38a80e049..4a2e7af169 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1485,8 +1485,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1497,6 +1496,50 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend tools have no idea of WAL buffers. */
+	Size        read_bytes;
+
+	/*
+	 * When possible, read WAL from WAL buffers. We skip this step and continue
+	 * the usual way, that is to read from WAL file, either when the server is
+	 * in recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		XLogReadFromBuffers(startptr, tli, count, buf, &read_bytes);
+		pgstat_report_wait_end();
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially or
+		 * nothing, then continue to read the remaining bytes the usual way,
+		 * that is, read from WAL file.
+		 */
+		if (count == read_bytes)
+		{
+			/* Buffer hit, so return. */
+			return true;
+		}
+		else if (read_bytes > 0 && count > read_bytes)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += read_bytes;
+			startptr += read_bytes;
+			count -= read_bytes;
+		}
+
+		/* Buffer miss i.e., read_bytes = 0, so continue */
+	}
+#endif	/* FRONTEND */
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 1fbd48fbda..f4e1c46b23 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern void XLogReadFromBuffers(XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf,
+								Size *read_bytes);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txttext/plain; charset=US-ASCII; name=USE-ON-PATCH-Collect-WAL-read-from-buffers-and-file-stats.txtDownload
From f90dfcbd1968280feec6d116568697225854ac40 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 26 Dec 2022 08:13:07 +0000
Subject: [PATCH v2] Collect WAL read from buffers and file stats for WAL
 senders

---
 doc/src/sgml/monitoring.sgml                | 61 +++++++++++++++
 src/backend/access/transam/xlogreader.c     | 56 +++++++++++++-
 src/backend/access/transam/xlogutils.c      |  2 +-
 src/backend/catalog/system_views.sql        |  8 +-
 src/backend/replication/walsender.c         | 85 ++++++++++++++++++++-
 src/bin/pg_waldump/pg_waldump.c             |  2 +-
 src/include/access/xlogreader.h             | 30 +++++++-
 src/include/catalog/pg_proc.dat             |  6 +-
 src/include/replication/walsender_private.h |  4 +
 src/test/regress/expected/rules.out         | 10 ++-
 10 files changed, 246 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 363b183e5f..239e0b0db9 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -2615,6 +2615,67 @@ SELECT pid, wait_event_type, wait_event FROM pg_stat_activity WHERE wait_event i
        Send time of last reply message received from standby server
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL data is read from disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read_bytes</structfield> <type>numeric</type>
+      </para>
+      <para>
+       Total amount of WAL read from disk in bytes
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent reading WAL from disk via
+       <function>WALRead</function> request, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
+       otherwise zero).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read_buffers</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL data is read from WAL buffers
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read_bytes_buffers</structfield> <type>numeric</type>
+      </para>
+      <para>
+       Total amount of WAL read from WAL buffers in bytes
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read_time_buffers</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent reading WAL from WAL buffers via
+       <function>WALRead</function> request, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
+       otherwise zero).
+      </para></entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 4a2e7af169..b9dfd4fde7 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -31,6 +31,7 @@
 #include "access/xlogrecord.h"
 #include "catalog/pg_control.h"
 #include "common/pg_lzcompress.h"
+#include "portability/instr_time.h"
 #include "replication/origin.h"
 
 #ifndef FRONTEND
@@ -1488,9 +1489,9 @@ err:
  * When possible, this function reads data directly from WAL buffers.
  */
 bool
-WALRead(XLogReaderState *state,
-		char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
-		WALReadError *errinfo)
+WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr, Size count,
+		TimeLineID tli, WALReadError *errinfo, WALReadStats *stats,
+		bool capture_wal_io_timing)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -1510,10 +1511,33 @@ WALRead(XLogReaderState *state,
 	if (!RecoveryInProgress() &&
 		tli == GetWALInsertionTimeLine())
 	{
+		instr_time      start;
+
+		/* Measure I/O timing to read WAL data if requested by the caller. */
+		if (stats != NULL && capture_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 		XLogReadFromBuffers(startptr, tli, count, buf, &read_bytes);
 		pgstat_report_wait_end();
 
+		/* Collect I/O stats if requested by the caller. */
+		if (stats != NULL && read_bytes > 0)
+		{
+			stats->wal_read_buffers++;
+			stats->wal_read_bytes_buffers += read_bytes;
+
+			/* Increment the I/O timing. */
+			if (capture_wal_io_timing)
+			{
+				instr_time      duration;
+
+				INSTR_TIME_SET_CURRENT(duration);
+				INSTR_TIME_SUBTRACT(duration, start);
+				stats->wal_read_time_buffers += INSTR_TIME_GET_MICROSEC(duration);
+			}
+		}
+
 		/*
 		 * Check if we have read fully (hit), partially (partial hit) or
 		 * nothing (miss) from WAL buffers. If we have read either partially or
@@ -1549,6 +1573,7 @@ WALRead(XLogReaderState *state,
 		uint32		startoff;
 		int			segbytes;
 		int			readbytes;
+		instr_time	start;
 
 		startoff = XLogSegmentOffset(recptr, state->segcxt.ws_segsize);
 
@@ -1583,6 +1608,10 @@ WALRead(XLogReaderState *state,
 		else
 			segbytes = nbytes;
 
+		/* Measure I/O timing to read WAL data if requested by the caller. */
+		if (stats != NULL && capture_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
 #ifndef FRONTEND
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 #endif
@@ -1595,6 +1624,27 @@ WALRead(XLogReaderState *state,
 		pgstat_report_wait_end();
 #endif
 
+		/* Collect I/O stats if requested by the caller. */
+		if (stats != NULL)
+		{
+			/* Increment the number of times WAL is read from disk. */
+			stats->wal_read++;
+
+			/* Collect bytes read. */
+			if (readbytes > 0)
+				stats->wal_read_bytes += readbytes;
+
+			/* Increment the I/O timing. */
+			if (capture_wal_io_timing)
+			{
+				instr_time	duration;
+
+				INSTR_TIME_SET_CURRENT(duration);
+				INSTR_TIME_SUBTRACT(duration, start);
+				stats->wal_read_time += INSTR_TIME_GET_MICROSEC(duration);
+			}
+		}
+
 		if (readbytes <= 0)
 		{
 			errinfo->wre_errno = errno;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 563cba258d..372de2c7d8 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -1027,7 +1027,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
 	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
-				 &errinfo))
+				 &errinfo, NULL, false))
 		WALReadRaiseError(&errinfo);
 
 	/* number of valid bytes in the buffer */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index 2d8104b090..bf6315df27 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -892,7 +892,13 @@ CREATE VIEW pg_stat_replication AS
             W.replay_lag,
             W.sync_priority,
             W.sync_state,
-            W.reply_time
+            W.reply_time,
+            W.wal_read,
+            W.wal_read_bytes,
+            W.wal_read_time,
+            W.wal_read_buffers,
+            W.wal_read_bytes_buffers,
+            W.wal_read_time_buffers
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index c11bb3716f..d3393b2b63 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -259,7 +259,7 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
 							  TimeLineID *tli_p);
-
+static void WalSndAccumulateWalReadStats(WALReadStats *stats);
 
 /* Initialize walsender process before entering the main command loop */
 void
@@ -907,6 +907,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	WALReadError errinfo;
 	XLogSegNo	segno;
 	TimeLineID	currTLI = GetWALInsertionTimeLine();
+	WALReadStats	stats;
 
 	/*
 	 * Since logical decoding is only permitted on a primary server, we know
@@ -932,6 +933,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
+	MemSet(&stats, 0, sizeof(WALReadStats));
+
 	/* now actually read the data, we know it's there */
 	if (!WALRead(state,
 				 cur_page,
@@ -940,9 +943,13 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 state->seg.ws_tli, /* Pass the current TLI because only
 									 * WalSndSegmentOpen controls whether new
 									 * TLI is needed. */
-				 &errinfo))
+				 &errinfo,
+				 &stats,
+				 track_wal_io_timing))
 		WALReadRaiseError(&errinfo);
 
+	WalSndAccumulateWalReadStats(&stats);
+
 	/*
 	 * After reading into the buffer, check that what we read was valid. We do
 	 * this after reading, because even though the segment was present when we
@@ -2610,6 +2617,12 @@ InitWalSenderSlot(void)
 			walsnd->sync_standby_priority = 0;
 			walsnd->latch = &MyProc->procLatch;
 			walsnd->replyTime = 0;
+			walsnd->wal_read_stats.wal_read = 0;
+			walsnd->wal_read_stats.wal_read_bytes = 0;
+			walsnd->wal_read_stats.wal_read_time = 0;
+			walsnd->wal_read_stats.wal_read_buffers = 0;
+			walsnd->wal_read_stats.wal_read_bytes_buffers = 0;
+			walsnd->wal_read_stats.wal_read_time_buffers = 0;
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -2730,6 +2743,7 @@ XLogSendPhysical(void)
 	Size		nbytes;
 	XLogSegNo	segno;
 	WALReadError errinfo;
+	WALReadStats stats;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2945,6 +2959,8 @@ XLogSendPhysical(void)
 	enlargeStringInfo(&output_message, nbytes);
 
 retry:
+	MemSet(&stats, 0, sizeof(WALReadStats));
+
 	if (!WALRead(xlogreader,
 				 &output_message.data[output_message.len],
 				 startptr,
@@ -2952,9 +2968,13 @@ retry:
 				 xlogreader->seg.ws_tli,	/* Pass the current TLI because
 											 * only WalSndSegmentOpen controls
 											 * whether new TLI is needed. */
-				 &errinfo))
+				 &errinfo,
+				 &stats,
+				 track_wal_io_timing))
 		WALReadRaiseError(&errinfo);
 
+	WalSndAccumulateWalReadStats(&stats);
+
 	/* See logical_read_xlog_page(). */
 	XLByteToSeg(startptr, segno, xlogreader->segcxt.ws_segsize);
 	CheckXLogRemoved(segno, xlogreader->seg.ws_tli);
@@ -3458,7 +3478,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	12
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	SyncRepStandbyData *sync_standbys;
 	int			num_standbys;
@@ -3487,9 +3507,16 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		WalSndState state;
 		TimestampTz replyTime;
 		bool		is_sync_standby;
+		int64		wal_read;
+		uint64		wal_read_bytes;
+		int64		wal_read_time;
+		int64		wal_read_buffers;
+		uint64		wal_read_bytes_buffers;
+		int64		wal_read_time_buffers;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS] = {0};
 		int			j;
+		char		buf[256];
 
 		/* Collect data from shared memory */
 		SpinLockAcquire(&walsnd->mutex);
@@ -3509,6 +3536,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		replyTime = walsnd->replyTime;
+		wal_read = walsnd->wal_read_stats.wal_read;
+		wal_read_bytes = walsnd->wal_read_stats.wal_read_bytes;
+		wal_read_time = walsnd->wal_read_stats.wal_read_time;
+		wal_read_buffers = walsnd->wal_read_stats.wal_read_buffers;
+		wal_read_bytes_buffers = walsnd->wal_read_stats.wal_read_bytes_buffers;
+		wal_read_time_buffers = walsnd->wal_read_stats.wal_read_time_buffers;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3605,6 +3638,31 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[11] = true;
 			else
 				values[11] = TimestampTzGetDatum(replyTime);
+
+			values[12] = Int64GetDatum(wal_read);
+
+			/* Convert to numeric. */
+			snprintf(buf, sizeof buf, UINT64_FORMAT, wal_read_bytes);
+			values[13] = DirectFunctionCall3(numeric_in,
+											 CStringGetDatum(buf),
+											 ObjectIdGetDatum(0),
+											 Int32GetDatum(-1));
+
+			/* Convert counter from microsec to millisec for display. */
+			values[14] = Float8GetDatum(((double) wal_read_time) / 1000.0);
+
+			values[15] = Int64GetDatum(wal_read_buffers);
+
+			/* Convert to numeric. */
+			MemSet(buf, '\0', sizeof buf);
+			snprintf(buf, sizeof buf, UINT64_FORMAT, wal_read_bytes_buffers);
+			values[16] = DirectFunctionCall3(numeric_in,
+											 CStringGetDatum(buf),
+											 ObjectIdGetDatum(0),
+											 Int32GetDatum(-1));
+
+			/* Convert counter from microsec to millisec for display. */
+			values[17] = Float8GetDatum(((double) wal_read_time_buffers) / 1000.0);
 		}
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
@@ -3849,3 +3907,22 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+/*
+ * Function to accumulate WAL Read stats for WAL sender.
+ */
+static void
+WalSndAccumulateWalReadStats(WALReadStats *stats)
+{
+	/* Collect I/O stats for walsender. */
+	SpinLockAcquire(&MyWalSnd->mutex);
+	MyWalSnd->wal_read_stats.wal_read += stats->wal_read;
+	MyWalSnd->wal_read_stats.wal_read_bytes += stats->wal_read_bytes;
+	MyWalSnd->wal_read_stats.wal_read_time += stats->wal_read_time;
+	MyWalSnd->wal_read_stats.wal_read_buffers += stats->wal_read_buffers;
+	MyWalSnd->wal_read_stats.wal_read_bytes_buffers +=
+									stats->wal_read_bytes_buffers;
+	MyWalSnd->wal_read_stats.wal_read_time_buffers +=
+									stats->wal_read_time_buffers;
+	SpinLockRelease(&MyWalSnd->mutex);
+}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index 9993378ca5..698ce1e9f7 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -364,7 +364,7 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 	}
 
 	if (!WALRead(state, readBuff, targetPagePtr, count, private->timeline,
-				 &errinfo))
+				 &errinfo, NULL, false))
 	{
 		WALOpenSegment *seg = &errinfo.wre_seg;
 		char		fname[MAXPGPATH];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index e87f91316a..9287114779 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -389,9 +389,33 @@ typedef struct WALReadError
 	WALOpenSegment wre_seg;		/* Segment we tried to read from. */
 } WALReadError;
 
-extern bool WALRead(XLogReaderState *state,
-					char *buf, XLogRecPtr startptr, Size count,
-					TimeLineID tli, WALReadError *errinfo);
+/*
+ * WAL read stats from WALRead that the callers can use.
+ */
+typedef struct WALReadStats
+{
+	/* Number of times WAL read from disk. */
+	int64	wal_read;
+
+	/* Total amount of WAL read from disk in bytes. */
+	uint64	wal_read_bytes;
+
+	/* Total amount of time spent reading WAL from disk. */
+	int64 	wal_read_time;
+
+	/* Number of times WAL read from WAL buffers. */
+	int64   wal_read_buffers;
+
+	/* Total amount of WAL read from WAL buffers in bytes. */
+	uint64  wal_read_bytes_buffers;
+
+	/* Total amount of time spent reading WAL from WAL buffers. */
+	int64   wal_read_time_buffers;
+} WALReadStats;
+
+extern bool WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr,
+					Size count, TimeLineID tli, WALReadError *errinfo,
+					WALReadStats *stats, bool capture_wal_io_timing);
 
 /* Functions for decoding an XLogRecord */
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index 7056c95371..706a005c2b 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5391,9 +5391,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,numeric,float8,int8,numeric,float8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,wal_read,wal_read_bytes,wal_read_time,wal_read_buffers,wal_read_bytes_buffers,wal_read_time_buffers}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 7897c74589..35413ea0d2 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_PRIVATE_H
 
 #include "access/xlog.h"
+#include "access/xlogreader.h"
 #include "nodes/nodes.h"
 #include "replication/syncrep.h"
 #include "storage/latch.h"
@@ -78,6 +79,9 @@ typedef struct WalSnd
 	 * Timestamp of the last message received from standby.
 	 */
 	TimestampTz replyTime;
+
+	/* WAL read stats for walsender. */
+	WALReadStats wal_read_stats;
 } WalSnd;
 
 extern PGDLLIMPORT WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index fb9f936d43..6ae65981c2 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2054,9 +2054,15 @@ pg_stat_replication| SELECT s.pid,
     w.replay_lag,
     w.sync_priority,
     w.sync_state,
-    w.reply_time
+    w.reply_time,
+    w.wal_read,
+    w.wal_read_bytes,
+    w.wal_read_time,
+    w.wal_read_buffers,
+    w.wal_read_bytes_buffers,
+    w.wal_read_time_buffers
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, leader_pid, query_id)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, wal_read, wal_read_bytes, wal_read_time, wal_read_buffers, wal_read_bytes_buffers, wal_read_time_buffers) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_replication_slots| SELECT s.slot_name,
     s.spill_txns,
-- 
2.34.1

#8Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#7)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:

Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?

--
Jeff Davis
PostgreSQL Contributor Team - AWS

#9Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#8)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:

On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:

Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?

One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.

Greetings,

Andres Freund

#10SATYANARAYANA NARLAPURAM
satyanarlapuram@gmail.com
In reply to: Andres Freund (#9)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sat, Jan 14, 2023 at 12:34 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:

On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:

Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?

One benefit would be that it'd make it more realistic to use direct IO for
WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a
win.

+1. Archive modules rely on reading the wal files for PITR. Direct IO for
WAL requires reading these files from disk anyways for archival. However,
Archiving using utilities like pg_receivewal can take advantage of this
patch together with direct IO for WAL.

Thanks,
Satya

#11Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#9)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2023-01-14 12:34:03 -0800, Andres Freund wrote:

On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:

On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:

Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?

One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.

Satya's email just now reminded me of another important reason:

Eventually we should add the ability to stream out WAL *before* it has locally
been written out and flushed. Obviously the relevant positions would have to
be noted in the relevant message in the streaming protocol, and we couldn't
generally allow standbys to apply that data yet.

That'd allow us to significantly reduce the overhead of synchronous
replication, because instead of commonly needing to send out all the pending
WAL at commit, we'd just need to send out the updated flush position. The
reason this would lower the overhead is that:

a) The reduced amount of data to be transferred reduces latency - it's easy to
accumulate a few TCP packets worth of data even in a single small OLTP
transaction
b) The remote side can start to write out data earlier

Of course this would require additional infrastructure on the receiver
side. E.g. some persistent state indicating up to where WAL is allowed to be
applied, to avoid the standby getting ahead of th eprimary, in case the
primary crash-restarts (or has more severe issues).

With a bit of work we could perform WAL replay on standby without waiting for
the fdatasync of the received WAL - that only needs to happen when a) we need
to confirm a flush position to the primary b) when we need to write back pages
from the buffer pool (and some other things).

Greetings,

Andres Freund

#12Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#11)
1 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Thu, Jan 26, 2023 at 2:45 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-01-14 12:34:03 -0800, Andres Freund wrote:

On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:

On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:

Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?

One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.

Satya's email just now reminded me of another important reason:

Eventually we should add the ability to stream out WAL *before* it has locally
been written out and flushed. Obviously the relevant positions would have to
be noted in the relevant message in the streaming protocol, and we couldn't
generally allow standbys to apply that data yet.

That'd allow us to significantly reduce the overhead of synchronous
replication, because instead of commonly needing to send out all the pending
WAL at commit, we'd just need to send out the updated flush position. The
reason this would lower the overhead is that:

a) The reduced amount of data to be transferred reduces latency - it's easy to
accumulate a few TCP packets worth of data even in a single small OLTP
transaction
b) The remote side can start to write out data earlier

Of course this would require additional infrastructure on the receiver
side. E.g. some persistent state indicating up to where WAL is allowed to be
applied, to avoid the standby getting ahead of th eprimary, in case the
primary crash-restarts (or has more severe issues).

With a bit of work we could perform WAL replay on standby without waiting for
the fdatasync of the received WAL - that only needs to happen when a) we need
to confirm a flush position to the primary b) when we need to write back pages
from the buffer pool (and some other things).

Thanks Andres, Jeff and Satya for taking a look at the thread. Andres
is right, the eventual plan is to do a bunch of other stuff as
described above and we've discussed this in another thread (see
below). I would like to once again clarify motivation behind this
feature:

1. It enables WAL readers (callers of WALRead() - wal senders,
pg_walinspect etc.) to use WAL buffers as first level cache which
might reduce number of IOPS at a peak load especially when the pread()
results in a disk read (WAL isn't available in OS page cache). I had
earlier presented the buffer hit ratio/amount of pread() system calls
reduced with wal senders in the first email of this thread (95% of the
time wal senders are able to read from WAL buffers without impacting
anybody). Now, here are the results with the WAL DIO patch [1]/messages/by-id/CA+hUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_=X7HztA@mail.gmail.com - where
WAL pread() turns into a disk read, see the results [2]Test case is an insert pgbench workload. clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ 1 1404 1070 1424 1375 2 1487 796 1454 1517 4 3064 1743 3011 3019 8 6114 3556 6026 5954 16 11560 7051 12216 12132 32 23181 13079 23449 23561 64 43607 26983 43997 45636 128 80723 45169 81515 81911 256 110925 90185 107332 114046 512 119354 109817 110287 117506 768 112435 105795 106853 111605 1024 107554 105541 105942 109370 2048 88552 79024 80699 90555 4096 61323 54814 58704 61743 and attached
graph.

2. As Andres rightly mentioned, it helps WAL DIO; since there's no OS
page cache, using WAL buffers as read cache helps a lot. It is clearly
evident from my experiment with WAL DIO patch [1]/messages/by-id/CA+hUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_=X7HztA@mail.gmail.com, see the results [2]Test case is an insert pgbench workload. clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ 1 1404 1070 1424 1375 2 1487 796 1454 1517 4 3064 1743 3011 3019 8 6114 3556 6026 5954 16 11560 7051 12216 12132 32 23181 13079 23449 23561 64 43607 26983 43997 45636 128 80723 45169 81515 81911 256 110925 90185 107332 114046 512 119354 109817 110287 117506 768 112435 105795 106853 111605 1024 107554 105541 105942 109370 2048 88552 79024 80699 90555 4096 61323 54814 58704 61743
and attached graph. As expected, WAL DIO brings down the TPS, whereas
WAL buffers read i.e. this patch brings it up.

3. As Andres rightly mentioned above, it enables flushing WAL in
parallel on primary and all standbys [3]/messages/by-id/20220309020123.sneaoijlg3rszvst@alap3.anarazel.de /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com. I haven't yet started work
on this, I will aim for PG 17.

4. It will make the work on - disallow async standbys or subscribers
getting ahead of the sync standbys [3]/messages/by-id/20220309020123.sneaoijlg3rszvst@alap3.anarazel.de /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com possible. I haven't yet started
work on this, I will aim for PG 17.

5. It implements the following TODO item specified near WALRead():
* XXX probably this should be improved to suck data directly from the
* WAL buffers when possible.
*/
bool
WALRead(XLogReaderState *state,

That said, this feature is separately reviewable and perhaps can go
separately as it has its own benefits.

[1]: /messages/by-id/CA+hUKGLmeyrDcUYAty90V_YTcoo5kAFfQjRQ-_1joS_=X7HztA@mail.gmail.com

[2]: Test case is an insert pgbench workload. clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ 1 1404 1070 1424 1375 2 1487 796 1454 1517 4 3064 1743 3011 3019 8 6114 3556 6026 5954 16 11560 7051 12216 12132 32 23181 13079 23449 23561 64 43607 26983 43997 45636 128 80723 45169 81515 81911 256 110925 90185 107332 114046 512 119354 109817 110287 117506 768 112435 105795 106853 111605 1024 107554 105541 105942 109370 2048 88552 79024 80699 90555 4096 61323 54814 58704 61743
clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ
1 1404 1070 1424 1375
2 1487 796 1454 1517
4 3064 1743 3011 3019
8 6114 3556 6026 5954
16 11560 7051 12216 12132
32 23181 13079 23449 23561
64 43607 26983 43997 45636
128 80723 45169 81515 81911
256 110925 90185 107332 114046
512 119354 109817 110287 117506
768 112435 105795 106853 111605
1024 107554 105541 105942 109370
2048 88552 79024 80699 90555
4096 61323 54814 58704 61743

[3]: /messages/by-id/20220309020123.sneaoijlg3rszvst@alap3.anarazel.de /messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com
/messages/by-id/20220309020123.sneaoijlg3rszvst@alap3.anarazel.de
/messages/by-id/CALj2ACXCSM+sTR=5NNRtmSQr3g1Vnr-yR91azzkZCaCJ7u4d4w@mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

WALDIO&WALBUFFERSREADCOMPARISON.pngimage/png; name=WALDIO&WALBUFFERSREADCOMPARISON.pngDownload
#13Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Bharath Rupireddy (#12)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Thu, Jan 26, 2023 at 2:33 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Thu, Jan 26, 2023 at 2:45 AM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-01-14 12:34:03 -0800, Andres Freund wrote:

On 2023-01-14 00:48:52 -0800, Jeff Davis wrote:

On Mon, 2022-12-26 at 14:20 +0530, Bharath Rupireddy wrote:

Please review the attached v2 patch further.

I'm still unclear on the performance goals of this patch. I see that it
will reduce syscalls, which sounds good, but to what end?

Does it allow a greater number of walsenders? Lower replication
latency? Less IO bandwidth? All of the above?

One benefit would be that it'd make it more realistic to use direct IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less clearly a win.

Satya's email just now reminded me of another important reason:

Eventually we should add the ability to stream out WAL *before* it has locally
been written out and flushed. Obviously the relevant positions would have to
be noted in the relevant message in the streaming protocol, and we couldn't
generally allow standbys to apply that data yet.

That'd allow us to significantly reduce the overhead of synchronous
replication, because instead of commonly needing to send out all the pending
WAL at commit, we'd just need to send out the updated flush position. The
reason this would lower the overhead is that:

a) The reduced amount of data to be transferred reduces latency - it's easy to
accumulate a few TCP packets worth of data even in a single small OLTP
transaction
b) The remote side can start to write out data earlier

Of course this would require additional infrastructure on the receiver
side. E.g. some persistent state indicating up to where WAL is allowed to be
applied, to avoid the standby getting ahead of th eprimary, in case the
primary crash-restarts (or has more severe issues).

With a bit of work we could perform WAL replay on standby without waiting for
the fdatasync of the received WAL - that only needs to happen when a) we need
to confirm a flush position to the primary b) when we need to write back pages
from the buffer pool (and some other things).

Thanks Andres, Jeff and Satya for taking a look at the thread. Andres
is right, the eventual plan is to do a bunch of other stuff as
described above and we've discussed this in another thread (see
below). I would like to once again clarify motivation behind this
feature:

1. It enables WAL readers (callers of WALRead() - wal senders,
pg_walinspect etc.) to use WAL buffers as first level cache which
might reduce number of IOPS at a peak load especially when the pread()
results in a disk read (WAL isn't available in OS page cache). I had
earlier presented the buffer hit ratio/amount of pread() system calls
reduced with wal senders in the first email of this thread (95% of the
time wal senders are able to read from WAL buffers without impacting
anybody). Now, here are the results with the WAL DIO patch [1] - where
WAL pread() turns into a disk read, see the results [2] and attached
graph.

2. As Andres rightly mentioned, it helps WAL DIO; since there's no OS
page cache, using WAL buffers as read cache helps a lot. It is clearly
evident from my experiment with WAL DIO patch [1], see the results [2]
and attached graph. As expected, WAL DIO brings down the TPS, whereas
WAL buffers read i.e. this patch brings it up.

[2] Test case is an insert pgbench workload.
clients HEAD WAL DIO WAL DIO & WAL BUFFERS READ WAL BUFFERS READ
1 1404 1070 1424 1375
2 1487 796 1454 1517
4 3064 1743 3011 3019
8 6114 3556 6026 5954
16 11560 7051 12216 12132
32 23181 13079 23449 23561
64 43607 26983 43997 45636
128 80723 45169 81515 81911
256 110925 90185 107332 114046
512 119354 109817 110287 117506
768 112435 105795 106853 111605
1024 107554 105541 105942 109370
2048 88552 79024 80699 90555
4096 61323 54814 58704 61743

If I'm understanding this result correctly, it seems to me that your
patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
BUFFERS READ), but there seems no visible performance gain with only
your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
patch should be included in the WAL DIO patch rather than applying it
alone. Am I missing something?

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#14Andres Freund
andres@anarazel.de
In reply to: Masahiko Sawada (#13)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote:

If I'm understanding this result correctly, it seems to me that your
patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
BUFFERS READ), but there seems no visible performance gain with only
your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
patch should be included in the WAL DIO patch rather than applying it
alone. Am I missing something?

We already support using DIO for WAL - it's just restricted in a way that
makes it practically not usable. And the reason for that is precisely that
walsenders need to read the WAL. See get_sync_bit():

/*
* Optimize writes by bypassing kernel cache with O_DIRECT when using
* O_SYNC and O_DSYNC. But only if archiving and streaming are disabled,
* otherwise the archive command or walsender process will read the WAL
* soon after writing it, which is guaranteed to cause a physical read if
* we bypassed the kernel cache. We also skip the
* posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
* reason.
*
* Never use O_DIRECT in walreceiver process for similar reasons; the WAL
* written by walreceiver is normally read by the startup process soon
* after it's written. Also, walreceiver performs unaligned writes, which
* don't work with O_DIRECT, so it is required for correctness too.
*/
if (!XLogIsNeeded() && !AmWalReceiverProcess())
o_direct_flag = PG_O_DIRECT;

Even if that weren't the case, splitting up bigger commits in incrementally
committable chunks is a good idea.

Greetings,

Andres Freund

#15Masahiko Sawada
sawada.mshk@gmail.com
In reply to: Andres Freund (#14)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Jan 27, 2023 at 3:17 PM Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2023-01-27 14:24:51 +0900, Masahiko Sawada wrote:

If I'm understanding this result correctly, it seems to me that your
patch works well with the WAL DIO patch (WALDIO vs. WAL DIO & WAL
BUFFERS READ), but there seems no visible performance gain with only
your patch (HEAD vs. WAL BUFFERS READ). So it seems to me that your
patch should be included in the WAL DIO patch rather than applying it
alone. Am I missing something?

We already support using DIO for WAL - it's just restricted in a way that
makes it practically not usable. And the reason for that is precisely that
walsenders need to read the WAL. See get_sync_bit():

/*
* Optimize writes by bypassing kernel cache with O_DIRECT when using
* O_SYNC and O_DSYNC. But only if archiving and streaming are disabled,
* otherwise the archive command or walsender process will read the WAL
* soon after writing it, which is guaranteed to cause a physical read if
* we bypassed the kernel cache. We also skip the
* posix_fadvise(POSIX_FADV_DONTNEED) call in XLogFileClose() for the same
* reason.
*
* Never use O_DIRECT in walreceiver process for similar reasons; the WAL
* written by walreceiver is normally read by the startup process soon
* after it's written. Also, walreceiver performs unaligned writes, which
* don't work with O_DIRECT, so it is required for correctness too.
*/
if (!XLogIsNeeded() && !AmWalReceiverProcess())
o_direct_flag = PG_O_DIRECT;

Even if that weren't the case, splitting up bigger commits in incrementally
committable chunks is a good idea.

Agreed. I was wondering about the fact that the test result doesn't
show things to satisfy the first motivation of this patch, which is to
improve performance by reducing disk I/O and system calls regardless
of the DIO patch. But it makes sense to me that this patch is a part
of the DIO patch series.

I'd like to confirm whether there is any performance regression caused
by this patch in some cases, especially when not using DIO.

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

#16Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Masahiko Sawada (#15)
3 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Jan 27, 2023 at 12:16 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:

I'd like to confirm whether there is any performance regression caused
by this patch in some cases, especially when not using DIO.

Thanks. I ran some insert tests with primary and 1 async standby.
Please see the numbers below and attached graphs. I've not noticed a
regression as such, in fact, with patch, there's a slight improvement.
Note that there's no WAL DIO involved here.

test-case 1:
clients HEAD PATCHED
1 139 156
2 624 599
4 3113 3410
8 6194 6433
16 11255 11722
32 22455 21658
64 46072 47103
128 80255 85970
256 110067 111488
512 114043 118094
768 109588 111892
1024 106144 109361
2048 85808 90745
4096 55911 53755

test-case 2:
clients HEAD PATCHED
1 177 128
2 186 425
4 2114 2946
8 5835 5840
16 10654 11199
32 14071 13959
64 18092 17519
128 27298 28274
256 24600 24843
512 17139 19450
768 16778 20473
1024 18294 20209
2048 12898 13920
4096 6399 6815

test-case 3:
clients HEAD PATCHED
1 148 191
2 302 317
4 3415 3243
8 5864 6193
16 9573 10267
32 14069 15819
64 17424 18453
128 24493 29192
256 33180 38250
512 35568 36551
768 29731 30317
1024 32291 32124
2048 27964 28933
4096 13702 15034

[1]: cat << EOF >> data/postgresql.conf shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' EOF
cat << EOF >> data/postgresql.conf
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
EOF

test-case 1:
./pgbench -i -s 300 -d postgres
./psql -d postgres -c "ALTER TABLE pgbench_accounts DROP CONSTRAINT
pgbench_accounts_pkey;"
cat << EOF >> insert.sql
\set aid random(1, 10 * :scale)
\set delta random(1, 100000 * :scale)
INSERT INTO pgbench_accounts (aid, bid, abalance) VALUES (:aid, :aid, :delta);
EOF
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -f insert.sql -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

test-case 2:
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c$c
-j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

test-case 3:
./pgbench --initialize --scale=300 postgres
for c in 1 2 4 8 16 32 64 128 256 512 768 1024 2048 4096; do echo -n
"$c ";./pgbench -n -M prepared -U ubuntu postgres -b simple-update
-c$c -j$c -T5 2>&1|grep '^tps'|awk '{print $3}';done

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

test_case_1.pngimage/png; name=test_case_1.pngDownload
test_case_2.pngimage/png; name=test_case_2.pngDownload
test_case_3.pngimage/png; name=test_case_3.pngDownload
�PNG


IHDR�!�`s� IDATx^��	�U����,�L�L62$�"	A���?rUnb�����1!
#W���JEQ���^W�� �3lYIX��d�-3��z��z�{z��>u�����2�s���|�[��Y�������E$@$@$��@YYB[�l	�y��zJd)$@$@$@��u+�~@$@$@�	P�ugy$@$@$`��
H�H�H (�9��"I�H�H�L   ���:�$  
0}�H�H�r@���,�H�H�(��  �
p��H  ��H�H�H (�9��"I�H�H�L   ���:�$  
0}�H�H�r@���,�
�@V�-����`Q�TN%Y�������47b��]��P��J���3o��&f��8+�P�������a��3� @.�Fd4�S$��[��"ZV�a�
��e%�a�$�
�_d�o�H6�jT�f�DYM]�[_���O��z�����L[��H��L�g�����VP���oZ
p��>��;
���Y@AH�K��U��&�z��
u��_tX6*��{��w0��X�Q��������I��c9e}2�f����_^�J��R`�8�Y�H*�I��`�]�����'����$���J�;:��g��I}K������S��)&@V��(�d����z�V����
TJar2��g��s��f��Q}2�k-���9�L���~O���]�H#�q;pb��P��l�
a�G���w6�����;�����#�LD����59+����p�b"ga�;��b(��P_��N��������-7����(x��obVP	��"�iN5E��y9`$��&��t����q\�����lU� ���O��6d
tH1Yp��w���"���*���l���{��w������%r�mTl�����-��:"��j��
p{�vpX�[��`���h�F�8�E��<�>l�����&�]�K�w��nD�c����9\Y�5w�b,C���r$�$��_ �7���JL�*O����H ((�Ai)�I$@$PP(����	�	�@PP���R��H�H��P��9Y  �����h'	�	�@A�Ts�2$@$@A!@JK�N  ��"@.��deH�H��B�����$@$@E�\P����	�	�8(-E;I�H�
������! 

pPZ�v�	�	
pA5'+C$@$����$ ((��jNV�H�H (�p4�7�`����F=��8V[5��Nd���A�I;I�H�H�)���X����X�9�\��9��9�]�P�����6Ez%��Q���e��ZLG$@$@�M��77b���h��u-�Xz���0*����B�S7�Gz��H�H�H������{��F���8�
AO�MS��QX  �p���	�������f�=�g�}C�UK���B�����%���6�g�,Gr��3��XJ"����~���r(���]�;��Y���q67����M���_��q��5�wh�
�g�y�����C��`9�����g,%�sj�8!�T=`s������*��,S�`�(>L�3�K��S��WX����m)2��,j��)������$knX����)r
�SR�������;��
-oi��7�w���������?h>L�3f����������[6)K���0`�S��8������`X�?)����e�������_�/r�E�=`=>�����'gr���rv��������8SfF��La�����3�/��)JqR���,�|��89��L����&gW��r��Sb2����M�Lv�4:n�c����x�Pg>z<��)�z<�e)tL��<&'g��\�4�A�W��6�tJ��uq���Rq��FVa��mVI3�7c��/k���|��������o)��
m\��L���/����
�LZ��FVYE���&{�zh&/����>9��C[�l	;�2#��5��6�(. �-D�����	��tJ�{:>L�������������qs{9+���:�T�����)�;1�i�����+Q�p��|���S���9k��2���+�@��b��tqv7m���v:���E]KD�����z���B�S�[�l'D�vB)�4�3;+����*i��6�x$M�IiK�_�irV�n����QYo�t
!�g��t��:����P�>M�8�^����Q�9��
�=���npeed^�	�	�	x%�c4��K�w1V�H������j����TC����>������J�i�J���C3y)�e=�uqv(�	�������C�jd����6��Y8b��Q�2}9]���5hn��e�#U�`����s�n�EXNIyO��1�[�/
*���+h�4{��%�6���	���RcZ?������[�9����0�Cu(=QI��P�1c&����A�4{)fJ��Q&�|���9D|/�=m���+<C?{�����<�a"��T���9K�n'c�$!��)�i�����c�S1���������5��Y#t�9;��u��q���g���3���4M���#�q<i����s����j��
�e@�U�L��X�6��	�o�^
�JoM��.���i�#?�]omI��t�L�L���QR5��g�jdO����6��I�C3y)�e=��9�s�.��6�T��EX'^o�������^{�����8�B���4C�'���k��1=5�����52O7�s��e��[z�I�o���y���1t��d�
��a����t�Eq�ze��o����\�����3��F$?�KW#{�@��Ubs�}�52�7��kd�n g��=9N���������=�d�p�������� �����c��X��E�����~NZ��/[�<U��M|�@R���@����������N��c�#8��u����������q�v����9����p�Ur�Y(>e
�_�����'���L���K`���Gp.n#c��HJ��{r��/E���P>{!�/���<����pB9:d-���b�I��[������X��B�*_��\�����=as}9S�];�9�|�l���b�u9^X����p����>�4��-NN�����j��!g����A�n�yKO��8����h������0h4��B�I��V%#rR��j���;mv���e.����7�WX��j�Sh����o��S�&e(�fD��s&c�K	��B�[V���j
i5~�[�s�����`NTtcE%���U
��E���>w�W��������9�,C������A����A��������	31�3?P]����0)CI��2e)�e=
@�*�����r`�)��Xz���V<_��Tq�-�?�jfY|����t�}�O7��\^_v?�C�8.AO�e�D�e�n���&=�B��s&c������U��C�5Xd�~�����Y�e���{?in��G*��G;��?���?�a,�
�e}~�f,�������/�Ze�J��KK	����sFDY' ��:�����/^�2�7D,bj�nn����h{�~SEU\~�lz�o���A�� t��d���Q6�r�+���2�i3"g�9�����V��g%�����]W�������pe��I�lF�Z��#-8<�V���^z�h����g^n���������0mbNS���kEkG�������w���u?�Kq���������H$@$����`(������Q���������W�����V����S����o�������,�����Z����7�c������c��3a(��qf�s4��I�o��?��q��2?8�G����S���k����f�Ce�8�#.���9�����g���p�\o��H�7nK�Q����|�������
��g�.{g-�Y��h������l�������.��!�%��^uq%�����+���`�h#vl��������W���a�������+���/-����G��xyIM�^�����S3s�v�]�*X_�",'������������h���=�����w�I� �c&��wN}=��`t������6&=����d�?c������4V6�d+���H�����%�0��U��~�*�����R\7�t����5{�U"�rU���{,���$��(sT�T/�>"\q/z^[c���/Ez������C����L��3�S�{��!p����~���~��p�U�)DNm.*
��{�I/���G�fJ�HI�
O����Y���������s�C?����0_Z��)����>G7��#LY'"��9mc#GO��2�+s�rX���Z0>dBqq���T��+=_�W&�������9�n���j~Ar�2����i�/��*�P�mVM4y~A�4{�����:9S��m*A�������d��B9���wk��'/�O����	~�����fm���_|�]��O���m�~�w������M��Psw���559���w$���,=������^��o�s������m���g�N�.��'��k�w"|���������w���;�'� F��Xx~]27-��t�w��t��)g]�8)�6;��]2���������W��o:6<�A��g|�L����q��7���w�W�]�$\z���yfJ��1�"��q������8����;~�>���:�?�
������������3#
������XJ g
pZO3c��qId�yQd��\����%������(+-�W><	�/��s�:�]�g���/��e�,��$j�������6&��{�������6i��&x��-gO�(��6+�$;2��18=c�|Nl{G����x�#&���xp+����;O��zy�E��c_��;~`���pL���[n���hOZ���AK�8���D���?f ��1���f����x����l�����sv/;��m*rf8�����'�lX�
Y�����{��l����^�.��<�n����l���*/�����kpt�����_�HM'e��3�0��������q'��������G�'}���r�e_�rh�����������b����u��?d�:������._e����'F���V���������x��xs���8���d+V��Y�(�=%1����h�1�{:�C��;2��g"J���S1xh9FL~'N7C�8?��nd��mY���fU$��>t���&��4s��~Z����_����_��X���^`?�l&�~������3��U�n�7�6���G�!�C�\��poO��x8B������-o���5?D��X�UR�AW���i��GN���:z��}��8��U�����Dy���8;�����1`�h<i��g��]<���Q��r���3�I�3���d�=C'9�sz�m��%���`��
qj�)�P��NY�x9�����E���6W>�
h����W���w�C�2����1�"|��_�]�O�hp������aFL����$���m�������W�����#����YWb��ScSa�d���l������cG�P��%tu��,l���k��(�k�)4d�O:���QRu^��IT���0�	h�B�)�"c�K	��B��P����X ����;1�i��?��0 �-����I��>t����5���+Y�����1�"�t��pN���(�u~W������L��@�5�W�~��}v������K.��hO���T<e���7a��f��gl�z{3�����]�#�,���)J����������n�D��I�@�~���I����������K1��k����7�
A��.�����Q
,�^�}���[�:A��,��#?�EF/Jb�Z�D6����8�b �����R������p^�K}�
�1%Yz��u<\����u�SQd1��I�HN�.�6��*S;��a��6����C��aL��-z��=Qd�sh7F�z��,��l�v�6/�z�Yo������������A�9h�R���Nw���
W�;]�E��5�d	n���?�����������+f��������\�V��"����,�V!1�`h4���j�9x)t�y�����g��E��%��o�m5O+{��cfLf��P�������1 ��3�^���0N�z9c�2JP4t�����zL��7<������4{)���1���|���jd;����zT7�W��2r���k��nF���8�/u������_��C���v&�\�$��I3x�M(?�wr����A��Z����c���"�����]�!'�S�������O`���>d���wAY)�0�d�)e;0���J�������VF��LF����5�����.��"(N�Ee[[ol�*�����0�%�x��Yi������~�������1WXj��}2�xGzQ���9����^za�y�l�[���-�����.3{��u��I��h�8�h$���������>��x�)fo���Z!S�k�-f��a_<�(N|���0.�C��6���w�]*����Y�K�i�gJG����;9���\�����,p�a�U�E[�;�WA�������v86G*{g�}iEl������}���/����P��)r��1%����
���g(��3\��t^�'�.��N�B��y*���K�p���a����NV���T9���I�{pR�+2�]�6K�;�0��.7�����/�G��]���l�����;h�
���G�����s6EwVGm�Y��HoxY��5�d���)�L���=x:�>�g�R�����!�Y|%=�%;3�����h�|���������l_��j�ql�n��6fI�����CMa���F����� �0���������g�y���Wo�?tc����)m=���h���i9n9;���tA����'��W�o�`}��ZR&~����$�]�=���W�b}����?��0�x,������\Ep�����S�0�7��/g�_�l�c6O;��y���Iz������y`I��m�F&�om'��*�8����jLY���s�F�� h����p��SRrN��_	�l9��G�Wh����o��rG��QX���=9����r�u/�����;�r��
����Yz��|�l��ye�=���P5���>�<]M:�<l�M�&&��{����
��4?9�m�q�w�������������L��^
p�U�w]���X���������d�|(�������{_���W�M��.Tr0�K�jdun�9(',E���,���_;b�_���d��"b|zeD��w��8���L��1�'�=�[�18�2��1��s�s.���s�����j]���X�OB��!�]��1����Y��"V���W���������t5��z�fo4��d>������=�t�n�{��n������0����'��\��)�3�X~����A������u4��K�=�mO�����+1l�8EKK��s^��!k�)���l_�2z���7+�
_��he�������dK��8�����2����q��;V�2�����A����}��s����[��
�`���5_Nr���V����p�%.�>e�|d���A�,��<�
�>b������Z�.��5SGa���qs�fT+C��1Q�=*��e�X�����w�W�8�^�����B'�~)��b�a������~������:)_�tvP9/�C�����(BO�w��X�5sj��a. �_��o������S�=Y�R�sjN�R����R�g0��g��d��,��0�H�E��C:�������R6~T9q?Vo8`a[�,�������)qk�����K�������	���<t���HV�����
���Tm����A��dK�Y��xOo:h.��_��r����(W2lm]~U�s���)*�Lrf8F@�����I3����KV?�6��������J'L��o����gYm����h�r���O�S�� IDAT!AN�q-[�N��D�8�N����9h��7_��vH^�.��w=���W��q������-�����}��K�����aRYCU6KZz��/2C5Z�����2��W�����8�����AN8�:����K�/��K��*9��+h�R�uy��/��N�� I�	�0�����L�����g�lBa|�F���.k����K�WD��76s�8[Wl�������IWU�q���mo�FC��������r����'���������eU�����Y��H(%��;E��,h�'q��o��4|i
�M��(%�t5�c���f�4S��7g�7�z������d{�,���K�tC�#y������-�7c�v���6�A�o��8;��Uhl���ZCZ���������(Iu-��XU��_���z��
;�!��f���V��f�_��/�u_�Ye�h�J��`{�2o,"�f�����8��QVS��_P�
U�	\rWi��j=3�B_��HE
rN����l�v7td�\�\����c�	������^����l����_w��Gw���'�+s'9,E]2:�:��5�c)",��d1�}�X�g5�������k����X�1��d�V}������J��}��Z��w�Xh�z����x��(������������8���;>������c�!N��9�cYM-{��gf����3��,n��5T-�q��c�%��������)�/����;�Y��`UD}��s��3�n�M��2u��c�V����x�g2��v]��Xf���8e����8-��@�8���_���]'����8���.�y`��2�����H����kP�o[����U���|tVUC��y�@�p7*��j�s���e!������rb������0�%7<k���$~��{p�����N��v~$�N��`���!N��9��b�?�����:q����I��4WU�b���d���aeU��C���~�,GrV����b���Z|#���`���Va=xK>P��:k�����<��=�h+���99��Y"9=��mxe�)VI�&���t��a��$�#d����~��h���Y�77b���q9�|�,��m52�Z��	��FH�m	�?w�&�a=��/`j�?���9���=�y��R���w;-FY::�2�i3"g��e�Xr=��~3����p�2o,A#d��~Y��;7>
9��~Y��%�������_���9rV!��9�<E&���kPz�|��&��o�++9��f.=o�v���z��s�8��r
���QN<S�� '��c��~�;��S��Rt���9`�U���L��m!����t���]!�p��30������]���r�Y%����u,QY�L�!�k�7��o,�l]����4~Ug�W%Uy�fU$����s�`{���Yn.����x���_i~]���ei�J��K�k+��rNu����'��Y�e?S�P5TM_�/�"��7���������?Z��:����^�gJ�KFJ���*�D�U���l�����r�|�����jcu��pY����\%��+��j�2}����
A���
Zz��A\��_�����=]a�����=3/w��j��Q�9]H�tGc���Pu4pD�@dUu4p���Ve�{R��������S�~��� �=��X��|s��O���M�����*
��*i����L�Gcn87o,_��9����f�{HE�W���W,��lU��q�0n�U�VU;�K�i��eK����8�>�������n��7�n�����t5��:�f�4)�ni�
�(��KHE�hL{�������2��4���!�PIo8F���������#����o��������7G�$�`���jd�.K�U��gC3SHEk���WL�d�z����|���gU'Ugc��{���&�<?]���[[�����N���w�v�@���t5��:�f�4)��hf
�(����V��/g��^���:$�:������{uq�lmA�)������F|������O)u5�J�i�J�`�h�>c��$���
�8�������>����j���|����������s�`k���g�s-_�U�T��W���%�����UV�6��I�AS����1��TLw4��f�iU������A9O��S��,�c�A�,BS�0hU�;E�j��i��K�t��9?�#q�i��!_zZ����,�o�jjYU-����DE������6G������#�mNrV����*9�,#p���I��6~���_�E6�{�F����j,� �f�����������6Ut#�����J��-H�/�
�,�pP��<D�L��l��{�+��T���fN5N�2VU������Pu������U�����i]�=A�
mu����Ui�[�^����l��	�TlmA�~����tz���08iMuit=L�,�7��mN��_=l�������X3����0�P��4�l�*��Uzl��tq�&���],���$8�p�T��[��{*�� I������8(�NZS]]�:�)�*Y��+�_��(��f�?�e�N����/��}��n14�T_DF��_�cQ�Z�������8(�z"r&�t��Y���"�"��!�l�C�Nm����K���{��-[����>��X+�R[�����j��2~����0*~��8���������K��?i	�@@���������6���#m=1��a��2�=��N+Ciq�O��\�������B�M����Bt�|�N9����H���T�!��,=��X3�W|���|�����D|����8��=�l\��������;��Y��R������d��~YGc�b�������6��6��v��Z��^�6�{h��ps#�/[���y1N�=I��N?�������������`�-�&_j8f�%h�U�k�� /����mM��I�9%�7��jk!W�@��\�1g�sn���������r�lv��Q�����8t�o������y�����������������&�v�KO�U�L����3�TLv4�e���e�q,���U\�g�'8BN����O�~�]��)�����_�����fa�0�^��\v�_m�:_]����47�f�4�M���RY����N{��T�E\3�i�����W��j��j��j��Otr��N���l.H>��u���TL|c�,+&`���j������Yj��6��I�CS8]��hL9s��}������V$
�(y��~�w�@�+k��Um
U�>$�v�;#�s�jt6
�������_�p����U����f8����UV	�9+|�o�|�7SHEs���hL�������-{��\U�g�NM�����[o$B
��7��m[��^W#��mVI�=`=4������B*ZGcVm���75i�P�qN��W�tUu������g=����A��,��pk����r_�'��`.������z����9�|��T!�hLYM-�T��R�t�X"7un3�������Kz���1�X�~_��9��Q�=�[~����y�=�<p]���9�+.��nz����8:�cTY%$���9�9h��d���B��?�v���k+�b��1��j+��}U�5Tm��c�:H�-��l.H��%k1����|)�����J�z��5�_(~'���*�A�U������v�B����'��
���hL)#�P�RQB+�ZU�wF?����p��G����������9�������m�v��Z��1�`9�s.�U�O����s5��[�=d�G�.�76�8M�lq����+�P��J�l�3�!k���~&�Vl�/n�;�������{�y�����K������G�T�4i��L��*����R��7Nq4�_C�|���_R=�<K  �����2����
9�/��M��:�CPY&#�,:�=h��f�4���hL���c�C�_�������������|O�����~�����g����u4<=�~���=��Fs[��FvkW���Y%����y�����/��xR�@s5��7m������j3x�tnlWQ����������HD�����ZD�5�q�2��4L����&\�bG���EhZPm���s������b���}N�*3�>�4m���zP�39'#�_�B��e���o���8[�L�I����hL��������rv���`���Z4kV�\��9�����
S�7m��}k��;�����,��N\4�4��=C'9�s���6���t!3�)6u��`�TQ���=�UJ�����)����(=1�oV�����8�>T�'���#��S}	tw����,���Xh�+�z���B�S7�[�;�������o�I�o'`Q�����4�&u���Ui_!�(�}[�
�(1�E�����9��C���]N��%pD�8"1`N�������B�u���pw2�
A[1�s!���)�x���z�{���
8w������jd�����6���:/r��3�gl������\2tm]2o�.������>���l0����U���c,����Y���t�.�P ��F�S7���YX�1�<����7���3�X���W��4{(AW#{0-�-�Y%M
���K�/���*�b��1-��������t�.�P,������X;�	sZr#�#��Q���`_�(��y���q#��yS�	�	�$ �a>�Z^|�{u���\Y�i���*��=�C���b��'����8��{����]j�\x�3�p-��������Jic���],���#���O��>�uS���������LE����hT���e��8mVI�=`=4�����T�������W�%��sN	�����9���v��f[qc�ta�|(�������s.8a����fQ���j�,����
/��O�����~��y�x�S8��*�-=�pC�-���������Fv��I:���R�i�9{��r �L���=SHE����^z
��a@O�P���7�����/��;t���������k�1)�
�,�/�L�������OA(��>Cv�&����*i��?��%�>�??�:�������D�%���p'�e�:�+�F+u=%���)�:�J�� `��]��Z�f�4)���L]���VHE�~��V#P����_�/�������_�|o��_�����A(�z_�&����*iR����Y��)��[���������a�R��WP,��*��`|s�����|_�����@��:��0��mVI��?Mw,!cD���y2V��Wp|n�A?�m�����M���<�t5�J�Y%Mw/-=%{/%h�4{�e
�f	���������h�.�'�������'�>�9_������;u5����f�4)�zh����e�������](+-���J]������c�?��C����~#�~��l]���	h�XY$%�,�9�����29����.C�����2|��f����#���7|��������5BG��#LY'
���[HC�Y;����������/��Q<Y�%\�����f�.{]�����w�f�49��&����]��������+>Gv�v�S�r���8OA����&e�8�^����Nw�������������u#f����kN���UV�6��������;��s'��<�3�:���������u��u5��j�f�4)�zhR���]�9���Uu����P_+PQ�*[@��Eh�FcP��t�y�}�a���(�n���\u��[s�FW#�� mVI���&������s!�����f`�J`�%�fBaS�/t�P�������v8v�hw�T|��(=�y��u5��*�f�4)�zhR���]�9���q��%���]�����(���B�S�[���w��������c���11|�C��kL���UV�6��I�C�L���z���}��wOJ��������ZW#��mVI���&������+n�����}w
�C����<S�	�	�{g�y��
B����J|p�2F[�q������W�|�w�^��-��}����*i���&{�������\�lL��|��-Y��]pC�}���?������[2U��jdw��>5mVI���&������s!��
zV��Q��	����q>�E�0a^ClA���3��bK�):��/���Q�r����-I�����]�����Y%M
��4�e�c�%{=�X�%O���o���m=�������K�����%���[��4��s���&����-�8�_�-��������K=��6g9f_0�z}��jd�5��*i����&{�������\���}�g��s��
��/BM���Aj]���j�Y%M
��`rvG@�{���o��'���+l|	.��{��� ��FVY5���&XM
09�#��=WP����`���`�����nuG<�u5����f�4)�zhR���]��������0��k0�K7�#����Ye�h�J�`=4)�������\A	����"�n]�����Y���8,W����\/}j���&XM
09�#��=WP��!��0��w\���>��xR�jd�U��*iR������;��s!��������m�P���i���s)��\�Yj]���g�h�3N��
���+�C���Rg���x�=t����c6b?�����09{T���0����*)r:EM��xax~��_F�kO���S1���m�P��&grNF�~A�HE@�o(��3�'�CC}-*�������;"u�Y�&98�����d������dO���������39��%��t	m���G��e�k�������*J�(y���[�Dv�Iq$<��I�~�</>L��&�����=`=>@��8��'`��.��Xd$�����n>��M��w����-��A|u�}�
��)z��39S��9���F�ma
�p��X"!��������(�����Oa����u$$��s�lRS�������q��|g8��lS��
l����DUC=������t��t�/����[ 1����/y	�S�����u=L�-���6gf�m
2������95'�47����M���O���m�u6F��l�3�`*  HB  ��l����`������s���=c��gX�e�f	p��~�l�7C=�=9�s2��E*�|#�p�V#�65���md�����T�����~��M(�NIyO��1�[��N���f��4�U��_�"�>rN����l����X��G;������;������jT�-��]Y�B����[���,��s������Y8��[uq����N�$������=����-(Fn
-�Mf���jd��i�J�����8����������?~��U�Gy4<?|3�s��x��Rt5�G���F�U����������\�C�O|�����,���8L��?��0���1=�sy9��1y�8�� �2���S�{�o}s&���I�|_
���7�����%0����9h�R�<:���t�F�x���A��W� �0coy�z}��jd�5��*irZMA��;��s���7��/BQ����C0|�70�����kN���UV�6��I�C�L���z�^���mO�����p�q�(����
e�>��xR�jd�U��*iR������;��s!������<,��;�9J���UV�6��I�C�L���z��o1�-sG;G�u5����f�4)�zhR���]������W��/���v�R�jd����*iR������;��s�'��}�zw�s�ZW#��mVI���&�������+��P�MVT�������/�i�/`O�d���8�����Aj]���j�Y%M
��`rvG@�{.�e�$|a�p�"��?��h������}���a]��������*iR������;��sy#���]�����(n����=��qj��>GQ(��KuG]sj]���Z�Y%M
��`rvG@�{.����j#ROR�<����9I��1���4
mvJ*�tA�4{�uhsv>��n]�/�Q���U��t9f���2��*iN��~A�HE@�o^��~�r��>G`�^!�-\�m�R���X
	�	�@�8p .��"���7lL�zZ����'0���Q��,
��	�IDATC�;�`�����\�;@@$@$@^���77�����jL��[���b���(F<��X	���R���\ym���w  �����Q`�$@$@$��X3pG$@$@B�L?   ���:�$  
0}�H�H�r@���,�H�H�(��  �
p��H  (�G���oF��e��-"������)����KV�-�
��(|�����d��7�T��/�T�����wGr��o�f���W�C�������[��`�GX�|�R�e4n]�DE7�.���vg�u'\��E3�f%��������Oa/�������$���k�2��`��L��?rVAd����-U��s"���/��9��x�GX��.�������FDd/1���E�����N��'�k���~�q[+&�2�}2�~�h�Gus�8��Xjg�������p���1�����+�q�����1�t4M_�����}���r�s������5r�����v���BB!�m�?oyM����������!�,jZ���TJ�/�� �����w�L�<��z��!j���6��3�W�g���CN�H}z�A�-/S��@��6��F��9$�3�2�`�W|���/�q>Nes�}��M�#Z�No��w/]
��O��+�z#}A�mNl�|�k��i��QF���n���ywD?�H���Eh�Z�����mnJ��_���/�^�k��HQ�o��+i[��P��}QE�p������H����S�lw�|�l��� *`�yk����g1���p�1��f��",�t[�=�M���`���8�~�p��|���E��lS����$,�����N=�'�WF���j�W�e��:�x�u~2�����,�������"��/[���� fy��)�4��sA��B��I�H�H@5
�j���H�H��;��$$@$@$��X5Q�G$@$@P�@b  PM���(�#  (� 1		�	�	�&@VM���	�	��`���H�H�T��&��H�H�H�
�HLB$@$@�	P�Ue~$@$@$���$&!  �(���2?  p@����	�	��j`�D�	�	�	8 @v�IH�H�H@5
�j���H�H��;��$��@V�-����������������au��k��kn��e��?���+v���,B���A6�o�w��}����R�	��P_�J������x��Y�s!�
�+���P������@&^��M�L>��c-j��n�����1>H",�i�;b���M��C$�w�SZV�a9�>�m&���U�{��*�|��8�B�[�Ok!@�',���l�q������R^���^���X+S���������=X���pW'�v�7���|B�:�-!�]�E5U�S���WVE{��b������)6���&o��(��������y�����f5�C!����b�)�]V;��i�tlHU�������|��t���LX$�=�K&�zk����;�G��`�'h
��+M8e��p��e��ak�����u���I�=`�@����X��]�[|����x��r�����&J�{������h{�~GL�� ��1t������p8��e����9K�	���'�k�
)���N5��f:R��/8�)2	�`����>6����'U�_� NX�d�T�E�\��!�#�~�R��Gr�56T6UN�[��J�X��'�MW��l)�/�C����M������<�n/~�~���sqq���T�{�x�<���z�R�}�Q���w��.��x~Z�x���quc�{����?��I�p����N!�������".#�F`��EX����|]�cR���
�������w����x6�w���$��mHF?���k����\���s�9aY��R��8��9������.4�����<prN6��[��:��s���y�
Dvfy��a�����
��.�
Q��q�s�k����um��������'�C5'�[,�/��i@O�a�A�|���e|�5��	+N�$@�%}��+��{������e\x�H�3���_M��y�������Yn�3QR5U�
,�H B�LO   ���:�$  
0}�H�H�r@���,�H�H�(��  �
p��H  ��H�H�H (�9��"I�H�H�L   ���:�$  
0}�H�H�r@���,�H�H�(��  �
p��H  0x��m����   �D��������IEND�B`�
#17Dilip Kumar
dilipbalaut@gmail.com
In reply to: Bharath Rupireddy (#7)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Mon, Dec 26, 2022 at 2:20 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I have gone through this patch, I have some comments (mostly cosmetic
and comments)

1.
+ /*
+ * We have found the WAL buffer page holding the given LSN. Read from a
+ * pointer to the right offset within the page.
+ */
+ memcpy(page, (XLogCtl->pages + idx * (Size) XLOG_BLCKSZ),
+    (Size) XLOG_BLCKSZ);

From the above comments, it appears that we are reading from the exact
pointer we are interested to read, but actually, we are reading
the complete page. I think this comment needs to be fixed and we can
also explain why we read the complete page here.

2.
+static char *
+GetXLogBufferForRead(XLogRecPtr ptr, TimeLineID tli, char *page)
+{
+ XLogRecPtr expectedEndPtr;
+ XLogRecPtr endptr;
+ int idx;
+ char    *recptr = NULL;

Generally, we use the name 'recptr' to represent XLogRecPtr type of
variable, but in your case, it is actually data at that recptr, so
better use some other name like 'buf' or 'buffer'.

3.
+ if ((recptr + nbytes) <= (page + XLOG_BLCKSZ))
+ {
+ /* All the bytes are in one page. */
+ memcpy(dst, recptr, nbytes);
+ dst += nbytes;
+ *read_bytes += nbytes;
+ ptr += nbytes;
+ nbytes = 0;
+ }
+ else if ((recptr + nbytes) > (page + XLOG_BLCKSZ))
+ {
+ /* All the bytes are not in one page. */
+ Size bytes_remaining;

Why do you have this 'else if ((recptr + nbytes) > (page +
XLOG_BLCKSZ))' check in the else part? why it is not directly else
without a condition in 'if'?

4.
+XLogReadFromBuffers(XLogRecPtr startptr,
+ TimeLineID tli,
+ Size count,
+ char *buf,
+ Size *read_bytes)

I think we do not need 2 separate variables 'count' and '*read_bytes',
just one variable for input/output is sufficient. The original value
can always be stored in some temp variable
instead of the function argument.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#18Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Dilip Kumar (#17)
1 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Feb 7, 2023 at 4:12 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Mon, Dec 26, 2022 at 2:20 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I have gone through this patch, I have some comments (mostly cosmetic
and comments)

Thanks a lot for reviewing.

From the above comments, it appears that we are reading from the exact
pointer we are interested to read, but actually, we are reading
the complete page. I think this comment needs to be fixed and we can
also explain why we read the complete page here.

I modified it. Please see the attached v3 patch.

Generally, we use the name 'recptr' to represent XLogRecPtr type of
variable, but in your case, it is actually data at that recptr, so
better use some other name like 'buf' or 'buffer'.

Changed it to use 'data' as it seemed more appropriate than just a
buffer to not confuse with the WAL buffer page.

3.
+ if ((recptr + nbytes) <= (page + XLOG_BLCKSZ))
+ {
+ }
+ else if ((recptr + nbytes) > (page + XLOG_BLCKSZ))
+ {

Why do you have this 'else if ((recptr + nbytes) > (page +
XLOG_BLCKSZ))' check in the else part? why it is not directly else
without a condition in 'if'?

Changed.

I think we do not need 2 separate variables 'count' and '*read_bytes',
just one variable for input/output is sufficient. The original value
can always be stored in some temp variable
instead of the function argument.

We could do that, but for the sake of readability and not cluttering
the API, I kept it as-is.

Besides addressing the above review comments, I've made some more
changes - 1) I optimized the patch a bit by removing an extra memcpy.
Up until v2 patch, the entire WAL buffer page is returned and the
caller takes what is wanted from it. This adds an extra memcpy, so I
changed it to avoid extra memcpy and just copy what is wanted. 2) I
improved the comments.

I can also do a few other things, but before working on them, I'd like
to hear from others:
1. A separate wait event (WAIT_EVENT_WAL_READ_FROM_BUFFERS) for
reading from WAL buffers - right now, WAIT_EVENT_WAL_READ is being
used both for reading from WAL buffers and WAL files. Given the fact
that we won't wait for a lock or do a time-taking task while reading
from buffers, it seems unnecessary.
2. A separate TAP test for verifying that the WAL is actually read
from WAL buffers - right now, existing tests for recovery,
subscription, pg_walinspect already cover the code, see [1]recovery tests: PATCHED: WAL buffers hit - 14759, misses - 3371. However,
if needed, I can add a separate TAP test.
3. Use the oldest initialized WAL buffer page to quickly tell if the
given LSN is present in WAL buffers without taking any lock - right
now, WALBufMappingLock is acquired to do so. While this doesn't seem
to impact much, it's good to optimize it away. But, the oldest
initialized WAL buffer page isn't tracked, so I've put up a patch and
sent in another thread [2]/messages/by-id/CALj2ACVgi6LirgLDZh=FdfdvGvKAD==WTOSWcQy=AtNgPDVnKw@mail.gmail.com. Irrespective of [2]/messages/by-id/CALj2ACVgi6LirgLDZh=FdfdvGvKAD==WTOSWcQy=AtNgPDVnKw@mail.gmail.com, we are still good
with what we have in this patch.

[1]: recovery tests: PATCHED: WAL buffers hit - 14759, misses - 3371
recovery tests:
PATCHED: WAL buffers hit - 14759, misses - 3371

subscription tests:
PATCHED: WAL buffers hit - 1972, misses - 32616

pg_walinspect tests:
PATCHED: WAL buffers hit - 8, misses - 8

[2]: /messages/by-id/CALj2ACVgi6LirgLDZh=FdfdvGvKAD==WTOSWcQy=AtNgPDVnKw@mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v3-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchapplication/x-patch; name=v3-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchDownload
From e6670c4098930f87e9b4dce2f21e73e0c0ce9361 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 8 Feb 2023 03:48:32 +0000
Subject: [PATCH v3] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c       | 178 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  47 ++++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 229 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..e4679d42f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -689,6 +689,10 @@ static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
+static Size XLogReadFromBuffersGuts(XLogRecPtr ptr,
+									TimeLineID tli,
+									Size count,
+									char *page);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
@@ -1639,6 +1643,180 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Guts of XLogReadFromBuffers().
+ *
+ * Read 'count' bytes into 'buf', starting at location 'ptr', from WAL
+ * fetched WAL buffers on timeline 'tli' and return the read bytes.
+ */
+static Size
+XLogReadFromBuffersGuts(XLogRecPtr ptr,
+						TimeLineID tli,
+						Size count,
+						char *buf)
+{
+	XLogRecPtr	expectedEndPtr;
+	XLogRecPtr	endptr;
+	int 	idx;
+	Size	nread = 0;
+
+	idx = XLogRecPtrToBufIdx(ptr);
+	expectedEndPtr = ptr;
+	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that the
+	 * concurrent WAL readers are also allowed. We try to do as less work as
+	 * possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return nread;
+
+	endptr = XLogCtl->xlblocks[idx];
+
+	if (expectedEndPtr == endptr)
+	{
+		char	*page;
+		char    *data;
+		XLogPageHeader	phdr;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/* Read what is wanted, not the whole page. */
+		if ((data + count) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			memcpy(buf, data, count);
+			nread = count;
+		}
+		else
+		{
+			Size nremaining;
+
+			/*
+			 * All the bytes are not in one page. Compute remaining bytes on
+			 * the current page, copy them over to output buffer.
+			 */
+			nremaining = XLOG_BLCKSZ - (data - page);
+			memcpy(buf, data, nremaining);
+			nread = nremaining;
+		}
+
+		/*
+		 * Release the lock as early as possible to avoid creating any possible
+		 * contention.
+		 */
+		LWLockRelease(WALBufMappingLock);
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer().
+		 *
+		 * However, we perform basic page header checks for ensuring that we
+		 * are not reading a page that just got initialized. Callers will
+		 * anyway perform extensive page-level and record-level checks.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+		{
+			/*
+			 * WAL buffer page doesn't look valid, so return as if nothing was
+			 * read.
+			 */
+			nread = 0;
+		}
+	}
+	else
+	{
+		/* We have found nothing. */
+		LWLockRelease(WALBufMappingLock);
+	}
+
+	/* We never read more than what the caller has asked for. */
+	Assert(nread <= count);
+
+	return nread;
+}
+
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes into 'buf', starting at location 'startptr', from WAL
+ * fetched WAL buffers on timeline 'tli' and set the read bytes to
+ * 'read_bytes'.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. The caller must be aware of
+ * this and deal with it.
+ */
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf,
+					Size *read_bytes)
+{
+	XLogRecPtr	ptr;
+	char    *dst;
+	Size    nbytes;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+
+	ptr = startptr;
+	nbytes = count;
+	dst = buf;
+	*read_bytes = 0;
+
+	while (nbytes > 0)
+	{
+		Size	nread;
+
+		nread = XLogReadFromBuffersGuts(ptr, tli, nbytes, dst);
+
+		if (nread == 0)
+		{
+			/* We read nothing. */
+			break;
+		}
+		else if (nread == nbytes)
+		{
+			/* We read all the requested bytes. */
+			*read_bytes += nread;
+			break;
+		}
+		else if (nread < nbytes)
+		{
+			/*
+			 * We read some of the requested bytes. Continue to read remaining
+			 * bytes.
+			 */
+			ptr += nread;
+			nbytes -= nread;
+			dst += nread;
+			*read_bytes += nread;
+		}
+	}
+
+	elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+		 *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index aa6c929477..244e92908c 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1485,8 +1485,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1497,6 +1496,50 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend tools have no idea of WAL buffers. */
+	Size        read_bytes;
+
+	/*
+	 * When possible, read WAL from WAL buffers. We skip this step and continue
+	 * the usual way, that is to read from WAL file, either when server is in
+	 * recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
+		XLogReadFromBuffers(startptr, tli, count, buf, &read_bytes);
+		pgstat_report_wait_end();
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially or
+		 * nothing, then continue to read the remaining bytes the usual way,
+		 * that is, read from WAL file.
+		 */
+		if (count == read_bytes)
+		{
+			/* Buffer hit, so return. */
+			return true;
+		}
+		else if (read_bytes > 0 && count > read_bytes)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += read_bytes;
+			startptr += read_bytes;
+			count -= read_bytes;
+		}
+
+		/* Buffer miss i.e., read_bytes = 0, so continue */
+	}
+#endif	/* FRONTEND */
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..c9941aa001 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern void XLogReadFromBuffers(XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf,
+								Size *read_bytes);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

#19Dilip Kumar
dilipbalaut@gmail.com
In reply to: Bharath Rupireddy (#18)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, Feb 8, 2023 at 9:57 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I can also do a few other things, but before working on them, I'd like
to hear from others:
1. A separate wait event (WAIT_EVENT_WAL_READ_FROM_BUFFERS) for
reading from WAL buffers - right now, WAIT_EVENT_WAL_READ is being
used both for reading from WAL buffers and WAL files. Given the fact
that we won't wait for a lock or do a time-taking task while reading
from buffers, it seems unnecessary.

Yes, we do not need this separate wait event and we also don't need
WAIT_EVENT_WAL_READ wait event while reading from the buffer. Because
we are not performing any IO so no specific wait event is needed and
for reading from the WAL buffer we are acquiring WALBufMappingLock so
that lwlock event will be tracked under that lock.

2. A separate TAP test for verifying that the WAL is actually read
from WAL buffers - right now, existing tests for recovery,
subscription, pg_walinspect already cover the code, see [1]. However,
if needed, I can add a separate TAP test.

Can we write a test that can actually validate that we have read from
a WAL Buffer? If so then it would be good to have such a test to avoid
any future breakage in that logic. But if it is just for hitting the
code but no guarantee that whether we can validate as part of the test
whether it has hit the WAL buffer or not then I think the existing
cases are sufficient.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

#20Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Dilip Kumar (#19)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, Feb 8, 2023 at 10:33 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:

On Wed, Feb 8, 2023 at 9:57 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I can also do a few other things, but before working on them, I'd like
to hear from others:
1. A separate wait event (WAIT_EVENT_WAL_READ_FROM_BUFFERS) for
reading from WAL buffers - right now, WAIT_EVENT_WAL_READ is being
used both for reading from WAL buffers and WAL files. Given the fact
that we won't wait for a lock or do a time-taking task while reading
from buffers, it seems unnecessary.

Yes, we do not need this separate wait event and we also don't need
WAIT_EVENT_WAL_READ wait event while reading from the buffer. Because
we are not performing any IO so no specific wait event is needed and
for reading from the WAL buffer we are acquiring WALBufMappingLock so
that lwlock event will be tracked under that lock.

Nope, LWLockConditionalAcquire doesn't wait, so no lock wait event (no
LWLockReportWaitStart) there. I agree to not have any wait event for
reading from WAL buffers as no IO is involved there. I removed it in
the attached v4 patch.

2. A separate TAP test for verifying that the WAL is actually read
from WAL buffers - right now, existing tests for recovery,
subscription, pg_walinspect already cover the code, see [1]. However,
if needed, I can add a separate TAP test.

Can we write a test that can actually validate that we have read from
a WAL Buffer? If so then it would be good to have such a test to avoid
any future breakage in that logic. But if it is just for hitting the
code but no guarantee that whether we can validate as part of the test
whether it has hit the WAL buffer or not then I think the existing
cases are sufficient.

We could set up a standby or a logical replication subscriber or
pg_walinspect extension and verify if the code got hit with the help
of the server log (DEBUG1) message added by the patch. However, this
can make the test volatile.

Therefore, I came up with a simple and small test module/extension
named test_wal_read_from_buffers under src/test/module. It basically
exposes a SQL-function given an LSN, it calls XLogReadFromBuffers()
and returns true if it hits WAL buffers, otherwise false. And the
simple TAP test of this module verifies if the function returns true.
I attached the test module as v4-0002 here. The test module looks
specific and also helps as demonstration of how one can possibly use
the new XLogReadFromBuffers().

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v4-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchapplication/octet-stream; name=v4-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchDownload
From 9c411b406c04c59b0b08e530ba187903f8385ed6 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 8 Feb 2023 07:45:12 +0000
Subject: [PATCH v4] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c       | 178 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  45 +++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 227 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..e4679d42f5 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -689,6 +689,10 @@ static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
+static Size XLogReadFromBuffersGuts(XLogRecPtr ptr,
+									TimeLineID tli,
+									Size count,
+									char *page);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
@@ -1639,6 +1643,180 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Guts of XLogReadFromBuffers().
+ *
+ * Read 'count' bytes into 'buf', starting at location 'ptr', from WAL
+ * fetched WAL buffers on timeline 'tli' and return the read bytes.
+ */
+static Size
+XLogReadFromBuffersGuts(XLogRecPtr ptr,
+						TimeLineID tli,
+						Size count,
+						char *buf)
+{
+	XLogRecPtr	expectedEndPtr;
+	XLogRecPtr	endptr;
+	int 	idx;
+	Size	nread = 0;
+
+	idx = XLogRecPtrToBufIdx(ptr);
+	expectedEndPtr = ptr;
+	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that the
+	 * concurrent WAL readers are also allowed. We try to do as less work as
+	 * possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return nread;
+
+	endptr = XLogCtl->xlblocks[idx];
+
+	if (expectedEndPtr == endptr)
+	{
+		char	*page;
+		char    *data;
+		XLogPageHeader	phdr;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/* Read what is wanted, not the whole page. */
+		if ((data + count) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			memcpy(buf, data, count);
+			nread = count;
+		}
+		else
+		{
+			Size nremaining;
+
+			/*
+			 * All the bytes are not in one page. Compute remaining bytes on
+			 * the current page, copy them over to output buffer.
+			 */
+			nremaining = XLOG_BLCKSZ - (data - page);
+			memcpy(buf, data, nremaining);
+			nread = nremaining;
+		}
+
+		/*
+		 * Release the lock as early as possible to avoid creating any possible
+		 * contention.
+		 */
+		LWLockRelease(WALBufMappingLock);
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer().
+		 *
+		 * However, we perform basic page header checks for ensuring that we
+		 * are not reading a page that just got initialized. Callers will
+		 * anyway perform extensive page-level and record-level checks.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+		{
+			/*
+			 * WAL buffer page doesn't look valid, so return as if nothing was
+			 * read.
+			 */
+			nread = 0;
+		}
+	}
+	else
+	{
+		/* We have found nothing. */
+		LWLockRelease(WALBufMappingLock);
+	}
+
+	/* We never read more than what the caller has asked for. */
+	Assert(nread <= count);
+
+	return nread;
+}
+
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes into 'buf', starting at location 'startptr', from WAL
+ * fetched WAL buffers on timeline 'tli' and set the read bytes to
+ * 'read_bytes'.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. The caller must be aware of
+ * this and deal with it.
+ */
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf,
+					Size *read_bytes)
+{
+	XLogRecPtr	ptr;
+	char    *dst;
+	Size    nbytes;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+
+	ptr = startptr;
+	nbytes = count;
+	dst = buf;
+	*read_bytes = 0;
+
+	while (nbytes > 0)
+	{
+		Size	nread;
+
+		nread = XLogReadFromBuffersGuts(ptr, tli, nbytes, dst);
+
+		if (nread == 0)
+		{
+			/* We read nothing. */
+			break;
+		}
+		else if (nread == nbytes)
+		{
+			/* We read all the requested bytes. */
+			*read_bytes += nread;
+			break;
+		}
+		else if (nread < nbytes)
+		{
+			/*
+			 * We read some of the requested bytes. Continue to read remaining
+			 * bytes.
+			 */
+			ptr += nread;
+			nbytes -= nread;
+			dst += nread;
+			*read_bytes += nread;
+		}
+	}
+
+	elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+		 *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index aa6c929477..723379b7d9 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1485,8 +1485,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1497,6 +1496,48 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend tools have no idea of WAL buffers. */
+	Size        read_bytes;
+
+	/*
+	 * When possible, read WAL from WAL buffers. We skip this step and continue
+	 * the usual way, that is to read from WAL file, either when server is in
+	 * recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		XLogReadFromBuffers(startptr, tli, count, buf, &read_bytes);
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially or
+		 * nothing, then continue to read the remaining bytes the usual way,
+		 * that is, read from WAL file.
+		 */
+		if (count == read_bytes)
+		{
+			/* Buffer hit, so return. */
+			return true;
+		}
+		else if (read_bytes > 0 && count > read_bytes)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += read_bytes;
+			startptr += read_bytes;
+			count -= read_bytes;
+		}
+
+		/* Buffer miss i.e., read_bytes = 0, so continue */
+	}
+#endif	/* FRONTEND */
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..c9941aa001 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern void XLogReadFromBuffers(XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf,
+								Size *read_bytes);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v4-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patchapplication/octet-stream; name=v4-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patchDownload
From 28da1ad6723e2543a0dc9e808d9a2911a10e3dbc Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 8 Feb 2023 10:22:05 +0000
Subject: [PATCH v4] Add test module for verifying WAL read from WAL buffers

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 +++++++++
 .../test_wal_read_from_buffers/meson.build    | 36 +++++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 44 ++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 ++++++
 .../test_wal_read_from_buffers.c              | 51 +++++++++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 180 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..ea33361f69 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -33,6 +33,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..e3ffd3538d 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -29,5 +29,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7a09533ec7
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40a36edc07
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,36 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+# FIXME: prevent install during main install, but not during test :/
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_wal_read_from_buffers
+
+install_data(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..3448e0bed6
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,44 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn')});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..8e89910133
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    OUT read_from_buffers bool
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT PARALLEL UNSAFE;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..ca8645101a
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,51 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test code for veryfing WAL read from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn;
+	Size	read_bytes;
+	TimeLineID	tli;
+	char	data[XLOG_BLCKSZ] = {0};
+
+	lsn = PG_GETARG_LSN(0);
+
+	if (XLogRecPtrIsInvalid(lsn))
+		PG_RETURN_BOOL(false);
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
+	tli = GetWALInsertionTimeLine();
+
+	XLogReadFromBuffers(lsn, tli, XLOG_BLCKSZ, data, &read_bytes);
+
+	PG_RETURN_LSN(read_bytes > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..7852b3e331
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test code for veryfing WAL read from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#21Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#20)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, Feb 08, 2023 at 08:00:00PM +0530, Bharath Rupireddy wrote:

+			/*
+			 * We read some of the requested bytes. Continue to read remaining
+			 * bytes.
+			 */
+			ptr += nread;
+			nbytes -= nread;
+			dst += nread;
+			*read_bytes += nread;

Why do we only read a page at a time in XLogReadFromBuffersGuts()? What is
preventing us from copying all the data we need in one go?

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#22Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#21)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Feb 28, 2023 at 6:14 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Wed, Feb 08, 2023 at 08:00:00PM +0530, Bharath Rupireddy wrote:

+                     /*
+                      * We read some of the requested bytes. Continue to read remaining
+                      * bytes.
+                      */
+                     ptr += nread;
+                     nbytes -= nread;
+                     dst += nread;
+                     *read_bytes += nread;

Why do we only read a page at a time in XLogReadFromBuffersGuts()? What is
preventing us from copying all the data we need in one go?

Note that most of the WALRead() callers request a single page of
XLOG_BLCKSZ bytes even if the server has less or more available WAL
pages. It's the streaming replication wal sender that can request less
than XLOG_BLCKSZ bytes and upto MAX_SEND_SIZE (16 * XLOG_BLCKSZ). And,
if we read, say, MAX_SEND_SIZE at once while holding
WALBufMappingLock, that might impact concurrent inserters (at least, I
can say it in theory) - one of the main intentions of this patch is
not to impact inserters much.

Therefore, I feel reading one WAL buffer page at a time, which works
for most of the cases, without impacting concurrent inserters much is
better - /messages/by-id/CALj2ACWXHP6Ha1BfDB14txm=XP272wCbOV00mcPg9c6EXbnp5A@mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#23Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Bharath Rupireddy (#22)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Feb 28, 2023 at 10:38 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

+/*
+ * Guts of XLogReadFromBuffers().
+ *
+ * Read 'count' bytes into 'buf', starting at location 'ptr', from WAL
+ * fetched WAL buffers on timeline 'tli' and return the read bytes.
+ */
s/fetched WAL buffers/fetched from WAL buffers
+ else if (nread < nbytes)
+ {
+ /*
+ * We read some of the requested bytes. Continue to read remaining
+ * bytes.
+ */
+ ptr += nread;
+ nbytes -= nread;
+ dst += nread;
+ *read_bytes += nread;
+ }

The 'if' condition should always be true. You can replace the same
with an assertion instead.
s/Continue to read remaining/Continue to read the remaining

The good thing about this patch is that it reduces read IO calls
without impacting the write performance (at least not that
noticeable). It also takes us one step forward towards the
enhancements mentioned in the thread. If performance is a concern, we
can introduce a GUC to enable/disable this feature.

--
Thanks & Regards,
Kuntal Ghosh

#24Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#22)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Feb 28, 2023 at 10:38:31AM +0530, Bharath Rupireddy wrote:

On Tue, Feb 28, 2023 at 6:14 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Why do we only read a page at a time in XLogReadFromBuffersGuts()? What is
preventing us from copying all the data we need in one go?

Note that most of the WALRead() callers request a single page of
XLOG_BLCKSZ bytes even if the server has less or more available WAL
pages. It's the streaming replication wal sender that can request less
than XLOG_BLCKSZ bytes and upto MAX_SEND_SIZE (16 * XLOG_BLCKSZ). And,
if we read, say, MAX_SEND_SIZE at once while holding
WALBufMappingLock, that might impact concurrent inserters (at least, I
can say it in theory) - one of the main intentions of this patch is
not to impact inserters much.

Perhaps we should test both approaches to see if there is a noticeable
difference. It might not be great for concurrent inserts to repeatedly
take the lock, either. If there's no real difference, we might be able to
simplify the code a bit.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#25Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Kuntal Ghosh (#23)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, Mar 1, 2023 at 12:06 AM Kuntal Ghosh <kuntalghosh.2007@gmail.com> wrote:

On Tue, Feb 28, 2023 at 10:38 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

+/*
+ * Guts of XLogReadFromBuffers().
+ *
+ * Read 'count' bytes into 'buf', starting at location 'ptr', from WAL
+ * fetched WAL buffers on timeline 'tli' and return the read bytes.
+ */
s/fetched WAL buffers/fetched from WAL buffers

Modified that comment a bit and moved it to the XLogReadFromBuffers.

+ else if (nread < nbytes)
+ {
+ /*
+ * We read some of the requested bytes. Continue to read remaining
+ * bytes.
+ */
+ ptr += nread;
+ nbytes -= nread;
+ dst += nread;
+ *read_bytes += nread;
+ }

The 'if' condition should always be true. You can replace the same
with an assertion instead.

Yeah, added an assert and changed that else if (nread < nbytes) to
else only condition.

s/Continue to read remaining/Continue to read the remaining

Done.

The good thing about this patch is that it reduces read IO calls
without impacting the write performance (at least not that
noticeable). It also takes us one step forward towards the
enhancements mentioned in the thread.

Right.

If performance is a concern, we
can introduce a GUC to enable/disable this feature.

I didn't see any performance issues from my testing so far with 3
different pgbench cases
/messages/by-id/CALj2ACWXHP6Ha1BfDB14txm=XP272wCbOV00mcPg9c6EXbnp5A@mail.gmail.com.

While adding a GUC to enable/disable a feature sounds useful, IMHO it
isn't good for the user. Because we already have too many GUCs for the
user and we may not want all features to be defensive and add their
own GUCs. If at all, any bugs arise due to some corner-case we missed
to count in, we can surely help fix them. Having said this, I'm open
to suggestions here.

Please find the attached v5 patch set for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v5-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchapplication/x-patch; name=v5-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchDownload
From 121f98633541c340a89dc628cc3f02711abdee02 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 1 Mar 2023 08:40:30 +0000
Subject: [PATCH v5] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c       | 175 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  45 +++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 224 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..98bd588ea0 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -689,6 +689,10 @@ static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
 static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
+static Size XLogReadFromBuffersGuts(XLogRecPtr ptr,
+									TimeLineID tli,
+									Size count,
+									char *page);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
 static uint64 XLogRecPtrToBytePos(XLogRecPtr ptr);
@@ -1639,6 +1643,177 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Guts of XLogReadFromBuffers().
+ */
+static Size
+XLogReadFromBuffersGuts(XLogRecPtr ptr,
+						TimeLineID tli,
+						Size count,
+						char *buf)
+{
+	XLogRecPtr	expectedEndPtr;
+	XLogRecPtr	endptr;
+	int 	idx;
+	Size	nread = 0;
+
+	idx = XLogRecPtrToBufIdx(ptr);
+	expectedEndPtr = ptr;
+	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that the
+	 * concurrent WAL readers are also allowed. We try to do as less work as
+	 * possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return nread;
+
+	endptr = XLogCtl->xlblocks[idx];
+
+	if (expectedEndPtr == endptr)
+	{
+		char	*page;
+		char    *data;
+		XLogPageHeader	phdr;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/* Read what is wanted, not the whole page. */
+		if ((data + count) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			memcpy(buf, data, count);
+			nread = count;
+		}
+		else
+		{
+			Size nremaining;
+
+			/*
+			 * All the bytes are not in one page. Compute remaining bytes on
+			 * the current page, copy them over to output buffer.
+			 */
+			nremaining = XLOG_BLCKSZ - (data - page);
+			memcpy(buf, data, nremaining);
+			nread = nremaining;
+		}
+
+		/*
+		 * Release the lock as early as possible to avoid creating any possible
+		 * contention.
+		 */
+		LWLockRelease(WALBufMappingLock);
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer().
+		 *
+		 * However, we perform basic page header checks for ensuring that we
+		 * are not reading a page that just got initialized. Callers will
+		 * anyway perform extensive page-level and record-level checks.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+		{
+			/*
+			 * WAL buffer page doesn't look valid, so return as if nothing was
+			 * read.
+			 */
+			nread = 0;
+		}
+	}
+	else
+	{
+		/* Requested WAL isn't available in WAL buffers. */
+		LWLockRelease(WALBufMappingLock);
+	}
+
+	/* We never read more than what the caller has asked for. */
+	Assert(nread <= count);
+
+	return nread;
+}
+
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and set the read bytes to 'read_bytes'.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. The caller must be aware of
+ * this and deal with it.
+ */
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf,
+					Size *read_bytes)
+{
+	XLogRecPtr	ptr;
+	char    *dst;
+	Size    nbytes;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+
+	ptr = startptr;
+	nbytes = count;
+	dst = buf;
+	*read_bytes = 0;
+
+	while (nbytes > 0)
+	{
+		Size	nread;
+
+		nread = XLogReadFromBuffersGuts(ptr, tli, nbytes, dst);
+
+		if (nread <= 0)
+		{
+			/* We read nothing. */
+			break;
+		}
+		else if (nread == nbytes)
+		{
+			/* We read all the requested bytes. */
+			*read_bytes += nread;
+			break;
+		}
+		else
+		{
+			/*
+			 * We read some of the requested bytes. Continue to read the
+			 * remaining bytes.
+			 */
+			Assert(nread < nbytes);
+			ptr += nread;
+			nbytes -= nread;
+			dst += nread;
+			*read_bytes += nread;
+		}
+	}
+
+	elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+		 *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index aa6c929477..723379b7d9 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1485,8 +1485,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1497,6 +1496,48 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend tools have no idea of WAL buffers. */
+	Size        read_bytes;
+
+	/*
+	 * When possible, read WAL from WAL buffers. We skip this step and continue
+	 * the usual way, that is to read from WAL file, either when server is in
+	 * recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		XLogReadFromBuffers(startptr, tli, count, buf, &read_bytes);
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially or
+		 * nothing, then continue to read the remaining bytes the usual way,
+		 * that is, read from WAL file.
+		 */
+		if (count == read_bytes)
+		{
+			/* Buffer hit, so return. */
+			return true;
+		}
+		else if (read_bytes > 0 && count > read_bytes)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += read_bytes;
+			startptr += read_bytes;
+			count -= read_bytes;
+		}
+
+		/* Buffer miss i.e., read_bytes = 0, so continue */
+	}
+#endif	/* FRONTEND */
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..c9941aa001 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern void XLogReadFromBuffers(XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf,
+								Size *read_bytes);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v5-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patchapplication/x-patch; name=v5-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patchDownload
From a889a11b814a902c0b4292f00a360acdab2c7c1f Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 1 Mar 2023 08:40:57 +0000
Subject: [PATCH v5] Add test module for verifying WAL read from WAL buffers

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 +++++++++
 .../test_wal_read_from_buffers/meson.build    | 36 +++++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 44 ++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 ++++++
 .../test_wal_read_from_buffers.c              | 51 +++++++++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 180 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..ea33361f69 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -33,6 +33,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..e3ffd3538d 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -29,5 +29,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7a09533ec7
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40a36edc07
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,36 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+# FIXME: prevent install during main install, but not during test :/
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_wal_read_from_buffers
+
+install_data(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..3448e0bed6
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,44 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn')});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..8e89910133
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    OUT read_from_buffers bool
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT PARALLEL UNSAFE;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..ca8645101a
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,51 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test code for veryfing WAL read from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn;
+	Size	read_bytes;
+	TimeLineID	tli;
+	char	data[XLOG_BLCKSZ] = {0};
+
+	lsn = PG_GETARG_LSN(0);
+
+	if (XLogRecPtrIsInvalid(lsn))
+		PG_RETURN_BOOL(false);
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
+	tli = GetWALInsertionTimeLine();
+
+	XLogReadFromBuffers(lsn, tli, XLOG_BLCKSZ, data, &read_bytes);
+
+	PG_RETURN_LSN(read_bytes > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..7852b3e331
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test code for veryfing WAL read from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#26Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#25)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, Mar 1, 2023 at 2:39 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Please find the attached v5 patch set for further review.

I simplified the code largely by moving the logic of reading the WAL
buffer page from a separate function to the main while loop. This
enabled me to get rid of XLogReadFromBuffersGuts() that v5 and other
previous patches have.

Please find the attached v6 patch set for further review. Meanwhile,
I'll continue to work on the review comment raised upthread -
/messages/by-id/20230301041523.GA1453450@nathanxps13.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v6-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchapplication/x-patch; name=v6-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchDownload
From afb57cd61955f7fc0dba315f3f7c81604a0b47c9 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 2 Mar 2023 07:54:06 +0000
Subject: [PATCH v6] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c       | 144 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  45 +++++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 193 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f9f0f6db8d..b29dc67c38 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1639,6 +1639,150 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and set the read bytes to 'read_bytes'.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. The caller must be aware of
+ * this and deal with it.
+ */
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf,
+					Size *read_bytes)
+{
+	XLogRecPtr	ptr;
+	char    *dst;
+	Size    nbytes;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+	Assert(tli == GetWALInsertionTimeLine());
+
+	ptr = startptr;
+	nbytes = count;
+	dst = buf;
+	*read_bytes = 0;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr origptr;
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int 	idx;
+
+		origptr = ptr;
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+		/*
+		 * Holding WALBufMappingLock ensures inserters don't overwrite this
+		 * value while we are reading it. We try to acquire it in shared mode
+		 * so that the concurrent WAL readers are also allowed. We try to do as
+		 * less work as possible while holding the lock to not impact
+		 * concurrent WAL writers much. We quickly exit to not cause any
+		 * contention, if the lock isn't immediately available.
+		 */
+		if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+			return;
+
+		endptr = XLogCtl->xlblocks[idx];
+
+		if (expectedEndPtr == endptr)
+		{
+			char	*page;
+			char    *data;
+			XLogPageHeader	phdr;
+
+			/*
+			 * We found WAL buffer page containing given XLogRecPtr. Get
+			 * starting address of the page and a pointer to the right location
+			 * of given XLogRecPtr in that page.
+			 */
+			page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+			data = page + ptr % XLOG_BLCKSZ;
+
+			/* Read what is wanted, not the whole page. */
+			if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+			{
+				/* All the bytes are in one page. */
+				memcpy(dst, data, nbytes);
+				*read_bytes += nbytes;
+				nbytes = 0;
+			}
+			else
+			{
+				Size	nread;
+
+				/*
+				 * All the bytes are not in one page. Read available bytes on
+				 * the current page, copy them over to output buffer and
+				 * continue to read remaining bytes.
+				 */
+				nread = XLOG_BLCKSZ - (data - page);
+				Assert(nread > 0 && nread <= nbytes);
+				memcpy(dst, data, nread);
+				ptr += nread;
+				nbytes -= nread;
+				dst += nread;
+				*read_bytes += nread;
+			}
+
+			/*
+			 * Release the lock as early as possible to avoid creating any
+			 * possible contention.
+			 */
+			LWLockRelease(WALBufMappingLock);
+
+			/*
+			 * The fact that we acquire WALBufMappingLock while reading the WAL
+			 * buffer page itself guarantees that no one else initializes it or
+			 * makes it ready for next use in AdvanceXLInsertBuffer().
+			 *
+			 * However, we perform basic page header checks for ensuring that
+			 * we are not reading a page that just got initialized. Callers
+			 * will anyway perform extensive page-level and record-level
+			 * checks.
+			 */
+			phdr = (XLogPageHeader) page;
+
+			if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+				  phdr->xlp_pageaddr == (origptr - (origptr % XLOG_BLCKSZ)) &&
+				  phdr->xlp_tli == tli))
+			{
+				/*
+				 * WAL buffer page doesn't look valid, so return with what we
+				 * have read so far.
+				 */
+				break;
+			}
+		}
+		else
+		{
+			/*
+			 * Requested WAL isn't available in WAL buffers, so return with
+			 * what we have read so far.
+			 */
+			LWLockRelease(WALBufMappingLock);
+			break;
+		}
+	}
+
+	/* We never read more than what the caller has asked for. */
+	Assert(*read_bytes <= count);
+
+	elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+		 *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index aa6c929477..723379b7d9 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1485,8 +1485,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1497,6 +1496,48 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend tools have no idea of WAL buffers. */
+	Size        read_bytes;
+
+	/*
+	 * When possible, read WAL from WAL buffers. We skip this step and continue
+	 * the usual way, that is to read from WAL file, either when server is in
+	 * recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		XLogReadFromBuffers(startptr, tli, count, buf, &read_bytes);
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially or
+		 * nothing, then continue to read the remaining bytes the usual way,
+		 * that is, read from WAL file.
+		 */
+		if (count == read_bytes)
+		{
+			/* Buffer hit, so return. */
+			return true;
+		}
+		else if (read_bytes > 0 && count > read_bytes)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += read_bytes;
+			startptr += read_bytes;
+			count -= read_bytes;
+		}
+
+		/* Buffer miss i.e., read_bytes = 0, so continue */
+	}
+#endif	/* FRONTEND */
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..c9941aa001 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern void XLogReadFromBuffers(XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf,
+								Size *read_bytes);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v6-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patchapplication/x-patch; name=v6-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patchDownload
From 934af3ea6475c3293156e20b62901ab8dda5e65a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 2 Mar 2023 07:54:36 +0000
Subject: [PATCH v6] Add test module for verifying WAL read from WAL buffers

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 +++++++++
 .../test_wal_read_from_buffers/meson.build    | 36 +++++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 44 ++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 ++++++
 .../test_wal_read_from_buffers.c              | 51 +++++++++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 180 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..ea33361f69 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -33,6 +33,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..e3ffd3538d 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -29,5 +29,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7a09533ec7
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40a36edc07
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,36 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+# FIXME: prevent install during main install, but not during test :/
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_wal_read_from_buffers
+
+install_data(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..3448e0bed6
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,44 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn')});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..8e89910133
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    OUT read_from_buffers bool
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT PARALLEL UNSAFE;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..ca8645101a
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,51 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test code for veryfing WAL read from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn;
+	Size	read_bytes;
+	TimeLineID	tli;
+	char	data[XLOG_BLCKSZ] = {0};
+
+	lsn = PG_GETARG_LSN(0);
+
+	if (XLogRecPtrIsInvalid(lsn))
+		PG_RETURN_BOOL(false);
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
+	tli = GetWALInsertionTimeLine();
+
+	XLogReadFromBuffers(lsn, tli, XLOG_BLCKSZ, data, &read_bytes);
+
+	PG_RETURN_LSN(read_bytes > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..7852b3e331
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test code for veryfing WAL read from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#27Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#24)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, Mar 1, 2023 at 9:45 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Tue, Feb 28, 2023 at 10:38:31AM +0530, Bharath Rupireddy wrote:

On Tue, Feb 28, 2023 at 6:14 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Why do we only read a page at a time in XLogReadFromBuffersGuts()? What is
preventing us from copying all the data we need in one go?

Note that most of the WALRead() callers request a single page of
XLOG_BLCKSZ bytes even if the server has less or more available WAL
pages. It's the streaming replication wal sender that can request less
than XLOG_BLCKSZ bytes and upto MAX_SEND_SIZE (16 * XLOG_BLCKSZ). And,
if we read, say, MAX_SEND_SIZE at once while holding
WALBufMappingLock, that might impact concurrent inserters (at least, I
can say it in theory) - one of the main intentions of this patch is
not to impact inserters much.

Perhaps we should test both approaches to see if there is a noticeable
difference. It might not be great for concurrent inserts to repeatedly
take the lock, either. If there's no real difference, we might be able to
simplify the code a bit.

I took a stab at this - acquire WALBufMappingLock separately for each
requested WAL buffer page vs acquire WALBufMappingLock once for all
requested WAL buffer pages. I chose the pgbench tpcb-like benchmark
that has 3 UPDATE statements and 1 INSERT statement. I ran pgbench for
30min with scale factor 100 and 4096 clients with primary and 1 async
standby, see [1]shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' ./pgbench --initialize --scale=100 postgres ./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c4096 -j4096 -T1800. I captured wait_events to see the contention on
WALBufMappingLock. I haven't noticed any contention on the lock and no
difference in TPS too, see [2]HEAD: done in 20.03 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 15.53 s, vacuum 0.19 s, primary keys 4.30 s). tps = 11654.475345 (without initial connection time) for results on HEAD, see [3]done in 19.99 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 15.52 s, vacuum 0.18 s, primary keys 4.28 s). tps = 11689.584538 (without initial connection time) for
results on v6 patch which has "acquire WALBufMappingLock separately
for each requested WAL buffer page" strategy and see [4]done in 19.92 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 15.53 s, vacuum 0.23 s, primary keys 4.16 s). tps = 11671.869074 (without initial connection time) for results
on v7 patch (attached herewith) which has "acquire WALBufMappingLock
once for all requested WAL buffer pages" strategy. Another thing to
note from the test results is that reduction in WALRead IO wait events
from 136 on HEAD to 1 on v6 or v7 patch. So, the read from WAL buffers
is really helping here.

With these observations, I'd like to use the approach that acquires
WALBufMappingLock once for all requested WAL buffer pages unlike v6
and the previous patches.

I'm attaching the v7 patch set with this change for further review.

[1]: shared_buffers = '8GB' wal_buffers = '1GB' max_wal_size = '16GB' max_connections = '5000' archive_mode = 'on' archive_command='cp %p /home/ubuntu/archived_wal/%f' ./pgbench --initialize --scale=100 postgres ./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c4096 -j4096 -T1800
shared_buffers = '8GB'
wal_buffers = '1GB'
max_wal_size = '16GB'
max_connections = '5000'
archive_mode = 'on'
archive_command='cp %p /home/ubuntu/archived_wal/%f'
./pgbench --initialize --scale=100 postgres
./pgbench -n -M prepared -U ubuntu postgres -b tpcb-like -c4096 -j4096 -T1800

[2]: HEAD: done in 20.03 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 15.53 s, vacuum 0.19 s, primary keys 4.30 s). tps = 11654.475345 (without initial connection time)
HEAD:
done in 20.03 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 15.53 s, vacuum 0.19 s, primary keys 4.30 s).
tps = 11654.475345 (without initial connection time)

50950253 Lock | transactionid
16472447 Lock | tuple
3869523 LWLock | LockManager
739283 IPC | ProcArrayGroupUpdate
718549 |
439877 LWLock | WALWrite
130737 Client | ClientRead
121113 LWLock | BufferContent
70778 LWLock | WALInsert
43346 IPC | XactGroupUpdate
18547
18546 Activity | LogicalLauncherMain
18545 Activity | AutoVacuumMain
18272 Activity | ArchiverMain
17627 Activity | WalSenderMain
17207 Activity | WalWriterMain
15455 IO | WALSync
14963 LWLock | ProcArray
14747 LWLock | XactSLRU
13943 Timeout | CheckpointWriteDelay
10519 Activity | BgWriterHibernate
8022 Activity | BgWriterMain
4486 Timeout | SpinDelay
4443 Activity | CheckpointerMain
1435 Lock | extend
670 LWLock | XidGen
373 IO | WALWrite
283 Timeout | VacuumDelay
268 IPC | ArchiveCommand
249 Timeout | VacuumTruncate
136 IO | WALRead
115 IO | WALInitSync
74 IO | DataFileWrite
67 IO | WALInitWrite
36 IO | DataFileFlush
35 IO | DataFileExtend
17 IO | DataFileRead
4 IO | SLRUWrite
3 IO | BufFileWrite
2 IO | DataFileImmediateSync
1 Tuples only is on.
1 LWLock | SInvalWrite
1 LWLock | LockFastPath
1 IO | ControlFileSyncUpdate

[3]: done in 19.99 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 15.52 s, vacuum 0.18 s, primary keys 4.28 s). tps = 11689.584538 (without initial connection time)
done in 19.99 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 15.52 s, vacuum 0.18 s, primary keys 4.28 s).
tps = 11689.584538 (without initial connection time)

50678977 Lock | transactionid
16252048 Lock | tuple
4146827 LWLock | LockManager
768256 |
719923 IPC | ProcArrayGroupUpdate
432836 LWLock | WALWrite
140354 Client | ClientRead
124203 LWLock | BufferContent
74355 LWLock | WALInsert
39852 IPC | XactGroupUpdate
30728
30727 Activity | LogicalLauncherMain
30726 Activity | AutoVacuumMain
30420 Activity | ArchiverMain
29881 Activity | WalSenderMain
29418 Activity | WalWriterMain
23428 Activity | BgWriterHibernate
15960 Timeout | CheckpointWriteDelay
15840 IO | WALSync
15066 LWLock | ProcArray
14577 Activity | CheckpointerMain
14377 LWLock | XactSLRU
7291 Activity | BgWriterMain
4336 Timeout | SpinDelay
1707 Lock | extend
720 LWLock | XidGen
362 Timeout | VacuumTruncate
360 IO | WALWrite
304 Timeout | VacuumDelay
301 IPC | ArchiveCommand
106 IO | WALInitSync
82 IO | DataFileWrite
66 IO | WALInitWrite
45 IO | DataFileFlush
25 IO | DataFileExtend
18 IO | DataFileRead
5 LWLock | LockFastPath
2 IO | DataFileSync
2 IO | DataFileImmediateSync
1 Tuples only is on.
1 LWLock | BufferMapping
1 IO | WALRead
1 IO | SLRUWrite
1 IO | SLRURead
1 IO | ReplicationSlotSync
1 IO | BufFileRead

[4]: done in 19.92 s (drop tables 0.00 s, create tables 0.01 s, client-side generate 15.53 s, vacuum 0.23 s, primary keys 4.16 s). tps = 11671.869074 (without initial connection time)
done in 19.92 s (drop tables 0.00 s, create tables 0.01 s, client-side
generate 15.53 s, vacuum 0.23 s, primary keys 4.16 s).
tps = 11671.869074 (without initial connection time)

50614021 Lock | transactionid
16482561 Lock | tuple
4086451 LWLock | LockManager
777507 |
714329 IPC | ProcArrayGroupUpdate
420593 LWLock | WALWrite
138142 Client | ClientRead
125381 LWLock | BufferContent
75283 LWLock | WALInsert
38759 IPC | XactGroupUpdate
20283
20282 Activity | LogicalLauncherMain
20281 Activity | AutoVacuumMain
20002 Activity | ArchiverMain
19467 Activity | WalSenderMain
19036 Activity | WalWriterMain
15836 IO | WALSync
15708 Timeout | CheckpointWriteDelay
15346 LWLock | ProcArray
15095 LWLock | XactSLRU
11852 Activity | BgWriterHibernate
8424 Activity | BgWriterMain
4636 Timeout | SpinDelay
4415 Activity | CheckpointerMain
2048 Lock | extend
1457 Timeout | VacuumTruncate
646 LWLock | XidGen
402 IO | WALWrite
306 Timeout | VacuumDelay
278 IPC | ArchiveCommand
117 IO | WALInitSync
74 IO | DataFileWrite
66 IO | WALInitWrite
35 IO | DataFileFlush
29 IO | DataFileExtend
24 LWLock | LockFastPath
14 IO | DataFileRead
2 IO | SLRUWrite
2 IO | DataFileImmediateSync
2 IO | BufFileWrite
1 Tuples only is on.
1 LWLock | BufferMapping
1 IO | WALRead
1 IO | SLRURead
1 IO | BufFileRead

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v7-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchapplication/x-patch; name=v7-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchDownload
From 2c46ebcb95954580da3ece4bd8ce5d5b1d824694 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 3 Mar 2023 10:33:06 +0000
Subject: [PATCH v7] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c       | 140 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  45 +++++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 189 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 87af608d15..51dd101d12 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1639,6 +1639,146 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and set the read bytes to 'read_bytes'.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. The caller must be aware of
+ * this and deal with it.
+ */
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf,
+					Size *read_bytes)
+{
+	XLogRecPtr	ptr;
+	char    *dst;
+	Size    nbytes;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+	Assert(tli == GetWALInsertionTimeLine());
+
+	ptr = startptr;
+	nbytes = count;
+	dst = buf;
+	*read_bytes = 0;
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that the
+	 * concurrent WAL readers are also allowed. We try to do as less work as
+	 * possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr origptr;
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int 	idx;
+
+		origptr = ptr;
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+		endptr = XLogCtl->xlblocks[idx];
+
+		if (expectedEndPtr == endptr)
+		{
+			char	*page;
+			char    *data;
+			XLogPageHeader	phdr;
+
+			/*
+			 * We found WAL buffer page containing given XLogRecPtr. Get
+			 * starting address of the page and a pointer to the right location
+			 * of given XLogRecPtr in that page.
+			 */
+			page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+			data = page + ptr % XLOG_BLCKSZ;
+
+			/* Read what is wanted, not the whole page. */
+			if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+			{
+				/* All the bytes are in one page. */
+				memcpy(dst, data, nbytes);
+				*read_bytes += nbytes;
+				nbytes = 0;
+			}
+			else
+			{
+				Size	nread;
+
+				/*
+				 * All the bytes are not in one page. Read available bytes on
+				 * the current page, copy them over to output buffer and
+				 * continue to read remaining bytes.
+				 */
+				nread = XLOG_BLCKSZ - (data - page);
+				Assert(nread > 0 && nread <= nbytes);
+				memcpy(dst, data, nread);
+				ptr += nread;
+				nbytes -= nread;
+				dst += nread;
+				*read_bytes += nread;
+			}
+
+
+			/*
+			 * The fact that we acquire WALBufMappingLock while reading the WAL
+			 * buffer page itself guarantees that no one else initializes it or
+			 * makes it ready for next use in AdvanceXLInsertBuffer().
+			 *
+			 * However, we perform basic page header checks for ensuring that
+			 * we are not reading a page that just got initialized. Callers
+			 * will anyway perform extensive page-level and record-level
+			 * checks.
+			 */
+			phdr = (XLogPageHeader) page;
+
+			if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+				  phdr->xlp_pageaddr == (origptr - (origptr % XLOG_BLCKSZ)) &&
+				  phdr->xlp_tli == tli))
+			{
+				/*
+				 * WAL buffer page doesn't look valid, so return with what we
+				 * have read so far.
+				 */
+				break;
+			}
+		}
+		else
+		{
+			/*
+			 * Requested WAL isn't available in WAL buffers, so return with
+			 * what we have read so far.
+			 */
+			break;
+		}
+	}
+
+	LWLockRelease(WALBufMappingLock);
+
+	/* We never read more than what the caller has asked for. */
+	Assert(*read_bytes <= count);
+
+	elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+		 *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cadea21b37..bd11df448a 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1486,8 +1486,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1498,6 +1497,48 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend tools have no idea of WAL buffers. */
+	Size        read_bytes;
+
+	/*
+	 * When possible, read WAL from WAL buffers. We skip this step and continue
+	 * the usual way, that is to read from WAL file, either when server is in
+	 * recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		XLogReadFromBuffers(startptr, tli, count, buf, &read_bytes);
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially or
+		 * nothing, then continue to read the remaining bytes the usual way,
+		 * that is, read from WAL file.
+		 */
+		if (count == read_bytes)
+		{
+			/* Buffer hit, so return. */
+			return true;
+		}
+		else if (read_bytes > 0 && count > read_bytes)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += read_bytes;
+			startptr += read_bytes;
+			count -= read_bytes;
+		}
+
+		/* Buffer miss i.e., read_bytes = 0, so continue */
+	}
+#endif	/* FRONTEND */
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..c9941aa001 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern void XLogReadFromBuffers(XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf,
+								Size *read_bytes);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v7-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patchapplication/x-patch; name=v7-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patchDownload
From ad65a3c413720462c6eae0d5ea4c08ce656e582f Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 3 Mar 2023 10:33:36 +0000
Subject: [PATCH v7] Add test module for verifying WAL read from WAL buffers

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 +++++++++
 .../test_wal_read_from_buffers/meson.build    | 36 +++++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 44 ++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 ++++++
 .../test_wal_read_from_buffers.c              | 51 +++++++++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 180 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..ea33361f69 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -33,6 +33,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..e3ffd3538d 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -29,5 +29,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7a09533ec7
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40a36edc07
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,36 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+# FIXME: prevent install during main install, but not during test :/
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_wal_read_from_buffers
+
+install_data(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..3448e0bed6
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,44 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn')});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..8e89910133
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    OUT read_from_buffers bool
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT PARALLEL UNSAFE;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..ca8645101a
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,51 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test code for veryfing WAL read from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn;
+	Size	read_bytes;
+	TimeLineID	tli;
+	char	data[XLOG_BLCKSZ] = {0};
+
+	lsn = PG_GETARG_LSN(0);
+
+	if (XLogRecPtrIsInvalid(lsn))
+		PG_RETURN_BOOL(false);
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
+	tli = GetWALInsertionTimeLine();
+
+	XLogReadFromBuffers(lsn, tli, XLOG_BLCKSZ, data, &read_bytes);
+
+	PG_RETURN_LSN(read_bytes > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..7852b3e331
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test code for veryfing WAL read from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#28Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#27)
Re: Improve WALRead() to suck data directly from WAL buffers when possible
+void
+XLogReadFromBuffers(XLogRecPtr startptr,
+                    TimeLineID tli,
+                    Size count,
+                    char *buf,
+                    Size *read_bytes)

Since this function presently doesn't return anything, can we have it
return the number of bytes read instead of storing it in a pointer
variable?

+    ptr = startptr;
+    nbytes = count;
+    dst = buf;

These variables seem superfluous.

+            /*
+             * Requested WAL isn't available in WAL buffers, so return with
+             * what we have read so far.
+             */
+            break;

nitpick: I'd move this to the top so that you can save a level of
indentation.

if (expectedEndPtr != endptr)
break;

... logic for when the data is found in the WAL buffers ...

+                /*
+                 * All the bytes are not in one page. Read available bytes on
+                 * the current page, copy them over to output buffer and
+                 * continue to read remaining bytes.
+                 */

Is it possible to memcpy more than a page at a time?

+            /*
+             * The fact that we acquire WALBufMappingLock while reading the WAL
+             * buffer page itself guarantees that no one else initializes it or
+             * makes it ready for next use in AdvanceXLInsertBuffer().
+             *
+             * However, we perform basic page header checks for ensuring that
+             * we are not reading a page that just got initialized. Callers
+             * will anyway perform extensive page-level and record-level
+             * checks.
+             */

Hm. I wonder if we should make these assertions instead.

+    elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+         *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);

I definitely don't think we should put an elog() in this code path.
Perhaps this should be guarded behind WAL_DEBUG.

+        /*
+         * Check if we have read fully (hit), partially (partial hit) or
+         * nothing (miss) from WAL buffers. If we have read either partially or
+         * nothing, then continue to read the remaining bytes the usual way,
+         * that is, read from WAL file.
+         */
+        if (count == read_bytes)
+        {
+            /* Buffer hit, so return. */
+            return true;
+        }
+        else if (read_bytes > 0 && count > read_bytes)
+        {
+            /*
+             * Buffer partial hit, so reset the state to count the read bytes
+             * and continue.
+             */
+            buf += read_bytes;
+            startptr += read_bytes;
+            count -= read_bytes;
+        }
+
+        /* Buffer miss i.e., read_bytes = 0, so continue */

I think we can simplify this. We effectively take the same action any time
"count" doesn't equal "read_bytes", so there's no need for the "else if".

if (count == read_bytes)
return true;

buf += read_bytes;
startptr += read_bytes;
count -= read_bytes;

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#29Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#28)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

+void
+XLogReadFromBuffers(XLogRecPtr startptr,

Since this function presently doesn't return anything, can we have it
return the number of bytes read instead of storing it in a pointer
variable?

Done.

+    ptr = startptr;
+    nbytes = count;
+    dst = buf;

These variables seem superfluous.

Needed startptr and count for DEBUG1 message and assertion at the end.
Removed dst and used buf in the new patch now.

+            /*
+             * Requested WAL isn't available in WAL buffers, so return with
+             * what we have read so far.
+             */
+            break;

nitpick: I'd move this to the top so that you can save a level of
indentation.

Done.

+                /*
+                 * All the bytes are not in one page. Read available bytes on
+                 * the current page, copy them over to output buffer and
+                 * continue to read remaining bytes.
+                 */

Is it possible to memcpy more than a page at a time?

It would complicate things a lot there; the logic to figure out the
last page bytes that may or may not fit in the whole page gets
complicated. Also, the logic to verify each page's header gets
complicated. We might lose out if we memcpy all the pages at once and
start verifying each page's header in another loop.

I would like to keep it simple - read a single page from WAL buffers,
verify it and continue.

+            /*
+             * The fact that we acquire WALBufMappingLock while reading the WAL
+             * buffer page itself guarantees that no one else initializes it or
+             * makes it ready for next use in AdvanceXLInsertBuffer().
+             *
+             * However, we perform basic page header checks for ensuring that
+             * we are not reading a page that just got initialized. Callers
+             * will anyway perform extensive page-level and record-level
+             * checks.
+             */

Hm. I wonder if we should make these assertions instead.

Okay. I added XLogReaderValidatePageHeader for assert-only builds
which will help catch any issues there. But we can't perform record
level checks here because this function doesn't know where the record
starts from, it knows only pages. This change required us to pass in
XLogReaderState to XLogReadFromBuffers. I marked it as
PG_USED_FOR_ASSERTS_ONLY and did page header checks only when it is
passed as non-null so that someone who doesn't have XLogReaderState
can still read from buffers.

+    elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+         *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);

I definitely don't think we should put an elog() in this code path.
Perhaps this should be guarded behind WAL_DEBUG.

Placing it behind WAL_DEBUG doesn't help users/developers. My
intention was to let users know that the WAL read hit the buffers,
it'll help them report if any issue occurs and also help developers to
debug that issue.

On a different note - I was recently looking at the code around
WAL_DEBUG macro and the wal_debug GUC. It looks so complex that one
needs to build source code with the WAL_DEBUG macro and enable the GUC
to see the extended logs for WAL. IMO, the best way there is either:
1) unify all the code under WAL_DEBUG macro and get rid of wal_debug GUC, or
2) unify all the code under wal_debug GUC (it is developer-only and
superuser-only so there shouldn't be a problem even if we ship it out
of the box).

If someone is concerned about the GUC being enabled on production
servers knowingly or unknowingly with option (2), we can go ahead with
option (1). I will discuss this separately to see what others think.

I think we can simplify this. We effectively take the same action any time
"count" doesn't equal "read_bytes", so there's no need for the "else if".

if (count == read_bytes)
return true;

buf += read_bytes;
startptr += read_bytes;
count -= read_bytes;

I wanted to avoid setting these unnecessarily for buffer misses.

Thanks a lot for reviewing. I'm attaching the v8 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v8-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchapplication/octet-stream; name=v8-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchDownload
From 56855e25fc9e21a86c21a38328736ad727797d05 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 7 Mar 2023 06:51:40 +0000
Subject: [PATCH v8] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c       | 145 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  42 ++++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 191 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 543d4d897a..9dd97a66d3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1639,6 +1639,151 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state PG_USED_FOR_ASSERTS_ONLY,
+					XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf)
+{
+	XLogRecPtr	ptr;
+	Size    nbytes;
+	Size	ntotal;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+	Assert(tli == GetWALInsertionTimeLine());
+
+	ptr = startptr;
+	nbytes = count;
+	ntotal = 0;
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that the
+	 * concurrent WAL readers are also allowed. We try to do as less work as
+	 * possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return ntotal;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr origptr;
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int 	idx;
+		char	*page;
+		char    *data;
+		XLogPageHeader	phdr;
+
+		origptr = ptr;
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = XLogCtl->xlblocks[idx];
+
+		/*
+		 * Requested WAL isn't available in WAL buffers, so return with what we
+		 * have read so far.
+		 */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/* Read what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			memcpy(buf, data, nbytes);
+			ntotal += nbytes;
+			nbytes = 0;
+		}
+		else
+		{
+			Size	nread;
+
+			/*
+			 * All the bytes are not in one page. Read available bytes on the
+			 * current page, copy them over to output buffer and continue to
+			 * read remaining bytes.
+			 */
+			nread = XLOG_BLCKSZ - (data - page);
+			Assert(nread > 0 && nread <= nbytes);
+			memcpy(buf, data, nread);
+			ptr += nread;
+			nbytes -= nread;
+			buf += nread;
+			ntotal += nread;
+		}
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
+		 * need to ensure that we are not reading a page that just got
+		 * initialized. For this, we looka at the needed page header.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		/*
+		 * Check if WAL buffer page looks valid. If it doesn't, return with
+		 * what we have read so far.
+		 */
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (origptr - (origptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+			break;
+
+		/*
+		 * Note that we don't perform all page header checks here to avoid
+		 * extra work in production builds, callers will anyway do those checks
+		 * extensively. However, in an assert-enabled build, we perform all the
+		 * checks here and raise an error if failed.
+		 */
+#ifdef USE_ASSERT_CHECKING
+		if (state != NULL &&
+			!XLogReaderValidatePageHeader(state, (endptr - XLOG_BLCKSZ),
+										  (char *) phdr))
+			ereport(ERROR,
+					(errcode(ERRCODE_INTERNAL_ERROR),
+					 errmsg_internal("error while reading WAL from WAL buffers: %s", state->errormsg_buf)));
+#endif
+	}
+
+	LWLockRelease(WALBufMappingLock);
+
+	/* We never read more than what the caller has asked for. */
+	Assert(ntotal <= count);
+
+	ereport(DEBUG1,
+			(errmsg_internal("read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+							 ntotal, count, LSN_FORMAT_ARGS(startptr), tli)));
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cadea21b37..03f0cca1e6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1486,8 +1486,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1498,6 +1497,45 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend tools have no idea of WAL buffers. */
+	Size        nread;
+
+	/*
+	 * Try reading WAL from WAL buffers. We skip this step and continue the
+	 * usual way, that is to read from WAL file, either when server is in
+	 * recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		nread = XLogReadFromBuffers(state, startptr, tli, count, buf);
+
+		Assert(nread >= 0);
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially or
+		 * nothing, then continue to read the remaining bytes the usual way,
+		 * that is, read from WAL file.
+		 */
+		if (count == nread)
+			return true;	/* Buffer hit, so return. */
+		else if (count > nread)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += nread;
+			startptr += nread;
+			count -= nread;
+		}
+	}
+#endif	/* FRONTEND */
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..4fdd8c8b17 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(struct XLogReaderState *state,
+								XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v8-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patchapplication/octet-stream; name=v8-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patchDownload
From aae6ad83b01cd7e6a49c2a11853e4b5b62fe19c8 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 7 Mar 2023 05:11:41 +0000
Subject: [PATCH v8] Add test module for verifying WAL read from WAL buffers

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 +++++++++
 .../test_wal_read_from_buffers/meson.build    | 36 +++++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 44 ++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 ++++++
 .../test_wal_read_from_buffers.c              | 51 +++++++++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 180 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..ea33361f69 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -33,6 +33,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..e3ffd3538d 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -29,5 +29,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7a09533ec7
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40a36edc07
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,36 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+# FIXME: prevent install during main install, but not during test :/
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_wal_read_from_buffers
+
+install_data(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..3448e0bed6
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,44 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn')});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..8e89910133
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    OUT read_from_buffers bool
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT PARALLEL UNSAFE;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..d0942658da
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,51 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test code for veryfing WAL read from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn;
+	Size	nread;
+	TimeLineID	tli;
+	char	data[XLOG_BLCKSZ] = {0};
+
+	lsn = PG_GETARG_LSN(0);
+
+	if (XLogRecPtrIsInvalid(lsn))
+		PG_RETURN_BOOL(false);
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
+	tli = GetWALInsertionTimeLine();
+
+	nread = XLogReadFromBuffers(NULL, lsn, tli, XLOG_BLCKSZ, data);
+
+	PG_RETURN_LSN(nread > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..7852b3e331
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test code for veryfing WAL read from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#30Nathan Bossart
nathandbossart@gmail.com
In reply to: Bharath Rupireddy (#29)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Mar 07, 2023 at 12:39:13PM +0530, Bharath Rupireddy wrote:

On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Is it possible to memcpy more than a page at a time?

It would complicate things a lot there; the logic to figure out the
last page bytes that may or may not fit in the whole page gets
complicated. Also, the logic to verify each page's header gets
complicated. We might lose out if we memcpy all the pages at once and
start verifying each page's header in another loop.

Doesn't the complicated logic you describe already exist to some extent in
the patch? You are copying a page at a time, which involves calculating
various addresses and byte counts.

+    elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+         *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);

I definitely don't think we should put an elog() in this code path.
Perhaps this should be guarded behind WAL_DEBUG.

Placing it behind WAL_DEBUG doesn't help users/developers. My
intention was to let users know that the WAL read hit the buffers,
it'll help them report if any issue occurs and also help developers to
debug that issue.

I still think an elog() is mighty expensive for this code path, even when
it doesn't actually produce any messages. And when it does, I think it has
the potential to be incredibly noisy.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#31Nitin Jadhav
nitinjadhavpostgres@gmail.com
In reply to: Bharath Rupireddy (#29)
1 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

[1]
subscription tests:
PATCHED: WAL buffers hit - 1972, misses - 32616

Can you share more details about the test here?

I went through the v8 patch. Following are my thoughts to improve the
WAL buffer hit ratio.

Currently the no-longer-needed WAL data present in WAL buffers gets
cleared in XLogBackgroundFlush() which is called based on the
wal_writer_delay config setting. Once the data is flushed to the disk,
it is treated as no-longer-needed and it will be cleared as soon as
possible based on some config settings. I have done some testing by
tweaking the wal_writer_delay config setting to confirm the behaviour.
We can see that the WAL buffer hit ratio is good when the
wal_writer_delay is big enough [2]wal_buffers=1GB wal_writer_delay=10s ./pgbench --initialize --scale=300 postgres compared to smaller
wal_writer_delay [1]wal_buffers=1GB wal_writer_delay=1ms ./pgbench --initialize --scale=300 postgres. So irrespective of the wal_writer_delay
settings, we should keep the WAL data in the WAL buffers as long as
possible so that all the readers (Mainly WAL senders) can take
advantage of this. The WAL page should be evicted from the WAL buffers
only when the WAL buffer is full and we need room for the new page.
The patch attached takes care of this. We can see the improvements in
WAL buffer hit ratio even when the wal_writer_delay is set to lower
value [3]wal_buffers=1GB wal_writer_delay=1ms ./pgbench --initialize --scale=300 postgres.

Second, In WALRead(), we try to read the data from disk whenever we
don't find the data from WAL buffers. We don't store this data in the
WAL buffer. We just read the data, use it and leave it. If we store
this data to the WAL buffer, then we may avoid a few disk reads.

[1]: wal_buffers=1GB wal_writer_delay=1ms ./pgbench --initialize --scale=300 postgres
wal_buffers=1GB
wal_writer_delay=1ms
./pgbench --initialize --scale=300 postgres

WAL buffers hit=5046
WAL buffers miss=56767

[2]: wal_buffers=1GB wal_writer_delay=10s ./pgbench --initialize --scale=300 postgres
wal_buffers=1GB
wal_writer_delay=10s
./pgbench --initialize --scale=300 postgres

WAL buffers hit=45454
WAL buffers miss=14064

[3]: wal_buffers=1GB wal_writer_delay=1ms ./pgbench --initialize --scale=300 postgres
wal_buffers=1GB
wal_writer_delay=1ms
./pgbench --initialize --scale=300 postgres

WAL buffers hit=37214
WAL buffers miss=844

Please share your thoughts.

Thanks & Regards,
Nitin Jadhav

On Tue, Mar 7, 2023 at 12:39 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Show quoted text

On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

+void
+XLogReadFromBuffers(XLogRecPtr startptr,

Since this function presently doesn't return anything, can we have it
return the number of bytes read instead of storing it in a pointer
variable?

Done.

+    ptr = startptr;
+    nbytes = count;
+    dst = buf;

These variables seem superfluous.

Needed startptr and count for DEBUG1 message and assertion at the end.
Removed dst and used buf in the new patch now.

+            /*
+             * Requested WAL isn't available in WAL buffers, so return with
+             * what we have read so far.
+             */
+            break;

nitpick: I'd move this to the top so that you can save a level of
indentation.

Done.

+                /*
+                 * All the bytes are not in one page. Read available bytes on
+                 * the current page, copy them over to output buffer and
+                 * continue to read remaining bytes.
+                 */

Is it possible to memcpy more than a page at a time?

It would complicate things a lot there; the logic to figure out the
last page bytes that may or may not fit in the whole page gets
complicated. Also, the logic to verify each page's header gets
complicated. We might lose out if we memcpy all the pages at once and
start verifying each page's header in another loop.

I would like to keep it simple - read a single page from WAL buffers,
verify it and continue.

+            /*
+             * The fact that we acquire WALBufMappingLock while reading the WAL
+             * buffer page itself guarantees that no one else initializes it or
+             * makes it ready for next use in AdvanceXLInsertBuffer().
+             *
+             * However, we perform basic page header checks for ensuring that
+             * we are not reading a page that just got initialized. Callers
+             * will anyway perform extensive page-level and record-level
+             * checks.
+             */

Hm. I wonder if we should make these assertions instead.

Okay. I added XLogReaderValidatePageHeader for assert-only builds
which will help catch any issues there. But we can't perform record
level checks here because this function doesn't know where the record
starts from, it knows only pages. This change required us to pass in
XLogReaderState to XLogReadFromBuffers. I marked it as
PG_USED_FOR_ASSERTS_ONLY and did page header checks only when it is
passed as non-null so that someone who doesn't have XLogReaderState
can still read from buffers.

+    elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+         *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);

I definitely don't think we should put an elog() in this code path.
Perhaps this should be guarded behind WAL_DEBUG.

Placing it behind WAL_DEBUG doesn't help users/developers. My
intention was to let users know that the WAL read hit the buffers,
it'll help them report if any issue occurs and also help developers to
debug that issue.

On a different note - I was recently looking at the code around
WAL_DEBUG macro and the wal_debug GUC. It looks so complex that one
needs to build source code with the WAL_DEBUG macro and enable the GUC
to see the extended logs for WAL. IMO, the best way there is either:
1) unify all the code under WAL_DEBUG macro and get rid of wal_debug GUC, or
2) unify all the code under wal_debug GUC (it is developer-only and
superuser-only so there shouldn't be a problem even if we ship it out
of the box).

If someone is concerned about the GUC being enabled on production
servers knowingly or unknowingly with option (2), we can go ahead with
option (1). I will discuss this separately to see what others think.

I think we can simplify this. We effectively take the same action any time
"count" doesn't equal "read_bytes", so there's no need for the "else if".

if (count == read_bytes)
return true;

buf += read_bytes;
startptr += read_bytes;
count -= read_bytes;

I wanted to avoid setting these unnecessarily for buffer misses.

Thanks a lot for reviewing. I'm attaching the v8 patch for further review.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v8-0003-Don-t-clear-the-WAL-buffers-in-XLogBackgroundFlush.patchapplication/octet-stream; name=v8-0003-Don-t-clear-the-WAL-buffers-in-XLogBackgroundFlush.patchDownload
From c9fba2b1525282770825b4082104486b2f475158 Mon Sep 17 00:00:00 2001
From: Nitin Jadhav <nitinjadhav@microsoft.com>
Date: Sat, 11 Mar 2023 23:42:44 +0530
Subject: [PATCH 3/4] Don't clear the WAL buffers in XLogBackgroundFlush()

The no-longer-needed WAL data present in WAL buffers gets
cleared in XLogBackgroundFlush() which is called based
wal_writer_delay config value. As we are trying to read
as much as data from WAL buffer instead of fetching it from
disk, the no-longer-needed WAL data is in need now. Hence
trying to keep the WAL data in WAL buffers as long as
possible so that all the readers take the advantage of it.
---
 src/backend/access/transam/xlog.c | 6 ------
 1 file changed, 6 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 9dd97a66d3..6b0974b750 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2964,12 +2964,6 @@ XLogBackgroundFlush(void)
 	/* wake up walsenders now that we've released heavily contended locks */
 	WalSndWakeupProcessRequests();
 
-	/*
-	 * Great, done. To take some work off the critical path, try to initialize
-	 * as many of the no-longer-needed WAL buffers for future use as we can.
-	 */
-	AdvanceXLInsertBuffer(InvalidXLogRecPtr, insertTLI, true);
-
 	/*
 	 * If we determined that we need to write data, but somebody else
 	 * wrote/flushed already, it should be considered as being active, to
-- 
2.34.1

#32Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nitin Jadhav (#31)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sun, Mar 12, 2023 at 12:52 AM Nitin Jadhav
<nitinjadhavpostgres@gmail.com> wrote:

I went through the v8 patch.

Thanks for looking at it. Please post the responses in-line, not above
the entire previous message for better readability.

Following are my thoughts to improve the
WAL buffer hit ratio.

Note that the motive of this patch is to read WAL from WAL buffers
*when possible* without affecting concurrent WAL writers.

Currently the no-longer-needed WAL data present in WAL buffers gets
cleared in XLogBackgroundFlush() which is called based on the
wal_writer_delay config setting. Once the data is flushed to the disk,
it is treated as no-longer-needed and it will be cleared as soon as
possible based on some config settings.

Being opportunistic in pre-initializing as many possible WAL buffer
pages as is there for a purpose. There's an illuminating comment [1],
so that's done for a purpose, so removing it fully is a no-go IMO. For
instance, it'll make WAL buffer pages available for concurrent writers
so there will be less work for writers in GetXLogBuffer. I'm sure
removing the opportunistic pre-initialization of the WAL buffer pages
will hurt performance in a highly concurrent-write workload.

/*
* Great, done. To take some work off the critical path, try to initialize
* as many of the no-longer-needed WAL buffers for future use as we can.
*/
AdvanceXLInsertBuffer(InvalidXLogRecPtr, insertTLI, true);

Second, In WALRead(), we try to read the data from disk whenever we
don't find the data from WAL buffers. We don't store this data in the
WAL buffer. We just read the data, use it and leave it. If we store
this data to the WAL buffer, then we may avoid a few disk reads.

Again this is going to hurt concurrent writers. Note that wal_buffers
aren't used as full cache per-se, there'll be multiple writers to it,
*when possible* readers will try to read from it without hurting
writers.

The patch attached takes care of this.

Please post the new proposal as a text file (not a .patch file) or as
a plain text in the email itself if the change is small or attach all
the patches if the patch is over-and-above the proposed patches.
Attaching a single over-and-above patch will make CFBot unhappy and
will force authors to repost the original patches. Typically, we
follow this. Having said, I have some review comments to fix on
v8-0001, so, I'll be sending out v9 patch-set soon.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#33Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#32)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sun, Mar 12, 2023 at 11:00 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

I have some review comments to fix on
v8-0001, so, I'll be sending out v9 patch-set soon.

Please find the attached v9 patch set for further review. I moved the
check for just-initialized WAL buffer pages before reading the page.
Up until now, it's the other way around, meaning, read the page and
then check the header if it is just-initialized, which is wrong. The
attached v9 patch set corrects it.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v9-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchapplication/octet-stream; name=v9-0001-Improve-WALRead-to-suck-data-directly-from-WAL-bu.patchDownload
From 398545d66aaf59b40658e37813edd445aad519f0 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 14 Mar 2023 03:15:00 +0000
Subject: [PATCH v9] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c       | 143 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  42 ++++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 189 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 543d4d897a..8f13551820 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1639,6 +1639,149 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state PG_USED_FOR_ASSERTS_ONLY,
+					XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf)
+{
+	XLogRecPtr	ptr;
+	Size    nbytes;
+	Size	ntotal;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+	Assert(tli == GetWALInsertionTimeLine());
+
+	ptr = startptr;
+	nbytes = count;
+	ntotal = 0;
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that the
+	 * concurrent WAL readers are also allowed. We try to do as less work as
+	 * possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return ntotal;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int 	idx;
+		char	*page;
+		char    *data;
+		XLogPageHeader	phdr;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = XLogCtl->xlblocks[idx];
+
+		/*
+		 * Requested WAL isn't available in WAL buffers, so return with what we
+		 * have read so far.
+		 */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
+		 * need to ensure that we are not reading a page that just got
+		 * initialized. For this, we looka at the needed page header.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		/*
+		 * Check if WAL buffer page looks valid. If it doesn't, return with
+		 * what we have read so far.
+		 */
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+			break;
+
+		/*
+		 * Note that we don't perform all page header checks here to avoid
+		 * extra work in production builds, callers will anyway do those checks
+		 * extensively. However, in an assert-enabled build, we perform all the
+		 * checks here and raise an error if failed.
+		 */
+#ifdef USE_ASSERT_CHECKING
+		if (state != NULL &&
+			!XLogReaderValidatePageHeader(state, (endptr - XLOG_BLCKSZ),
+										  (char *) phdr))
+			ereport(ERROR,
+					(errcode(ERRCODE_INTERNAL_ERROR),
+					 errmsg_internal("error while reading WAL from WAL buffers: %s", state->errormsg_buf)));
+#endif
+
+		/* Read what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			memcpy(buf, data, nbytes);
+			ntotal += nbytes;
+			nbytes = 0;
+		}
+		else
+		{
+			Size	nread;
+
+			/*
+			 * All the bytes are not in one page. Read available bytes on the
+			 * current page, copy them over to output buffer and continue to
+			 * read remaining bytes.
+			 */
+			nread = XLOG_BLCKSZ - (data - page);
+			Assert(nread > 0 && nread <= nbytes);
+			memcpy(buf, data, nread);
+			ptr += nread;
+			nbytes -= nread;
+			buf += nread;
+			ntotal += nread;
+		}
+	}
+
+	LWLockRelease(WALBufMappingLock);
+
+	/* We never read more than what the caller has asked for. */
+	Assert(ntotal <= count);
+
+	ereport(DEBUG1,
+			(errmsg_internal("read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+							 ntotal, count, LSN_FORMAT_ARGS(startptr), tli)));
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cadea21b37..03f0cca1e6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1486,8 +1486,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1498,6 +1497,45 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend tools have no idea of WAL buffers. */
+	Size        nread;
+
+	/*
+	 * Try reading WAL from WAL buffers. We skip this step and continue the
+	 * usual way, that is to read from WAL file, either when server is in
+	 * recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		nread = XLogReadFromBuffers(state, startptr, tli, count, buf);
+
+		Assert(nread >= 0);
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially or
+		 * nothing, then continue to read the remaining bytes the usual way,
+		 * that is, read from WAL file.
+		 */
+		if (count == nread)
+			return true;	/* Buffer hit, so return. */
+		else if (count > nread)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += nread;
+			startptr += nread;
+			count -= nread;
+		}
+	}
+#endif	/* FRONTEND */
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..4fdd8c8b17 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(struct XLogReaderState *state,
+								XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v9-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patchapplication/octet-stream; name=v9-0002-Add-test-module-for-verifying-WAL-read-from-WAL-b.patchDownload
From 458805c1340ac92aaf4a7aca59e75c34e7690b3c Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 14 Mar 2023 03:15:33 +0000
Subject: [PATCH v9] Add test module for verifying WAL read from WAL buffers

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 +++++++++
 .../test_wal_read_from_buffers/meson.build    | 36 +++++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 44 ++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 ++++++
 .../test_wal_read_from_buffers.c              | 51 +++++++++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 180 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..ea33361f69 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -33,6 +33,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..e3ffd3538d 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -29,5 +29,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7a09533ec7
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40a36edc07
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,36 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+# FIXME: prevent install during main install, but not during test :/
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_wal_read_from_buffers
+
+install_data(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..3448e0bed6
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,44 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn')});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..8e89910133
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    OUT read_from_buffers bool
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT PARALLEL UNSAFE;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..d0942658da
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,51 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test code for veryfing WAL read from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn;
+	Size	nread;
+	TimeLineID	tli;
+	char	data[XLOG_BLCKSZ] = {0};
+
+	lsn = PG_GETARG_LSN(0);
+
+	if (XLogRecPtrIsInvalid(lsn))
+		PG_RETURN_BOOL(false);
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
+	tli = GetWALInsertionTimeLine();
+
+	nread = XLogReadFromBuffers(NULL, lsn, tli, XLOG_BLCKSZ, data);
+
+	PG_RETURN_LSN(nread > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..7852b3e331
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test code for veryfing WAL read from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#34Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Nathan Bossart (#30)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Mar 7, 2023 at 11:14 PM Nathan Bossart <nathandbossart@gmail.com> wrote:

On Tue, Mar 07, 2023 at 12:39:13PM +0530, Bharath Rupireddy wrote:

On Tue, Mar 7, 2023 at 3:30 AM Nathan Bossart <nathandbossart@gmail.com> wrote:

Is it possible to memcpy more than a page at a time?

It would complicate things a lot there; the logic to figure out the
last page bytes that may or may not fit in the whole page gets
complicated. Also, the logic to verify each page's header gets
complicated. We might lose out if we memcpy all the pages at once and
start verifying each page's header in another loop.

Doesn't the complicated logic you describe already exist to some extent in
the patch? You are copying a page at a time, which involves calculating
various addresses and byte counts.

Okay here I am with the v10 patch set attached that avoids multiple
memcpy calls which must benefit the callers who want to read more than
1 WAL buffer page (streaming replication WAL sender for instance).

+    elog(DEBUG1, "read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+         *read_bytes, count, LSN_FORMAT_ARGS(startptr), tli);

I definitely don't think we should put an elog() in this code path.
Perhaps this should be guarded behind WAL_DEBUG.

Placing it behind WAL_DEBUG doesn't help users/developers. My
intention was to let users know that the WAL read hit the buffers,
it'll help them report if any issue occurs and also help developers to
debug that issue.

I still think an elog() is mighty expensive for this code path, even when
it doesn't actually produce any messages. And when it does, I think it has
the potential to be incredibly noisy.

Well, my motive was to have a way for the user to know WAL buffer hits
and misses to report any found issues. However, I have a plan later to
add WAL buffer stats (hits/misses). I understand that even if someone
enables DEBUG1, this message can bloat server log files and make
recovery slower, especially on a standby. Hence, I agree to keep these
logs behind the WAL_DEBUG macro like others and did so in the attached
v10 patch set.

Please review the attached v10 patch set further.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v10-0001-Improve-WALRead-to-suck-data-directly-from-WAL-b.patchapplication/x-patch; name=v10-0001-Improve-WALRead-to-suck-data-directly-from-WAL-b.patchDownload
From aa6454d9abb9a70b728dbba7f40279108486a3e4 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 14 Mar 2023 07:30:09 +0000
Subject: [PATCH v10] Improve WALRead() to suck data directly from WAL buffers

---
 src/backend/access/transam/xlog.c       | 171 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  42 +++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 217 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 543d4d897a..d40b9562e1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1639,6 +1639,177 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state PG_USED_FOR_ASSERTS_ONLY,
+					XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf)
+{
+	XLogRecPtr	ptr = startptr;
+	Size	nbytes = count;	/* total bytes requested to be read by caller */
+	Size	ntotal = 0;	/* total bytes read */
+	Size	nbatch = 0;	/* bytes to be read in single batch */
+	char	*batchstart = NULL;	/* location to read from for single batch */
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+	Assert(count > 0);
+	Assert(startptr <= GetFlushRecPtr(NULL));
+	Assert(!RecoveryInProgress());
+	Assert(tli == GetWALInsertionTimeLine());
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that the
+	 * concurrent WAL readers are also allowed. We try to do as less work as
+	 * possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return ntotal;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int 	idx;
+		char	*page;
+		char    *data;
+		XLogPageHeader	phdr;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = XLogCtl->xlblocks[idx];
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
+		 * need to ensure that we are not reading a page that just got
+		 * initialized. For this, we looka at the needed page header.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		/* Return, if WAL buffer page doesn't look valid. */
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+			break;
+
+		/*
+		 * Note that we don't perform all page header checks here to avoid
+		 * extra work in production builds, callers will anyway do those checks
+		 * extensively. However, in an assert-enabled build, we perform all the
+		 * checks here and raise an error if failed.
+		 */
+#ifdef USE_ASSERT_CHECKING
+		if (state != NULL &&
+			!XLogReaderValidatePageHeader(state, (endptr - XLOG_BLCKSZ),
+										  (char *) phdr))
+			ereport(ERROR,
+					(errcode(ERRCODE_INTERNAL_ERROR),
+					 errmsg_internal("error while reading WAL from WAL buffers: %s", state->errormsg_buf)));
+#endif
+
+		/* Count what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			nbatch += nbytes;
+			ntotal += nbytes;
+			nbytes = 0;
+		}
+		else
+		{
+			Size	navailable;
+
+			/*
+			 * All the bytes are not in one page. Deduce available bytes on the
+			 * current page, count them and continue to look for remaining
+			 * bytes.
+			 */
+			navailable = XLOG_BLCKSZ - (data - page);
+			Assert(navailable > 0 && navailable <= nbytes);
+			ptr += navailable;
+			nbytes -= navailable;
+			nbatch += navailable;
+			ntotal += navailable;
+		}
+
+		/*
+		 * We avoid multiple memcpy calls while reading WAL. Note that we
+		 * memcpy what we have counted so far whenever we are wrapping around
+		 * WAL buffers (because WAL buffers are organized as cirucular array of
+		 * pages) and continue to look for remaining WAL.
+		 */
+		if (batchstart == NULL)
+		{
+			/* Mark where the data in WAL buffers starts from. */
+			batchstart = data;
+		}
+
+		/*
+		 * We are wrapping around WAL buffers, so read what we have counted so
+		 * far.
+		 */
+		if (idx == XLogCtl->XLogCacheBlck)
+		{
+			Assert(batchstart != NULL);
+			Assert(nbatch > 0);
+
+			memcpy(buf, batchstart, nbatch);
+			buf += nbatch;
+
+			/* Reset for next batch. */
+			batchstart = NULL;
+			nbatch = 0;
+		}
+	}
+
+	/* Read what we have counted so far. */
+	Assert(nbatch <= ntotal);
+	if (batchstart != NULL && nbatch > 0)
+		memcpy(buf, batchstart, nbatch);
+
+	LWLockRelease(WALBufMappingLock);
+
+	/* We never read more than what the caller has asked for. */
+	Assert(ntotal <= count);
+
+#ifdef WAL_DEBUG
+	if (XLOG_DEBUG)
+		ereport(DEBUG1,
+				(errmsg_internal("read %zu bytes out of %zu bytes from WAL buffers for given LSN %X/%X, Timeline ID %u",
+								  ntotal, count, LSN_FORMAT_ARGS(startptr), tli)));
+#endif
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index cadea21b37..03f0cca1e6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1486,8 +1486,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1498,6 +1497,45 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend tools have no idea of WAL buffers. */
+	Size        nread;
+
+	/*
+	 * Try reading WAL from WAL buffers. We skip this step and continue the
+	 * usual way, that is to read from WAL file, either when server is in
+	 * recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		nread = XLogReadFromBuffers(state, startptr, tli, count, buf);
+
+		Assert(nread >= 0);
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially or
+		 * nothing, then continue to read the remaining bytes the usual way,
+		 * that is, read from WAL file.
+		 */
+		if (count == nread)
+			return true;	/* Buffer hit, so return. */
+		else if (count > nread)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += nread;
+			startptr += nread;
+			count -= nread;
+		}
+	}
+#endif	/* FRONTEND */
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index cfe5409738..4fdd8c8b17 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -247,6 +247,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(struct XLogReaderState *state,
+								XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v10-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchapplication/x-patch; name=v10-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchDownload
From 9a7abe263123266cfc7fccd259570164ecdab047 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 14 Mar 2023 07:14:01 +0000
Subject: [PATCH v10] Add test module for verifying WAL read from WAL buffers

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 +++++++++
 .../test_wal_read_from_buffers/meson.build    | 36 +++++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 44 ++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 ++++++
 .../test_wal_read_from_buffers.c              | 51 +++++++++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 180 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index c629cbe383..ea33361f69 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -33,6 +33,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 1baa6b558d..e3ffd3538d 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -29,5 +29,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7a09533ec7
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40a36edc07
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,36 @@
+# Copyright (c) 2022-2023, PostgreSQL Global Development Group
+
+# FIXME: prevent install during main install, but not during test :/
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to verify that WAL can be read from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_mod_args,
+)
+testprep_targets += test_wal_read_from_buffers
+
+install_data(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+  kwargs: contrib_data_args,
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..3448e0bed6
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,44 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('main');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn')});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..8e89910133
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    OUT read_from_buffers bool
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT PARALLEL UNSAFE;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..d0942658da
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,51 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test code for veryfing WAL read from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2023, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	lsn;
+	Size	nread;
+	TimeLineID	tli;
+	char	data[XLOG_BLCKSZ] = {0};
+
+	lsn = PG_GETARG_LSN(0);
+
+	if (XLogRecPtrIsInvalid(lsn))
+		PG_RETURN_BOOL(false);
+
+	if (RecoveryInProgress())
+		ereport(ERROR,
+				(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+				 errmsg("recovery is in progress"),
+				 errhint("WAL control functions cannot be executed during recovery.")));
+
+	tli = GetWALInsertionTimeLine();
+
+	nread = XLogReadFromBuffers(NULL, lsn, tli, XLOG_BLCKSZ, data);
+
+	PG_RETURN_LSN(nread > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..7852b3e331
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test code for veryfing WAL read from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#35Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#9)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:

One benefit would be that it'd make it more realistic to use direct
IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less
clearly a win.

Does this patch still look like a good fit for your (or someone else's)
plans for direct IO here? If so, would committing this soon make it
easier to make progress on that, or should we wait until it's actually
needed?

If I recall, this patch does not provide a perforance benefit as-is
(correct me if things have changed) and I don't know if a reduction in
syscalls alone is enough to justify it. But if it paves the way for
direct IO for WAL, that does seem worth it.

Regards,
Jeff Davis

#36Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#35)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2023-10-03 16:05:32 -0700, Jeff Davis wrote:

On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:

One benefit would be that it'd make it more realistic to use direct
IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less
clearly a win.

Does this patch still look like a good fit for your (or someone else's)
plans for direct IO here? If so, would committing this soon make it
easier to make progress on that, or should we wait until it's actually
needed?

I think it'd be quite useful to have. Even with the code as of 16, I see
better performance in some workloads with debug_io_direct=wal,
wal_sync_method=open_datasync compared to any other configuration. Except of
course that it makes walsenders more problematic, as they suddenly require
read IO. Thus having support for walsenders to send directly from wal buffers
would be beneficial, even without further AIO infrastructure.

I also think there are other quite desirable features that are made easier by
this patch. One of the primary problems with using synchronous replication is
the latency increase, obviously. We can't send out WAL before it has locally
been wirten out and flushed to disk. For some workloads, we could
substantially lower synchronous commit latency if we were able to send WAL to
remote nodes *before* WAL has been made durable locally, even if the receiving
systems wouldn't be allowed to write that data to disk yet: It takes less time
to send just "write LSN: %X/%X, flush LSNL: %X/%X" than also having to send
all the not-yet-durable WAL.

In many OLTP workloads there won't be WAL flushes between generating WAL for
DML and commit, which means that the amount of WAL that needs to be sent out
at commit can be of nontrivial size.

E.g. for pgbench, normally a transaction is about ~550 bytes (fitting in a
single tcp/ip packet), but a pgbench transaction that needs to emit FPIs for
everything is a lot larger: ~45kB (not fitting in a single packet). Obviously
many real world workloads OLTP workloads actually do more writes than
pgbench. Making the commit latency of the latter be closer to the commit
latency of the former when using syncrep would obviously be great.

Of course this patch is just a relatively small step towards that: We'd also
need in-memory buffering on the receiving side, the replication protocol would
need to be improved, we'd likely need an option to explicitly opt into
receiving unflushed data. But it's still a pretty much required step.

Greetings,

Andres Freund

#37Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#36)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Thu, Oct 12, 2023 at 4:13 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-10-03 16:05:32 -0700, Jeff Davis wrote:

Does this patch still look like a good fit for your (or someone else's)
plans for direct IO here? If so, would committing this soon make it
easier to make progress on that, or should we wait until it's actually
needed?

I think it'd be quite useful to have. Even with the code as of 16, I see
better performance in some workloads with debug_io_direct=wal,
wal_sync_method=open_datasync compared to any other configuration. Except of
course that it makes walsenders more problematic, as they suddenly require
read IO. Thus having support for walsenders to send directly from wal buffers
would be beneficial, even without further AIO infrastructure.

Right. Tests show the benefit with WAL DIO + this patch -
/messages/by-id/CALj2ACV6rS+7iZx5+oAvyXJaN4AG-djAQeM1mrM=YSDkVrUs7g@mail.gmail.com.

Also, irrespective of WAL DIO, the WAL buffers hit ratio with the
patch stood at 95% for 1 primary, 1 sync standby, 1 async standby,
pgbench --scale=300 --client=32 --time=900. In other words, the
walsenders avoided 95% of the time reading from the file/avoided pread
system calls - /messages/by-id/CALj2ACXKKK=wbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54+Na=Q@mail.gmail.com.

I also think there are other quite desirable features that are made easier by
this patch. One of the primary problems with using synchronous replication is
the latency increase, obviously. We can't send out WAL before it has locally
been wirten out and flushed to disk. For some workloads, we could
substantially lower synchronous commit latency if we were able to send WAL to
remote nodes *before* WAL has been made durable locally, even if the receiving
systems wouldn't be allowed to write that data to disk yet: It takes less time
to send just "write LSN: %X/%X, flush LSNL: %X/%X" than also having to send
all the not-yet-durable WAL.

In many OLTP workloads there won't be WAL flushes between generating WAL for
DML and commit, which means that the amount of WAL that needs to be sent out
at commit can be of nontrivial size.

E.g. for pgbench, normally a transaction is about ~550 bytes (fitting in a
single tcp/ip packet), but a pgbench transaction that needs to emit FPIs for
everything is a lot larger: ~45kB (not fitting in a single packet). Obviously
many real world workloads OLTP workloads actually do more writes than
pgbench. Making the commit latency of the latter be closer to the commit
latency of the former when using syncrep would obviously be great.

Of course this patch is just a relatively small step towards that: We'd also
need in-memory buffering on the receiving side, the replication protocol would
need to be improved, we'd likely need an option to explicitly opt into
receiving unflushed data. But it's still a pretty much required step.

Yes, this patch can pave the way for all of the above features in
future. However, I'm looking forward to getting this in for now.
Later, I'll come up with more concrete thoughts on the above.

Having said above, the latest v10 patch after addressing some of the
review comments is at
/messages/by-id/CALj2ACU3ZYzjOv4vZTR+LFk5PL4ndUnbLS6E1vG2dhDBjQGy2A@mail.gmail.com.
Any further thoughts on the patch is welcome.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#38Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#36)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Thu, Oct 12, 2023 at 4:13 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-10-03 16:05:32 -0700, Jeff Davis wrote:

On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:

One benefit would be that it'd make it more realistic to use direct
IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less
clearly a win.

Does this patch still look like a good fit for your (or someone else's)
plans for direct IO here? If so, would committing this soon make it
easier to make progress on that, or should we wait until it's actually
needed?

I think it'd be quite useful to have. Even with the code as of 16, I see
better performance in some workloads with debug_io_direct=wal,
wal_sync_method=open_datasync compared to any other configuration. Except of
course that it makes walsenders more problematic, as they suddenly require
read IO. Thus having support for walsenders to send directly from wal buffers
would be beneficial, even without further AIO infrastructure.

I'm attaching the v11 patch set with the following changes:
- Improved input validation in the function that reads WAL from WAL
buffers in 0001 patch.
- Improved test module's code in 0002 patch.
- Modernized meson build file in 0002 patch.
- Added commit messages for both the patches.
- Ran pgindent on both the patches.

Any thoughts are welcome.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v11-0001-Allow-WAL-reading-from-WAL-buffers.patchapplication/x-patch; name=v11-0001-Allow-WAL-reading-from-WAL-buffers.patchDownload
From 6615590d795a6068897a8aa348d9e699442bb07b Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 20 Oct 2023 15:49:23 +0000
Subject: [PATCH v11] Allow WAL reading from WAL buffers

This commit adds WALRead() the capability to read WAL from WAL
buffers when possible. When requested WAL isn't available in WAL
buffers, the WAL is read from the WAL file as usual. It relies on
WALBufMappingLock so that no one replaces the WAL buffer page that
we're reading from. It skips reading from WAL buffers if
WALBufMappingLock can't be acquired immediately. In other words,
it doesn't wait for WALBufMappingLock to be available. This helps
reduce the contention on WALBufMappingLock.

This commit benefits the callers of WALRead(), that are walsenders
and pg_walinspect. They can now avoid reading WAL from the WAL
file (possibly avoiding disk IO). Tests show that the WAL buffers
hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
standby, with pgbench --scale=300 --client=32 --time=900. In other
words, the walsenders avoided 95% of the time reading from the
file/avoided pread system calls:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com

This commit also paves the way for the following features in
future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c       | 208 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  45 ++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 257 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cea13e3d58..9553a880f1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1706,6 +1706,214 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ *
+ * Note that this function is not available for frontend code as WAL buffers is
+ * an internal mechanism to the server.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state,
+					XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf)
+{
+	XLogRecPtr	ptr;
+	Size		nbytes;
+	Size		ntotal;
+	Size		nbatch;
+	char	   *batchstart;
+	TimeLineID	current_timeline;
+
+	/*
+	 * Do some input parameter validations to fail quickly with meaningful
+	 * error messages or return immediately.
+	 */
+	if (unlikely(RecoveryInProgress()))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg_internal("reading WAL from WAL buffers is not supported during recovery")));
+
+	if (unlikely(XLogRecPtrIsInvalid(startptr) ||
+				 startptr > GetFlushRecPtr(NULL)))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg_internal("invalid WAL start LSN %X/%X specified for reading from WAL buffers",
+								 LSN_FORMAT_ARGS(startptr))));
+
+	current_timeline = GetWALInsertionTimeLine();
+	if (unlikely(tli != current_timeline))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg_internal("requested WAL timeline ID %u is different from that of current system timeline ID %u",
+								 tli, current_timeline)));
+
+	if (unlikely(count <= 0))
+		return 0;
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that
+	 * the concurrent WAL readers are also allowed. We try to do as less work
+	 * as possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return 0;
+
+	ptr = startptr;
+	nbytes = count;				/* Total bytes requested to be read by caller. */
+	ntotal = 0;					/* Total bytes read. */
+	nbatch = 0;					/* Bytes to be read in single batch. */
+	batchstart = NULL;			/* Location to read from for single batch. */
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int			idx;
+		char	   *page;
+		char	   *data;
+		XLogPageHeader phdr;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = XLogCtl->xlblocks[idx];
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
+		 * need to ensure that we are not reading a page that just got
+		 * initialized. For this, we look at the needed page header.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		/* Return, if WAL buffer page doesn't look valid. */
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+			break;
+
+		/*
+		 * Note that we don't perform all page header checks here to avoid
+		 * extra work in production builds; callers will anyway do those
+		 * checks extensively. However, in an assert-enabled build, we perform
+		 * all the checks here and raise an error if failed.
+		 */
+#ifdef USE_ASSERT_CHECKING
+		if (unlikely(state != NULL &&
+					 !XLogReaderValidatePageHeader(state, (endptr - XLOG_BLCKSZ),
+												   (char *) phdr)))
+		{
+			if (state->errormsg_buf[0])
+				ereport(ERROR,
+						(errcode(ERRCODE_INTERNAL_ERROR),
+						 errmsg_internal("%s", state->errormsg_buf)));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INTERNAL_ERROR),
+						 errmsg_internal("could not read WAL from WAL buffers")));
+		}
+#endif
+
+		/* Count what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			nbatch += nbytes;
+			ntotal += nbytes;
+			nbytes = 0;
+		}
+		else
+		{
+			Size		navailable;
+
+			/*
+			 * All the bytes are not in one page. Deduce available bytes on
+			 * the current page, count them and continue to look for remaining
+			 * bytes.
+			 */
+			navailable = XLOG_BLCKSZ - (data - page);
+			Assert(navailable > 0 && navailable <= nbytes);
+			ptr += navailable;
+			nbytes -= navailable;
+			nbatch += navailable;
+			ntotal += navailable;
+		}
+
+		/*
+		 * We avoid multiple memcpy calls while reading WAL. Note that we
+		 * memcpy what we have counted so far whenever we are wrapping around
+		 * WAL buffers (because WAL buffers are organized as cirucular array
+		 * of pages) and continue to look for remaining WAL.
+		 */
+		if (batchstart == NULL)
+		{
+			/* Mark where the data in WAL buffers starts from. */
+			batchstart = data;
+		}
+
+		/*
+		 * We are wrapping around WAL buffers, so read what we have counted so
+		 * far.
+		 */
+		if (idx == XLogCtl->XLogCacheBlck)
+		{
+			Assert(batchstart != NULL);
+			Assert(nbatch > 0);
+
+			memcpy(buf, batchstart, nbatch);
+			buf += nbatch;
+
+			/* Reset for next batch. */
+			batchstart = NULL;
+			nbatch = 0;
+		}
+	}
+
+	/* Read what we have counted so far. */
+	Assert(nbatch <= ntotal);
+	if (batchstart != NULL && nbatch > 0)
+		memcpy(buf, batchstart, nbatch);
+
+	LWLockRelease(WALBufMappingLock);
+
+	/* We never read more than what the caller has asked for. */
+	Assert(ntotal <= count);
+
+#ifdef WAL_DEBUG
+	if (XLOG_DEBUG)
+		ereport(DEBUG1,
+				(errmsg_internal("read %zu bytes out of %zu bytes from WAL buffers for given start LSN %X/%X, timeline ID %u",
+								 ntotal, count, LSN_FORMAT_ARGS(startptr), tli)));
+#endif
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e0baa86bd3..9c82172c42 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1473,8 +1473,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1485,6 +1484,48 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend code has no idea of WAL buffers. */
+
+	Size		nread;
+
+	/*
+	 * Try reading WAL from WAL buffers. We skip this step and continue the
+	 * usual way, that is to read from WAL file, either when server is in
+	 * recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		nread = XLogReadFromBuffers(state, startptr, tli, count, buf);
+
+		Assert(nread >= 0);
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially
+		 * or nothing, then continue to read the remaining bytes the usual
+		 * way, that is, read from WAL file.
+		 *
+		 * XXX: It might be worth to expose WAL buffer read stats.
+		 */
+		if (count == nread)
+			return true;		/* Buffer hit, so return. */
+		else if (count > nread)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += nread;
+			startptr += nread;
+			count -= nread;
+		}
+	}
+#endif
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4ad572cb87..74a9cd237a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -251,6 +251,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(struct XLogReaderState *state,
+								XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v11-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchapplication/x-patch; name=v11-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchDownload
From 83814e9ea4a0891b8b446865461c47584f308b35 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 20 Oct 2023 16:38:25 +0000
Subject: [PATCH v11] Add test module for verifying WAL read from WAL buffers

This commit adds a test module to verify WAL read from WAL
buffers.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 +++++++
 .../test_wal_read_from_buffers/meson.build    | 33 ++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 66 +++++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 +++++
 .../test_wal_read_from_buffers.c              | 37 +++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 185 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index e81873cb5a..f5aedb95a4 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index fcd643f6f1..86fd74ab50 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,5 +28,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7472494501
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40bd5dcd33
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_wal_read_from_buffers
+
+test_install_data += files(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..b838d8c3ca
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,66 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers;');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+# Check WAL read from buffers with some invalid input LSNs.
+$lsn = '0/0';
+
+my ($psql_ret, $psql_stdout, $psql_stderr) = ('','', '');
+
+# Must not use safe_psql since we expect an error here.
+($psql_ret, $psql_stdout, $psql_stderr) = $node->psql(
+    'postgres',
+    qq{SELECT test_wal_read_from_buffers('$lsn');});
+like($psql_stderr,
+	 qr/ERROR: ( [A-Z0-9]+:)? invalid WAL start LSN $lsn specified for reading from WAL buffers/,
+     "WAL read from WAL buffers failed due to invalid WAL start LSN $lsn");
+
+$lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_flush_lsn()+1000;');
+
+# Must not use safe_psql since we expect an error here.
+($psql_ret, $psql_stdout, $psql_stderr) = $node->psql(
+    'postgres',
+    qq{SELECT test_wal_read_from_buffers('$lsn');});
+like($psql_stderr,
+	 qr/ERROR: ( [A-Z0-9]+:)? invalid WAL start LSN $lsn specified for reading from WAL buffers/,
+     "WAL read from WAL buffers failed due to invalid WAL start LSN $lsn");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..c6ffb3fa65
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    read_successful OUT boolean
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..2307cbff7a
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,37 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	char		data[XLOG_BLCKSZ] = {0};
+	Size		nread;
+
+	nread = XLogReadFromBuffers(NULL, PG_GETARG_LSN(0),
+								GetWALInsertionTimeLine(),
+								XLOG_BLCKSZ, data);
+
+	PG_RETURN_BOOL(nread > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..eda8d47954
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#39Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#38)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Oct 20, 2023 at 10:19 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Thu, Oct 12, 2023 at 4:13 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-10-03 16:05:32 -0700, Jeff Davis wrote:

On Sat, 2023-01-14 at 12:34 -0800, Andres Freund wrote:

One benefit would be that it'd make it more realistic to use direct
IO for WAL
- for which I have seen significant performance benefits. But when we
afterwards have to re-read it from disk to replicate, it's less
clearly a win.

Does this patch still look like a good fit for your (or someone else's)
plans for direct IO here? If so, would committing this soon make it
easier to make progress on that, or should we wait until it's actually
needed?

I think it'd be quite useful to have. Even with the code as of 16, I see
better performance in some workloads with debug_io_direct=wal,
wal_sync_method=open_datasync compared to any other configuration. Except of
course that it makes walsenders more problematic, as they suddenly require
read IO. Thus having support for walsenders to send directly from wal buffers
would be beneficial, even without further AIO infrastructure.

I'm attaching the v11 patch set with the following changes:
- Improved input validation in the function that reads WAL from WAL
buffers in 0001 patch.
- Improved test module's code in 0002 patch.
- Modernized meson build file in 0002 patch.
- Added commit messages for both the patches.
- Ran pgindent on both the patches.

Any thoughts are welcome.

I'm attaching v12 patch set with just pgperltidy ran on the new TAP
test added in 0002. No other changes from that of v11 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v12-0001-Allow-WAL-reading-from-WAL-buffers.patchapplication/x-patch; name=v12-0001-Allow-WAL-reading-from-WAL-buffers.patchDownload
From c240d914967261e462290b714ec6ae2803d72442 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 21 Oct 2023 13:50:15 +0000
Subject: [PATCH v12] Allow WAL reading from WAL buffers

This commit adds WALRead() the capability to read WAL from WAL
buffers when possible. When requested WAL isn't available in WAL
buffers, the WAL is read from the WAL file as usual. It relies on
WALBufMappingLock so that no one replaces the WAL buffer page that
we're reading from. It skips reading from WAL buffers if
WALBufMappingLock can't be acquired immediately. In other words,
it doesn't wait for WALBufMappingLock to be available. This helps
reduce the contention on WALBufMappingLock.

This commit benefits the callers of WALRead(), that are walsenders
and pg_walinspect. They can now avoid reading WAL from the WAL
file (possibly avoiding disk IO). Tests show that the WAL buffers
hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
standby, with pgbench --scale=300 --client=32 --time=900. In other
words, the walsenders avoided 95% of the time reading from the
file/avoided pread system calls:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com

This commit also paves the way for the following features in
future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c       | 208 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  45 ++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 257 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cea13e3d58..9553a880f1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1706,6 +1706,214 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ *
+ * Note that this function is not available for frontend code as WAL buffers is
+ * an internal mechanism to the server.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state,
+					XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf)
+{
+	XLogRecPtr	ptr;
+	Size		nbytes;
+	Size		ntotal;
+	Size		nbatch;
+	char	   *batchstart;
+	TimeLineID	current_timeline;
+
+	/*
+	 * Do some input parameter validations to fail quickly with meaningful
+	 * error messages or return immediately.
+	 */
+	if (unlikely(RecoveryInProgress()))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg_internal("reading WAL from WAL buffers is not supported during recovery")));
+
+	if (unlikely(XLogRecPtrIsInvalid(startptr) ||
+				 startptr > GetFlushRecPtr(NULL)))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg_internal("invalid WAL start LSN %X/%X specified for reading from WAL buffers",
+								 LSN_FORMAT_ARGS(startptr))));
+
+	current_timeline = GetWALInsertionTimeLine();
+	if (unlikely(tli != current_timeline))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg_internal("requested WAL timeline ID %u is different from that of current system timeline ID %u",
+								 tli, current_timeline)));
+
+	if (unlikely(count <= 0))
+		return 0;
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that
+	 * the concurrent WAL readers are also allowed. We try to do as less work
+	 * as possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return 0;
+
+	ptr = startptr;
+	nbytes = count;				/* Total bytes requested to be read by caller. */
+	ntotal = 0;					/* Total bytes read. */
+	nbatch = 0;					/* Bytes to be read in single batch. */
+	batchstart = NULL;			/* Location to read from for single batch. */
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int			idx;
+		char	   *page;
+		char	   *data;
+		XLogPageHeader phdr;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = XLogCtl->xlblocks[idx];
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
+		 * need to ensure that we are not reading a page that just got
+		 * initialized. For this, we look at the needed page header.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		/* Return, if WAL buffer page doesn't look valid. */
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+			break;
+
+		/*
+		 * Note that we don't perform all page header checks here to avoid
+		 * extra work in production builds; callers will anyway do those
+		 * checks extensively. However, in an assert-enabled build, we perform
+		 * all the checks here and raise an error if failed.
+		 */
+#ifdef USE_ASSERT_CHECKING
+		if (unlikely(state != NULL &&
+					 !XLogReaderValidatePageHeader(state, (endptr - XLOG_BLCKSZ),
+												   (char *) phdr)))
+		{
+			if (state->errormsg_buf[0])
+				ereport(ERROR,
+						(errcode(ERRCODE_INTERNAL_ERROR),
+						 errmsg_internal("%s", state->errormsg_buf)));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INTERNAL_ERROR),
+						 errmsg_internal("could not read WAL from WAL buffers")));
+		}
+#endif
+
+		/* Count what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			nbatch += nbytes;
+			ntotal += nbytes;
+			nbytes = 0;
+		}
+		else
+		{
+			Size		navailable;
+
+			/*
+			 * All the bytes are not in one page. Deduce available bytes on
+			 * the current page, count them and continue to look for remaining
+			 * bytes.
+			 */
+			navailable = XLOG_BLCKSZ - (data - page);
+			Assert(navailable > 0 && navailable <= nbytes);
+			ptr += navailable;
+			nbytes -= navailable;
+			nbatch += navailable;
+			ntotal += navailable;
+		}
+
+		/*
+		 * We avoid multiple memcpy calls while reading WAL. Note that we
+		 * memcpy what we have counted so far whenever we are wrapping around
+		 * WAL buffers (because WAL buffers are organized as cirucular array
+		 * of pages) and continue to look for remaining WAL.
+		 */
+		if (batchstart == NULL)
+		{
+			/* Mark where the data in WAL buffers starts from. */
+			batchstart = data;
+		}
+
+		/*
+		 * We are wrapping around WAL buffers, so read what we have counted so
+		 * far.
+		 */
+		if (idx == XLogCtl->XLogCacheBlck)
+		{
+			Assert(batchstart != NULL);
+			Assert(nbatch > 0);
+
+			memcpy(buf, batchstart, nbatch);
+			buf += nbatch;
+
+			/* Reset for next batch. */
+			batchstart = NULL;
+			nbatch = 0;
+		}
+	}
+
+	/* Read what we have counted so far. */
+	Assert(nbatch <= ntotal);
+	if (batchstart != NULL && nbatch > 0)
+		memcpy(buf, batchstart, nbatch);
+
+	LWLockRelease(WALBufMappingLock);
+
+	/* We never read more than what the caller has asked for. */
+	Assert(ntotal <= count);
+
+#ifdef WAL_DEBUG
+	if (XLOG_DEBUG)
+		ereport(DEBUG1,
+				(errmsg_internal("read %zu bytes out of %zu bytes from WAL buffers for given start LSN %X/%X, timeline ID %u",
+								 ntotal, count, LSN_FORMAT_ARGS(startptr), tli)));
+#endif
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e0baa86bd3..9c82172c42 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1473,8 +1473,7 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1485,6 +1484,48 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend code has no idea of WAL buffers. */
+
+	Size		nread;
+
+	/*
+	 * Try reading WAL from WAL buffers. We skip this step and continue the
+	 * usual way, that is to read from WAL file, either when server is in
+	 * recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		nread = XLogReadFromBuffers(state, startptr, tli, count, buf);
+
+		Assert(nread >= 0);
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially
+		 * or nothing, then continue to read the remaining bytes the usual
+		 * way, that is, read from WAL file.
+		 *
+		 * XXX: It might be worth to expose WAL buffer read stats.
+		 */
+		if (count == nread)
+			return true;		/* Buffer hit, so return. */
+		else if (count > nread)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += nread;
+			startptr += nread;
+			count -= nread;
+		}
+	}
+#endif
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 4ad572cb87..74a9cd237a 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -251,6 +251,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(struct XLogReaderState *state,
+								XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v12-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchapplication/x-patch; name=v12-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchDownload
From cfb22e06d416d07a8cfbb9d8898ee69d3bcd75f2 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 21 Oct 2023 13:52:29 +0000
Subject: [PATCH v12] Add test module for verifying WAL read from WAL buffers

This commit adds a test module to verify WAL read from WAL
buffers.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 +++++++
 .../test_wal_read_from_buffers/meson.build    | 33 +++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 67 +++++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 +++++
 .../test_wal_read_from_buffers.c              | 37 ++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 186 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index e81873cb5a..f5aedb95a4 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index fcd643f6f1..86fd74ab50 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,5 +28,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7472494501
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40bd5dcd33
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_wal_read_from_buffers
+
+test_install_data += files(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..e04f5c85ab
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,67 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers;');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+# Check WAL read from buffers with some invalid input LSNs.
+$lsn = '0/0';
+
+my ($psql_ret, $psql_stdout, $psql_stderr) = ('', '', '');
+
+# Must not use safe_psql since we expect an error here.
+($psql_ret, $psql_stdout, $psql_stderr) =
+  $node->psql('postgres', qq{SELECT test_wal_read_from_buffers('$lsn');});
+like(
+	$psql_stderr,
+	qr/ERROR: ( [A-Z0-9]+:)? invalid WAL start LSN $lsn specified for reading from WAL buffers/,
+	"WAL read from WAL buffers failed due to invalid WAL start LSN $lsn");
+
+$lsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_flush_lsn()+1000;');
+
+# Must not use safe_psql since we expect an error here.
+($psql_ret, $psql_stdout, $psql_stderr) =
+  $node->psql('postgres', qq{SELECT test_wal_read_from_buffers('$lsn');});
+like(
+	$psql_stderr,
+	qr/ERROR: ( [A-Z0-9]+:)? invalid WAL start LSN $lsn specified for reading from WAL buffers/,
+	"WAL read from WAL buffers failed due to invalid WAL start LSN $lsn");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..c6ffb3fa65
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    read_successful OUT boolean
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..2307cbff7a
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,37 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	char		data[XLOG_BLCKSZ] = {0};
+	Size		nread;
+
+	nread = XLogReadFromBuffers(NULL, PG_GETARG_LSN(0),
+								GetWALInsertionTimeLine(),
+								XLOG_BLCKSZ, data);
+
+	PG_RETURN_BOOL(nread > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..eda8d47954
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#40Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#39)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sat, 2023-10-21 at 23:59 +0530, Bharath Rupireddy wrote:

I'm attaching v12 patch set with just pgperltidy ran on the new TAP
test added in 0002. No other changes from that of v11 patch set.

Thank you.

Comments:

* It would be good to document that this is partially an optimization
(read from memory first) and partially an API difference that allows
reading unflushed data. For instance, walsender may benefit
performance-wise (and perhaps later with the ability to read unflushed
data) whereas pg_walinspect benefits primarily from reading unflushed
data.

* Shouldn't there be a new method in XLogReaderRoutine (e.g.
read_unflushed_data), rather than having logic in WALRead()? The
callers can define the method if it makes sense (and that would be a
good place to document why); or leave it NULL if not.

* I'm not clear on the "partial hit" case. Wouldn't that mean you found
the earliest byte in the buffers, but not the latest byte requested? Is
that possible, and if so under what circumstances? I added an
"Assert(nread == 0 || nread == count)" in WALRead() after calling
XLogReadFromBuffers(), and it wasn't hit.

* If the partial hit case is important, wouldn't XLogReadFromBuffers()
fill in the end of the buffer rather than the beginning?

* Other functions that use xlblocks, e.g. GetXLogBuffer(), use more
effort to avoid acquiring WALBufMappingLock. Perhaps you can avoid it,
too? One idea is to check that XLogCtl->xlblocks[idx] is equal to
expectedEndPtr both before and after the memcpy(), with appropriate
barriers. That could mitigate concerns expressed by Kyotaro Horiguchi
and Masahiko Sawada.

* Are you sure that reducing the number of calls to memcpy() is a win?
I would expect that to be true only if the memcpy()s are tiny, but here
they are around XLOG_BLCKSZ. I believe this was done based on a comment
from Nathan Bossart, but I didn't really follow why that's important.
Also, if we try to use one memcpy for all of the data, it might not
interact well with my idea above to avoid taking the lock.

* Style-wise, the use of "unlikely" seems excessive, unless there's a
reason to think it matters.

Regards,
Jeff Davis

#41Nathan Bossart
nathandbossart@gmail.com
In reply to: Jeff Davis (#40)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Oct 24, 2023 at 05:15:19PM -0700, Jeff Davis wrote:

* Are you sure that reducing the number of calls to memcpy() is a win?
I would expect that to be true only if the memcpy()s are tiny, but here
they are around XLOG_BLCKSZ. I believe this was done based on a comment
from Nathan Bossart, but I didn't really follow why that's important.
Also, if we try to use one memcpy for all of the data, it might not
interact well with my idea above to avoid taking the lock.

I don't recall exactly why I suggested this, but if additional memcpy()s
help in some way and don't negatively impact performance, then I retract my
previous comment.

--
Nathan Bossart
Amazon Web Services: https://aws.amazon.com

#42Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#40)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, Oct 25, 2023 at 5:45 AM Jeff Davis <pgsql@j-davis.com> wrote:

Comments:

Thanks for reviewing.

* It would be good to document that this is partially an optimization
(read from memory first) and partially an API difference that allows
reading unflushed data. For instance, walsender may benefit
performance-wise (and perhaps later with the ability to read unflushed
data) whereas pg_walinspect benefits primarily from reading unflushed
data.

Commit message has these things covered in detail. However, I think
adding some info in the code comments is a good idea and done around
the WALRead() function in the attached v13 patch set.

* Shouldn't there be a new method in XLogReaderRoutine (e.g.
read_unflushed_data), rather than having logic in WALRead()? The
callers can define the method if it makes sense (and that would be a
good place to document why); or leave it NULL if not.

I've designed the new function XLogReadFromBuffers to read from WAL
buffers in such a way that one can easily embed it in page_read
callbacks if it makes sense. Almost all the available backend
page_read callbacks read_local_xlog_page_no_wait,
read_local_xlog_page, logical_read_xlog_page except XLogPageRead
(which is used for recovery when WAL buffers aren't used at all) have
one thing in common, that is, WALRead(). Therefore, it seemed a
natural choice for me to call XLogReadFromBuffers. In other words, I'd
say it's the responsibility of page_read callback implementers to
decide if they want to read from WAL buffers or not and hence I don't
think we need a separate XLogReaderRoutine.

If someone wants to read unflushed WAL, the typical way to implement
it is to write a new page_read callback
read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
similar without WALRead() but just the new function
XLogReadFromBuffers to read from WAL buffers and return.

* I'm not clear on the "partial hit" case. Wouldn't that mean you found
the earliest byte in the buffers, but not the latest byte requested? Is
that possible, and if so under what circumstances? I added an
"Assert(nread == 0 || nread == count)" in WALRead() after calling
XLogReadFromBuffers(), and it wasn't hit.

* If the partial hit case is important, wouldn't XLogReadFromBuffers()
fill in the end of the buffer rather than the beginning?

Partial hit was possible when the requested WAL pages are read one
page at a time from WAL buffers with WALBufMappingLock
acquisition-release for each page as the requested page can be
replaced by the time the lock is released and reacquired. This was the
case up until the v6 patch -
https://www.postgresql.org/message-id/CALj2ACWTNneq2EjMDyUeWF-BnwpewuhiNEfjo9bxLwFU9iPF0w%40mail.gmail.com.
Now that the approach has been changed to read multiple pages at once
under one WALBufMappingLock acquisition-release. .
We can either keep the partial hit handling (just to not miss
anything) or turn the following partial hit case to an error or an
Assert(false);. I prefer to keep the partial hit handling as-is just
in case:
+        else if (count > nread)
+        {
+            /*
+             * Buffer partial hit, so reset the state to count the read bytes
+             * and continue.
+             */
+            buf += nread;
+            startptr += nread;
+            count -= nread;
+        }

* Other functions that use xlblocks, e.g. GetXLogBuffer(), use more
effort to avoid acquiring WALBufMappingLock. Perhaps you can avoid it,
too? One idea is to check that XLogCtl->xlblocks[idx] is equal to
expectedEndPtr both before and after the memcpy(), with appropriate
barriers. That could mitigate concerns expressed by Kyotaro Horiguchi
and Masahiko Sawada.

Yes, I proposed that idea in another thread -
/messages/by-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu+wp=qkM9pdLZ5mwLaFg@mail.gmail.com.
If that looks okay, I can add it to the next version of this patch
set.

* Are you sure that reducing the number of calls to memcpy() is a win?
I would expect that to be true only if the memcpy()s are tiny, but here
they are around XLOG_BLCKSZ. I believe this was done based on a comment
from Nathan Bossart, but I didn't really follow why that's important.
Also, if we try to use one memcpy for all of the data, it might not
interact well with my idea above to avoid taking the lock.

Up until the v6 patch -
/messages/by-id/CALj2ACWTNneq2EjMDyUeWF-BnwpewuhiNEfjo9bxLwFU9iPF0w@mail.gmail.com,
the requested WAL was being read one page at a time from WAL buffers
into output buffer with one memcpy call for each page. Now that the
approach has been changed to read multiple pages at once under one
WALBufMappingLock acquisition-release with comparatively lesser number
of memcpy calls. I honestly haven't seen any difference between the
two approaches -
/messages/by-id/CALj2ACUpQGiwQTzmoSMOFk5=WiJc06FcYpxzBX0SEej4ProRzg@mail.gmail.com.

The new approach of reading multiple pages at once under one
WALBufMappingLock acquisition-release clearly wins over reading one
page at a time with multiple lock acquisition-release cycles.

* Style-wise, the use of "unlikely" seems excessive, unless there's a
reason to think it matters.

Given the current use of XLogReadFromBuffers, the input parameters are
passed as expected, IOW, these are unlikely events. The comments [1]* Hints to the compiler about the likelihood of a branch. Both likely() and * unlikely() return the boolean value of the contained expression. * * These should only be used sparingly, in very hot code paths. It's very easy * to mis-estimate likelihoods.
say that the unlikely() is to be used in hot code paths; I think
reading WAL from buffers is a hot code path especially when called
from (logical, physical) walsenders. If there's any stronger reason
than the appearance/style-wise, I'm okay to not use them. For now,
I've retained them.

FWIW, I found heapam.c using unlikely() extensively for safety checks.

[1]: * Hints to the compiler about the likelihood of a branch. Both likely() and * unlikely() return the boolean value of the contained expression. * * These should only be used sparingly, in very hot code paths. It's very easy * to mis-estimate likelihoods.
* Hints to the compiler about the likelihood of a branch. Both likely() and
* unlikely() return the boolean value of the contained expression.
*
* These should only be used sparingly, in very hot code paths. It's very easy
* to mis-estimate likelihoods.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v13-0001-Allow-WAL-reading-from-WAL-buffers.patchapplication/octet-stream; name=v13-0001-Allow-WAL-reading-from-WAL-buffers.patchDownload
From b7439ab3980e412041c408abe10c2e716b71cabe Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 26 Oct 2023 21:46:37 +0000
Subject: [PATCH v13] Allow WAL reading from WAL buffers

This commit adds WALRead() the capability to read WAL from WAL
buffers when possible. When requested WAL isn't available in WAL
buffers, the WAL is read from the WAL file as usual. It relies on
WALBufMappingLock so that no one replaces the WAL buffer page that
we're reading from. It skips reading from WAL buffers if
WALBufMappingLock can't be acquired immediately. In other words,
it doesn't wait for WALBufMappingLock to be available. This helps
reduce the contention on WALBufMappingLock.

This commit benefits the callers of WALRead(), that are walsenders
and pg_walinspect. They can now avoid reading WAL from the WAL
file (possibly avoiding disk IO). Tests show that the WAL buffers
hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
standby, with pgbench --scale=300 --client=32 --time=900. In other
words, the walsenders avoided 95% of the time reading from the
file/avoided pread system calls:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com

This commit also paves the way for the following features in
future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c       | 208 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  48 +++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 260 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 40461923ea..5d199c5e47 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1706,6 +1706,214 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ *
+ * Note that this function is not available for frontend code as WAL buffers is
+ * an internal mechanism to the server.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state,
+					XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf)
+{
+	XLogRecPtr	ptr;
+	Size		nbytes;
+	Size		ntotal;
+	Size		nbatch;
+	char	   *batchstart;
+	TimeLineID	current_timeline;
+
+	/*
+	 * Do some input parameter validations to fail quickly with meaningful
+	 * error messages or return immediately.
+	 */
+	if (unlikely(RecoveryInProgress()))
+		ereport(ERROR,
+				(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
+				 errmsg_internal("reading WAL from WAL buffers is not supported during recovery")));
+
+	if (unlikely(XLogRecPtrIsInvalid(startptr) ||
+				 startptr > GetFlushRecPtr(NULL)))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg_internal("invalid WAL start LSN %X/%X specified for reading from WAL buffers",
+								 LSN_FORMAT_ARGS(startptr))));
+
+	current_timeline = GetWALInsertionTimeLine();
+	if (unlikely(tli != current_timeline))
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg_internal("requested WAL timeline ID %u is different from that of current system timeline ID %u",
+								 tli, current_timeline)));
+
+	if (unlikely(count <= 0))
+		return 0;
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that
+	 * the concurrent WAL readers are also allowed. We try to do as less work
+	 * as possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return 0;
+
+	ptr = startptr;
+	nbytes = count;				/* Total bytes requested to be read by caller. */
+	ntotal = 0;					/* Total bytes read. */
+	nbatch = 0;					/* Bytes to be read in single batch. */
+	batchstart = NULL;			/* Location to read from for single batch. */
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int			idx;
+		char	   *page;
+		char	   *data;
+		XLogPageHeader phdr;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = XLogCtl->xlblocks[idx];
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
+		 * need to ensure that we are not reading a page that just got
+		 * initialized. For this, we look at the needed page header.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		/* Return, if WAL buffer page doesn't look valid. */
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+			break;
+
+		/*
+		 * Note that we don't perform all page header checks here to avoid
+		 * extra work in production builds; callers will anyway do those
+		 * checks extensively. However, in an assert-enabled build, we perform
+		 * all the checks here and raise an error if failed.
+		 */
+#ifdef USE_ASSERT_CHECKING
+		if (unlikely(state != NULL &&
+					 !XLogReaderValidatePageHeader(state, (endptr - XLOG_BLCKSZ),
+												   (char *) phdr)))
+		{
+			if (state->errormsg_buf[0])
+				ereport(ERROR,
+						(errcode(ERRCODE_INTERNAL_ERROR),
+						 errmsg_internal("%s", state->errormsg_buf)));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INTERNAL_ERROR),
+						 errmsg_internal("could not read WAL from WAL buffers")));
+		}
+#endif
+
+		/* Count what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			nbatch += nbytes;
+			ntotal += nbytes;
+			nbytes = 0;
+		}
+		else
+		{
+			Size		navailable;
+
+			/*
+			 * All the bytes are not in one page. Deduce available bytes on
+			 * the current page, count them and continue to look for remaining
+			 * bytes.
+			 */
+			navailable = XLOG_BLCKSZ - (data - page);
+			Assert(navailable > 0 && navailable <= nbytes);
+			ptr += navailable;
+			nbytes -= navailable;
+			nbatch += navailable;
+			ntotal += navailable;
+		}
+
+		/*
+		 * We avoid multiple memcpy calls while reading WAL. Note that we
+		 * memcpy what we have counted so far whenever we are wrapping around
+		 * WAL buffers (because WAL buffers are organized as cirucular array
+		 * of pages) and continue to look for remaining WAL.
+		 */
+		if (batchstart == NULL)
+		{
+			/* Mark where the data in WAL buffers starts from. */
+			batchstart = data;
+		}
+
+		/*
+		 * We are wrapping around WAL buffers, so read what we have counted so
+		 * far.
+		 */
+		if (idx == XLogCtl->XLogCacheBlck)
+		{
+			Assert(batchstart != NULL);
+			Assert(nbatch > 0);
+
+			memcpy(buf, batchstart, nbatch);
+			buf += nbatch;
+
+			/* Reset for next batch. */
+			batchstart = NULL;
+			nbatch = 0;
+		}
+	}
+
+	/* Read what we have counted so far. */
+	Assert(nbatch <= ntotal);
+	if (batchstart != NULL && nbatch > 0)
+		memcpy(buf, batchstart, nbatch);
+
+	LWLockRelease(WALBufMappingLock);
+
+	/* We never read more than what the caller has asked for. */
+	Assert(ntotal <= count);
+
+#ifdef WAL_DEBUG
+	if (XLOG_DEBUG)
+		ereport(DEBUG1,
+				(errmsg_internal("read %zu bytes out of %zu bytes from WAL buffers for given start LSN %X/%X, timeline ID %u",
+								 ntotal, count, LSN_FORMAT_ARGS(startptr), tli)));
+#endif
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e0baa86bd3..727baf02a6 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1473,8 +1473,10 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers. When
+ * requested WAL isn't available in WAL buffers, the WAL is read from the WAL
+ * file as usual. The callers may avoid reading WAL from the WAL file thus
+ * reducing read system calls or even disk IOs.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1485,6 +1487,48 @@ WALRead(XLogReaderState *state,
 	XLogRecPtr	recptr;
 	Size		nbytes;
 
+#ifndef FRONTEND
+	/* Frontend code has no idea of WAL buffers. */
+
+	Size		nread;
+
+	/*
+	 * Try reading WAL from WAL buffers. We skip this step and continue the
+	 * usual way, that is to read from WAL file, either when server is in
+	 * recovery (standby mode, archive or crash recovery), in which case the
+	 * WAL buffers are not used or when the server is inserting in a different
+	 * timeline from that of the timeline that we're trying to read WAL from.
+	 */
+	if (!RecoveryInProgress() &&
+		tli == GetWALInsertionTimeLine())
+	{
+		nread = XLogReadFromBuffers(state, startptr, tli, count, buf);
+
+		Assert(nread >= 0);
+
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially
+		 * or nothing, then continue to read the remaining bytes the usual
+		 * way, that is, read from WAL file.
+		 *
+		 * XXX: It might be worth to expose WAL buffer read stats.
+		 */
+		if (count == nread)
+			return true;		/* Buffer hit, so return. */
+		else if (count > nread)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += nread;
+			startptr += nread;
+			count -= nread;
+		}
+	}
+#endif
+
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..18167c36b4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -251,6 +251,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(struct XLogReaderState *state,
+								XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v13-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchapplication/octet-stream; name=v13-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchDownload
From 9812304bf6c63d5d2e866e78aae1451ca4abcc6b Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 26 Oct 2023 22:14:34 +0000
Subject: [PATCH v13] Add test module for verifying WAL read from WAL buffers

This commit adds a test module to verify WAL read from WAL
buffers.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 +++++++
 .../test_wal_read_from_buffers/meson.build    | 33 +++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 67 +++++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 +++++
 .../test_wal_read_from_buffers.c              | 37 ++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 186 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index e81873cb5a..f5aedb95a4 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index fcd643f6f1..86fd74ab50 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,5 +28,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7472494501
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40bd5dcd33
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_wal_read_from_buffers
+
+test_install_data += files(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..e04f5c85ab
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,67 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers;');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+# Check WAL read from buffers with some invalid input LSNs.
+$lsn = '0/0';
+
+my ($psql_ret, $psql_stdout, $psql_stderr) = ('', '', '');
+
+# Must not use safe_psql since we expect an error here.
+($psql_ret, $psql_stdout, $psql_stderr) =
+  $node->psql('postgres', qq{SELECT test_wal_read_from_buffers('$lsn');});
+like(
+	$psql_stderr,
+	qr/ERROR: ( [A-Z0-9]+:)? invalid WAL start LSN $lsn specified for reading from WAL buffers/,
+	"WAL read from WAL buffers failed due to invalid WAL start LSN $lsn");
+
+$lsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_flush_lsn()+1000;');
+
+# Must not use safe_psql since we expect an error here.
+($psql_ret, $psql_stdout, $psql_stderr) =
+  $node->psql('postgres', qq{SELECT test_wal_read_from_buffers('$lsn');});
+like(
+	$psql_stderr,
+	qr/ERROR: ( [A-Z0-9]+:)? invalid WAL start LSN $lsn specified for reading from WAL buffers/,
+	"WAL read from WAL buffers failed due to invalid WAL start LSN $lsn");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..c6ffb3fa65
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    read_successful OUT boolean
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..2307cbff7a
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,37 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	char		data[XLOG_BLCKSZ] = {0};
+	Size		nread;
+
+	nread = XLogReadFromBuffers(NULL, PG_GETARG_LSN(0),
+								GetWALInsertionTimeLine(),
+								XLOG_BLCKSZ, data);
+
+	PG_RETURN_BOOL(nread > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..eda8d47954
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#43Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#42)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, 2023-10-27 at 03:46 +0530, Bharath Rupireddy wrote:

Almost all the available backend
page_read callbacks read_local_xlog_page_no_wait,
read_local_xlog_page, logical_read_xlog_page except XLogPageRead
(which is used for recovery when WAL buffers aren't used at all) have
one thing in common, that is, WALRead(). Therefore, it seemed a
natural choice for me to call XLogReadFromBuffers. In other words,
I'd
say it's the responsibility of page_read callback implementers to
decide if they want to read from WAL buffers or not and hence I don't
think we need a separate XLogReaderRoutine.

I think I see what you are saying: WALRead() is at a lower level than
the XLogReaderRoutine callbacks, because it's used by the .page_read
callbacks.

That makes sense, but my first interpretation was that WALRead() is
above the XLogReaderRoutine callbacks because it calls .segment_open
and .segment_close. To me that sounds like a layering violation, but it
exists today without your patch.

I suppose the question is: should reading from the WAL buffers an
intentional thing that the caller does explicitly by specific callers?
Or is it an optimization that should be hidden from the caller?

I tend toward the former, at least for now. I suspect that when we do
some more interesting things, like replicating unflushed data, we will
want reading from buffers to be a separate step, not combined with
WALRead(). After things in this area settle a bit then we might want to
refactor and combine them again.

If someone wants to read unflushed WAL, the typical way to implement
it is to write a new page_read callback
read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
similar without WALRead() but just the new function
XLogReadFromBuffers to read from WAL buffers and return.

Then why is it being called from WALRead() at all?

I prefer to keep the partial hit handling as-is just
in case:

So a "partial hit" is essentially a narrow race condition where one
page is read from buffers, and it's valid; and by the time it gets to
the next page, it has already been evicted (along with the previously
read page)? In other words, I think you are describing a case where
eviction is happening faster than the memcpy()s in a loop, which is
certainly possible due to scheduling or whatever, but doesn't seem like
the common case.

The case I'd expect for a partial read is when the startptr points to
an evicted page, but some later pages in the requested range are still
present in the buffers.

I'm not really sure whether either of these cases matters, but if we
implement one and not the other, there should be some explanation.

Yes, I proposed that idea in another thread -
/messages/by-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu+wp=qkM9pdLZ5mwLaFg@mail.gmail.com
.
If that looks okay, I can add it to the next version of this patch
set.

The code in the email above still shows a call to:

/*
* Requested WAL is available in WAL buffers, so recheck the
existence
* under the WALBufMappingLock and read if the page still exists,
otherwise
* return.
*/
LWLockAcquire(WALBufMappingLock, LW_SHARED);

and I don't think that's required. How about something like:

endptr1 = XLogCtl->xlblocks[idx];
/* Requested WAL isn't available in WAL buffers. */
if (expectedEndPtr != endptr1)
break;

pg_read_barrier();
...
memcpy(buf, data, bytes_read_this_loop);
...
pg_read_barrier();
endptr2 = XLogCtl->xlblocks[idx];
if (expectedEndPtr != endptr2)
break;

ntotal += bytes_read_this_loop;
/* success; move on to next page */

I'm not sure why GetXLogBuffer() doesn't just use pg_atomic_read_u64().
I suppose because xlblocks are not guaranteed to be 64-bit aligned?
Should we just align it to 64 bits so we can use atomics? (I don't
think it matters in this case, but atomics would avoid the need to
think about it.)

FWIW, I found heapam.c using unlikely() extensively for safety
checks.

OK, I won't object to the use of unlikely(), though I typically don't
use it without a fairly strong reason to think I should override what
the compiler thinks and/or what branch predictors can handle.

In this case, I think some of those errors are not really necessary
anyway, though:

* XLogReadFromBuffers shouldn't take a timeline argument just to
demand that it's always equal to the wal insertion timeline.
* Why check that startptr is earlier than the flush pointer, but not
startptr+count? Also, given that we intend to support reading unflushed
data, it would be good to comment that the function still works past
the flush pointer, and that it will be safe to remove later (right?).
* An "Assert(!RecoveryInProgress())" would be more appropriate than
an error. Perhaps we will remove even that check in the future to
achieve cascaded replication of unflushed data.

Regards,
Jeff Davis

#44Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#43)
3 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sat, Oct 28, 2023 at 2:22 AM Jeff Davis <pgsql@j-davis.com> wrote:

I think I see what you are saying: WALRead() is at a lower level than
the XLogReaderRoutine callbacks, because it's used by the .page_read
callbacks.

That makes sense, but my first interpretation was that WALRead() is
above the XLogReaderRoutine callbacks because it calls .segment_open
and .segment_close. To me that sounds like a layering violation, but it
exists today without your patch.

Right. WALRead() is a common function used by most if not all
page_read callbacks. Typically, the page_read callbacks code has 2
parts - first determine the target/start LSN and second read WAL (via
WALRead() for instance).

I suppose the question is: should reading from the WAL buffers an
intentional thing that the caller does explicitly by specific callers?
Or is it an optimization that should be hidden from the caller?

I tend toward the former, at least for now.

Yes, it's an optimization that must be hidden from the caller.

I suspect that when we do
some more interesting things, like replicating unflushed data, we will
want reading from buffers to be a separate step, not combined with
WALRead(). After things in this area settle a bit then we might want to
refactor and combine them again.

As said previously, the new XLogReadFromBuffers() function is generic
and extensible in the way that anyone with a target/start LSN
(corresponding to flushed or written-but-not-yet-flushed WAL) and TLI
can call it to read from WAL buffers. It's just that the patch
currently uses it where it makes sense i.e. in WALRead(). But, it's
usable in, say, a page_read callback reading unflushed WAL from WAL
buffers.

If someone wants to read unflushed WAL, the typical way to implement
it is to write a new page_read callback
read_local_unflushed_xlog_page/logical_read_unflushed_xlog_page or
similar without WALRead() but just the new function
XLogReadFromBuffers to read from WAL buffers and return.

Then why is it being called from WALRead() at all?

The patch focuses on reading flushed WAL from WAL buffers if
available, not the unflushed WAL at all; that's why it's in WALRead()
before reading from the WAL file using pg_pread().

I'm trying to make a point that the XLogReadFromBuffers() enables one
to read unflushed WAL from WAL buffers (if really wanted for future
features like replicate from WAL buffers as a new opt-in feature to
improve the replication performance).

I prefer to keep the partial hit handling as-is just
in case:

So a "partial hit" is essentially a narrow race condition where one
page is read from buffers, and it's valid; and by the time it gets to
the next page, it has already been evicted (along with the previously
read page)?

In other words, I think you are describing a case where
eviction is happening faster than the memcpy()s in a loop, which is
certainly possible due to scheduling or whatever, but doesn't seem like
the common case.

The case I'd expect for a partial read is when the startptr points to
an evicted page, but some later pages in the requested range are still
present in the buffers.

I'm not really sure whether either of these cases matters, but if we
implement one and not the other, there should be some explanation.

At any given point of time, WAL buffer pages are maintained as a
circularly sorted array in an ascending order from
OldestInitializedPage to InitializedUpTo (new pages are inserted at
this end). Also, the current patch holds WALBufMappingLock while
reading the buffer pages, meaning, no one can replace the buffer pages
until reading is finished. Therefore, the above two described partial
hit cases can't happen - when reading multiple pages if the first page
is found to be existing in the buffer pages, it means the other pages
must exist too because of the circular and sortedness of the WAL
buffer page array.

Here's an illustration with WAL buffers circular array (idx, LSN) of
size 10 elements with contents as {(0, 160), (1, 170), (2, 180), (3,
90), (4, 100), (5, 110), (6, 120), (7, 130), (8, 140), (9, 150)} and
current InitializedUpTo pointing to page at LSN 180, idx 2.
- Read 6 pages starting from LSN 80. Nothing is read from WAL buffers
as the page at LSN 80 doesn't exist despite other 5 pages starting
from LSN 90 exist.
- Read 6 pages starting from LSN 90. All the pages exist and are read
from WAL buffers.
- Read 6 pages starting from LSN 150. Note that WAL is currently
flushed only up to page at LSN 180 and the callers won't ask for
unflushed WAL read. If a caller asks for an unflushed WAL read
intentionally or unintentionally, XLogReadFromBuffers() reads only 4
pages starting from LSN 150 to LSN 180 and will leave the remaining 2
pages for the caller to deal with. This is the partial hit that can
happen. Therefore, retaining the partial hit code in WALRead() as-is
in the current patch is needed IMV.

Yes, I proposed that idea in another thread -
/messages/by-id/CALj2ACVFSirOFiABrNVAA6JtPHvA9iu+wp=qkM9pdLZ5mwLaFg@mail.gmail.com
.
If that looks okay, I can add it to the next version of this patch
set.

The code in the email above still shows a call to:

/*
* Requested WAL is available in WAL buffers, so recheck the
existence
* under the WALBufMappingLock and read if the page still exists,
otherwise
* return.
*/
LWLockAcquire(WALBufMappingLock, LW_SHARED);

and I don't think that's required. How about something like:

endptr1 = XLogCtl->xlblocks[idx];
/* Requested WAL isn't available in WAL buffers. */
if (expectedEndPtr != endptr1)
break;

pg_read_barrier();
...
memcpy(buf, data, bytes_read_this_loop);
...
pg_read_barrier();
endptr2 = XLogCtl->xlblocks[idx];
if (expectedEndPtr != endptr2)
break;

ntotal += bytes_read_this_loop;
/* success; move on to next page */

I'm not sure why GetXLogBuffer() doesn't just use pg_atomic_read_u64().
I suppose because xlblocks are not guaranteed to be 64-bit aligned?
Should we just align it to 64 bits so we can use atomics? (I don't
think it matters in this case, but atomics would avoid the need to
think about it.)

WALBufMappingLock protects both xlblocks and WAL buffer pages [1]* WALBufMappingLock: must be held to replace a page in the WAL buffer cache.[2]* and xlblocks values certainly do. xlblocks values are protected by * WALBufMappingLock. */ char *pages; /* buffers for unwritten XLOG pages */ XLogRecPtr *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */.
I'm not sure how using the memory barrier, not WALBufMappingLock,
prevents writers from replacing WAL buffer pages while readers reading
the pages. FWIW, GetXLogBuffer() reads the xlblocks value without the
lock but it confirms the WAL existence under the lock and gets the WAL
buffer page under the lock [3]* However, we don't hold a lock while we read the value. If someone has * just initialized the page, it's possible that we get a "torn read" of * the XLogRecPtr if 64-bit fetches are not atomic on this platform. In * that case we will see a bogus value. That's ok, we'll grab the mapping * lock (in AdvanceXLInsertBuffer) and retry if we see anything else than * the page we're looking for. But it means that when we do this unlocked * read, we might see a value that appears to be ahead of the page we're * looking for. Don't PANIC on that, until we've verified the value while * holding the lock. */ expectedEndPtr = ptr; expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;.

I'll reiterate the WALBufMappingLock thing for this patch - the idea
is to know whether or not the WAL at a given LSN exists in WAL buffers
without acquiring WALBufMappingLock; if exists acquire the lock in
shared mode, read from WAL buffers and then release. WAL buffer pages
are organized as a circular array with the InitializedUpTo as the
latest filled WAL buffer page. If there's a way to track the oldest
filled WAL buffer page (OldestInitializedPage), at any given point of
time, the elements of the circular array are sorted in an ascending
order from OldestInitializedPage to InitializedUpTo. With this
approach, no lock is required to know if the WAL at given LSN exists
in WAL buffers, we can just do this if lsn >=
XLogCtl->OldestInitializedPage && lsn < XLogCtl->InitializedUpTo. I
proposed this idea here
/messages/by-id/CALj2ACVgi6LirgLDZh=FdfdvGvKAD==WTOSWcQy=AtNgPDVnKw@mail.gmail.com.
I've pulled that patch in here as 0001 to showcase its use for this
feature.

* Why check that startptr is earlier than the flush pointer, but not
startptr+count? Also, given that we intend to support reading unflushed
data, it would be good to comment that the function still works past
the flush pointer, and that it will be safe to remove later (right?).

That won't work, see the comment below. Actual flushed LSN may not
always be greater than startptr+count. GetFlushRecPtr() check in
XLogReadFromBuffers() is similar to what pg_walinspect has in
GetCurrentLSN().

/*
* Even though we just determined how much of the page can be validly read
* as 'count', read the whole page anyway. It's guaranteed to be
* zero-padded up to the page boundary if it's incomplete.
*/
if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
&errinfo))

* An "Assert(!RecoveryInProgress())" would be more appropriate than
an error. Perhaps we will remove even that check in the future to
achieve cascaded replication of unflushed data.

In this case, I think some of those errors are not really necessary
anyway, though:

* XLogReadFromBuffers shouldn't take a timeline argument just to
demand that it's always equal to the wal insertion timeline.

I've changed XLogReadFromBuffers() to return as-if nothing was read
(cache miss) when the server is in recovery or the requested TLI is
not the current server's insertion TLI. It is better than failing with
ERRORs so that the callers don't have to have any checks for recovery
or TLI.

PSA v14 patch set.

[1]: * WALBufMappingLock: must be held to replace a page in the WAL buffer cache.
* WALBufMappingLock: must be held to replace a page in the WAL buffer cache.

[2]: * and xlblocks values certainly do. xlblocks values are protected by * WALBufMappingLock. */ char *pages; /* buffers for unwritten XLOG pages */ XLogRecPtr *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */
* and xlblocks values certainly do. xlblocks values are protected by
* WALBufMappingLock.
*/
char *pages; /* buffers for unwritten XLOG pages */
XLogRecPtr *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */

[3]: * However, we don't hold a lock while we read the value. If someone has * just initialized the page, it's possible that we get a "torn read" of * the XLogRecPtr if 64-bit fetches are not atomic on this platform. In * that case we will see a bogus value. That's ok, we'll grab the mapping * lock (in AdvanceXLInsertBuffer) and retry if we see anything else than * the page we're looking for. But it means that when we do this unlocked * read, we might see a value that appears to be ahead of the page we're * looking for. Don't PANIC on that, until we've verified the value while * holding the lock. */ expectedEndPtr = ptr; expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
* However, we don't hold a lock while we read the value. If someone has
* just initialized the page, it's possible that we get a "torn read" of
* the XLogRecPtr if 64-bit fetches are not atomic on this platform. In
* that case we will see a bogus value. That's ok, we'll grab the mapping
* lock (in AdvanceXLInsertBuffer) and retry if we see anything else than
* the page we're looking for. But it means that when we do this unlocked
* read, we might see a value that appears to be ahead of the page we're
* looking for. Don't PANIC on that, until we've verified the value while
* holding the lock.
*/
expectedEndPtr = ptr;
expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;

endptr = XLogCtl->xlblocks[idx];

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v14-0001-Track-oldest-initialized-WAL-buffer-page.patchapplication/octet-stream; name=v14-0001-Track-oldest-initialized-WAL-buffer-page.patchDownload
From 5b5469d7dcd8e98bfcaf14227e67356bbc1f5fe8 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 2 Nov 2023 15:10:51 +0000
Subject: [PATCH v14] Track oldest initialized WAL buffer page

---
 src/backend/access/transam/xlog.c | 170 ++++++++++++++++++++++++++++++
 src/include/access/xlog.h         |   1 +
 2 files changed, 171 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b541be8eec..fdf2ef310b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -504,6 +504,45 @@ typedef struct XLogCtlData
 	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
 	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
 
+	/*
+	 * Start address of oldest initialized page in XLog buffers.
+	 *
+	 * We mainly track oldest initialized page explicitly to quickly tell if a
+	 * given WAL record is available in XLog buffers. It also can be used for
+	 * other purposes, see notes below.
+	 *
+	 * OldestInitializedPage gives XLog buffers following properties:
+	 *
+	 * 1) At any given point of time, pages in XLog buffers array are sorted
+	 * in an ascending order from OldestInitializedPage till InitializedUpTo.
+	 * Note that we verify this property for assert-only builds, see
+	 * IsXLogBuffersArraySorted() for more details.
+	 *
+	 * 2) OldestInitializedPage is monotonically increasing (by virtue of how
+	 * postgres generates WAL records), that is, its value never decreases.
+	 * This property lets someone read its value without a lock. There's no
+	 * problem even if its value is slightly stale i.e. concurrently being
+	 * updated. One can still use it for finding if a given WAL record is
+	 * available in XLog buffers. At worst, one might get false positives
+	 * (i.e. OldestInitializedPage may tell that the WAL record is available
+	 * in XLog buffers, but when one actually looks at it, it isn't really
+	 * available). This is more efficient and performant than acquiring a lock
+	 * for reading. Note that we may not need a lock to read
+	 * OldestInitializedPage but we need to update it holding
+	 * WALBufMappingLock.
+	 *
+	 * 3) One can start traversing XLog buffers from OldestInitializedPage
+	 * till InitializedUpTo to list out all valid WAL records and stats, and
+	 * expose them via SQL-callable functions to users.
+	 *
+	 * 4) XLog buffers array is inherently organized as a circular, sorted and
+	 * rotated array with OldestInitializedPage as pivot with the property
+	 * where LSN of previous buffer page (if valid) is greater than
+	 * OldestInitializedPage and LSN of next buffer page (if valid) is greater
+	 * than OldestInitializedPage.
+	 */
+	XLogRecPtr	OldestInitializedPage;
+
 	/*
 	 * InsertTimeLineID is the timeline into which new WAL is being inserted
 	 * and flushed. It is zero during recovery, and does not change once set.
@@ -590,6 +629,10 @@ static ControlFileData *ControlFile = NULL;
 #define NextBufIdx(idx)		\
 		(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))
 
+/* Macro to retreat to previous buffer index. */
+#define PreviousBufIdx(idx)		\
+		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))
+
 /*
  * XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
  * would hold if it was in cache, the page containing 'recptr'.
@@ -708,6 +751,10 @@ static void WALInsertLockAcquireExclusive(void);
 static void WALInsertLockRelease(void);
 static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);
 
+#ifdef USE_ASSERT_CHECKING
+static bool IsXLogBuffersArraySorted(void);
+#endif
+
 /*
  * Insert an XLOG record represented by an already-constructed chain of data
  * chunks.  This is a low-level routine; to construct the WAL record header
@@ -1992,6 +2039,52 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		XLogCtl->InitializedUpTo = NewPageEndPtr;
 
 		npages++;
+
+		/*
+		 * Try updating oldest initialized XLog buffer page.
+		 *
+		 * Update it if we are initializing an XLog buffer page for the first
+		 * time or if XLog buffers are full and we are wrapping around.
+		 */
+		if (XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) ||
+			XLogRecPtrToBufIdx(XLogCtl->OldestInitializedPage) == nextidx)
+		{
+			Assert(XLogCtl->OldestInitializedPage < NewPageBeginPtr);
+
+			XLogCtl->OldestInitializedPage = NewPageBeginPtr;
+		}
+
+		/*
+		 * Check some properties about XLog buffers array. We essentially
+		 * perform these checks as asserts to avoid extra costs.
+		 *
+		 * XXX: Perhaps these extra checks are too much for an assert build,
+		 * so placing them under WAL_DEBUG might be worth trying.
+		 */
+
+		/* OldestInitializedPage must have already been initialized. */
+		Assert(!XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage));
+
+		/*
+		 * OldestInitializedPage is always a starting address of XLog buffer
+		 * page.
+		 */
+		Assert((XLogCtl->OldestInitializedPage % XLOG_BLCKSZ) == 0);
+
+		/*
+		 * OldestInitializedPage and InitializedUpTo are always starting and
+		 * ending addresses of (same or different) XLog buffer page
+		 * respectively. Hence, they can never be same even if there's only
+		 * one initialized page in XLog buffers.
+		 */
+		Assert(XLogCtl->OldestInitializedPage != XLogCtl->InitializedUpTo);
+
+		/*
+		 * At any given point of time, pages in XLog buffers array are sorted
+		 * in an ascending order from OldestInitializedPage till
+		 * InitializedUpTo.
+		 */
+		Assert(IsXLogBuffersArraySorted());
 	}
 	LWLockRelease(WALBufMappingLock);
 
@@ -4711,6 +4804,7 @@ XLOGShmemInit(void)
 	XLogCtl->SharedRecoveryState = RECOVERY_STATE_CRASH;
 	XLogCtl->InstallXLogFileSegmentActive = false;
 	XLogCtl->WalWriterSleeping = false;
+	XLogCtl->OldestInitializedPage = InvalidXLogRecPtr;
 
 	SpinLockInit(&XLogCtl->Insert.insertpos_lck);
 	SpinLockInit(&XLogCtl->info_lck);
@@ -5717,6 +5811,14 @@ StartupXLOG(void)
 
 		XLogCtl->xlblocks[firstIdx] = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ;
 		XLogCtl->InitializedUpTo = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ;
+		XLogCtl->OldestInitializedPage = endOfRecoveryInfo->lastPageBeginPtr;
+
+		/*
+		 * OldestInitializedPage is always a starting address of XLog buffer
+		 * page.
+		 */
+		Assert(!XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage));
+		Assert((XLogCtl->OldestInitializedPage % XLOG_BLCKSZ) == 0);
 	}
 	else
 	{
@@ -9109,3 +9211,71 @@ SetWalWriterSleeping(bool sleeping)
 	XLogCtl->WalWriterSleeping = sleeping;
 	SpinLockRelease(&XLogCtl->info_lck);
 }
+
+#ifdef USE_ASSERT_CHECKING
+/*
+ * Returns whether or not XLog buffers array is sorted.
+ *
+ * XXX: Perhaps this function is too much for an assert build, so placing it
+ * under WAL_DEBUG might be worth trying.
+ */
+static bool
+IsXLogBuffersArraySorted(void)
+{
+	int			start;
+	int			end;
+	int			current;
+	int			next;
+	XLogRecPtr	CurrentPage;
+	XLogRecPtr	NextPage;
+
+	start = XLogRecPtrToBufIdx(XLogCtl->OldestInitializedPage);
+	end = XLogRecPtrToBufIdx(XLogCtl->InitializedUpTo - XLOG_BLCKSZ);
+
+	if (start == end)
+		return true;
+
+	current = start;
+
+	while (current != end)
+	{
+		CurrentPage = XLogCtl->xlblocks[current];
+
+		next = NextBufIdx(current);
+		NextPage = XLogCtl->xlblocks[next];
+
+		if (!XLogRecPtrIsInvalid(NextPage) &&
+			CurrentPage > NextPage)
+			return false;
+
+		current = next;
+	}
+
+	Assert(XLogCtl->xlblocks[current] == XLogCtl->xlblocks[end]);
+
+	return true;
+}
+#endif
+
+/*
+ * Returns whether or not a given WAL record is available in XLog buffers.
+ *
+ * Note that we don't read OldestInitializedPage under a lock, see description
+ * near its definition in xlog.c for more details.
+ *
+ * Note that caller needs to pass in an LSN known to the server, not a future
+ * or unwritten or unflushed LSN.
+ */
+bool
+IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn)
+{
+	if (!XLogRecPtrIsInvalid(lsn) &&
+		!XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) &&
+		lsn >= XLogCtl->OldestInitializedPage &&
+		lsn < XLogCtl->InitializedUpTo)
+	{
+		return true;
+	}
+
+	return false;
+}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..35235010e6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ extern void ReachedEndOfBackup(XLogRecPtr EndRecPtr, TimeLineID tli);
 extern void SetInstallXLogFileSegmentActive(void);
 extern bool IsInstallXLogFileSegmentActive(void);
 extern void XLogShutdownWalRcv(void);
+extern bool IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn);
 
 /*
  * Routines to start, stop, and get status of a base backup.
-- 
2.34.1

v14-0002-Allow-WAL-reading-from-WAL-buffers.patchapplication/octet-stream; name=v14-0002-Allow-WAL-reading-from-WAL-buffers.patchDownload
From db027d8f1dcb53ebceef0135287f120acf67cc21 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 2 Nov 2023 15:36:11 +0000
Subject: [PATCH v14] Allow WAL reading from WAL buffers

This commit adds WALRead() the capability to read WAL from WAL
buffers when possible. When requested WAL isn't available in WAL
buffers, the WAL is read from the WAL file as usual. It relies on
WALBufMappingLock so that no one replaces the WAL buffer page that
we're reading from. It skips reading from WAL buffers if
WALBufMappingLock can't be acquired immediately. In other words,
it doesn't wait for WALBufMappingLock to be available. This helps
reduce the contention on WALBufMappingLock.

This commit benefits the callers of WALRead(), that are walsenders
and pg_walinspect. They can now avoid reading WAL from the WAL
file (possibly avoiding disk IO). Tests show that the WAL buffers
hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
standby, with pgbench --scale=300 --client=32 --time=900. In other
words, the walsenders avoided 95% of the time reading from the
file/avoided pread system calls:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com

This commit also paves the way for the following features in
future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c       | 205 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  41 ++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 250 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fdf2ef310b..ff5dccaaa7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1753,6 +1753,211 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * This function returns quickly in the following cases:
+ * - When passed-in timeline is different than server's current insertion
+ * timeline as WAL is always inserted into WAL buffers on insertion timeline.
+ *
+ * - When server is in recovery as WAL buffers aren't currently used in
+ * recovery.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ *
+ * Note that this function is not available for frontend code as WAL buffers is
+ * an internal mechanism to the server.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state,
+					XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf)
+{
+	XLogRecPtr	ptr;
+	XLogRecPtr	cur_lsn;
+	Size		nbytes;
+	Size		ntotal;
+	Size		nbatch;
+	char	   *batchstart;
+
+	if (RecoveryInProgress())
+		return 0;
+
+	if (tli != GetWALInsertionTimeLine())
+		return 0;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+
+	cur_lsn = GetFlushRecPtr(NULL);
+	if (unlikely(startptr > cur_lsn))
+		elog(ERROR, "WAL start LSN %X/%X specified for reading from WAL buffers must be less than current database system WAL LSN %X/%X",
+			 LSN_FORMAT_ARGS(startptr), LSN_FORMAT_ARGS(cur_lsn));
+
+	if (!IsWALRecordAvailableInXLogBuffers(startptr))
+		return 0;
+
+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that
+	 * the concurrent WAL readers are also allowed. We try to do as less work
+	 * as possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return 0;
+
+	ptr = startptr;
+	nbytes = count;				/* Total bytes requested to be read by caller. */
+	ntotal = 0;					/* Total bytes read. */
+	nbatch = 0;					/* Bytes to be read in single batch. */
+	batchstart = NULL;			/* Location to read from for single batch. */
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int			idx;
+		char	   *page;
+		char	   *data;
+		XLogPageHeader phdr;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = XLogCtl->xlblocks[idx];
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
+		 * need to ensure that we are not reading a page that just got
+		 * initialized. For this, we look at the needed page header.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		/* Return, if WAL buffer page doesn't look valid. */
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+			break;
+
+		/*
+		 * Note that we don't perform all page header checks here to avoid
+		 * extra work in production builds; callers will anyway do those
+		 * checks extensively. However, in an assert-enabled build, we perform
+		 * all the checks here and raise an error if failed.
+		 */
+#ifdef USE_ASSERT_CHECKING
+		if (unlikely(state != NULL &&
+					 !XLogReaderValidatePageHeader(state, (endptr - XLOG_BLCKSZ),
+												   (char *) phdr)))
+		{
+			if (state->errormsg_buf[0])
+				ereport(ERROR,
+						(errcode(ERRCODE_INTERNAL_ERROR),
+						 errmsg_internal("%s", state->errormsg_buf)));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INTERNAL_ERROR),
+						 errmsg_internal("could not read WAL from WAL buffers")));
+		}
+#endif
+
+		/* Count what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			nbatch += nbytes;
+			ntotal += nbytes;
+			nbytes = 0;
+		}
+		else
+		{
+			Size		navailable;
+
+			/*
+			 * All the bytes are not in one page. Deduce available bytes on
+			 * the current page, count them and continue to look for remaining
+			 * bytes.
+			 */
+			navailable = XLOG_BLCKSZ - (data - page);
+			Assert(navailable > 0 && navailable <= nbytes);
+			ptr += navailable;
+			nbytes -= navailable;
+			nbatch += navailable;
+			ntotal += navailable;
+		}
+
+		/*
+		 * We avoid multiple memcpy calls while reading WAL. Note that we
+		 * memcpy what we have counted so far whenever we are wrapping around
+		 * WAL buffers (because WAL buffers are organized as cirucular array
+		 * of pages) and continue to look for remaining WAL.
+		 */
+		if (batchstart == NULL)
+		{
+			/* Mark where the data in WAL buffers starts from. */
+			batchstart = data;
+		}
+
+		/*
+		 * We are wrapping around WAL buffers, so read what we have counted so
+		 * far.
+		 */
+		if (idx == XLogCtl->XLogCacheBlck)
+		{
+			Assert(batchstart != NULL);
+			Assert(nbatch > 0);
+
+			memcpy(buf, batchstart, nbatch);
+			buf += nbatch;
+
+			/* Reset for next batch. */
+			batchstart = NULL;
+			nbatch = 0;
+		}
+	}
+
+	/* Read what we have counted so far. */
+	Assert(nbatch <= ntotal);
+	if (batchstart != NULL && nbatch > 0)
+		memcpy(buf, batchstart, nbatch);
+
+	LWLockRelease(WALBufMappingLock);
+
+	/* We never read more than what the caller has asked for. */
+	Assert(ntotal <= count);
+
+#ifdef WAL_DEBUG
+	if (XLOG_DEBUG)
+		ereport(DEBUG1,
+				(errmsg_internal("read %zu bytes out of %zu bytes from WAL buffers for given start LSN %X/%X, timeline ID %u",
+								 ntotal, count, LSN_FORMAT_ARGS(startptr), tli)));
+#endif
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e0baa86bd3..5820c5eedc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1473,8 +1473,10 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers. When
+ * requested WAL isn't available in WAL buffers, the WAL is read from the WAL
+ * file as usual. The callers may avoid reading WAL from the WAL file thus
+ * reducing read system calls or even disk IOs.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1484,6 +1486,41 @@ WALRead(XLogReaderState *state,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	Size		nread;
+#endif
+
+#ifndef FRONTEND
+
+	/*
+	 * Try reading WAL from WAL buffers. Frontend code has no idea of WAL
+	 * buffers.
+	 */
+	nread = XLogReadFromBuffers(state, startptr, tli, count, buf);
+
+	Assert(nread >= 0);
+
+	/*
+	 * Check if we have read fully (hit), partially (partial hit) or nothing
+	 * (miss) from WAL buffers. If we have read either partially or nothing,
+	 * then continue to read the remaining bytes the usual way, that is, read
+	 * from WAL file.
+	 *
+	 * XXX: It might be worth to expose WAL buffer read stats.
+	 */
+	if (count == nread)
+		return true;			/* Buffer hit, so return. */
+	else if (count > nread)
+	{
+		/*
+		 * Buffer partial hit, so reset the state to count the read bytes and
+		 * continue.
+		 */
+		buf += nread;
+		startptr += nread;
+		count -= nread;
+	}
+#endif
 
 	p = buf;
 	recptr = startptr;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 35235010e6..0e6a3d4264 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -251,6 +251,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(struct XLogReaderState *state,
+								XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v14-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patchapplication/octet-stream; name=v14-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patchDownload
From d3e8b16e078b0fb8fbbda17c43fc6c2a77bf145f Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 2 Nov 2023 16:37:08 +0000
Subject: [PATCH v14] Add test module for verifying WAL read from WAL buffers

This commit adds a test module to verify WAL read from WAL
buffers.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 ++++++++
 .../test_wal_read_from_buffers/meson.build    | 33 +++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 58 +++++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 +++++
 .../test_wal_read_from_buffers.c              | 37 ++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 177 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index e81873cb5a..f5aedb95a4 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index fcd643f6f1..86fd74ab50 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,5 +28,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7472494501
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40bd5dcd33
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_wal_read_from_buffers
+
+test_install_data += files(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..5d94f8a960
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,58 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers;');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+my ($psql_ret, $psql_stdout, $psql_stderr) = ('', '', '');
+
+# Check WAL read from buffers with an LSN greater than current database system
+# LSN.
+$lsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_flush_lsn()+1000;');
+
+# Must not use safe_psql since we expect an error here.
+($psql_ret, $psql_stdout, $psql_stderr) =
+  $node->psql('postgres', qq{SELECT test_wal_read_from_buffers('$lsn');});
+like(
+	$psql_stderr,
+	qr/ERROR: ( [A-Z0-9]+:)? WAL start LSN $lsn specified for reading from WAL buffers must be less than current database system WAL LSN *./,
+	"WAL read from WAL buffers failed due to an LSN greater than current database system LSN");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..c6ffb3fa65
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    read_successful OUT boolean
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..2307cbff7a
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,37 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	char		data[XLOG_BLCKSZ] = {0};
+	Size		nread;
+
+	nread = XLogReadFromBuffers(NULL, PG_GETARG_LSN(0),
+								GetWALInsertionTimeLine(),
+								XLOG_BLCKSZ, data);
+
+	PG_RETURN_BOOL(nread > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..eda8d47954
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#45Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#44)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Thu, 2023-11-02 at 22:38 +0530, Bharath Rupireddy wrote:

I suppose the question is: should reading from the WAL buffers an
intentional thing that the caller does explicitly by specific
callers?
Or is it an optimization that should be hidden from the caller?

I tend toward the former, at least for now.

Yes, it's an optimization that must be hidden from the caller.

As I said, I tend toward the opposite: that specific callers should
read from the buffers explicitly in the cases where it makes sense.

I don't think this is the most important point right now though, let's
sort out the other details.

At any given point of time, WAL buffer pages are maintained as a
circularly sorted array in an ascending order from
OldestInitializedPage to InitializedUpTo (new pages are inserted at
this end).

I don't see any reference to OldestInitializedPage or anything like it,
with or without your patch. Am I missing something?

- Read 6 pages starting from LSN 80. Nothing is read from WAL buffers
as the page at LSN 80 doesn't exist despite other 5 pages starting
from LSN 90 exist.

This is what I imagined a "partial hit" was: read the 5 pages starting
at 90. The caller would then need to figure out how to read the page at
LSN 80 from the segment files.

I am not saying we should support this case; perhaps it doesn't matter.
I'm just describing why that term was confusing for me.

If a caller asks for an unflushed WAL read
intentionally or unintentionally, XLogReadFromBuffers() reads only 4
pages starting from LSN 150 to LSN 180 and will leave the remaining 2
pages for the caller to deal with. This is the partial hit that can
happen.

To me that's more like an EOF case. "Partial hit" sounds to me like the
data exists but is not available in the cache (i.e. go to the segment
files); whereas if it encountered the end, the data is not available at
all.

WALBufMappingLock protects both xlblocks and WAL buffer pages [1][2].
I'm not sure how using the memory barrier, not WALBufMappingLock,
prevents writers from replacing WAL buffer pages while readers
reading
the pages.

It doesn't *prevent* that case, but it does *detect* that case. We
don't want to prevent writers from replacing WAL buffers, because that
would mean we are slowing down the critical WAL writing path.

Let me explain the potential problem cases, and how the barriers
prevent them:

Potential problem 1: the page is not yet resident in the cache at the
time the memcpy begins. In this case, the first read barrier would
ensure that the page is also not yet resident at the time xlblocks[idx]
is read into endptr1, and we'd break out of the loop.

Potential problem 2: the page is evicted before the memcpy finishes. In
this case, the second read barrier would ensure that the page was also
evicted before xlblocks[idx] is read into endptr2, and again we'd
detect the problem and break out of the loop.

I assume here that, if xlblocks[idx] holds the endPtr of the desired
page, all of the bytes for that page are resident at that moment. I
don't think that's true right now: AdvanceXLInsertBuffers() zeroes the
old page before updating xlblocks[nextidx]. I think it needs something
like:

pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], InvalidXLogRecPtr);
pg_write_barrier();

before the MemSet.

I didn't review your latest v14 patch yet.

Regards,
Jeff Davis

#46Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#45)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Nov 3, 2023 at 12:35 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2023-11-02 at 22:38 +0530, Bharath Rupireddy wrote:

I suppose the question is: should reading from the WAL buffers an
intentional thing that the caller does explicitly by specific
callers?
Or is it an optimization that should be hidden from the caller?

I tend toward the former, at least for now.

Yes, it's an optimization that must be hidden from the caller.

As I said, I tend toward the opposite: that specific callers should
read from the buffers explicitly in the cases where it makes sense.

How about adding a bool flag (read_from_wal_buffers) to
XLogReaderState so that the callers can set it if they want this
facility via XLogReaderAllocate()?

At any given point of time, WAL buffer pages are maintained as a
circularly sorted array in an ascending order from
OldestInitializedPage to InitializedUpTo (new pages are inserted at
this end).

I don't see any reference to OldestInitializedPage or anything like it,
with or without your patch. Am I missing something?

OldestInitializedPage is introduced in v14-0001 patch. Please have a look.

- Read 6 pages starting from LSN 80. Nothing is read from WAL buffers
as the page at LSN 80 doesn't exist despite other 5 pages starting
from LSN 90 exist.

This is what I imagined a "partial hit" was: read the 5 pages starting
at 90. The caller would then need to figure out how to read the page at
LSN 80 from the segment files.

I am not saying we should support this case; perhaps it doesn't matter.
I'm just describing why that term was confusing for me.

Okay. Current patch doesn't support this case.

If a caller asks for an unflushed WAL read
intentionally or unintentionally, XLogReadFromBuffers() reads only 4
pages starting from LSN 150 to LSN 180 and will leave the remaining 2
pages for the caller to deal with. This is the partial hit that can
happen.

To me that's more like an EOF case. "Partial hit" sounds to me like the
data exists but is not available in the cache (i.e. go to the segment
files); whereas if it encountered the end, the data is not available at
all.

Right. We can tweak the comments around "partial hit" if required.

WALBufMappingLock protects both xlblocks and WAL buffer pages [1][2].
I'm not sure how using the memory barrier, not WALBufMappingLock,
prevents writers from replacing WAL buffer pages while readers
reading
the pages.

It doesn't *prevent* that case, but it does *detect* that case. We
don't want to prevent writers from replacing WAL buffers, because that
would mean we are slowing down the critical WAL writing path.

Let me explain the potential problem cases, and how the barriers
prevent them:

Potential problem 1: the page is not yet resident in the cache at the
time the memcpy begins. In this case, the first read barrier would
ensure that the page is also not yet resident at the time xlblocks[idx]
is read into endptr1, and we'd break out of the loop.

Potential problem 2: the page is evicted before the memcpy finishes. In
this case, the second read barrier would ensure that the page was also
evicted before xlblocks[idx] is read into endptr2, and again we'd
detect the problem and break out of the loop.

Understood.

I assume here that, if xlblocks[idx] holds the endPtr of the desired
page, all of the bytes for that page are resident at that moment. I
don't think that's true right now: AdvanceXLInsertBuffers() zeroes the
old page before updating xlblocks[nextidx].

Right.

I think it needs something like:

pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], InvalidXLogRecPtr);
pg_write_barrier();

before the MemSet.

I think it works. First, xlblocks needs to be turned to an array of
64-bit atomics and then the above change. With this, all those who
reads xlblocks with or without WALBufMappingLock also need to check if
xlblocks[idx] is ever InvalidXLogRecPtr and take appropriate action.

I'm sure you have seen the following. It looks like I'm leaning
towards the claim that it's safe to read xlblocks without
WALBufMappingLock. I'll put up a patch for these changes separately.

/*
* Make sure the initialization of the page becomes visible to others
* before the xlblocks update. GetXLogBuffer() reads xlblocks without
* holding a lock.
*/
pg_write_barrier();

*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;

I think the 3 things that helps read from WAL buffers without
WALBufMappingLock are: 1) couple of the read barriers in
XLogReadFromBuffers, 2) atomically initializing xlblocks[idx] to
InvalidXLogRecPtr plus a write barrier in AdvanceXLInsertBuffer(), 3)
the following sanity check to see if the read page is valid in
XLogReadFromBuffers(). If it sounds sensible, I'll work towards coding
it up. Thoughts?

+ , we
+         * need to ensure that we are not reading a page that just got
+         * initialized. For this, we look at the needed page header.
+         */
+        phdr = (XLogPageHeader) page;
+
+        /* Return, if WAL buffer page doesn't look valid. */
+        if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+              phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+              phdr->xlp_tli == tli))
+            break;
+

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#47Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#46)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, 2023-11-03 at 20:23 +0530, Bharath Rupireddy wrote:

OldestInitializedPage is introduced in v14-0001 patch. Please have a
look.

I don't see why that's necessary if we move to the algorithm I
suggested below that doesn't require a lock.

Okay. Current patch doesn't support this [partial hit of newer pages]
case.

OK, no need to support it until you see a reason.

I think it needs something like:

  pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx],
InvalidXLogRecPtr);
  pg_write_barrier();

before the MemSet.

I think it works. First, xlblocks needs to be turned to an array of
64-bit atomics and then the above change.

Does anyone see a reason we shouldn't move to atomics here?

        pg_write_barrier();

        *((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) =
NewPageEndPtr;

I am confused why the "volatile" is required on that line (not from
your patch). I sent a separate message about that:

/messages/by-id/784f72ac09061fe5eaa5335cc347340c367c73ac.camel@j-davis.com

I think the 3 things that helps read from WAL buffers without
WALBufMappingLock are: 1) couple of the read barriers in
XLogReadFromBuffers, 2) atomically initializing xlblocks[idx] to
InvalidXLogRecPtr plus a write barrier in AdvanceXLInsertBuffer(), 3)
the following sanity check to see if the read page is valid in
XLogReadFromBuffers(). If it sounds sensible, I'll work towards
coding
it up. Thoughts?

I like it. I think it will ultimately be a fairly simple loop. And by
moving to atomics, we won't need the delicate comment in
GetXLogBuffer().

Regards,
Jeff Davis

#48Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#46)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2023-11-03 20:23:30 +0530, Bharath Rupireddy wrote:

On Fri, Nov 3, 2023 at 12:35 PM Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2023-11-02 at 22:38 +0530, Bharath Rupireddy wrote:

I suppose the question is: should reading from the WAL buffers an
intentional thing that the caller does explicitly by specific
callers?
Or is it an optimization that should be hidden from the caller?

I tend toward the former, at least for now.

Yes, it's an optimization that must be hidden from the caller.

As I said, I tend toward the opposite: that specific callers should
read from the buffers explicitly in the cases where it makes sense.

How about adding a bool flag (read_from_wal_buffers) to
XLogReaderState so that the callers can set it if they want this
facility via XLogReaderAllocate()?

That seems wrong architecturally - why should xlogreader itself know about any
of this? What would it mean in frontend code if read_from_wal_buffers were
set? IMO this is something that should happen purely within the read function.

Greetings,

Andres Freund

#49Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#44)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2023-11-02 22:38:38 +0530, Bharath Rupireddy wrote:

From 5b5469d7dcd8e98bfcaf14227e67356bbc1f5fe8 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 2 Nov 2023 15:10:51 +0000
Subject: [PATCH v14] Track oldest initialized WAL buffer page

---
src/backend/access/transam/xlog.c | 170 ++++++++++++++++++++++++++++++
src/include/access/xlog.h | 1 +
2 files changed, 171 insertions(+)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b541be8eec..fdf2ef310b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -504,6 +504,45 @@ typedef struct XLogCtlData
XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
int			XLogCacheBlck;	/* highest allocated xlog buffer index */
+	/*
+	 * Start address of oldest initialized page in XLog buffers.
+	 *
+	 * We mainly track oldest initialized page explicitly to quickly tell if a
+	 * given WAL record is available in XLog buffers. It also can be used for
+	 * other purposes, see notes below.
+	 *
+	 * OldestInitializedPage gives XLog buffers following properties:
+	 *
+	 * 1) At any given point of time, pages in XLog buffers array are sorted
+	 * in an ascending order from OldestInitializedPage till InitializedUpTo.
+	 * Note that we verify this property for assert-only builds, see
+	 * IsXLogBuffersArraySorted() for more details.

This is true - but also not, if you look at it a bit too literally. The
buffers in xlblocks itself obviously aren't ordered when wrapping around
between XLogRecPtrToBufIdx(OldestInitializedPage) and
XLogRecPtrToBufIdx(InitializedUpTo).

+	 * 2) OldestInitializedPage is monotonically increasing (by virtue of how
+	 * postgres generates WAL records), that is, its value never decreases.
+	 * This property lets someone read its value without a lock. There's no
+	 * problem even if its value is slightly stale i.e. concurrently being
+	 * updated. One can still use it for finding if a given WAL record is
+	 * available in XLog buffers. At worst, one might get false positives
+	 * (i.e. OldestInitializedPage may tell that the WAL record is available
+	 * in XLog buffers, but when one actually looks at it, it isn't really
+	 * available). This is more efficient and performant than acquiring a lock
+	 * for reading. Note that we may not need a lock to read
+	 * OldestInitializedPage but we need to update it holding
+	 * WALBufMappingLock.

I'd
s/may not need/do not need/

But perhaps rephrase it a bit more, to something like:

To update OldestInitializedPage, WALBufMappingLock needs to be held
exclusively, for reading no lock is required.

+	 *
+	 * 3) One can start traversing XLog buffers from OldestInitializedPage
+	 * till InitializedUpTo to list out all valid WAL records and stats, and
+	 * expose them via SQL-callable functions to users.
+	 *
+	 * 4) XLog buffers array is inherently organized as a circular, sorted and
+	 * rotated array with OldestInitializedPage as pivot with the property
+	 * where LSN of previous buffer page (if valid) is greater than
+	 * OldestInitializedPage and LSN of next buffer page (if valid) is greater
+	 * than OldestInitializedPage.
+	 */
+	XLogRecPtr	OldestInitializedPage;

It seems a bit odd to name a LSN containing variable *Page...

/*
* InsertTimeLineID is the timeline into which new WAL is being inserted
* and flushed. It is zero during recovery, and does not change once set.
@@ -590,6 +629,10 @@ static ControlFileData *ControlFile = NULL;
#define NextBufIdx(idx) \
(((idx) == XLogCtl->XLogCacheBlck) ? 0 : ((idx) + 1))

+/* Macro to retreat to previous buffer index. */
+#define PreviousBufIdx(idx)		\
+		(((idx) == 0) ? XLogCtl->XLogCacheBlck : ((idx) - 1))

I think it might be worth making these inlines and adding assertions that idx
is not bigger than XLogCtl->XLogCacheBlck?

/*
* XLogRecPtrToBufIdx returns the index of the WAL buffer that holds, or
* would hold if it was in cache, the page containing 'recptr'.
@@ -708,6 +751,10 @@ static void WALInsertLockAcquireExclusive(void);
static void WALInsertLockRelease(void);
static void WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt);

+#ifdef USE_ASSERT_CHECKING
+static bool IsXLogBuffersArraySorted(void);
+#endif
+
/*
* Insert an XLOG record represented by an already-constructed chain of data
* chunks.  This is a low-level routine; to construct the WAL record header
@@ -1992,6 +2039,52 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
XLogCtl->InitializedUpTo = NewPageEndPtr;
npages++;
+
+		/*
+		 * Try updating oldest initialized XLog buffer page.
+		 *
+		 * Update it if we are initializing an XLog buffer page for the first
+		 * time or if XLog buffers are full and we are wrapping around.
+		 */
+		if (XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) ||
+			XLogRecPtrToBufIdx(XLogCtl->OldestInitializedPage) == nextidx)
+		{
+			Assert(XLogCtl->OldestInitializedPage < NewPageBeginPtr);
+
+			XLogCtl->OldestInitializedPage = NewPageBeginPtr;
+		}

Wait, isn't this too late? At this point the buffer can already be used by
GetXLogBuffers(). I think thi sneeds to happen at the latest just before
*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;

Why is it legal to get here with XLogCtl->OldestInitializedPage being invalid?

+
+/*
+ * Returns whether or not a given WAL record is available in XLog buffers.
+ *
+ * Note that we don't read OldestInitializedPage under a lock, see description
+ * near its definition in xlog.c for more details.
+ *
+ * Note that caller needs to pass in an LSN known to the server, not a future
+ * or unwritten or unflushed LSN.
+ */
+bool
+IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn)
+{
+	if (!XLogRecPtrIsInvalid(lsn) &&
+		!XLogRecPtrIsInvalid(XLogCtl->OldestInitializedPage) &&
+		lsn >= XLogCtl->OldestInitializedPage &&
+		lsn < XLogCtl->InitializedUpTo)
+	{
+		return true;
+	}
+
+	return false;
+}
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..35235010e6 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -261,6 +261,7 @@ extern void ReachedEndOfBackup(XLogRecPtr EndRecPtr, TimeLineID tli);
extern void SetInstallXLogFileSegmentActive(void);
extern bool IsInstallXLogFileSegmentActive(void);
extern void XLogShutdownWalRcv(void);
+extern bool IsWALRecordAvailableInXLogBuffers(XLogRecPtr lsn);

/*
* Routines to start, stop, and get status of a base backup.
--
2.34.1

From db027d8f1dcb53ebceef0135287f120acf67cc21 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 2 Nov 2023 15:36:11 +0000
Subject: [PATCH v14] Allow WAL reading from WAL buffers

This commit adds WALRead() the capability to read WAL from WAL
buffers when possible. When requested WAL isn't available in WAL
buffers, the WAL is read from the WAL file as usual. It relies on
WALBufMappingLock so that no one replaces the WAL buffer page that
we're reading from. It skips reading from WAL buffers if
WALBufMappingLock can't be acquired immediately. In other words,
it doesn't wait for WALBufMappingLock to be available. This helps
reduce the contention on WALBufMappingLock.

This commit benefits the callers of WALRead(), that are walsenders
and pg_walinspect. They can now avoid reading WAL from the WAL
file (possibly avoiding disk IO). Tests show that the WAL buffers
hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
standby, with pgbench --scale=300 --client=32 --time=900. In other
words, the walsenders avoided 95% of the time reading from the
file/avoided pread system calls:
/messages/by-id/CALj2ACXKKK=wbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54+Na=Q@mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
/messages/by-id/CALj2ACV6rS+7iZx5+oAvyXJaN4AG-djAQeM1mrM=YSDkVrUs7g@mail.gmail.com

This commit also paves the way for the following features in
future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
/messages/by-id/20231011224353.cl7c2s222dw3de4j@awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh
Discussion: /messages/by-id/CALj2ACXKKK=wbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54+Na=Q@mail.gmail.com
---
src/backend/access/transam/xlog.c | 205 ++++++++++++++++++++++++
src/backend/access/transam/xlogreader.c | 41 ++++-
src/include/access/xlog.h | 6 +
3 files changed, 250 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fdf2ef310b..ff5dccaaa7 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1753,6 +1753,211 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
return cachedPos + ptr % XLOG_BLCKSZ;
}
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * This function returns quickly in the following cases:
+ * - When passed-in timeline is different than server's current insertion
+ * timeline as WAL is always inserted into WAL buffers on insertion timeline.
+ * - When server is in recovery as WAL buffers aren't currently used in
+ * recovery.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ *
+ * Note that this function is not available for frontend code as WAL buffers is
+ * an internal mechanism to the server.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state,
+					XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf)
+{
+	XLogRecPtr	ptr;
+	XLogRecPtr	cur_lsn;
+	Size		nbytes;
+	Size		ntotal;
+	Size		nbatch;
+	char	   *batchstart;
+
+	if (RecoveryInProgress())
+		return 0;
+	if (tli != GetWALInsertionTimeLine())
+		return 0;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+
+	cur_lsn = GetFlushRecPtr(NULL);
+	if (unlikely(startptr > cur_lsn))
+		elog(ERROR, "WAL start LSN %X/%X specified for reading from WAL buffers must be less than current database system WAL LSN %X/%X",
+			 LSN_FORMAT_ARGS(startptr), LSN_FORMAT_ARGS(cur_lsn));

Hm, why does this check belong here? For some tools it might be legitimate to
read the WAL before it was fully flushed.

+	/*
+	 * Holding WALBufMappingLock ensures inserters don't overwrite this value
+	 * while we are reading it. We try to acquire it in shared mode so that
+	 * the concurrent WAL readers are also allowed. We try to do as less work
+	 * as possible while holding the lock to not impact concurrent WAL writers
+	 * much. We quickly exit to not cause any contention, if the lock isn't
+	 * immediately available.
+	 */
+	if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+		return 0;

That seems problematic - that lock is often heavily contended. We could
instead check IsWALRecordAvailableInXLogBuffers() once before reading the
page, then read the page contents *without* holding a lock, and then check
IsWALRecordAvailableInXLogBuffers() again - if the page was replaced in the
interim we read bogus data, but that's a bit of a wasted effort.

+	ptr = startptr;
+	nbytes = count;				/* Total bytes requested to be read by caller. */
+	ntotal = 0;					/* Total bytes read. */
+	nbatch = 0;					/* Bytes to be read in single batch. */
+	batchstart = NULL;			/* Location to read from for single batch. */

What does "batch" mean?

+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int			idx;
+		char	   *page;
+		char	   *data;
+		XLogPageHeader phdr;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = XLogCtl->xlblocks[idx];
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * The fact that we acquire WALBufMappingLock while reading the WAL
+		 * buffer page itself guarantees that no one else initializes it or
+		 * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
+		 * need to ensure that we are not reading a page that just got
+		 * initialized. For this, we look at the needed page header.
+		 */
+		phdr = (XLogPageHeader) page;
+
+		/* Return, if WAL buffer page doesn't look valid. */
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+			break;

I don't think this code should ever encounter a page where this is not the
case? We particularly shouldn't do so silently, seems that could hide all
kinds of problems.

+		/*
+		 * Note that we don't perform all page header checks here to avoid
+		 * extra work in production builds; callers will anyway do those
+		 * checks extensively. However, in an assert-enabled build, we perform
+		 * all the checks here and raise an error if failed.
+		 */

Why?

+		/* Count what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			nbatch += nbytes;
+			ntotal += nbytes;
+			nbytes = 0;
+		}
+		else
+		{
+			Size		navailable;
+
+			/*
+			 * All the bytes are not in one page. Deduce available bytes on
+			 * the current page, count them and continue to look for remaining
+			 * bytes.
+			 */

s/deducate/deduct/? Perhaps better subtract?

Greetings,

Andres Freund

#50Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#47)
3 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sat, Nov 4, 2023 at 1:17 AM Jeff Davis <pgsql@j-davis.com> wrote:

I think it needs something like:

pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx],
InvalidXLogRecPtr);
pg_write_barrier();

before the MemSet.

I think it works. First, xlblocks needs to be turned to an array of
64-bit atomics and then the above change.

Does anyone see a reason we shouldn't move to atomics here?

pg_write_barrier();

*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) =
NewPageEndPtr;

I am confused why the "volatile" is required on that line (not from
your patch). I sent a separate message about that:

/messages/by-id/784f72ac09061fe5eaa5335cc347340c367c73ac.camel@j-davis.com

I think the 3 things that helps read from WAL buffers without
WALBufMappingLock are: 1) couple of the read barriers in
XLogReadFromBuffers, 2) atomically initializing xlblocks[idx] to
InvalidXLogRecPtr plus a write barrier in AdvanceXLInsertBuffer(), 3)
the following sanity check to see if the read page is valid in
XLogReadFromBuffers(). If it sounds sensible, I'll work towards
coding
it up. Thoughts?

I like it. I think it will ultimately be a fairly simple loop. And by
moving to atomics, we won't need the delicate comment in
GetXLogBuffer().

I'm attaching the v15 patch set implementing the above ideas. Please
have a look.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v15-0001-Use-64-bit-atomics-for-xlblocks-array-elements.patchapplication/octet-stream; name=v15-0001-Use-64-bit-atomics-for-xlblocks-array-elements.patchDownload
From e3dd35828be4dd665cbfbb6ca153fba0011aa0a8 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 4 Nov 2023 13:48:51 +0000
Subject: [PATCH v15] Use 64-bit atomics for xlblocks array elements

In AdvanceXLInsertBuffer(), xlblocks value of a WAL buffer page is
updated only at the end after the page is initialized with all
zeros. A problem with this approach is that anyone reading
xlblocks and WAL buffer page without holding WALBufMappingLock
will see the wrong page contents if the read happens before the
xlblocks is marked with a new entry in AdvanceXLInsertBuffer() at
the end.

To fix this issue, xlblocks is made to use 64-bit atomics instead
of XLogRecPtr and the xlblocks value is marked with
InvalidXLogRecPtr just before the page initialization begins. Once
the page initialization finishes, only then the actual value of
the newly initialized page is marked in xlblocks. A write barrier
is placed in between xlblocks update with InvalidXLogRecPtr and
the page initialization to not cause any memory ordering problems.

With this fix, one can read xlblocks and WAL buffer page without
WALBufMappingLock in the following manner:

  endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);

  /* Requested WAL isn't available in WAL buffers. */
  if (expectedEndPtr != endptr)
      break;

  page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
  data = page + ptr % XLOG_BLCKSZ;
  ...
  pg_read_barrier();
  ...
  memcpy(buf, data, bytes_to_read);
  ...
  pg_read_barrier();

  /* Recheck if the page still exists in WAL buffers. */
  endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);

  /* Return if the page got initalized while we were reading it */
  if (expectedEndPtr != endptr)
      break;
---
 src/backend/access/transam/xlog.c | 65 +++++++++++++++++++++----------
 1 file changed, 45 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b541be8eec..1a2ad1a475 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -501,7 +501,7 @@ typedef struct XLogCtlData
 	 * WALBufMappingLock.
 	 */
 	char	   *pages;			/* buffers for unwritten XLOG pages */
-	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
+	pg_atomic_uint64 *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */
 	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
 
 	/*
@@ -1634,20 +1634,19 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	 * out to disk and evicted, and the caller is responsible for making sure
 	 * that doesn't happen.
 	 *
-	 * However, we don't hold a lock while we read the value. If someone has
-	 * just initialized the page, it's possible that we get a "torn read" of
-	 * the XLogRecPtr if 64-bit fetches are not atomic on this platform. In
-	 * that case we will see a bogus value. That's ok, we'll grab the mapping
-	 * lock (in AdvanceXLInsertBuffer) and retry if we see anything else than
-	 * the page we're looking for. But it means that when we do this unlocked
-	 * read, we might see a value that appears to be ahead of the page we're
-	 * looking for. Don't PANIC on that, until we've verified the value while
-	 * holding the lock.
+	 * However, we don't hold a lock while we read the value. If someone is
+	 * just about to initialize or has just initialized the page, it's
+	 * possible that we get InvalidXLogRecPtr. That's ok, we'll grab the
+	 * mapping lock (in AdvanceXLInsertBuffer) and retry if we see anything
+	 * else than the page we're looking for. But it means that when we do this
+	 * unlocked read, we might see a value that appears to be ahead of the
+	 * page we're looking for. Don't PANIC on that, until we've verified the
+	 * value while holding the lock.
 	 */
 	expectedEndPtr = ptr;
 	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
 
-	endptr = XLogCtl->xlblocks[idx];
+	endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
 	if (expectedEndPtr != endptr)
 	{
 		XLogRecPtr	initializedUpto;
@@ -1678,7 +1677,7 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 		WALInsertLockUpdateInsertingAt(initializedUpto);
 
 		AdvanceXLInsertBuffer(ptr, tli, false);
-		endptr = XLogCtl->xlblocks[idx];
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
 
 		if (expectedEndPtr != endptr)
 			elog(PANIC, "could not find WAL buffer for %X/%X",
@@ -1865,7 +1864,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		 * be zero if the buffer hasn't been used yet).  Fall through if it's
 		 * already written out.
 		 */
-		OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
+		OldPageRqstPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]);
 		if (LogwrtResult.Write < OldPageRqstPtr)
 		{
 			/*
@@ -1934,6 +1933,15 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 
 		NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
 
+		/*
+		 * Make sure to mark the xlblocks with InvalidXLogRecPtr before the
+		 * initialization of the page begins so that others, reading xlblocks
+		 * without holding a lock, will know that the page initialization has
+		 * just begun.
+		 */
+		pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], InvalidXLogRecPtr);
+		pg_write_barrier();
+
 		/*
 		 * Be sure to re-zero the buffer so that bytes beyond what we've
 		 * written will look like zeroes and not valid XLOG records...
@@ -1987,8 +1995,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		 */
 		pg_write_barrier();
 
-		*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
-
+		pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);
 		XLogCtl->InitializedUpTo = NewPageEndPtr;
 
 		npages++;
@@ -2187,7 +2194,22 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 		 * if we're passed a bogus WriteRqst.Write that is past the end of the
 		 * last page that's been initialized by AdvanceXLInsertBuffer.
 		 */
-		XLogRecPtr	EndPtr = XLogCtl->xlblocks[curridx];
+		XLogRecPtr	EndPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[curridx]);
+
+		/*
+		 * xlblocks value can be InvalidXLogRecPtr before the new WAL buffer
+		 * page gets initialized in AdvanceXLInsertBuffer. In such a case
+		 * re-read the xlblocks value under the lock to ensure the correct
+		 * value is read.
+		 */
+		if (unlikely(XLogRecPtrIsInvalid(EndPtr)))
+		{
+			LWLockAcquire(WALBufMappingLock, LW_EXCLUSIVE);
+			EndPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[curridx]);
+			LWLockRelease(WALBufMappingLock);
+		}
+
+		Assert(!XLogRecPtrIsInvalid(EndPtr));
 
 		if (LogwrtResult.Write >= EndPtr)
 			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
@@ -4675,10 +4697,13 @@ XLOGShmemInit(void)
 	 * needed here.
 	 */
 	allocptr = ((char *) XLogCtl) + sizeof(XLogCtlData);
-	XLogCtl->xlblocks = (XLogRecPtr *) allocptr;
-	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
-	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
+	XLogCtl->xlblocks = (pg_atomic_uint64 *) allocptr;
+	allocptr += sizeof(pg_atomic_uint64) * XLOGbuffers;
 
+	for (i = 0; i < XLOGbuffers; i++)
+	{
+		pg_atomic_init_u64(&XLogCtl->xlblocks[i], InvalidXLogRecPtr);
+	}
 
 	/* WAL insertion locks. Ensure they're aligned to the full padded size */
 	allocptr += sizeof(WALInsertLockPadded) -
@@ -5715,7 +5740,7 @@ StartupXLOG(void)
 		memcpy(page, endOfRecoveryInfo->lastPage, len);
 		memset(page + len, 0, XLOG_BLCKSZ - len);
 
-		XLogCtl->xlblocks[firstIdx] = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ;
+		pg_atomic_write_u64(&XLogCtl->xlblocks[firstIdx], endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
 		XLogCtl->InitializedUpTo = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ;
 	}
 	else
-- 
2.34.1

v15-0002-Allow-WAL-reading-from-WAL-buffers.patchapplication/octet-stream; name=v15-0002-Allow-WAL-reading-from-WAL-buffers.patchDownload
From 35589fbefd488b9070c891f5dcfc7caab1b0a980 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 4 Nov 2023 14:21:43 +0000
Subject: [PATCH v15] Allow WAL reading from WAL buffers

This commit adds WALRead() the capability to read WAL from WAL
buffers when possible. When requested WAL isn't available in WAL
buffers, the WAL is read from the WAL file as usual.

This commit benefits the callers of WALRead(), that are walsenders
and pg_walinspect. They can now avoid reading WAL from the WAL
file (possibly avoiding disk IO). Tests show that the WAL buffers
hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
standby, with pgbench --scale=300 --client=32 --time=900. In other
words, the walsenders avoided 95% of the time reading from the
file/avoided pread system calls:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com

This commit paves the way for the following features in future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c       | 170 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  41 +++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 215 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 1a2ad1a475..1df74d8f48 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1705,6 +1705,176 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * This function returns quickly in the following cases:
+ * - When passed-in timeline is different than server's current insertion
+ * timeline as WAL is always inserted into WAL buffers on insertion timeline.
+ *
+ * - When server is in recovery as WAL buffers aren't currently used in
+ * recovery.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ *
+ * Note that function reads WAL from WAL buffers without holding any lock.
+ * First it reads xlblocks atomically for checking page existence, then it
+ * reads the page contents, validates. Finally, it rechecks the page existence
+ * by rereading xlblocks, if the read page is replaced, it discards read page
+ * and returns.
+ *
+ * Note that this function is not available for frontend code as WAL buffers is
+ * an internal mechanism to the server.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state,
+					XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf)
+{
+	XLogRecPtr	ptr;
+	XLogRecPtr	cur_lsn;
+	Size		nbytes;
+	Size		ntotal;
+	char	   *dst;
+
+	if (RecoveryInProgress())
+		return 0;
+
+	if (tli != GetWALInsertionTimeLine())
+		return 0;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+
+	cur_lsn = GetFlushRecPtr(NULL);
+	if (unlikely(startptr > cur_lsn))
+		elog(ERROR, "WAL start LSN %X/%X specified for reading from WAL buffers must be less than current database system WAL LSN %X/%X",
+			 LSN_FORMAT_ARGS(startptr), LSN_FORMAT_ARGS(cur_lsn));
+
+	ptr = startptr;
+	nbytes = count;				/* Total bytes requested to be read by caller. */
+	ntotal = 0;					/* Total bytes read. */
+	dst = buf;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int			idx;
+		char	   *page;
+		char	   *data;
+		XLogPageHeader phdr;
+		Size		nread;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * Make sure to not read a page that just got initialized. Return, if
+		 * WAL buffer page doesn't look valid.
+		 */
+		phdr = (XLogPageHeader) page;
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+			break;
+
+		/*
+		 * Note that we don't perform all page header checks here to avoid
+		 * extra work in production builds; callers will anyway do those
+		 * checks extensively. However, in an assert-enabled build, we perform
+		 * all the checks here and raise an error if failed.
+		 */
+#ifdef USE_ASSERT_CHECKING
+		if (unlikely(state != NULL &&
+					 !XLogReaderValidatePageHeader(state, (endptr - XLOG_BLCKSZ),
+												   (char *) phdr)))
+		{
+			if (state->errormsg_buf[0])
+				ereport(ERROR,
+						(errcode(ERRCODE_INTERNAL_ERROR),
+						 errmsg_internal("%s", state->errormsg_buf)));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INTERNAL_ERROR),
+						 errmsg_internal("could not read WAL from WAL buffers")));
+		}
+#endif
+
+		/* Make sure we don't read the page contents before xlblocks. */
+		pg_read_barrier();
+
+		nread = 0;
+
+		/* Read what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			nread = nbytes;
+		}
+		else
+		{
+			/*
+			 * All the bytes are not in one page. Read available bytes on the
+			 * current page, copy them over to output buffer and continue to
+			 * read remaining bytes.
+			 */
+			nread = XLOG_BLCKSZ - (data - page);
+			Assert(nread > 0 && nread <= nbytes);
+		}
+
+		Assert(nread > 0);
+		memcpy(dst, data, nread);
+
+		/* Make sure we don't read xlblocks before the page contents. */
+		pg_read_barrier();
+
+		/* Recheck if the read page still exists in WAL buffers. */
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+		/* Return if the page got initalized while we were reading it. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		dst += nread;
+		ptr += nread;
+		ntotal += nread;
+		nbytes -= nread;
+	}
+
+	/* We never read more than what the caller has asked for. */
+	Assert(ntotal <= count);
+
+#ifdef WAL_DEBUG
+	if (XLOG_DEBUG)
+		ereport(DEBUG1,
+				(errmsg_internal("read %zu bytes out of %zu bytes from WAL buffers for given start LSN %X/%X, timeline ID %u",
+								 ntotal, count, LSN_FORMAT_ARGS(startptr), tli)));
+#endif
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e0baa86bd3..5820c5eedc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1473,8 +1473,10 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers. When
+ * requested WAL isn't available in WAL buffers, the WAL is read from the WAL
+ * file as usual. The callers may avoid reading WAL from the WAL file thus
+ * reducing read system calls or even disk IOs.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1484,6 +1486,41 @@ WALRead(XLogReaderState *state,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	Size		nread;
+#endif
+
+#ifndef FRONTEND
+
+	/*
+	 * Try reading WAL from WAL buffers. Frontend code has no idea of WAL
+	 * buffers.
+	 */
+	nread = XLogReadFromBuffers(state, startptr, tli, count, buf);
+
+	Assert(nread >= 0);
+
+	/*
+	 * Check if we have read fully (hit), partially (partial hit) or nothing
+	 * (miss) from WAL buffers. If we have read either partially or nothing,
+	 * then continue to read the remaining bytes the usual way, that is, read
+	 * from WAL file.
+	 *
+	 * XXX: It might be worth to expose WAL buffer read stats.
+	 */
+	if (count == nread)
+		return true;			/* Buffer hit, so return. */
+	else if (count > nread)
+	{
+		/*
+		 * Buffer partial hit, so reset the state to count the read bytes and
+		 * continue.
+		 */
+		buf += nread;
+		startptr += nread;
+		count -= nread;
+	}
+#endif
 
 	p = buf;
 	recptr = startptr;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..18167c36b4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -251,6 +251,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(struct XLogReaderState *state,
+								XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v15-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patchapplication/octet-stream; name=v15-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patchDownload
From b27001d76f77d8765b675638b939689ad4930185 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 4 Nov 2023 13:52:17 +0000
Subject: [PATCH v15] Add test module for verifying WAL read from WAL buffers

This commit adds a test module to verify WAL read from WAL
buffers.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 ++++++++
 .../test_wal_read_from_buffers/meson.build    | 33 +++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 58 +++++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 +++++
 .../test_wal_read_from_buffers.c              | 37 ++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 177 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index e81873cb5a..f5aedb95a4 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index fcd643f6f1..86fd74ab50 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,5 +28,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7472494501
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40bd5dcd33
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_wal_read_from_buffers
+
+test_install_data += files(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..5d94f8a960
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,58 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers;');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+my ($psql_ret, $psql_stdout, $psql_stderr) = ('', '', '');
+
+# Check WAL read from buffers with an LSN greater than current database system
+# LSN.
+$lsn =
+  $node->safe_psql('postgres', 'SELECT pg_current_wal_flush_lsn()+1000;');
+
+# Must not use safe_psql since we expect an error here.
+($psql_ret, $psql_stdout, $psql_stderr) =
+  $node->psql('postgres', qq{SELECT test_wal_read_from_buffers('$lsn');});
+like(
+	$psql_stderr,
+	qr/ERROR: ( [A-Z0-9]+:)? WAL start LSN $lsn specified for reading from WAL buffers must be less than current database system WAL LSN *./,
+	"WAL read from WAL buffers failed due to an LSN greater than current database system LSN");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..c6ffb3fa65
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    read_successful OUT boolean
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..2307cbff7a
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,37 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	char		data[XLOG_BLCKSZ] = {0};
+	Size		nread;
+
+	nread = XLogReadFromBuffers(NULL, PG_GETARG_LSN(0),
+								GetWALInsertionTimeLine(),
+								XLOG_BLCKSZ, data);
+
+	PG_RETURN_BOOL(nread > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..eda8d47954
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#51Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#50)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sat, 2023-11-04 at 20:55 +0530, Bharath Rupireddy wrote:

+		XLogRecPtr	EndPtr =
pg_atomic_read_u64(&XLogCtl->xlblocks[curridx]);
+
+		/*
+		 * xlblocks value can be InvalidXLogRecPtr before
the new WAL buffer
+		 * page gets initialized in AdvanceXLInsertBuffer.
In such a case
+		 * re-read the xlblocks value under the lock to
ensure the correct
+		 * value is read.
+		 */
+		if (unlikely(XLogRecPtrIsInvalid(EndPtr)))
+		{
+			LWLockAcquire(WALBufMappingLock,
LW_EXCLUSIVE);
+			EndPtr = pg_atomic_read_u64(&XLogCtl-

xlblocks[curridx]);

+			LWLockRelease(WALBufMappingLock);
+		}
+
+		Assert(!XLogRecPtrIsInvalid(EndPtr));

Can that really happen? If the EndPtr is invalid, that means the page
is in the process of being cleared, so the contents of the page are
undefined at that time, right?

Regards,
Jeff Davis

#52Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#51)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sun, Nov 5, 2023 at 2:57 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Sat, 2023-11-04 at 20:55 +0530, Bharath Rupireddy wrote:

+             XLogRecPtr      EndPtr =
pg_atomic_read_u64(&XLogCtl->xlblocks[curridx]);
+
+             /*
+              * xlblocks value can be InvalidXLogRecPtr before
the new WAL buffer
+              * page gets initialized in AdvanceXLInsertBuffer.
In such a case
+              * re-read the xlblocks value under the lock to
ensure the correct
+              * value is read.
+              */
+             if (unlikely(XLogRecPtrIsInvalid(EndPtr)))
+             {
+                     LWLockAcquire(WALBufMappingLock,
LW_EXCLUSIVE);
+                     EndPtr = pg_atomic_read_u64(&XLogCtl-

xlblocks[curridx]);

+                     LWLockRelease(WALBufMappingLock);
+             }
+
+             Assert(!XLogRecPtrIsInvalid(EndPtr));

Can that really happen? If the EndPtr is invalid, that means the page
is in the process of being cleared, so the contents of the page are
undefined at that time, right?

My initial thoughts were this way - xlblocks is being read without
holding WALBufMappingLock in XLogWrite() and since we write
InvalidXLogRecPtr to xlblocks array elements temporarily before
MemSet-ting the page in AdvanceXLInsertBuffer(), the PANIC "xlog write
request %X/%X is past end of log %X/%X" might get hit if EndPtr read
from xlblocks is InvalidXLogRecPtr. FWIW, an Assert(false); within the
if (unlikely(XLogRecPtrIsInvalid(EndPtr))) block didn't hit in make
check-world.

It looks like my above understanding isn't correct because it can
never happen that the page that's being written to the WAL file gets
initialized in AdvanceXLInsertBuffer(). I'll remove this piece of code
in next version of the patch unless there are any other thoughts.

[1]: /* * Within the loop, curridx is the cache block index of the page to * consider writing. Begin at the buffer containing the next unwritten * page, or last partially written page. */ curridx = XLogRecPtrToBufIdx(LogwrtResult.Write);
/*
* Within the loop, curridx is the cache block index of the page to
* consider writing. Begin at the buffer containing the next unwritten
* page, or last partially written page.
*/
curridx = XLogRecPtrToBufIdx(LogwrtResult.Write);

while (LogwrtResult.Write < WriteRqst.Write)
{
/*
* Make sure we're not ahead of the insert process. This could happen
* if we're passed a bogus WriteRqst.Write that is past the end of the
* last page that's been initialized by AdvanceXLInsertBuffer.
*/
XLogRecPtr EndPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[curridx]);

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#53Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#49)
4 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sat, Nov 4, 2023 at 6:13 AM Andres Freund <andres@anarazel.de> wrote:

+     cur_lsn = GetFlushRecPtr(NULL);
+     if (unlikely(startptr > cur_lsn))
+             elog(ERROR, "WAL start LSN %X/%X specified for reading from WAL buffers must be less than current database system WAL LSN %X/%X",
+                      LSN_FORMAT_ARGS(startptr), LSN_FORMAT_ARGS(cur_lsn));

Hm, why does this check belong here? For some tools it might be legitimate to
read the WAL before it was fully flushed.

Agreed and removed the check.

+     /*
+      * Holding WALBufMappingLock ensures inserters don't overwrite this value
+      * while we are reading it. We try to acquire it in shared mode so that
+      * the concurrent WAL readers are also allowed. We try to do as less work
+      * as possible while holding the lock to not impact concurrent WAL writers
+      * much. We quickly exit to not cause any contention, if the lock isn't
+      * immediately available.
+      */
+     if (!LWLockConditionalAcquire(WALBufMappingLock, LW_SHARED))
+             return 0;

That seems problematic - that lock is often heavily contended. We could
instead check IsWALRecordAvailableInXLogBuffers() once before reading the
page, then read the page contents *without* holding a lock, and then check
IsWALRecordAvailableInXLogBuffers() again - if the page was replaced in the
interim we read bogus data, but that's a bit of a wasted effort.

In the new approach described upthread here
/messages/by-id/c3455ab9da42e09ca9d059879b5c512b2d1f9681.camel@j-davis.com,
there's no lock required for reading from WAL buffers. PSA patches for
more details.

+             /*
+              * The fact that we acquire WALBufMappingLock while reading the WAL
+              * buffer page itself guarantees that no one else initializes it or
+              * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
+              * need to ensure that we are not reading a page that just got
+              * initialized. For this, we look at the needed page header.
+              */
+             phdr = (XLogPageHeader) page;
+
+             /* Return, if WAL buffer page doesn't look valid. */
+             if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+                       phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+                       phdr->xlp_tli == tli))
+                     break;

I don't think this code should ever encounter a page where this is not the
case? We particularly shouldn't do so silently, seems that could hide all
kinds of problems.

I think it's possible to read a "just got initialized" page with the
new approach to read WAL buffer pages without WALBufMappingLock if the
page is read right after it is initialized and xlblocks is filled in
AdvanceXLInsertBuffer() but before actual WAL is written.

+             /*
+              * Note that we don't perform all page header checks here to avoid
+              * extra work in production builds; callers will anyway do those
+              * checks extensively. However, in an assert-enabled build, we perform
+              * all the checks here and raise an error if failed.
+              */

Why?

Minimal page header checks are performed to ensure we don't read the
page that just got initialized unlike what
XLogReaderValidatePageHeader(). Are you suggesting to remove page
header checks with XLogReaderValidatePageHeader() for assert-enabled
builds? Or are you suggesting to do page header checks with
XLogReaderValidatePageHeader() for production builds too?

PSA v16 patch set. Note that 0004 patch adds support for WAL read
stats (both from WAL file and WAL buffers) to walsenders and may not
necessarily the best approach to capture WAL read stats in light of
/messages/by-id/CALj2ACU_f5_c8F+xyNR4HURjG=Jziiz07wCpQc=AqAJUFh7+8w@mail.gmail.com
which adds WAL read/write/fsync stats to pg_stat_io.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v16-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patchapplication/x-patch; name=v16-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patchDownload
From 3eaf18789e5d0b0a7c95d59ac13d71f3ef51680c Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 7 Nov 2023 21:05:12 +0000
Subject: [PATCH v16] Add test module for verifying WAL read from WAL buffers

This commit adds a test module to verify WAL read from WAL
buffers.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 ++++++++++
 .../test_wal_read_from_buffers/meson.build    | 33 ++++++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 43 +++++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 +++++++
 .../test_wal_read_from_buffers.c              | 37 ++++++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 162 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index e81873cb5a..f5aedb95a4 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -31,6 +31,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi
 
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index fcd643f6f1..86fd74ab50 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -28,5 +28,6 @@ subdir('test_regex')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7472494501
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40bd5dcd33
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_wal_read_from_buffers
+
+test_install_data += files(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..80f6947d1c
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,43 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers;');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..c6ffb3fa65
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    read_successful OUT boolean
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..2307cbff7a
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,37 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	char		data[XLOG_BLCKSZ] = {0};
+	Size		nread;
+
+	nread = XLogReadFromBuffers(NULL, PG_GETARG_LSN(0),
+								GetWALInsertionTimeLine(),
+								XLOG_BLCKSZ, data);
+
+	PG_RETURN_BOOL(nread > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..eda8d47954
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

v16-0004-Add-support-for-collecting-WAL-read-stats-for-wa.patchapplication/octet-stream; name=v16-0004-Add-support-for-collecting-WAL-read-stats-for-wa.patchDownload
From fa0d8433ed649841416b442a8c9897acf12eaa92 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 8 Nov 2023 07:07:49 +0000
Subject: [PATCH v16] Add support for collecting WAL read stats for walsenders

This commit adds code for collecting WAL read stats for
walsenders; both read from WAL buffers and WAL file and expose it
via pg_stat_replication.
---
 doc/src/sgml/monitoring.sgml                | 61 +++++++++++++++
 src/backend/access/transam/xlogreader.c     | 48 +++++++++++-
 src/backend/access/transam/xlogutils.c      |  2 +-
 src/backend/catalog/system_views.sql        |  8 +-
 src/backend/replication/walsender.c         | 86 ++++++++++++++++++++-
 src/bin/pg_waldump/pg_waldump.c             |  2 +-
 src/include/access/xlogreader.h             | 30 ++++++-
 src/include/catalog/pg_proc.dat             |  6 +-
 src/include/replication/walsender_private.h |  3 +
 src/test/regress/expected/rules.out         | 10 ++-
 10 files changed, 238 insertions(+), 18 deletions(-)

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index e068f7e247..a0257fea0c 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -1442,6 +1442,67 @@ description | Waiting for a newly initialized WAL file to reach durable storage
        Send time of last reply message received from standby server
       </para></entry>
      </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL data is read from disk
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read_bytes</structfield> <type>numeric</type>
+      </para>
+      <para>
+       Total amount of WAL read from disk in bytes
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read_time</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent reading WAL from disk via
+       <function>WALRead</function> request, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
+       otherwise zero).
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read_buffers</structfield> <type>bigint</type>
+      </para>
+      <para>
+       Number of times WAL data is read from WAL buffers
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read_bytes_buffers</structfield> <type>numeric</type>
+      </para>
+      <para>
+       Total amount of WAL read from WAL buffers in bytes
+      </para></entry>
+     </row>
+
+     <row>
+      <entry role="catalog_table_entry"><para role="column_definition">
+       <structfield>wal_read_time_buffers</structfield> <type>double precision</type>
+      </para>
+      <para>
+       Total amount of time spent reading WAL from WAL buffers via
+       <function>WALRead</function> request, in milliseconds
+       (if <xref linkend="guc-track-wal-io-timing"/> is enabled,
+       otherwise zero).
+      </para></entry>
+     </row>
+
     </tbody>
    </tgroup>
   </table>
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 5820c5eedc..44aee42079 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -31,6 +31,7 @@
 #include "access/xlogrecord.h"
 #include "catalog/pg_control.h"
 #include "common/pg_lzcompress.h"
+#include "portability/instr_time.h"
 #include "replication/origin.h"
 
 #ifndef FRONTEND
@@ -1479,9 +1480,9 @@ err:
  * reducing read system calls or even disk IOs.
  */
 bool
-WALRead(XLogReaderState *state,
-		char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
-		WALReadError *errinfo)
+WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr, Size count,
+		TimeLineID tli, WALReadError *errinfo, WALReadStats * stats,
+		bool capture_wal_io_timing)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
@@ -1489,9 +1490,13 @@ WALRead(XLogReaderState *state,
 #ifndef FRONTEND
 	Size		nread;
 #endif
+	instr_time	start;
 
 #ifndef FRONTEND
 
+	if (stats != NULL && capture_wal_io_timing)
+		INSTR_TIME_SET_CURRENT(start);
+
 	/*
 	 * Try reading WAL from WAL buffers. Frontend code has no idea of WAL
 	 * buffers.
@@ -1500,6 +1505,23 @@ WALRead(XLogReaderState *state,
 
 	Assert(nread >= 0);
 
+	/* Collect I/O stats if requested by the caller. */
+	if (stats != NULL)
+	{
+		stats->wal_read_buffers++;
+		stats->wal_read_bytes_buffers += nread;
+
+		/* Increment the I/O timing. */
+		if (capture_wal_io_timing)
+		{
+			instr_time	duration;
+
+			INSTR_TIME_SET_CURRENT(duration);
+			INSTR_TIME_SUBTRACT(duration, start);
+			stats->wal_read_time_buffers += INSTR_TIME_GET_MICROSEC(duration);
+		}
+	}
+
 	/*
 	 * Check if we have read fully (hit), partially (partial hit) or nothing
 	 * (miss) from WAL buffers. If we have read either partially or nothing,
@@ -1525,6 +1547,7 @@ WALRead(XLogReaderState *state,
 	p = buf;
 	recptr = startptr;
 	nbytes = count;
+	INSTR_TIME_SET_ZERO(start);
 
 	while (nbytes > 0)
 	{
@@ -1565,6 +1588,10 @@ WALRead(XLogReaderState *state,
 		else
 			segbytes = nbytes;
 
+		/* Measure I/O timing to read WAL data if requested by the caller. */
+		if (stats != NULL && capture_wal_io_timing)
+			INSTR_TIME_SET_CURRENT(start);
+
 #ifndef FRONTEND
 		pgstat_report_wait_start(WAIT_EVENT_WAL_READ);
 #endif
@@ -1587,6 +1614,21 @@ WALRead(XLogReaderState *state,
 			return false;
 		}
 
+		if (stats != NULL)
+		{
+			stats->wal_read++;
+			stats->wal_read_bytes += readbytes;
+
+			if (capture_wal_io_timing)
+			{
+				instr_time	duration;
+
+				INSTR_TIME_SET_CURRENT(duration);
+				INSTR_TIME_SUBTRACT(duration, start);
+				stats->wal_read_time += INSTR_TIME_GET_MICROSEC(duration);
+			}
+		}
+
 		/* Update state for read */
 		recptr += readbytes;
 		nbytes -= readbytes;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 43f7b31205..c88aad35bb 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -1012,7 +1012,7 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
 	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
-				 &errinfo))
+				 &errinfo, NULL, false))
 		WALReadRaiseError(&errinfo);
 
 	/* number of valid bytes in the buffer */
diff --git a/src/backend/catalog/system_views.sql b/src/backend/catalog/system_views.sql
index b65f6b5249..5e9fa587e7 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -893,7 +893,13 @@ CREATE VIEW pg_stat_replication AS
             W.replay_lag,
             W.sync_priority,
             W.sync_state,
-            W.reply_time
+            W.reply_time,
+            W.wal_read,
+            W.wal_read_bytes,
+            W.wal_read_time,
+            W.wal_read_buffers,
+            W.wal_read_bytes_buffers,
+            W.wal_read_time_buffers
     FROM pg_stat_get_activity(NULL) AS S
         JOIN pg_stat_get_wal_senders() AS W ON (S.pid = W.pid)
         LEFT JOIN pg_authid AS U ON (S.usesysid = U.oid);
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e250b0567e..2bfee2b002 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -259,7 +259,7 @@ static bool TransactionIdInRecentPast(TransactionId xid, uint32 epoch);
 
 static void WalSndSegmentOpen(XLogReaderState *state, XLogSegNo nextSegNo,
 							  TimeLineID *tli_p);
-
+static void WalSndAccumulateWalReadStats(WALReadStats * stats);
 
 /* Initialize walsender process before entering the main command loop */
 void
@@ -907,6 +907,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	WALReadError errinfo;
 	XLogSegNo	segno;
 	TimeLineID	currTLI;
+	WALReadStats stats;
 
 	/*
 	 * Make sure we have enough WAL available before retrieving the current
@@ -943,6 +944,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
+	MemSet(&stats, 0, sizeof(WALReadStats));
+
 	/* now actually read the data, we know it's there */
 	if (!WALRead(state,
 				 cur_page,
@@ -951,9 +954,13 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 				 currTLI,		/* Pass the current TLI because only
 								 * WalSndSegmentOpen controls whether new TLI
 								 * is needed. */
-				 &errinfo))
+				 &errinfo,
+				 &stats,
+				 track_wal_io_timing))
 		WALReadRaiseError(&errinfo);
 
+	WalSndAccumulateWalReadStats(&stats);
+
 	/*
 	 * After reading into the buffer, check that what we read was valid. We do
 	 * this after reading, because even though the segment was present when we
@@ -2630,6 +2637,13 @@ InitWalSenderSlot(void)
 			else
 				walsnd->kind = REPLICATION_KIND_LOGICAL;
 
+			walsnd->wal_read_stats.wal_read = 0;
+			walsnd->wal_read_stats.wal_read_bytes = 0;
+			walsnd->wal_read_stats.wal_read_time = 0;
+			walsnd->wal_read_stats.wal_read_buffers = 0;
+			walsnd->wal_read_stats.wal_read_bytes_buffers = 0;
+			walsnd->wal_read_stats.wal_read_time_buffers = 0;
+
 			SpinLockRelease(&walsnd->mutex);
 			/* don't need the lock anymore */
 			MyWalSnd = (WalSnd *) walsnd;
@@ -2750,6 +2764,7 @@ XLogSendPhysical(void)
 	Size		nbytes;
 	XLogSegNo	segno;
 	WALReadError errinfo;
+	WALReadStats stats;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -2965,6 +2980,8 @@ XLogSendPhysical(void)
 	enlargeStringInfo(&output_message, nbytes);
 
 retry:
+	MemSet(&stats, 0, sizeof(WALReadStats));
+
 	if (!WALRead(xlogreader,
 				 &output_message.data[output_message.len],
 				 startptr,
@@ -2972,9 +2989,13 @@ retry:
 				 xlogreader->seg.ws_tli,	/* Pass the current TLI because
 											 * only WalSndSegmentOpen controls
 											 * whether new TLI is needed. */
-				 &errinfo))
+				 &errinfo,
+				 &stats,
+				 track_wal_io_timing))
 		WALReadRaiseError(&errinfo);
 
+	WalSndAccumulateWalReadStats(&stats);
+
 	/* See logical_read_xlog_page(). */
 	XLByteToSeg(startptr, segno, xlogreader->segcxt.ws_segsize);
 	CheckXLogRemoved(segno, xlogreader->seg.ws_tli);
@@ -3523,7 +3544,7 @@ offset_to_interval(TimeOffset offset)
 Datum
 pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 {
-#define PG_STAT_GET_WAL_SENDERS_COLS	12
+#define PG_STAT_GET_WAL_SENDERS_COLS	18
 	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
 	SyncRepStandbyData *sync_standbys;
 	int			num_standbys;
@@ -3552,9 +3573,16 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		WalSndState state;
 		TimestampTz replyTime;
 		bool		is_sync_standby;
+		int64		wal_read;
+		uint64		wal_read_bytes;
+		int64		wal_read_time;
+		int64		wal_read_buffers;
+		uint64		wal_read_bytes_buffers;
+		int64		wal_read_time_buffers;
 		Datum		values[PG_STAT_GET_WAL_SENDERS_COLS];
 		bool		nulls[PG_STAT_GET_WAL_SENDERS_COLS] = {0};
 		int			j;
+		char		buf[256];
 
 		/* Collect data from shared memory */
 		SpinLockAcquire(&walsnd->mutex);
@@ -3574,6 +3602,12 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 		applyLag = walsnd->applyLag;
 		priority = walsnd->sync_standby_priority;
 		replyTime = walsnd->replyTime;
+		wal_read = walsnd->wal_read_stats.wal_read;
+		wal_read_bytes = walsnd->wal_read_stats.wal_read_bytes;
+		wal_read_time = walsnd->wal_read_stats.wal_read_time;
+		wal_read_buffers = walsnd->wal_read_stats.wal_read_buffers;
+		wal_read_bytes_buffers = walsnd->wal_read_stats.wal_read_bytes_buffers;
+		wal_read_time_buffers = walsnd->wal_read_stats.wal_read_time_buffers;
 		SpinLockRelease(&walsnd->mutex);
 
 		/*
@@ -3670,6 +3704,31 @@ pg_stat_get_wal_senders(PG_FUNCTION_ARGS)
 				nulls[11] = true;
 			else
 				values[11] = TimestampTzGetDatum(replyTime);
+
+			values[12] = Int64GetDatum(wal_read);
+
+			/* Convert to numeric. */
+			snprintf(buf, sizeof buf, UINT64_FORMAT, wal_read_bytes);
+			values[13] = DirectFunctionCall3(numeric_in,
+											 CStringGetDatum(buf),
+											 ObjectIdGetDatum(0),
+											 Int32GetDatum(-1));
+
+			/* Convert counter from microsec to millisec for display. */
+			values[14] = Float8GetDatum(((double) wal_read_time) / 1000.0);
+
+			values[15] = Int64GetDatum(wal_read_buffers);
+
+			/* Convert to numeric. */
+			MemSet(buf, '\0', sizeof buf);
+			snprintf(buf, sizeof buf, UINT64_FORMAT, wal_read_bytes_buffers);
+			values[16] = DirectFunctionCall3(numeric_in,
+											 CStringGetDatum(buf),
+											 ObjectIdGetDatum(0),
+											 Int32GetDatum(-1));
+
+			/* Convert counter from microsec to millisec for display. */
+			values[17] = Float8GetDatum(((double) wal_read_time_buffers) / 1000.0);
 		}
 
 		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
@@ -3914,3 +3973,22 @@ LagTrackerRead(int head, XLogRecPtr lsn, TimestampTz now)
 	Assert(time != 0);
 	return now - time;
 }
+
+/*
+ * Function to accumulate WAL Read stats for WAL sender.
+ */
+static void
+WalSndAccumulateWalReadStats(WALReadStats * stats)
+{
+	/* Collect I/O stats for walsender. */
+	SpinLockAcquire(&MyWalSnd->mutex);
+	MyWalSnd->wal_read_stats.wal_read += stats->wal_read;
+	MyWalSnd->wal_read_stats.wal_read_bytes += stats->wal_read_bytes;
+	MyWalSnd->wal_read_stats.wal_read_time += stats->wal_read_time;
+	MyWalSnd->wal_read_stats.wal_read_buffers += stats->wal_read_buffers;
+	MyWalSnd->wal_read_stats.wal_read_bytes_buffers +=
+		stats->wal_read_bytes_buffers;
+	MyWalSnd->wal_read_stats.wal_read_time_buffers +=
+		stats->wal_read_time_buffers;
+	SpinLockRelease(&MyWalSnd->mutex);
+}
diff --git a/src/bin/pg_waldump/pg_waldump.c b/src/bin/pg_waldump/pg_waldump.c
index a3535bdfa9..5e1c14dd2e 100644
--- a/src/bin/pg_waldump/pg_waldump.c
+++ b/src/bin/pg_waldump/pg_waldump.c
@@ -407,7 +407,7 @@ WALDumpReadPage(XLogReaderState *state, XLogRecPtr targetPagePtr, int reqLen,
 	}
 
 	if (!WALRead(state, readBuff, targetPagePtr, count, private->timeline,
-				 &errinfo))
+				 &errinfo, NULL, false))
 	{
 		WALOpenSegment *seg = &errinfo.wre_seg;
 		char		fname[MAXPGPATH];
diff --git a/src/include/access/xlogreader.h b/src/include/access/xlogreader.h
index 0813722715..44a3cd4591 100644
--- a/src/include/access/xlogreader.h
+++ b/src/include/access/xlogreader.h
@@ -388,9 +388,33 @@ typedef struct WALReadError
 	WALOpenSegment wre_seg;		/* Segment we tried to read from. */
 } WALReadError;
 
-extern bool WALRead(XLogReaderState *state,
-					char *buf, XLogRecPtr startptr, Size count,
-					TimeLineID tli, WALReadError *errinfo);
+/*
+ * WAL read stats from WALRead that the callers can use.
+ */
+typedef struct WALReadStats
+{
+	/* Number of times WAL read from disk. */
+	int64		wal_read;
+
+	/* Total amount of WAL read from disk in bytes. */
+	uint64		wal_read_bytes;
+
+	/* Total amount of time spent reading WAL from disk. */
+	int64		wal_read_time;
+
+	/* Number of times WAL read from WAL buffers. */
+	int64		wal_read_buffers;
+
+	/* Total amount of WAL read from WAL buffers in bytes. */
+	uint64		wal_read_bytes_buffers;
+
+	/* Total amount of time spent reading WAL from WAL buffers. */
+	int64		wal_read_time_buffers;
+}			WALReadStats;
+
+extern bool WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr,
+					Size count, TimeLineID tli, WALReadError *errinfo,
+					WALReadStats * stats, bool capture_wal_io_timing);
 
 /* Functions for decoding an XLogRecord */
 
diff --git a/src/include/catalog/pg_proc.dat b/src/include/catalog/pg_proc.dat
index f14aed422a..4af16a0f81 100644
--- a/src/include/catalog/pg_proc.dat
+++ b/src/include/catalog/pg_proc.dat
@@ -5452,9 +5452,9 @@
   proname => 'pg_stat_get_wal_senders', prorows => '10', proisstrict => 'f',
   proretset => 't', provolatile => 's', proparallel => 'r',
   prorettype => 'record', proargtypes => '',
-  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz}',
-  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o}',
-  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time}',
+  proallargtypes => '{int4,text,pg_lsn,pg_lsn,pg_lsn,pg_lsn,interval,interval,interval,int4,text,timestamptz,int8,numeric,float8,int8,numeric,float8}',
+  proargmodes => '{o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o,o}',
+  proargnames => '{pid,state,sent_lsn,write_lsn,flush_lsn,replay_lsn,write_lag,flush_lag,replay_lag,sync_priority,sync_state,reply_time,wal_read,wal_read_bytes,wal_read_time,wal_read_buffers,wal_read_bytes_buffers,wal_read_time_buffers}',
   prosrc => 'pg_stat_get_wal_senders' },
 { oid => '3317', descr => 'statistics: information about WAL receiver',
   proname => 'pg_stat_get_wal_receiver', proisstrict => 'f', provolatile => 's',
diff --git a/src/include/replication/walsender_private.h b/src/include/replication/walsender_private.h
index 13fd5877a6..c21707098f 100644
--- a/src/include/replication/walsender_private.h
+++ b/src/include/replication/walsender_private.h
@@ -13,6 +13,7 @@
 #define _WALSENDER_PRIVATE_H
 
 #include "access/xlog.h"
+#include "access/xlogreader.h"
 #include "lib/ilist.h"
 #include "nodes/nodes.h"
 #include "nodes/replnodes.h"
@@ -83,6 +84,8 @@ typedef struct WalSnd
 	TimestampTz replyTime;
 
 	ReplicationKind kind;
+
+	WALReadStats wal_read_stats;
 } WalSnd;
 
 extern PGDLLIMPORT WalSnd *MyWalSnd;
diff --git a/src/test/regress/expected/rules.out b/src/test/regress/expected/rules.out
index 1442c43d9c..40d7963707 100644
--- a/src/test/regress/expected/rules.out
+++ b/src/test/regress/expected/rules.out
@@ -2078,9 +2078,15 @@ pg_stat_replication| SELECT s.pid,
     w.replay_lag,
     w.sync_priority,
     w.sync_state,
-    w.reply_time
+    w.reply_time,
+    w.wal_read,
+    w.wal_read_bytes,
+    w.wal_read_time,
+    w.wal_read_buffers,
+    w.wal_read_bytes_buffers,
+    w.wal_read_time_buffers
    FROM ((pg_stat_get_activity(NULL::integer) s(datid, pid, usesysid, application_name, state, query, wait_event_type, wait_event, xact_start, query_start, backend_start, state_change, client_addr, client_hostname, client_port, backend_xid, backend_xmin, backend_type, ssl, sslversion, sslcipher, sslbits, ssl_client_dn, ssl_client_serial, ssl_issuer_dn, gss_auth, gss_princ, gss_enc, gss_delegation, leader_pid, query_id)
-     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time) ON ((s.pid = w.pid)))
+     JOIN pg_stat_get_wal_senders() w(pid, state, sent_lsn, write_lsn, flush_lsn, replay_lsn, write_lag, flush_lag, replay_lag, sync_priority, sync_state, reply_time, wal_read, wal_read_bytes, wal_read_time, wal_read_buffers, wal_read_bytes_buffers, wal_read_time_buffers) ON ((s.pid = w.pid)))
      LEFT JOIN pg_authid u ON ((s.usesysid = u.oid)));
 pg_stat_replication_slots| SELECT s.slot_name,
     s.spill_txns,
-- 
2.34.1

v16-0001-Use-64-bit-atomics-for-xlblocks-array-elements.patchapplication/x-patch; name=v16-0001-Use-64-bit-atomics-for-xlblocks-array-elements.patchDownload
From b678edfccf7ecf490cb792391249cbf85ba0db29 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 7 Nov 2023 19:20:00 +0000
Subject: [PATCH v16] Use 64-bit atomics for xlblocks array elements

In AdvanceXLInsertBuffer(), xlblocks value of a WAL buffer page is
updated only at the end after the page is initialized with all
zeros. A problem with this approach is that anyone reading
xlblocks and WAL buffer page without holding WALBufMappingLock
will see the wrong page contents if the read happens before the
xlblocks is marked with a new entry in AdvanceXLInsertBuffer() at
the end.

To fix this issue, xlblocks is made to use 64-bit atomics instead
of XLogRecPtr and the xlblocks value is marked with
InvalidXLogRecPtr just before the page initialization begins. Once
the page initialization finishes, only then the actual value of
the newly initialized page is marked in xlblocks. A write barrier
is placed in between xlblocks update with InvalidXLogRecPtr and
the page initialization to not cause any memory ordering problems.

With this fix, one can read xlblocks and WAL buffer page without
WALBufMappingLock in the following manner:

  endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);

  /* Requested WAL isn't available in WAL buffers. */
  if (expectedEndPtr != endptr)
      break;

  page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
  data = page + ptr % XLOG_BLCKSZ;
  ...
  pg_read_barrier();
  ...
  memcpy(buf, data, bytes_to_read);
  ...
  pg_read_barrier();

  /* Recheck if the page still exists in WAL buffers. */
  endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);

  /* Return if the page got initalized while we were reading it */
  if (expectedEndPtr != endptr)
      break;
---
 src/backend/access/transam/xlog.c | 55 ++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b541be8eec..5fe4f101e8 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -501,7 +501,7 @@ typedef struct XLogCtlData
 	 * WALBufMappingLock.
 	 */
 	char	   *pages;			/* buffers for unwritten XLOG pages */
-	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
+	pg_atomic_uint64 *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */
 	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
 
 	/*
@@ -1634,20 +1634,19 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	 * out to disk and evicted, and the caller is responsible for making sure
 	 * that doesn't happen.
 	 *
-	 * However, we don't hold a lock while we read the value. If someone has
-	 * just initialized the page, it's possible that we get a "torn read" of
-	 * the XLogRecPtr if 64-bit fetches are not atomic on this platform. In
-	 * that case we will see a bogus value. That's ok, we'll grab the mapping
-	 * lock (in AdvanceXLInsertBuffer) and retry if we see anything else than
-	 * the page we're looking for. But it means that when we do this unlocked
-	 * read, we might see a value that appears to be ahead of the page we're
-	 * looking for. Don't PANIC on that, until we've verified the value while
-	 * holding the lock.
+	 * However, we don't hold a lock while we read the value. If someone is
+	 * just about to initialize or has just initialized the page, it's
+	 * possible that we get InvalidXLogRecPtr. That's ok, we'll grab the
+	 * mapping lock (in AdvanceXLInsertBuffer) and retry if we see anything
+	 * else than the page we're looking for. But it means that when we do this
+	 * unlocked read, we might see a value that appears to be ahead of the
+	 * page we're looking for. Don't PANIC on that, until we've verified the
+	 * value while holding the lock.
 	 */
 	expectedEndPtr = ptr;
 	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
 
-	endptr = XLogCtl->xlblocks[idx];
+	endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
 	if (expectedEndPtr != endptr)
 	{
 		XLogRecPtr	initializedUpto;
@@ -1678,7 +1677,7 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 		WALInsertLockUpdateInsertingAt(initializedUpto);
 
 		AdvanceXLInsertBuffer(ptr, tli, false);
-		endptr = XLogCtl->xlblocks[idx];
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
 
 		if (expectedEndPtr != endptr)
 			elog(PANIC, "could not find WAL buffer for %X/%X",
@@ -1865,7 +1864,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		 * be zero if the buffer hasn't been used yet).  Fall through if it's
 		 * already written out.
 		 */
-		OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
+		OldPageRqstPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]);
 		if (LogwrtResult.Write < OldPageRqstPtr)
 		{
 			/*
@@ -1934,6 +1933,20 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 
 		NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
 
+		/*
+		 * Make sure to mark the xlblocks with InvalidXLogRecPtr before the
+		 * initialization of the page begins so that others, reading xlblocks
+		 * without holding a lock, will know that the page initialization has
+		 * just begun.
+		 */
+		pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], InvalidXLogRecPtr);
+
+		/*
+		 * A write barrier here helps to not reorder the above xlblocks atomic
+		 * write with below page initialization.
+		 */
+		pg_write_barrier();
+
 		/*
 		 * Be sure to re-zero the buffer so that bytes beyond what we've
 		 * written will look like zeroes and not valid XLOG records...
@@ -1987,8 +2000,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		 */
 		pg_write_barrier();
 
-		*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
-
+		pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);
 		XLogCtl->InitializedUpTo = NewPageEndPtr;
 
 		npages++;
@@ -2187,7 +2199,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 		 * if we're passed a bogus WriteRqst.Write that is past the end of the
 		 * last page that's been initialized by AdvanceXLInsertBuffer.
 		 */
-		XLogRecPtr	EndPtr = XLogCtl->xlblocks[curridx];
+		XLogRecPtr	EndPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[curridx]);
 
 		if (LogwrtResult.Write >= EndPtr)
 			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
@@ -4675,10 +4687,13 @@ XLOGShmemInit(void)
 	 * needed here.
 	 */
 	allocptr = ((char *) XLogCtl) + sizeof(XLogCtlData);
-	XLogCtl->xlblocks = (XLogRecPtr *) allocptr;
-	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
-	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
+	XLogCtl->xlblocks = (pg_atomic_uint64 *) allocptr;
+	allocptr += sizeof(pg_atomic_uint64) * XLOGbuffers;
 
+	for (i = 0; i < XLOGbuffers; i++)
+	{
+		pg_atomic_init_u64(&XLogCtl->xlblocks[i], InvalidXLogRecPtr);
+	}
 
 	/* WAL insertion locks. Ensure they're aligned to the full padded size */
 	allocptr += sizeof(WALInsertLockPadded) -
@@ -5715,7 +5730,7 @@ StartupXLOG(void)
 		memcpy(page, endOfRecoveryInfo->lastPage, len);
 		memset(page + len, 0, XLOG_BLCKSZ - len);
 
-		XLogCtl->xlblocks[firstIdx] = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ;
+		pg_atomic_write_u64(&XLogCtl->xlblocks[firstIdx], endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
 		XLogCtl->InitializedUpTo = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ;
 	}
 	else
-- 
2.34.1

v16-0002-Allow-WAL-reading-from-WAL-buffers.patchapplication/x-patch; name=v16-0002-Allow-WAL-reading-from-WAL-buffers.patchDownload
From 3f7c2ec6ed281c827105d46b12276cfb54da8441 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 7 Nov 2023 21:02:35 +0000
Subject: [PATCH v16] Allow WAL reading from WAL buffers

This commit adds WALRead() the capability to read WAL from WAL
buffers when possible. When requested WAL isn't available in WAL
buffers, the WAL is read from the WAL file as usual.

This commit benefits the callers of WALRead(), that are walsenders
and pg_walinspect. They can now avoid reading WAL from the WAL
file (possibly avoiding disk IO). Tests show that the WAL buffers
hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
standby, with pgbench --scale=300 --client=32 --time=900. In other
words, the walsenders avoided 95% of the time reading from the
file/avoided pread system calls:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com

This commit paves the way for the following features in future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c       | 167 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  41 +++++-
 src/include/access/xlog.h               |   6 +
 3 files changed, 212 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 5fe4f101e8..2c1ddf235f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1705,6 +1705,173 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * Read 'count' bytes of WAL from WAL buffers into 'buf', starting at location
+ * 'startptr', on timeline 'tli' and return total read bytes.
+ *
+ * This function returns quickly in the following cases:
+ * - When passed-in timeline is different than server's current insertion
+ * timeline as WAL is always inserted into WAL buffers on insertion timeline.
+ *
+ * - When server is in recovery as WAL buffers aren't currently used in
+ * recovery.
+ *
+ * Note that this function reads as much as it can from WAL buffers, meaning,
+ * it may not read all the requested 'count' bytes. Caller must be aware of
+ * this and deal with it.
+ *
+ * Note that function reads WAL from WAL buffers without holding any lock.
+ * First it reads xlblocks atomically for checking page existence, then it
+ * reads the page contents, validates. Finally, it rechecks the page existence
+ * by rereading xlblocks, if the read page is replaced, it discards read page
+ * and returns.
+ *
+ * Note that this function is not available for frontend code as WAL buffers is
+ * an internal mechanism to the server.
+ */
+Size
+XLogReadFromBuffers(XLogReaderState *state,
+					XLogRecPtr startptr,
+					TimeLineID tli,
+					Size count,
+					char *buf)
+{
+	XLogRecPtr	ptr;
+	Size		nbytes;
+	Size		ntotal;
+	char	   *dst;
+
+	if (RecoveryInProgress())
+		return 0;
+
+	if (tli != GetWALInsertionTimeLine())
+		return 0;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+
+	ptr = startptr;
+	nbytes = count;				/* Total bytes requested to be read by caller. */
+	ntotal = 0;					/* Total bytes read. */
+	dst = buf;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int			idx;
+		char	   *page;
+		char	   *data;
+		XLogPageHeader phdr;
+		Size		nread;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/* Make sure to not read a page that just got initialized. */
+		phdr = (XLogPageHeader) page;
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+			break;
+
+		/*
+		 * Note that we don't perform all page header checks here to avoid
+		 * extra work in production builds; callers will anyway do those
+		 * checks extensively. However, in an assert-enabled build, we perform
+		 * all the checks here and raise an error if failed.
+		 */
+#ifdef USE_ASSERT_CHECKING
+		if (unlikely(state != NULL &&
+					 !XLogReaderValidatePageHeader(state, (endptr - XLOG_BLCKSZ),
+												   (char *) phdr)))
+		{
+			if (state->errormsg_buf[0])
+				ereport(ERROR,
+						(errcode(ERRCODE_INTERNAL_ERROR),
+						 errmsg_internal("%s", state->errormsg_buf)));
+			else
+				ereport(ERROR,
+						(errcode(ERRCODE_INTERNAL_ERROR),
+						 errmsg_internal("could not read WAL from WAL buffers")));
+		}
+#endif
+
+		/*
+		 * Make sure we don't read xlblocks up above before the page contents
+		 * down below.
+		 */
+		pg_read_barrier();
+
+		nread = 0;
+
+		/* Read what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			nread = nbytes;
+		}
+		else
+		{
+			/*
+			 * All the bytes are not in one page. Read available bytes on the
+			 * current page, copy them over to output buffer and continue to
+			 * read remaining bytes.
+			 */
+			nread = XLOG_BLCKSZ - (data - page);
+			Assert(nread > 0 && nread <= nbytes);
+		}
+
+		Assert(nread > 0);
+		memcpy(dst, data, nread);
+
+		/*
+		 * Make sure we don't read xlblocks down below before the page
+		 * contents up above.
+		 */
+		pg_read_barrier();
+
+		/* Recheck if the read page still exists in WAL buffers. */
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+		/* Return if the page got initalized while we were reading it. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		dst += nread;
+		ptr += nread;
+		ntotal += nread;
+		nbytes -= nread;
+	}
+
+	/* We never read more than what the caller has asked for. */
+	Assert(ntotal <= count);
+
+#ifdef WAL_DEBUG
+	if (XLOG_DEBUG)
+		ereport(DEBUG1,
+				(errmsg_internal("read %zu bytes out of %zu bytes from WAL buffers for given start LSN %X/%X, timeline ID %u",
+								 ntotal, count, LSN_FORMAT_ARGS(startptr), tli)));
+#endif
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e0baa86bd3..5820c5eedc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1473,8 +1473,10 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers. When
+ * requested WAL isn't available in WAL buffers, the WAL is read from the WAL
+ * file as usual. The callers may avoid reading WAL from the WAL file thus
+ * reducing read system calls or even disk IOs.
  */
 bool
 WALRead(XLogReaderState *state,
@@ -1484,6 +1486,41 @@ WALRead(XLogReaderState *state,
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	Size		nread;
+#endif
+
+#ifndef FRONTEND
+
+	/*
+	 * Try reading WAL from WAL buffers. Frontend code has no idea of WAL
+	 * buffers.
+	 */
+	nread = XLogReadFromBuffers(state, startptr, tli, count, buf);
+
+	Assert(nread >= 0);
+
+	/*
+	 * Check if we have read fully (hit), partially (partial hit) or nothing
+	 * (miss) from WAL buffers. If we have read either partially or nothing,
+	 * then continue to read the remaining bytes the usual way, that is, read
+	 * from WAL file.
+	 *
+	 * XXX: It might be worth to expose WAL buffer read stats.
+	 */
+	if (count == nread)
+		return true;			/* Buffer hit, so return. */
+	else if (count > nread)
+	{
+		/*
+		 * Buffer partial hit, so reset the state to count the read bytes and
+		 * continue.
+		 */
+		buf += nread;
+		startptr += nread;
+		count -= nread;
+	}
+#endif
 
 	p = buf;
 	recptr = startptr;
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..18167c36b4 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -251,6 +251,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(struct XLogReaderState *state,
+								XLogRecPtr startptr,
+								TimeLineID tli,
+								Size count,
+								char *buf);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

#54Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#53)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2023-11-08 13:10:34 +0530, Bharath Rupireddy wrote:

+             /*
+              * The fact that we acquire WALBufMappingLock while reading the WAL
+              * buffer page itself guarantees that no one else initializes it or
+              * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
+              * need to ensure that we are not reading a page that just got
+              * initialized. For this, we look at the needed page header.
+              */
+             phdr = (XLogPageHeader) page;
+
+             /* Return, if WAL buffer page doesn't look valid. */
+             if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+                       phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+                       phdr->xlp_tli == tli))
+                     break;

I don't think this code should ever encounter a page where this is not the
case? We particularly shouldn't do so silently, seems that could hide all
kinds of problems.

I think it's possible to read a "just got initialized" page with the
new approach to read WAL buffer pages without WALBufMappingLock if the
page is read right after it is initialized and xlblocks is filled in
AdvanceXLInsertBuffer() but before actual WAL is written.

I think the code needs to make sure that *never* happens. That seems unrelated
to holding or not holding WALBufMappingLock. Even if the page header is
already valid, I don't think it's ok to just read/parse WAL data that's
concurrently being modified.

We can never allow WAL being read that's past
XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos)
as it does not exist.

And if the to-be-read LSN is between
XLogCtl->LogwrtResult->Write and XLogBytePosToRecPtr(Insert->CurrBytePos)
we need to call WaitXLogInsertionsToFinish() before copying the data.

Greetings,

Andres Freund

#55Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#54)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Nov 10, 2023 at 2:28 AM Andres Freund <andres@anarazel.de> wrote:

On 2023-11-08 13:10:34 +0530, Bharath Rupireddy wrote:

+             /*
+              * The fact that we acquire WALBufMappingLock while reading the WAL
+              * buffer page itself guarantees that no one else initializes it or
+              * makes it ready for next use in AdvanceXLInsertBuffer(). However, we
+              * need to ensure that we are not reading a page that just got
+              * initialized. For this, we look at the needed page header.
+              */
+             phdr = (XLogPageHeader) page;
+
+             /* Return, if WAL buffer page doesn't look valid. */
+             if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+                       phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+                       phdr->xlp_tli == tli))
+                     break;

I don't think this code should ever encounter a page where this is not the
case? We particularly shouldn't do so silently, seems that could hide all
kinds of problems.

I think it's possible to read a "just got initialized" page with the
new approach to read WAL buffer pages without WALBufMappingLock if the
page is read right after it is initialized and xlblocks is filled in
AdvanceXLInsertBuffer() but before actual WAL is written.

I think the code needs to make sure that *never* happens. That seems unrelated
to holding or not holding WALBufMappingLock. Even if the page header is
already valid, I don't think it's ok to just read/parse WAL data that's
concurrently being modified.

We can never allow WAL being read that's past
XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos)
as it does not exist.

Agreed. Erroring out in XLogReadFromBuffers() if passed in WAL is past
the CurrBytePos is an option. Another cleaner way is to just let the
caller decide what it needs to do (retry or error out) - fill an error
message in XLogReadFromBuffers() and return as-if nothing was read or
return a special negative error code like XLogDecodeNextRecord so that
the caller can deal with it.

Also, reading CurrBytePos with insertpos_lck spinlock can come in the
way of concurrent inserters. A possible way is to turn both
CurrBytePos and PrevBytePos 64-bit atomics so that
XLogReadFromBuffers() can read CurrBytePos without any lock atomically
and leave it to the caller to deal with non-existing WAL reads.

And if the to-be-read LSN is between
XLogCtl->LogwrtResult->Write and XLogBytePosToRecPtr(Insert->CurrBytePos)
we need to call WaitXLogInsertionsToFinish() before copying the data.

Agree to wait for all in-flight insertions to the pages we're about to
read to finish. But, reading XLogCtl->LogwrtRqst.Write requires either
XLogCtl->info_lck spinlock or WALWriteLock. Maybe turn
XLogCtl->LogwrtRqst.Write a 64-bit atomic and read it without any
lock, rely on
WaitXLogInsertionsToFinish()'s return value i.e. if
WaitXLogInsertionsToFinish() returns a value >= Insert->CurrBytePos,
then go read that page from WAL buffers.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#56Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#55)
3 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Mon, Nov 13, 2023 at 7:02 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Nov 10, 2023 at 2:28 AM Andres Freund <andres@anarazel.de> wrote:

I think the code needs to make sure that *never* happens. That seems unrelated
to holding or not holding WALBufMappingLock. Even if the page header is
already valid, I don't think it's ok to just read/parse WAL data that's
concurrently being modified.

We can never allow WAL being read that's past
XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos)
as it does not exist.

Agreed. Erroring out in XLogReadFromBuffers() if passed in WAL is past
the CurrBytePos is an option. Another cleaner way is to just let the
caller decide what it needs to do (retry or error out) - fill an error
message in XLogReadFromBuffers() and return as-if nothing was read or
return a special negative error code like XLogDecodeNextRecord so that
the caller can deal with it.

In the attached v17 patch, I've ensured that the XLogReadFromBuffers
returns when the caller requests a WAL that's past the current insert
position at the moment.

Also, reading CurrBytePos with insertpos_lck spinlock can come in the
way of concurrent inserters. A possible way is to turn both
CurrBytePos and PrevBytePos 64-bit atomics so that
XLogReadFromBuffers() can read CurrBytePos without any lock atomically
and leave it to the caller to deal with non-existing WAL reads.

And if the to-be-read LSN is between
XLogCtl->LogwrtResult->Write and XLogBytePosToRecPtr(Insert->CurrBytePos)
we need to call WaitXLogInsertionsToFinish() before copying the data.

Agree to wait for all in-flight insertions to the pages we're about to
read to finish. But, reading XLogCtl->LogwrtRqst.Write requires either
XLogCtl->info_lck spinlock or WALWriteLock. Maybe turn
XLogCtl->LogwrtRqst.Write a 64-bit atomic and read it without any
lock, rely on
WaitXLogInsertionsToFinish()'s return value i.e. if
WaitXLogInsertionsToFinish() returns a value >= Insert->CurrBytePos,
then go read that page from WAL buffers.

In the attached v17 patch, I've ensured that the XLogReadFromBuffers
waits for all in-progress insertions to finish when the caller
requests WAL that's past the current write position and before the
current insert position.

I've also ensured that the XLogReadFromBuffers returns special return
codes for various scenarios (when asked to read in recovery, read on a
different TLI, read a non-existent WAL and so on.) instead of it
erroring out. This gives flexibility to the caller to decide what to
do.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v17-0001-Use-64-bit-atomics-for-xlblocks-array-elements.patchapplication/octet-stream; name=v17-0001-Use-64-bit-atomics-for-xlblocks-array-elements.patchDownload
From 9264635eb097f0f2e85f733d358b67e6d038edad Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 1 Dec 2023 04:53:30 +0000
Subject: [PATCH v17] Use 64-bit atomics for xlblocks array elements

In AdvanceXLInsertBuffer(), xlblocks value of a WAL buffer page is
updated only at the end after the page is initialized with all
zeros. A problem with this approach is that anyone reading
xlblocks and WAL buffer page without holding WALBufMappingLock
will see the wrong page contents if the read happens before the
xlblocks is marked with a new entry in AdvanceXLInsertBuffer() at
the end.

To fix this issue, xlblocks is made to use 64-bit atomics instead
of XLogRecPtr and the xlblocks value is marked with
InvalidXLogRecPtr just before the page initialization begins. Once
the page initialization finishes, only then the actual value of
the newly initialized page is marked in xlblocks. A write barrier
is placed in between xlblocks update with InvalidXLogRecPtr and
the page initialization to not cause any memory ordering problems.

With this fix, one can read xlblocks and WAL buffer page without
WALBufMappingLock in the following manner:

  endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);

  /* Requested WAL isn't available in WAL buffers. */
  if (expectedEndPtr != endptr)
      break;

  page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
  data = page + ptr % XLOG_BLCKSZ;
  ...
  pg_read_barrier();
  ...
  memcpy(buf, data, bytes_to_read);
  ...
  pg_read_barrier();

  /* Recheck if the page still exists in WAL buffers. */
  endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);

  /* Return if the page got initalized while we were reading it */
  if (expectedEndPtr != endptr)
      break;
---
 src/backend/access/transam/xlog.c | 55 ++++++++++++++++++++-----------
 1 file changed, 35 insertions(+), 20 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6526bd4f43..b3c08f3980 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -501,7 +501,7 @@ typedef struct XLogCtlData
 	 * WALBufMappingLock.
 	 */
 	char	   *pages;			/* buffers for unwritten XLOG pages */
-	XLogRecPtr *xlblocks;		/* 1st byte ptr-s + XLOG_BLCKSZ */
+	pg_atomic_uint64 *xlblocks; /* 1st byte ptr-s + XLOG_BLCKSZ */
 	int			XLogCacheBlck;	/* highest allocated xlog buffer index */
 
 	/*
@@ -1634,20 +1634,19 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	 * out to disk and evicted, and the caller is responsible for making sure
 	 * that doesn't happen.
 	 *
-	 * However, we don't hold a lock while we read the value. If someone has
-	 * just initialized the page, it's possible that we get a "torn read" of
-	 * the XLogRecPtr if 64-bit fetches are not atomic on this platform. In
-	 * that case we will see a bogus value. That's ok, we'll grab the mapping
-	 * lock (in AdvanceXLInsertBuffer) and retry if we see anything else than
-	 * the page we're looking for. But it means that when we do this unlocked
-	 * read, we might see a value that appears to be ahead of the page we're
-	 * looking for. Don't PANIC on that, until we've verified the value while
-	 * holding the lock.
+	 * However, we don't hold a lock while we read the value. If someone is
+	 * just about to initialize or has just initialized the page, it's
+	 * possible that we get InvalidXLogRecPtr. That's ok, we'll grab the
+	 * mapping lock (in AdvanceXLInsertBuffer) and retry if we see anything
+	 * else than the page we're looking for. But it means that when we do this
+	 * unlocked read, we might see a value that appears to be ahead of the
+	 * page we're looking for. Don't PANIC on that, until we've verified the
+	 * value while holding the lock.
 	 */
 	expectedEndPtr = ptr;
 	expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
 
-	endptr = XLogCtl->xlblocks[idx];
+	endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
 	if (expectedEndPtr != endptr)
 	{
 		XLogRecPtr	initializedUpto;
@@ -1678,7 +1677,7 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 		WALInsertLockUpdateInsertingAt(initializedUpto);
 
 		AdvanceXLInsertBuffer(ptr, tli, false);
-		endptr = XLogCtl->xlblocks[idx];
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
 
 		if (expectedEndPtr != endptr)
 			elog(PANIC, "could not find WAL buffer for %X/%X",
@@ -1865,7 +1864,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		 * be zero if the buffer hasn't been used yet).  Fall through if it's
 		 * already written out.
 		 */
-		OldPageRqstPtr = XLogCtl->xlblocks[nextidx];
+		OldPageRqstPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[nextidx]);
 		if (LogwrtResult.Write < OldPageRqstPtr)
 		{
 			/*
@@ -1934,6 +1933,20 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 
 		NewPage = (XLogPageHeader) (XLogCtl->pages + nextidx * (Size) XLOG_BLCKSZ);
 
+		/*
+		 * Make sure to mark the xlblocks with InvalidXLogRecPtr before the
+		 * initialization of the page begins so that others reading xlblocks
+		 * without holding a lock, will know that the page initialization has
+		 * just begun.
+		 */
+		pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], InvalidXLogRecPtr);
+
+		/*
+		 * A write barrier here helps to not reorder the above xlblocks atomic
+		 * write with below page initialization.
+		 */
+		pg_write_barrier();
+
 		/*
 		 * Be sure to re-zero the buffer so that bytes beyond what we've
 		 * written will look like zeroes and not valid XLOG records...
@@ -1987,8 +2000,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 		 */
 		pg_write_barrier();
 
-		*((volatile XLogRecPtr *) &XLogCtl->xlblocks[nextidx]) = NewPageEndPtr;
-
+		pg_atomic_write_u64(&XLogCtl->xlblocks[nextidx], NewPageEndPtr);
 		XLogCtl->InitializedUpTo = NewPageEndPtr;
 
 		npages++;
@@ -2206,7 +2218,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
 		 * if we're passed a bogus WriteRqst.Write that is past the end of the
 		 * last page that's been initialized by AdvanceXLInsertBuffer.
 		 */
-		XLogRecPtr	EndPtr = XLogCtl->xlblocks[curridx];
+		XLogRecPtr	EndPtr = pg_atomic_read_u64(&XLogCtl->xlblocks[curridx]);
 
 		if (LogwrtResult.Write >= EndPtr)
 			elog(PANIC, "xlog write request %X/%X is past end of log %X/%X",
@@ -4708,10 +4720,13 @@ XLOGShmemInit(void)
 	 * needed here.
 	 */
 	allocptr = ((char *) XLogCtl) + sizeof(XLogCtlData);
-	XLogCtl->xlblocks = (XLogRecPtr *) allocptr;
-	memset(XLogCtl->xlblocks, 0, sizeof(XLogRecPtr) * XLOGbuffers);
-	allocptr += sizeof(XLogRecPtr) * XLOGbuffers;
+	XLogCtl->xlblocks = (pg_atomic_uint64 *) allocptr;
+	allocptr += sizeof(pg_atomic_uint64) * XLOGbuffers;
 
+	for (i = 0; i < XLOGbuffers; i++)
+	{
+		pg_atomic_init_u64(&XLogCtl->xlblocks[i], InvalidXLogRecPtr);
+	}
 
 	/* WAL insertion locks. Ensure they're aligned to the full padded size */
 	allocptr += sizeof(WALInsertLockPadded) -
@@ -5748,7 +5763,7 @@ StartupXLOG(void)
 		memcpy(page, endOfRecoveryInfo->lastPage, len);
 		memset(page + len, 0, XLOG_BLCKSZ - len);
 
-		XLogCtl->xlblocks[firstIdx] = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ;
+		pg_atomic_write_u64(&XLogCtl->xlblocks[firstIdx], endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ);
 		XLogCtl->InitializedUpTo = endOfRecoveryInfo->lastPageBeginPtr + XLOG_BLCKSZ;
 	}
 	else
-- 
2.34.1

v17-0002-Allow-WAL-reading-from-WAL-buffers.patchapplication/octet-stream; name=v17-0002-Allow-WAL-reading-from-WAL-buffers.patchDownload
From d855d4e01ef552c79267a2c06904200dfc637224 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 7 Dec 2023 08:17:06 +0000
Subject: [PATCH v17] Allow WAL reading from WAL buffers

This commit adds WALRead() the capability to read WAL from WAL
buffers when possible. When requested WAL isn't available in WAL
buffers, the WAL is read from the WAL file as usual.

This commit benefits the callers of WALRead(), that are walsenders
and pg_walinspect. They can now avoid reading WAL from the WAL
file (possibly avoiding disk IO). Tests show that the WAL buffers
hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
standby, with pgbench --scale=300 --client=32 --time=900. In other
words, the walsenders avoided 95% of the time reading from the
file/avoided pread system calls:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com

This commit paves the way for the following features in future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c       | 189 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  47 +++++-
 src/backend/access/transam/xlogutils.c  |  11 +-
 src/backend/replication/walsender.c     |  10 +-
 src/include/access/xlog.h               |  23 +++
 5 files changed, 268 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4ebd918198..b0eb6d5d56 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1707,6 +1707,195 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * This function reads 'bytes_to_read' bytes of WAL from WAL buffers into
+ * 'buf' starting at location 'startptr' on timeline 'tli' and returns
+ * appropriate result code and fills total read bytes if any into
+ * 'bytes_read'.
+ *
+ * Points to note:
+ *
+ * - This function reads as much as it can from WAL buffers, meaning, it may
+ * not read all the requested 'bytes_to_read' bytes. Caller must be aware of
+ * this and deal with it.
+ *
+ * - This function reads WAL from WAL buffers without holding any lock. First
+ * it reads xlblocks atomically for checking page existence, then it reads the
+ * page contents, validates. Finally, it rechecks the page existence by
+ * rereading xlblocks, if the read page is replaced, it discards read page and
+ * returns.
+ *
+ * - This function is not available for frontend code as WAL buffers is an
+ * internal mechanism to the server.
+ *
+ * - Caller must look at the result code to take appropriate action such as
+ * error out on failure or emit warning or continue.
+ *
+ * - This function waits for any in-progress WAL insertions to WAL buffers to
+ * finish.
+ */
+XLogReadFromBuffersResult
+XLogReadFromBuffers(XLogRecPtr startptr,
+					TimeLineID tli,
+					Size bytes_to_read,
+					char *buf,
+					Size *bytes_read)
+{
+	XLogRecPtr	ptr;
+	Size		nbytes;
+	char	   *dst;
+	uint64		bytepos;
+	XLogReadFromBuffersResult result = XLREADBUFS_OK;
+
+	*bytes_read = 0;
+
+	/* WAL buffers aren't in use when server is in recovery. */
+	if (RecoveryInProgress())
+		return XLREADBUFS_IN_RECOVERY;
+
+	/* WAL is inserted into WAL buffers on current server's insertion TLI. */
+	if (tli != GetWALInsertionTimeLine())
+		return XLREADBUFS_NOT_INSERT_TLI;
+
+	if (XLogRecPtrIsInvalid(startptr))
+		return XLREADBUFS_INVALID_INPUT;
+
+	ptr = startptr;
+	nbytes = bytes_to_read;
+	dst = buf;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int			idx;
+		char	   *page;
+		char	   *data;
+		XLogPageHeader phdr;
+		Size		nread;
+		XLogRecPtr	reservedUpto;
+		XLogwrtResult LogwrtResult;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * Make sure we don't read xlblocks up above before the page contents
+		 * down below.
+		 */
+		pg_read_barrier();
+
+		nread = 0;
+
+		/* Read what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			nread = nbytes;
+		}
+		else
+		{
+			/*
+			 * All the bytes are not in one page. Read available bytes on the
+			 * current page, copy them over to output buffer and continue to
+			 * read remaining bytes.
+			 */
+			nread = XLOG_BLCKSZ - (data - page);
+			Assert(nread > 0 && nread <= nbytes);
+		}
+
+		Assert(nread > 0);
+		memcpy(dst, data, nread);
+
+		/*
+		 * Make sure we don't read xlblocks down below before the page
+		 * contents up above.
+		 */
+		pg_read_barrier();
+
+		/* Recheck if the read page still exists in WAL buffers. */
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+		/* Return if the page got initalized while we were reading it. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/* Read the current insert position */
+		SpinLockAcquire(&XLogCtl->Insert.insertpos_lck);
+		bytepos = XLogCtl->Insert.CurrBytePos;
+		SpinLockRelease(&XLogCtl->Insert.insertpos_lck);
+
+		reservedUpto = XLogBytePosToEndRecPtr(bytepos);
+
+		/*
+		 * We can't allow WAL being read is past the current insert position
+		 * as it does not yet exist.
+		 */
+		if ((ptr + nread) > reservedUpto)
+		{
+			result = XLREADBUFS_NON_EXISTENT_WAL;
+			break;
+		}
+
+		SpinLockAcquire(&XLogCtl->info_lck);
+		LogwrtResult = XLogCtl->LogwrtResult;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		/* Wait for any in-progress WAL insertions to WAL buffers to finish. */
+		if ((ptr + nread) > LogwrtResult.Write &&
+			(ptr + nread) <= reservedUpto)
+			WaitXLogInsertionsToFinish(ptr + nread);
+
+		/*
+		 * Typically, we must not read a WAL buffer page that just got
+		 * initialized, because we waited enough for the in-progress WAL
+		 * insertions to finish above. However, there can exists a slight
+		 * window after the above wait finishes in which the read buffer page
+		 * can get replaced especially under high WAL generation rates. So,
+		 * let's not account such buffer page.
+		 */
+		phdr = (XLogPageHeader) page;
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+		{
+			result = XLREADBUFS_UNINITIALIZED_WAL;
+			break;
+		}
+
+		dst += nread;
+		ptr += nread;
+		*bytes_read += nread;
+		nbytes -= nread;
+	}
+
+	/* We never read more than what the caller has asked for. */
+	Assert(*bytes_read <= bytes_to_read);
+
+	ereport(DEBUG1,
+			errmsg_internal("read %zu bytes out of %zu bytes from WAL buffers for given start LSN %X/%X, timeline ID %u",
+							*bytes_read, bytes_to_read,
+							LSN_FORMAT_ARGS(startptr), tli));
+
+	return result;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index e0baa86bd3..bb8871b671 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1473,17 +1473,54 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers. When
+ * requested WAL isn't available in WAL buffers, the WAL is read from the WAL
+ * file as usual. The callers may avoid reading WAL from the WAL file thus
+ * reducing read system calls or even disk IOs.
  */
 bool
-WALRead(XLogReaderState *state,
-		char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
-		WALReadError *errinfo)
+WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr,
+		Size count, TimeLineID tli, WALReadError *errinfo)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	Size		nread;
+#endif
+
+#ifndef FRONTEND
+
+	/*
+	 * Try reading WAL from WAL buffers. Frontend code has no idea of WAL
+	 * buffers.
+	 */
+	(void) XLogReadFromBuffers(startptr, tli, count, buf, &nread);
+
+	if (nread > 0)
+	{
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially
+		 * or nothing, then continue to read the remaining bytes the usual
+		 * way, that is, read from WAL file.
+		 *
+		 * XXX: It might be worth to expose WAL buffer read stats.
+		 */
+		if (nread == count)
+			return true;		/* Buffer hit, so return. */
+		else if (nread < count)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += nread;
+			startptr += nread;
+			count -= nread;
+		}
+	}
+#endif
 
 	p = buf;
 	recptr = startptr;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 43f7b31205..057c9b4ea0 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -1007,12 +1007,13 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	}
 
 	/*
-	 * Even though we just determined how much of the page can be validly read
-	 * as 'count', read the whole page anyway. It's guaranteed to be
-	 * zero-padded up to the page boundary if it's incomplete.
+	 * We determined how much of the page can be validly read as 'count', read
+	 * that much only, not the entire page. Since WALRead() can read the page
+	 * from WAL buffers, in which case, the page is not guaranteed to be
+	 * zero-padded up to the page boundary because of the concurrent
+	 * insertions.
 	 */
-	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
-				 &errinfo))
+	if (!WALRead(state, cur_page, targetPagePtr, count, tli, &errinfo))
 		WALReadRaiseError(&errinfo);
 
 	/* number of valid bytes in the buffer */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3bc9c82389..a00bcd30bf 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -943,11 +943,17 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
-	/* now actually read the data, we know it's there */
+	/*
+	 * We determined how much of the page can be validly read as 'count', read
+	 * that much only, not the entire page. Since WALRead() can read the page
+	 * from WAL buffers, in which case, the page is not guaranteed to be
+	 * zero-padded up to the page boundary because of the concurrent
+	 * insertions.
+	 */
 	if (!WALRead(state,
 				 cur_page,
 				 targetPagePtr,
-				 XLOG_BLCKSZ,
+				 count,
 				 currTLI,		/* Pass the current TLI because only
 								 * WalSndSegmentOpen controls whether new TLI
 								 * is needed. */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..9035a12a7b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -193,6 +193,23 @@ typedef enum WALAvailability
 	WALAVAIL_REMOVED,			/* WAL segment has been removed */
 } WALAvailability;
 
+/* Return values from XLogReadFromBuffers. */
+typedef enum XLogReadFromBuffersResult
+{
+	XLREADBUFS_OK = 0,			/* no error */
+	XLREADBUFS_INVALID_INPUT = -1,	/* invalid startptr */
+	XLREADBUFS_IN_RECOVERY = -2,	/* read attempted when in recovery */
+
+	/* read attempted with TLI that's different from server insertion TLI */
+	XLREADBUFS_NOT_INSERT_TLI = -3,
+
+	/* read attempted for non-existent WAL */
+	XLREADBUFS_NON_EXISTENT_WAL = -4,
+
+	/* uninitialized WAL buffer page */
+	XLREADBUFS_UNINITIALIZED_WAL = -5
+} XLogReadFromBuffersResult;
+
 struct XLogRecData;
 struct XLogReaderState;
 
@@ -251,6 +268,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern XLogReadFromBuffersResult XLogReadFromBuffers(XLogRecPtr startptr,
+													 TimeLineID tli,
+													 Size bytes_to_read,
+													 char *buf,
+													 Size *bytes_read);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v17-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patchapplication/octet-stream; name=v17-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patchDownload
From be13735c7fc641d1160d85ad9069c404b94adc5a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 7 Dec 2023 08:27:41 +0000
Subject: [PATCH v17] Add test module for verifying WAL read from WAL buffers

This commit adds a test module to verify WAL read from WAL
buffers.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 ++++++++
 .../test_wal_read_from_buffers/meson.build    | 33 ++++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 54 +++++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 ++++++
 .../test_wal_read_from_buffers.c              | 46 ++++++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 182 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 5d33fa6a9a..64a051ce1c 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -33,6 +33,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi \
 		  xid_wraparound
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index b76f588559..52b0cd5812 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,6 +30,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7472494501
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40bd5dcd33
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_wal_read_from_buffers
+
+test_install_data += files(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..1d842bb02e
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,54 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers;');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+# Check with a WAL that doesn't yet exist.
+$lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_flush_lsn()+8192;');
+$result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 'f', "WAL that doesn't yet exist is not read from WAL buffers");
+
+# Check with invalid input.
+$result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('0/0');});
+is($result, 'f', "WAL is not read from WAL buffers with invalid input");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..c6ffb3fa65
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    read_successful OUT boolean
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..aff609ead7
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,46 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	char	data[XLOG_BLCKSZ] = {0};
+	Size	nread;
+	XLogReadFromBuffersResult	result;
+	bool	is_read;
+
+	result = XLogReadFromBuffers(PG_GETARG_LSN(0),
+								 GetWALInsertionTimeLine(),
+							     XLOG_BLCKSZ,
+							     data,
+							     &nread);
+
+	if (nread > 0)
+		is_read = true;
+	else
+		is_read = false;
+
+	PG_RETURN_BOOL(is_read);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..eda8d47954
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#57Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#56)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Thu, 2023-12-07 at 15:59 +0530, Bharath Rupireddy wrote:

In the attached v17 patch

0001 could impact performance could be impacted in a few ways:

* There's one additional write barrier inside
AdvanceXLInsertBuffer()
* AdvanceXLInsertBuffer() already holds WALBufMappingLock, so
the atomic access inside of it is somewhat redundant
* On some platforms, the XLogCtlData structure size will change

The patch has been out for a while and nobody seems concerned about
those things, and they look fine to me, so I assume these are not real
problems. I just wanted to highlight them.

Also, the description and the comments seem off. The patch does two
things: (a) make it possible to read a page without a lock, which means
we need to mark with InvalidXLogRecPtr while it's being initialized;
and (b) use 64-bit atomics to make it safer (or at least more
readable).

(a) feels like the most important thing, and it's a hard requirement
for the rest of the work, right?

(b) seems like an implementation choice, and I agree with it on
readability grounds.

Also:

+  * But it means that when we do this
+  * unlocked read, we might see a value that appears to be ahead of
the
+  * page we're looking for. Don't PANIC on that, until we've verified
the
+  * value while holding the lock.

Is that still true even without a torn read?

The code for 0001 itself looks good. These are minor concerns and I am
inclined to commit something like it fairly soon.

Regards,
Jeff Davis

#58Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#57)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Dec 8, 2023 at 6:04 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Thu, 2023-12-07 at 15:59 +0530, Bharath Rupireddy wrote:

In the attached v17 patch

The code for 0001 itself looks good. These are minor concerns and I am
inclined to commit something like it fairly soon.

Thanks. Attaching remaining patches as v18 patch-set after commits
c3a8e2a7cb16 and 766571be1659.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v18-0001-Allow-WAL-reading-from-WAL-buffers.patchapplication/octet-stream; name=v18-0001-Allow-WAL-reading-from-WAL-buffers.patchDownload
From d41b37f65f5a8266d0e18bdfb320079a40a7999b Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 20 Dec 2023 10:00:19 +0000
Subject: [PATCH v18] Allow WAL reading from WAL buffers

This commit adds WALRead() the capability to read WAL from WAL
buffers when possible. When requested WAL isn't available in WAL
buffers, the WAL is read from the WAL file as usual.

This commit benefits the callers of WALRead(), that are walsenders
and pg_walinspect. They can now avoid reading WAL from the WAL
file (possibly avoiding disk IO). Tests show that the WAL buffers
hit ratio stood at 95% for 1 primary, 1 sync standby, 1 async
standby, with pgbench --scale=300 --client=32 --time=900. In other
words, the walsenders avoided 95% of the time reading from the
file/avoided pread system calls:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com

This commit paves the way for the following features in future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c       | 189 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  47 +++++-
 src/backend/access/transam/xlogutils.c  |  11 +-
 src/backend/replication/walsender.c     |  10 +-
 src/include/access/xlog.h               |  23 +++
 5 files changed, 268 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 56e4d6fb02..86dbea0a26 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1704,6 +1704,195 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * This function reads 'bytes_to_read' bytes of WAL from WAL buffers into
+ * 'buf' starting at location 'startptr' on timeline 'tli' and returns
+ * appropriate result code and fills total read bytes if any into
+ * 'bytes_read'.
+ *
+ * Points to note:
+ *
+ * - This function reads as much as it can from WAL buffers, meaning, it may
+ * not read all the requested 'bytes_to_read' bytes. Caller must be aware of
+ * this and deal with it.
+ *
+ * - This function reads WAL from WAL buffers without holding any lock. First
+ * it reads xlblocks atomically for checking page existence, then it reads the
+ * page contents, validates. Finally, it rechecks the page existence by
+ * rereading xlblocks, if the read page is replaced, it discards read page and
+ * returns.
+ *
+ * - This function is not available for frontend code as WAL buffers is an
+ * internal mechanism to the server.
+ *
+ * - Caller must look at the result code to take appropriate action such as
+ * error out on failure or emit warning or continue.
+ *
+ * - This function waits for any in-progress WAL insertions to WAL buffers to
+ * finish.
+ */
+XLogReadFromBuffersResult
+XLogReadFromBuffers(XLogRecPtr startptr,
+					TimeLineID tli,
+					Size bytes_to_read,
+					char *buf,
+					Size *bytes_read)
+{
+	XLogRecPtr	ptr;
+	Size		nbytes;
+	char	   *dst;
+	uint64		bytepos;
+	XLogReadFromBuffersResult result = XLREADBUFS_OK;
+
+	*bytes_read = 0;
+
+	/* WAL buffers aren't in use when server is in recovery. */
+	if (RecoveryInProgress())
+		return XLREADBUFS_IN_RECOVERY;
+
+	/* WAL is inserted into WAL buffers on current server's insertion TLI. */
+	if (tli != GetWALInsertionTimeLine())
+		return XLREADBUFS_NOT_INSERT_TLI;
+
+	if (XLogRecPtrIsInvalid(startptr))
+		return XLREADBUFS_INVALID_INPUT;
+
+	ptr = startptr;
+	nbytes = bytes_to_read;
+	dst = buf;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int			idx;
+		char	   *page;
+		char	   *data;
+		XLogPageHeader phdr;
+		Size		nread;
+		XLogRecPtr	reservedUpto;
+		XLogwrtResult LogwrtResult;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * Make sure we don't read xlblocks up above before the page contents
+		 * down below.
+		 */
+		pg_read_barrier();
+
+		nread = 0;
+
+		/* Read what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			nread = nbytes;
+		}
+		else
+		{
+			/*
+			 * All the bytes are not in one page. Read available bytes on the
+			 * current page, copy them over to output buffer and continue to
+			 * read remaining bytes.
+			 */
+			nread = XLOG_BLCKSZ - (data - page);
+			Assert(nread > 0 && nread <= nbytes);
+		}
+
+		Assert(nread > 0);
+		memcpy(dst, data, nread);
+
+		/*
+		 * Make sure we don't read xlblocks down below before the page
+		 * contents up above.
+		 */
+		pg_read_barrier();
+
+		/* Recheck if the read page still exists in WAL buffers. */
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+		/* Return if the page got initalized while we were reading it. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/* Read the current insert position */
+		SpinLockAcquire(&XLogCtl->Insert.insertpos_lck);
+		bytepos = XLogCtl->Insert.CurrBytePos;
+		SpinLockRelease(&XLogCtl->Insert.insertpos_lck);
+
+		reservedUpto = XLogBytePosToEndRecPtr(bytepos);
+
+		/*
+		 * We can't allow WAL being read is past the current insert position
+		 * as it does not yet exist.
+		 */
+		if ((ptr + nread) > reservedUpto)
+		{
+			result = XLREADBUFS_NON_EXISTENT_WAL;
+			break;
+		}
+
+		SpinLockAcquire(&XLogCtl->info_lck);
+		LogwrtResult = XLogCtl->LogwrtResult;
+		SpinLockRelease(&XLogCtl->info_lck);
+
+		/* Wait for any in-progress WAL insertions to WAL buffers to finish. */
+		if ((ptr + nread) > LogwrtResult.Write &&
+			(ptr + nread) <= reservedUpto)
+			WaitXLogInsertionsToFinish(ptr + nread);
+
+		/*
+		 * Typically, we must not read a WAL buffer page that just got
+		 * initialized, because we waited enough for the in-progress WAL
+		 * insertions to finish above. However, there can exists a slight
+		 * window after the above wait finishes in which the read buffer page
+		 * can get replaced especially under high WAL generation rates. So,
+		 * let's not account such buffer page.
+		 */
+		phdr = (XLogPageHeader) page;
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+		{
+			result = XLREADBUFS_UNINITIALIZED_WAL;
+			break;
+		}
+
+		dst += nread;
+		ptr += nread;
+		*bytes_read += nread;
+		nbytes -= nread;
+	}
+
+	/* We never read more than what the caller has asked for. */
+	Assert(*bytes_read <= bytes_to_read);
+
+	ereport(DEBUG1,
+			errmsg_internal("read %zu bytes out of %zu bytes from WAL buffers for given start LSN %X/%X, timeline ID %u",
+							*bytes_read, bytes_to_read,
+							LSN_FORMAT_ARGS(startptr), tli));
+
+	return result;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 6b404b8169..631bd4fe6b 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1501,17 +1501,54 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads data directly from WAL buffers. When
+ * requested WAL isn't available in WAL buffers, the WAL is read from the WAL
+ * file as usual. The callers may avoid reading WAL from the WAL file thus
+ * reducing read system calls or even disk IOs.
  */
 bool
-WALRead(XLogReaderState *state,
-		char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
-		WALReadError *errinfo)
+WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr,
+		Size count, TimeLineID tli, WALReadError *errinfo)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	Size		nread;
+#endif
+
+#ifndef FRONTEND
+
+	/*
+	 * Try reading WAL from WAL buffers. Frontend code has no idea of WAL
+	 * buffers.
+	 */
+	(void) XLogReadFromBuffers(startptr, tli, count, buf, &nread);
+
+	if (nread > 0)
+	{
+		/*
+		 * Check if we have read fully (hit), partially (partial hit) or
+		 * nothing (miss) from WAL buffers. If we have read either partially
+		 * or nothing, then continue to read the remaining bytes the usual
+		 * way, that is, read from WAL file.
+		 *
+		 * XXX: It might be worth to expose WAL buffer read stats.
+		 */
+		if (nread == count)
+			return true;		/* Buffer hit, so return. */
+		else if (nread < count)
+		{
+			/*
+			 * Buffer partial hit, so reset the state to count the read bytes
+			 * and continue.
+			 */
+			buf += nread;
+			startptr += nread;
+			count -= nread;
+		}
+	}
+#endif
 
 	p = buf;
 	recptr = startptr;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 43f7b31205..057c9b4ea0 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -1007,12 +1007,13 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	}
 
 	/*
-	 * Even though we just determined how much of the page can be validly read
-	 * as 'count', read the whole page anyway. It's guaranteed to be
-	 * zero-padded up to the page boundary if it's incomplete.
+	 * We determined how much of the page can be validly read as 'count', read
+	 * that much only, not the entire page. Since WALRead() can read the page
+	 * from WAL buffers, in which case, the page is not guaranteed to be
+	 * zero-padded up to the page boundary because of the concurrent
+	 * insertions.
 	 */
-	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
-				 &errinfo))
+	if (!WALRead(state, cur_page, targetPagePtr, count, tli, &errinfo))
 		WALReadRaiseError(&errinfo);
 
 	/* number of valid bytes in the buffer */
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 3bc9c82389..a00bcd30bf 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -943,11 +943,17 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
-	/* now actually read the data, we know it's there */
+	/*
+	 * We determined how much of the page can be validly read as 'count', read
+	 * that much only, not the entire page. Since WALRead() can read the page
+	 * from WAL buffers, in which case, the page is not guaranteed to be
+	 * zero-padded up to the page boundary because of the concurrent
+	 * insertions.
+	 */
 	if (!WALRead(state,
 				 cur_page,
 				 targetPagePtr,
-				 XLOG_BLCKSZ,
+				 count,
 				 currTLI,		/* Pass the current TLI because only
 								 * WalSndSegmentOpen controls whether new TLI
 								 * is needed. */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index a14126d164..9035a12a7b 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -193,6 +193,23 @@ typedef enum WALAvailability
 	WALAVAIL_REMOVED,			/* WAL segment has been removed */
 } WALAvailability;
 
+/* Return values from XLogReadFromBuffers. */
+typedef enum XLogReadFromBuffersResult
+{
+	XLREADBUFS_OK = 0,			/* no error */
+	XLREADBUFS_INVALID_INPUT = -1,	/* invalid startptr */
+	XLREADBUFS_IN_RECOVERY = -2,	/* read attempted when in recovery */
+
+	/* read attempted with TLI that's different from server insertion TLI */
+	XLREADBUFS_NOT_INSERT_TLI = -3,
+
+	/* read attempted for non-existent WAL */
+	XLREADBUFS_NON_EXISTENT_WAL = -4,
+
+	/* uninitialized WAL buffer page */
+	XLREADBUFS_UNINITIALIZED_WAL = -5
+} XLogReadFromBuffersResult;
+
 struct XLogRecData;
 struct XLogReaderState;
 
@@ -251,6 +268,12 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern XLogReadFromBuffersResult XLogReadFromBuffers(XLogRecPtr startptr,
+													 TimeLineID tli,
+													 Size bytes_to_read,
+													 char *buf,
+													 Size *bytes_read);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v18-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchapplication/octet-stream; name=v18-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchDownload
From 3022d508ecb85242dc5369639082937e8c5596db Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 20 Dec 2023 10:02:53 +0000
Subject: [PATCH v18] Add test module for verifying WAL read from WAL buffers

This commit adds a test module to verify WAL read from WAL
buffers.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 ++++++++
 .../test_wal_read_from_buffers/meson.build    | 33 ++++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 54 +++++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 ++++++
 .../test_wal_read_from_buffers.c              | 46 ++++++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 182 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 5d33fa6a9a..64a051ce1c 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -33,6 +33,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi \
 		  xid_wraparound
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index b76f588559..52b0cd5812 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,6 +30,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7472494501
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40bd5dcd33
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_wal_read_from_buffers
+
+test_install_data += files(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..1d842bb02e
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,54 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers;');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+# Check with a WAL that doesn't yet exist.
+$lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_flush_lsn()+8192;');
+$result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 'f', "WAL that doesn't yet exist is not read from WAL buffers");
+
+# Check with invalid input.
+$result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('0/0');});
+is($result, 'f', "WAL is not read from WAL buffers with invalid input");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..c6ffb3fa65
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    read_successful OUT boolean
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..eb031f15b1
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,46 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	char		data[XLOG_BLCKSZ] = {0};
+	Size		nread;
+	XLogReadFromBuffersResult result;
+	bool		is_read;
+
+	result = XLogReadFromBuffers(PG_GETARG_LSN(0),
+								 GetWALInsertionTimeLine(),
+								 XLOG_BLCKSZ,
+								 data,
+								 &nread);
+
+	if (nread > 0)
+		is_read = true;
+	else
+		is_read = false;
+
+	PG_RETURN_BOOL(is_read);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..eda8d47954
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#59Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#58)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, 2023-12-20 at 15:36 +0530, Bharath Rupireddy wrote:

Thanks. Attaching remaining patches as v18 patch-set after commits
c3a8e2a7cb16 and 766571be1659.

Comments:

I still think the right thing for this patch is to call
XLogReadFromBuffers() directly from the callers who need it, and not
change WALRead(). I am open to changing this later, but for now that
makes sense to me so that we can clearly identify which callers benefit
and why. I have brought this up a few times before[1] /messages/by-id/4132fe48f831ed6f73a9eb191af5fe475384969c.camel@j-davis.com[2]/messages/by-id/2ef04861c0f77e7ae78b703770cc2bbbac3d85e6.camel@j-davis.com, so there must
be some reason that I don't understand -- can you explain it?

The XLogReadFromBuffersResult is never used. I can see how it might be
useful for testing or asserts, but it's not used even in the test
module. I don't think we should clutter the API with that kind of thing
-- let's just return the nread.

I also do not like the terminology "partial hit" to be used in this
way. Perhaps "short read" or something about hitting the end of
readable WAL would be better?

I like how the callers of WALRead() are being more precise about the
bytes they are requesting.

You've added several spinlock acquisitions to the loop. Two explicitly,
and one implicitly in WaitXLogInsertionsToFinish(). These may allow you
to read slightly further, but introduce performance risk. Was this
discussed?

The callers are not checking for XLREADBUGS_UNINITIALIZED_WAL, so it
seems like there's a risk of getting partially-written data? And it's
not clear to me the check of the wal page headers is the right one
anyway.

It seems like all of this would be simpler if you checked first how far
you can safely read data, and then just loop and read that far. I'm not
sure that it's worth it to try to mix the validity checks with the
reading of the data.

Regards,
Jeff Davis

[1]:  /messages/by-id/4132fe48f831ed6f73a9eb191af5fe475384969c.camel@j-davis.com
[2]: /messages/by-id/2ef04861c0f77e7ae78b703770cc2bbbac3d85e6.camel@j-davis.com
/messages/by-id/2ef04861c0f77e7ae78b703770cc2bbbac3d85e6.camel@j-davis.com

#60Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#59)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Jan 5, 2024 at 7:20 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Wed, 2023-12-20 at 15:36 +0530, Bharath Rupireddy wrote:

Thanks. Attaching remaining patches as v18 patch-set after commits
c3a8e2a7cb16 and 766571be1659.

Comments:

Thanks for reviewing.

I still think the right thing for this patch is to call
XLogReadFromBuffers() directly from the callers who need it, and not
change WALRead(). I am open to changing this later, but for now that
makes sense to me so that we can clearly identify which callers benefit
and why. I have brought this up a few times before[1][2], so there must
be some reason that I don't understand -- can you explain it?

IMO, WALRead() is the best place to have XLogReadFromBuffers() for 2
reasons: 1) All of the WALRead() callers (except FRONTEND tools) will
benefit if WAL is read from WAL buffers. I don't see any reason for a
caller to skip reading from WAL buffers. If there's a caller (in
future) wanting to skip reading from WAL buffers, I'm open to adding a
flag in XLogReaderState to skip. 2) The amount of code is reduced if
XLogReadFromBuffers() sits in WALRead().

The XLogReadFromBuffersResult is never used. I can see how it might be
useful for testing or asserts, but it's not used even in the test
module. I don't think we should clutter the API with that kind of thing
-- let's just return the nread.

Removed.

I also do not like the terminology "partial hit" to be used in this
way. Perhaps "short read" or something about hitting the end of
readable WAL would be better?

"short read" seems good. Done that way in the new patch.

I like how the callers of WALRead() are being more precise about the
bytes they are requesting.

You've added several spinlock acquisitions to the loop. Two explicitly,
and one implicitly in WaitXLogInsertionsToFinish(). These may allow you
to read slightly further, but introduce performance risk. Was this
discussed?

I opted to read slightly further thinking that the loops aren't going
to get longer for spinlocks to appear costly. Basically, I wasn't sure
which approach was the best. Now that there's an opinion to keep them
outside, I'd agree with it. Done that way in the new patch.

The callers are not checking for XLREADBUGS_UNINITIALIZED_WAL, so it
seems like there's a risk of getting partially-written data? And it's
not clear to me the check of the wal page headers is the right one
anyway.

It seems like all of this would be simpler if you checked first how far
you can safely read data, and then just loop and read that far. I'm not
sure that it's worth it to try to mix the validity checks with the
reading of the data.

XLogReadFromBuffers needs the page header check in after reading the
page from WAL buffers. Typically, we must not read a WAL buffer page
that just got initialized. Because we waited enough for the
in-progress WAL insertions to finish above. However, there can exist a
slight window after the above wait finishes in which the read buffer
page can get replaced especially under high WAL generation rates.
After all, we are reading from WAL buffers without any locks here. So,
let's not count such a page in.

I've addressed the above review comments and attached v19 patch-set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v19-0001-Allow-WAL-reading-from-WAL-buffers.patchapplication/x-patch; name=v19-0001-Allow-WAL-reading-from-WAL-buffers.patchDownload
From e03af5726957437c15361bdb1b373fe8982f5c7c Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 10 Jan 2024 14:12:17 +0000
Subject: [PATCH v19] Allow WAL reading from WAL buffers

This commit adds postgres the capability to read WAL from WAL
buffers. When requested WAL isn't available in WAL buffers, the
WAL is read from the WAL file as usual.

This commit benefits the callers of WALRead(), that are
walsenders, pg_walinspect etc. They all can now avoid reading WAL
from the WAL file (possibly avoiding disk IO). Tests show that the
WAL buffers hit ratio stood at 95% for 1 primary, 1 sync standby,
1 async standby, with pgbench --scale=300 --client=32 --time=900.
In other words, the walsenders avoided 95% of the time reading from
the file/avoided pread system calls:
https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com

This commit also benefits when direct IO is enabled for WAL.
Reading WAL from WAL buffers puts back the performance close to
that of without direct IO for WAL:
https://www.postgresql.org/message-id/CALj2ACV6rS%2B7iZx5%2BoAvyXJaN4AG-djAQeM1mrM%3DYSDkVrUs7g%40mail.gmail.com

This commit paves the way for the following features in future:
- Improves synchronous replication performance by replicating
directly from WAL buffers.
- A opt-in way for the walreceivers to receive unflushed WAL.
More details here:
https://www.postgresql.org/message-id/20231011224353.cl7c2s222dw3de4j%40awork3.anarazel.de

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar, Andres Freund
Reviewed-by: Nathan Bossart, Kuntal Ghosh
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/backend/access/transam/xlog.c       | 173 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |  40 +++++-
 src/backend/access/transam/xlogutils.c  |  11 +-
 src/backend/postmaster/walsummarizer.c  |  10 +-
 src/backend/replication/walsender.c     |  10 +-
 src/include/access/xlog.h               |   3 +
 6 files changed, 231 insertions(+), 16 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..886eaf12e3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1705,6 +1705,179 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL from WAL buffers.
+ *
+ * This function reads 'count' bytes of WAL from WAL buffers into 'buf'
+ * starting at location 'startptr' on timeline 'tli' and returns total bytes
+ * read.
+ *
+ * Points to note:
+ *
+ * - This function reads as much as it can from WAL buffers, meaning, it may
+ * not read all the requested 'count' bytes. Caller must be aware of this and
+ * deal with it.
+ *
+ * - This function reads WAL from WAL buffers without holding any lock. First
+ * it reads xlblocks atomically for checking page existence, then it reads the
+ * page contents and validates. Finally, it rechecks the page existence by
+ * re-reading xlblocks; if the read page is replaced, it discards it and
+ * returns.
+ *
+ * - This function is not available for frontend code as WAL buffers are
+ * internal to the server.
+ *
+ * - This function waits for any in-progress WAL insertions to WAL buffers to
+ * finish.
+ */
+Size
+XLogReadFromBuffers(XLogRecPtr startptr, TimeLineID tli, Size count,
+					char *buf)
+{
+	XLogRecPtr	ptr;
+	Size		nbytes;
+	Size		ntotal = 0;
+	char	   *dst;
+	uint64		bytepos;
+	XLogRecPtr	reservedUpto;
+	XLogwrtResult LogwrtResult;
+
+	/*
+	 * Fast paths for the following reasons: 1) WAL buffers aren't in use when
+	 * server is in recovery. 2) WAL is inserted into WAL buffers on current
+	 * server's insertion TLI. 3) Invalid starting WAL location.
+	 */
+	if (RecoveryInProgress() ||
+		tli != GetWALInsertionTimeLine() ||
+		XLogRecPtrIsInvalid(startptr))
+		return ntotal;
+
+	/* Read the current insert position */
+	SpinLockAcquire(&XLogCtl->Insert.insertpos_lck);
+	bytepos = XLogCtl->Insert.CurrBytePos;
+	SpinLockRelease(&XLogCtl->Insert.insertpos_lck);
+
+	reservedUpto = XLogBytePosToEndRecPtr(bytepos);
+
+	/*
+	 * WAL being read doesn't yet exist i.e. past the current insert position.
+	 */
+	if ((startptr + count) > reservedUpto)
+		return ntotal;
+
+	SpinLockAcquire(&XLogCtl->info_lck);
+	LogwrtResult = XLogCtl->LogwrtResult;
+	SpinLockRelease(&XLogCtl->info_lck);
+
+	/* Wait for any in-progress WAL insertions to WAL buffers to finish. */
+	if ((startptr + count) > LogwrtResult.Write &&
+		(startptr + count) <= reservedUpto)
+		WaitXLogInsertionsToFinish(startptr + count);
+
+	ptr = startptr;
+	nbytes = count;
+	dst = buf;
+
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int			idx;
+		char	   *page;
+		char	   *data;
+		Size		nread;
+		XLogPageHeader phdr;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+		/* Requested WAL isn't available in WAL buffers. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * Make sure we don't read xlblocks up above before the page contents
+		 * down below.
+		 */
+		pg_read_barrier();
+
+		nread = 0;
+
+		/* Read what is wanted, not the whole page. */
+		if ((data + nbytes) <= (page + XLOG_BLCKSZ))
+		{
+			/* All the bytes are in one page. */
+			nread = nbytes;
+		}
+		else
+		{
+			/*
+			 * All the bytes are not in one page. Read available bytes on the
+			 * current page, copy them over to output buffer and continue to
+			 * read remaining bytes.
+			 */
+			nread = XLOG_BLCKSZ - (data - page);
+			Assert(nread > 0 && nread <= nbytes);
+		}
+
+		Assert(nread > 0);
+		memcpy(dst, data, nread);
+
+		/*
+		 * Make sure we don't read xlblocks down below before the page
+		 * contents up above.
+		 */
+		pg_read_barrier();
+
+		/* Recheck if the read page still exists in WAL buffers. */
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+		/* Return if the page got initalized while we were reading it. */
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * Typically, we must not read a WAL buffer page that just got
+		 * initialized. Because we waited enough for the in-progress WAL
+		 * insertions to finish above. However, there can exist a slight
+		 * window after the above wait finishes in which the read buffer page
+		 * can get replaced especially under high WAL generation rates. After
+		 * all, we are reading from WAL buffers without any locks here. So,
+		 * let's not count such a page in.
+		 */
+		phdr = (XLogPageHeader) page;
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+			break;
+
+		dst += nread;
+		ptr += nread;
+		ntotal += nread;
+		nbytes -= nread;
+	}
+
+	/* We never read more than what the caller has asked for. */
+	Assert(ntotal <= count);
+
+	ereport(DEBUG1,
+			errmsg_internal("read %zu bytes out of %zu bytes from WAL buffers for given start LSN %X/%X, timeline ID %u",
+							ntotal, count,
+							LSN_FORMAT_ARGS(startptr), tli));
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7190156f2f..639bba2ad9 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1501,17 +1501,47 @@ err:
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
  *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
+ * When possible, this function reads WAL from WAL buffers. When requested WAL
+ * isn't available in WAL buffers, it is read from the WAL file as usual.
  */
 bool
-WALRead(XLogReaderState *state,
-		char *buf, XLogRecPtr startptr, Size count, TimeLineID tli,
-		WALReadError *errinfo)
+WALRead(XLogReaderState *state, char *buf, XLogRecPtr startptr,
+		Size count, TimeLineID tli, WALReadError *errinfo)
 {
 	char	   *p;
 	XLogRecPtr	recptr;
 	Size		nbytes;
+#ifndef FRONTEND
+	Size		nread;
+#endif
+
+#ifndef FRONTEND
+
+	/*
+	 * Try reading WAL from WAL buffers. Frontend code has no idea of WAL
+	 * buffers.
+	 */
+	nread = XLogReadFromBuffers(startptr, tli, count, buf);
+
+	if (nread > 0)
+	{
+		/*
+		 * Check if its a full read, short read or no read from WAL buffers.
+		 * For short read or no read, continue to read the remaining bytes
+		 * from WAL file.
+		 *
+		 * XXX: It might be worth to expose WAL buffer read stats.
+		 */
+		if (nread == count)		/* full read */
+			return true;
+		else if (nread < count) /* short read */
+		{
+			buf += nread;
+			startptr += nread;
+			count -= nread;
+		}
+	}
+#endif
 
 	p = buf;
 	recptr = startptr;
diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd1..fafab9aa32 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -1007,12 +1007,13 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	}
 
 	/*
-	 * Even though we just determined how much of the page can be validly read
-	 * as 'count', read the whole page anyway. It's guaranteed to be
-	 * zero-padded up to the page boundary if it's incomplete.
+	 * We determined how much of the page can be validly read as 'count', read
+	 * that much only, not the entire page. Since WALRead() can read the page
+	 * from WAL buffers, in which case, the page is not guaranteed to be
+	 * zero-padded up to the page boundary because of the concurrent
+	 * insertions.
 	 */
-	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
-				 &errinfo))
+	if (!WALRead(state, cur_page, targetPagePtr, count, tli, &errinfo))
 		WALReadRaiseError(&errinfo);
 
 	/* number of valid bytes in the buffer */
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index f828cc436a..d465848bc9 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -1254,11 +1254,13 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
 	}
 
 	/*
-	 * Even though we just determined how much of the page can be validly read
-	 * as 'count', read the whole page anyway. It's guaranteed to be
-	 * zero-padded up to the page boundary if it's incomplete.
+	 * We determined how much of the page can be validly read as 'count', read
+	 * that much only, not the entire page. Since WALRead() can read the page
+	 * from WAL buffers, in which case, the page is not guaranteed to be
+	 * zero-padded up to the page boundary because of the concurrent
+	 * insertions.
 	 */
-	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+	if (!WALRead(state, cur_page, targetPagePtr, count,
 				 private_data->tli, &errinfo))
 		WALReadRaiseError(&errinfo);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 087031e9dc..b35406bcdf 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1095,11 +1095,17 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
-	/* now actually read the data, we know it's there */
+	/*
+	 * We determined how much of the page can be validly read as 'count', read
+	 * that much only, not the entire page. Since WALRead() can read the page
+	 * from WAL buffers, in which case, the page is not guaranteed to be
+	 * zero-padded up to the page boundary because of the concurrent
+	 * insertions.
+	 */
 	if (!WALRead(state,
 				 cur_page,
 				 targetPagePtr,
-				 XLOG_BLCKSZ,
+				 count,
 				 currTLI,		/* Pass the current TLI because only
 								 * WalSndSegmentOpen controls whether new TLI
 								 * is needed. */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 301c5fa11f..fa760a92d5 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -252,6 +252,9 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(XLogRecPtr startptr, TimeLineID tli,
+								Size count, char *buf);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v19-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchapplication/x-patch; name=v19-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchDownload
From 35e9c0afe130d79e4f74dfbe3a445cf3d594ec14 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 10 Jan 2024 13:36:15 +0000
Subject: [PATCH v19] Add test module for verifying WAL read from WAL buffers

This commit adds a test module to verify WAL read from WAL
buffers.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 ++++++++
 .../test_wal_read_from_buffers/meson.build    | 33 ++++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 54 +++++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 ++++++
 .../test_wal_read_from_buffers.c              | 44 +++++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 180 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 5d33fa6a9a..64a051ce1c 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -33,6 +33,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi \
 		  xid_wraparound
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 00ff1d77d1..d5ec3bd3a9 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -30,6 +30,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7472494501
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40bd5dcd33
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_wal_read_from_buffers
+
+test_install_data += files(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..1d842bb02e
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,54 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers;');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+# Check with a WAL that doesn't yet exist.
+$lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_flush_lsn()+8192;');
+$result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 'f', "WAL that doesn't yet exist is not read from WAL buffers");
+
+# Check with invalid input.
+$result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('0/0');});
+is($result, 'f', "WAL is not read from WAL buffers with invalid input");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..c6ffb3fa65
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    read_successful OUT boolean
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..e54c64236d
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,44 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	char		data[XLOG_BLCKSZ] = {0};
+	Size		nread;
+	bool		is_read;
+
+	nread = XLogReadFromBuffers(PG_GETARG_LSN(0),
+								GetWALInsertionTimeLine(),
+								XLOG_BLCKSZ,
+								data);
+
+	if (nread > 0)
+		is_read = true;
+	else
+		is_read = false;
+
+	PG_RETURN_BOOL(is_read);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..eda8d47954
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#61Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#60)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, 2024-01-10 at 19:59 +0530, Bharath Rupireddy wrote:

I've addressed the above review comments and attached v19 patch-set.

Regarding:

-       if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
-                                &errinfo))
+       if (!WALRead(state, cur_page, targetPagePtr, count, tli,
&errinfo))

I'd like to understand the reason it was using XLOG_BLCKSZ before. Was
it a performance optimization? Or was it to zero the remainder of the
caller's buffer (readBuf)? Or something else?

If it was to zero the remainder of the caller's buffer, then we should
explicitly make that the caller's responsibility.

Regards,
Jeff Davis

#62Melih Mutlu
m.melihmutlu@gmail.com
In reply to: Bharath Rupireddy (#60)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi Bharath,

Thanks for working on this. It seems like a nice improvement to have.

Here are some comments on 0001 patch.

1-  xlog.c
+ /*
+ * Fast paths for the following reasons: 1) WAL buffers aren't in use when
+ * server is in recovery. 2) WAL is inserted into WAL buffers on current
+ * server's insertion TLI. 3) Invalid starting WAL location.
+ */

Shouldn't the comment be something like "2) WAL is *not* inserted into WAL
buffers on current server's insertion TLI" since the condition to break is tli
!= GetWALInsertionTimeLine()

2-
+ /*
+ * WAL being read doesn't yet exist i.e. past the current insert position.
+ */
+ if ((startptr + count) > reservedUpto)
+ return ntotal;

This question may not even make sense but I wonder whether we can read from
startptr only to reservedUpto in case of startptr+count exceeds
reservedUpto?

3-
+ /* Wait for any in-progress WAL insertions to WAL buffers to finish. */
+ if ((startptr + count) > LogwrtResult.Write &&
+ (startptr + count) <= reservedUpto)
+ WaitXLogInsertionsToFinish(startptr + count);

Do we need to check if (startptr + count) <= reservedUpto as we already
verified this condition a few lines above?

4-
+ Assert(nread > 0);
+ memcpy(dst, data, nread);
+
+ /*
+ * Make sure we don't read xlblocks down below before the page
+ * contents up above.
+ */
+ pg_read_barrier();
+
+ /* Recheck if the read page still exists in WAL buffers. */
+ endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+
+ /* Return if the page got initalized while we were reading it. */
+ if (expectedEndPtr != endptr)
+ break;
+
+ /*
+ * Typically, we must not read a WAL buffer page that just got
+ * initialized. Because we waited enough for the in-progress WAL
+ * insertions to finish above. However, there can exist a slight
+ * window after the above wait finishes in which the read buffer page
+ * can get replaced especially under high WAL generation rates. After
+ * all, we are reading from WAL buffers without any locks here. So,
+ * let's not count such a page in.
+ */
+ phdr = (XLogPageHeader) page;
+ if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+   phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+   phdr->xlp_tli == tli))
+ break;

I see that you recheck if the page still exists and so at the end. What
would you think about memcpy'ing only after being sure that we will need
and use the recently read data? If we break the loop during the recheck, we
simply discard the data read in the latest attempt. I guess that this may
not be a big deal but the data would be unnecessarily copied into the
destination in such a case.

5- xlogreader.c
+ nread = XLogReadFromBuffers(startptr, tli, count, buf);
+
+ if (nread > 0)
+ {
+ /*
+ * Check if its a full read, short read or no read from WAL buffers.
+ * For short read or no read, continue to read the remaining bytes
+ * from WAL file.
+ *
+ * XXX: It might be worth to expose WAL buffer read stats.
+ */
+ if (nread == count) /* full read */
+ return true;
+ else if (nread < count) /* short read */
+ {
+ buf += nread;
+ startptr += nread;
+ count -= nread;
+ }

Typo in the comment. Should be like "Check if *it's* a full read, short
read or no read from WAL buffers."

Also I don't think XLogReadFromBuffers() returns anything less than 0 and
more than count. Is verifying nread > 0 necessary? I think if nread does
not equal to count, we can simply assume that it's a short read. (or no
read at all in case nread is 0 which we don't need to handle specifically)

6-
+ /*
+ * We determined how much of the page can be validly read as 'count', read
+ * that much only, not the entire page. Since WALRead() can read the page
+ * from WAL buffers, in which case, the page is not guaranteed to be
+ * zero-padded up to the page boundary because of the concurrent
+ * insertions.
+ */

I'm not sure about pasting this into the most places we call WalRead().
Wouldn't it be better if we mention this somewhere around WALRead() only
once?

Best,
--
Melih Mutlu
Microsoft

#63Andres Freund
andres@anarazel.de
In reply to: Bharath Rupireddy (#60)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2024-01-10 19:59:29 +0530, Bharath Rupireddy wrote:

+		/*
+		 * Typically, we must not read a WAL buffer page that just got
+		 * initialized. Because we waited enough for the in-progress WAL
+		 * insertions to finish above. However, there can exist a slight
+		 * window after the above wait finishes in which the read buffer page
+		 * can get replaced especially under high WAL generation rates. After
+		 * all, we are reading from WAL buffers without any locks here. So,
+		 * let's not count such a page in.
+		 */
+		phdr = (XLogPageHeader) page;
+		if (!(phdr->xlp_magic == XLOG_PAGE_MAGIC &&
+			  phdr->xlp_pageaddr == (ptr - (ptr % XLOG_BLCKSZ)) &&
+			  phdr->xlp_tli == tli))
+			break;

I still think that anything that requires such checks shouldn't be
merged. It's completely bogus to check page contents for validity when we
should have metadata telling us which range of the buffers is valid and which
not.

Greetings,

Andres Freund

#64Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#63)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Mon, 2024-01-22 at 12:12 -0800, Andres Freund wrote:

I still think that anything that requires such checks shouldn't be
merged. It's completely bogus to check page contents for validity
when we
should have metadata telling us which range of the buffers is valid
and which
not.

The check seems entirely unnecessary, to me. A leftover from v18?

I have attached a new patch (version "19j") to illustrate some of my
previous suggestions. I didn't spend a lot of time on it so it's not
ready for commit, but I believe my suggestions are easier to understand
in code form.

Note that, right now, it only works for XLogSendPhysical(). I believe
it's best to just make it work for 1-3 callers that we understand well,
and we can generalize later if it makes sense.

I'm still not clear on why some callers are reading XLOG_BLCKSZ
(expecting zeros at the end), and if it's OK to just change them to use
the exact byte count.

Also, if we've detected that the first requested buffer has been
evicted, is there any value in continuing the loop to see if more
recent buffers are available? For example, if the requested LSNs range
over buffers 4, 5, and 6, and 4 has already been evicted, should we try
to return LSN data from 5 and 6 at the proper offset in the dest
buffer? If so, we'd need to adjust the API so the caller knows what
parts of the dest buffer were filled in.

Regards,
Jeff Davis

Attachments:

v19j-0001-Add-XLogReadFromBuffers.patchtext/x-patch; charset=UTF-8; name=v19j-0001-Add-XLogReadFromBuffers.patchDownload
From 34d89d6b869a454fd15097d8a0f0b6d3997a74da Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Mon, 22 Jan 2024 16:21:53 -0800
Subject: [PATCH v19j 1/2] Add XLogReadFromBuffers().

Allows reading directly from WAL buffers without a lock, avoiding the
need to wait for WAL flushing and read from the filesystem.

For now, the only caller is physical replication, but we can consider
expanding it to other callers as needed.

Author: Bharath Rupireddy
---
 src/backend/access/transam/xlog.c   | 145 ++++++++++++++++++++++++++--
 src/backend/replication/walsender.c |  14 +++
 src/include/access/xlog.h           |   2 +
 3 files changed, 152 insertions(+), 9 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..c9619139af 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -698,7 +698,7 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
-static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
+static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto, bool emitLog);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
@@ -1494,7 +1494,7 @@ WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
  * to make room for a new one, which in turn requires WALWriteLock.
  */
 static XLogRecPtr
-WaitXLogInsertionsToFinish(XLogRecPtr upto)
+WaitXLogInsertionsToFinish(XLogRecPtr upto, bool emitLog)
 {
 	uint64		bytepos;
 	XLogRecPtr	reservedUpto;
@@ -1521,9 +1521,10 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	 */
 	if (upto > reservedUpto)
 	{
-		ereport(LOG,
-				(errmsg("request to flush past end of generated WAL; request %X/%X, current position %X/%X",
-						LSN_FORMAT_ARGS(upto), LSN_FORMAT_ARGS(reservedUpto))));
+		if (emitLog)
+			ereport(LOG,
+					(errmsg("request to flush past end of generated WAL; request %X/%X, current position %X/%X",
+							LSN_FORMAT_ARGS(upto), LSN_FORMAT_ARGS(reservedUpto))));
 		upto = reservedUpto;
 	}
 
@@ -1705,6 +1706,132 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL directly from WAL buffers, if available.
+ *
+ * This function reads 'count' bytes of WAL from WAL buffers into 'buf'
+ * starting at location 'startptr' and returns total bytes read.
+ *
+ * The bytes read may be fewer than requested if any of the WAL buffers in the
+ * requested range have been evicted, or if the last requested byte is beyond
+ * the Insert pointer.
+ *
+ * If reading beyond the Write pointer, this function will wait for concurent
+ * inserters to finish. Otherwise, it does not wait at all.
+ *
+ * The caller must ensure that it's reasonable to read from the WAL buffers,
+ * i.e. that the requested data is from the current timeline, that we're not
+ * in recovery, etc.
+ */
+Size
+XLogReadFromBuffers(char *buf, XLogRecPtr startptr, Size count)
+{
+	XLogRecPtr	 ptr	= startptr;
+	XLogRecPtr	 upto	= startptr + count;
+	Size		 nbytes = count;
+	Size		 ntotal = 0;
+	char		*dst	= buf;
+
+	Assert(!RecoveryInProgress());
+	Assert(!XLogRecPtrIsInvalid(startptr));
+
+	/*
+	 * Caller requested very recent WAL data. Wait for any in-progress WAL
+	 * insertions to WAL buffers to finish.
+	 *
+	 * Most callers will have already updated LogwrtResult when determining
+	 * how far to read, but it's OK if it's out of date. (XXX: is it worth
+	 * taking a spinlock to update LogwrtResult and check again before calling
+	 * WaitXLogInsertionsToFinish()?)
+	 */
+	if (upto > LogwrtResult.Write)
+	{
+		XLogRecPtr writtenUpto = WaitXLogInsertionsToFinish(upto, false);
+
+		upto = Min(upto, writtenUpto);
+		nbytes = upto - startptr;
+	}
+
+	/*
+	 * Loop through the buffers without a lock. For each buffer, atomically
+	 * read and verify the end pointer, then copy the data out, and finally
+	 * re-read and re-verify the end pointer.
+	 *
+	 * Once a page is evicted, it never returns to the WAL buffers, so if the
+	 * end pointer matches the expected end pointer before and after we copy
+	 * the data, then the right page must have been present during the data
+	 * copy. Read barriers are necessary to ensure that the data copy actually
+	 * happens between the two verification steps.
+	 *
+	 * If the verification fails, we simply terminate the loop and return the
+	 * data had been already copied out successfully.
+	 */
+	while (nbytes > 0)
+	{
+		XLogRecPtr	 expectedEndPtr;
+		XLogRecPtr	 endptr;
+		int			 idx;
+		const char	*page;
+		const char	*data;
+		Size		 nread;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+		/*
+		 * First verification step: check that the correct page is present in
+		 * the WAL buffers
+		 */
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * Ensure that the data copy and the first verification step are not
+		 * reordered
+		 */
+		pg_read_barrier();
+
+		/* how much is available on this page to read? */
+		nread = Min(nbytes, XLOG_BLCKSZ - (data - page));
+
+		/* data copy */
+		memcpy(dst, data, nread);
+
+		/*
+		 * Ensure that the data copy and the second verification step are not
+		 * reordered.
+		 */
+		pg_read_barrier();
+
+		/*
+		 * Second verification step: check that the page we read from wasn't
+		 * evicted while we were copying the data.
+		 */
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+		if (expectedEndPtr != endptr)
+			break;
+
+		dst += nread;
+		ptr += nread;
+		ntotal += nread;
+		nbytes -= nread;
+	}
+
+	Assert(ntotal <= count);
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
@@ -1895,7 +2022,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 				 */
 				LWLockRelease(WALBufMappingLock);
 
-				WaitXLogInsertionsToFinish(OldPageRqstPtr);
+				WaitXLogInsertionsToFinish(OldPageRqstPtr, true);
 
 				LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 
@@ -2689,7 +2816,7 @@ XLogFlush(XLogRecPtr record)
 		 * Before actually performing the write, wait for all in-flight
 		 * insertions to the pages we're about to write to finish.
 		 */
-		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr);
+		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr, true);
 
 		/*
 		 * Try to get the write lock. If we can't get it immediately, wait
@@ -2740,7 +2867,7 @@ XLogFlush(XLogRecPtr record)
 			 * We're only calling it again to allow insertpos to be moved
 			 * further forward, not to actually wait for anyone.
 			 */
-			insertpos = WaitXLogInsertionsToFinish(insertpos);
+			insertpos = WaitXLogInsertionsToFinish(insertpos, true);
 		}
 
 		/* try to write/flush later additions to XLOG as well */
@@ -2919,7 +3046,7 @@ XLogBackgroundFlush(void)
 	START_CRIT_SECTION();
 
 	/* now wait for any in-progress insertions to finish and get write lock */
-	WaitXLogInsertionsToFinish(WriteRqst.Write);
+	WaitXLogInsertionsToFinish(WriteRqst.Write, true);
 	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 	LogwrtResult = XLogCtl->LogwrtResult;
 	if (WriteRqst.Write > LogwrtResult.Write ||
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 087031e9dc..b06f5e75d6 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -3125,6 +3125,20 @@ XLogSendPhysical(void)
 	enlargeStringInfo(&output_message, nbytes);
 
 retry:
+	/*
+	 * Read from WAL buffers, if available.
+	 */
+	if (!RecoveryInProgress() &&
+		xlogreader->seg.ws_tli == GetWALInsertionTimeLine())
+	{
+		Size rbytes = XLogReadFromBuffers(
+			&output_message.data[output_message.len],
+			startptr, nbytes);
+		output_message.len += rbytes;
+		startptr += rbytes;
+		nbytes -= rbytes;
+	}
+
 	if (!WALRead(xlogreader,
 				 &output_message.data[output_message.len],
 				 startptr,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 301c5fa11f..5f2621b0b9 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -252,6 +252,8 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(char *buf, XLogRecPtr startptr, Size count);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v19j-0002-Add-test-module-for-verifying-WAL-read-from-WAL.patchtext/x-patch; charset=UTF-8; name=v19j-0002-Add-test-module-for-verifying-WAL-read-from-WAL.patchDownload
From fc9bbbc28c59f5c0a88d658f76bfc421f9b9e34d Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 10 Jan 2024 13:36:15 +0000
Subject: [PATCH v19j 2/2] Add test module for verifying WAL read from WAL
 buffers

This commit adds a test module to verify WAL read from WAL
buffers.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../test_wal_read_from_buffers/.gitignore     |  4 ++
 .../test_wal_read_from_buffers/Makefile       | 23 ++++++++
 .../test_wal_read_from_buffers/meson.build    | 33 ++++++++++++
 .../test_wal_read_from_buffers/t/001_basic.pl | 54 +++++++++++++++++++
 .../test_wal_read_from_buffers--1.0.sql       | 16 ++++++
 .../test_wal_read_from_buffers.c              | 37 +++++++++++++
 .../test_wal_read_from_buffers.control        |  4 ++
 9 files changed, 173 insertions(+)
 create mode 100644 src/test/modules/test_wal_read_from_buffers/.gitignore
 create mode 100644 src/test/modules/test_wal_read_from_buffers/Makefile
 create mode 100644 src/test/modules/test_wal_read_from_buffers/meson.build
 create mode 100644 src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
 create mode 100644 src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index e32c8925f6..c6e1f01dca 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -34,6 +34,7 @@ SUBDIRS = \
 		  test_rls_hooks \
 		  test_shm_mq \
 		  test_slru \
+		  test_wal_read_from_buffers \
 		  unsafe_tests \
 		  worker_spi \
 		  xid_wraparound
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 397e0906e6..9595bbc342 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -32,6 +32,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('test_wal_read_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/test_wal_read_from_buffers/.gitignore b/src/test/modules/test_wal_read_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/test_wal_read_from_buffers/Makefile b/src/test/modules/test_wal_read_from_buffers/Makefile
new file mode 100644
index 0000000000..7472494501
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/test_wal_read_from_buffers/Makefile
+
+MODULE_big = test_wal_read_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	test_wal_read_from_buffers.o
+PGFILEDESC = "test_wal_read_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = test_wal_read_from_buffers
+DATA = test_wal_read_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/test_wal_read_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/test_wal_read_from_buffers/meson.build b/src/test/modules/test_wal_read_from_buffers/meson.build
new file mode 100644
index 0000000000..40bd5dcd33
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2023, PostgreSQL Global Development Group
+
+test_wal_read_from_buffers_sources = files(
+  'test_wal_read_from_buffers.c',
+)
+
+if host_system == 'windows'
+  test_wal_read_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'test_wal_read_from_buffers',
+    '--FILEDESC', 'test_wal_read_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+test_wal_read_from_buffers = shared_module('test_wal_read_from_buffers',
+  test_wal_read_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += test_wal_read_from_buffers
+
+test_install_data += files(
+  'test_wal_read_from_buffers.control',
+  'test_wal_read_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'test_wal_read_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..1d842bb02e
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/t/001_basic.pl
@@ -0,0 +1,54 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION test_wal_read_from_buffers;');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+# Check with a WAL that doesn't yet exist.
+$lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_flush_lsn()+8192;');
+$result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('$lsn');});
+is($result, 'f', "WAL that doesn't yet exist is not read from WAL buffers");
+
+# Check with invalid input.
+$result = $node->safe_psql('postgres',
+	qq{SELECT test_wal_read_from_buffers('0/0');});
+is($result, 'f', "WAL is not read from WAL buffers with invalid input");
+
+done_testing();
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
new file mode 100644
index 0000000000..c6ffb3fa65
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql
@@ -0,0 +1,16 @@
+/* src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION test_wal_read_from_buffers" to load this file. \quit
+
+--
+-- test_wal_read_from_buffers()
+--
+-- Returns true if WAL data at a given LSN can be read from WAL buffers.
+-- Otherwise returns false.
+--
+CREATE FUNCTION test_wal_read_from_buffers(IN lsn pg_lsn,
+    read_successful OUT boolean
+)
+AS 'MODULE_PATHNAME', 'test_wal_read_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
new file mode 100644
index 0000000000..5368f59b16
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
@@ -0,0 +1,37 @@
+/*--------------------------------------------------------------------------
+ *
+ * test_wal_read_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 2023, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function for verifying that WAL data at a given LSN can be read from WAL
+ * buffers. Returns true if read from WAL buffers, otherwise false.
+ */
+PG_FUNCTION_INFO_V1(test_wal_read_from_buffers);
+Datum
+test_wal_read_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	startptr		  = PG_GETARG_LSN(0);
+	char		data[XLOG_BLCKSZ] = {0};
+	Size		nread			  = 0;
+
+	if (!XLogRecPtrIsInvalid(startptr))
+		nread = XLogReadFromBuffers(data, startptr, XLOG_BLCKSZ);
+
+	PG_RETURN_BOOL(nread > 0);
+}
diff --git a/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
new file mode 100644
index 0000000000..eda8d47954
--- /dev/null
+++ b/src/test/modules/test_wal_read_from_buffers/test_wal_read_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/test_wal_read_from_buffers'
+relocatable = true
-- 
2.34.1

#65Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#64)
3 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Jan 23, 2024 at 9:37 AM Jeff Davis <pgsql@j-davis.com> wrote:

On Mon, 2024-01-22 at 12:12 -0800, Andres Freund wrote:

I still think that anything that requires such checks shouldn't be
merged. It's completely bogus to check page contents for validity
when we
should have metadata telling us which range of the buffers is valid
and which
not.

The check seems entirely unnecessary, to me. A leftover from v18?

I have attached a new patch (version "19j") to illustrate some of my
previous suggestions. I didn't spend a lot of time on it so it's not
ready for commit, but I believe my suggestions are easier to understand
in code form.

Note that, right now, it only works for XLogSendPhysical(). I believe
it's best to just make it work for 1-3 callers that we understand well,
and we can generalize later if it makes sense.

+1 to do it for XLogSendPhysical() first. Enabling it for others can
just be done as something like the attached v20-0003.

I'm still not clear on why some callers are reading XLOG_BLCKSZ
(expecting zeros at the end), and if it's OK to just change them to use
the exact byte count.

"expecting zeros at the end" - this can't always be true as the WAL
can get flushed after determining the flush ptr before reading it from
the WAL file. FWIW, here's what I've tried previoulsy -
https://github.com/BRupireddy2/postgres/tree/ensure_extra_read_WAL_page_is_zero_padded_at_the_end_WIP,
the tests hit the Assert(false); added. Which means, the zero-padding
comment around WALRead() call-sites isn't quite right.

/*
* Even though we just determined how much of the page can be validly read
* as 'count', read the whole page anyway. It's guaranteed to be
* zero-padded up to the page boundary if it's incomplete.
*/
if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,

I think this needs to be discussed separately. If okay, I'll start a new thread.

Also, if we've detected that the first requested buffer has been
evicted, is there any value in continuing the loop to see if more
recent buffers are available? For example, if the requested LSNs range
over buffers 4, 5, and 6, and 4 has already been evicted, should we try
to return LSN data from 5 and 6 at the proper offset in the dest
buffer? If so, we'd need to adjust the API so the caller knows what
parts of the dest buffer were filled in.

I'd second this capability for now to keep the API simple and clear,
but we can consider expanding it as needed.

I reviewed the v19j and attached v20 patch set:

1.
* The caller must ensure that it's reasonable to read from the WAL buffers,
* i.e. that the requested data is from the current timeline, that we're not
* in recovery, etc.

I still think the XLogReadFromBuffers can just return in any of the
above cases instead of comments. I feel we must assume the caller is
going to ask the WAL from a different timeline and/or in recovery and
design the API to deal with it. Done that way in v20 patch.

2. Fixed some typos, reworded a few comments (i.e. used "current
insert/write position" instead of "Insert/Write pointer" like
elsewhere), ran pgindent.

3.
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.

Removed the above comment before WALRead() since we have that facility
now. Perhaps, we can say the callers can suck data directly from the
WAL buffers using XLogReadFromBuffers. But I have no strong opinion on
this.

4.
+     * Most callers will have already updated LogwrtResult when determining
+     * how far to read, but it's OK if it's out of date. (XXX: is it worth
+     * taking a spinlock to update LogwrtResult and check again before calling
+     * WaitXLogInsertionsToFinish()?)

If the callers use GetFlushRecPtr() to determine how far to read,
LogwrtResult will be *reasonably* latest, otherwise not. If
LogwrtResult is a bit old, XLogReadFromBuffers will call
WaitXLogInsertionsToFinish which will just loop over all insertion
locks and return.

As far as the current WAL readers are concerned, we don't need an
explicit spinlock to determine LogwrtResult because all of them use
GetFlushRecPtr() to determine how far to read. If there's any caller
that's not updating LogwrtResult at all, we can consider reading
LogwrtResult it ourselves in future.

5. I think the two requirements specified at
/messages/by-id/20231109205836.zjoawdrn4q77yemv@awork3.anarazel.de
still hold with the v19j.

5.1 Never allow WAL being read that's past
XLogBytePosToRecPtr(XLogCtl->Insert->CurrBytePos) as it does not
exist.
5.2 If the to-be-read LSN is between XLogCtl->LogwrtResult->Write and
XLogBytePosToRecPtr(Insert->CurrBytePos) we need to call
WaitXLogInsertionsToFinish() before copying the data.

+    if (upto > LogwrtResult.Write)
+    {
+        XLogRecPtr writtenUpto = WaitXLogInsertionsToFinish(upto, false);
+
+        upto = Min(upto, writtenUpto);
+        nbytes = upto - startptr;
+    }

XLogReadFromBuffers ensures the above two with adjusting upto based on
Min(upto, writtenUpto) as WaitXLogInsertionsToFinish returns the
oldest insertion that is still in-progress.

For instance, the current write LSN is 100, current insert LSN is 150
and upto is 200 - we only read upto 150 if startptr is < 150; we don't
read anything if startptr is > 150.

6. I've modified the test module in v20-0002 patch as follows:
6.1 Renamed the module to read_wal_from_buffers stripping "test_"
which otherwise is making the name longer. Longer names can cause
failures on some Windows BF members if the PATH/FILE name is too long.
6.2 Tweaked tests to hit WaitXLogInsertionsToFinish() and upto =
Min(upto, writtenUpto); in XLogReadFromBuffers.

PSA v20 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v20-0001-Add-XLogReadFromBuffers.patchapplication/octet-stream; name=v20-0001-Add-XLogReadFromBuffers.patchDownload
From 76b71019e9067d96559639c299384222abb1651e Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 25 Jan 2024 08:19:01 +0000
Subject: [PATCH v20] Add XLogReadFromBuffers().

Allows reading directly from WAL buffers without a lock, avoiding the
need to wait for WAL flushing and read from the filesystem.

For now, the only caller is physical replication, but we can consider
expanding it to other callers as needed.

Author: Bharath Rupireddy
---
 src/backend/access/transam/xlog.c       | 147 ++++++++++++++++++++++--
 src/backend/access/transam/xlogreader.c |   3 -
 src/backend/replication/walsender.c     |   8 ++
 src/include/access/xlog.h               |   3 +
 4 files changed, 149 insertions(+), 12 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..4940e8ca29 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -698,7 +698,7 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
-static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
+static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto, bool emitLog);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
@@ -1494,7 +1494,7 @@ WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
  * to make room for a new one, which in turn requires WALWriteLock.
  */
 static XLogRecPtr
-WaitXLogInsertionsToFinish(XLogRecPtr upto)
+WaitXLogInsertionsToFinish(XLogRecPtr upto, bool emitLog)
 {
 	uint64		bytepos;
 	XLogRecPtr	reservedUpto;
@@ -1521,9 +1521,10 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	 */
 	if (upto > reservedUpto)
 	{
-		ereport(LOG,
-				(errmsg("request to flush past end of generated WAL; request %X/%X, current position %X/%X",
-						LSN_FORMAT_ARGS(upto), LSN_FORMAT_ARGS(reservedUpto))));
+		if (emitLog)
+			ereport(LOG,
+					(errmsg("request to flush past end of generated WAL; request %X/%X, current position %X/%X",
+							LSN_FORMAT_ARGS(upto), LSN_FORMAT_ARGS(reservedUpto))));
 		upto = reservedUpto;
 	}
 
@@ -1705,6 +1706,134 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL directly from WAL buffers, if available.
+ *
+ * This function reads 'count' bytes of WAL from WAL buffers into 'buf'
+ * starting at location 'startptr' and returns total bytes read.
+ *
+ * The bytes read may be fewer than requested if any of the WAL buffers in the
+ * requested range have been evicted, or if the last requested byte is beyond
+ * the current insert position.
+ *
+ * If reading beyond the current write position, this function will wait for
+ * concurrent inserters to finish. Otherwise, it does not wait at all.
+ *
+ * This function returns immediately if the requested data is not from the
+ * current timeline, or if the server is in recovery.
+ */
+Size
+XLogReadFromBuffers(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli)
+{
+	XLogRecPtr	ptr = startptr;
+	XLogRecPtr	upto = startptr + count;
+	Size		nbytes = count;
+	Size		ntotal = 0;
+	char	   *dst = buf;
+
+	if (RecoveryInProgress() ||
+		tli != GetWALInsertionTimeLine())
+		return ntotal;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+
+	/*
+	 * Caller requested very recent WAL data. Wait for any in-progress WAL
+	 * insertions to WAL buffers to finish.
+	 *
+	 * Most callers will have already updated LogwrtResult when determining
+	 * how far to read, but it's OK if it's out of date. XXX: is it worth
+	 * taking a spinlock to update LogwrtResult and check again before calling
+	 * WaitXLogInsertionsToFinish()?
+	 */
+	if (upto > LogwrtResult.Write)
+	{
+		XLogRecPtr	writtenUpto = WaitXLogInsertionsToFinish(upto, false);
+
+		upto = Min(upto, writtenUpto);
+		nbytes = upto - startptr;
+	}
+
+	/*
+	 * Loop through the buffers without a lock. For each buffer, atomically
+	 * read and verify the end pointer, then copy the data out, and finally
+	 * re-read and re-verify the end pointer.
+	 *
+	 * Once a page is evicted, it never returns to the WAL buffers, so if the
+	 * end pointer matches the expected end pointer before and after we copy
+	 * the data, then the right page must have been present during the data
+	 * copy. Read barriers are necessary to ensure that the data copy actually
+	 * happens between the two verification steps.
+	 *
+	 * If the verification fails, we simply terminate the loop and return with
+	 * the data that had been already copied out successfully.
+	 */
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int			idx;
+		const char *page;
+		const char *data;
+		Size		nread;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+		/*
+		 * First verification step: check that the correct page is present in
+		 * the WAL buffers.
+		 */
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * Ensure that the data copy and the first verification step are not
+		 * reordered.
+		 */
+		pg_read_barrier();
+
+		/* how much is available on this page to read? */
+		nread = Min(nbytes, XLOG_BLCKSZ - (data - page));
+
+		/* data copy */
+		memcpy(dst, data, nread);
+
+		/*
+		 * Ensure that the data copy and the second verification step are not
+		 * reordered.
+		 */
+		pg_read_barrier();
+
+		/*
+		 * Second verification step: check that the page we read from wasn't
+		 * evicted while we were copying the data.
+		 */
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+		if (expectedEndPtr != endptr)
+			break;
+
+		dst += nread;
+		ptr += nread;
+		ntotal += nread;
+		nbytes -= nread;
+	}
+
+	Assert(ntotal <= count);
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
@@ -1895,7 +2024,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 				 */
 				LWLockRelease(WALBufMappingLock);
 
-				WaitXLogInsertionsToFinish(OldPageRqstPtr);
+				WaitXLogInsertionsToFinish(OldPageRqstPtr, true);
 
 				LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 
@@ -2689,7 +2818,7 @@ XLogFlush(XLogRecPtr record)
 		 * Before actually performing the write, wait for all in-flight
 		 * insertions to the pages we're about to write to finish.
 		 */
-		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr);
+		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr, true);
 
 		/*
 		 * Try to get the write lock. If we can't get it immediately, wait
@@ -2740,7 +2869,7 @@ XLogFlush(XLogRecPtr record)
 			 * We're only calling it again to allow insertpos to be moved
 			 * further forward, not to actually wait for anyone.
 			 */
-			insertpos = WaitXLogInsertionsToFinish(insertpos);
+			insertpos = WaitXLogInsertionsToFinish(insertpos, true);
 		}
 
 		/* try to write/flush later additions to XLOG as well */
@@ -2919,7 +3048,7 @@ XLogBackgroundFlush(void)
 	START_CRIT_SECTION();
 
 	/* now wait for any in-progress insertions to finish and get write lock */
-	WaitXLogInsertionsToFinish(WriteRqst.Write);
+	WaitXLogInsertionsToFinish(WriteRqst.Write, true);
 	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 	LogwrtResult = XLogCtl->LogwrtResult;
 	if (WriteRqst.Write > LogwrtResult.Write ||
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7190156f2f..74a6b11866 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1500,9 +1500,6 @@ err:
  *
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
  */
 bool
 WALRead(XLogReaderState *state,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 087031e9dc..95ba656a06 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2910,6 +2910,7 @@ XLogSendPhysical(void)
 	Size		nbytes;
 	XLogSegNo	segno;
 	WALReadError errinfo;
+	Size		rbytes;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -3125,6 +3126,13 @@ XLogSendPhysical(void)
 	enlargeStringInfo(&output_message, nbytes);
 
 retry:
+	/* Read from WAL buffers, if available. */
+	rbytes = XLogReadFromBuffers(&output_message.data[output_message.len],
+								 startptr, nbytes, xlogreader->seg.ws_tli);
+	output_message.len += rbytes;
+	startptr += rbytes;
+	nbytes -= rbytes;
+
 	if (!WALRead(xlogreader,
 				 &output_message.data[output_message.len],
 				 startptr,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 301c5fa11f..f8c281c799 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -252,6 +252,9 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(char *buf, XLogRecPtr startptr, Size count,
+								TimeLineID tli);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v20-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchapplication/octet-stream; name=v20-0002-Add-test-module-for-verifying-WAL-read-from-WAL-.patchDownload
From d348c029ca3db228753b8db9cea214951eced4ee Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 25 Jan 2024 06:54:33 +0000
Subject: [PATCH v20] Add test module for verifying WAL read from WAL  buffers

This commit adds a test module to verify WAL read from WAL
buffers.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../modules/read_wal_from_buffers/.gitignore  |  4 ++
 .../modules/read_wal_from_buffers/Makefile    | 23 ++++++++
 .../modules/read_wal_from_buffers/meson.build | 33 ++++++++++++
 .../read_wal_from_buffers--1.0.sql            | 14 +++++
 .../read_wal_from_buffers.c                   | 41 +++++++++++++++
 .../read_wal_from_buffers.control             |  4 ++
 .../read_wal_from_buffers/t/001_basic.pl      | 52 +++++++++++++++++++
 9 files changed, 173 insertions(+)
 create mode 100644 src/test/modules/read_wal_from_buffers/.gitignore
 create mode 100644 src/test/modules/read_wal_from_buffers/Makefile
 create mode 100644 src/test/modules/read_wal_from_buffers/meson.build
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
 create mode 100644 src/test/modules/read_wal_from_buffers/t/001_basic.pl

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index e32c8925f6..4eba0fa2e2 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  dummy_seclabel \
 		  libpq_pipeline \
 		  plsample \
+		  read_wal_from_buffers \
 		  spgist_name_ops \
 		  test_bloomfilter \
 		  test_copy_callbacks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 397e0906e6..f0b53eced7 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -32,6 +32,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('read_wal_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/read_wal_from_buffers/.gitignore b/src/test/modules/read_wal_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/read_wal_from_buffers/Makefile b/src/test/modules/read_wal_from_buffers/Makefile
new file mode 100644
index 0000000000..9e57a837f9
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/read_wal_from_buffers/Makefile
+
+MODULE_big = read_wal_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	read_wal_from_buffers.o
+PGFILEDESC = "read_wal_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = read_wal_from_buffers
+DATA = read_wal_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/read_wal_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/read_wal_from_buffers/meson.build b/src/test/modules/read_wal_from_buffers/meson.build
new file mode 100644
index 0000000000..3fac00d616
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+read_wal_from_buffers_sources = files(
+  'read_wal_from_buffers.c',
+)
+
+if host_system == 'windows'
+  read_wal_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'read_wal_from_buffers',
+    '--FILEDESC', 'read_wal_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+read_wal_from_buffers = shared_module('read_wal_from_buffers',
+  read_wal_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += read_wal_from_buffers
+
+test_install_data += files(
+  'read_wal_from_buffers.control',
+  'read_wal_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'read_wal_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
new file mode 100644
index 0000000000..82fa097d10
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -0,0 +1,14 @@
+/* src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION read_wal_from_buffers" to load this file. \quit
+
+--
+-- read_wal_from_buffers()
+--
+-- SQL function to read WAL from WAL buffers. Returns number of bytes read.
+--
+CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
+    bytes_read OUT int)
+AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
new file mode 100644
index 0000000000..da841da3f0
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -0,0 +1,41 @@
+/*--------------------------------------------------------------------------
+ *
+ * read_wal_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function to read WAL from WAL buffers. Returns number of bytes read.
+ */
+PG_FUNCTION_INFO_V1(read_wal_from_buffers);
+Datum
+read_wal_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	startptr = PG_GETARG_LSN(0);
+	int32		bytes_to_read = PG_GETARG_INT32(1);
+	Size		bytes_read = 0;
+	char	   *data = palloc0(bytes_to_read);
+
+	bytes_read = XLogReadFromBuffers(data, startptr,
+									 (Size) bytes_to_read,
+									 GetWALInsertionTimeLine());
+
+	pfree(data);
+
+	PG_RETURN_INT32(bytes_read);
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
new file mode 100644
index 0000000000..b14d24751c
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/read_wal_from_buffers'
+relocatable = true
diff --git a/src/test/modules/read_wal_from_buffers/t/001_basic.pl b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..e1773da2c8
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
@@ -0,0 +1,52 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+my $node = PostgreSQL::Test::Cluster->new('test');
+
+$node->init;
+
+# Ensure nobody interferes with us so that the WAL in WAL buffers don't get
+# overwritten while running tests.
+$node->append_conf(
+	'postgresql.conf', qq(
+autovacuum = off
+checkpoint_timeout = 1h
+wal_writer_delay = 10000ms
+wal_writer_flush_after = 1GB
+));
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION read_wal_from_buffers;');
+
+# Get current insert LSN. After this, we generate some WAL which is guranteed
+# to be in WAL buffers as there is no other WAL generating activity is
+# happening on the server. We then verify if we can read the WAL from WAL
+# buffers using this LSN.
+my $lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+# Generate minimal WAL so that WAL buffers don't get overwritten.
+$node->safe_psql('postgres',
+	"CREATE TABLE t (c int); INSERT INTO t VALUES (1);");
+
+# Check if WAL is successfully read from WAL buffers.
+my $to_read = 8192;
+my $result = $node->safe_psql('postgres',
+	qq{SELECT read_wal_from_buffers(lsn := '$lsn', bytes_to_read := $to_read) > 0;});
+is($result, 't', "WAL is successfully read from WAL buffers");
+
+# Check with a WAL that doesn't yet exist i.e., 16MB starting from current
+# flush LSN.
+$lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_flush_lsn()+16777216;');
+$to_read = 8192;
+$result = $node->safe_psql('postgres',
+	qq{SELECT read_wal_from_buffers(lsn := '$lsn', bytes_to_read := $to_read) = 0;});
+is($result, 't', "WAL that doesn't yet exist is not read from WAL buffers");
+
+done_testing();
-- 
2.34.1

v20-0003-Use-XLogReadFromBuffers-in-more-places.patchapplication/octet-stream; name=v20-0003-Use-XLogReadFromBuffers-in-more-places.patchDownload
From 938d80bdaa6d702cb8e415b582555efa76574e55 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 25 Jan 2024 07:13:23 +0000
Subject: [PATCH v20] Use XLogReadFromBuffers in more places

---
 src/backend/access/transam/xlogutils.c | 12 +++++++++++-
 src/backend/postmaster/walsummarizer.c | 12 +++++++++++-
 src/backend/replication/walsender.c    | 12 +++++++++++-
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd1..de526f7da7 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -894,6 +894,8 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 	WALReadError errinfo;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	loc = targetPagePtr + reqLen;
 
@@ -1006,12 +1008,20 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		count = read_upto - targetPagePtr;
 	}
 
+	/* Read from WAL buffers, if available. */
+	nbytes = XLOG_BLCKSZ;
+	rbytes = XLogReadFromBuffers(cur_page, targetPagePtr,
+								 nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
 	/*
 	 * Even though we just determined how much of the page can be validly read
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
+	if (!WALRead(state, cur_page, targetPagePtr, nbytes, tli,
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 9b883c21ca..33eb3a4870 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -1221,6 +1221,8 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
 	int			count;
 	WALReadError errinfo;
 	SummarizerReadLocalXLogPrivate *private_data;
+	Size		nbytes;
+	Size		rbytes;
 
 	HandleWalSummarizerInterrupts();
 
@@ -1318,12 +1320,20 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
 		}
 	}
 
+	/* Read from WAL buffers, if available. */
+	nbytes = XLOG_BLCKSZ;
+	rbytes = XLogReadFromBuffers(cur_page, targetPagePtr,
+								 nbytes, private_data->tli);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
 	/*
 	 * Even though we just determined how much of the page can be validly read
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+	if (!WALRead(state, cur_page, targetPagePtr, nbytes,
 				 private_data->tli, &errinfo))
 		WALReadRaiseError(&errinfo);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 95ba656a06..ab119ef29a 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1059,6 +1059,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	WALReadError errinfo;
 	XLogSegNo	segno;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	/*
 	 * Make sure we have enough WAL available before retrieving the current
@@ -1095,11 +1097,19 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
+	/* Read from WAL buffers, if available. */
+	nbytes = XLOG_BLCKSZ;
+	rbytes = XLogReadFromBuffers(cur_page, targetPagePtr,
+								 nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
 	/* now actually read the data, we know it's there */
 	if (!WALRead(state,
 				 cur_page,
 				 targetPagePtr,
-				 XLOG_BLCKSZ,
+				 nbytes,
 				 currTLI,		/* Pass the current TLI because only
 								 * WalSndSegmentOpen controls whether new TLI
 								 * is needed. */
-- 
2.34.1

#66Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#65)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Thu, 2024-01-25 at 14:35 +0530, Bharath Rupireddy wrote:

"expecting zeros at the end" - this can't always be true as the WAL

...

I think this needs to be discussed separately. If okay, I'll start a
new thread.

Thank you for investigating. When the above issue is handled, I'll be
more comfortable expanding the call sites for XLogReadFromBuffers().

Also, if we've detected that the first requested buffer has been
evicted, is there any value in continuing the loop to see if more
recent buffers are available? For example, if the requested LSNs
range
over buffers 4, 5, and 6, and 4 has already been evicted, should we
try
to return LSN data from 5 and 6 at the proper offset in the dest
buffer? If so, we'd need to adjust the API so the caller knows what
parts of the dest buffer were filled in.

I'd second this capability for now to keep the API simple and clear,
but we can consider expanding it as needed.

Agreed. This case doesn't seem important; I just thought I'd ask about
it.

If the callers use GetFlushRecPtr() to determine how far to read,
LogwrtResult will be *reasonably* latest

It will be up-to-date enough that we'd never go through
WaitXLogInsertionsToFinish(), which is all we care about.

As far as the current WAL readers are concerned, we don't need an
explicit spinlock to determine LogwrtResult because all of them use
GetFlushRecPtr() to determine how far to read. If there's any caller
that's not updating LogwrtResult at all, we can consider reading
LogwrtResult it ourselves in future.

So we don't actually need that path yet, right?

5. I think the two requirements specified at
/messages/by-id/20231109205836.zjoawdrn4q77yemv@awork3.anarazel.de
still hold with the v19j.

Agreed.

PSA v20 patch set.

0001 is very close. I have the following suggestions:

* Don't just return zero. If the caller is doing something we don't
expect, we want to fix the caller. I understand you'd like this to be
more like a transparent optimization, and we may do that later, but I
don't think it's a good idea to do that now.

* There's currently no use for reading LSNs between Write and Insert,
so remove the WaitXLogInsertionsToFinish() code path. That also means
we don't need the extra emitLog parameter, so we can remove that. When
we have a use case, we can bring it all back.

If you agree, I can just make those adjustments (and do some final
checking) and commit 0001. Otherwise let me know what you think.

0002: How does the test control whether the data requested is before
the Flush pointer, the Write pointer, or the Insert pointer? What if
the walwriter comes in and moves one of those pointers before the next
statement is executed? Also, do you think a test module is required for
the basic functionality in 0001, or only when we start doing more
complex things like reading past the Flush pointer?

0003: can you explain why this is useful for wal summarizer to read
from the buffers?

Regards,
Jeff Davis

#67Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#66)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Jan 26, 2024 at 8:31 AM Jeff Davis <pgsql@j-davis.com> wrote:

PSA v20 patch set.

0001 is very close. I have the following suggestions:

* Don't just return zero. If the caller is doing something we don't
expect, we want to fix the caller. I understand you'd like this to be
more like a transparent optimization, and we may do that later, but I
don't think it's a good idea to do that now.

+    if (RecoveryInProgress() ||
+        tli != GetWALInsertionTimeLine())
+        return ntotal;
+
+    Assert(!XLogRecPtrIsInvalid(startptr));

Are you suggesting to error out instead of returning 0? If yes, I
disagree with it. Because, failure to read due to unmet pre-conditions
doesn't necessarily have to be to error out. If we error out, the
immediate failure we see is in the src/bin/psql TAP test for calling
XLogReadFromBuffers when the server is in recovery. How about
returning a negative value instead of just 0 or returning true/false
just like WALRead?

* There's currently no use for reading LSNs between Write and Insert,
so remove the WaitXLogInsertionsToFinish() code path. That also means
we don't need the extra emitLog parameter, so we can remove that. When
we have a use case, we can bring it all back.

I disagree with this. I don't see anything wrong with
XLogReadFromBuffers having the capability to wait for in-progress
insertions to finish. In fact, it makes the function near-complete.
Imagine, implementing an extension (may be for fun or learning or
educational or production purposes) to read unflushed WAL directly
from WAL buffers using XLogReadFromBuffers as page_read callback with
xlogreader facility. AFAICT, I don't see a problem with
WaitXLogInsertionsToFinish logic in XLogReadFromBuffers.

FWIW, one important aspect of XLogReadFromBuffers is its ability to
read the unflushed WAL from WAL buffers. Also, see a note from Andres
here /messages/by-id/20231109205836.zjoawdrn4q77yemv@awork3.anarazel.de.

If you agree, I can just make those adjustments (and do some final
checking) and commit 0001. Otherwise let me know what you think.

Thanks. Please see my responses above.

0002: How does the test control whether the data requested is before
the Flush pointer, the Write pointer, or the Insert pointer? What if
the walwriter comes in and moves one of those pointers before the next
statement is executed?

Tried to keep wal_writer quiet with wal_writer_delay=10000ms and
wal_writer_flush_after = 1GB to not to flush WAL in the background.
Also, disabled autovacuum, and set checkpoint_timeout to a higher
value. All of this is done to generate minimal WAL so that WAL buffers
don't get overwritten. Do you see any problems with it?

Also, do you think a test module is required for
the basic functionality in 0001, or only when we start doing more
complex things like reading past the Flush pointer?

With WaitXLogInsertionsToFinish in XLogReadFromBuffers, we have that
capability already in. Having a separate test module ensures the code
is tested properly.

As far as the test is concerned, it verifies 2 cases:
1. Check if WAL is successfully read from WAL buffers. For this, the
test generates minimal WAL and reads from WAL buffers from the start
LSN = current insert LSN captured before the WAL generation.
2. Check with a WAL that doesn't yet exist. For this, the test reads
from WAL buffers from the start LSN = current flush LSN+16MB (a
randomly chosen higher value).

0003: can you explain why this is useful for wal summarizer to read
from the buffers?

Can the WAL summarizer ever read the WAL on current TLI? I'm not so
sure about it, I haven't explored it in detail.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#68Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#67)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, 2024-01-26 at 19:31 +0530, Bharath Rupireddy wrote:

Are you suggesting to error out instead of returning 0?

We'd do neither of those things, because no caller should actually call
it while RecoveryInProgress() or on a different timeline.

How about
returning a negative value instead of just 0 or returning true/false
just like WALRead?

All of these things are functionally equivalent -- the same thing is
happening at the end. This is just a discussion about API style and how
that will interact with hypothetical callers that don't exist today.
And it can also be easily changed later, so we aren't stuck with
whatever decision happens here.

Imagine, implementing an extension (may be for fun or learning or
educational or production purposes) to read unflushed WAL directly
from WAL buffers using XLogReadFromBuffers as page_read callback with
xlogreader facility.

That makes sense, I didn't realize you intended to use this fron an
extension. I'm fine considering that as a separate patch that could
potentially be committed soon after this one.

I'd like some more details, but can I please just commit the basic
functionality now-ish?

Tried to keep wal_writer quiet with wal_writer_delay=10000ms and
wal_writer_flush_after = 1GB to not to flush WAL in the background.
Also, disabled autovacuum, and set checkpoint_timeout to a higher
value. All of this is done to generate minimal WAL so that WAL
buffers
don't get overwritten. Do you see any problems with it?

Maybe check it against pg_current_wal_lsn(), and see if the Write
pointer moved ahead? Perhaps even have a (limited) loop that tries
again to catch it at the right time?

Can the WAL summarizer ever read the WAL on current TLI? I'm not so
sure about it, I haven't explored it in detail.

Let's just not call XLogReadFromBuffers from there.

Regards,
Jeff Davis

#69Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#68)
4 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sat, Jan 27, 2024 at 1:04 AM Jeff Davis <pgsql@j-davis.com> wrote:

All of these things are functionally equivalent -- the same thing is
happening at the end. This is just a discussion about API style and how
that will interact with hypothetical callers that don't exist today.
And it can also be easily changed later, so we aren't stuck with
whatever decision happens here.

I'll leave that up to you. I'm okay either ways - 1) ensure the caller
doesn't use XLogReadFromBuffers, 2) XLogReadFromBuffers returning
as-if nothing was read when in recovery or on a different timeline.

Imagine, implementing an extension (may be for fun or learning or
educational or production purposes) to read unflushed WAL directly
from WAL buffers using XLogReadFromBuffers as page_read callback with
xlogreader facility.

That makes sense, I didn't realize you intended to use this fron an
extension. I'm fine considering that as a separate patch that could
potentially be committed soon after this one.

Yes, I've turned that into 0002 patch.

I'd like some more details, but can I please just commit the basic
functionality now-ish?

+1.

Tried to keep wal_writer quiet with wal_writer_delay=10000ms and
wal_writer_flush_after = 1GB to not to flush WAL in the background.
Also, disabled autovacuum, and set checkpoint_timeout to a higher
value. All of this is done to generate minimal WAL so that WAL
buffers
don't get overwritten. Do you see any problems with it?

Maybe check it against pg_current_wal_lsn(), and see if the Write
pointer moved ahead? Perhaps even have a (limited) loop that tries
again to catch it at the right time?

Adding a loop seems to be reasonable here and done in v21-0003. Also,
I've added wal_level = minimal per
src/test/recovery/t/039_end_of_wal.pl introduced by commit bae868caf22
which also tries to keep WAL activity to minimum.

Can the WAL summarizer ever read the WAL on current TLI? I'm not so
sure about it, I haven't explored it in detail.

Let's just not call XLogReadFromBuffers from there.

Removed.

PSA v21 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v21-0001-Add-XLogReadFromBuffers.patchapplication/octet-stream; name=v21-0001-Add-XLogReadFromBuffers.patchDownload
From 95ba60dd3afdc134329f3b017588264103f85985 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 27 Jan 2024 05:48:30 +0000
Subject: [PATCH v21] Add XLogReadFromBuffers().

Allows reading directly from WAL buffers without a lock, avoiding the
need to wait for WAL flushing and read from the filesystem.

For now, the only caller is physical replication, but we can consider
expanding it to other callers as needed.
---
 src/backend/access/transam/xlog.c       | 106 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |   3 -
 src/backend/replication/walsender.c     |   8 ++
 src/include/access/xlog.h               |   3 +
 4 files changed, 117 insertions(+), 3 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..eea50bea3c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1705,6 +1705,112 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL directly from WAL buffers, if available.
+ *
+ * This function reads 'count' bytes of WAL from WAL buffers into 'buf'
+ * starting at location 'startptr' and returns total bytes read.
+ *
+ * The bytes read may be fewer than requested if any of the WAL buffers in the
+ * requested range have been evicted.
+ *
+ * This function returns immediately if the requested data is not from the
+ * current timeline, or if the server is in recovery.
+ */
+Size
+XLogReadFromBuffers(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli)
+{
+	XLogRecPtr	ptr = startptr;
+	Size		nbytes = count;
+	Size		ntotal = 0;
+	char	   *dst = buf;
+
+	if (RecoveryInProgress() ||
+		tli != GetWALInsertionTimeLine())
+		return ntotal;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+
+	/*
+	 * Loop through the buffers without a lock. For each buffer, atomically
+	 * read and verify the end pointer, then copy the data out, and finally
+	 * re-read and re-verify the end pointer.
+	 *
+	 * Once a page is evicted, it never returns to the WAL buffers, so if the
+	 * end pointer matches the expected end pointer before and after we copy
+	 * the data, then the right page must have been present during the data
+	 * copy. Read barriers are necessary to ensure that the data copy actually
+	 * happens between the two verification steps.
+	 *
+	 * If the verification fails, we simply terminate the loop and return with
+	 * the data that had been already copied out successfully.
+	 */
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int			idx;
+		const char *page;
+		const char *data;
+		Size		nread;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+		/*
+		 * First verification step: check that the correct page is present in
+		 * the WAL buffers.
+		 */
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * Ensure that the data copy and the first verification step are not
+		 * reordered.
+		 */
+		pg_read_barrier();
+
+		/* how much is available on this page to read? */
+		nread = Min(nbytes, XLOG_BLCKSZ - (data - page));
+
+		/* data copy */
+		memcpy(dst, data, nread);
+
+		/*
+		 * Ensure that the data copy and the second verification step are not
+		 * reordered.
+		 */
+		pg_read_barrier();
+
+		/*
+		 * Second verification step: check that the page we read from wasn't
+		 * evicted while we were copying the data.
+		 */
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+		if (expectedEndPtr != endptr)
+			break;
+
+		dst += nread;
+		ptr += nread;
+		ntotal += nread;
+		nbytes -= nread;
+	}
+
+	Assert(ntotal <= count);
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7190156f2f..74a6b11866 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1500,9 +1500,6 @@ err:
  *
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
  */
 bool
 WALRead(XLogReaderState *state,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index aa80f3de20..7efe9ad010 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2910,6 +2910,7 @@ XLogSendPhysical(void)
 	Size		nbytes;
 	XLogSegNo	segno;
 	WALReadError errinfo;
+	Size		rbytes;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -3125,6 +3126,13 @@ XLogSendPhysical(void)
 	enlargeStringInfo(&output_message, nbytes);
 
 retry:
+	/* Read from WAL buffers, if available. */
+	rbytes = XLogReadFromBuffers(&output_message.data[output_message.len],
+								 startptr, nbytes, xlogreader->seg.ws_tli);
+	output_message.len += rbytes;
+	startptr += rbytes;
+	nbytes -= rbytes;
+
 	if (!WALRead(xlogreader,
 				 &output_message.data[output_message.len],
 				 startptr,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 301c5fa11f..f8c281c799 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -252,6 +252,9 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size XLogReadFromBuffers(char *buf, XLogRecPtr startptr, Size count,
+								TimeLineID tli);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

v21-0002-Allow-XLogReadFromBuffers-to-wait-for-in-progres.patchapplication/octet-stream; name=v21-0002-Allow-XLogReadFromBuffers-to-wait-for-in-progres.patchDownload
From 1e9dcbfb18c26a8e0fbbf90bac6e7890afcf012d Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 27 Jan 2024 06:23:07 +0000
Subject: [PATCH v21] Allow XLogReadFromBuffers to wait for in-progress
 insertions

---
 src/backend/access/transam/xlog.c | 43 ++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index eea50bea3c..20fc2c6036 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -698,7 +698,7 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
-static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
+static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto, bool emitLog);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
@@ -1494,7 +1494,7 @@ WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
  * to make room for a new one, which in turn requires WALWriteLock.
  */
 static XLogRecPtr
-WaitXLogInsertionsToFinish(XLogRecPtr upto)
+WaitXLogInsertionsToFinish(XLogRecPtr upto, bool emitLog)
 {
 	uint64		bytepos;
 	XLogRecPtr	reservedUpto;
@@ -1521,9 +1521,10 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	 */
 	if (upto > reservedUpto)
 	{
-		ereport(LOG,
-				(errmsg("request to flush past end of generated WAL; request %X/%X, current position %X/%X",
-						LSN_FORMAT_ARGS(upto), LSN_FORMAT_ARGS(reservedUpto))));
+		if (emitLog)
+			ereport(LOG,
+					(errmsg("request to flush past end of generated WAL; request %X/%X, current position %X/%X",
+							LSN_FORMAT_ARGS(upto), LSN_FORMAT_ARGS(reservedUpto))));
 		upto = reservedUpto;
 	}
 
@@ -1712,7 +1713,11 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
  * starting at location 'startptr' and returns total bytes read.
  *
  * The bytes read may be fewer than requested if any of the WAL buffers in the
- * requested range have been evicted.
+ * requested range have been evicted, or if the last requested byte is beyond
+ * the current insert position.
+ *
+ * If reading beyond the current write position, this function will wait for
+ * concurrent inserters to finish. Otherwise, it does not wait at all.
  *
  * This function returns immediately if the requested data is not from the
  * current timeline, or if the server is in recovery.
@@ -1724,6 +1729,7 @@ XLogReadFromBuffers(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli)
 	Size		nbytes = count;
 	Size		ntotal = 0;
 	char	   *dst = buf;
+	XLogRecPtr	upto = startptr + count;
 
 	if (RecoveryInProgress() ||
 		tli != GetWALInsertionTimeLine())
@@ -1731,6 +1737,23 @@ XLogReadFromBuffers(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli)
 
 	Assert(!XLogRecPtrIsInvalid(startptr));
 
+	/*
+	 * Caller requested very recent WAL data. Wait for any in-progress WAL
+	 * insertions to WAL buffers to finish.
+	 *
+	 * Most callers will have already updated LogwrtResult when determining
+	 * how far to read, but it's OK if it's out of date. XXX: is it worth
+	 * taking a spinlock to update LogwrtResult and check again before calling
+	 * WaitXLogInsertionsToFinish()?
+	 */
+	if (upto > LogwrtResult.Write)
+	{
+		XLogRecPtr	writtenUpto = WaitXLogInsertionsToFinish(upto, false);
+
+		upto = Min(upto, writtenUpto);
+		nbytes = upto - startptr;
+	}
+
 	/*
 	 * Loop through the buffers without a lock. For each buffer, atomically
 	 * read and verify the end pointer, then copy the data out, and finally
@@ -2001,7 +2024,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 				 */
 				LWLockRelease(WALBufMappingLock);
 
-				WaitXLogInsertionsToFinish(OldPageRqstPtr);
+				WaitXLogInsertionsToFinish(OldPageRqstPtr, true);
 
 				LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 
@@ -2795,7 +2818,7 @@ XLogFlush(XLogRecPtr record)
 		 * Before actually performing the write, wait for all in-flight
 		 * insertions to the pages we're about to write to finish.
 		 */
-		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr);
+		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr, true);
 
 		/*
 		 * Try to get the write lock. If we can't get it immediately, wait
@@ -2846,7 +2869,7 @@ XLogFlush(XLogRecPtr record)
 			 * We're only calling it again to allow insertpos to be moved
 			 * further forward, not to actually wait for anyone.
 			 */
-			insertpos = WaitXLogInsertionsToFinish(insertpos);
+			insertpos = WaitXLogInsertionsToFinish(insertpos, true);
 		}
 
 		/* try to write/flush later additions to XLOG as well */
@@ -3025,7 +3048,7 @@ XLogBackgroundFlush(void)
 	START_CRIT_SECTION();
 
 	/* now wait for any in-progress insertions to finish and get write lock */
-	WaitXLogInsertionsToFinish(WriteRqst.Write);
+	WaitXLogInsertionsToFinish(WriteRqst.Write, true);
 	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 	LogwrtResult = XLogCtl->LogwrtResult;
 	if (WriteRqst.Write > LogwrtResult.Write ||
-- 
2.34.1

v21-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patchapplication/octet-stream; name=v21-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patchDownload
From e78206889c4ffd5a52033b4e35814bdc74560f7b Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 27 Jan 2024 07:03:48 +0000
Subject: [PATCH v21] Add test module for verifying WAL read from WAL  buffers

This commit adds a test module to verify WAL read from WAL
buffers.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../modules/read_wal_from_buffers/.gitignore  |  4 ++
 .../modules/read_wal_from_buffers/Makefile    | 23 ++++++
 .../modules/read_wal_from_buffers/meson.build | 33 +++++++++
 .../read_wal_from_buffers--1.0.sql            | 14 ++++
 .../read_wal_from_buffers.c                   | 41 +++++++++++
 .../read_wal_from_buffers.control             |  4 ++
 .../read_wal_from_buffers/t/001_basic.pl      | 71 +++++++++++++++++++
 9 files changed, 192 insertions(+)
 create mode 100644 src/test/modules/read_wal_from_buffers/.gitignore
 create mode 100644 src/test/modules/read_wal_from_buffers/Makefile
 create mode 100644 src/test/modules/read_wal_from_buffers/meson.build
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
 create mode 100644 src/test/modules/read_wal_from_buffers/t/001_basic.pl

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index e32c8925f6..4eba0fa2e2 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  dummy_seclabel \
 		  libpq_pipeline \
 		  plsample \
+		  read_wal_from_buffers \
 		  spgist_name_ops \
 		  test_bloomfilter \
 		  test_copy_callbacks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 397e0906e6..f0b53eced7 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -32,6 +32,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('read_wal_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/read_wal_from_buffers/.gitignore b/src/test/modules/read_wal_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/read_wal_from_buffers/Makefile b/src/test/modules/read_wal_from_buffers/Makefile
new file mode 100644
index 0000000000..9e57a837f9
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/read_wal_from_buffers/Makefile
+
+MODULE_big = read_wal_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	read_wal_from_buffers.o
+PGFILEDESC = "read_wal_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = read_wal_from_buffers
+DATA = read_wal_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/read_wal_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/read_wal_from_buffers/meson.build b/src/test/modules/read_wal_from_buffers/meson.build
new file mode 100644
index 0000000000..3fac00d616
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+read_wal_from_buffers_sources = files(
+  'read_wal_from_buffers.c',
+)
+
+if host_system == 'windows'
+  read_wal_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'read_wal_from_buffers',
+    '--FILEDESC', 'read_wal_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+read_wal_from_buffers = shared_module('read_wal_from_buffers',
+  read_wal_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += read_wal_from_buffers
+
+test_install_data += files(
+  'read_wal_from_buffers.control',
+  'read_wal_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'read_wal_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
new file mode 100644
index 0000000000..82fa097d10
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -0,0 +1,14 @@
+/* src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION read_wal_from_buffers" to load this file. \quit
+
+--
+-- read_wal_from_buffers()
+--
+-- SQL function to read WAL from WAL buffers. Returns number of bytes read.
+--
+CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
+    bytes_read OUT int)
+AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
new file mode 100644
index 0000000000..da841da3f0
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -0,0 +1,41 @@
+/*--------------------------------------------------------------------------
+ *
+ * read_wal_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function to read WAL from WAL buffers. Returns number of bytes read.
+ */
+PG_FUNCTION_INFO_V1(read_wal_from_buffers);
+Datum
+read_wal_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	startptr = PG_GETARG_LSN(0);
+	int32		bytes_to_read = PG_GETARG_INT32(1);
+	Size		bytes_read = 0;
+	char	   *data = palloc0(bytes_to_read);
+
+	bytes_read = XLogReadFromBuffers(data, startptr,
+									 (Size) bytes_to_read,
+									 GetWALInsertionTimeLine());
+
+	pfree(data);
+
+	PG_RETURN_INT32(bytes_read);
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
new file mode 100644
index 0000000000..b14d24751c
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/read_wal_from_buffers'
+relocatable = true
diff --git a/src/test/modules/read_wal_from_buffers/t/001_basic.pl b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..62ea21e541
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
@@ -0,0 +1,71 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Setup a new node.  The configuration chosen here minimizes the number
+# of arbitrary records that could get generated in a cluster.  Enlarging
+# checkpoint_timeout avoids noise with checkpoint activity.  wal_level
+# set to "minimal" avoids random standby snapshot records.  Autovacuum
+# could also trigger randomly, generating random WAL activity of its own.
+# Enlarging wal_writer_delay and wal_writer_flush_after avoid background
+# wal flush by walwriter.
+my $node = PostgreSQL::Test::Cluster->new("node");
+$node->init;
+$node->append_conf(
+	'postgresql.conf',
+	q[wal_level = minimal
+	  autovacuum = off
+	  checkpoint_timeout = '30min'
+	  wal_writer_delay = 10000ms
+	  wal_writer_flush_after = 1GB
+]);
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION read_wal_from_buffers;');
+
+$node->safe_psql('postgres', 'CREATE TABLE t (c int);');
+
+my $result = 0;
+my $lsn;
+my $to_read;
+
+# Wait until we read from WAL buffers
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+	# Get current insert LSN. After this, we generate some WAL which is guranteed
+	# to be in WAL buffers as there is no other WAL generating activity is
+	# happening on the server. We then verify if we can read the WAL from WAL
+	# buffers using this LSN.
+	$lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+	# Generate minimal WAL so that WAL buffers don't get overwritten.
+	$node->safe_psql('postgres', "INSERT INTO t VALUES ($i);");
+
+	$to_read = 8192;
+
+	if ($node->safe_psql('postgres',
+		qq{SELECT read_wal_from_buffers(lsn := '$lsn', bytes_to_read := $to_read) > 0;}))
+	{
+		$result = 1;
+		last;
+	}
+
+	usleep(100_000);
+}
+ok($result, 'waited until WAL is successfully read from WAL buffers');
+
+# Check with a WAL that doesn't yet exist i.e., 16MB starting from current
+# flush LSN.
+$lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_flush_lsn()+16777216;');
+$to_read = 8192;
+$result = $node->safe_psql('postgres',
+	qq{SELECT read_wal_from_buffers(lsn := '$lsn', bytes_to_read := $to_read) = 0;});
+is($result, 't', "WAL that doesn't yet exist is not read from WAL buffers");
+
+done_testing();
-- 
2.34.1

v21-0004-Use-XLogReadFromBuffers-in-more-places.patchapplication/octet-stream; name=v21-0004-Use-XLogReadFromBuffers-in-more-places.patchDownload
From 0a3f3ea0f849e0bef731efe4c66ed17745b44c8d Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 27 Jan 2024 07:04:51 +0000
Subject: [PATCH v21] Use XLogReadFromBuffers in more places

---
 src/backend/access/transam/xlogutils.c | 12 +++++++++++-
 src/backend/replication/walsender.c    | 12 +++++++++++-
 2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd1..de526f7da7 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -894,6 +894,8 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 	WALReadError errinfo;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	loc = targetPagePtr + reqLen;
 
@@ -1006,12 +1008,20 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		count = read_upto - targetPagePtr;
 	}
 
+	/* Read from WAL buffers, if available. */
+	nbytes = XLOG_BLCKSZ;
+	rbytes = XLogReadFromBuffers(cur_page, targetPagePtr,
+								 nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
 	/*
 	 * Even though we just determined how much of the page can be validly read
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
+	if (!WALRead(state, cur_page, targetPagePtr, nbytes, tli,
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 7efe9ad010..4bc8d5e320 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1059,6 +1059,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	WALReadError errinfo;
 	XLogSegNo	segno;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	/*
 	 * Make sure we have enough WAL available before retrieving the current
@@ -1095,11 +1097,19 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
+	/* Read from WAL buffers, if available. */
+	nbytes = XLOG_BLCKSZ;
+	rbytes = XLogReadFromBuffers(cur_page, targetPagePtr,
+								 nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
 	/* now actually read the data, we know it's there */
 	if (!WALRead(state,
 				 cur_page,
 				 targetPagePtr,
-				 XLOG_BLCKSZ,
+				 nbytes,
 				 currTLI,		/* Pass the current TLI because only
 								 * WalSndSegmentOpen controls whether new TLI
 								 * is needed. */
-- 
2.34.1

#70Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Bharath Rupireddy (#69)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hmm, this looks quite nice and simple. My only comment is that a
sequence like this

/* Read from WAL buffers, if available. */
rbytes = XLogReadFromBuffers(&output_message.data[output_message.len],
startptr, nbytes, xlogreader->seg.ws_tli);
output_message.len += rbytes;
startptr += rbytes;
nbytes -= rbytes;

if (!WALRead(xlogreader,
&output_message.data[output_message.len],
startptr,

leaves you wondering if WALRead() should be called at all or not, in the
case when all bytes were read by XLogReadFromBuffers. I think in many
cases what's going to happen is that nbytes is going to be zero, and
then WALRead is going to return having done nothing in its inner loop.
I think this warrants a comment somewhere. Alternatively, we could
short-circuit the 'if' expression so that WALRead() is not called in
that case (but I'm not sure it's worth the loss of code clarity).

Also, but this is really quite minor, it seems sad to add more functions
with the prefix XLog, when we have renamed things to use the prefix WAL,
and we have kept the old names only to avoid backpatchability issues.
I mean, if we have WALRead() already, wouldn't it make perfect sense to
name the new routine WALReadFromBuffers?

--
Álvaro Herrera 48°01'N 7°57'E — https://www.EnterpriseDB.com/
"Tiene valor aquel que admite que es un cobarde" (Fernandel)

#71Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Alvaro Herrera (#70)
4 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Jan 30, 2024 at 11:01 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Hmm, this looks quite nice and simple.

Thanks for looking at it.

My only comment is that a
sequence like this

/* Read from WAL buffers, if available. */
rbytes = XLogReadFromBuffers(&output_message.data[output_message.len],
startptr, nbytes, xlogreader->seg.ws_tli);
output_message.len += rbytes;
startptr += rbytes;
nbytes -= rbytes;

if (!WALRead(xlogreader,
&output_message.data[output_message.len],
startptr,

leaves you wondering if WALRead() should be called at all or not, in the
case when all bytes were read by XLogReadFromBuffers. I think in many
cases what's going to happen is that nbytes is going to be zero, and
then WALRead is going to return having done nothing in its inner loop.
I think this warrants a comment somewhere. Alternatively, we could
short-circuit the 'if' expression so that WALRead() is not called in
that case (but I'm not sure it's worth the loss of code clarity).

It might help avoid a function call in case reading from WAL buffers
satisfies the read fully. And, it's not that clumsy with the change,
see following. I've changed it in the attached v22 patch set.

if (nbytes > 0 &&
!WALRead(xlogreader,

Also, but this is really quite minor, it seems sad to add more functions
with the prefix XLog, when we have renamed things to use the prefix WAL,
and we have kept the old names only to avoid backpatchability issues.
I mean, if we have WALRead() already, wouldn't it make perfect sense to
name the new routine WALReadFromBuffers?

WALReadFromBuffers looks better. Used that in v22 patch.

Please see the attached v22 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v22-0004-Use-WALReadFromBuffers-in-more-places.patchapplication/octet-stream; name=v22-0004-Use-WALReadFromBuffers-in-more-places.patchDownload
From 73d640cbac5d33c151c24c71613fc89f99a78c97 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 31 Jan 2024 07:48:57 +0000
Subject: [PATCH v22 4/4] Use WALReadFromBuffers() in more places

---
 src/backend/access/transam/xlogutils.c | 14 +++++++++++++-
 src/backend/replication/walsender.c    | 16 +++++++++++++---
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index aa8667abd1..1740ac3160 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -894,6 +894,8 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 	WALReadError errinfo;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	loc = targetPagePtr + reqLen;
 
@@ -1006,12 +1008,22 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		count = read_upto - targetPagePtr;
 	}
 
+	/* Attempt to read WAL from WAL buffers first. */
+	nbytes = XLOG_BLCKSZ;
+	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
 	/*
+	 * Now read the remaining WAL from WAL file.
+	 *
 	 * Even though we just determined how much of the page can be validly read
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
+	if (nbytes > 0 &&
+		!WALRead(state, cur_page, targetPagePtr, nbytes, tli,
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 0551f0f2d8..3f515bbf18 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1059,6 +1059,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	WALReadError errinfo;
 	XLogSegNo	segno;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	/*
 	 * Make sure we have enough WAL available before retrieving the current
@@ -1095,11 +1097,19 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
-	/* now actually read the data, we know it's there */
-	if (!WALRead(state,
+	/* Attempt to read WAL from WAL buffers first. */
+	nbytes = XLOG_BLCKSZ;
+	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
+	/* Now read the remaining WAL from WAL file. */
+	if (nbytes > 0 &&
+		!WALRead(state,
 				 cur_page,
 				 targetPagePtr,
-				 XLOG_BLCKSZ,
+				 nbytes,
 				 currTLI,		/* Pass the current TLI because only
 								 * WalSndSegmentOpen controls whether new TLI
 								 * is needed. */
-- 
2.34.1

v22-0002-Allow-WALReadFromBuffers-to-wait-for-in-progress.patchapplication/octet-stream; name=v22-0002-Allow-WALReadFromBuffers-to-wait-for-in-progress.patchDownload
From 5b6b2ebc60100d6d062bd837aa30f5943d4212cc Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 31 Jan 2024 07:27:11 +0000
Subject: [PATCH v22 2/4] Allow WALReadFromBuffers() to wait for in-progress
 insertions

---
 src/backend/access/transam/xlog.c | 43 ++++++++++++++++++++++++-------
 1 file changed, 33 insertions(+), 10 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0d87a66c59..d82557886e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -698,7 +698,7 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
-static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
+static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto, bool emitLog);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
@@ -1494,7 +1494,7 @@ WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
  * to make room for a new one, which in turn requires WALWriteLock.
  */
 static XLogRecPtr
-WaitXLogInsertionsToFinish(XLogRecPtr upto)
+WaitXLogInsertionsToFinish(XLogRecPtr upto, bool emitLog)
 {
 	uint64		bytepos;
 	XLogRecPtr	reservedUpto;
@@ -1521,9 +1521,10 @@ WaitXLogInsertionsToFinish(XLogRecPtr upto)
 	 */
 	if (upto > reservedUpto)
 	{
-		ereport(LOG,
-				(errmsg("request to flush past end of generated WAL; request %X/%X, current position %X/%X",
-						LSN_FORMAT_ARGS(upto), LSN_FORMAT_ARGS(reservedUpto))));
+		if (emitLog)
+			ereport(LOG,
+					(errmsg("request to flush past end of generated WAL; request %X/%X, current position %X/%X",
+							LSN_FORMAT_ARGS(upto), LSN_FORMAT_ARGS(reservedUpto))));
 		upto = reservedUpto;
 	}
 
@@ -1712,7 +1713,11 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
  * starting at location 'startptr' and returns total bytes read.
  *
  * The bytes read may be fewer than requested if any of the WAL buffers in the
- * requested range have been evicted.
+ * requested range have been evicted, or if the last requested byte is beyond
+ * the current insert position.
+ *
+ * If reading beyond the current write position, this function will wait for
+ * concurrent inserters to finish. Otherwise, it does not wait at all.
  *
  * This function returns immediately if the requested data is not from the
  * current timeline, or if the server is in recovery.
@@ -1724,6 +1729,7 @@ WALReadFromBuffers(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli)
 	Size		nbytes = count;
 	Size		ntotal = 0;
 	char	   *dst = buf;
+	XLogRecPtr	upto = startptr + count;
 
 	if (RecoveryInProgress() ||
 		tli != GetWALInsertionTimeLine())
@@ -1731,6 +1737,23 @@ WALReadFromBuffers(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli)
 
 	Assert(!XLogRecPtrIsInvalid(startptr));
 
+	/*
+	 * Caller requested very recent WAL data. Wait for any in-progress WAL
+	 * insertions to WAL buffers to finish.
+	 *
+	 * Most callers will have already updated LogwrtResult when determining
+	 * how far to read, but it's OK if it's out of date. XXX: is it worth
+	 * taking a spinlock to update LogwrtResult and check again before calling
+	 * WaitXLogInsertionsToFinish()?
+	 */
+	if (upto > LogwrtResult.Write)
+	{
+		XLogRecPtr	writtenUpto = WaitXLogInsertionsToFinish(upto, false);
+
+		upto = Min(upto, writtenUpto);
+		nbytes = upto - startptr;
+	}
+
 	/*
 	 * Loop through the buffers without a lock. For each buffer, atomically
 	 * read and verify the end pointer, then copy the data out, and finally
@@ -2001,7 +2024,7 @@ AdvanceXLInsertBuffer(XLogRecPtr upto, TimeLineID tli, bool opportunistic)
 				 */
 				LWLockRelease(WALBufMappingLock);
 
-				WaitXLogInsertionsToFinish(OldPageRqstPtr);
+				WaitXLogInsertionsToFinish(OldPageRqstPtr, true);
 
 				LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 
@@ -2795,7 +2818,7 @@ XLogFlush(XLogRecPtr record)
 		 * Before actually performing the write, wait for all in-flight
 		 * insertions to the pages we're about to write to finish.
 		 */
-		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr);
+		insertpos = WaitXLogInsertionsToFinish(WriteRqstPtr, true);
 
 		/*
 		 * Try to get the write lock. If we can't get it immediately, wait
@@ -2846,7 +2869,7 @@ XLogFlush(XLogRecPtr record)
 			 * We're only calling it again to allow insertpos to be moved
 			 * further forward, not to actually wait for anyone.
 			 */
-			insertpos = WaitXLogInsertionsToFinish(insertpos);
+			insertpos = WaitXLogInsertionsToFinish(insertpos, true);
 		}
 
 		/* try to write/flush later additions to XLOG as well */
@@ -3025,7 +3048,7 @@ XLogBackgroundFlush(void)
 	START_CRIT_SECTION();
 
 	/* now wait for any in-progress insertions to finish and get write lock */
-	WaitXLogInsertionsToFinish(WriteRqst.Write);
+	WaitXLogInsertionsToFinish(WriteRqst.Write, true);
 	LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);
 	LogwrtResult = XLogCtl->LogwrtResult;
 	if (WriteRqst.Write > LogwrtResult.Write ||
-- 
2.34.1

v22-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patchapplication/octet-stream; name=v22-0003-Add-test-module-for-verifying-WAL-read-from-WAL-.patchDownload
From 7d867ca6fb918f9b0ed5145f1c6992ce984bf2eb Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 31 Jan 2024 07:29:04 +0000
Subject: [PATCH v22 3/4] Add test module for verifying WAL read from WAL 
 buffers

This commit adds a test module to verify WAL read from WAL
buffers.

Author: Bharath Rupireddy
Reviewed-by: Dilip Kumar
Discussion: https://www.postgresql.org/message-id/CALj2ACXKKK%3DwbiG5_t6dGao5GoecMwRkhr7GjVBM_jg54%2BNa%3DQ%40mail.gmail.com
---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../modules/read_wal_from_buffers/.gitignore  |  4 ++
 .../modules/read_wal_from_buffers/Makefile    | 23 ++++++
 .../modules/read_wal_from_buffers/meson.build | 33 +++++++++
 .../read_wal_from_buffers--1.0.sql            | 14 ++++
 .../read_wal_from_buffers.c                   | 41 +++++++++++
 .../read_wal_from_buffers.control             |  4 ++
 .../read_wal_from_buffers/t/001_basic.pl      | 71 +++++++++++++++++++
 9 files changed, 192 insertions(+)
 create mode 100644 src/test/modules/read_wal_from_buffers/.gitignore
 create mode 100644 src/test/modules/read_wal_from_buffers/Makefile
 create mode 100644 src/test/modules/read_wal_from_buffers/meson.build
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
 create mode 100644 src/test/modules/read_wal_from_buffers/t/001_basic.pl

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 89aa41b5e3..864a3dd72b 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  dummy_seclabel \
 		  libpq_pipeline \
 		  plsample \
+		  read_wal_from_buffers \
 		  spgist_name_ops \
 		  test_bloomfilter \
 		  test_copy_callbacks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 8fbe742d38..4f3dd69e58 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -33,6 +33,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('read_wal_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/read_wal_from_buffers/.gitignore b/src/test/modules/read_wal_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/read_wal_from_buffers/Makefile b/src/test/modules/read_wal_from_buffers/Makefile
new file mode 100644
index 0000000000..9e57a837f9
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/read_wal_from_buffers/Makefile
+
+MODULE_big = read_wal_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	read_wal_from_buffers.o
+PGFILEDESC = "read_wal_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = read_wal_from_buffers
+DATA = read_wal_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/read_wal_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/read_wal_from_buffers/meson.build b/src/test/modules/read_wal_from_buffers/meson.build
new file mode 100644
index 0000000000..3fac00d616
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+read_wal_from_buffers_sources = files(
+  'read_wal_from_buffers.c',
+)
+
+if host_system == 'windows'
+  read_wal_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'read_wal_from_buffers',
+    '--FILEDESC', 'read_wal_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+read_wal_from_buffers = shared_module('read_wal_from_buffers',
+  read_wal_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += read_wal_from_buffers
+
+test_install_data += files(
+  'read_wal_from_buffers.control',
+  'read_wal_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'read_wal_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
new file mode 100644
index 0000000000..82fa097d10
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -0,0 +1,14 @@
+/* src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION read_wal_from_buffers" to load this file. \quit
+
+--
+-- read_wal_from_buffers()
+--
+-- SQL function to read WAL from WAL buffers. Returns number of bytes read.
+--
+CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
+    bytes_read OUT int)
+AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
new file mode 100644
index 0000000000..9fad86a962
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -0,0 +1,41 @@
+/*--------------------------------------------------------------------------
+ *
+ * read_wal_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function to read WAL from WAL buffers. Returns number of bytes read.
+ */
+PG_FUNCTION_INFO_V1(read_wal_from_buffers);
+Datum
+read_wal_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	startptr = PG_GETARG_LSN(0);
+	int32		bytes_to_read = PG_GETARG_INT32(1);
+	Size		bytes_read = 0;
+	char	   *data = palloc0(bytes_to_read);
+
+	bytes_read = WALReadFromBuffers(data, startptr,
+									(Size) bytes_to_read,
+									GetWALInsertionTimeLine());
+
+	pfree(data);
+
+	PG_RETURN_INT32(bytes_read);
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
new file mode 100644
index 0000000000..b14d24751c
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/read_wal_from_buffers'
+relocatable = true
diff --git a/src/test/modules/read_wal_from_buffers/t/001_basic.pl b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..62ea21e541
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
@@ -0,0 +1,71 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+
+# Setup a new node.  The configuration chosen here minimizes the number
+# of arbitrary records that could get generated in a cluster.  Enlarging
+# checkpoint_timeout avoids noise with checkpoint activity.  wal_level
+# set to "minimal" avoids random standby snapshot records.  Autovacuum
+# could also trigger randomly, generating random WAL activity of its own.
+# Enlarging wal_writer_delay and wal_writer_flush_after avoid background
+# wal flush by walwriter.
+my $node = PostgreSQL::Test::Cluster->new("node");
+$node->init;
+$node->append_conf(
+	'postgresql.conf',
+	q[wal_level = minimal
+	  autovacuum = off
+	  checkpoint_timeout = '30min'
+	  wal_writer_delay = 10000ms
+	  wal_writer_flush_after = 1GB
+]);
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION read_wal_from_buffers;');
+
+$node->safe_psql('postgres', 'CREATE TABLE t (c int);');
+
+my $result = 0;
+my $lsn;
+my $to_read;
+
+# Wait until we read from WAL buffers
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+	# Get current insert LSN. After this, we generate some WAL which is guranteed
+	# to be in WAL buffers as there is no other WAL generating activity is
+	# happening on the server. We then verify if we can read the WAL from WAL
+	# buffers using this LSN.
+	$lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+	# Generate minimal WAL so that WAL buffers don't get overwritten.
+	$node->safe_psql('postgres', "INSERT INTO t VALUES ($i);");
+
+	$to_read = 8192;
+
+	if ($node->safe_psql('postgres',
+		qq{SELECT read_wal_from_buffers(lsn := '$lsn', bytes_to_read := $to_read) > 0;}))
+	{
+		$result = 1;
+		last;
+	}
+
+	usleep(100_000);
+}
+ok($result, 'waited until WAL is successfully read from WAL buffers');
+
+# Check with a WAL that doesn't yet exist i.e., 16MB starting from current
+# flush LSN.
+$lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_flush_lsn()+16777216;');
+$to_read = 8192;
+$result = $node->safe_psql('postgres',
+	qq{SELECT read_wal_from_buffers(lsn := '$lsn', bytes_to_read := $to_read) = 0;});
+is($result, 't', "WAL that doesn't yet exist is not read from WAL buffers");
+
+done_testing();
-- 
2.34.1

v22-0001-Add-WALReadFromBuffers.patchapplication/octet-stream; name=v22-0001-Add-WALReadFromBuffers.patchDownload
From 5826ad95c980f4543517cb7014af21bd9a1aa917 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Wed, 31 Jan 2024 07:25:22 +0000
Subject: [PATCH v22 1/4] Add WALReadFromBuffers().

Allows reading directly from WAL buffers without a lock, avoiding the
need to wait for WAL flushing and read from the filesystem.

For now, the only caller is physical replication, but we can consider
expanding it to other callers as needed.
---
 src/backend/access/transam/xlog.c       | 106 ++++++++++++++++++++++++
 src/backend/access/transam/xlogreader.c |   3 -
 src/backend/replication/walsender.c     |  12 ++-
 src/include/access/xlog.h               |   3 +
 4 files changed, 120 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 478377c4a2..0d87a66c59 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1705,6 +1705,112 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 	return cachedPos + ptr % XLOG_BLCKSZ;
 }
 
+/*
+ * Read WAL directly from WAL buffers, if available.
+ *
+ * This function reads 'count' bytes of WAL from WAL buffers into 'buf'
+ * starting at location 'startptr' and returns total bytes read.
+ *
+ * The bytes read may be fewer than requested if any of the WAL buffers in the
+ * requested range have been evicted.
+ *
+ * This function returns immediately if the requested data is not from the
+ * current timeline, or if the server is in recovery.
+ */
+Size
+WALReadFromBuffers(char *buf, XLogRecPtr startptr, Size count, TimeLineID tli)
+{
+	XLogRecPtr	ptr = startptr;
+	Size		nbytes = count;
+	Size		ntotal = 0;
+	char	   *dst = buf;
+
+	if (RecoveryInProgress() ||
+		tli != GetWALInsertionTimeLine())
+		return ntotal;
+
+	Assert(!XLogRecPtrIsInvalid(startptr));
+
+	/*
+	 * Loop through the buffers without a lock. For each buffer, atomically
+	 * read and verify the end pointer, then copy the data out, and finally
+	 * re-read and re-verify the end pointer.
+	 *
+	 * Once a page is evicted, it never returns to the WAL buffers, so if the
+	 * end pointer matches the expected end pointer before and after we copy
+	 * the data, then the right page must have been present during the data
+	 * copy. Read barriers are necessary to ensure that the data copy actually
+	 * happens between the two verification steps.
+	 *
+	 * If the verification fails, we simply terminate the loop and return with
+	 * the data that had been already copied out successfully.
+	 */
+	while (nbytes > 0)
+	{
+		XLogRecPtr	expectedEndPtr;
+		XLogRecPtr	endptr;
+		int			idx;
+		const char *page;
+		const char *data;
+		Size		nread;
+
+		idx = XLogRecPtrToBufIdx(ptr);
+		expectedEndPtr = ptr;
+		expectedEndPtr += XLOG_BLCKSZ - ptr % XLOG_BLCKSZ;
+
+		/*
+		 * First verification step: check that the correct page is present in
+		 * the WAL buffers.
+		 */
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+		if (expectedEndPtr != endptr)
+			break;
+
+		/*
+		 * We found WAL buffer page containing given XLogRecPtr. Get starting
+		 * address of the page and a pointer to the right location of given
+		 * XLogRecPtr in that page.
+		 */
+		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+		data = page + ptr % XLOG_BLCKSZ;
+
+		/*
+		 * Ensure that the data copy and the first verification step are not
+		 * reordered.
+		 */
+		pg_read_barrier();
+
+		/* how much is available on this page to read? */
+		nread = Min(nbytes, XLOG_BLCKSZ - (data - page));
+
+		/* data copy */
+		memcpy(dst, data, nread);
+
+		/*
+		 * Ensure that the data copy and the second verification step are not
+		 * reordered.
+		 */
+		pg_read_barrier();
+
+		/*
+		 * Second verification step: check that the page we read from wasn't
+		 * evicted while we were copying the data.
+		 */
+		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+		if (expectedEndPtr != endptr)
+			break;
+
+		dst += nread;
+		ptr += nread;
+		ntotal += nread;
+		nbytes -= nread;
+	}
+
+	Assert(ntotal <= count);
+
+	return ntotal;
+}
+
 /*
  * Converts a "usable byte position" to XLogRecPtr. A usable byte position
  * is the position starting from the beginning of WAL, excluding all WAL
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 7190156f2f..74a6b11866 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1500,9 +1500,6 @@ err:
  *
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
- *
- * XXX probably this should be improved to suck data directly from the
- * WAL buffers when possible.
  */
 bool
 WALRead(XLogReaderState *state,
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 77c8baa32a..0551f0f2d8 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2966,6 +2966,7 @@ XLogSendPhysical(void)
 	Size		nbytes;
 	XLogSegNo	segno;
 	WALReadError errinfo;
+	Size		rbytes;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -3181,7 +3182,16 @@ XLogSendPhysical(void)
 	enlargeStringInfo(&output_message, nbytes);
 
 retry:
-	if (!WALRead(xlogreader,
+	/* Attempt to read WAL from WAL buffers first. */
+	rbytes = WALReadFromBuffers(&output_message.data[output_message.len],
+								startptr, nbytes, xlogreader->seg.ws_tli);
+	output_message.len += rbytes;
+	startptr += rbytes;
+	nbytes -= rbytes;
+
+	/* Now read the remaining WAL from WAL file. */
+	if (nbytes > 0 &&
+		!WALRead(xlogreader,
 				 &output_message.data[output_message.len],
 				 startptr,
 				 nbytes,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 301c5fa11f..6d5de9812c 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -252,6 +252,9 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern Size WALReadFromBuffers(char *buf, XLogRecPtr startptr, Size count,
+							   TimeLineID tli);
+
 /*
  * Routines used by xlogrecovery.c to call back into xlog.c during recovery.
  */
-- 
2.34.1

#72Alvaro Herrera
alvherre@alvh.no-ip.org
In reply to: Bharath Rupireddy (#71)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Looking at 0003, where an XXX comment is added about taking a spinlock
to read LogwrtResult, I suspect the answer is probably not, because it
is likely to slow down the other uses of LogwrtResult. But I wonder if
a better path forward would be to base further work on my older
uncommitted patch to make LogwrtResult use atomics. With that, you
wouldn't have to block others in order to read the value. I last posted
that patch in [1]/messages/by-id/20220728065920.oleu2jzsatchakfj@alvherre.pgsql in case you're curious.

[1]: /messages/by-id/20220728065920.oleu2jzsatchakfj@alvherre.pgsql

The reason I abandoned that patch is that the performance problem that I
was fixing no longer existed -- it was fixed in a different way.

--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"In fact, the basic problem with Perl 5's subroutines is that they're not
crufty enough, so the cruft leaks out into user-defined code instead, by
the Conservation of Cruft Principle." (Larry Wall, Apocalypse 6)

#73Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Alvaro Herrera (#72)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, Jan 31, 2024 at 3:01 PM Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:

Looking at 0003, where an XXX comment is added about taking a spinlock
to read LogwrtResult, I suspect the answer is probably not, because it
is likely to slow down the other uses of LogwrtResult.

We avoided keeping LogwrtResult latest as the current callers for
WALReadFromBuffers() all determine the flush LSN using
GetFlushRecPtr(), see comment #4 from
/messages/by-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hLA_hsGWNx2Y5-G+mSwdhNg@mail.gmail.com.

But I wonder if
a better path forward would be to base further work on my older
uncommitted patch to make LogwrtResult use atomics. With that, you
wouldn't have to block others in order to read the value. I last posted
that patch in [1] in case you're curious.

[1] /messages/by-id/20220728065920.oleu2jzsatchakfj@alvherre.pgsql

The reason I abandoned that patch is that the performance problem that I
was fixing no longer existed -- it was fixed in a different way.

Nice. I'll respond in that thread. FWIW, there's been a recent
attempt at turning unloggedLSN to 64-bit atomic -
https://commitfest.postgresql.org/46/4330/ and that might need
pg_atomic_monotonic_advance_u64. I guess we would have to bring your
patch and the unloggedLSN into a single thread to have a better
discussion.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#74Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#71)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, 2024-01-31 at 14:30 +0530, Bharath Rupireddy wrote:

Please see the attached v22 patch set.

Committed 0001.

For 0002 & 0003, I'd like more clarity on how they will actually be
used by an extension.

For 0004, we need to resolve why callers are using XLOG_BLCKSZ and we
can fix that independently, as discussed here:

/messages/by-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hLA_hsGWNx2Y5-G+mSwdhNg@mail.gmail.com

Regards,
Jeff Davis

#75Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#74)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2024-02-12 11:33:24 -0800, Jeff Davis wrote:

On Wed, 2024-01-31 at 14:30 +0530, Bharath Rupireddy wrote:

Please see the attached v22 patch set.

Committed 0001.

Yay, I think this is very cool. There are plenty other improvements than can
be based on this...

One thing I'm a bit confused in the code is the following:

+    /*
+     * Don't read past the available WAL data.
+     *
+     * Check using local copy of LogwrtResult. Ordinarily it's been updated by
+     * the caller when determining how far to read; but if not, it just means
+     * we'll read less data.
+     *
+     * XXX: the available WAL could be extended to the WAL insert pointer by
+     * calling WaitXLogInsertionsToFinish().
+     */
+    upto = Min(startptr + count, LogwrtResult.Write);
+    nbytes = upto - startptr;

Shouldn't it pretty much be a bug to ever encounter this? There aren't
equivalent checks in WALRead(), so any user of WALReadFromBuffers() that then
falls back to WALRead() is just going to send unwritten data.

ISTM that this should be an assertion or error.

Greetings,

Andres Freund

#76Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#75)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Mon, 2024-02-12 at 12:18 -0800, Andres Freund wrote:

+    upto = Min(startptr + count, LogwrtResult.Write);
+    nbytes = upto - startptr;

Shouldn't it pretty much be a bug to ever encounter this?

In the current code it's impossible, though Bharath hinted at an
extension which could reach that path.

What I committed was a bit of a compromise -- earlier versions of the
patch supported reading right up to the Insert pointer (which requires
a call to WaitXLogInsertionsToFinish()). I wasn't ready to commit that
code without seeing a more about how that would be used, but I thought
it was reasonable to have some simple code in there to allow reading up
to the Write pointer.

It seems closer to the structure that we will ultimately need to
replicate unflushed data, right?

Regards,
Jeff Davis

[1]: /messages/by-id/CALj2ACW65mqn6Ukv57SqDTMzAJgd1N_AdQtDgy+gMDqu6v618Q@mail.gmail.com
/messages/by-id/CALj2ACW65mqn6Ukv57SqDTMzAJgd1N_AdQtDgy+gMDqu6v618Q@mail.gmail.com

#77Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#76)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2024-02-12 12:46:00 -0800, Jeff Davis wrote:

On Mon, 2024-02-12 at 12:18 -0800, Andres Freund wrote:

+��� upto = Min(startptr + count, LogwrtResult.Write);
+��� nbytes = upto - startptr;

Shouldn't it pretty much be a bug to ever encounter this?

In the current code it's impossible, though Bharath hinted at an
extension which could reach that path.

What I committed was a bit of a compromise -- earlier versions of the
patch supported reading right up to the Insert pointer (which requires
a call to WaitXLogInsertionsToFinish()). I wasn't ready to commit that
code without seeing a more about how that would be used, but I thought
it was reasonable to have some simple code in there to allow reading up
to the Write pointer.

I doubt there's a sane way to use WALRead() without *first* ensuring that the
range of data is valid. I think we're better of moving that responsibility
explicitly to the caller and adding an assertion verifying that.

It seems closer to the structure that we will ultimately need to
replicate unflushed data, right?

It doesn't really seem like a necessary, or even particularly useful,
part. You couldn't just call WALRead() for that, since the caller would need
to know the range up to which WAL is valid but not yet flushed as well. Thus
the caller would need to first use WaitXLogInsertionsToFinish() or something
like it anyway - and then there's no point in doing the WALRead() anymore.

Note that for replicating unflushed data, we *still* might need to fall back
to reading WAL data from disk. In which case not asserting in WALRead() would
just make it hard to find bugs, because not using WaitXLogInsertionsToFinish()
would appear to work as long as data is in wal buffers, but as soon as we'd
fall back to on-disk (but unflushed) data, we'd send bogus WAL.

Greetings,

Andres Freund

#78Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#74)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Mon, 2024-02-12 at 11:33 -0800, Jeff Davis wrote:

For 0002 & 0003, I'd like more clarity on how they will actually be
used by an extension.

In patch 0002, I'm concerned about calling
WaitXLogInsertionsToFinish(). It loops through all the locks, but
doesn't have any early return path or advance any state.

So if it's repeatedly called with the same or similar values it seems
like it would be doing a lot of extra work.

I'm not sure of the best fix. We could add something to LogwrtResult to
track a new LSN that represents the highest known point where all
inserters are finished (in other words, the latest return value of
WaitXLogInsertionsToFinish()). That seems invasive, though.

Regards,
Jeff Davis

#79Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#78)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2024-02-12 15:56:19 -0800, Jeff Davis wrote:

On Mon, 2024-02-12 at 11:33 -0800, Jeff Davis wrote:

For 0002 & 0003, I'd like more clarity on how they will actually be
used by an extension.

In patch 0002, I'm concerned about calling
WaitXLogInsertionsToFinish(). It loops through all the locks, but
doesn't have any early return path or advance any state.

I doubt it'd be too bad - we call that at much much higher frequency during
write heavy OLTP workloads (c.f. XLogFlush()). It can be a performance issue
there, but only after increasing NUM_XLOGINSERT_LOCKS - before that the
limited number of writers is the limit. Compared to that walsender shouldn't
be a significant factor.

However, I think it's a very bad idea to call WALReadFromBuffers() from
WALReadFromBuffers(). This needs to be at the caller, not down in
WALReadFromBuffers().

I don't see why we would want to weaken the error condition in
WaitXLogInsertionsToFinish() - I suspect it'd not work correctly to wait for
insertions that aren't yet in progress and it just seems like an API misuse.

So if it's repeatedly called with the same or similar values it seems like
it would be doing a lot of extra work.

I'm not sure of the best fix. We could add something to LogwrtResult to
track a new LSN that represents the highest known point where all
inserters are finished (in other words, the latest return value of
WaitXLogInsertionsToFinish()). That seems invasive, though.

FWIW, I think LogwrtResult is an anti-pattern, perhaps introduced due to
misunderstanding how cache coherency works. It's not fundamentally faster to
access non-shared memory. It'd make far more sense to allow lock-free access
to the shared LogwrtResult and

Greetings,

Andres Freund

#80Jeff Davis
pgsql@j-davis.com
In reply to: Andres Freund (#77)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Mon, 2024-02-12 at 15:36 -0800, Andres Freund wrote:

It doesn't really seem like a necessary, or even particularly useful,
part. You couldn't just call WALRead() for that, since the caller
would need
to know the range up to which WAL is valid but not yet flushed as
well. Thus
the caller would need to first use WaitXLogInsertionsToFinish() or
something
like it anyway - and then there's no point in doing the WALRead()
anymore.

I follow until the last part. Did you mean "and then there's no point
in doing the WaitXLogInsertionsToFinish() in WALReadFromBuffers()
anymore"?

For now, should I assert that the requested WAL data is before the
Flush pointer or assert that it's before the Write pointer?

Note that for replicating unflushed data, we *still* might need to
fall back
to reading WAL data from disk. In which case not asserting in
WALRead() would
just make it hard to find bugs, because not using
WaitXLogInsertionsToFinish()
would appear to work as long as data is in wal buffers, but as soon
as we'd
fall back to on-disk (but unflushed) data, we'd send bogus WAL.

That makes me wonder whether my previous idea[1]/messages/by-id/2b36bf99e762e65db0dafbf8d338756cf5fa6ece.camel@j-davis.com might matter: when
some buffers have been evicted, should WALReadFromBuffers() keep going
through the loop and return the end portion of the requested data
rather than the beginning?

We can sort that out when we get closer to replicating unflushed WAL.

Regards,
Jeff Davis

[1]: /messages/by-id/2b36bf99e762e65db0dafbf8d338756cf5fa6ece.camel@j-davis.com
/messages/by-id/2b36bf99e762e65db0dafbf8d338756cf5fa6ece.camel@j-davis.com

#81Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#74)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Feb 13, 2024 at 1:03 AM Jeff Davis <pgsql@j-davis.com> wrote:

For 0004, we need to resolve why callers are using XLOG_BLCKSZ and we
can fix that independently, as discussed here:

/messages/by-id/CALj2ACV=C1GZT9XQRm4iN1NV1T=hLA_hsGWNx2Y5-G+mSwdhNg@mail.gmail.com

Thanks. I started a new thread for this -
/messages/by-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3=AEEZPJmqhpEOBGExg@mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#82Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Andres Freund (#77)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Feb 13, 2024 at 5:06 AM Andres Freund <andres@anarazel.de> wrote:

I doubt there's a sane way to use WALRead() without *first* ensuring that the
range of data is valid. I think we're better of moving that responsibility
explicitly to the caller and adding an assertion verifying that.

It doesn't really seem like a necessary, or even particularly useful,
part. You couldn't just call WALRead() for that, since the caller would need
to know the range up to which WAL is valid but not yet flushed as well. Thus
the caller would need to first use WaitXLogInsertionsToFinish() or something
like it anyway - and then there's no point in doing the WALRead() anymore.

Note that for replicating unflushed data, we *still* might need to fall back
to reading WAL data from disk. In which case not asserting in WALRead() would
just make it hard to find bugs, because not using WaitXLogInsertionsToFinish()
would appear to work as long as data is in wal buffers, but as soon as we'd
fall back to on-disk (but unflushed) data, we'd send bogus WAL.

Callers of WALRead() do a good amount of work to figure out what's
been flushed out but they read the un-flushed and/or invalid data see
the comment [1]/* * Even though we just determined how much of the page can be validly read * as 'count', read the whole page anyway. It's guaranteed to be * zero-padded up to the page boundary if it's incomplete. */ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &errinfo)) around WALRead() call sites as well as a recent thread
[2]: /messages/by-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3=AEEZPJmqhpEOBGExg@mail.gmail.com

IIUC, here's the summary of the discussion that has happened so far:
a) If only replicating flushed data, then ensure all the WALRead()
callers read how much ever is valid out of startptr+count. Fix
provided in [2]/messages/by-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3=AEEZPJmqhpEOBGExg@mail.gmail.com can help do that.
b) If only replicating flushed data, then ensure all the
WALReadFromBuffers() callers read how much ever is valid out of
startptr+count. Current and expected WALReadFromBuffers() callers will
anyway determine how much of it is flushed and can validly be read.
c) If planning to replicate unflushed data, then ensure all the
WALRead() callers wait until startptr+count is past the current insert
position with WaitXLogInsertionsToFinish().
d) If planning to replicate unflushed data, then ensure all the
WALReadFromBuffers() callers wait until startptr+count is past the
current insert position with WaitXLogInsertionsToFinish().

Adding an assertion or error in WALReadFromBuffers() for ensuring the
callers do follow the above set of rules is easy. We can just do
Assert(startptr+count <= LogwrtResult.Flush).

However, adding a similar assertion or error in WALRead() gets
trickier as it's being called from many places - walsenders, backends,
external tools etc. even when the server is in recovery. Therefore,
determining the actual valid LSN is a bit of a challenge.

What I think is the best way:
- Try and get the fix provided for (a) at [2]/messages/by-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3=AEEZPJmqhpEOBGExg@mail.gmail.com.
- Implement both (c) and (d).
- Have the assertion in WALReadFromBuffers() ensuring the callers wait
until startptr+count is past the current insert position with
WaitXLogInsertionsToFinish().
- Have a comment around WALRead() to ensure the callers are requesting
the WAL that's written to the disk because it's hard to determine
what's written to disk as this gets called in many scenarios - when
server is in recovery, for walsummarizer etc.
- In the new test module, demonstrate how one can implement reading
unflushed data with WALReadFromBuffers() and/or WALRead() +
WaitXLogInsertionsToFinish().

Thoughts?

[1]: /* * Even though we just determined how much of the page can be validly read * as 'count', read the whole page anyway. It's guaranteed to be * zero-padded up to the page boundary if it's incomplete. */ if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli, &errinfo))
/*
* Even though we just determined how much of the page can be validly read
* as 'count', read the whole page anyway. It's guaranteed to be
* zero-padded up to the page boundary if it's incomplete.
*/
if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
&errinfo))

[2]: /messages/by-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3=AEEZPJmqhpEOBGExg@mail.gmail.com

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#83Jeff Davis
pgsql@j-davis.com
In reply to: Jeff Davis (#80)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Attached 2 patches.

Per Andres's suggestion, 0001 adds an:
Assert(startptr + count <= LogwrtResult.Write)

Though if we want to allow the caller (e.g. in an extension) to
determine the valid range, perhaps using WaitXLogInsertionsToFinish(),
then the check is wrong. Maybe we should just get rid of that code
entirely and trust the caller to request a reasonable range?

On Mon, 2024-02-12 at 17:33 -0800, Jeff Davis wrote:

That makes me wonder whether my previous idea[1] might matter: when
some buffers have been evicted, should WALReadFromBuffers() keep
going
through the loop and return the end portion of the requested data
rather than the beginning?
[1]
/messages/by-id/2b36bf99e762e65db0dafbf8d338756cf5fa6ece.camel@j-davis.com

0002 is to illustrate the above idea. It's a strange API so I don't
intend to commit it in this form, but I think we will ultimately need
to do something like it when we want to replicate unflushed data.

The idea is that data past the Write pointer is always (and only)
available in the WAL buffers, so WALReadFromBuffers() should always
return it. That way we can always safely fall through to ordinary
WALRead(), which can only see before the Write pointer. There's also
data before the Write pointer that could be in the WAL buffers, and we
might as well copy that, too, if it's not evicted.

If some buffers are evicted, it will fill in the *end* of the buffer,
leaving a gap at the beginning. The nice thing is that if there is any
gap, it will be before the Write pointer, so we can always fall back to
WALRead() to fill the gap and it should always succeed.

Regards,
Jeff Davis

Attachments:

0001-Add-assert-to-WALReadFromBuffers.patchtext/x-patch; charset=UTF-8; name=0001-Add-assert-to-WALReadFromBuffers.patchDownload
From f890362e9f5cefd04ee3f9406c7fcafb6a277e45 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 13 Feb 2024 11:17:08 -0800
Subject: [PATCH 1/2] Add assert to WALReadFromBuffers().

Per suggestion from Andres.
---
 src/backend/access/transam/xlog.c | 24 ++++++------------------
 1 file changed, 6 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4e14c242b1..50c347a679 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1710,12 +1710,13 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
  * of bytes read successfully.
  *
  * Fewer than 'count' bytes may be read if some of the requested WAL data has
- * already been evicted from the WAL buffers, or if the caller requests data
- * that is not yet available.
+ * already been evicted.
  *
  * No locks are taken.
  *
- * The 'tli' argument is only used as a convenient safety check so that
+ * Caller should ensure that it reads no further than LogwrtResult.Write
+ * (which should have been updated by the caller when determining how far to
+ * read). The 'tli' argument is only used as a convenient safety check so that
  * callers do not read from WAL buffers on a historical timeline.
  */
 Size
@@ -1724,26 +1725,13 @@ WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 {
 	char	   *pdst = dstbuf;
 	XLogRecPtr	recptr = startptr;
-	XLogRecPtr	upto;
-	Size		nbytes;
+	Size		nbytes = count;
 
 	if (RecoveryInProgress() || tli != GetWALInsertionTimeLine())
 		return 0;
 
 	Assert(!XLogRecPtrIsInvalid(startptr));
-
-	/*
-	 * Don't read past the available WAL data.
-	 *
-	 * Check using local copy of LogwrtResult. Ordinarily it's been updated by
-	 * the caller when determining how far to read; but if not, it just means
-	 * we'll read less data.
-	 *
-	 * XXX: the available WAL could be extended to the WAL insert pointer by
-	 * calling WaitXLogInsertionsToFinish().
-	 */
-	upto = Min(startptr + count, LogwrtResult.Write);
-	nbytes = upto - startptr;
+	Assert(startptr + count <= LogwrtResult.Write);
 
 	/*
 	 * Loop through the buffers without a lock. For each buffer, atomically
-- 
2.34.1

0002-WALReadFromBuffers-read-end-of-the-requested-range.patchtext/x-patch; charset=UTF-8; name=0002-WALReadFromBuffers-read-end-of-the-requested-range.patchDownload
From 6fc92fa74a033c881624177365afbbffc37ed873 Mon Sep 17 00:00:00 2001
From: Jeff Davis <jeff@j-davis.com>
Date: Tue, 13 Feb 2024 16:56:46 -0800
Subject: [PATCH 2/2] WALReadFromBuffers: read end of the requested range

---
 src/backend/access/transam/xlog.c   | 109 ++++++++++++++++------------
 src/backend/replication/walsender.c |  14 ++--
 src/include/access/xlog.h           |   2 +-
 3 files changed, 70 insertions(+), 55 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 50c347a679..ea55e1b77b 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -1706,11 +1706,21 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
 }
 
 /*
- * Read WAL data directly from WAL buffers, if available. Returns the number
- * of bytes read successfully.
+ * Read WAL data directly from WAL buffers, if available.
  *
- * Fewer than 'count' bytes may be read if some of the requested WAL data has
- * already been evicted.
+ * Some pages in the requested range may already be evicted from the WAL
+ * buffers, in which case this function continues on and reads the *end* of
+ * the range requested (filling in the end of the buffer rather than the
+ * beginning).
+ *
+ * 'count' is an in/out parameter. On return, it's updated to represent the
+ * range of bytes that haven't been filled in. For example:
+ *   remaining = nbytes;
+ *   WALReadFromBuffers(buf, startptr, &remaining, ...);
+ *   WALRead(xlogreader, buf, startptr, remaining, ...);
+ *
+ * On return, startptr + *count <= LogwrtResult.Write, so it will always be
+ * safe to fall back to WALRead().
  *
  * No locks are taken.
  *
@@ -1719,19 +1729,20 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
  * read). The 'tli' argument is only used as a convenient safety check so that
  * callers do not read from WAL buffers on a historical timeline.
  */
-Size
-WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
+void
+WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size *count,
 				   TimeLineID tli)
 {
 	char	   *pdst = dstbuf;
 	XLogRecPtr	recptr = startptr;
-	Size		nbytes = count;
+	XLogRecPtr	valid = startptr;
+	Size		nbytes = *count;
 
 	if (RecoveryInProgress() || tli != GetWALInsertionTimeLine())
-		return 0;
+		return;
 
 	Assert(!XLogRecPtrIsInvalid(startptr));
-	Assert(startptr + count <= LogwrtResult.Write);
+	Assert(startptr + *count <= LogwrtResult.Write);
 
 	/*
 	 * Loop through the buffers without a lock. For each buffer, atomically
@@ -1744,8 +1755,8 @@ WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 	 * copy. Read barriers are necessary to ensure that the data copy actually
 	 * happens between the two verification steps.
 	 *
-	 * If either verification fails, we simply terminate the loop and return
-	 * with the data that had been already copied out successfully.
+	 * If either verification fails, we advance 'valid' to the next page
+	 * boundary and continue the loop.
 	 */
 	while (nbytes > 0)
 	{
@@ -1753,8 +1764,6 @@ WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 		int			idx = XLogRecPtrToBufIdx(recptr);
 		XLogRecPtr	expectedEndPtr;
 		XLogRecPtr	endptr;
-		const char *page;
-		const char *psrc;
 		Size		npagebytes;
 
 		/*
@@ -1763,54 +1772,62 @@ WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 		 */
 		expectedEndPtr = recptr + (XLOG_BLCKSZ - offset);
 
+		/* determine how much data we intend to read from this page */
+		npagebytes = Min(nbytes, XLOG_BLCKSZ - offset);
+
 		/*
 		 * First verification step: check that the correct page is present in
 		 * the WAL buffers.
 		 */
 		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
-		if (expectedEndPtr != endptr)
-			break;
-
-		/*
-		 * The correct page is present (or was at the time the endptr was
-		 * read; must re-verify later). Calculate pointer to source data and
-		 * determine how much data to read from this page.
-		 */
-		page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
-		psrc = page + offset;
-		npagebytes = Min(nbytes, XLOG_BLCKSZ - offset);
+		if (expectedEndPtr == endptr)
+		{
+			/*
+			 * The correct page is present (or was at the time the endptr was
+			 * read; must re-verify later). Calculate pointer to source data.
+			 */
+			const char *page = XLogCtl->pages + idx * (Size) XLOG_BLCKSZ;
+			const char *psrc = page + offset;
 
-		/*
-		 * Ensure that the data copy and the first verification step are not
-		 * reordered.
-		 */
-		pg_read_barrier();
+			/*
+			 * Ensure that the data copy and the first verification step are
+			 * not reordered.
+			 */
+			pg_read_barrier();
 
-		/* data copy */
-		memcpy(pdst, psrc, npagebytes);
+			/* data copy */
+			memcpy(pdst, psrc, npagebytes);
 
-		/*
-		 * Ensure that the data copy and the second verification step are not
-		 * reordered.
-		 */
-		pg_read_barrier();
+			/*
+			 * Ensure that the data copy and the second verification step are
+			 * not reordered.
+			 */
+			pg_read_barrier();
 
-		/*
-		 * Second verification step: check that the page we read from wasn't
-		 * evicted while we were copying the data.
-		 */
-		endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
-		if (expectedEndPtr != endptr)
-			break;
+			/*
+			 * Second verification step: check that the page we read from wasn't
+			 * evicted while we were copying the data.
+			 */
+			endptr = pg_atomic_read_u64(&XLogCtl->xlblocks[idx]);
+			if (expectedEndPtr != endptr)
+			{
+				/* discard previously-copied data but keep going */
+				valid = recptr + npagebytes;
+			}
+		}
+		else
+		{
+			/* discard previously-copied data but keep going */
+			valid = recptr + npagebytes;
+		}
 
 		pdst += npagebytes;
 		recptr += npagebytes;
 		nbytes -= npagebytes;
 	}
 
-	Assert(pdst - dstbuf <= count);
-
-	return pdst - dstbuf;
+	*count = valid - startptr;
+	Assert(startptr + *count <= LogwrtResult.Write);
 }
 
 /*
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 146826d5db..f50328c01d 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -2966,7 +2966,7 @@ XLogSendPhysical(void)
 	Size		nbytes;
 	XLogSegNo	segno;
 	WALReadError errinfo;
-	Size		rbytes;
+	Size		bytes_remaining;
 
 	/* If requested switch the WAL sender to the stopping state. */
 	if (got_STOPPING)
@@ -3182,19 +3182,17 @@ XLogSendPhysical(void)
 	enlargeStringInfo(&output_message, nbytes);
 
 retry:
+	bytes_remaining = nbytes;
 	/* attempt to read WAL from WAL buffers first */
-	rbytes = WALReadFromBuffers(&output_message.data[output_message.len],
-								startptr, nbytes, xlogreader->seg.ws_tli);
-	output_message.len += rbytes;
-	startptr += rbytes;
-	nbytes -= rbytes;
+	WALReadFromBuffers(&output_message.data[output_message.len],
+					   startptr, &bytes_remaining, xlogreader->seg.ws_tli);
 
 	/* now read the remaining WAL from WAL file */
-	if (nbytes > 0 &&
+	if (bytes_remaining > 0 &&
 		!WALRead(xlogreader,
 				 &output_message.data[output_message.len],
 				 startptr,
-				 nbytes,
+				 bytes_remaining,
 				 xlogreader->seg.ws_tli,	/* Pass the current TLI because
 											 * only WalSndSegmentOpen controls
 											 * whether new TLI is needed. */
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 76787a8267..8709e97be0 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -252,7 +252,7 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
-extern Size WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
+extern void WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size *count,
 							   TimeLineID tli);
 
 /*
-- 
2.34.1

#84Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#82)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, 2024-02-13 at 22:47 +0530, Bharath Rupireddy wrote:

c) If planning to replicate unflushed data, then ensure all the
WALRead() callers wait until startptr+count is past the current
insert
position with WaitXLogInsertionsToFinish().

WALRead() can't read past the Write pointer, so there's no point in
calling WaitXLogInsertionsToFinish(), right?

Regards,
Jeff Davis

#85Andres Freund
andres@anarazel.de
In reply to: Jeff Davis (#80)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

Hi,

On 2024-02-12 17:33:24 -0800, Jeff Davis wrote:

On Mon, 2024-02-12 at 15:36 -0800, Andres Freund wrote:

It doesn't really seem like a necessary, or even particularly useful,
part. You couldn't just call WALRead() for that, since the caller
would need
to know the range up to which WAL is valid but not yet flushed as
well. Thus
the caller would need to first use WaitXLogInsertionsToFinish() or
something
like it anyway - and then there's no point in doing the WALRead()
anymore.

I follow until the last part. Did you mean "and then there's no point
in doing the WaitXLogInsertionsToFinish() in WALReadFromBuffers()
anymore"?

Yes, not sure what happened in my brain there.

For now, should I assert that the requested WAL data is before the
Flush pointer or assert that it's before the Write pointer?

Yes, I think that'd be good.

Note that for replicating unflushed data, we *still* might need to
fall back
to reading WAL data from disk. In which case not asserting in
WALRead() would
just make it hard to find bugs, because not using
WaitXLogInsertionsToFinish()
would appear to work as long as data is in wal buffers, but as soon
as we'd
fall back to on-disk (but unflushed) data, we'd send bogus WAL.

That makes me wonder whether my previous idea[1] might matter: when
some buffers have been evicted, should WALReadFromBuffers() keep going
through the loop and return the end portion of the requested data
rather than the beginning?

I still doubt that that will help very often, but it'll take some
experimentation to figure it out, I guess.

We can sort that out when we get closer to replicating unflushed WAL.

+1

Greetings,

Andres Freund

#86Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#83)
5 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, Feb 14, 2024 at 6:59 AM Jeff Davis <pgsql@j-davis.com> wrote:

Attached 2 patches.

Per Andres's suggestion, 0001 adds an:
Assert(startptr + count <= LogwrtResult.Write)

Though if we want to allow the caller (e.g. in an extension) to
determine the valid range, perhaps using WaitXLogInsertionsToFinish(),
then the check is wrong.

Right.

Maybe we should just get rid of that code
entirely and trust the caller to request a reasonable range?

I'd suggest we strike a balance here - error out in assert builds if
startptr+count is past the current insert position and trust the
callers for production builds. It has a couple of advantages over
doing just Assert(startptr + count <= LogwrtResult.Write):
1) It allows the caller to read unflushed WAL directly from WAL
buffers, see the attached 0005 for an example.
2) All the existing callers where WALReadFromBuffers() is thought to
be used are ensuring WAL availability by reading upto the flush
position so no problem with it.

Also, a note before WALRead() stating the caller must request the WAL
at least that's written out (upto LogwrtResult.Write). I'm not so sure
about this, perhaps, we don't need this comment at all.

Here, I'm with v23 patch set:

0001 - Adds assertion in WALReadFromBuffers() to ensure the requested
WAL isn't beyond the current insert position.
0002 - Adds a new test module to demonstrate how one can use
WALReadFromBuffers() ensuring WaitXLogInsertionsToFinish() if need be.
0003 - Uses WALReadFromBuffers in more places like logical walsenders
and backends.
0004 - Removes zero-padding related stuff as discussed in
/messages/by-id/CALj2ACWBRFac2TingD3PE3w2EBHXUHY3=AEEZPJmqhpEOBGExg@mail.gmail.com.
This is needed in this patch set otherwise the assertion added in 0001
fails after 0003.
0005 - Adds a page_read callback for reading from WAL buffers in the
new test module added in 0002. Also, adds tests.

Thoughts?

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v23-0002-Add-test-module-for-verifying-read-from-WAL-buff.patchapplication/x-patch; name=v23-0002-Add-test-module-for-verifying-read-from-WAL-buff.patchDownload
From 5c0f95acda494904d02593e6ba305717b61c44b5 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 16 Feb 2024 06:54:53 +0000
Subject: [PATCH v23 2/5] Add test module for verifying read from WAL buffers

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../modules/read_wal_from_buffers/.gitignore  |  4 ++
 .../modules/read_wal_from_buffers/Makefile    | 23 ++++++
 .../modules/read_wal_from_buffers/meson.build | 33 +++++++++
 .../read_wal_from_buffers--1.0.sql            | 14 ++++
 .../read_wal_from_buffers.c                   | 54 ++++++++++++++
 .../read_wal_from_buffers.control             |  4 ++
 .../read_wal_from_buffers/t/001_basic.pl      | 72 +++++++++++++++++++
 9 files changed, 206 insertions(+)
 create mode 100644 src/test/modules/read_wal_from_buffers/.gitignore
 create mode 100644 src/test/modules/read_wal_from_buffers/Makefile
 create mode 100644 src/test/modules/read_wal_from_buffers/meson.build
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
 create mode 100644 src/test/modules/read_wal_from_buffers/t/001_basic.pl

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 89aa41b5e3..864a3dd72b 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  dummy_seclabel \
 		  libpq_pipeline \
 		  plsample \
+		  read_wal_from_buffers \
 		  spgist_name_ops \
 		  test_bloomfilter \
 		  test_copy_callbacks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 8fbe742d38..4f3dd69e58 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -33,6 +33,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('read_wal_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/read_wal_from_buffers/.gitignore b/src/test/modules/read_wal_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/read_wal_from_buffers/Makefile b/src/test/modules/read_wal_from_buffers/Makefile
new file mode 100644
index 0000000000..9e57a837f9
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/read_wal_from_buffers/Makefile
+
+MODULE_big = read_wal_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	read_wal_from_buffers.o
+PGFILEDESC = "read_wal_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = read_wal_from_buffers
+DATA = read_wal_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/read_wal_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/read_wal_from_buffers/meson.build b/src/test/modules/read_wal_from_buffers/meson.build
new file mode 100644
index 0000000000..3fac00d616
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+read_wal_from_buffers_sources = files(
+  'read_wal_from_buffers.c',
+)
+
+if host_system == 'windows'
+  read_wal_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'read_wal_from_buffers',
+    '--FILEDESC', 'read_wal_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+read_wal_from_buffers = shared_module('read_wal_from_buffers',
+  read_wal_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += read_wal_from_buffers
+
+test_install_data += files(
+  'read_wal_from_buffers.control',
+  'read_wal_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'read_wal_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
new file mode 100644
index 0000000000..82fa097d10
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -0,0 +1,14 @@
+/* src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION read_wal_from_buffers" to load this file. \quit
+
+--
+-- read_wal_from_buffers()
+--
+-- SQL function to read WAL from WAL buffers. Returns number of bytes read.
+--
+CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
+    bytes_read OUT int)
+AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
new file mode 100644
index 0000000000..9df5c07b4b
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -0,0 +1,54 @@
+/*--------------------------------------------------------------------------
+ *
+ * read_wal_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function to read WAL from WAL buffers. Returns number of bytes read.
+ */
+PG_FUNCTION_INFO_V1(read_wal_from_buffers);
+Datum
+read_wal_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	startptr = PG_GETARG_LSN(0);
+	int32		count = PG_GETARG_INT32(1);
+	Size		read;
+	char	   *data = palloc0(count);
+	XLogRecPtr	upto = startptr + count;
+	XLogRecPtr	insert_pos = GetXLogInsertRecPtr();
+	TimeLineID	tli = GetWALInsertionTimeLine();
+
+	/*
+	 * The requested WAL may be very recent, so wait for any in-progress WAL
+	 * insertions to WAL buffers to finish.
+	 */
+	if (upto > insert_pos)
+	{
+		XLogRecPtr	writtenUpto = WaitXLogInsertionsToFinish(upto);
+
+		upto = Min(upto, writtenUpto);
+		count = upto - startptr;
+	}
+
+	read = WALReadFromBuffers(data, startptr, count, tli);
+
+	pfree(data);
+
+	PG_RETURN_INT32(read);
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
new file mode 100644
index 0000000000..b14d24751c
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/read_wal_from_buffers'
+relocatable = true
diff --git a/src/test/modules/read_wal_from_buffers/t/001_basic.pl b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..f985e49a27
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
@@ -0,0 +1,72 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(usleep);
+
+# Setup a new node.  The configuration chosen here minimizes the number
+# of arbitrary records that could get generated in a cluster.  Enlarging
+# checkpoint_timeout avoids noise with checkpoint activity.  wal_level
+# set to "minimal" avoids random standby snapshot records.  Autovacuum
+# could also trigger randomly, generating random WAL activity of its own.
+# Enlarging wal_writer_delay and wal_writer_flush_after avoid background
+# wal flush by walwriter.
+my $node = PostgreSQL::Test::Cluster->new("node");
+$node->init;
+$node->append_conf(
+	'postgresql.conf',
+	q[wal_level = minimal
+	  autovacuum = off
+	  checkpoint_timeout = '30min'
+	  wal_writer_delay = 10000ms
+	  wal_writer_flush_after = 1GB
+]);
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION read_wal_from_buffers;');
+
+$node->safe_psql('postgres', 'CREATE TABLE t (c int);');
+
+my $result = 0;
+my $lsn;
+my $to_read;
+
+# Wait until we read from WAL buffers
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+	# Get current insert LSN. After this, we generate some WAL which is guranteed
+	# to be in WAL buffers as there is no other WAL generating activity is
+	# happening on the server. We then verify if we can read the WAL from WAL
+	# buffers using this LSN.
+	$lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+	my $logstart = -s $node->logfile;
+
+	# Generate minimal WAL so that WAL buffers don't get overwritten.
+	$node->safe_psql('postgres', "INSERT INTO t VALUES ($i);");
+
+	$to_read = 8192;
+
+	my $res = $node->safe_psql('postgres',
+				qq{SELECT read_wal_from_buffers(lsn := '$lsn', bytes_to_read := $to_read) > 0;});
+
+	my $log = $node->log_contains(
+				"request to flush past end of generated WAL; request .*, current position .*",
+				$logstart);
+
+	if ($res eq 't' && $log > 0)
+	{
+		$result = 1;
+		last;
+	}
+
+	usleep(100_000);
+}
+ok($result, 'waited until WAL is successfully read from WAL buffers');
+
+done_testing();
-- 
2.34.1

v23-0004-Do-away-with-zero-padding-assumption-before-WALR.patchapplication/x-patch; name=v23-0004-Do-away-with-zero-padding-assumption-before-WALR.patchDownload
From c43e78ec2738c92bd7c73e9ec96ba72acc594bc1 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 16 Feb 2024 06:56:23 +0000
Subject: [PATCH v23 4/5] Do away with zero-padding assumption before WALRead

---
 src/backend/access/transam/xlogutils.c | 10 ++--------
 src/backend/postmaster/walsummarizer.c |  7 +------
 src/backend/replication/walsender.c    |  2 +-
 3 files changed, 4 insertions(+), 15 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index d4872ec170..8fb2e68e85 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -1010,19 +1010,13 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	}
 
 	/* attempt to read WAL from WAL buffers first */
-	nbytes = XLOG_BLCKSZ;
+	nbytes = count;
 	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
 	cur_page += rbytes;
 	targetPagePtr += rbytes;
 	nbytes -= rbytes;
 
-	/*
-	 * Now read the remaining WAL from WAL file.
-	 *
-	 * Even though we just determined how much of the page can be validly read
-	 * as 'count', read the whole page anyway. It's guaranteed to be
-	 * zero-padded up to the page boundary if it's incomplete.
-	 */
+	/* now read the remaining WAL from WAL file */
 	if (nbytes > 0 &&
 		!WALRead(state, cur_page, targetPagePtr, nbytes, tli,
 				 &errinfo))
diff --git a/src/backend/postmaster/walsummarizer.c b/src/backend/postmaster/walsummarizer.c
index 3e1b146538..e85d497034 100644
--- a/src/backend/postmaster/walsummarizer.c
+++ b/src/backend/postmaster/walsummarizer.c
@@ -1318,12 +1318,7 @@ summarizer_read_local_xlog_page(XLogReaderState *state,
 		}
 	}
 
-	/*
-	 * Even though we just determined how much of the page can be validly read
-	 * as 'count', read the whole page anyway. It's guaranteed to be
-	 * zero-padded up to the page boundary if it's incomplete.
-	 */
-	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ,
+	if (!WALRead(state, cur_page, targetPagePtr, count,
 				 private_data->tli, &errinfo))
 		WALReadRaiseError(&errinfo);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 24687dab28..7ecc7174a0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1098,7 +1098,7 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
 	/* attempt to read WAL from WAL buffers first */
-	nbytes = XLOG_BLCKSZ;
+	nbytes = count;
 	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
 	cur_page += rbytes;
 	targetPagePtr += rbytes;
-- 
2.34.1

v23-0003-Use-WALReadFromBuffers-in-more-places.patchapplication/x-patch; name=v23-0003-Use-WALReadFromBuffers-in-more-places.patchDownload
From ffa3b4e2bc95cdf69ac0feae0bf2ad268a9115b5 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 16 Feb 2024 06:55:53 +0000
Subject: [PATCH v23 3/5] Use WALReadFromBuffers in more places

---
 src/backend/access/transam/xlogutils.c | 14 +++++++++++++-
 src/backend/replication/walsender.c    | 16 +++++++++++++---
 2 files changed, 26 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 945f1f790d..d4872ec170 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -895,6 +895,8 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 	WALReadError errinfo;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	loc = targetPagePtr + reqLen;
 
@@ -1007,12 +1009,22 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		count = read_upto - targetPagePtr;
 	}
 
+	/* attempt to read WAL from WAL buffers first */
+	nbytes = XLOG_BLCKSZ;
+	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
 	/*
+	 * Now read the remaining WAL from WAL file.
+	 *
 	 * Even though we just determined how much of the page can be validly read
 	 * as 'count', read the whole page anyway. It's guaranteed to be
 	 * zero-padded up to the page boundary if it's incomplete.
 	 */
-	if (!WALRead(state, cur_page, targetPagePtr, XLOG_BLCKSZ, tli,
+	if (nbytes > 0 &&
+		!WALRead(state, cur_page, targetPagePtr, nbytes, tli,
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index e5477c1de1..24687dab28 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1059,6 +1059,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	WALReadError errinfo;
 	XLogSegNo	segno;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	/*
 	 * Make sure we have enough WAL available before retrieving the current
@@ -1095,11 +1097,19 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
-	/* now actually read the data, we know it's there */
-	if (!WALRead(state,
+	/* attempt to read WAL from WAL buffers first */
+	nbytes = XLOG_BLCKSZ;
+	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
+	/* now read the remaining WAL from WAL file */
+	if (nbytes > 0 &&
+		!WALRead(state,
 				 cur_page,
 				 targetPagePtr,
-				 XLOG_BLCKSZ,
+				 nbytes,
 				 currTLI,		/* Pass the current TLI because only
 								 * WalSndSegmentOpen controls whether new TLI
 								 * is needed. */
-- 
2.34.1

v23-0001-Add-check-in-WALReadFromBuffers-against-requeste.patchapplication/x-patch; name=v23-0001-Add-check-in-WALReadFromBuffers-against-requeste.patchDownload
From 09e76fd9336352d6a34cc320c0b369e158b50b54 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 16 Feb 2024 06:54:10 +0000
Subject: [PATCH v23 1/5] Add check in WALReadFromBuffers against requested WAL

---
 src/backend/access/transam/xlog.c       | 36 ++++++++++++-------------
 src/backend/access/transam/xlogreader.c |  3 +++
 src/include/access/xlog.h               |  1 +
 3 files changed, 22 insertions(+), 18 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 4e14c242b1..884be9c805 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -698,7 +698,6 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
-static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
@@ -1493,7 +1492,7 @@ WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
  * uninitialized page), and the inserter might need to evict an old WAL buffer
  * to make room for a new one, which in turn requires WALWriteLock.
  */
-static XLogRecPtr
+XLogRecPtr
 WaitXLogInsertionsToFinish(XLogRecPtr upto)
 {
 	uint64		bytepos;
@@ -1710,13 +1709,15 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
  * of bytes read successfully.
  *
  * Fewer than 'count' bytes may be read if some of the requested WAL data has
- * already been evicted from the WAL buffers, or if the caller requests data
- * that is not yet available.
+ * already been evicted from the WAL buffers.
  *
  * No locks are taken.
  *
  * The 'tli' argument is only used as a convenient safety check so that
  * callers do not read from WAL buffers on a historical timeline.
+ *
+ * Note: It is the caller's responsibility to ensure requested WAL up to
+ * 'startptr'+'count' is available by using WaitXLogInsertionsToFinish().
  */
 Size
 WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
@@ -1724,26 +1725,25 @@ WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 {
 	char	   *pdst = dstbuf;
 	XLogRecPtr	recptr = startptr;
-	XLogRecPtr	upto;
-	Size		nbytes;
+	Size		nbytes = count;
 
 	if (RecoveryInProgress() || tli != GetWALInsertionTimeLine())
 		return 0;
 
 	Assert(!XLogRecPtrIsInvalid(startptr));
 
-	/*
-	 * Don't read past the available WAL data.
-	 *
-	 * Check using local copy of LogwrtResult. Ordinarily it's been updated by
-	 * the caller when determining how far to read; but if not, it just means
-	 * we'll read less data.
-	 *
-	 * XXX: the available WAL could be extended to the WAL insert pointer by
-	 * calling WaitXLogInsertionsToFinish().
-	 */
-	upto = Min(startptr + count, LogwrtResult.Write);
-	nbytes = upto - startptr;
+#ifdef USE_ASSERT_CHECKING
+	{
+		XLogRecPtr	upto = startptr + count;
+		XLogRecPtr	insert_pos = GetXLogInsertRecPtr();
+
+		if (upto > insert_pos)
+			ereport(ERROR,
+					(errmsg("cannot read past end of current insert position; request %X/%X, insert position %X/%X",
+							LSN_FORMAT_ARGS(upto),
+							LSN_FORMAT_ARGS(insert_pos))));
+	}
+#endif
 
 	/*
 	 * Loop through the buffers without a lock. For each buffer, atomically
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 74a6b11866..ae9904e7e4 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1500,6 +1500,9 @@ err:
  *
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
+ *
+ * Note: It is the caller's responsibility to ensure requested WAL is written
+ * to disk, that is 'startptr'+'count' > LogwrtResult.Write.
  */
 bool
 WALRead(XLogReaderState *state,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 76787a8267..74606a6846 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -252,6 +252,7 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 extern Size WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 							   TimeLineID tli);
 
-- 
2.34.1

v23-0005-Demonstrate-page_read-callback-for-reading-from-.patchapplication/x-patch; name=v23-0005-Demonstrate-page_read-callback-for-reading-from-.patchDownload
From fe60c94d7f1cd6b58bd0b2f6434e3d8365ebd0fc Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Fri, 16 Feb 2024 06:57:16 +0000
Subject: [PATCH v23 5/5] Demonstrate page_read callback for reading from WAL
 buffers

---
 src/backend/access/transam/xlogreader.c       |   3 +-
 .../read_wal_from_buffers--1.0.sql            |  23 ++
 .../read_wal_from_buffers.c                   | 266 +++++++++++++++++-
 .../read_wal_from_buffers/t/001_basic.pl      |  35 +++
 4 files changed, 325 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ae9904e7e4..4658a86997 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1035,7 +1035,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
+	if (state->seg.ws_segno != 0 &&
+		targetSegNo != state->seg.ws_segno && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
index 82fa097d10..72d05522fc 100644
--- a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -12,3 +12,26 @@ CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
     bytes_read OUT int)
 AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
 LANGUAGE C STRICT;
+
+--
+-- get_wal_records_info_from_buffers()
+--
+-- SQL function to get info of WAL records available in WAL buffers.
+--
+CREATE FUNCTION get_wal_records_info_from_buffers(IN start_lsn pg_lsn,
+    IN end_lsn pg_lsn,
+    OUT start_lsn pg_lsn,
+    OUT end_lsn pg_lsn,
+    OUT prev_lsn pg_lsn,
+    OUT xid xid,
+    OUT resource_manager text,
+    OUT record_type text,
+    OUT record_length int4,
+    OUT main_data_length int4,
+    OUT fpi_length int4,
+    OUT description text,
+    OUT block_ref text
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'get_wal_records_info_from_buffers'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
index 9df5c07b4b..ed33a14127 100644
--- a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -14,11 +14,27 @@
 #include "postgres.h"
 
 #include "access/xlog.h"
-#include "fmgr.h"
+#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecovery.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 
 PG_MODULE_MAGIC;
 
+static int	read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+								  int reqLen, XLogRecPtr targetRecPtr,
+								  char *cur_page);
+
+static XLogRecord *ReadNextXLogRecord(XLogReaderState *xlogreader);
+static void GetWALRecordInfo(XLogReaderState *record, Datum *values,
+							 bool *nulls, uint32 ncols);
+static void GetWALRecordsInfo(FunctionCallInfo fcinfo,
+							  XLogRecPtr start_lsn,
+							  XLogRecPtr end_lsn);
+
 /*
  * SQL function to read WAL from WAL buffers. Returns number of bytes read.
  */
@@ -52,3 +68,251 @@ read_wal_from_buffers(PG_FUNCTION_ARGS)
 
 	PG_RETURN_INT32(read);
 }
+
+/*
+ * XLogReaderRoutine->page_read callback for reading WAL from WAL buffers.
+ */
+static int
+read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+					  int reqLen, XLogRecPtr targetRecPtr,
+					  char *cur_page)
+{
+	XLogRecPtr	read_upto,
+				loc;
+	TimeLineID	tli = GetWALInsertionTimeLine();
+	Size		count;
+	Size		read = 0;
+
+	loc = targetPagePtr + reqLen;
+
+	/* Loop waiting for xlog to be available if necessary */
+	while (1)
+	{
+		read_upto = GetXLogInsertRecPtr();
+
+		if (loc <= read_upto)
+			break;
+
+		WaitXLogInsertionsToFinish(loc);
+
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(1000L);
+	}
+
+	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
+	{
+		/*
+		 * more than one block available; read only that block, have caller
+		 * come back if they need more.
+		 */
+		count = XLOG_BLCKSZ;
+	}
+	else if (targetPagePtr + reqLen > read_upto)
+	{
+		/* not enough data there */
+		return -1;
+	}
+	else
+	{
+		/* enough bytes available to satisfy the request */
+		count = read_upto - targetPagePtr;
+	}
+
+	/* read WAL from WAL buffers */
+	read = WALReadFromBuffers(cur_page, targetPagePtr, count, tli);
+
+	if (read != count)
+		ereport(ERROR,
+				errmsg("could not read fully from WAL buffers; expected %lu, read %lu",
+					   count, read));
+
+	return count;
+}
+
+/*
+ * Get info of all WAL records between start LSN and end LSN.
+ *
+ * This function and its helpers below are similar to pg_walinspect's
+ * pg_get_wal_records_info() except that it will get info of WAL records
+ * available in WAL buffers.
+ */
+PG_FUNCTION_INFO_V1(get_wal_records_info_from_buffers);
+Datum
+get_wal_records_info_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	start_lsn = PG_GETARG_LSN(0);
+	XLogRecPtr	end_lsn = PG_GETARG_LSN(1);
+
+	/*
+	 * Validate start and end LSNs coming from the function inputs.
+	 *
+	 * Reading WAL below the first page of the first segments isn't allowed.
+	 * This is a bootstrap WAL page and the page_read callback fails to read
+	 * it.
+	 */
+	if (start_lsn < XLOG_BLCKSZ)
+		ereport(ERROR,
+				(errmsg("could not read WAL at LSN %X/%X",
+						LSN_FORMAT_ARGS(start_lsn))));
+
+	if (start_lsn > end_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("WAL start LSN must be less than end LSN")));
+
+	GetWALRecordsInfo(fcinfo, start_lsn, end_lsn);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Read next WAL record.
+ */
+static XLogRecord *
+ReadNextXLogRecord(XLogReaderState *xlogreader)
+{
+	XLogRecord *record;
+	char	   *errormsg;
+
+	record = XLogReadRecord(xlogreader, &errormsg);
+
+	if (record == NULL)
+	{
+		if (errormsg)
+			ereport(ERROR,
+					errmsg("could not read WAL at %X/%X: %s",
+						   LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg));
+		else
+			ereport(ERROR,
+					errmsg("could not read WAL at %X/%X",
+						   LSN_FORMAT_ARGS(xlogreader->EndRecPtr)));
+	}
+
+	return record;
+}
+
+/*
+ * Output values that make up a row describing caller's WAL record.
+ */
+static void
+GetWALRecordInfo(XLogReaderState *record, Datum *values,
+				 bool *nulls, uint32 ncols)
+{
+	const char *record_type;
+	RmgrData	desc;
+	uint32		fpi_len = 0;
+	StringInfoData rec_desc;
+	StringInfoData rec_blk_ref;
+	int			i = 0;
+
+	desc = GetRmgr(XLogRecGetRmid(record));
+	record_type = desc.rm_identify(XLogRecGetInfo(record));
+
+	if (record_type == NULL)
+		record_type = psprintf("UNKNOWN (%x)", XLogRecGetInfo(record) & ~XLR_INFO_MASK);
+
+	initStringInfo(&rec_desc);
+	desc.rm_desc(&rec_desc, record);
+
+	if (XLogRecHasAnyBlockRefs(record))
+	{
+		initStringInfo(&rec_blk_ref);
+		XLogRecGetBlockRefInfo(record, false, true, &rec_blk_ref, &fpi_len);
+	}
+
+	values[i++] = LSNGetDatum(record->ReadRecPtr);
+	values[i++] = LSNGetDatum(record->EndRecPtr);
+	values[i++] = LSNGetDatum(XLogRecGetPrev(record));
+	values[i++] = TransactionIdGetDatum(XLogRecGetXid(record));
+	values[i++] = CStringGetTextDatum(desc.rm_name);
+	values[i++] = CStringGetTextDatum(record_type);
+	values[i++] = UInt32GetDatum(XLogRecGetTotalLen(record));
+	values[i++] = UInt32GetDatum(XLogRecGetDataLen(record));
+	values[i++] = UInt32GetDatum(fpi_len);
+
+	if (rec_desc.len > 0)
+		values[i++] = CStringGetTextDatum(rec_desc.data);
+	else
+		nulls[i++] = true;
+
+	if (XLogRecHasAnyBlockRefs(record))
+		values[i++] = CStringGetTextDatum(rec_blk_ref.data);
+	else
+		nulls[i++] = true;
+
+	Assert(i == ncols);
+}
+
+/*
+ * Get info of all WAL records between start LSN and end LSN.
+ */
+static void
+GetWALRecordsInfo(FunctionCallInfo fcinfo, XLogRecPtr start_lsn,
+				  XLogRecPtr end_lsn)
+{
+#define GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS 11
+	XLogReaderState *xlogreader;
+	XLogRecPtr	first_valid_record;
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	MemoryContext old_cxt;
+	MemoryContext tmp_cxt;
+
+	Assert(start_lsn <= end_lsn);
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+									XL_ROUTINE(.page_read = &read_from_wal_buffers,
+											   .segment_open = NULL,
+											   .segment_close = NULL),
+									NULL);
+
+	if (xlogreader == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while allocating a WAL reading processor.")));
+
+	/* first find a valid recptr to start from */
+	first_valid_record = XLogFindNextRecord(xlogreader, start_lsn);
+
+	if (XLogRecPtrIsInvalid(first_valid_record))
+	{
+		ereport(LOG,
+				(errmsg("could not find a valid record after %X/%X",
+						LSN_FORMAT_ARGS(start_lsn))));
+
+		return;
+	}
+
+	tmp_cxt = AllocSetContextCreate(CurrentMemoryContext,
+									"GetWALRecordsInfo temporary cxt",
+									ALLOCSET_DEFAULT_SIZES);
+
+	while (ReadNextXLogRecord(xlogreader) &&
+		   xlogreader->EndRecPtr <= end_lsn)
+	{
+		Datum		values[GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS] = {0};
+		bool		nulls[GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS] = {0};
+
+		/* Use the tmp context so we can clean up after each tuple is done */
+		old_cxt = MemoryContextSwitchTo(tmp_cxt);
+
+		GetWALRecordInfo(xlogreader, values, nulls,
+						 GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+
+		/* clean up and switch back */
+		MemoryContextSwitchTo(old_cxt);
+		MemoryContextReset(tmp_cxt);
+
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	MemoryContextDelete(tmp_cxt);
+	XLogReaderFree(xlogreader);
+
+#undef GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS
+}
diff --git a/src/test/modules/read_wal_from_buffers/t/001_basic.pl b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
index f985e49a27..fcdcdb001e 100644
--- a/src/test/modules/read_wal_from_buffers/t/001_basic.pl
+++ b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
@@ -69,4 +69,39 @@ for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
 }
 ok($result, 'waited until WAL is successfully read from WAL buffers');
 
+$result = 0;
+
+# Wait until we get info of WAL records available in WAL buffers.
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+	$node->safe_psql('postgres', "DROP TABLE IF EXISTS foo, bar;");
+	$node->safe_psql('postgres',
+		"CREATE TABLE foo AS SELECT * FROM generate_series(1, 2);");
+	my $start_lsn = $node->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn();");
+	my $tbl_oid = $node->safe_psql('postgres',
+		"SELECT oid FROM pg_class WHERE relname = 'foo';");
+	$node->safe_psql('postgres',
+		"INSERT INTO foo SELECT * FROM generate_series(1, 10);");
+	my $end_lsn = $node->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn();");
+	$node->safe_psql('postgres',
+		"CREATE TABLE bar AS SELECT * FROM generate_series(1, 2);");
+
+	my $res = $node->safe_psql('postgres',
+				"SELECT count(*) FROM get_wal_records_info_from_buffers('$start_lsn', '$end_lsn')
+					WHERE block_ref LIKE concat('%', '$tbl_oid', '%') AND
+						resource_manager = 'Heap' AND
+						record_type = 'INSERT';");
+
+	if ($res eq 10)
+	{
+		$result = 1;
+		last;
+	}
+
+	usleep(100_000);
+}
+ok($result, 'waited until we get info of WAL records available in WAL buffers.');
+
 done_testing();
-- 
2.34.1

#87Jeff Davis
pgsql@j-davis.com
In reply to: Bharath Rupireddy (#86)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, 2024-02-16 at 13:08 +0530, Bharath Rupireddy wrote:

I'd suggest we strike a balance here - error out in assert builds if
startptr+count is past the current insert position and trust the
callers for production builds.

It's not reasonable to have divergent behavior between assert-enabled
builds and production. I think for now I will just commit the Assert as
Andres suggested until we work out a few more details.

One idea is to use Álvaro's work to eliminate the spinlock, and then
add a variable to represent the last known point returned by
WaitXLogInsertionsToFinish(). Then we can cheaply Assert that the
caller requested something before that point.

Here, I'm with v23 patch set:

Thank you, I'll look at these.

Regards,
Jeff Davis

#88Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Jeff Davis (#87)
4 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Fri, Feb 16, 2024 at 11:01 PM Jeff Davis <pgsql@j-davis.com> wrote:

Here, I'm with v23 patch set:

Thank you, I'll look at these.

Thanks. Here's the v24 patch set after rebasing.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v24-0004-Demonstrate-reading-unflushed-WAL-directly-from-.patchapplication/x-patch; name=v24-0004-Demonstrate-reading-unflushed-WAL-directly-from-.patchDownload
From d0317ed91b1483a5556c87388e0186462711e022 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 17 Feb 2024 04:41:29 +0000
Subject: [PATCH v24 4/4] Demonstrate reading unflushed WAL directly from WAL
 buffers

---
 src/backend/access/transam/xlogreader.c       |   3 +-
 .../read_wal_from_buffers--1.0.sql            |  23 ++
 .../read_wal_from_buffers.c                   | 266 +++++++++++++++++-
 .../read_wal_from_buffers/t/001_basic.pl      |  35 +++
 4 files changed, 325 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ae9904e7e4..4658a86997 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1035,7 +1035,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
+	if (state->seg.ws_segno != 0 &&
+		targetSegNo != state->seg.ws_segno && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
index 82fa097d10..72d05522fc 100644
--- a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -12,3 +12,26 @@ CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
     bytes_read OUT int)
 AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
 LANGUAGE C STRICT;
+
+--
+-- get_wal_records_info_from_buffers()
+--
+-- SQL function to get info of WAL records available in WAL buffers.
+--
+CREATE FUNCTION get_wal_records_info_from_buffers(IN start_lsn pg_lsn,
+    IN end_lsn pg_lsn,
+    OUT start_lsn pg_lsn,
+    OUT end_lsn pg_lsn,
+    OUT prev_lsn pg_lsn,
+    OUT xid xid,
+    OUT resource_manager text,
+    OUT record_type text,
+    OUT record_length int4,
+    OUT main_data_length int4,
+    OUT fpi_length int4,
+    OUT description text,
+    OUT block_ref text
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'get_wal_records_info_from_buffers'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
index 9df5c07b4b..ed33a14127 100644
--- a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -14,11 +14,27 @@
 #include "postgres.h"
 
 #include "access/xlog.h"
-#include "fmgr.h"
+#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecovery.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 
 PG_MODULE_MAGIC;
 
+static int	read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+								  int reqLen, XLogRecPtr targetRecPtr,
+								  char *cur_page);
+
+static XLogRecord *ReadNextXLogRecord(XLogReaderState *xlogreader);
+static void GetWALRecordInfo(XLogReaderState *record, Datum *values,
+							 bool *nulls, uint32 ncols);
+static void GetWALRecordsInfo(FunctionCallInfo fcinfo,
+							  XLogRecPtr start_lsn,
+							  XLogRecPtr end_lsn);
+
 /*
  * SQL function to read WAL from WAL buffers. Returns number of bytes read.
  */
@@ -52,3 +68,251 @@ read_wal_from_buffers(PG_FUNCTION_ARGS)
 
 	PG_RETURN_INT32(read);
 }
+
+/*
+ * XLogReaderRoutine->page_read callback for reading WAL from WAL buffers.
+ */
+static int
+read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+					  int reqLen, XLogRecPtr targetRecPtr,
+					  char *cur_page)
+{
+	XLogRecPtr	read_upto,
+				loc;
+	TimeLineID	tli = GetWALInsertionTimeLine();
+	Size		count;
+	Size		read = 0;
+
+	loc = targetPagePtr + reqLen;
+
+	/* Loop waiting for xlog to be available if necessary */
+	while (1)
+	{
+		read_upto = GetXLogInsertRecPtr();
+
+		if (loc <= read_upto)
+			break;
+
+		WaitXLogInsertionsToFinish(loc);
+
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(1000L);
+	}
+
+	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
+	{
+		/*
+		 * more than one block available; read only that block, have caller
+		 * come back if they need more.
+		 */
+		count = XLOG_BLCKSZ;
+	}
+	else if (targetPagePtr + reqLen > read_upto)
+	{
+		/* not enough data there */
+		return -1;
+	}
+	else
+	{
+		/* enough bytes available to satisfy the request */
+		count = read_upto - targetPagePtr;
+	}
+
+	/* read WAL from WAL buffers */
+	read = WALReadFromBuffers(cur_page, targetPagePtr, count, tli);
+
+	if (read != count)
+		ereport(ERROR,
+				errmsg("could not read fully from WAL buffers; expected %lu, read %lu",
+					   count, read));
+
+	return count;
+}
+
+/*
+ * Get info of all WAL records between start LSN and end LSN.
+ *
+ * This function and its helpers below are similar to pg_walinspect's
+ * pg_get_wal_records_info() except that it will get info of WAL records
+ * available in WAL buffers.
+ */
+PG_FUNCTION_INFO_V1(get_wal_records_info_from_buffers);
+Datum
+get_wal_records_info_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	start_lsn = PG_GETARG_LSN(0);
+	XLogRecPtr	end_lsn = PG_GETARG_LSN(1);
+
+	/*
+	 * Validate start and end LSNs coming from the function inputs.
+	 *
+	 * Reading WAL below the first page of the first segments isn't allowed.
+	 * This is a bootstrap WAL page and the page_read callback fails to read
+	 * it.
+	 */
+	if (start_lsn < XLOG_BLCKSZ)
+		ereport(ERROR,
+				(errmsg("could not read WAL at LSN %X/%X",
+						LSN_FORMAT_ARGS(start_lsn))));
+
+	if (start_lsn > end_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("WAL start LSN must be less than end LSN")));
+
+	GetWALRecordsInfo(fcinfo, start_lsn, end_lsn);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Read next WAL record.
+ */
+static XLogRecord *
+ReadNextXLogRecord(XLogReaderState *xlogreader)
+{
+	XLogRecord *record;
+	char	   *errormsg;
+
+	record = XLogReadRecord(xlogreader, &errormsg);
+
+	if (record == NULL)
+	{
+		if (errormsg)
+			ereport(ERROR,
+					errmsg("could not read WAL at %X/%X: %s",
+						   LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg));
+		else
+			ereport(ERROR,
+					errmsg("could not read WAL at %X/%X",
+						   LSN_FORMAT_ARGS(xlogreader->EndRecPtr)));
+	}
+
+	return record;
+}
+
+/*
+ * Output values that make up a row describing caller's WAL record.
+ */
+static void
+GetWALRecordInfo(XLogReaderState *record, Datum *values,
+				 bool *nulls, uint32 ncols)
+{
+	const char *record_type;
+	RmgrData	desc;
+	uint32		fpi_len = 0;
+	StringInfoData rec_desc;
+	StringInfoData rec_blk_ref;
+	int			i = 0;
+
+	desc = GetRmgr(XLogRecGetRmid(record));
+	record_type = desc.rm_identify(XLogRecGetInfo(record));
+
+	if (record_type == NULL)
+		record_type = psprintf("UNKNOWN (%x)", XLogRecGetInfo(record) & ~XLR_INFO_MASK);
+
+	initStringInfo(&rec_desc);
+	desc.rm_desc(&rec_desc, record);
+
+	if (XLogRecHasAnyBlockRefs(record))
+	{
+		initStringInfo(&rec_blk_ref);
+		XLogRecGetBlockRefInfo(record, false, true, &rec_blk_ref, &fpi_len);
+	}
+
+	values[i++] = LSNGetDatum(record->ReadRecPtr);
+	values[i++] = LSNGetDatum(record->EndRecPtr);
+	values[i++] = LSNGetDatum(XLogRecGetPrev(record));
+	values[i++] = TransactionIdGetDatum(XLogRecGetXid(record));
+	values[i++] = CStringGetTextDatum(desc.rm_name);
+	values[i++] = CStringGetTextDatum(record_type);
+	values[i++] = UInt32GetDatum(XLogRecGetTotalLen(record));
+	values[i++] = UInt32GetDatum(XLogRecGetDataLen(record));
+	values[i++] = UInt32GetDatum(fpi_len);
+
+	if (rec_desc.len > 0)
+		values[i++] = CStringGetTextDatum(rec_desc.data);
+	else
+		nulls[i++] = true;
+
+	if (XLogRecHasAnyBlockRefs(record))
+		values[i++] = CStringGetTextDatum(rec_blk_ref.data);
+	else
+		nulls[i++] = true;
+
+	Assert(i == ncols);
+}
+
+/*
+ * Get info of all WAL records between start LSN and end LSN.
+ */
+static void
+GetWALRecordsInfo(FunctionCallInfo fcinfo, XLogRecPtr start_lsn,
+				  XLogRecPtr end_lsn)
+{
+#define GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS 11
+	XLogReaderState *xlogreader;
+	XLogRecPtr	first_valid_record;
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	MemoryContext old_cxt;
+	MemoryContext tmp_cxt;
+
+	Assert(start_lsn <= end_lsn);
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+									XL_ROUTINE(.page_read = &read_from_wal_buffers,
+											   .segment_open = NULL,
+											   .segment_close = NULL),
+									NULL);
+
+	if (xlogreader == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while allocating a WAL reading processor.")));
+
+	/* first find a valid recptr to start from */
+	first_valid_record = XLogFindNextRecord(xlogreader, start_lsn);
+
+	if (XLogRecPtrIsInvalid(first_valid_record))
+	{
+		ereport(LOG,
+				(errmsg("could not find a valid record after %X/%X",
+						LSN_FORMAT_ARGS(start_lsn))));
+
+		return;
+	}
+
+	tmp_cxt = AllocSetContextCreate(CurrentMemoryContext,
+									"GetWALRecordsInfo temporary cxt",
+									ALLOCSET_DEFAULT_SIZES);
+
+	while (ReadNextXLogRecord(xlogreader) &&
+		   xlogreader->EndRecPtr <= end_lsn)
+	{
+		Datum		values[GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS] = {0};
+		bool		nulls[GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS] = {0};
+
+		/* Use the tmp context so we can clean up after each tuple is done */
+		old_cxt = MemoryContextSwitchTo(tmp_cxt);
+
+		GetWALRecordInfo(xlogreader, values, nulls,
+						 GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+
+		/* clean up and switch back */
+		MemoryContextSwitchTo(old_cxt);
+		MemoryContextReset(tmp_cxt);
+
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	MemoryContextDelete(tmp_cxt);
+	XLogReaderFree(xlogreader);
+
+#undef GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS
+}
diff --git a/src/test/modules/read_wal_from_buffers/t/001_basic.pl b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
index f985e49a27..fcdcdb001e 100644
--- a/src/test/modules/read_wal_from_buffers/t/001_basic.pl
+++ b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
@@ -69,4 +69,39 @@ for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
 }
 ok($result, 'waited until WAL is successfully read from WAL buffers');
 
+$result = 0;
+
+# Wait until we get info of WAL records available in WAL buffers.
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+	$node->safe_psql('postgres', "DROP TABLE IF EXISTS foo, bar;");
+	$node->safe_psql('postgres',
+		"CREATE TABLE foo AS SELECT * FROM generate_series(1, 2);");
+	my $start_lsn = $node->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn();");
+	my $tbl_oid = $node->safe_psql('postgres',
+		"SELECT oid FROM pg_class WHERE relname = 'foo';");
+	$node->safe_psql('postgres',
+		"INSERT INTO foo SELECT * FROM generate_series(1, 10);");
+	my $end_lsn = $node->safe_psql('postgres',
+		"SELECT pg_current_wal_insert_lsn();");
+	$node->safe_psql('postgres',
+		"CREATE TABLE bar AS SELECT * FROM generate_series(1, 2);");
+
+	my $res = $node->safe_psql('postgres',
+				"SELECT count(*) FROM get_wal_records_info_from_buffers('$start_lsn', '$end_lsn')
+					WHERE block_ref LIKE concat('%', '$tbl_oid', '%') AND
+						resource_manager = 'Heap' AND
+						record_type = 'INSERT';");
+
+	if ($res eq 10)
+	{
+		$result = 1;
+		last;
+	}
+
+	usleep(100_000);
+}
+ok($result, 'waited until we get info of WAL records available in WAL buffers.');
+
 done_testing();
-- 
2.34.1

v24-0001-Add-check-in-WALReadFromBuffers-against-requeste.patchapplication/x-patch; name=v24-0001-Add-check-in-WALReadFromBuffers-against-requeste.patchDownload
From 648e261505ac819c85112276e7b6054105f22e13 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 17 Feb 2024 04:32:09 +0000
Subject: [PATCH v24 1/4] Add check in WALReadFromBuffers against requested WAL

---
 src/backend/access/transam/xlog.c       | 26 ++++++++++++++++++-------
 src/backend/access/transam/xlogreader.c |  3 +++
 src/include/access/xlog.h               |  1 +
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 50c347a679..b01a3b4ed1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -698,7 +698,6 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
-static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
@@ -1493,7 +1492,7 @@ WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
  * uninitialized page), and the inserter might need to evict an old WAL buffer
  * to make room for a new one, which in turn requires WALWriteLock.
  */
-static XLogRecPtr
+XLogRecPtr
 WaitXLogInsertionsToFinish(XLogRecPtr upto)
 {
 	uint64		bytepos;
@@ -1710,13 +1709,14 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
  * of bytes read successfully.
  *
  * Fewer than 'count' bytes may be read if some of the requested WAL data has
- * already been evicted.
+ * already been evicted from the WAL buffers.
  *
  * No locks are taken.
  *
- * Caller should ensure that it reads no further than LogwrtResult.Write
- * (which should have been updated by the caller when determining how far to
- * read). The 'tli' argument is only used as a convenient safety check so that
+ * Caller should ensure that it reads no further than current insert position
+ * with the help of WaitXLogInsertionsToFinish().
+ *
+ * The 'tli' argument is only used as a convenient safety check so that
  * callers do not read from WAL buffers on a historical timeline.
  */
 Size
@@ -1731,7 +1731,19 @@ WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 		return 0;
 
 	Assert(!XLogRecPtrIsInvalid(startptr));
-	Assert(startptr + count <= LogwrtResult.Write);
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		XLogRecPtr	upto = startptr + count;
+		XLogRecPtr	insert_pos = GetXLogInsertRecPtr();
+
+		if (upto > insert_pos)
+			ereport(ERROR,
+					(errmsg("cannot read past end of current insert position; request %X/%X, insert position %X/%X",
+							LSN_FORMAT_ARGS(upto),
+							LSN_FORMAT_ARGS(insert_pos))));
+	}
+#endif
 
 	/*
 	 * Loop through the buffers without a lock. For each buffer, atomically
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 74a6b11866..ae9904e7e4 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1500,6 +1500,9 @@ err:
  *
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
+ *
+ * Note: It is the caller's responsibility to ensure requested WAL is written
+ * to disk, that is 'startptr'+'count' > LogwrtResult.Write.
  */
 bool
 WALRead(XLogReaderState *state,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 76787a8267..74606a6846 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -252,6 +252,7 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 extern Size WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 							   TimeLineID tli);
 
-- 
2.34.1

v24-0002-Add-test-module-for-verifying-read-from-WAL-buff.patchapplication/x-patch; name=v24-0002-Add-test-module-for-verifying-read-from-WAL-buff.patchDownload
From 6f3b49ef70f0cab1f9191423db0d419ff3771211 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 17 Feb 2024 04:32:33 +0000
Subject: [PATCH v24 2/4] Add test module for verifying read from WAL buffers

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../modules/read_wal_from_buffers/.gitignore  |  4 ++
 .../modules/read_wal_from_buffers/Makefile    | 23 ++++++
 .../modules/read_wal_from_buffers/meson.build | 33 +++++++++
 .../read_wal_from_buffers--1.0.sql            | 14 ++++
 .../read_wal_from_buffers.c                   | 54 ++++++++++++++
 .../read_wal_from_buffers.control             |  4 ++
 .../read_wal_from_buffers/t/001_basic.pl      | 72 +++++++++++++++++++
 9 files changed, 206 insertions(+)
 create mode 100644 src/test/modules/read_wal_from_buffers/.gitignore
 create mode 100644 src/test/modules/read_wal_from_buffers/Makefile
 create mode 100644 src/test/modules/read_wal_from_buffers/meson.build
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
 create mode 100644 src/test/modules/read_wal_from_buffers/t/001_basic.pl

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 89aa41b5e3..864a3dd72b 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  dummy_seclabel \
 		  libpq_pipeline \
 		  plsample \
+		  read_wal_from_buffers \
 		  spgist_name_ops \
 		  test_bloomfilter \
 		  test_copy_callbacks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 8fbe742d38..4f3dd69e58 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -33,6 +33,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('read_wal_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/read_wal_from_buffers/.gitignore b/src/test/modules/read_wal_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/read_wal_from_buffers/Makefile b/src/test/modules/read_wal_from_buffers/Makefile
new file mode 100644
index 0000000000..9e57a837f9
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/read_wal_from_buffers/Makefile
+
+MODULE_big = read_wal_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	read_wal_from_buffers.o
+PGFILEDESC = "read_wal_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = read_wal_from_buffers
+DATA = read_wal_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/read_wal_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/read_wal_from_buffers/meson.build b/src/test/modules/read_wal_from_buffers/meson.build
new file mode 100644
index 0000000000..3fac00d616
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+read_wal_from_buffers_sources = files(
+  'read_wal_from_buffers.c',
+)
+
+if host_system == 'windows'
+  read_wal_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'read_wal_from_buffers',
+    '--FILEDESC', 'read_wal_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+read_wal_from_buffers = shared_module('read_wal_from_buffers',
+  read_wal_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += read_wal_from_buffers
+
+test_install_data += files(
+  'read_wal_from_buffers.control',
+  'read_wal_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'read_wal_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
new file mode 100644
index 0000000000..82fa097d10
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -0,0 +1,14 @@
+/* src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION read_wal_from_buffers" to load this file. \quit
+
+--
+-- read_wal_from_buffers()
+--
+-- SQL function to read WAL from WAL buffers. Returns number of bytes read.
+--
+CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
+    bytes_read OUT int)
+AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
new file mode 100644
index 0000000000..9df5c07b4b
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -0,0 +1,54 @@
+/*--------------------------------------------------------------------------
+ *
+ * read_wal_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function to read WAL from WAL buffers. Returns number of bytes read.
+ */
+PG_FUNCTION_INFO_V1(read_wal_from_buffers);
+Datum
+read_wal_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	startptr = PG_GETARG_LSN(0);
+	int32		count = PG_GETARG_INT32(1);
+	Size		read;
+	char	   *data = palloc0(count);
+	XLogRecPtr	upto = startptr + count;
+	XLogRecPtr	insert_pos = GetXLogInsertRecPtr();
+	TimeLineID	tli = GetWALInsertionTimeLine();
+
+	/*
+	 * The requested WAL may be very recent, so wait for any in-progress WAL
+	 * insertions to WAL buffers to finish.
+	 */
+	if (upto > insert_pos)
+	{
+		XLogRecPtr	writtenUpto = WaitXLogInsertionsToFinish(upto);
+
+		upto = Min(upto, writtenUpto);
+		count = upto - startptr;
+	}
+
+	read = WALReadFromBuffers(data, startptr, count, tli);
+
+	pfree(data);
+
+	PG_RETURN_INT32(read);
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
new file mode 100644
index 0000000000..b14d24751c
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/read_wal_from_buffers'
+relocatable = true
diff --git a/src/test/modules/read_wal_from_buffers/t/001_basic.pl b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..f985e49a27
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
@@ -0,0 +1,72 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(usleep);
+
+# Setup a new node.  The configuration chosen here minimizes the number
+# of arbitrary records that could get generated in a cluster.  Enlarging
+# checkpoint_timeout avoids noise with checkpoint activity.  wal_level
+# set to "minimal" avoids random standby snapshot records.  Autovacuum
+# could also trigger randomly, generating random WAL activity of its own.
+# Enlarging wal_writer_delay and wal_writer_flush_after avoid background
+# wal flush by walwriter.
+my $node = PostgreSQL::Test::Cluster->new("node");
+$node->init;
+$node->append_conf(
+	'postgresql.conf',
+	q[wal_level = minimal
+	  autovacuum = off
+	  checkpoint_timeout = '30min'
+	  wal_writer_delay = 10000ms
+	  wal_writer_flush_after = 1GB
+]);
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION read_wal_from_buffers;');
+
+$node->safe_psql('postgres', 'CREATE TABLE t (c int);');
+
+my $result = 0;
+my $lsn;
+my $to_read;
+
+# Wait until we read from WAL buffers
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+	# Get current insert LSN. After this, we generate some WAL which is guranteed
+	# to be in WAL buffers as there is no other WAL generating activity is
+	# happening on the server. We then verify if we can read the WAL from WAL
+	# buffers using this LSN.
+	$lsn = $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+	my $logstart = -s $node->logfile;
+
+	# Generate minimal WAL so that WAL buffers don't get overwritten.
+	$node->safe_psql('postgres', "INSERT INTO t VALUES ($i);");
+
+	$to_read = 8192;
+
+	my $res = $node->safe_psql('postgres',
+				qq{SELECT read_wal_from_buffers(lsn := '$lsn', bytes_to_read := $to_read) > 0;});
+
+	my $log = $node->log_contains(
+				"request to flush past end of generated WAL; request .*, current position .*",
+				$logstart);
+
+	if ($res eq 't' && $log > 0)
+	{
+		$result = 1;
+		last;
+	}
+
+	usleep(100_000);
+}
+ok($result, 'waited until WAL is successfully read from WAL buffers');
+
+done_testing();
-- 
2.34.1

v24-0003-Use-WALReadFromBuffers-in-more-places.patchapplication/x-patch; name=v24-0003-Use-WALReadFromBuffers-in-more-places.patchDownload
From c590fc71291514d761e1d2e8d865fa348d0c206a Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Sat, 17 Feb 2024 04:40:05 +0000
Subject: [PATCH v24 3/4] Use WALReadFromBuffers in more places

---
 src/backend/access/transam/xlogutils.c | 13 ++++++++++++-
 src/backend/replication/walsender.c    | 16 +++++++++++++---
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index ad93035d50..8fb2e68e85 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -895,6 +895,8 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 	WALReadError errinfo;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	loc = targetPagePtr + reqLen;
 
@@ -1007,7 +1009,16 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		count = read_upto - targetPagePtr;
 	}
 
-	if (!WALRead(state, cur_page, targetPagePtr, count, tli,
+	/* attempt to read WAL from WAL buffers first */
+	nbytes = count;
+	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
+	/* now read the remaining WAL from WAL file */
+	if (nbytes > 0 &&
+		!WALRead(state, cur_page, targetPagePtr, nbytes, tli,
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 631d1e0c9f..7ecc7174a0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1059,6 +1059,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	WALReadError errinfo;
 	XLogSegNo	segno;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	/*
 	 * Make sure we have enough WAL available before retrieving the current
@@ -1095,11 +1097,19 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
-	/* now actually read the data, we know it's there */
-	if (!WALRead(state,
+	/* attempt to read WAL from WAL buffers first */
+	nbytes = count;
+	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
+	/* now read the remaining WAL from WAL file */
+	if (nbytes > 0 &&
+		!WALRead(state,
 				 cur_page,
 				 targetPagePtr,
-				 count,
+				 nbytes,
 				 currTLI,		/* Pass the current TLI because only
 								 * WalSndSegmentOpen controls whether new TLI
 								 * is needed. */
-- 
2.34.1

#89Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#88)
4 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Sat, Feb 17, 2024 at 10:27 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

On Fri, Feb 16, 2024 at 11:01 PM Jeff Davis <pgsql@j-davis.com> wrote:

Here, I'm with v23 patch set:

Thank you, I'll look at these.

Thanks. Here's the v24 patch set after rebasing.

Ran pgperltidy on the new TAP test file added. Please see the attached
v25 patch set.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v25-0001-Add-check-in-WALReadFromBuffers-against-requeste.patchapplication/octet-stream; name=v25-0001-Add-check-in-WALReadFromBuffers-against-requeste.patchDownload
From 782575067036e710009cf6a5d00a3fe8ac291a90 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 20 Feb 2024 05:52:29 +0000
Subject: [PATCH v25 1/4] Add check in WALReadFromBuffers against requested WAL

---
 src/backend/access/transam/xlog.c       | 26 ++++++++++++++++++-------
 src/backend/access/transam/xlogreader.c |  3 +++
 src/include/access/xlog.h               |  1 +
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 50c347a679..b01a3b4ed1 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -698,7 +698,6 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
-static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
@@ -1493,7 +1492,7 @@ WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
  * uninitialized page), and the inserter might need to evict an old WAL buffer
  * to make room for a new one, which in turn requires WALWriteLock.
  */
-static XLogRecPtr
+XLogRecPtr
 WaitXLogInsertionsToFinish(XLogRecPtr upto)
 {
 	uint64		bytepos;
@@ -1710,13 +1709,14 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
  * of bytes read successfully.
  *
  * Fewer than 'count' bytes may be read if some of the requested WAL data has
- * already been evicted.
+ * already been evicted from the WAL buffers.
  *
  * No locks are taken.
  *
- * Caller should ensure that it reads no further than LogwrtResult.Write
- * (which should have been updated by the caller when determining how far to
- * read). The 'tli' argument is only used as a convenient safety check so that
+ * Caller should ensure that it reads no further than current insert position
+ * with the help of WaitXLogInsertionsToFinish().
+ *
+ * The 'tli' argument is only used as a convenient safety check so that
  * callers do not read from WAL buffers on a historical timeline.
  */
 Size
@@ -1731,7 +1731,19 @@ WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 		return 0;
 
 	Assert(!XLogRecPtrIsInvalid(startptr));
-	Assert(startptr + count <= LogwrtResult.Write);
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		XLogRecPtr	upto = startptr + count;
+		XLogRecPtr	insert_pos = GetXLogInsertRecPtr();
+
+		if (upto > insert_pos)
+			ereport(ERROR,
+					(errmsg("cannot read past end of current insert position; request %X/%X, insert position %X/%X",
+							LSN_FORMAT_ARGS(upto),
+							LSN_FORMAT_ARGS(insert_pos))));
+	}
+#endif
 
 	/*
 	 * Loop through the buffers without a lock. For each buffer, atomically
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 74a6b11866..ae9904e7e4 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1500,6 +1500,9 @@ err:
  *
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
+ *
+ * Note: It is the caller's responsibility to ensure requested WAL is written
+ * to disk, that is 'startptr'+'count' > LogwrtResult.Write.
  */
 bool
 WALRead(XLogReaderState *state,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 76787a8267..74606a6846 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -252,6 +252,7 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 extern Size WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 							   TimeLineID tli);
 
-- 
2.34.1

v25-0002-Add-test-module-for-verifying-read-from-WAL-buff.patchapplication/octet-stream; name=v25-0002-Add-test-module-for-verifying-read-from-WAL-buff.patchDownload
From 63e9d8145247d40b11ed438b36b056ea4e02fbf5 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 20 Feb 2024 05:53:09 +0000
Subject: [PATCH v25 2/4] Add test module for verifying read from WAL buffers

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../modules/read_wal_from_buffers/.gitignore  |  4 +
 .../modules/read_wal_from_buffers/Makefile    | 23 ++++++
 .../modules/read_wal_from_buffers/meson.build | 33 +++++++++
 .../read_wal_from_buffers--1.0.sql            | 14 ++++
 .../read_wal_from_buffers.c                   | 54 ++++++++++++++
 .../read_wal_from_buffers.control             |  4 +
 .../read_wal_from_buffers/t/001_basic.pl      | 74 +++++++++++++++++++
 9 files changed, 208 insertions(+)
 create mode 100644 src/test/modules/read_wal_from_buffers/.gitignore
 create mode 100644 src/test/modules/read_wal_from_buffers/Makefile
 create mode 100644 src/test/modules/read_wal_from_buffers/meson.build
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
 create mode 100644 src/test/modules/read_wal_from_buffers/t/001_basic.pl

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 89aa41b5e3..864a3dd72b 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  dummy_seclabel \
 		  libpq_pipeline \
 		  plsample \
+		  read_wal_from_buffers \
 		  spgist_name_ops \
 		  test_bloomfilter \
 		  test_copy_callbacks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 8fbe742d38..4f3dd69e58 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -33,6 +33,7 @@ subdir('test_resowner')
 subdir('test_rls_hooks')
 subdir('test_shm_mq')
 subdir('test_slru')
+subdir('read_wal_from_buffers')
 subdir('unsafe_tests')
 subdir('worker_spi')
 subdir('xid_wraparound')
diff --git a/src/test/modules/read_wal_from_buffers/.gitignore b/src/test/modules/read_wal_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/read_wal_from_buffers/Makefile b/src/test/modules/read_wal_from_buffers/Makefile
new file mode 100644
index 0000000000..9e57a837f9
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/read_wal_from_buffers/Makefile
+
+MODULE_big = read_wal_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	read_wal_from_buffers.o
+PGFILEDESC = "read_wal_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = read_wal_from_buffers
+DATA = read_wal_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/read_wal_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/read_wal_from_buffers/meson.build b/src/test/modules/read_wal_from_buffers/meson.build
new file mode 100644
index 0000000000..3fac00d616
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+read_wal_from_buffers_sources = files(
+  'read_wal_from_buffers.c',
+)
+
+if host_system == 'windows'
+  read_wal_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'read_wal_from_buffers',
+    '--FILEDESC', 'read_wal_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+read_wal_from_buffers = shared_module('read_wal_from_buffers',
+  read_wal_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += read_wal_from_buffers
+
+test_install_data += files(
+  'read_wal_from_buffers.control',
+  'read_wal_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'read_wal_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
new file mode 100644
index 0000000000..82fa097d10
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -0,0 +1,14 @@
+/* src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION read_wal_from_buffers" to load this file. \quit
+
+--
+-- read_wal_from_buffers()
+--
+-- SQL function to read WAL from WAL buffers. Returns number of bytes read.
+--
+CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
+    bytes_read OUT int)
+AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
new file mode 100644
index 0000000000..9df5c07b4b
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -0,0 +1,54 @@
+/*--------------------------------------------------------------------------
+ *
+ * read_wal_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function to read WAL from WAL buffers. Returns number of bytes read.
+ */
+PG_FUNCTION_INFO_V1(read_wal_from_buffers);
+Datum
+read_wal_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	startptr = PG_GETARG_LSN(0);
+	int32		count = PG_GETARG_INT32(1);
+	Size		read;
+	char	   *data = palloc0(count);
+	XLogRecPtr	upto = startptr + count;
+	XLogRecPtr	insert_pos = GetXLogInsertRecPtr();
+	TimeLineID	tli = GetWALInsertionTimeLine();
+
+	/*
+	 * The requested WAL may be very recent, so wait for any in-progress WAL
+	 * insertions to WAL buffers to finish.
+	 */
+	if (upto > insert_pos)
+	{
+		XLogRecPtr	writtenUpto = WaitXLogInsertionsToFinish(upto);
+
+		upto = Min(upto, writtenUpto);
+		count = upto - startptr;
+	}
+
+	read = WALReadFromBuffers(data, startptr, count, tli);
+
+	pfree(data);
+
+	PG_RETURN_INT32(read);
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
new file mode 100644
index 0000000000..b14d24751c
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/read_wal_from_buffers'
+relocatable = true
diff --git a/src/test/modules/read_wal_from_buffers/t/001_basic.pl b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..2360ff1171
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
@@ -0,0 +1,74 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(usleep);
+
+# Setup a new node.  The configuration chosen here minimizes the number
+# of arbitrary records that could get generated in a cluster.  Enlarging
+# checkpoint_timeout avoids noise with checkpoint activity.  wal_level
+# set to "minimal" avoids random standby snapshot records.  Autovacuum
+# could also trigger randomly, generating random WAL activity of its own.
+# Enlarging wal_writer_delay and wal_writer_flush_after avoid background
+# wal flush by walwriter.
+my $node = PostgreSQL::Test::Cluster->new("node");
+$node->init;
+$node->append_conf(
+	'postgresql.conf',
+	q[wal_level = minimal
+	  autovacuum = off
+	  checkpoint_timeout = '30min'
+	  wal_writer_delay = 10000ms
+	  wal_writer_flush_after = 1GB
+]);
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION read_wal_from_buffers;');
+
+$node->safe_psql('postgres', 'CREATE TABLE t (c int);');
+
+my $result = 0;
+my $lsn;
+my $to_read;
+
+# Wait until we read from WAL buffers
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+	# Get current insert LSN. After this, we generate some WAL which is guranteed
+	# to be in WAL buffers as there is no other WAL generating activity is
+	# happening on the server. We then verify if we can read the WAL from WAL
+	# buffers using this LSN.
+	$lsn =
+	  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+	my $logstart = -s $node->logfile;
+
+	# Generate minimal WAL so that WAL buffers don't get overwritten.
+	$node->safe_psql('postgres', "INSERT INTO t VALUES ($i);");
+
+	$to_read = 8192;
+
+	my $res = $node->safe_psql('postgres',
+		qq{SELECT read_wal_from_buffers(lsn := '$lsn', bytes_to_read := $to_read) > 0;}
+	);
+
+	my $log = $node->log_contains(
+		"request to flush past end of generated WAL; request .*, current position .*",
+		$logstart);
+
+	if ($res eq 't' && $log > 0)
+	{
+		$result = 1;
+		last;
+	}
+
+	usleep(100_000);
+}
+ok($result, 'waited until WAL is successfully read from WAL buffers');
+
+done_testing();
-- 
2.34.1

v25-0003-Use-WALReadFromBuffers-in-more-places.patchapplication/octet-stream; name=v25-0003-Use-WALReadFromBuffers-in-more-places.patchDownload
From c520e35619ef5093c0cbc6d31943d5b5797aded1 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 20 Feb 2024 05:53:34 +0000
Subject: [PATCH v25 3/4] Use WALReadFromBuffers in more places

---
 src/backend/access/transam/xlogutils.c | 13 ++++++++++++-
 src/backend/replication/walsender.c    | 16 +++++++++++++---
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index ad93035d50..8fb2e68e85 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -895,6 +895,8 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 	WALReadError errinfo;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	loc = targetPagePtr + reqLen;
 
@@ -1007,7 +1009,16 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		count = read_upto - targetPagePtr;
 	}
 
-	if (!WALRead(state, cur_page, targetPagePtr, count, tli,
+	/* attempt to read WAL from WAL buffers first */
+	nbytes = count;
+	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
+	/* now read the remaining WAL from WAL file */
+	if (nbytes > 0 &&
+		!WALRead(state, cur_page, targetPagePtr, nbytes, tli,
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index 631d1e0c9f..7ecc7174a0 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1059,6 +1059,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	WALReadError errinfo;
 	XLogSegNo	segno;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	/*
 	 * Make sure we have enough WAL available before retrieving the current
@@ -1095,11 +1097,19 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
-	/* now actually read the data, we know it's there */
-	if (!WALRead(state,
+	/* attempt to read WAL from WAL buffers first */
+	nbytes = count;
+	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
+	/* now read the remaining WAL from WAL file */
+	if (nbytes > 0 &&
+		!WALRead(state,
 				 cur_page,
 				 targetPagePtr,
-				 count,
+				 nbytes,
 				 currTLI,		/* Pass the current TLI because only
 								 * WalSndSegmentOpen controls whether new TLI
 								 * is needed. */
-- 
2.34.1

v25-0004-Demonstrate-reading-unflushed-WAL-directly-from-.patchapplication/octet-stream; name=v25-0004-Demonstrate-reading-unflushed-WAL-directly-from-.patchDownload
From 15e29110301ad77291eb0f0322229077adad69e2 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Tue, 20 Feb 2024 05:54:20 +0000
Subject: [PATCH v25 4/4] Demonstrate reading unflushed WAL directly from WAL
 buffers

---
 src/backend/access/transam/xlogreader.c       |   3 +-
 .../read_wal_from_buffers--1.0.sql            |  23 ++
 .../read_wal_from_buffers.c                   | 266 +++++++++++++++++-
 .../read_wal_from_buffers/t/001_basic.pl      |  37 +++
 4 files changed, 327 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index ae9904e7e4..4658a86997 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1035,7 +1035,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
+	if (state->seg.ws_segno != 0 &&
+		targetSegNo != state->seg.ws_segno && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
index 82fa097d10..72d05522fc 100644
--- a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -12,3 +12,26 @@ CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
     bytes_read OUT int)
 AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
 LANGUAGE C STRICT;
+
+--
+-- get_wal_records_info_from_buffers()
+--
+-- SQL function to get info of WAL records available in WAL buffers.
+--
+CREATE FUNCTION get_wal_records_info_from_buffers(IN start_lsn pg_lsn,
+    IN end_lsn pg_lsn,
+    OUT start_lsn pg_lsn,
+    OUT end_lsn pg_lsn,
+    OUT prev_lsn pg_lsn,
+    OUT xid xid,
+    OUT resource_manager text,
+    OUT record_type text,
+    OUT record_length int4,
+    OUT main_data_length int4,
+    OUT fpi_length int4,
+    OUT description text,
+    OUT block_ref text
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'get_wal_records_info_from_buffers'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
index 9df5c07b4b..ed33a14127 100644
--- a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -14,11 +14,27 @@
 #include "postgres.h"
 
 #include "access/xlog.h"
-#include "fmgr.h"
+#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecovery.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 
 PG_MODULE_MAGIC;
 
+static int	read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+								  int reqLen, XLogRecPtr targetRecPtr,
+								  char *cur_page);
+
+static XLogRecord *ReadNextXLogRecord(XLogReaderState *xlogreader);
+static void GetWALRecordInfo(XLogReaderState *record, Datum *values,
+							 bool *nulls, uint32 ncols);
+static void GetWALRecordsInfo(FunctionCallInfo fcinfo,
+							  XLogRecPtr start_lsn,
+							  XLogRecPtr end_lsn);
+
 /*
  * SQL function to read WAL from WAL buffers. Returns number of bytes read.
  */
@@ -52,3 +68,251 @@ read_wal_from_buffers(PG_FUNCTION_ARGS)
 
 	PG_RETURN_INT32(read);
 }
+
+/*
+ * XLogReaderRoutine->page_read callback for reading WAL from WAL buffers.
+ */
+static int
+read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+					  int reqLen, XLogRecPtr targetRecPtr,
+					  char *cur_page)
+{
+	XLogRecPtr	read_upto,
+				loc;
+	TimeLineID	tli = GetWALInsertionTimeLine();
+	Size		count;
+	Size		read = 0;
+
+	loc = targetPagePtr + reqLen;
+
+	/* Loop waiting for xlog to be available if necessary */
+	while (1)
+	{
+		read_upto = GetXLogInsertRecPtr();
+
+		if (loc <= read_upto)
+			break;
+
+		WaitXLogInsertionsToFinish(loc);
+
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(1000L);
+	}
+
+	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
+	{
+		/*
+		 * more than one block available; read only that block, have caller
+		 * come back if they need more.
+		 */
+		count = XLOG_BLCKSZ;
+	}
+	else if (targetPagePtr + reqLen > read_upto)
+	{
+		/* not enough data there */
+		return -1;
+	}
+	else
+	{
+		/* enough bytes available to satisfy the request */
+		count = read_upto - targetPagePtr;
+	}
+
+	/* read WAL from WAL buffers */
+	read = WALReadFromBuffers(cur_page, targetPagePtr, count, tli);
+
+	if (read != count)
+		ereport(ERROR,
+				errmsg("could not read fully from WAL buffers; expected %lu, read %lu",
+					   count, read));
+
+	return count;
+}
+
+/*
+ * Get info of all WAL records between start LSN and end LSN.
+ *
+ * This function and its helpers below are similar to pg_walinspect's
+ * pg_get_wal_records_info() except that it will get info of WAL records
+ * available in WAL buffers.
+ */
+PG_FUNCTION_INFO_V1(get_wal_records_info_from_buffers);
+Datum
+get_wal_records_info_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	start_lsn = PG_GETARG_LSN(0);
+	XLogRecPtr	end_lsn = PG_GETARG_LSN(1);
+
+	/*
+	 * Validate start and end LSNs coming from the function inputs.
+	 *
+	 * Reading WAL below the first page of the first segments isn't allowed.
+	 * This is a bootstrap WAL page and the page_read callback fails to read
+	 * it.
+	 */
+	if (start_lsn < XLOG_BLCKSZ)
+		ereport(ERROR,
+				(errmsg("could not read WAL at LSN %X/%X",
+						LSN_FORMAT_ARGS(start_lsn))));
+
+	if (start_lsn > end_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("WAL start LSN must be less than end LSN")));
+
+	GetWALRecordsInfo(fcinfo, start_lsn, end_lsn);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Read next WAL record.
+ */
+static XLogRecord *
+ReadNextXLogRecord(XLogReaderState *xlogreader)
+{
+	XLogRecord *record;
+	char	   *errormsg;
+
+	record = XLogReadRecord(xlogreader, &errormsg);
+
+	if (record == NULL)
+	{
+		if (errormsg)
+			ereport(ERROR,
+					errmsg("could not read WAL at %X/%X: %s",
+						   LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg));
+		else
+			ereport(ERROR,
+					errmsg("could not read WAL at %X/%X",
+						   LSN_FORMAT_ARGS(xlogreader->EndRecPtr)));
+	}
+
+	return record;
+}
+
+/*
+ * Output values that make up a row describing caller's WAL record.
+ */
+static void
+GetWALRecordInfo(XLogReaderState *record, Datum *values,
+				 bool *nulls, uint32 ncols)
+{
+	const char *record_type;
+	RmgrData	desc;
+	uint32		fpi_len = 0;
+	StringInfoData rec_desc;
+	StringInfoData rec_blk_ref;
+	int			i = 0;
+
+	desc = GetRmgr(XLogRecGetRmid(record));
+	record_type = desc.rm_identify(XLogRecGetInfo(record));
+
+	if (record_type == NULL)
+		record_type = psprintf("UNKNOWN (%x)", XLogRecGetInfo(record) & ~XLR_INFO_MASK);
+
+	initStringInfo(&rec_desc);
+	desc.rm_desc(&rec_desc, record);
+
+	if (XLogRecHasAnyBlockRefs(record))
+	{
+		initStringInfo(&rec_blk_ref);
+		XLogRecGetBlockRefInfo(record, false, true, &rec_blk_ref, &fpi_len);
+	}
+
+	values[i++] = LSNGetDatum(record->ReadRecPtr);
+	values[i++] = LSNGetDatum(record->EndRecPtr);
+	values[i++] = LSNGetDatum(XLogRecGetPrev(record));
+	values[i++] = TransactionIdGetDatum(XLogRecGetXid(record));
+	values[i++] = CStringGetTextDatum(desc.rm_name);
+	values[i++] = CStringGetTextDatum(record_type);
+	values[i++] = UInt32GetDatum(XLogRecGetTotalLen(record));
+	values[i++] = UInt32GetDatum(XLogRecGetDataLen(record));
+	values[i++] = UInt32GetDatum(fpi_len);
+
+	if (rec_desc.len > 0)
+		values[i++] = CStringGetTextDatum(rec_desc.data);
+	else
+		nulls[i++] = true;
+
+	if (XLogRecHasAnyBlockRefs(record))
+		values[i++] = CStringGetTextDatum(rec_blk_ref.data);
+	else
+		nulls[i++] = true;
+
+	Assert(i == ncols);
+}
+
+/*
+ * Get info of all WAL records between start LSN and end LSN.
+ */
+static void
+GetWALRecordsInfo(FunctionCallInfo fcinfo, XLogRecPtr start_lsn,
+				  XLogRecPtr end_lsn)
+{
+#define GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS 11
+	XLogReaderState *xlogreader;
+	XLogRecPtr	first_valid_record;
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	MemoryContext old_cxt;
+	MemoryContext tmp_cxt;
+
+	Assert(start_lsn <= end_lsn);
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+									XL_ROUTINE(.page_read = &read_from_wal_buffers,
+											   .segment_open = NULL,
+											   .segment_close = NULL),
+									NULL);
+
+	if (xlogreader == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while allocating a WAL reading processor.")));
+
+	/* first find a valid recptr to start from */
+	first_valid_record = XLogFindNextRecord(xlogreader, start_lsn);
+
+	if (XLogRecPtrIsInvalid(first_valid_record))
+	{
+		ereport(LOG,
+				(errmsg("could not find a valid record after %X/%X",
+						LSN_FORMAT_ARGS(start_lsn))));
+
+		return;
+	}
+
+	tmp_cxt = AllocSetContextCreate(CurrentMemoryContext,
+									"GetWALRecordsInfo temporary cxt",
+									ALLOCSET_DEFAULT_SIZES);
+
+	while (ReadNextXLogRecord(xlogreader) &&
+		   xlogreader->EndRecPtr <= end_lsn)
+	{
+		Datum		values[GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS] = {0};
+		bool		nulls[GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS] = {0};
+
+		/* Use the tmp context so we can clean up after each tuple is done */
+		old_cxt = MemoryContextSwitchTo(tmp_cxt);
+
+		GetWALRecordInfo(xlogreader, values, nulls,
+						 GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+
+		/* clean up and switch back */
+		MemoryContextSwitchTo(old_cxt);
+		MemoryContextReset(tmp_cxt);
+
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	MemoryContextDelete(tmp_cxt);
+	XLogReaderFree(xlogreader);
+
+#undef GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS
+}
diff --git a/src/test/modules/read_wal_from_buffers/t/001_basic.pl b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
index 2360ff1171..15ef550c8c 100644
--- a/src/test/modules/read_wal_from_buffers/t/001_basic.pl
+++ b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
@@ -71,4 +71,41 @@ for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
 }
 ok($result, 'waited until WAL is successfully read from WAL buffers');
 
+$result = 0;
+
+# Wait until we get info of WAL records available in WAL buffers.
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+	$node->safe_psql('postgres', "DROP TABLE IF EXISTS foo, bar;");
+	$node->safe_psql('postgres',
+		"CREATE TABLE foo AS SELECT * FROM generate_series(1, 2);");
+	my $start_lsn =
+	  $node->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn();");
+	my $tbl_oid = $node->safe_psql('postgres',
+		"SELECT oid FROM pg_class WHERE relname = 'foo';");
+	$node->safe_psql('postgres',
+		"INSERT INTO foo SELECT * FROM generate_series(1, 10);");
+	my $end_lsn =
+	  $node->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn();");
+	$node->safe_psql('postgres',
+		"CREATE TABLE bar AS SELECT * FROM generate_series(1, 2);");
+
+	my $res = $node->safe_psql(
+		'postgres',
+		"SELECT count(*) FROM get_wal_records_info_from_buffers('$start_lsn', '$end_lsn')
+					WHERE block_ref LIKE concat('%', '$tbl_oid', '%') AND
+						resource_manager = 'Heap' AND
+						record_type = 'INSERT';");
+
+	if ($res eq 10)
+	{
+		$result = 1;
+		last;
+	}
+
+	usleep(100_000);
+}
+ok($result,
+	'waited until we get info of WAL records available in WAL buffers.');
+
 done_testing();
-- 
2.34.1

#90Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#89)
4 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Feb 20, 2024 at 11:40 AM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Ran pgperltidy on the new TAP test file added. Please see the attached
v25 patch set.

Please find the v26 patches after rebasing.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v26-0001-Add-check-in-WALReadFromBuffers-against-requeste.patchapplication/octet-stream; name=v26-0001-Add-check-in-WALReadFromBuffers-against-requeste.patchDownload
From 34f35eddcc3c2646e9a669c9660637446a5b4b3f Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 21 Mar 2024 17:58:32 +0000
Subject: [PATCH v26 1/4] Add check in WALReadFromBuffers against requested WAL

---
 src/backend/access/transam/xlog.c       | 26 ++++++++++++++++++-------
 src/backend/access/transam/xlogreader.c |  3 +++
 src/include/access/xlog.h               |  1 +
 3 files changed, 23 insertions(+), 7 deletions(-)

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 20a5f86209..b75b344707 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -693,7 +693,6 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
-static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
@@ -1488,7 +1487,7 @@ WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
  * uninitialized page), and the inserter might need to evict an old WAL buffer
  * to make room for a new one, which in turn requires WALWriteLock.
  */
-static XLogRecPtr
+XLogRecPtr
 WaitXLogInsertionsToFinish(XLogRecPtr upto)
 {
 	uint64		bytepos;
@@ -1705,13 +1704,14 @@ GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli)
  * of bytes read successfully.
  *
  * Fewer than 'count' bytes may be read if some of the requested WAL data has
- * already been evicted.
+ * already been evicted from the WAL buffers.
  *
  * No locks are taken.
  *
- * Caller should ensure that it reads no further than LogwrtResult.Write
- * (which should have been updated by the caller when determining how far to
- * read). The 'tli' argument is only used as a convenient safety check so that
+ * Caller should ensure that it reads no further than current insert position
+ * with the help of WaitXLogInsertionsToFinish().
+ *
+ * The 'tli' argument is only used as a convenient safety check so that
  * callers do not read from WAL buffers on a historical timeline.
  */
 Size
@@ -1726,7 +1726,19 @@ WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 		return 0;
 
 	Assert(!XLogRecPtrIsInvalid(startptr));
-	Assert(startptr + count <= LogwrtResult.Write);
+
+#ifdef USE_ASSERT_CHECKING
+	{
+		XLogRecPtr	upto = startptr + count;
+		XLogRecPtr	insert_pos = GetXLogInsertRecPtr();
+
+		if (upto > insert_pos)
+			ereport(ERROR,
+					(errmsg("cannot read past end of current insert position; request %X/%X, insert position %X/%X",
+							LSN_FORMAT_ARGS(upto),
+							LSN_FORMAT_ARGS(insert_pos))));
+	}
+#endif
 
 	/*
 	 * Loop through the buffers without a lock. For each buffer, atomically
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 37d2a57961..75ea36c37f 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1498,6 +1498,9 @@ err:
  *
  * Returns true if succeeded, false if an error occurs, in which case
  * 'errinfo' receives error details.
+ *
+ * Note: It is the caller's responsibility to ensure requested WAL is written
+ * to disk, that is 'startptr'+'count' > LogwrtResult.Write.
  */
 bool
 WALRead(XLogReaderState *state,
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 76787a8267..74606a6846 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -252,6 +252,7 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 extern Size WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 							   TimeLineID tli);
 
-- 
2.34.1

v26-0002-Add-test-module-for-verifying-read-from-WAL-buff.patchapplication/octet-stream; name=v26-0002-Add-test-module-for-verifying-read-from-WAL-buff.patchDownload
From cfe8f47cb1435615f04af46124debfd5a89fedf9 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 21 Mar 2024 18:00:09 +0000
Subject: [PATCH v26 2/4] Add test module for verifying read from WAL buffers

---
 src/test/modules/Makefile                     |  1 +
 src/test/modules/meson.build                  |  1 +
 .../modules/read_wal_from_buffers/.gitignore  |  4 +
 .../modules/read_wal_from_buffers/Makefile    | 23 ++++++
 .../modules/read_wal_from_buffers/meson.build | 33 +++++++++
 .../read_wal_from_buffers--1.0.sql            | 14 ++++
 .../read_wal_from_buffers.c                   | 54 ++++++++++++++
 .../read_wal_from_buffers.control             |  4 +
 .../read_wal_from_buffers/t/001_basic.pl      | 74 +++++++++++++++++++
 9 files changed, 208 insertions(+)
 create mode 100644 src/test/modules/read_wal_from_buffers/.gitignore
 create mode 100644 src/test/modules/read_wal_from_buffers/Makefile
 create mode 100644 src/test/modules/read_wal_from_buffers/meson.build
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
 create mode 100644 src/test/modules/read_wal_from_buffers/t/001_basic.pl

diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 1cbd532156..1922b0ed4a 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  dummy_seclabel \
 		  libpq_pipeline \
 		  plsample \
+		  read_wal_from_buffers \
 		  spgist_name_ops \
 		  test_bloomfilter \
 		  test_copy_callbacks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index 7c11fb97f2..437fb39ddf 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -10,6 +10,7 @@ subdir('injection_points')
 subdir('ldap_password_func')
 subdir('libpq_pipeline')
 subdir('plsample')
+subdir('read_wal_from_buffers')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
diff --git a/src/test/modules/read_wal_from_buffers/.gitignore b/src/test/modules/read_wal_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/read_wal_from_buffers/Makefile b/src/test/modules/read_wal_from_buffers/Makefile
new file mode 100644
index 0000000000..9e57a837f9
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/read_wal_from_buffers/Makefile
+
+MODULE_big = read_wal_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	read_wal_from_buffers.o
+PGFILEDESC = "read_wal_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = read_wal_from_buffers
+DATA = read_wal_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/read_wal_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/read_wal_from_buffers/meson.build b/src/test/modules/read_wal_from_buffers/meson.build
new file mode 100644
index 0000000000..3fac00d616
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+read_wal_from_buffers_sources = files(
+  'read_wal_from_buffers.c',
+)
+
+if host_system == 'windows'
+  read_wal_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'read_wal_from_buffers',
+    '--FILEDESC', 'read_wal_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+read_wal_from_buffers = shared_module('read_wal_from_buffers',
+  read_wal_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += read_wal_from_buffers
+
+test_install_data += files(
+  'read_wal_from_buffers.control',
+  'read_wal_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'read_wal_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
new file mode 100644
index 0000000000..82fa097d10
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -0,0 +1,14 @@
+/* src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION read_wal_from_buffers" to load this file. \quit
+
+--
+-- read_wal_from_buffers()
+--
+-- SQL function to read WAL from WAL buffers. Returns number of bytes read.
+--
+CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
+    bytes_read OUT int)
+AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
+LANGUAGE C STRICT;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
new file mode 100644
index 0000000000..9df5c07b4b
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -0,0 +1,54 @@
+/*--------------------------------------------------------------------------
+ *
+ * read_wal_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "fmgr.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+/*
+ * SQL function to read WAL from WAL buffers. Returns number of bytes read.
+ */
+PG_FUNCTION_INFO_V1(read_wal_from_buffers);
+Datum
+read_wal_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	startptr = PG_GETARG_LSN(0);
+	int32		count = PG_GETARG_INT32(1);
+	Size		read;
+	char	   *data = palloc0(count);
+	XLogRecPtr	upto = startptr + count;
+	XLogRecPtr	insert_pos = GetXLogInsertRecPtr();
+	TimeLineID	tli = GetWALInsertionTimeLine();
+
+	/*
+	 * The requested WAL may be very recent, so wait for any in-progress WAL
+	 * insertions to WAL buffers to finish.
+	 */
+	if (upto > insert_pos)
+	{
+		XLogRecPtr	writtenUpto = WaitXLogInsertionsToFinish(upto);
+
+		upto = Min(upto, writtenUpto);
+		count = upto - startptr;
+	}
+
+	read = WALReadFromBuffers(data, startptr, count, tli);
+
+	pfree(data);
+
+	PG_RETURN_INT32(read);
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
new file mode 100644
index 0000000000..b14d24751c
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/read_wal_from_buffers'
+relocatable = true
diff --git a/src/test/modules/read_wal_from_buffers/t/001_basic.pl b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..2360ff1171
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
@@ -0,0 +1,74 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(usleep);
+
+# Setup a new node.  The configuration chosen here minimizes the number
+# of arbitrary records that could get generated in a cluster.  Enlarging
+# checkpoint_timeout avoids noise with checkpoint activity.  wal_level
+# set to "minimal" avoids random standby snapshot records.  Autovacuum
+# could also trigger randomly, generating random WAL activity of its own.
+# Enlarging wal_writer_delay and wal_writer_flush_after avoid background
+# wal flush by walwriter.
+my $node = PostgreSQL::Test::Cluster->new("node");
+$node->init;
+$node->append_conf(
+	'postgresql.conf',
+	q[wal_level = minimal
+	  autovacuum = off
+	  checkpoint_timeout = '30min'
+	  wal_writer_delay = 10000ms
+	  wal_writer_flush_after = 1GB
+]);
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION read_wal_from_buffers;');
+
+$node->safe_psql('postgres', 'CREATE TABLE t (c int);');
+
+my $result = 0;
+my $lsn;
+my $to_read;
+
+# Wait until we read from WAL buffers
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+	# Get current insert LSN. After this, we generate some WAL which is guranteed
+	# to be in WAL buffers as there is no other WAL generating activity is
+	# happening on the server. We then verify if we can read the WAL from WAL
+	# buffers using this LSN.
+	$lsn =
+	  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+	my $logstart = -s $node->logfile;
+
+	# Generate minimal WAL so that WAL buffers don't get overwritten.
+	$node->safe_psql('postgres', "INSERT INTO t VALUES ($i);");
+
+	$to_read = 8192;
+
+	my $res = $node->safe_psql('postgres',
+		qq{SELECT read_wal_from_buffers(lsn := '$lsn', bytes_to_read := $to_read) > 0;}
+	);
+
+	my $log = $node->log_contains(
+		"request to flush past end of generated WAL; request .*, current position .*",
+		$logstart);
+
+	if ($res eq 't' && $log > 0)
+	{
+		$result = 1;
+		last;
+	}
+
+	usleep(100_000);
+}
+ok($result, 'waited until WAL is successfully read from WAL buffers');
+
+done_testing();
-- 
2.34.1

v26-0003-Use-WALReadFromBuffers-in-more-places.patchapplication/octet-stream; name=v26-0003-Use-WALReadFromBuffers-in-more-places.patchDownload
From 0cd37ae40b85e3c02fdd1f486cd8d1071b1a12cf Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 21 Mar 2024 18:00:29 +0000
Subject: [PATCH v26 3/4] Use WALReadFromBuffers in more places

---
 src/backend/access/transam/xlogutils.c | 13 ++++++++++++-
 src/backend/replication/walsender.c    | 16 +++++++++++++---
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5295b85fe0..1e1f5b5306 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -892,6 +892,8 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 	WALReadError errinfo;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	loc = targetPagePtr + reqLen;
 
@@ -1004,7 +1006,16 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		count = read_upto - targetPagePtr;
 	}
 
-	if (!WALRead(state, cur_page, targetPagePtr, count, tli,
+	/* attempt to read WAL from WAL buffers first */
+	nbytes = count;
+	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
+	/* now read the remaining WAL from WAL file */
+	if (nbytes > 0 &&
+		!WALRead(state, cur_page, targetPagePtr, nbytes, tli,
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index bc40c454de..19cf1d5ce7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1056,6 +1056,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	WALReadError errinfo;
 	XLogSegNo	segno;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	/*
 	 * Make sure we have enough WAL available before retrieving the current
@@ -1092,11 +1094,19 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
-	/* now actually read the data, we know it's there */
-	if (!WALRead(state,
+	/* attempt to read WAL from WAL buffers first */
+	nbytes = count;
+	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
+	/* now read the remaining WAL from WAL file */
+	if (nbytes > 0 &&
+		!WALRead(state,
 				 cur_page,
 				 targetPagePtr,
-				 count,
+				 nbytes,
 				 currTLI,		/* Pass the current TLI because only
 								 * WalSndSegmentOpen controls whether new TLI
 								 * is needed. */
-- 
2.34.1

v26-0004-Demonstrate-reading-unflushed-WAL-directly-from-.patchapplication/octet-stream; name=v26-0004-Demonstrate-reading-unflushed-WAL-directly-from-.patchDownload
From f77c06f60857b03086169aac5b0dd479fe046129 Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Thu, 21 Mar 2024 18:00:47 +0000
Subject: [PATCH v26 4/4] Demonstrate reading unflushed WAL directly from WAL
 buffers

---
 src/backend/access/transam/xlogreader.c       |   3 +-
 .../read_wal_from_buffers--1.0.sql            |  23 ++
 .../read_wal_from_buffers.c                   | 266 +++++++++++++++++-
 .../read_wal_from_buffers/t/001_basic.pl      |  37 +++
 4 files changed, 327 insertions(+), 2 deletions(-)

diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 75ea36c37f..3e1b814d54 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1033,7 +1033,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
+	if (state->seg.ws_segno != 0 &&
+		targetSegNo != state->seg.ws_segno && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
index 82fa097d10..72d05522fc 100644
--- a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -12,3 +12,26 @@ CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
     bytes_read OUT int)
 AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
 LANGUAGE C STRICT;
+
+--
+-- get_wal_records_info_from_buffers()
+--
+-- SQL function to get info of WAL records available in WAL buffers.
+--
+CREATE FUNCTION get_wal_records_info_from_buffers(IN start_lsn pg_lsn,
+    IN end_lsn pg_lsn,
+    OUT start_lsn pg_lsn,
+    OUT end_lsn pg_lsn,
+    OUT prev_lsn pg_lsn,
+    OUT xid xid,
+    OUT resource_manager text,
+    OUT record_type text,
+    OUT record_length int4,
+    OUT main_data_length int4,
+    OUT fpi_length int4,
+    OUT description text,
+    OUT block_ref text
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'get_wal_records_info_from_buffers'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
index 9df5c07b4b..ed33a14127 100644
--- a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -14,11 +14,27 @@
 #include "postgres.h"
 
 #include "access/xlog.h"
-#include "fmgr.h"
+#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecovery.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/builtins.h"
 #include "utils/pg_lsn.h"
 
 PG_MODULE_MAGIC;
 
+static int	read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+								  int reqLen, XLogRecPtr targetRecPtr,
+								  char *cur_page);
+
+static XLogRecord *ReadNextXLogRecord(XLogReaderState *xlogreader);
+static void GetWALRecordInfo(XLogReaderState *record, Datum *values,
+							 bool *nulls, uint32 ncols);
+static void GetWALRecordsInfo(FunctionCallInfo fcinfo,
+							  XLogRecPtr start_lsn,
+							  XLogRecPtr end_lsn);
+
 /*
  * SQL function to read WAL from WAL buffers. Returns number of bytes read.
  */
@@ -52,3 +68,251 @@ read_wal_from_buffers(PG_FUNCTION_ARGS)
 
 	PG_RETURN_INT32(read);
 }
+
+/*
+ * XLogReaderRoutine->page_read callback for reading WAL from WAL buffers.
+ */
+static int
+read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+					  int reqLen, XLogRecPtr targetRecPtr,
+					  char *cur_page)
+{
+	XLogRecPtr	read_upto,
+				loc;
+	TimeLineID	tli = GetWALInsertionTimeLine();
+	Size		count;
+	Size		read = 0;
+
+	loc = targetPagePtr + reqLen;
+
+	/* Loop waiting for xlog to be available if necessary */
+	while (1)
+	{
+		read_upto = GetXLogInsertRecPtr();
+
+		if (loc <= read_upto)
+			break;
+
+		WaitXLogInsertionsToFinish(loc);
+
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(1000L);
+	}
+
+	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
+	{
+		/*
+		 * more than one block available; read only that block, have caller
+		 * come back if they need more.
+		 */
+		count = XLOG_BLCKSZ;
+	}
+	else if (targetPagePtr + reqLen > read_upto)
+	{
+		/* not enough data there */
+		return -1;
+	}
+	else
+	{
+		/* enough bytes available to satisfy the request */
+		count = read_upto - targetPagePtr;
+	}
+
+	/* read WAL from WAL buffers */
+	read = WALReadFromBuffers(cur_page, targetPagePtr, count, tli);
+
+	if (read != count)
+		ereport(ERROR,
+				errmsg("could not read fully from WAL buffers; expected %lu, read %lu",
+					   count, read));
+
+	return count;
+}
+
+/*
+ * Get info of all WAL records between start LSN and end LSN.
+ *
+ * This function and its helpers below are similar to pg_walinspect's
+ * pg_get_wal_records_info() except that it will get info of WAL records
+ * available in WAL buffers.
+ */
+PG_FUNCTION_INFO_V1(get_wal_records_info_from_buffers);
+Datum
+get_wal_records_info_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	start_lsn = PG_GETARG_LSN(0);
+	XLogRecPtr	end_lsn = PG_GETARG_LSN(1);
+
+	/*
+	 * Validate start and end LSNs coming from the function inputs.
+	 *
+	 * Reading WAL below the first page of the first segments isn't allowed.
+	 * This is a bootstrap WAL page and the page_read callback fails to read
+	 * it.
+	 */
+	if (start_lsn < XLOG_BLCKSZ)
+		ereport(ERROR,
+				(errmsg("could not read WAL at LSN %X/%X",
+						LSN_FORMAT_ARGS(start_lsn))));
+
+	if (start_lsn > end_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("WAL start LSN must be less than end LSN")));
+
+	GetWALRecordsInfo(fcinfo, start_lsn, end_lsn);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Read next WAL record.
+ */
+static XLogRecord *
+ReadNextXLogRecord(XLogReaderState *xlogreader)
+{
+	XLogRecord *record;
+	char	   *errormsg;
+
+	record = XLogReadRecord(xlogreader, &errormsg);
+
+	if (record == NULL)
+	{
+		if (errormsg)
+			ereport(ERROR,
+					errmsg("could not read WAL at %X/%X: %s",
+						   LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg));
+		else
+			ereport(ERROR,
+					errmsg("could not read WAL at %X/%X",
+						   LSN_FORMAT_ARGS(xlogreader->EndRecPtr)));
+	}
+
+	return record;
+}
+
+/*
+ * Output values that make up a row describing caller's WAL record.
+ */
+static void
+GetWALRecordInfo(XLogReaderState *record, Datum *values,
+				 bool *nulls, uint32 ncols)
+{
+	const char *record_type;
+	RmgrData	desc;
+	uint32		fpi_len = 0;
+	StringInfoData rec_desc;
+	StringInfoData rec_blk_ref;
+	int			i = 0;
+
+	desc = GetRmgr(XLogRecGetRmid(record));
+	record_type = desc.rm_identify(XLogRecGetInfo(record));
+
+	if (record_type == NULL)
+		record_type = psprintf("UNKNOWN (%x)", XLogRecGetInfo(record) & ~XLR_INFO_MASK);
+
+	initStringInfo(&rec_desc);
+	desc.rm_desc(&rec_desc, record);
+
+	if (XLogRecHasAnyBlockRefs(record))
+	{
+		initStringInfo(&rec_blk_ref);
+		XLogRecGetBlockRefInfo(record, false, true, &rec_blk_ref, &fpi_len);
+	}
+
+	values[i++] = LSNGetDatum(record->ReadRecPtr);
+	values[i++] = LSNGetDatum(record->EndRecPtr);
+	values[i++] = LSNGetDatum(XLogRecGetPrev(record));
+	values[i++] = TransactionIdGetDatum(XLogRecGetXid(record));
+	values[i++] = CStringGetTextDatum(desc.rm_name);
+	values[i++] = CStringGetTextDatum(record_type);
+	values[i++] = UInt32GetDatum(XLogRecGetTotalLen(record));
+	values[i++] = UInt32GetDatum(XLogRecGetDataLen(record));
+	values[i++] = UInt32GetDatum(fpi_len);
+
+	if (rec_desc.len > 0)
+		values[i++] = CStringGetTextDatum(rec_desc.data);
+	else
+		nulls[i++] = true;
+
+	if (XLogRecHasAnyBlockRefs(record))
+		values[i++] = CStringGetTextDatum(rec_blk_ref.data);
+	else
+		nulls[i++] = true;
+
+	Assert(i == ncols);
+}
+
+/*
+ * Get info of all WAL records between start LSN and end LSN.
+ */
+static void
+GetWALRecordsInfo(FunctionCallInfo fcinfo, XLogRecPtr start_lsn,
+				  XLogRecPtr end_lsn)
+{
+#define GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS 11
+	XLogReaderState *xlogreader;
+	XLogRecPtr	first_valid_record;
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	MemoryContext old_cxt;
+	MemoryContext tmp_cxt;
+
+	Assert(start_lsn <= end_lsn);
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+									XL_ROUTINE(.page_read = &read_from_wal_buffers,
+											   .segment_open = NULL,
+											   .segment_close = NULL),
+									NULL);
+
+	if (xlogreader == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while allocating a WAL reading processor.")));
+
+	/* first find a valid recptr to start from */
+	first_valid_record = XLogFindNextRecord(xlogreader, start_lsn);
+
+	if (XLogRecPtrIsInvalid(first_valid_record))
+	{
+		ereport(LOG,
+				(errmsg("could not find a valid record after %X/%X",
+						LSN_FORMAT_ARGS(start_lsn))));
+
+		return;
+	}
+
+	tmp_cxt = AllocSetContextCreate(CurrentMemoryContext,
+									"GetWALRecordsInfo temporary cxt",
+									ALLOCSET_DEFAULT_SIZES);
+
+	while (ReadNextXLogRecord(xlogreader) &&
+		   xlogreader->EndRecPtr <= end_lsn)
+	{
+		Datum		values[GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS] = {0};
+		bool		nulls[GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS] = {0};
+
+		/* Use the tmp context so we can clean up after each tuple is done */
+		old_cxt = MemoryContextSwitchTo(tmp_cxt);
+
+		GetWALRecordInfo(xlogreader, values, nulls,
+						 GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+
+		/* clean up and switch back */
+		MemoryContextSwitchTo(old_cxt);
+		MemoryContextReset(tmp_cxt);
+
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	MemoryContextDelete(tmp_cxt);
+	XLogReaderFree(xlogreader);
+
+#undef GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS
+}
diff --git a/src/test/modules/read_wal_from_buffers/t/001_basic.pl b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
index 2360ff1171..15ef550c8c 100644
--- a/src/test/modules/read_wal_from_buffers/t/001_basic.pl
+++ b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
@@ -71,4 +71,41 @@ for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
 }
 ok($result, 'waited until WAL is successfully read from WAL buffers');
 
+$result = 0;
+
+# Wait until we get info of WAL records available in WAL buffers.
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+	$node->safe_psql('postgres', "DROP TABLE IF EXISTS foo, bar;");
+	$node->safe_psql('postgres',
+		"CREATE TABLE foo AS SELECT * FROM generate_series(1, 2);");
+	my $start_lsn =
+	  $node->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn();");
+	my $tbl_oid = $node->safe_psql('postgres',
+		"SELECT oid FROM pg_class WHERE relname = 'foo';");
+	$node->safe_psql('postgres',
+		"INSERT INTO foo SELECT * FROM generate_series(1, 10);");
+	my $end_lsn =
+	  $node->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn();");
+	$node->safe_psql('postgres',
+		"CREATE TABLE bar AS SELECT * FROM generate_series(1, 2);");
+
+	my $res = $node->safe_psql(
+		'postgres',
+		"SELECT count(*) FROM get_wal_records_info_from_buffers('$start_lsn', '$end_lsn')
+					WHERE block_ref LIKE concat('%', '$tbl_oid', '%') AND
+						resource_manager = 'Heap' AND
+						record_type = 'INSERT';");
+
+	if ($res eq 10)
+	{
+		$result = 1;
+		last;
+	}
+
+	usleep(100_000);
+}
+ok($result,
+	'waited until we get info of WAL records available in WAL buffers.');
+
 done_testing();
-- 
2.34.1

#91Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Bharath Rupireddy (#90)
2 attachment(s)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Thu, Mar 21, 2024 at 11:33 PM Bharath Rupireddy
<bharath.rupireddyforpostgres@gmail.com> wrote:

Please find the v26 patches after rebasing.

Commit f3ff7bf83b added a check in WALReadFromBuffers to ensure the
requested WAL is not past the WAL that's copied to WAL buffers. So,
I've dropped v26-0001 patch.

I've attached v27 patches for further review.

0001 adds a test module to demonstrate reading from WAL buffers
patterns like the caller ensuring the requested WAL is fully copied to
WAL buffers using WaitXLogInsertionsToFinish and an implementation of
xlogreader page_read
callback to read unflushed/not-yet-flushed WAL directly from WAL buffers.

0002 Use WALReadFromBuffers in more places like for logical
walsenders, logical decoding functions, backends reading WAL.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Attachments:

v27-0001-Add-test-module-to-demonstrate-reading-from-WAL.patchapplication/x-patch; name=v27-0001-Add-test-module-to-demonstrate-reading-from-WAL.patchDownload
From 77c4cc3320a9c2982b5fdac9bd51a892ca644fae Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 8 Apr 2024 05:02:07 +0000
Subject: [PATCH v27 1/2] Add test module to demonstrate reading from WAL 
 buffers patterns

This commit adds a test module to demonstrate a few patterns for
reading from WAL buffers using WALReadFromBuffers added by commit
91f2cae7a4e.

1. This module contains a test function to read the WAL that's
fully copied to WAL buffers. Whether or not the WAL is fully
copied to WAL buffers is ensured by WaitXLogInsertionsToFinish
before WALReadFromBuffers.

2. This module contains an implementation of xlogreader page_read
callback to read unflushed/not-yet-flushed WAL directly from WAL
buffers.
---
 src/backend/access/transam/xlog.c             |   3 +-
 src/backend/access/transam/xlogreader.c       |   3 +-
 src/include/access/xlog.h                     |   1 +
 src/test/modules/Makefile                     |   1 +
 src/test/modules/meson.build                  |   1 +
 .../modules/read_wal_from_buffers/.gitignore  |   4 +
 .../modules/read_wal_from_buffers/Makefile    |  23 ++
 .../modules/read_wal_from_buffers/meson.build |  33 ++
 .../read_wal_from_buffers--1.0.sql            |  37 ++
 .../read_wal_from_buffers.c                   | 318 ++++++++++++++++++
 .../read_wal_from_buffers.control             |   4 +
 .../read_wal_from_buffers/t/001_basic.pl      | 111 ++++++
 12 files changed, 536 insertions(+), 3 deletions(-)
 create mode 100644 src/test/modules/read_wal_from_buffers/.gitignore
 create mode 100644 src/test/modules/read_wal_from_buffers/Makefile
 create mode 100644 src/test/modules/read_wal_from_buffers/meson.build
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
 create mode 100644 src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
 create mode 100644 src/test/modules/read_wal_from_buffers/t/001_basic.pl

diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index e3fb26f5ab..4d4a840c8e 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -700,7 +700,6 @@ static void ReserveXLogInsertLocation(int size, XLogRecPtr *StartPos,
 									  XLogRecPtr *EndPos, XLogRecPtr *PrevPtr);
 static bool ReserveXLogSwitch(XLogRecPtr *StartPos, XLogRecPtr *EndPos,
 							  XLogRecPtr *PrevPtr);
-static XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 static char *GetXLogBuffer(XLogRecPtr ptr, TimeLineID tli);
 static XLogRecPtr XLogBytePosToRecPtr(uint64 bytepos);
 static XLogRecPtr XLogBytePosToEndRecPtr(uint64 bytepos);
@@ -1496,7 +1495,7 @@ WALInsertLockUpdateInsertingAt(XLogRecPtr insertingAt)
  * uninitialized page), and the inserter might need to evict an old WAL buffer
  * to make room for a new one, which in turn requires WALWriteLock.
  */
-static XLogRecPtr
+XLogRecPtr
 WaitXLogInsertionsToFinish(XLogRecPtr upto)
 {
 	uint64		bytepos;
diff --git a/src/backend/access/transam/xlogreader.c b/src/backend/access/transam/xlogreader.c
index 37d2a57961..12dddf64cc 100644
--- a/src/backend/access/transam/xlogreader.c
+++ b/src/backend/access/transam/xlogreader.c
@@ -1033,7 +1033,8 @@ ReadPageInternal(XLogReaderState *state, XLogRecPtr pageptr, int reqLen)
 	 * record is.  This is so that we can check the additional identification
 	 * info that is present in the first page's "long" header.
 	 */
-	if (targetSegNo != state->seg.ws_segno && targetPageOff != 0)
+	if (state->seg.ws_segno != 0 &&
+		targetSegNo != state->seg.ws_segno && targetPageOff != 0)
 	{
 		XLogRecPtr	targetSegmentPtr = pageptr - targetPageOff;
 
diff --git a/src/include/access/xlog.h b/src/include/access/xlog.h
index 76787a8267..74606a6846 100644
--- a/src/include/access/xlog.h
+++ b/src/include/access/xlog.h
@@ -252,6 +252,7 @@ extern XLogRecPtr GetLastImportantRecPtr(void);
 
 extern void SetWalWriterSleeping(bool sleeping);
 
+extern XLogRecPtr WaitXLogInsertionsToFinish(XLogRecPtr upto);
 extern Size WALReadFromBuffers(char *dstbuf, XLogRecPtr startptr, Size count,
 							   TimeLineID tli);
 
diff --git a/src/test/modules/Makefile b/src/test/modules/Makefile
index 256799f520..c39b407e5b 100644
--- a/src/test/modules/Makefile
+++ b/src/test/modules/Makefile
@@ -12,6 +12,7 @@ SUBDIRS = \
 		  dummy_seclabel \
 		  libpq_pipeline \
 		  plsample \
+		  read_wal_from_buffers \
 		  spgist_name_ops \
 		  test_bloomfilter \
 		  test_copy_callbacks \
diff --git a/src/test/modules/meson.build b/src/test/modules/meson.build
index d8fe059d23..222fa1cd72 100644
--- a/src/test/modules/meson.build
+++ b/src/test/modules/meson.build
@@ -10,6 +10,7 @@ subdir('injection_points')
 subdir('ldap_password_func')
 subdir('libpq_pipeline')
 subdir('plsample')
+subdir('read_wal_from_buffers')
 subdir('spgist_name_ops')
 subdir('ssl_passphrase_callback')
 subdir('test_bloomfilter')
diff --git a/src/test/modules/read_wal_from_buffers/.gitignore b/src/test/modules/read_wal_from_buffers/.gitignore
new file mode 100644
index 0000000000..5dcb3ff972
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/.gitignore
@@ -0,0 +1,4 @@
+# Generated subdirectories
+/log/
+/results/
+/tmp_check/
diff --git a/src/test/modules/read_wal_from_buffers/Makefile b/src/test/modules/read_wal_from_buffers/Makefile
new file mode 100644
index 0000000000..9e57a837f9
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/Makefile
@@ -0,0 +1,23 @@
+# src/test/modules/read_wal_from_buffers/Makefile
+
+MODULE_big = read_wal_from_buffers
+OBJS = \
+	$(WIN32RES) \
+	read_wal_from_buffers.o
+PGFILEDESC = "read_wal_from_buffers - test module to read WAL from WAL buffers"
+
+EXTENSION = read_wal_from_buffers
+DATA = read_wal_from_buffers--1.0.sql
+
+TAP_TESTS = 1
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = src/test/modules/read_wal_from_buffers
+top_builddir = ../../../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
diff --git a/src/test/modules/read_wal_from_buffers/meson.build b/src/test/modules/read_wal_from_buffers/meson.build
new file mode 100644
index 0000000000..3fac00d616
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/meson.build
@@ -0,0 +1,33 @@
+# Copyright (c) 2024, PostgreSQL Global Development Group
+
+read_wal_from_buffers_sources = files(
+  'read_wal_from_buffers.c',
+)
+
+if host_system == 'windows'
+  read_wal_from_buffers_sources += rc_lib_gen.process(win32ver_rc, extra_args: [
+    '--NAME', 'read_wal_from_buffers',
+    '--FILEDESC', 'read_wal_from_buffers - test module to read WAL from WAL buffers',])
+endif
+
+read_wal_from_buffers = shared_module('read_wal_from_buffers',
+  read_wal_from_buffers_sources,
+  kwargs: pg_test_mod_args,
+)
+test_install_libs += read_wal_from_buffers
+
+test_install_data += files(
+  'read_wal_from_buffers.control',
+  'read_wal_from_buffers--1.0.sql',
+)
+
+tests += {
+  'name': 'read_wal_from_buffers',
+  'sd': meson.current_source_dir(),
+  'bd': meson.current_build_dir(),
+  'tap': {
+    'tests': [
+      't/001_basic.pl',
+    ],
+  },
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
new file mode 100644
index 0000000000..72d05522fc
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql
@@ -0,0 +1,37 @@
+/* src/test/modules/read_wal_from_buffers/read_wal_from_buffers--1.0.sql */
+
+-- complain if script is sourced in psql, rather than via CREATE EXTENSION
+\echo Use "CREATE EXTENSION read_wal_from_buffers" to load this file. \quit
+
+--
+-- read_wal_from_buffers()
+--
+-- SQL function to read WAL from WAL buffers. Returns number of bytes read.
+--
+CREATE FUNCTION read_wal_from_buffers(IN lsn pg_lsn, IN bytes_to_read int,
+    bytes_read OUT int)
+AS 'MODULE_PATHNAME', 'read_wal_from_buffers'
+LANGUAGE C STRICT;
+
+--
+-- get_wal_records_info_from_buffers()
+--
+-- SQL function to get info of WAL records available in WAL buffers.
+--
+CREATE FUNCTION get_wal_records_info_from_buffers(IN start_lsn pg_lsn,
+    IN end_lsn pg_lsn,
+    OUT start_lsn pg_lsn,
+    OUT end_lsn pg_lsn,
+    OUT prev_lsn pg_lsn,
+    OUT xid xid,
+    OUT resource_manager text,
+    OUT record_type text,
+    OUT record_length int4,
+    OUT main_data_length int4,
+    OUT fpi_length int4,
+    OUT description text,
+    OUT block_ref text
+)
+RETURNS SETOF record
+AS 'MODULE_PATHNAME', 'get_wal_records_info_from_buffers'
+LANGUAGE C STRICT PARALLEL SAFE;
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
new file mode 100644
index 0000000000..ed33a14127
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
@@ -0,0 +1,318 @@
+/*--------------------------------------------------------------------------
+ *
+ * read_wal_from_buffers.c
+ *		Test module to read WAL from WAL buffers.
+ *
+ * Portions Copyright (c) 1996-2024, PostgreSQL Global Development Group
+ * Portions Copyright (c) 1994, Regents of the University of California
+ *
+ * IDENTIFICATION
+ *	src/test/modules/read_wal_from_buffers/read_wal_from_buffers.c
+ * -------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "access/xlog.h"
+#include "access/xlog_internal.h"
+#include "access/xlogreader.h"
+#include "access/xlogrecovery.h"
+#include "funcapi.h"
+#include "miscadmin.h"
+#include "utils/builtins.h"
+#include "utils/pg_lsn.h"
+
+PG_MODULE_MAGIC;
+
+static int	read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+								  int reqLen, XLogRecPtr targetRecPtr,
+								  char *cur_page);
+
+static XLogRecord *ReadNextXLogRecord(XLogReaderState *xlogreader);
+static void GetWALRecordInfo(XLogReaderState *record, Datum *values,
+							 bool *nulls, uint32 ncols);
+static void GetWALRecordsInfo(FunctionCallInfo fcinfo,
+							  XLogRecPtr start_lsn,
+							  XLogRecPtr end_lsn);
+
+/*
+ * SQL function to read WAL from WAL buffers. Returns number of bytes read.
+ */
+PG_FUNCTION_INFO_V1(read_wal_from_buffers);
+Datum
+read_wal_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	startptr = PG_GETARG_LSN(0);
+	int32		count = PG_GETARG_INT32(1);
+	Size		read;
+	char	   *data = palloc0(count);
+	XLogRecPtr	upto = startptr + count;
+	XLogRecPtr	insert_pos = GetXLogInsertRecPtr();
+	TimeLineID	tli = GetWALInsertionTimeLine();
+
+	/*
+	 * The requested WAL may be very recent, so wait for any in-progress WAL
+	 * insertions to WAL buffers to finish.
+	 */
+	if (upto > insert_pos)
+	{
+		XLogRecPtr	writtenUpto = WaitXLogInsertionsToFinish(upto);
+
+		upto = Min(upto, writtenUpto);
+		count = upto - startptr;
+	}
+
+	read = WALReadFromBuffers(data, startptr, count, tli);
+
+	pfree(data);
+
+	PG_RETURN_INT32(read);
+}
+
+/*
+ * XLogReaderRoutine->page_read callback for reading WAL from WAL buffers.
+ */
+static int
+read_from_wal_buffers(XLogReaderState *state, XLogRecPtr targetPagePtr,
+					  int reqLen, XLogRecPtr targetRecPtr,
+					  char *cur_page)
+{
+	XLogRecPtr	read_upto,
+				loc;
+	TimeLineID	tli = GetWALInsertionTimeLine();
+	Size		count;
+	Size		read = 0;
+
+	loc = targetPagePtr + reqLen;
+
+	/* Loop waiting for xlog to be available if necessary */
+	while (1)
+	{
+		read_upto = GetXLogInsertRecPtr();
+
+		if (loc <= read_upto)
+			break;
+
+		WaitXLogInsertionsToFinish(loc);
+
+		CHECK_FOR_INTERRUPTS();
+		pg_usleep(1000L);
+	}
+
+	if (targetPagePtr + XLOG_BLCKSZ <= read_upto)
+	{
+		/*
+		 * more than one block available; read only that block, have caller
+		 * come back if they need more.
+		 */
+		count = XLOG_BLCKSZ;
+	}
+	else if (targetPagePtr + reqLen > read_upto)
+	{
+		/* not enough data there */
+		return -1;
+	}
+	else
+	{
+		/* enough bytes available to satisfy the request */
+		count = read_upto - targetPagePtr;
+	}
+
+	/* read WAL from WAL buffers */
+	read = WALReadFromBuffers(cur_page, targetPagePtr, count, tli);
+
+	if (read != count)
+		ereport(ERROR,
+				errmsg("could not read fully from WAL buffers; expected %lu, read %lu",
+					   count, read));
+
+	return count;
+}
+
+/*
+ * Get info of all WAL records between start LSN and end LSN.
+ *
+ * This function and its helpers below are similar to pg_walinspect's
+ * pg_get_wal_records_info() except that it will get info of WAL records
+ * available in WAL buffers.
+ */
+PG_FUNCTION_INFO_V1(get_wal_records_info_from_buffers);
+Datum
+get_wal_records_info_from_buffers(PG_FUNCTION_ARGS)
+{
+	XLogRecPtr	start_lsn = PG_GETARG_LSN(0);
+	XLogRecPtr	end_lsn = PG_GETARG_LSN(1);
+
+	/*
+	 * Validate start and end LSNs coming from the function inputs.
+	 *
+	 * Reading WAL below the first page of the first segments isn't allowed.
+	 * This is a bootstrap WAL page and the page_read callback fails to read
+	 * it.
+	 */
+	if (start_lsn < XLOG_BLCKSZ)
+		ereport(ERROR,
+				(errmsg("could not read WAL at LSN %X/%X",
+						LSN_FORMAT_ARGS(start_lsn))));
+
+	if (start_lsn > end_lsn)
+		ereport(ERROR,
+				(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+				 errmsg("WAL start LSN must be less than end LSN")));
+
+	GetWALRecordsInfo(fcinfo, start_lsn, end_lsn);
+
+	PG_RETURN_VOID();
+}
+
+/*
+ * Read next WAL record.
+ */
+static XLogRecord *
+ReadNextXLogRecord(XLogReaderState *xlogreader)
+{
+	XLogRecord *record;
+	char	   *errormsg;
+
+	record = XLogReadRecord(xlogreader, &errormsg);
+
+	if (record == NULL)
+	{
+		if (errormsg)
+			ereport(ERROR,
+					errmsg("could not read WAL at %X/%X: %s",
+						   LSN_FORMAT_ARGS(xlogreader->EndRecPtr), errormsg));
+		else
+			ereport(ERROR,
+					errmsg("could not read WAL at %X/%X",
+						   LSN_FORMAT_ARGS(xlogreader->EndRecPtr)));
+	}
+
+	return record;
+}
+
+/*
+ * Output values that make up a row describing caller's WAL record.
+ */
+static void
+GetWALRecordInfo(XLogReaderState *record, Datum *values,
+				 bool *nulls, uint32 ncols)
+{
+	const char *record_type;
+	RmgrData	desc;
+	uint32		fpi_len = 0;
+	StringInfoData rec_desc;
+	StringInfoData rec_blk_ref;
+	int			i = 0;
+
+	desc = GetRmgr(XLogRecGetRmid(record));
+	record_type = desc.rm_identify(XLogRecGetInfo(record));
+
+	if (record_type == NULL)
+		record_type = psprintf("UNKNOWN (%x)", XLogRecGetInfo(record) & ~XLR_INFO_MASK);
+
+	initStringInfo(&rec_desc);
+	desc.rm_desc(&rec_desc, record);
+
+	if (XLogRecHasAnyBlockRefs(record))
+	{
+		initStringInfo(&rec_blk_ref);
+		XLogRecGetBlockRefInfo(record, false, true, &rec_blk_ref, &fpi_len);
+	}
+
+	values[i++] = LSNGetDatum(record->ReadRecPtr);
+	values[i++] = LSNGetDatum(record->EndRecPtr);
+	values[i++] = LSNGetDatum(XLogRecGetPrev(record));
+	values[i++] = TransactionIdGetDatum(XLogRecGetXid(record));
+	values[i++] = CStringGetTextDatum(desc.rm_name);
+	values[i++] = CStringGetTextDatum(record_type);
+	values[i++] = UInt32GetDatum(XLogRecGetTotalLen(record));
+	values[i++] = UInt32GetDatum(XLogRecGetDataLen(record));
+	values[i++] = UInt32GetDatum(fpi_len);
+
+	if (rec_desc.len > 0)
+		values[i++] = CStringGetTextDatum(rec_desc.data);
+	else
+		nulls[i++] = true;
+
+	if (XLogRecHasAnyBlockRefs(record))
+		values[i++] = CStringGetTextDatum(rec_blk_ref.data);
+	else
+		nulls[i++] = true;
+
+	Assert(i == ncols);
+}
+
+/*
+ * Get info of all WAL records between start LSN and end LSN.
+ */
+static void
+GetWALRecordsInfo(FunctionCallInfo fcinfo, XLogRecPtr start_lsn,
+				  XLogRecPtr end_lsn)
+{
+#define GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS 11
+	XLogReaderState *xlogreader;
+	XLogRecPtr	first_valid_record;
+	ReturnSetInfo *rsinfo = (ReturnSetInfo *) fcinfo->resultinfo;
+	MemoryContext old_cxt;
+	MemoryContext tmp_cxt;
+
+	Assert(start_lsn <= end_lsn);
+
+	InitMaterializedSRF(fcinfo, 0);
+
+	xlogreader = XLogReaderAllocate(wal_segment_size, NULL,
+									XL_ROUTINE(.page_read = &read_from_wal_buffers,
+											   .segment_open = NULL,
+											   .segment_close = NULL),
+									NULL);
+
+	if (xlogreader == NULL)
+		ereport(ERROR,
+				(errcode(ERRCODE_OUT_OF_MEMORY),
+				 errmsg("out of memory"),
+				 errdetail("Failed while allocating a WAL reading processor.")));
+
+	/* first find a valid recptr to start from */
+	first_valid_record = XLogFindNextRecord(xlogreader, start_lsn);
+
+	if (XLogRecPtrIsInvalid(first_valid_record))
+	{
+		ereport(LOG,
+				(errmsg("could not find a valid record after %X/%X",
+						LSN_FORMAT_ARGS(start_lsn))));
+
+		return;
+	}
+
+	tmp_cxt = AllocSetContextCreate(CurrentMemoryContext,
+									"GetWALRecordsInfo temporary cxt",
+									ALLOCSET_DEFAULT_SIZES);
+
+	while (ReadNextXLogRecord(xlogreader) &&
+		   xlogreader->EndRecPtr <= end_lsn)
+	{
+		Datum		values[GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS] = {0};
+		bool		nulls[GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS] = {0};
+
+		/* Use the tmp context so we can clean up after each tuple is done */
+		old_cxt = MemoryContextSwitchTo(tmp_cxt);
+
+		GetWALRecordInfo(xlogreader, values, nulls,
+						 GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS);
+
+		tuplestore_putvalues(rsinfo->setResult, rsinfo->setDesc,
+							 values, nulls);
+
+		/* clean up and switch back */
+		MemoryContextSwitchTo(old_cxt);
+		MemoryContextReset(tmp_cxt);
+
+		CHECK_FOR_INTERRUPTS();
+	}
+
+	MemoryContextDelete(tmp_cxt);
+	XLogReaderFree(xlogreader);
+
+#undef GET_WAL_RECORDS_INFO_FROM_BUFFERS_COLS
+}
diff --git a/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
new file mode 100644
index 0000000000..b14d24751c
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/read_wal_from_buffers.control
@@ -0,0 +1,4 @@
+comment = 'Test module to read WAL from WAL buffers'
+default_version = '1.0'
+module_pathname = '$libdir/read_wal_from_buffers'
+relocatable = true
diff --git a/src/test/modules/read_wal_from_buffers/t/001_basic.pl b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
new file mode 100644
index 0000000000..15ef550c8c
--- /dev/null
+++ b/src/test/modules/read_wal_from_buffers/t/001_basic.pl
@@ -0,0 +1,111 @@
+# Copyright (c) 2021-2023, PostgreSQL Global Development Group
+
+use strict;
+use warnings;
+
+use PostgreSQL::Test::Cluster;
+use PostgreSQL::Test::Utils;
+use Test::More;
+use Time::HiRes qw(usleep);
+
+# Setup a new node.  The configuration chosen here minimizes the number
+# of arbitrary records that could get generated in a cluster.  Enlarging
+# checkpoint_timeout avoids noise with checkpoint activity.  wal_level
+# set to "minimal" avoids random standby snapshot records.  Autovacuum
+# could also trigger randomly, generating random WAL activity of its own.
+# Enlarging wal_writer_delay and wal_writer_flush_after avoid background
+# wal flush by walwriter.
+my $node = PostgreSQL::Test::Cluster->new("node");
+$node->init;
+$node->append_conf(
+	'postgresql.conf',
+	q[wal_level = minimal
+	  autovacuum = off
+	  checkpoint_timeout = '30min'
+	  wal_writer_delay = 10000ms
+	  wal_writer_flush_after = 1GB
+]);
+$node->start;
+
+# Setup.
+$node->safe_psql('postgres', 'CREATE EXTENSION read_wal_from_buffers;');
+
+$node->safe_psql('postgres', 'CREATE TABLE t (c int);');
+
+my $result = 0;
+my $lsn;
+my $to_read;
+
+# Wait until we read from WAL buffers
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+	# Get current insert LSN. After this, we generate some WAL which is guranteed
+	# to be in WAL buffers as there is no other WAL generating activity is
+	# happening on the server. We then verify if we can read the WAL from WAL
+	# buffers using this LSN.
+	$lsn =
+	  $node->safe_psql('postgres', 'SELECT pg_current_wal_insert_lsn();');
+
+	my $logstart = -s $node->logfile;
+
+	# Generate minimal WAL so that WAL buffers don't get overwritten.
+	$node->safe_psql('postgres', "INSERT INTO t VALUES ($i);");
+
+	$to_read = 8192;
+
+	my $res = $node->safe_psql('postgres',
+		qq{SELECT read_wal_from_buffers(lsn := '$lsn', bytes_to_read := $to_read) > 0;}
+	);
+
+	my $log = $node->log_contains(
+		"request to flush past end of generated WAL; request .*, current position .*",
+		$logstart);
+
+	if ($res eq 't' && $log > 0)
+	{
+		$result = 1;
+		last;
+	}
+
+	usleep(100_000);
+}
+ok($result, 'waited until WAL is successfully read from WAL buffers');
+
+$result = 0;
+
+# Wait until we get info of WAL records available in WAL buffers.
+for (my $i = 0; $i < 10 * $PostgreSQL::Test::Utils::timeout_default; $i++)
+{
+	$node->safe_psql('postgres', "DROP TABLE IF EXISTS foo, bar;");
+	$node->safe_psql('postgres',
+		"CREATE TABLE foo AS SELECT * FROM generate_series(1, 2);");
+	my $start_lsn =
+	  $node->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn();");
+	my $tbl_oid = $node->safe_psql('postgres',
+		"SELECT oid FROM pg_class WHERE relname = 'foo';");
+	$node->safe_psql('postgres',
+		"INSERT INTO foo SELECT * FROM generate_series(1, 10);");
+	my $end_lsn =
+	  $node->safe_psql('postgres', "SELECT pg_current_wal_insert_lsn();");
+	$node->safe_psql('postgres',
+		"CREATE TABLE bar AS SELECT * FROM generate_series(1, 2);");
+
+	my $res = $node->safe_psql(
+		'postgres',
+		"SELECT count(*) FROM get_wal_records_info_from_buffers('$start_lsn', '$end_lsn')
+					WHERE block_ref LIKE concat('%', '$tbl_oid', '%') AND
+						resource_manager = 'Heap' AND
+						record_type = 'INSERT';");
+
+	if ($res eq 10)
+	{
+		$result = 1;
+		last;
+	}
+
+	usleep(100_000);
+}
+ok($result,
+	'waited until we get info of WAL records available in WAL buffers.');
+
+done_testing();
-- 
2.34.1

v27-0002-Use-WALReadFromBuffers-in-more-places.patchapplication/x-patch; name=v27-0002-Use-WALReadFromBuffers-in-more-places.patchDownload
From 620695a36afdc30d27227a4864131c0fec80f3ba Mon Sep 17 00:00:00 2001
From: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com>
Date: Mon, 8 Apr 2024 05:02:46 +0000
Subject: [PATCH v27 2/2] Use WALReadFromBuffers in more places.

---
 src/backend/access/transam/xlogutils.c | 13 ++++++++++++-
 src/backend/replication/walsender.c    | 16 +++++++++++++---
 2 files changed, 25 insertions(+), 4 deletions(-)

diff --git a/src/backend/access/transam/xlogutils.c b/src/backend/access/transam/xlogutils.c
index 5295b85fe0..1e1f5b5306 100644
--- a/src/backend/access/transam/xlogutils.c
+++ b/src/backend/access/transam/xlogutils.c
@@ -892,6 +892,8 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 	int			count;
 	WALReadError errinfo;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	loc = targetPagePtr + reqLen;
 
@@ -1004,7 +1006,16 @@ read_local_xlog_page_guts(XLogReaderState *state, XLogRecPtr targetPagePtr,
 		count = read_upto - targetPagePtr;
 	}
 
-	if (!WALRead(state, cur_page, targetPagePtr, count, tli,
+	/* attempt to read WAL from WAL buffers first */
+	nbytes = count;
+	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
+	/* now read the remaining WAL from WAL file */
+	if (nbytes > 0 &&
+		!WALRead(state, cur_page, targetPagePtr, nbytes, tli,
 				 &errinfo))
 		WALReadRaiseError(&errinfo);
 
diff --git a/src/backend/replication/walsender.c b/src/backend/replication/walsender.c
index bc40c454de..19cf1d5ce7 100644
--- a/src/backend/replication/walsender.c
+++ b/src/backend/replication/walsender.c
@@ -1056,6 +1056,8 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	WALReadError errinfo;
 	XLogSegNo	segno;
 	TimeLineID	currTLI;
+	Size		nbytes;
+	Size		rbytes;
 
 	/*
 	 * Make sure we have enough WAL available before retrieving the current
@@ -1092,11 +1094,19 @@ logical_read_xlog_page(XLogReaderState *state, XLogRecPtr targetPagePtr, int req
 	else
 		count = flushptr - targetPagePtr;	/* part of the page available */
 
-	/* now actually read the data, we know it's there */
-	if (!WALRead(state,
+	/* attempt to read WAL from WAL buffers first */
+	nbytes = count;
+	rbytes = WALReadFromBuffers(cur_page, targetPagePtr, nbytes, currTLI);
+	cur_page += rbytes;
+	targetPagePtr += rbytes;
+	nbytes -= rbytes;
+
+	/* now read the remaining WAL from WAL file */
+	if (nbytes > 0 &&
+		!WALRead(state,
 				 cur_page,
 				 targetPagePtr,
-				 count,
+				 nbytes,
 				 currTLI,		/* Pass the current TLI because only
 								 * WalSndSegmentOpen controls whether new TLI
 								 * is needed. */
-- 
2.34.1

#92Andrey M. Borodin
x4mmm@yandex-team.ru
In reply to: Bharath Rupireddy (#91)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On 8 Apr 2024, at 08:17, Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> wrote:

Hi Bharath!

As far as I understand CF entry [0]https://commitfest.postgresql.org/47/4060/ is committed? I understand that there are some open followups, but I just want to determine correct CF item status...

Thanks!

Best regards, Andrey Borodin.

[0]: https://commitfest.postgresql.org/47/4060/

#93Michael Paquier
michael@paquier.xyz
In reply to: Andrey M. Borodin (#92)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Tue, Apr 09, 2024 at 09:33:49AM +0300, Andrey M. Borodin wrote:

As far as I understand CF entry [0] is committed? I understand that
there are some open followups, but I just want to determine correct
CF item status...

So much work has happened on this thread with things that has been
committed, so switching the entry to committed makes sense to me. I
have just done that.

Bharath, could you create a new thread with the new things you are
proposing? All that should be v18 work, particularly v27-0002:
/messages/by-id/CALj2ACWCibnX2jcnRreBHFesFeQ6vbKiFstML=w-JVTvUKD_EA@mail.gmail.com
--
Michael

#94Bharath Rupireddy
bharath.rupireddyforpostgres@gmail.com
In reply to: Michael Paquier (#93)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Thu, Apr 11, 2024 at 6:31 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Apr 09, 2024 at 09:33:49AM +0300, Andrey M. Borodin wrote:

As far as I understand CF entry [0] is committed? I understand that
there are some open followups, but I just want to determine correct
CF item status...

So much work has happened on this thread with things that has been
committed, so switching the entry to committed makes sense to me. I
have just done that.

Bharath, could you create a new thread with the new things you are
proposing? All that should be v18 work, particularly v27-0002:
/messages/by-id/CALj2ACWCibnX2jcnRreBHFesFeQ6vbKiFstML=w-JVTvUKD_EA@mail.gmail.com

Thanks. I started a new thread
/messages/by-id/CALj2ACVfF2Uj9NoFy-5m98HNtjHpuD17EDE9twVeJng-jTAe7A@mail.gmail.com.

--
Bharath Rupireddy
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#95Michael Paquier
michael@paquier.xyz
In reply to: Bharath Rupireddy (#94)
Re: Improve WALRead() to suck data directly from WAL buffers when possible

On Wed, Apr 24, 2024 at 09:46:20PM +0530, Bharath Rupireddy wrote:

Thanks. I started a new thread
/messages/by-id/CALj2ACVfF2Uj9NoFy-5m98HNtjHpuD17EDE9twVeJng-jTAe7A@mail.gmail.com.

Cool, thanks.
--
Michael