Call rm_redo in a temporary memory context

Started by Heikki Linnakangasover 1 year ago3 messages
#1Heikki Linnakangas
hlinnaka@iki.fi
3 attachment(s)

Many resource managers set up a temporary memory context which is reset
after replaying the record. It seems a bit silly for each rmgr to do
that on their own, so I propose that we do it in a centralized fashion.
The attached patch creates one new temporary context and switches to it
for each rm_redo() call.

I was afraid of the overhead of the MemoryContextReset between each WAL
record since this is a very hot codepath, but it doesn't seem to be
noticeable. I used the attached scripts to benchmark it.
redobench-setup.sh sets up a base backup and about 5 GB of WAL. The WAL
consists of just tiny logical decoding messages, no real page
modifications. The idea is that replaying that WAL should make any
per-record overhead stand out as much as possible, since there's no real
work to do. Use redobench.sh to perform the replay. I am not seeing any
measurable difference this patch, so I think we're good. But if we
needed to optimize, we could e.g. have an inlined fastpath version of
MemoryContextReset for the common case that the context is empty, or
only reset it every 100 records or something.

This leaves no built-in rmgrs with any rm_startup or rm_clenaup hooks.
Extensions might still use them, and they seem like they might be
useful, so I kept them.

There was no natural place to document this, so I added a brief
explanation of rm_redo in the RmgrData comment, and then tacked the
information about the memory context there too. I also added a note in
"Custom WAL Resource Managers" section of the docs to point out that
this changed in v18.

(Why am I doing this now? I was browsing through all the global
variables for the multithreading work, and these "opCtx"s caught my eye.
This is in no way critical for multithreading though.)

--
Heikki Linnakangas
Neon (https://neon.tech)

Attachments:

v1-0001-Run-WAL-redo-functions-in-a-temporary-memory-cont.patchtext/x-patch; charset=UTF-8; name=v1-0001-Run-WAL-redo-functions-in-a-temporary-memory-cont.patchDownload
From 9bfc0af26dcfc0c0061b98e170bc6fa02acaf709 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakangas@iki.fi>
Date: Wed, 7 Aug 2024 14:03:08 +0300
Subject: [PATCH v1 1/1] Run WAL redo functions in a temporary memory context

Some resource managers already used such a temporary memory context on
their own. Let's consolidate that and run all WAL redo functions in a
temporary context that is reset between each record.

This leaves the rm_startup and rm_cleanup hooks unused in core code,
but I kept them because they might still be used by extensions, and it
might come handy for builtin rmgrs again in the future too.
---
 doc/src/sgml/custom-rmgr.sgml             | 17 +++++++++++++++++
 src/backend/access/gin/ginxlog.c          | 21 ---------------------
 src/backend/access/gist/gistxlog.c        | 19 -------------------
 src/backend/access/nbtree/nbtxlog.c       | 21 ---------------------
 src/backend/access/spgist/spgxlog.c       | 23 -----------------------
 src/backend/access/transam/xlogrecovery.c | 12 ++++++++++++
 src/include/access/ginxlog.h              |  2 --
 src/include/access/gistxlog.h             |  2 --
 src/include/access/nbtxlog.h              |  2 --
 src/include/access/rmgrlist.h             |  8 ++++----
 src/include/access/spgxlog.h              |  2 --
 src/include/access/xlog_internal.h        |  5 +++++
 12 files changed, 38 insertions(+), 96 deletions(-)

diff --git a/doc/src/sgml/custom-rmgr.sgml b/doc/src/sgml/custom-rmgr.sgml
index 3032b2dc0d..bfa15e33d6 100644
--- a/doc/src/sgml/custom-rmgr.sgml
+++ b/doc/src/sgml/custom-rmgr.sgml
@@ -29,6 +29,11 @@
  * This struct must be kept in sync with the PG_RMGR definition in
  * rmgr.c.
  *
+ * rm_redo is the core function that replays one WAL record.  During WAL
+ * recovery, it is called once for every WAL record belonging to the resource
+ * manager.  It's called in a temporary memory context that is reset between
+ * every record.
+ *
  * rm_identify must return a name for the record based on xl_info (without
  * reference to the rmid). For example, XLOG_BTREE_VACUUM would be named
  * "VACUUM". rm_desc can then be called to obtain additional detail for the
@@ -55,6 +60,18 @@ typedef struct RmgrData
 </programlisting>
  </para>
 
+ <note>
+   <para>
+     Prior to <productname>PostgreSQL</productname> version 18.0,
+     the <varname>rm_redo</varname> function was called in a long-lived memory
+     context, and if you needed a short-lived context you had to create one
+     yourself e.g. in <varname>rm_startup</varname>. Starting with version
+     18.0, it's called in a temporary memory context instead, and if you need
+     to retain data over calls, you need to explicitly allocate them in
+     <literal>TopMemoryContext</literal>.
+   </para>
+ </note>
+
   <para>
    The <filename>src/test/modules/test_custom_rmgrs</filename> module
    contains a working example, which demonstrates usage of custom WAL
diff --git a/src/backend/access/gin/ginxlog.c b/src/backend/access/gin/ginxlog.c
index 07ba0b559e..59afd5e4b4 100644
--- a/src/backend/access/gin/ginxlog.c
+++ b/src/backend/access/gin/ginxlog.c
@@ -19,8 +19,6 @@
 #include "access/xlogutils.h"
 #include "utils/memutils.h"
 
-static MemoryContext opCtx;		/* working memory for operations */
-
 static void
 ginRedoClearIncompleteSplit(XLogReaderState *record, uint8 block_id)
 {
@@ -726,7 +724,6 @@ void
 gin_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
-	MemoryContext oldCtx;
 
 	/*
 	 * GIN indexes do not require any conflict processing. NB: If we ever
@@ -734,7 +731,6 @@ gin_redo(XLogReaderState *record)
 	 * killed tuples outside VACUUM, we'll need to handle that here.
 	 */
 
-	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_GIN_CREATE_PTREE:
@@ -767,23 +763,6 @@ gin_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "gin_redo: unknown op code %u", info);
 	}
-	MemoryContextSwitchTo(oldCtx);
-	MemoryContextReset(opCtx);
-}
-
-void
-gin_xlog_startup(void)
-{
-	opCtx = AllocSetContextCreate(CurrentMemoryContext,
-								  "GIN recovery temporary context",
-								  ALLOCSET_DEFAULT_SIZES);
-}
-
-void
-gin_xlog_cleanup(void)
-{
-	MemoryContextDelete(opCtx);
-	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/gist/gistxlog.c b/src/backend/access/gist/gistxlog.c
index 451e8d8d98..3127cffc03 100644
--- a/src/backend/access/gist/gistxlog.c
+++ b/src/backend/access/gist/gistxlog.c
@@ -23,8 +23,6 @@
 #include "utils/memutils.h"
 #include "utils/rel.h"
 
-static MemoryContext opCtx;		/* working memory for operations */
-
 /*
  * Replay the clearing of F_FOLLOW_RIGHT flag on a child page.
  *
@@ -397,7 +395,6 @@ void
 gist_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
-	MemoryContext oldCxt;
 
 	/*
 	 * GiST indexes do not require any conflict processing. NB: If we ever
@@ -405,7 +402,6 @@ gist_redo(XLogReaderState *record)
 	 * tuples outside VACUUM, we'll need to handle that here.
 	 */
 
-	oldCxt = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_GIST_PAGE_UPDATE:
@@ -429,21 +425,6 @@ gist_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "gist_redo: unknown op code %u", info);
 	}
-
-	MemoryContextSwitchTo(oldCxt);
-	MemoryContextReset(opCtx);
-}
-
-void
-gist_xlog_startup(void)
-{
-	opCtx = createTempGistContext();
-}
-
-void
-gist_xlog_cleanup(void)
-{
-	MemoryContextDelete(opCtx);
 }
 
 /*
diff --git a/src/backend/access/nbtree/nbtxlog.c b/src/backend/access/nbtree/nbtxlog.c
index b5b0e22447..0073e8bcdb 100644
--- a/src/backend/access/nbtree/nbtxlog.c
+++ b/src/backend/access/nbtree/nbtxlog.c
@@ -22,8 +22,6 @@
 #include "storage/standby.h"
 #include "utils/memutils.h"
 
-static MemoryContext opCtx;		/* working memory for operations */
-
 /*
  * _bt_restore_page -- re-enter all the index tuples on a page
  *
@@ -1014,9 +1012,7 @@ void
 btree_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
-	MemoryContext oldCtx;
 
-	oldCtx = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_BTREE_INSERT_LEAF:
@@ -1065,23 +1061,6 @@ btree_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "btree_redo: unknown op code %u", info);
 	}
-	MemoryContextSwitchTo(oldCtx);
-	MemoryContextReset(opCtx);
-}
-
-void
-btree_xlog_startup(void)
-{
-	opCtx = AllocSetContextCreate(CurrentMemoryContext,
-								  "Btree recovery temporary context",
-								  ALLOCSET_DEFAULT_SIZES);
-}
-
-void
-btree_xlog_cleanup(void)
-{
-	MemoryContextDelete(opCtx);
-	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/spgist/spgxlog.c b/src/backend/access/spgist/spgxlog.c
index 4e31d33e70..b422049920 100644
--- a/src/backend/access/spgist/spgxlog.c
+++ b/src/backend/access/spgist/spgxlog.c
@@ -22,9 +22,6 @@
 #include "utils/memutils.h"
 
 
-static MemoryContext opCtx;		/* working memory for operations */
-
-
 /*
  * Prepare a dummy SpGistState, with just the minimum info needed for replay.
  *
@@ -935,9 +932,7 @@ void
 spg_redo(XLogReaderState *record)
 {
 	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
-	MemoryContext oldCxt;
 
-	oldCxt = MemoryContextSwitchTo(opCtx);
 	switch (info)
 	{
 		case XLOG_SPGIST_ADD_LEAF:
@@ -967,24 +962,6 @@ spg_redo(XLogReaderState *record)
 		default:
 			elog(PANIC, "spg_redo: unknown op code %u", info);
 	}
-
-	MemoryContextSwitchTo(oldCxt);
-	MemoryContextReset(opCtx);
-}
-
-void
-spg_xlog_startup(void)
-{
-	opCtx = AllocSetContextCreate(CurrentMemoryContext,
-								  "SP-GiST temporary context",
-								  ALLOCSET_DEFAULT_SIZES);
-}
-
-void
-spg_xlog_cleanup(void)
-{
-	MemoryContextDelete(opCtx);
-	opCtx = NULL;
 }
 
 /*
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index ad817fbca6..b884f84974 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -61,6 +61,7 @@
 #include "utils/datetime.h"
 #include "utils/fmgrprotos.h"
 #include "utils/guc_hooks.h"
+#include "utils/memutils.h"
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
@@ -1728,6 +1729,8 @@ PerformWalRecovery(void)
 
 	if (record != NULL)
 	{
+		MemoryContext oldCtx;
+		MemoryContext redoCtx;
 		TimestampTz xtime;
 		PGRUsage	ru0;
 
@@ -1745,6 +1748,10 @@ PerformWalRecovery(void)
 		if (!StandbyMode)
 			begin_startup_progress_phase();
 
+		redoCtx = AllocSetContextCreate(CurrentMemoryContext,
+										"recovery temporary context",
+										ALLOCSET_DEFAULT_SIZES);
+
 		/*
 		 * main redo apply loop
 		 */
@@ -1820,7 +1827,10 @@ PerformWalRecovery(void)
 			/*
 			 * Apply the record
 			 */
+			oldCtx = MemoryContextSwitchTo(redoCtx);
 			ApplyWalRecord(xlogreader, record, &replayTLI);
+			MemoryContextSwitchTo(oldCtx);
+			MemoryContextReset(redoCtx);
 
 			/* Exit loop if we reached inclusive recovery target */
 			if (recoveryStopsAfter(xlogreader))
@@ -1847,6 +1857,8 @@ PerformWalRecovery(void)
 		 * end of main redo apply loop
 		 */
 
+		MemoryContextSwitchTo(oldCtx);
+
 		if (reachedRecoveryTarget)
 		{
 			if (!reachedConsistency)
diff --git a/src/include/access/ginxlog.h b/src/include/access/ginxlog.h
index 0c2ddbd82a..3e5185fbdc 100644
--- a/src/include/access/ginxlog.h
+++ b/src/include/access/ginxlog.h
@@ -209,8 +209,6 @@ typedef struct ginxlogDeleteListPages
 extern void gin_redo(XLogReaderState *record);
 extern void gin_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gin_identify(uint8 info);
-extern void gin_xlog_startup(void);
-extern void gin_xlog_cleanup(void);
 extern void gin_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* GINXLOG_H */
diff --git a/src/include/access/gistxlog.h b/src/include/access/gistxlog.h
index a0bdc5d7d7..809512acd9 100644
--- a/src/include/access/gistxlog.h
+++ b/src/include/access/gistxlog.h
@@ -110,8 +110,6 @@ typedef struct gistxlogPageReuse
 extern void gist_redo(XLogReaderState *record);
 extern void gist_desc(StringInfo buf, XLogReaderState *record);
 extern const char *gist_identify(uint8 info);
-extern void gist_xlog_startup(void);
-extern void gist_xlog_cleanup(void);
 extern void gist_mask(char *pagedata, BlockNumber blkno);
 
 #endif
diff --git a/src/include/access/nbtxlog.h b/src/include/access/nbtxlog.h
index e42ac1611c..3922687858 100644
--- a/src/include/access/nbtxlog.h
+++ b/src/include/access/nbtxlog.h
@@ -354,8 +354,6 @@ typedef struct xl_btree_newroot
  * prototypes for functions in nbtxlog.c
  */
 extern void btree_redo(XLogReaderState *record);
-extern void btree_xlog_startup(void);
-extern void btree_xlog_cleanup(void);
 extern void btree_mask(char *pagedata, BlockNumber blkno);
 
 /*
diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 78e6b908c6..8bcc82cb54 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -36,12 +36,12 @@ PG_RMGR(RM_RELMAP_ID, "RelMap", relmap_redo, relmap_desc, relmap_identify, NULL,
 PG_RMGR(RM_STANDBY_ID, "Standby", standby_redo, standby_desc, standby_identify, NULL, NULL, NULL, standby_decode)
 PG_RMGR(RM_HEAP2_ID, "Heap2", heap2_redo, heap2_desc, heap2_identify, NULL, NULL, heap_mask, heap2_decode)
 PG_RMGR(RM_HEAP_ID, "Heap", heap_redo, heap_desc, heap_identify, NULL, NULL, heap_mask, heap_decode)
-PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, btree_xlog_startup, btree_xlog_cleanup, btree_mask, NULL)
+PG_RMGR(RM_BTREE_ID, "Btree", btree_redo, btree_desc, btree_identify, NULL, NULL, btree_mask, NULL)
 PG_RMGR(RM_HASH_ID, "Hash", hash_redo, hash_desc, hash_identify, NULL, NULL, hash_mask, NULL)
-PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, gin_xlog_startup, gin_xlog_cleanup, gin_mask, NULL)
-PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, gist_xlog_startup, gist_xlog_cleanup, gist_mask, NULL)
+PG_RMGR(RM_GIN_ID, "Gin", gin_redo, gin_desc, gin_identify, NULL, NULL, gin_mask, NULL)
+PG_RMGR(RM_GIST_ID, "Gist", gist_redo, gist_desc, gist_identify, NULL, NULL, gist_mask, NULL)
 PG_RMGR(RM_SEQ_ID, "Sequence", seq_redo, seq_desc, seq_identify, NULL, NULL, seq_mask, NULL)
-PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, spg_xlog_startup, spg_xlog_cleanup, spg_mask, NULL)
+PG_RMGR(RM_SPGIST_ID, "SPGist", spg_redo, spg_desc, spg_identify, NULL, NULL, spg_mask, NULL)
 PG_RMGR(RM_BRIN_ID, "BRIN", brin_redo, brin_desc, brin_identify, NULL, NULL, brin_mask, NULL)
 PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_identify, NULL, NULL, NULL, NULL)
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL)
diff --git a/src/include/access/spgxlog.h b/src/include/access/spgxlog.h
index 16cc735001..fcd768a1c1 100644
--- a/src/include/access/spgxlog.h
+++ b/src/include/access/spgxlog.h
@@ -252,8 +252,6 @@ typedef struct spgxlogVacuumRedirect
 extern void spg_redo(XLogReaderState *record);
 extern void spg_desc(StringInfo buf, XLogReaderState *record);
 extern const char *spg_identify(uint8 info);
-extern void spg_xlog_startup(void);
-extern void spg_xlog_cleanup(void);
 extern void spg_mask(char *pagedata, BlockNumber blkno);
 
 #endif							/* SPGXLOG_H */
diff --git a/src/include/access/xlog_internal.h b/src/include/access/xlog_internal.h
index c6a91fb456..4df224bb16 100644
--- a/src/include/access/xlog_internal.h
+++ b/src/include/access/xlog_internal.h
@@ -335,6 +335,11 @@ struct XLogRecordBuffer;
  * This struct must be kept in sync with the PG_RMGR definition in
  * rmgr.c.
  *
+ * rm_redo is the core function that replays one WAL record.  During WAL
+ * recovery, it is called once for every WAL record belonging to the resource
+ * manager.  It's called in a temporary memory context that is reset between
+ * every record.
+ *
  * rm_identify must return a name for the record based on xl_info (without
  * reference to the rmid). For example, XLOG_BTREE_VACUUM would be named
  * "VACUUM". rm_desc can then be called to obtain additional detail for the
-- 
2.39.2

redobench.shapplication/x-shellscript; name=redobench.shDownload
redobench-setup.shapplication/x-shellscript; name=redobench-setup.shDownload
#2Kirill Reshke
reshkekirill@gmail.com
In reply to: Heikki Linnakangas (#1)
Re: Call rm_redo in a temporary memory context

Hi!

On Wed, 7 Aug 2024 at 16:24, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

Many resource managers set up a temporary memory context which is reset
after replaying the record. It seems a bit silly for each rmgr to do
that on their own, so I propose that we do it in a centralized fashion.
The attached patch creates one new temporary context and switches to it
for each rm_redo() call.

I was afraid of the overhead of the MemoryContextReset between each WAL
record since this is a very hot codepath, but it doesn't seem to be
noticeable. I used the attached scripts to benchmark it.
redobench-setup.sh sets up a base backup and about 5 GB of WAL. The WAL
consists of just tiny logical decoding messages, no real page
modifications. The idea is that replaying that WAL should make any
per-record overhead stand out as much as possible, since there's no real
work to do. Use redobench.sh to perform the replay. I am not seeing any
measurable difference this patch, so I think we're good. But if we
needed to optimize, we could e.g. have an inlined fastpath version of
MemoryContextReset for the common case that the context is empty, or
only reset it every 100 records or something.

This leaves no built-in rmgrs with any rm_startup or rm_clenaup hooks.
Extensions might still use them, and they seem like they might be
useful, so I kept them.

There was no natural place to document this, so I added a brief
explanation of rm_redo in the RmgrData comment, and then tacked the
information about the memory context there too. I also added a note in
"Custom WAL Resource Managers" section of the docs to point out that
this changed in v18.

(Why am I doing this now? I was browsing through all the global
variables for the multithreading work, and these "opCtx"s caught my eye.
This is in no way critical for multithreading though.)

--
Heikki Linnakangas
Neon (https://neon.tech)

+1 on the idea, since this simplifies RMGR API for extension developers.

Compiler warns about `src/backend/access/transam/xlogrecovery.c:1860`,
where we switch to maybe-uninitialized memory context. Lets assign
this to something.

--
Best regards,
Kirill Reshke

#3Michael Paquier
michael@paquier.xyz
In reply to: Kirill Reshke (#2)
Re: Call rm_redo in a temporary memory context

On Tue, Oct 01, 2024 at 03:29:19PM +0500, Kirill Reshke wrote:

Compiler warns about `src/backend/access/transam/xlogrecovery.c:1860`,
where we switch to maybe-uninitialized memory context. Lets assign
this to something.

+     yourself e.g. in <varname>rm_startup</varname>. Starting with version
+     18.0, it's called in a temporary memory context instead, and if you need
+     to retain data over calls, you need to explicitly allocate them in
+     <literal>TopMemoryContext</literal>.

Hmm. Is it a good idea to encourage that? This would be allocated in
the memory context where RmgrStartup() is called. For HEAD, it does
not matter because this is called only at the beginning of recovery in
the startup process. However, could it be a problem for out-of-core
code that wants to do a full SGMR startup and may want to rely on
something else than TopMemoryContext because the current code makes
memory context cleanup easier in the event if an error?
--
Michael