Autovacuum giving up on tables after crash because of lack of stats

Started by Michael Paquierabout 1 year ago7 messages
#1Michael Paquier
michael@paquier.xyz

Hi all,

While digging again into the support for pgstats flushes across
checkpoints, mainly to avoid autovacuum giving up on tables after a
crash if stats entries cannot be found, I've been reminded about this
point raised by Heikki:
/messages/by-id/54fdfc3e-74f6-4ea4-b844-05aa66b39490@iki.fi

The point is quite simple: depending on the timing of the stats file
flush we may finish with orphaned entries because some objects may be
dropped after a checkpoint, leaving stats entries when these should be
removed. This is an issue dependent on the pgstats file flush at
checkpoint time, and we cannot by design reach this state currently as
stats are flushed only at shutdown, once all the other backends have
exited after the shutdown checkpoint is finished. There is a sanity
check for that in pgstat_write_statsfile():
Assert(!ps->dropped);

Up to v14 and 5891c7a8ed, there was a mechanism in place able to do
garbage collection of stats entries when vacuuming a database, where
OIDs of various object types are collected and messages were sent to
the stats collector to clean up things.

With pgstats now in shared memory, we could do something else if we
want to have some stats for autovacuum post-crash to take some
decisions rather than giving up. Some ideas I can think of, some of
them linked to the timing of the pgstats file flush, but not all:

1) Reimplement the same thing as ~14 in autovacuum workers with a
flush of the pgstats file at each checkpoint, with a loop in all the
stats kinds that would trigger a periodic cleanup of potential
garbage entries, based on a new callback. One option would be a
callback to return a set of (stat_kind,objid) on the database
vacuumed, then loop through these to clean up everything in a last
step once all the objects are known. We could also let that up to
each stats kind, dropping objects they don't know about. This is far
from optimal as there would be a window where we still have orphaned
entries: these would not be marked as dropped, still their OIDs could
get reused later. As a whole it makes the existing pgstats facility
*less* robust compared to what we have now. For tables this could be
made efficient so as we cross check the list of table OIDs with the
contents of the central dshash once we're done with the initial scan
of pg_class, but that's O(N^2) for the pgstats dshash based on the
number of entries. I'm not really comfortable reintroducing that
because it does not scale well with many objects and I know of some
users with a lot of.. Objects. It adds more work to autovacuum and I
don't want any of that because we want autovacuum to be cheaper.

2) The main issue I am trying to tackle is autovacuum giving up on
tables if there are no stats entries, so we could add *some* WAL
logging of the relation stats that are relevant for autovacuum, then
replay them. I think that the correct approach here is to introduce
one new RMGR for pgstats, giving to each stats kind the possibility
to call a routine able to do WAL logging of *some* of its data (custom
structure size, custom data size), and insert records associated to
their stats kind. We are going to need a new optional callback
defined by a stats kind to be able to force some actions at replay, so
as stats kinds can decide what to do with the data in the record.
Generation of WAL records has to happen pgstat_report_stat() through
the flush callback of each stats kind when the stats stored locally
are synced with shared memory. There is a different reason for that:
stats are flushed when backends shut down, and we are still able to
generate some WAL at this stage. An advantage of this design is to be
able to decide which portions of which stats kind is worth
replicating, and we can take a step-by-step approach we what data and
how much data we want to replay (for example for tables we should not
care about replicating the number scans). Another benefit of this
design is for custom stats kind: these can call the pgstats RMGR to
pass down some data and define their own callback to use at replay.
If we do that, flushing the stats file at each checkpoint is not
actually mandatory: the most critical stats could be in WAL.

3) Do nothing, and accept that that we could finish with orphaned
entries as a possible scenario? This still requires more thinking
about how to deal with orphaned entries as these would not be marked
as dropped, still their OIDs could be reused after a wraparound. This
makes the current assumptions in the pgstats machinery more brittle.

Among all these ideas, 2) is by far the most relevant approach to me,
because even if we do not flush pgstats at checkpoint, we can still
keep around relevant stats when performing crash recovery, while
copying around some stats on standbys. It should be possible for a
given stats kind to do a different action depending on if we're in
standby mode or just in crash recovery. And that would take care of
this autovacuum problem post-crash: we could have stats to help in the
decision of if a table should be vacuum or not. Note that the
implementation can be done in multiple steps, like:
- Adding the whole WAL pgstats facility and some tests related to it
(WAL logging with injection points for variable and fixed-numbered
stats in a custom stats kind).
- Deal about the autovacuum and relation stats part.
- Open the door for more replication of stats data, whatever that may
be.

Comments, thoughts or tomatoes?
--
Michael

#2Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#1)
Re: Autovacuum giving up on tables after crash because of lack of stats

Hi,

On Wed, Dec 25, 2024 at 10:10:44AM +0900, Michael Paquier wrote:

2) The main issue I am trying to tackle is autovacuum giving up on
tables if there are no stats entries, so we could add *some* WAL
logging of the relation stats that are relevant for autovacuum, then
replay them. I think that the correct approach here is to introduce
one new RMGR for pgstats, giving to each stats kind the possibility
to call a routine able to do WAL logging of *some* of its data (custom
structure size, custom data size), and insert records associated to
their stats kind. We are going to need a new optional callback
defined by a stats kind to be able to force some actions at replay, so
as stats kinds can decide what to do with the data in the record.
Generation of WAL records has to happen pgstat_report_stat() through
the flush callback of each stats kind when the stats stored locally
are synced with shared memory. There is a different reason for that:
stats are flushed when backends shut down, and we are still able to
generate some WAL at this stage. An advantage of this design is to be
able to decide which portions of which stats kind is worth
replicating, and we can take a step-by-step approach we what data and
how much data we want to replay (for example for tables we should not
care about replicating the number scans).

I think that's a good idea. As you said that would give the ability to discard
some stats from the replication and replicate some of them (n_dead_tup,...).

Another benefit of this
design is for custom stats kind: these can call the pgstats RMGR to
pass down some data and define their own callback to use at replay.
If we do that, flushing the stats file at each checkpoint is not
actually mandatory: the most critical stats could be in WAL.

Among all these ideas, 2) is by far the most relevant approach to me,

+1

because even if we do not flush pgstats at checkpoint, we can still
keep around relevant stats when performing crash recovery, while
copying around some stats on standbys. It should be possible for a
given stats kind to do a different action depending on if we're in
standby mode or just in crash recovery. And that would take care of
this autovacuum problem post-crash: we could have stats to help in the
decision of if a table should be vacuum or not. Note that the
implementation can be done in multiple steps, like:
- Adding the whole WAL pgstats facility and some tests related to it
(WAL logging with injection points for variable and fixed-numbered
stats in a custom stats kind).
- Deal about the autovacuum and relation stats part.
- Open the door for more replication of stats data, whatever that may
be.

Comments, thoughts or tomatoes?

I think that replicating stats that are used by autovacuum would be an additional
benefit, so +1 for the idea number 2). This is an "issue" that has been raised
multiple times (like in [1]/messages/by-id/20240607033806.6gwgolihss72cj6r@awork3.anarazel.de).

[1]: /messages/by-id/20240607033806.6gwgolihss72cj6r@awork3.anarazel.de

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#3Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#2)
3 attachment(s)
Re: Autovacuum giving up on tables after crash because of lack of stats

On Mon, Dec 30, 2024 at 09:44:45AM +0000, Bertrand Drouvot wrote:

I think that replicating stats that are used by autovacuum would be an additional
benefit, so +1 for the idea number 2). This is an "issue" that has been raised
multiple times (like in [1]).

[1]: /messages/by-id/20240607033806.6gwgolihss72cj6r@awork3.anarazel.de

Yeah, I recall reading this one. A colleague has poked me about doing
something here a few weeks ago, as well, and only flushing the stats
at redo was not enough while being full of holes.

Attached is a rebase, following b757abefc041, for the moment.

And Happy New Year to all! In advance.
--
Michael

Attachments:

v2-0001-Move-information-about-pgstats-kinds-into-its-own.patchtext/x-diff; charset=us-asciiDownload
From 55a3c446c44f32204a070ca33c6ac01a16a2da3d Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 27 Dec 2024 11:54:45 +0900
Subject: [PATCH v2 1/3] Move information about pgstats kinds into its own
 header

This information is split into its own header as it will be used by a
follow-up patch, to be used in some frontend and backend code.
---
 src/include/pgstat.h            | 57 +-------------------------
 src/include/utils/pgstat_kind.h | 72 +++++++++++++++++++++++++++++++++
 2 files changed, 73 insertions(+), 56 deletions(-)
 create mode 100644 src/include/utils/pgstat_kind.h

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 75a41e8ff3..601b8075d5 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -18,6 +18,7 @@
 #include "replication/conflict.h"
 #include "utils/backend_progress.h" /* for backward compatibility */
 #include "utils/backend_status.h"	/* for backward compatibility */
+#include "utils/pgstat_kind.h"
 #include "utils/relcache.h"
 #include "utils/wait_event.h"	/* for backward compatibility */
 
@@ -33,62 +34,6 @@
 /* Default directory to store temporary statistics data in */
 #define PG_STAT_TMP_DIR		"pg_stat_tmp"
 
-/* The types of statistics entries */
-#define PgStat_Kind uint32
-
-/* Range of IDs allowed, for built-in and custom kinds */
-#define PGSTAT_KIND_MIN	1		/* Minimum ID allowed */
-#define PGSTAT_KIND_MAX	256		/* Maximum ID allowed */
-
-/* use 0 for INVALID, to catch zero-initialized data */
-#define PGSTAT_KIND_INVALID 0
-
-/* stats for variable-numbered objects */
-#define PGSTAT_KIND_DATABASE	1	/* database-wide statistics */
-#define PGSTAT_KIND_RELATION	2	/* per-table statistics */
-#define PGSTAT_KIND_FUNCTION	3	/* per-function statistics */
-#define PGSTAT_KIND_REPLSLOT	4	/* per-slot statistics */
-#define PGSTAT_KIND_SUBSCRIPTION	5	/* per-subscription statistics */
-#define PGSTAT_KIND_BACKEND	6	/* per-backend statistics */
-
-/* stats for fixed-numbered objects */
-#define PGSTAT_KIND_ARCHIVER	7
-#define PGSTAT_KIND_BGWRITER	8
-#define PGSTAT_KIND_CHECKPOINTER	9
-#define PGSTAT_KIND_IO	10
-#define PGSTAT_KIND_SLRU	11
-#define PGSTAT_KIND_WAL	12
-
-#define PGSTAT_KIND_BUILTIN_MIN PGSTAT_KIND_DATABASE
-#define PGSTAT_KIND_BUILTIN_MAX PGSTAT_KIND_WAL
-#define PGSTAT_KIND_BUILTIN_SIZE (PGSTAT_KIND_BUILTIN_MAX + 1)
-
-/* Custom stats kinds */
-
-/* Range of IDs allowed for custom stats kinds */
-#define PGSTAT_KIND_CUSTOM_MIN	128
-#define PGSTAT_KIND_CUSTOM_MAX	PGSTAT_KIND_MAX
-#define PGSTAT_KIND_CUSTOM_SIZE	(PGSTAT_KIND_CUSTOM_MAX - PGSTAT_KIND_CUSTOM_MIN + 1)
-
-/*
- * PgStat_Kind to use for extensions that require an ID, but are still in
- * development and have not reserved their own unique kind ID yet. See:
- * https://wiki.postgresql.org/wiki/CustomCumulativeStats
- */
-#define PGSTAT_KIND_EXPERIMENTAL	128
-
-static inline bool
-pgstat_is_kind_builtin(PgStat_Kind kind)
-{
-	return kind >= PGSTAT_KIND_BUILTIN_MIN && kind <= PGSTAT_KIND_BUILTIN_MAX;
-}
-
-static inline bool
-pgstat_is_kind_custom(PgStat_Kind kind)
-{
-	return kind >= PGSTAT_KIND_CUSTOM_MIN && kind <= PGSTAT_KIND_CUSTOM_MAX;
-}
-
 /* Values for track_functions GUC variable --- order is significant! */
 typedef enum TrackFunctionsLevel
 {
diff --git a/src/include/utils/pgstat_kind.h b/src/include/utils/pgstat_kind.h
new file mode 100644
index 0000000000..526fa04e8f
--- /dev/null
+++ b/src/include/utils/pgstat_kind.h
@@ -0,0 +1,72 @@
+/* ----------
+ *	pgstat_kind.h
+ *
+ *	Definitions related to the statistics kinds for the PostgreSQL
+ *	cumulative statistics system.  Can be included in backend or
+ *	frontend code.
+ *
+ *	Copyright (c) 2001-2024, PostgreSQL Global Development Group
+ *
+ *	src/include/utils/pgstat_kind.h
+ * ----------
+ */
+#ifndef PGSTAT_KIND_H
+#define PGSTAT_KIND_H
+
+/* The types of statistics entries */
+#define PgStat_Kind uint32
+
+/* Range of IDs allowed, for built-in and custom kinds */
+#define PGSTAT_KIND_MIN	1		/* Minimum ID allowed */
+#define PGSTAT_KIND_MAX	256		/* Maximum ID allowed */
+
+/* use 0 for INVALID, to catch zero-initialized data */
+#define PGSTAT_KIND_INVALID 0
+
+/* stats for variable-numbered objects */
+#define PGSTAT_KIND_DATABASE	1	/* database-wide statistics */
+#define PGSTAT_KIND_RELATION	2	/* per-table statistics */
+#define PGSTAT_KIND_FUNCTION	3	/* per-function statistics */
+#define PGSTAT_KIND_REPLSLOT	4	/* per-slot statistics */
+#define PGSTAT_KIND_SUBSCRIPTION	5	/* per-subscription statistics */
+#define PGSTAT_KIND_BACKEND	6	/* per-backend statistics */
+
+/* stats for fixed-numbered objects */
+#define PGSTAT_KIND_ARCHIVER	7
+#define PGSTAT_KIND_BGWRITER	8
+#define PGSTAT_KIND_CHECKPOINTER	9
+#define PGSTAT_KIND_IO	10
+#define PGSTAT_KIND_SLRU	11
+#define PGSTAT_KIND_WAL	12
+
+#define PGSTAT_KIND_BUILTIN_MIN PGSTAT_KIND_DATABASE
+#define PGSTAT_KIND_BUILTIN_MAX PGSTAT_KIND_WAL
+#define PGSTAT_KIND_BUILTIN_SIZE (PGSTAT_KIND_BUILTIN_MAX + 1)
+
+/* Custom stats kinds */
+
+/* Range of IDs allowed for custom stats kinds */
+#define PGSTAT_KIND_CUSTOM_MIN	128
+#define PGSTAT_KIND_CUSTOM_MAX	PGSTAT_KIND_MAX
+#define PGSTAT_KIND_CUSTOM_SIZE	(PGSTAT_KIND_CUSTOM_MAX - PGSTAT_KIND_CUSTOM_MIN + 1)
+
+/*
+ * PgStat_Kind to use for extensions that require an ID, but are still in
+ * development and have not reserved their own unique kind ID yet. See:
+ * https://wiki.postgresql.org/wiki/CustomCumulativeStats
+ */
+#define PGSTAT_KIND_EXPERIMENTAL	128
+
+static inline bool
+pgstat_is_kind_builtin(PgStat_Kind kind)
+{
+	return kind >= PGSTAT_KIND_BUILTIN_MIN && kind <= PGSTAT_KIND_BUILTIN_MAX;
+}
+
+static inline bool
+pgstat_is_kind_custom(PgStat_Kind kind)
+{
+	return kind >= PGSTAT_KIND_CUSTOM_MIN && kind <= PGSTAT_KIND_CUSTOM_MAX;
+}
+
+#endif							/* PGSTAT_KIND_H */
-- 
2.45.2

v2-0002-Add-RMGR-and-WAL-logging-API-for-pgstats.patchtext/x-diff; charset=us-asciiDownload
From 2fd2b76238618e62e37eb011c0fe145162602a9d Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 27 Dec 2024 12:03:34 +0900
Subject: [PATCH v2 2/3] Add RMGR and WAL-logging API for pgstats

This commit adds a new facility in the backend that offers to pgstats
kinds the possibility to WAL-log data that need to be persisted across
restarts:
- A new callback called redo_cb can be defined by a pgstats_kind.
- pgstat_xlog.c includes one API able to register some payload data and
its size.  Stats kinds doing WAL-logging require a definition of
redo_cb.
- pg_waldump is able to show the data logged, as hexadedimal data.

This facility has applications for in-core and custom stats kinds, with
one primary case being the possibility to WAL-log some statistics
related to relations so as these are not lost post-crash.  This is left
as future work.
---
 src/include/access/rmgrlist.h            |  1 +
 src/include/utils/pgstat_internal.h      |  8 +++
 src/include/utils/pgstat_xlog.h          | 41 +++++++++++++
 src/backend/access/rmgrdesc/Makefile     |  1 +
 src/backend/access/rmgrdesc/meson.build  |  1 +
 src/backend/access/rmgrdesc/pgstatdesc.c | 48 +++++++++++++++
 src/backend/access/transam/rmgr.c        |  1 +
 src/backend/utils/activity/Makefile      |  1 +
 src/backend/utils/activity/meson.build   |  1 +
 src/backend/utils/activity/pgstat_xlog.c | 77 ++++++++++++++++++++++++
 src/bin/pg_waldump/.gitignore            |  1 +
 src/bin/pg_waldump/rmgrdesc.c            |  1 +
 src/bin/pg_waldump/t/001_basic.pl        |  3 +-
 src/tools/pgindent/typedefs.list         |  1 +
 14 files changed, 185 insertions(+), 1 deletion(-)
 create mode 100644 src/include/utils/pgstat_xlog.h
 create mode 100644 src/backend/access/rmgrdesc/pgstatdesc.c
 create mode 100644 src/backend/utils/activity/pgstat_xlog.c

diff --git a/src/include/access/rmgrlist.h b/src/include/access/rmgrlist.h
index 78e6b908c6..abf9393bbe 100644
--- a/src/include/access/rmgrlist.h
+++ b/src/include/access/rmgrlist.h
@@ -47,3 +47,4 @@ PG_RMGR(RM_COMMIT_TS_ID, "CommitTs", commit_ts_redo, commit_ts_desc, commit_ts_i
 PG_RMGR(RM_REPLORIGIN_ID, "ReplicationOrigin", replorigin_redo, replorigin_desc, replorigin_identify, NULL, NULL, NULL, NULL)
 PG_RMGR(RM_GENERIC_ID, "Generic", generic_redo, generic_desc, generic_identify, NULL, NULL, generic_mask, NULL)
 PG_RMGR(RM_LOGICALMSG_ID, "LogicalMessage", logicalmsg_redo, logicalmsg_desc, logicalmsg_identify, NULL, NULL, NULL, logicalmsg_decode)
+PG_RMGR(RM_PGSTAT_ID, "PgStat", pgstat_redo, pgstat_desc, pgstat_identify, NULL, NULL, NULL, NULL)
diff --git a/src/include/utils/pgstat_internal.h b/src/include/utils/pgstat_internal.h
index 811ed9b005..47d7f8988e 100644
--- a/src/include/utils/pgstat_internal.h
+++ b/src/include/utils/pgstat_internal.h
@@ -258,6 +258,14 @@ typedef struct PgStat_KindInfo
 	 */
 	void		(*init_backend_cb) (void);
 
+	/*
+	 * Perform custom actions when replaying WAL related to a statistics kind.
+	 *
+	 * "data" is a pointer to the stats information that can be used by the
+	 * stats kind at redo.
+	 */
+	void		(*redo_cb) (void *data, size_t data_size);
+
 	/*
 	 * For variable-numbered stats: flush pending stats. Required if pending
 	 * data is used.  See flush_fixed_cb for fixed-numbered stats.
diff --git a/src/include/utils/pgstat_xlog.h b/src/include/utils/pgstat_xlog.h
new file mode 100644
index 0000000000..25b7a78dd5
--- /dev/null
+++ b/src/include/utils/pgstat_xlog.h
@@ -0,0 +1,41 @@
+/* ----------
+ * pgstat_xlog.h
+ *	  Exports from utils/activity/pgstat_xlog.c
+ *
+ * Copyright (c) 2001-2024, PostgreSQL Global Development Group
+ *
+ * src/include/utils/pgstat_xlog.h
+ * ----------
+ */
+#ifndef PGSTAT_XLOG_H
+#define PGSTAT_XLOG_H
+
+#include "access/xlog.h"
+#include "access/xlogdefs.h"
+#include "access/xlogreader.h"
+#include "utils/pgstat_kind.h"
+
+/*
+ * Generic WAL record for pgstat data.
+ */
+typedef struct xl_pgstat_data
+{
+	PgStat_Kind stats_kind;
+	size_t		data_size;		/* size of the data */
+	/* data payload, worth data_size */
+	char		data[FLEXIBLE_ARRAY_MEMBER];
+} xl_pgstat_data;
+
+#define SizeOfPgStatData    (offsetof(xl_pgstat_data, data))
+
+extern XLogRecPtr pgstat_xlog_data(PgStat_Kind stats_kind,
+								   const void *data,
+								   size_t data_size);
+
+/* RMGR API */
+#define XLOG_PGSTAT_DATA	0x00
+extern void pgstat_redo(XLogReaderState *record);
+extern void pgstat_desc(StringInfo buf, XLogReaderState *record);
+extern const char *pgstat_identify(uint8 info);
+
+#endif							/* PGSTAT_XLOG_H */
diff --git a/src/backend/access/rmgrdesc/Makefile b/src/backend/access/rmgrdesc/Makefile
index cd95eec37f..3979d22946 100644
--- a/src/backend/access/rmgrdesc/Makefile
+++ b/src/backend/access/rmgrdesc/Makefile
@@ -21,6 +21,7 @@ OBJS = \
 	logicalmsgdesc.o \
 	mxactdesc.o \
 	nbtdesc.o \
+	pgstatdesc.o \
 	relmapdesc.o \
 	replorigindesc.o \
 	rmgrdesc_utils.o \
diff --git a/src/backend/access/rmgrdesc/meson.build b/src/backend/access/rmgrdesc/meson.build
index e8b7a65fc7..951dacbd72 100644
--- a/src/backend/access/rmgrdesc/meson.build
+++ b/src/backend/access/rmgrdesc/meson.build
@@ -14,6 +14,7 @@ rmgr_desc_sources = files(
   'logicalmsgdesc.c',
   'mxactdesc.c',
   'nbtdesc.c',
+  'pgstatdesc.c',
   'relmapdesc.c',
   'replorigindesc.c',
   'rmgrdesc_utils.c',
diff --git a/src/backend/access/rmgrdesc/pgstatdesc.c b/src/backend/access/rmgrdesc/pgstatdesc.c
new file mode 100644
index 0000000000..0fc19d1e56
--- /dev/null
+++ b/src/backend/access/rmgrdesc/pgstatdesc.c
@@ -0,0 +1,48 @@
+/*-------------------------------------------------------------------------
+ *
+ * pgstatdesc.c
+ *	  rmgr descriptor routines for utils/activity/pgstat_xlog.c
+ *
+ * Portions Copyright (c) 2001-2024, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ *	  src/backend/access/rmgrdesc/pgstatdesc.c
+ *
+ *-------------------------------------------------------------------------
+ */
+
+#include "postgres.h"
+
+#include "utils/pgstat_xlog.h"
+
+void
+pgstat_desc(StringInfo buf, XLogReaderState *record)
+{
+	char	   *rec = XLogRecGetData(record);
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+
+	if (info == XLOG_PGSTAT_DATA)
+	{
+		xl_pgstat_data *xlrec = (xl_pgstat_data *) rec;
+		char	   *sep = "";
+
+		appendStringInfo(buf, "stats kind \"%u\"; with data (%zu bytes): ",
+						 xlrec->stats_kind, xlrec->data_size);
+		/* Write data payload as a series of hex bytes */
+		for (int cnt = 0; cnt < xlrec->data_size; cnt++)
+		{
+			appendStringInfo(buf, "%s%02X", sep, (unsigned char) xlrec->data[cnt]);
+			sep = " ";
+		}
+	}
+}
+
+const char *
+pgstat_identify(uint8 info)
+{
+	if ((info & ~XLR_INFO_MASK) == XLOG_PGSTAT_DATA)
+		return "PGSTAT_DATA";
+
+	return NULL;
+}
diff --git a/src/backend/access/transam/rmgr.c b/src/backend/access/transam/rmgr.c
index 1b7499726e..945843a78e 100644
--- a/src/backend/access/transam/rmgr.c
+++ b/src/backend/access/transam/rmgr.c
@@ -39,6 +39,7 @@
 #include "replication/message.h"
 #include "replication/origin.h"
 #include "storage/standby.h"
+#include "utils/pgstat_xlog.h"
 #include "utils/relmapper.h"
 /* IWYU pragma: end_keep */
 
diff --git a/src/backend/utils/activity/Makefile b/src/backend/utils/activity/Makefile
index 24b64a2742..186108c60a 100644
--- a/src/backend/utils/activity/Makefile
+++ b/src/backend/utils/activity/Makefile
@@ -33,6 +33,7 @@ OBJS = \
 	pgstat_subscription.o \
 	pgstat_wal.o \
 	pgstat_xact.o \
+	pgstat_xlog.o \
 	wait_event.o \
 	wait_event_funcs.o
 
diff --git a/src/backend/utils/activity/meson.build b/src/backend/utils/activity/meson.build
index 380d3dd70c..a79e2d3731 100644
--- a/src/backend/utils/activity/meson.build
+++ b/src/backend/utils/activity/meson.build
@@ -18,6 +18,7 @@ backend_sources += files(
   'pgstat_subscription.c',
   'pgstat_wal.c',
   'pgstat_xact.c',
+  'pgstat_xlog.c',
 )
 
 # this includes a .c file with contents generated in ../../../include/activity,
diff --git a/src/backend/utils/activity/pgstat_xlog.c b/src/backend/utils/activity/pgstat_xlog.c
new file mode 100644
index 0000000000..397a49a33a
--- /dev/null
+++ b/src/backend/utils/activity/pgstat_xlog.c
@@ -0,0 +1,77 @@
+/* ----------
+ * pgstat_xlog.c
+ *	   WAL replay logic for cumulative statistics.
+ *
+ *
+ * Copyright (c) 2001-2024, PostgreSQL Global Development Group
+ *
+ * IDENTIFICATION
+ *	  src/backend/utils/activity/pgstat_xlog.c
+ * ----------
+ */
+
+#include "postgres.h"
+
+#include <unistd.h>
+
+#include "access/xlog.h"
+#include "access/xloginsert.h"
+#include "utils/pgstat_internal.h"
+#include "utils/pgstat_xlog.h"
+
+/*
+ * Write pgstats record data into WAL.
+ */
+XLogRecPtr
+pgstat_xlog_data(PgStat_Kind stats_kind, const void *data,
+				 size_t data_size)
+{
+	xl_pgstat_data xlrec;
+	XLogRecPtr	lsn;
+
+	xlrec.stats_kind = stats_kind;
+	xlrec.data_size = data_size;
+
+	XLogBeginInsert();
+	XLogRegisterData((char *) &xlrec, SizeOfPgStatData);
+	XLogRegisterData(data, data_size);
+
+	lsn = XLogInsert(RM_PGSTAT_ID, XLOG_PGSTAT_DATA);
+	return lsn;
+}
+
+/*
+ * Redo just passes down to the stats kind expected by this record
+ * data included in it.
+ */
+void
+pgstat_redo(XLogReaderState *record)
+{
+	uint8		info = XLogRecGetInfo(record) & ~XLR_INFO_MASK;
+	xl_pgstat_data *xlrec;
+	PgStat_Kind stats_kind;
+	size_t		data_size;
+	void	   *data;
+	const PgStat_KindInfo *kind_info;
+
+	if (info != XLOG_PGSTAT_DATA)
+		elog(PANIC, "pgstat_redo: unknown op code %u", info);
+
+	xlrec = (xl_pgstat_data *) XLogRecGetData(record);
+
+	stats_kind = xlrec->stats_kind;
+	data_size = xlrec->data_size;
+	data = xlrec->data;
+
+	kind_info = pgstat_get_kind_info(stats_kind);
+
+	if (kind_info == NULL)
+		elog(FATAL, "pgstat_redo: invalid stats data found: kind=%u",
+			 stats_kind);
+
+	if (kind_info->redo_cb == NULL)
+		elog(FATAL, "pgstat_redo: no redo callback found: kind=%s",
+			 kind_info->name);
+
+	kind_info->redo_cb(data, data_size);
+}
diff --git a/src/bin/pg_waldump/.gitignore b/src/bin/pg_waldump/.gitignore
index ec51f41c76..e5089b322d 100644
--- a/src/bin/pg_waldump/.gitignore
+++ b/src/bin/pg_waldump/.gitignore
@@ -13,6 +13,7 @@
 /logicalmsgdesc.c
 /mxactdesc.c
 /nbtdesc.c
+/pgstatdesc.c
 /relmapdesc.c
 /replorigindesc.c
 /rmgrdesc_utils.c
diff --git a/src/bin/pg_waldump/rmgrdesc.c b/src/bin/pg_waldump/rmgrdesc.c
index 6b8c17bb4c..05b4f2acd1 100644
--- a/src/bin/pg_waldump/rmgrdesc.c
+++ b/src/bin/pg_waldump/rmgrdesc.c
@@ -30,6 +30,7 @@
 #include "replication/origin.h"
 #include "rmgrdesc.h"
 #include "storage/standbydefs.h"
+#include "utils/pgstat_xlog.h"
 #include "utils/relmapper.h"
 
 #define PG_RMGR(symname,name,redo,desc,identify,startup,cleanup,mask,decode) \
diff --git a/src/bin/pg_waldump/t/001_basic.pl b/src/bin/pg_waldump/t/001_basic.pl
index 578e473139..367ed6ab8c 100644
--- a/src/bin/pg_waldump/t/001_basic.pl
+++ b/src/bin/pg_waldump/t/001_basic.pl
@@ -73,7 +73,8 @@ BRIN
 CommitTs
 ReplicationOrigin
 Generic
-LogicalMessage$/,
+LogicalMessage
+PgStat$/,
 	'rmgr list');
 
 
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index e1c4f913f8..a5b1a2c998 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -4134,6 +4134,7 @@ xl_multixact_create
 xl_multixact_truncate
 xl_overwrite_contrecord
 xl_parameter_change
+xl_pgstat_data
 xl_relmap_update
 xl_replorigin_drop
 xl_replorigin_set
-- 
2.45.2

v2-0003-injection_points-Add-option-and-tests-for-WAL-log.patchtext/x-diff; charset=us-asciiDownload
From 54537bbceab106b7a40aebdc815e6cd1ccd481cc Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 27 Dec 2024 12:05:33 +0900
Subject: [PATCH v2 3/3] injection_points: Add option and tests for WAL-logging

This serves as a template for WAL-logging related to pgstats, based on
the variable-numbered stats the module includes.  Some TAP tests are
added to check the backend facility.
---
 .../injection_points/injection_points.c       | 19 ++++++-
 .../injection_points/injection_stats.c        | 52 +++++++++++++++++++
 .../injection_points/injection_stats.h        |  3 +-
 .../modules/injection_points/t/001_stats.pl   | 22 ++++++++
 src/tools/pgindent/typedefs.list              |  1 +
 5 files changed, 95 insertions(+), 2 deletions(-)

diff --git a/src/test/modules/injection_points/injection_points.c b/src/test/modules/injection_points/injection_points.c
index 6bcde7b34e..a5180e97f5 100644
--- a/src/test/modules/injection_points/injection_points.c
+++ b/src/test/modules/injection_points/injection_points.c
@@ -104,7 +104,7 @@ extern PGDLLEXPORT void injection_wait(const char *name,
 static bool injection_point_local = false;
 
 /*
- * GUC variable
+ * GUC variables
  *
  * This GUC is useful to control if statistics should be enabled or not
  * during a test with injection points, like for example if a test relies
@@ -112,6 +112,12 @@ static bool injection_point_local = false;
  */
 bool		inj_stats_enabled = false;
 
+/*
+ * This GUC controls if statistics of injection points should be logged
+ * into WAL or not.
+ */
+bool		inj_stats_wal_enabled = false;
+
 /* Shared memory init callbacks */
 static shmem_request_hook_type prev_shmem_request_hook = NULL;
 static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
@@ -534,6 +540,17 @@ _PG_init(void)
 							 NULL,
 							 NULL);
 
+	DefineCustomBoolVariable("injection_points.log_stats",
+							 "Enables WAL-logging for injection points statistics.",
+							 NULL,
+							 &inj_stats_wal_enabled,
+							 false,
+							 PGC_POSTMASTER,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	MarkGUCPrefixReserved("injection_points");
 
 	/* Shared memory initialization */
diff --git a/src/test/modules/injection_points/injection_stats.c b/src/test/modules/injection_points/injection_stats.c
index 21d5c10f39..c14655f70e 100644
--- a/src/test/modules/injection_points/injection_stats.c
+++ b/src/test/modules/injection_points/injection_stats.c
@@ -21,6 +21,7 @@
 #include "pgstat.h"
 #include "utils/builtins.h"
 #include "utils/pgstat_internal.h"
+#include "utils/pgstat_xlog.h"
 
 /* Structures for statistics of injection points */
 typedef struct PgStat_StatInjEntry
@@ -34,6 +35,14 @@ typedef struct PgStatShared_InjectionPoint
 	PgStat_StatInjEntry stats;
 } PgStatShared_InjectionPoint;
 
+/* Structure for data of injection points logged in WAL */
+typedef struct PgStat_StatInjRecord
+{
+	uint64		objid;			/* hash of the point name */
+	PgStat_StatInjEntry entry;	/* stats data */
+} PgStat_StatInjRecord;
+
+static void injection_stats_redo_cb(void *data, size_t data_size);
 static bool injection_stats_flush_cb(PgStat_EntryRef *entry_ref, bool nowait);
 
 static const PgStat_KindInfo injection_stats = {
@@ -48,6 +57,7 @@ static const PgStat_KindInfo injection_stats = {
 	.shared_data_off = offsetof(PgStatShared_InjectionPoint, stats),
 	.shared_data_len = sizeof(((PgStatShared_InjectionPoint *) 0)->stats),
 	.pending_size = sizeof(PgStat_StatInjEntry),
+	.redo_cb = injection_stats_redo_cb,
 	.flush_pending_cb = injection_stats_flush_cb,
 };
 
@@ -64,6 +74,31 @@ static const PgStat_KindInfo injection_stats = {
 /* Track if stats are loaded */
 static bool inj_stats_loaded = false;
 
+/*
+ * REDO callback for injection point stats.
+ */
+static void
+injection_stats_redo_cb(void *data, size_t data_size)
+{
+	PgStat_StatInjRecord *record_data = (PgStat_StatInjRecord *) data;
+	PgStat_StatInjEntry record_entry = record_data->entry;
+	PgStat_StatInjEntry *stat_entry;
+	PgStatShared_InjectionPoint *shstatent;
+	PgStat_EntryRef *entry_ref;
+
+	Assert(data_size == sizeof(PgStat_StatInjRecord));
+
+	/* create or fetch existing entry */
+	entry_ref = pgstat_prep_pending_entry(PGSTAT_KIND_INJECTION, InvalidOid,
+										  record_data->objid, NULL);
+
+	shstatent = (PgStatShared_InjectionPoint *) entry_ref->shared_stats;
+	stat_entry = &shstatent->stats;
+
+	/* Update the injection point statistics, overwriting any existing data */
+	*stat_entry = record_entry;
+}
+
 /*
  * Callback for stats handling
  */
@@ -72,6 +107,7 @@ injection_stats_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 {
 	PgStat_StatInjEntry *localent;
 	PgStatShared_InjectionPoint *shfuncent;
+	PgStat_StatInjRecord record_data = {0};
 
 	localent = (PgStat_StatInjEntry *) entry_ref->pending;
 	shfuncent = (PgStatShared_InjectionPoint *) entry_ref->shared_stats;
@@ -81,8 +117,24 @@ injection_stats_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 
 	shfuncent->stats.numcalls += localent->numcalls;
 
+	record_data.objid = entry_ref->shared_entry->key.objid;
+	record_data.entry = shfuncent->stats;
+
 	pgstat_unlock_entry(entry_ref);
 
+	/*
+	 * Store the data in WAL, if not in recovery and if the option is enabled.
+	 */
+	if (!RecoveryInProgress() && inj_stats_wal_enabled)
+	{
+		XLogRecPtr	lsn;
+
+		lsn = pgstat_xlog_data(PGSTAT_KIND_INJECTION, &record_data,
+							   sizeof(PgStat_StatInjRecord));
+
+		/* Force a flush, to ensure persistency */
+		XLogFlush(lsn);
+	}
 	return true;
 }
 
diff --git a/src/test/modules/injection_points/injection_stats.h b/src/test/modules/injection_points/injection_stats.h
index c48d533b4b..2583f3fe9c 100644
--- a/src/test/modules/injection_points/injection_stats.h
+++ b/src/test/modules/injection_points/injection_stats.h
@@ -15,8 +15,9 @@
 #ifndef INJECTION_STATS
 #define INJECTION_STATS
 
-/* GUC variable */
+/* GUC variables */
 extern bool inj_stats_enabled;
+extern bool inj_stats_wal_enabled;
 
 /* injection_stats.c */
 extern void pgstat_register_inj(void);
diff --git a/src/test/modules/injection_points/t/001_stats.pl b/src/test/modules/injection_points/t/001_stats.pl
index 36728f16fc..395c2e7cdb 100644
--- a/src/test/modules/injection_points/t/001_stats.pl
+++ b/src/test/modules/injection_points/t/001_stats.pl
@@ -75,4 +75,26 @@ $node->stop;
 $node->adjust_conf('postgresql.conf', 'shared_preload_libraries', "''");
 $node->start;
 
+# Stop the server and enable WAL, the stats are preserved after recovery.
+$node->stop;
+$node->append_conf(
+	'postgresql.conf', qq(
+shared_preload_libraries = 'injection_points'
+injection_points.log_stats = true
+));
+$node->start;
+
+# Two calls, both WAL-logged.
+$node->safe_psql('postgres',
+	"SELECT injection_points_attach('stats-wal-notice', 'notice');");
+$node->safe_psql('postgres',
+	"SELECT injection_points_run('stats-wal-notice');");
+$node->safe_psql('postgres',
+	"SELECT injection_points_run('stats-wal-notice');");
+$node->stop('immediate');
+$node->start;
+$numcalls = $node->safe_psql('postgres',
+	"SELECT injection_points_stats_numcalls('stats-wal-notice');");
+is($numcalls, '2', 'number of stats after crash with WAL-logging enabled');
+
 done_testing();
diff --git a/src/tools/pgindent/typedefs.list b/src/tools/pgindent/typedefs.list
index a5b1a2c998..47e3e07f5e 100644
--- a/src/tools/pgindent/typedefs.list
+++ b/src/tools/pgindent/typedefs.list
@@ -2166,6 +2166,7 @@ PgStat_StatDBEntry
 PgStat_StatFuncEntry
 PgStat_StatInjEntry
 PgStat_StatInjFixedEntry
+PgStat_StatInjRecord
 PgStat_StatReplSlotEntry
 PgStat_StatSubEntry
 PgStat_StatTabEntry
-- 
2.45.2

#4Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#3)
Re: Autovacuum giving up on tables after crash because of lack of stats

Hi,

On Mon, Dec 30, 2024 at 07:12:54PM +0900, Michael Paquier wrote:

On Mon, Dec 30, 2024 at 09:44:45AM +0000, Bertrand Drouvot wrote:

I think that replicating stats that are used by autovacuum would be an additional
benefit, so +1 for the idea number 2). This is an "issue" that has been raised
multiple times (like in [1]).

[1]: /messages/by-id/20240607033806.6gwgolihss72cj6r@awork3.anarazel.de

Yeah, I recall reading this one. A colleague has poked me about doing
something here a few weeks ago, as well, and only flushing the stats
at redo was not enough while being full of holes.

Attached is a rebase, following b757abefc041, for the moment.

Thanks, will look at it!

And Happy New Year to all! In advance.

Happy New Year to you too!

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#5Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#1)
Re: Autovacuum giving up on tables after crash because of lack of stats

Hi,

On 2024-12-25 10:10:44 +0900, Michael Paquier wrote:

While digging again into the support for pgstats flushes across
checkpoints, mainly to avoid autovacuum giving up on tables after a
crash if stats entries cannot be found, I've been reminded about this
point raised by Heikki:
/messages/by-id/54fdfc3e-74f6-4ea4-b844-05aa66b39490@iki.fi

The point is quite simple: depending on the timing of the stats file
flush we may finish with orphaned entries because some objects may be
dropped after a checkpoint, leaving stats entries when these should be
removed. This is an issue dependent on the pgstats file flush at
checkpoint time, and we cannot by design reach this state currently as
stats are flushed only at shutdown, once all the other backends have
exited after the shutdown checkpoint is finished. There is a sanity
check for that in pgstat_write_statsfile():
Assert(!ps->dropped);

I don't think this has to be an issue. The right thing imo would be to make a
durable copy of the stats at the time of the *redo* pointer
determination. That way the WAL records for dropped stats objects are going to
be replayed, leaving us with no orphaned entries by the time the checkpoint
has been replayed.

Doesn't this solve the issue?

Up to v14 and 5891c7a8ed, there was a mechanism in place able to do
garbage collection of stats entries when vacuuming a database, where
OIDs of various object types are collected and messages were sent to
the stats collector to clean up things.

With pgstats now in shared memory, we could do something else if we
want to have some stats for autovacuum post-crash to take some
decisions rather than giving up. Some ideas I can think of, some of
them linked to the timing of the pgstats file flush, but not all:

1) Reimplement the same thing as ~14 in autovacuum workers with a
flush of the pgstats file at each checkpoint, with a loop in all the
stats kinds that would trigger a periodic cleanup of potential
garbage entries, based on a new callback.

I am *vehemently* opposed to this approach.

2) The main issue I am trying to tackle is autovacuum giving up on
tables if there are no stats entries, so we could add *some* WAL
logging of the relation stats that are relevant for autovacuum, then
replay them. I think that the correct approach here is to introduce
one new RMGR for pgstats, giving to each stats kind the possibility
to call a routine able to do WAL logging of *some* of its data (custom
structure size, custom data size), and insert records associated to
their stats kind.

As mentioned in the other thread, I don't think we need this - nearly all the
information is already and in the WAL for the actual action that the stats
track. We should re-emit the stats changes during recovery.

Greetings,

Andres Freund

#6Andres Freund
andres@anarazel.de
In reply to: Michael Paquier (#3)
Re: Autovacuum giving up on tables after crash because of lack of stats

On 2024-12-30 19:12:54 +0900, Michael Paquier wrote:

On Mon, Dec 30, 2024 at 09:44:45AM +0000, Bertrand Drouvot wrote:

I think that replicating stats that are used by autovacuum would be an additional
benefit, so +1 for the idea number 2). This is an "issue" that has been raised
multiple times (like in [1]).

[1]: /messages/by-id/20240607033806.6gwgolihss72cj6r@awork3.anarazel.de

Yeah, I recall reading this one. A colleague has poked me about doing
something here a few weeks ago, as well, and only flushing the stats
at redo was not enough while being full of holes.

Good then that I was not at all proposing to only flush the stats at redo?

Greetings,

Andres Freund

#7Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Andres Freund (#6)
Re: Autovacuum giving up on tables after crash because of lack of stats

On Fri, Jan 03, 2025 at 09:47:58AM -0500, Andres Freund wrote:

On 2024-12-30 19:12:54 +0900, Michael Paquier wrote:

On Mon, Dec 30, 2024 at 09:44:45AM +0000, Bertrand Drouvot wrote:

I think that replicating stats that are used by autovacuum would be an additional
benefit, so +1 for the idea number 2). This is an "issue" that has been raised
multiple times (like in [1]).

[1]: /messages/by-id/20240607033806.6gwgolihss72cj6r@awork3.anarazel.de

Yeah, I recall reading this one. A colleague has poked me about doing
something here a few weeks ago, as well, and only flushing the stats
at redo was not enough while being full of holes.

Good then that I was not at all proposing to only flush the stats at redo?

Yeah, reading your comment in [1]/messages/by-id/xvetwjsnkhx2gp6np225g2h64f4mfmg6oopkuaiivrpzd2futj@pflk55su36ho I also think that re-assembling the stats during
WAL replay based on the relfilenode activity is the way to go.

[1]: /messages/by-id/xvetwjsnkhx2gp6np225g2h64f4mfmg6oopkuaiivrpzd2futj@pflk55su36ho

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com