Flush pgstats file during checkpoints

Started by Michael Paquierover 1 year ago21 messages
#1Michael Paquier
michael@paquier.xyz
4 attachment(s)

Hi all,

On HEAD, xlog.c has the following comment, which has been on my own
TODO list for a couple of weeks now:

* TODO: With a bit of extra work we could just start with a pgstat file
* associated with the checkpoint redo location we're starting from.

Please find a patch series to implement that, giving the possibility
to keep statistics after a crash rather than discard them. I have
been looking at the code for a while, before settling down on:
- Forcing the flush of the pgstats file to happen during non-shutdown
checkpoint and restart points, after updating the control file's redo
LSN and the critical sections in the area.
- Leaving the current before_shmem_exit() callback around, as a matter
of delaying the flush of the stats for as long as possible in a
shutdown sequence. This also makes the single-user mode shutdown
simpler.
- Adding the redo LSN to the pgstats file, with a bump of
PGSTAT_FILE_FORMAT_ID, cross-checked with the redo LSN. This change
is independently useful on its own when loading stats after a clean
startup, as well.
- The crash recovery case is simplified, as there is no more need for
the "discard" code path.
- Not using a logic where I would need to stick a LSN into the stats
file name, implying that we would need a potential lookup at the
contents of pg_stat/ to clean up past files at crash recovery. These
should not be costly, but I'd rather not add more of these.

7ff23c6d277d, that has removed the last call of CreateCheckPoint()
from the startup process, is older than 5891c7a8ed8f, still it seems
to me that pgstats relies on some areas of the code that don't make
sense on HEAD (see locking mentioned at the top of the write routine
for example). The logic gets straight-forward to think about as
restart points and checkpoints always run from the checkpointer,
implying that pgstat_write_statsfile() is already called only from the
postmaster in single-user mode or the checkpointer itself, at
shutdown.

Attached is a patch set, with the one being the actual feature, with
some stuff prior to that:
- 0001 adds the redo LSN to the pgstats file flushed.
- 0002 adds an assertion in pgstat_write_statsfile(), to check from
where it is called.
- 0003 with more debugging.
- 0004 is the meat of the thread.

I am adding that to the next CF. Thoughts and comments are welcome.
Thanks,
--
Michael

Attachments:

0001-Add-redo-LSN-to-pgstats-file.patchtext/x-diff; charset=us-asciiDownload
From a39ffddc1c525fa4e39fd08657c6dae95a20043f Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 18 Jun 2024 11:00:36 +0900
Subject: [PATCH 1/4] Add redo LSN to pgstats file

This is used in the startup process to check that the file we are
loading is the one referring to the latest checkpoint.

Bump PGSTAT_FILE_FORMAT_ID.
---
 src/include/pgstat.h                |  5 +++--
 src/backend/access/transam/xlog.c   |  2 +-
 src/backend/utils/activity/pgstat.c | 26 +++++++++++++++++++-------
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 2136239710..dbc3430b18 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -235,7 +236,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAC
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAD
 
 typedef struct PgStat_ArchiverStats
 {
@@ -466,7 +467,7 @@ extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
 /* Functions called during server startup / shutdown */
-extern void pgstat_restore_stats(void);
+extern void pgstat_restore_stats(XLogRecPtr redo);
 extern void pgstat_discard_stats(void);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 330e058c5f..96b458da42 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5641,7 +5641,7 @@ StartupXLOG(void)
 	if (didCrash)
 		pgstat_discard_stats();
 	else
-		pgstat_restore_stats();
+		pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index dcc2ad8d95..824f742bde 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -94,6 +94,7 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "access/xlog.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -162,8 +163,8 @@ typedef struct PgStat_SnapshotEntry
  * ----------
  */
 
-static void pgstat_write_statsfile(void);
-static void pgstat_read_statsfile(void);
+static void pgstat_write_statsfile(XLogRecPtr redo);
+static void pgstat_read_statsfile(XLogRecPtr redo);
 
 static void pgstat_reset_after_failure(void);
 
@@ -404,9 +405,9 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
  * Should only be called by the startup process or in single user mode.
  */
 void
-pgstat_restore_stats(void)
+pgstat_restore_stats(XLogRecPtr redo)
 {
-	pgstat_read_statsfile();
+	pgstat_read_statsfile(redo);
 }
 
 /*
@@ -482,7 +483,7 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	if (code == 0)
 	{
 		pgStatLocal.shmem->is_shutdown = true;
-		pgstat_write_statsfile();
+		pgstat_write_statsfile(GetRedoRecPtr());
 	}
 }
 
@@ -1305,7 +1306,7 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
  * stats so locking is not required.
  */
 static void
-pgstat_write_statsfile(void)
+pgstat_write_statsfile(XLogRecPtr redo)
 {
 	FILE	   *fpout;
 	int32		format_id;
@@ -1340,6 +1341,9 @@ pgstat_write_statsfile(void)
 	format_id = PGSTAT_FILE_FORMAT_ID;
 	write_chunk_s(fpout, &format_id);
 
+	/* Write the redo LSN, used to cross check the file loaded */
+	write_chunk_s(fpout, &redo);
+
 	/*
 	 * XXX: The following could now be generalized to just iterate over
 	 * pgstat_kind_infos instead of knowing about the different kinds of
@@ -1480,13 +1484,14 @@ read_chunk(FILE *fpin, void *ptr, size_t len)
  * stats so locking is not required.
  */
 static void
-pgstat_read_statsfile(void)
+pgstat_read_statsfile(XLogRecPtr redo)
 {
 	FILE	   *fpin;
 	int32		format_id;
 	bool		found;
 	const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 	PgStat_ShmemControl *shmem = pgStatLocal.shmem;
+	XLogRecPtr	file_redo;
 
 	/* shouldn't be called from postmaster */
 	Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
@@ -1520,6 +1525,13 @@ pgstat_read_statsfile(void)
 		format_id != PGSTAT_FILE_FORMAT_ID)
 		goto error;
 
+	/*
+	 * Read the redo LSN stored in the file.
+	 */
+	if (!read_chunk_s(fpin, &file_redo) ||
+		file_redo != redo)
+		goto error;
+
 	/*
 	 * XXX: The following could now be generalized to just iterate over
 	 * pgstat_kind_infos instead of knowing about the different kinds of
-- 
2.45.1

0002-Add-assertion-in-pgstat_write_statsfile.patchtext/x-diff; charset=us-asciiDownload
From 9b252f56d1cad3e12001829bbb61c51cc742e9e0 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 18 Jun 2024 10:59:31 +0900
Subject: [PATCH 2/4] Add assertion in pgstat_write_statsfile()

This routine can currently only be called from the postmaster in
single-user mode or the checkpointer, so make sure that this is always
the case.
---
 src/backend/utils/activity/pgstat.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 824f742bde..9cdd986582 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -1317,6 +1317,9 @@ pgstat_write_statsfile(XLogRecPtr redo)
 
 	pgstat_assert_is_up();
 
+	/* should be called only by the checkpointer or single user mode */
+	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
+
 	/* we're shutting down, so it's ok to just override this */
 	pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
 
-- 
2.45.1

0003-Add-some-DEBUG2-information-about-the-redo-LSN-of-th.patchtext/x-diff; charset=us-asciiDownload
From ed7fe49cf147b09def2b78554a989d592e02fe07 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 18 Jun 2024 11:10:19 +0900
Subject: [PATCH 3/4] Add some DEBUG2 information about the redo LSN of the
 stats file

This is useful for.. Debugging.  How surprising.
---
 src/backend/utils/activity/pgstat.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 9cdd986582..855dec9e52 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -1323,7 +1323,8 @@ pgstat_write_statsfile(XLogRecPtr redo)
 	/* we're shutting down, so it's ok to just override this */
 	pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
 
-	elog(DEBUG2, "writing stats file \"%s\"", statfile);
+	elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile,
+		 LSN_FORMAT_ARGS(redo));
 
 	/*
 	 * Open the statistics temp file to write out the current values.
@@ -1499,7 +1500,8 @@ pgstat_read_statsfile(XLogRecPtr redo)
 	/* shouldn't be called from postmaster */
 	Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
 
-	elog(DEBUG2, "reading stats file \"%s\"", statfile);
+	elog(DEBUG2, "reading stats file \"%s\" with redo %X/%X", statfile,
+		LSN_FORMAT_ARGS(redo));
 
 	/*
 	 * Try to open the stats file. If it doesn't exist, the backends simply
-- 
2.45.1

0004-Flush-pgstats-file-during-checkpoints.patchtext/x-diff; charset=us-asciiDownload
From d206073d3cc6a0a5ed9228cdd3089c24d88f51f6 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 18 Jun 2024 14:49:33 +0900
Subject: [PATCH 4/4] Flush pgstats file during checkpoints

The flushes are done for non-shutdown checkpoints and restart points, so
as it is possible to keep some statistics around in the event of a
crash.  Keeping the before_shmem_exit() callback to flush the pgstats
file in a shutdown sequence is a bit easier than doing so at checkpoint,
as they may be stats to flush before we are completely done with the
shutdown checkpoint.  Keeping the callback also keeps the shutdown of
single-user mode simpler, where the stats also need to be flushed.

At the beginning of recovery, the stats file is loaded when the redo LSN
it stores matches with the point we are replaying from.
---
 src/include/pgstat.h                     |  4 +-
 src/backend/access/transam/xlog.c        | 27 ++++++++---
 src/backend/utils/activity/pgstat.c      | 62 ++++++------------------
 src/test/recovery/t/029_stats_restart.pl | 22 +++++++--
 4 files changed, 54 insertions(+), 61 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index dbc3430b18..9860bb6a99 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -466,9 +466,9 @@ typedef struct PgStat_PendingWalStats
 extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
-/* Functions called during server startup / shutdown */
+/* Functions called during server startup / checkpoint / shutdown */
 extern void pgstat_restore_stats(XLogRecPtr redo);
-extern void pgstat_discard_stats(void);
+extern void pgstat_flush_stats(XLogRecPtr redo);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
 /* Functions for backend initialization */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 96b458da42..7ce304c227 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5390,7 +5390,6 @@ StartupXLOG(void)
 	XLogCtlInsert *Insert;
 	CheckPoint	checkPoint;
 	bool		wasShutdown;
-	bool		didCrash;
 	bool		haveTblspcMap;
 	bool		haveBackupLabel;
 	XLogRecPtr	EndOfLog;
@@ -5510,10 +5509,7 @@ StartupXLOG(void)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
-		didCrash = true;
 	}
-	else
-		didCrash = false;
 
 	/*
 	 * Prepare for WAL recovery if needed.
@@ -5638,10 +5634,7 @@ StartupXLOG(void)
 	 * TODO: With a bit of extra work we could just start with a pgstat file
 	 * associated with the checkpoint redo location we're starting from.
 	 */
-	if (didCrash)
-		pgstat_discard_stats();
-	else
-		pgstat_restore_stats(checkPoint.redo);
+	pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
@@ -7203,6 +7196,15 @@ CreateCheckPoint(int flags)
 	 */
 	END_CRIT_SECTION();
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if (!shutdown)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
 	 * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
@@ -7671,6 +7673,15 @@ CreateRestartPoint(int flags)
 	}
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if ((flags & CHECKPOINT_IS_SHUTDOWN) == 0)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * Update the average distance between checkpoints/restartpoints if the
 	 * prior checkpoint exists.
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 855dec9e52..a0d891aefc 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -10,9 +10,10 @@
  * statistics.
  *
  * Statistics are loaded from the filesystem during startup (by the startup
- * process), unless preceded by a crash, in which case all stats are
- * discarded. They are written out by the checkpointer process just before
- * shutting down, except when shutting down in immediate mode.
+ * process) if the stats file has a redo LSN that matches with the . They
+ * are written out by the checkpointer during checkpoints and restart points,
+ * as well as before shutting down, except when shutting down in immediate
+ * mode.
  *
  * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
  *
@@ -411,53 +412,19 @@ pgstat_restore_stats(XLogRecPtr redo)
 }
 
 /*
- * Remove the stats file.  This is currently used only if WAL recovery is
- * needed after a crash.
+ * Write stats in memory to disk.
  *
- * Should only be called by the startup process or in single user mode.
+ * This is called by the checkpointer or in single-user mode.
  */
 void
-pgstat_discard_stats(void)
+pgstat_flush_stats(XLogRecPtr redo)
 {
-	int			ret;
-
-	/* NB: this needs to be done even in single user mode */
-
-	ret = unlink(PGSTAT_STAT_PERMANENT_FILENAME);
-	if (ret != 0)
-	{
-		if (errno == ENOENT)
-			elog(DEBUG2,
-				 "didn't need to unlink permanent stats file \"%s\" - didn't exist",
-				 PGSTAT_STAT_PERMANENT_FILENAME);
-		else
-			ereport(LOG,
-					(errcode_for_file_access(),
-					 errmsg("could not unlink permanent statistics file \"%s\": %m",
-							PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-	else
-	{
-		ereport(DEBUG2,
-				(errcode_for_file_access(),
-				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-								 PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-
-	/*
-	 * Reset stats contents. This will set reset timestamps of fixed-numbered
-	 * stats to the current time (no variable stats exist).
-	 */
-	pgstat_reset_after_failure();
+	pgstat_write_statsfile(redo);
 }
 
 /*
  * pgstat_before_server_shutdown() needs to be called by exactly one process
- * during regular server shutdowns. Otherwise all stats will be lost.
- *
- * We currently only write out stats for proc_exit(0). We might want to change
- * that at some point... But right now pgstat_discard_stats() would be called
- * during the start after a disorderly shutdown, anyway.
+ * during regular server shutdowns.
  */
 void
 pgstat_before_server_shutdown(int code, Datum arg)
@@ -482,6 +449,9 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	 */
 	if (code == 0)
 	{
+		/* we're shutting down, so it's ok to just override this */
+		pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
+
 		pgStatLocal.shmem->is_shutdown = true;
 		pgstat_write_statsfile(GetRedoRecPtr());
 	}
@@ -1302,8 +1272,8 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
 #define write_chunk_s(fpout, ptr) write_chunk(fpout, ptr, sizeof(*ptr))
 
 /*
- * This function is called in the last process that is accessing the shared
- * stats so locking is not required.
+ * This function is called in the checkpointer or in single-user mode,
+ * so locking is not required.
  */
 static void
 pgstat_write_statsfile(XLogRecPtr redo)
@@ -1320,9 +1290,6 @@ pgstat_write_statsfile(XLogRecPtr redo)
 	/* should be called only by the checkpointer or single user mode */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
 
-	/* we're shutting down, so it's ok to just override this */
-	pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
-
 	elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile,
 		 LSN_FORMAT_ARGS(redo));
 
@@ -1402,7 +1369,6 @@ pgstat_write_statsfile(XLogRecPtr redo)
 		CHECK_FOR_INTERRUPTS();
 
 		/* we may have some "dropped" entries not yet removed, skip them */
-		Assert(!ps->dropped);
 		if (ps->dropped)
 			continue;
 
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index 93a7209f69..9956a6ab82 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -59,7 +59,7 @@ ok(-f "$og_stats", "origin stats file must exist");
 copy($og_stats, $statsfile) or die "Copy failed: $!";
 
 
-## test discarding of stats file after crash etc
+## test retention of stats file after crash etc
 
 $node->start;
 
@@ -69,12 +69,28 @@ is(have_stats('function', $dboid, $funcoid),
 	't', "$sect: function stats do exist");
 is(have_stats('relation', $dboid, $tableoid),
 	't', "$sect: relation stats do exist");
+# Flush the stats to disk.
+$node->psql('postgres', 'checkpoint');
 
 $node->stop('immediate');
 
-ok(!-f "$og_stats", "no stats file should exist after immediate shutdown");
+ok(-f "$og_stats", "stats file should exist after immediate shutdown");
 
-# copy the old stats back to test we discard stats after crash restart
+# Start once, there should be stats restored from the previous checkpoint.
+$node->start;
+
+$sect = "crashrecovery";
+is(have_stats('database', $dboid, 0), 't', "$sect: db stats do exist");
+is(have_stats('function', $dboid, $funcoid),
+	't', "$sect: function stats do exist");
+is(have_stats('relation', $dboid, $tableoid),
+	't', "$sect: relation stats do exist");
+
+$node->stop('immediate');
+
+# Copy the old stats back, and test that we discard stats after crash restart
+# because these don't match with the redo LSN stored in the stats file
+# generated by the previous checkpoint.
 copy($statsfile, $og_stats) or die "Copy failed: $!";
 
 $node->start;
-- 
2.45.1

#2Konstantin Knizhnik
knizhnik@garret.ru
In reply to: Michael Paquier (#1)
Re: Flush pgstats file during checkpoints

On 18/06/2024 9:01 am, Michael Paquier wrote:

Hi all,

On HEAD, xlog.c has the following comment, which has been on my own
TODO list for a couple of weeks now:

* TODO: With a bit of extra work we could just start with a pgstat file
* associated with the checkpoint redo location we're starting from.

Please find a patch series to implement that, giving the possibility
to keep statistics after a crash rather than discard them. I have
been looking at the code for a while, before settling down on:
- Forcing the flush of the pgstats file to happen during non-shutdown
checkpoint and restart points, after updating the control file's redo
LSN and the critical sections in the area.
- Leaving the current before_shmem_exit() callback around, as a matter
of delaying the flush of the stats for as long as possible in a
shutdown sequence. This also makes the single-user mode shutdown
simpler.
- Adding the redo LSN to the pgstats file, with a bump of
PGSTAT_FILE_FORMAT_ID, cross-checked with the redo LSN. This change
is independently useful on its own when loading stats after a clean
startup, as well.
- The crash recovery case is simplified, as there is no more need for
the "discard" code path.
- Not using a logic where I would need to stick a LSN into the stats
file name, implying that we would need a potential lookup at the
contents of pg_stat/ to clean up past files at crash recovery. These
should not be costly, but I'd rather not add more of these.

7ff23c6d277d, that has removed the last call of CreateCheckPoint()
from the startup process, is older than 5891c7a8ed8f, still it seems
to me that pgstats relies on some areas of the code that don't make
sense on HEAD (see locking mentioned at the top of the write routine
for example). The logic gets straight-forward to think about as
restart points and checkpoints always run from the checkpointer,
implying that pgstat_write_statsfile() is already called only from the
postmaster in single-user mode or the checkpointer itself, at
shutdown.

Attached is a patch set, with the one being the actual feature, with
some stuff prior to that:
- 0001 adds the redo LSN to the pgstats file flushed.
- 0002 adds an assertion in pgstat_write_statsfile(), to check from
where it is called.
- 0003 with more debugging.
- 0004 is the meat of the thread.

I am adding that to the next CF. Thoughts and comments are welcome.
Thanks,
--
Michael

Hi Michael.

I am working mostly on the same problem - persisting pgstat state in
Neon (because of separation of storage and compute it has no local files).
I have two questions concerning this PR and the whole strategy for
saving pgstat state between sessions.

1. Right now pgstat.stat is discarded after abnormal Postgres
termination. And in your PR we are storing LSN in pgstat.staty to check
that it matches checkpoint redo LSN. I wonder if having outdated version
of pgstat.stat  is worse than not having it at all? Comment in xlog.c
says: "When starting with crash recovery, reset pgstat data - it might
not be valid." But why it may be invalid? We are writing it first to
temp file and then rename. May be fsync() should be added here and
durable_rename() should be used instead of rename(). But it seems to be
better than loosing statistics. And should not add some significant
overhead (because it happens only at shutdown). In your case we are
checking LSN of pgstat.stat file. But once again - why it is better to
discard file than load version from previous checkpoint?

2. Second question is also related with 1). So we saved pgstat.stat on
checkpoint, then did some updates and then Postgres crashed. As far as I
understand with your patch we will successfully restore pgstats.stat
file. But it is not actually up-to-date: it doesn't reflect information
about recent updates. If it was so critical to keep pgstat.stat
up-to-date, then why do we allow to restore state on most recent checkpoint?

Thanks,
Konstantin

#3Tomas Vondra
tomas.vondra@enterprisedb.com
In reply to: Konstantin Knizhnik (#2)
Re: Flush pgstats file during checkpoints

On 6/28/24 09:32, Konstantin Knizhnik wrote:

On 18/06/2024 9:01 am, Michael Paquier wrote:

Hi all,

On HEAD, xlog.c has the following comment, which has been on my own
TODO list for a couple of weeks now:

      * TODO: With a bit of extra work we could just start with a
pgstat file
      * associated with the checkpoint redo location we're starting from.

Please find a patch series to implement that, giving the possibility
to keep statistics after a crash rather than discard them.  I have
been looking at the code for a while, before settling down on:
- Forcing the flush of the pgstats file to happen during non-shutdown
checkpoint and restart points, after updating the control file's redo
LSN and the critical sections in the area.
- Leaving the current before_shmem_exit() callback around, as a matter
of delaying the flush of the stats for as long as possible in a
shutdown sequence.  This also makes the single-user mode shutdown
simpler.
- Adding the redo LSN to the pgstats file, with a bump of
PGSTAT_FILE_FORMAT_ID, cross-checked with the redo LSN.  This change
is independently useful on its own when loading stats after a clean
startup, as well.
- The crash recovery case is simplified, as there is no more need for
the "discard" code path.
- Not using a logic where I would need to stick a LSN into the stats
file name, implying that we would need a potential lookup at the
contents of pg_stat/ to clean up past files at crash recovery.  These
should not be costly, but I'd rather not add more of these.

7ff23c6d277d, that has removed the last call of CreateCheckPoint()
from the startup process, is older than 5891c7a8ed8f, still it seems
to me that pgstats relies on some areas of the code that don't make
sense on HEAD (see locking mentioned at the top of the write routine
for example).  The logic gets straight-forward to think about as
restart points and checkpoints always run from the checkpointer,
implying that pgstat_write_statsfile() is already called only from the
postmaster in single-user mode or the checkpointer itself, at
shutdown.

Attached is a patch set, with the one being the actual feature, with
some stuff prior to that:
- 0001 adds the redo LSN to the pgstats file flushed.
- 0002 adds an assertion in pgstat_write_statsfile(), to check from
where it is called.
- 0003 with more debugging.
- 0004 is the meat of the thread.

I am adding that to the next CF.  Thoughts and comments are welcome.
Thanks,
--
Michael

Hi Michael.

I am working mostly on the same problem - persisting pgstat state in
Neon (because of separation of storage and compute it has no local files).
I have two questions concerning this PR and the whole strategy for
saving pgstat state between sessions.

1. Right now pgstat.stat is discarded after abnormal Postgres
termination. And in your PR we are storing LSN in pgstat.staty to check
that it matches checkpoint redo LSN. I wonder if having outdated version
of pgstat.stat  is worse than not having it at all? Comment in xlog.c
says: "When starting with crash recovery, reset pgstat data - it might
not be valid." But why it may be invalid? We are writing it first to
temp file and then rename. May be fsync() should be added here and
durable_rename() should be used instead of rename(). But it seems to be
better than loosing statistics. And should not add some significant
overhead (because it happens only at shutdown). In your case we are
checking LSN of pgstat.stat file. But once again - why it is better to
discard file than load version from previous checkpoint?

I think those are two independent issues - knowing that the snapshot is
from the last checkpoint, and knowing that it's correct (not corrupted).
And yeah, we should be careful about fsync/durable_rename.

2. Second question is also related with 1). So we saved pgstat.stat on
checkpoint, then did some updates and then Postgres crashed. As far as I
understand with your patch we will successfully restore pgstats.stat
file. But it is not actually up-to-date: it doesn't reflect information
about recent updates. If it was so critical to keep pgstat.stat
up-to-date, then why do we allow to restore state on most recent
checkpoint?

Yeah, I was wondering about the same thing - can't this mean we fail to
start autovacuum? Let's say we delete a significant fraction of a huge
table, but then we crash before the next checkpoint. With this patch we
restore the last stats snapshot, which can easily have

n_dead_tup | 0
n_mod_since_analyze | 0

for the table. And thus we fail to notice the table needs autovacuum.
AFAIK we run autovacuum on all tables with missing stats (or am I
wrong?). That's what's happening on replicas after switchover/failover
too, right?

It'd not be such an issue if we updated stats during recovery, but I
think think we're doing that. Perhaps we should, which might also help
on replicas - no idea if it's feasible, though.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#4Michael Paquier
michael@paquier.xyz
In reply to: Tomas Vondra (#3)
Re: Flush pgstats file during checkpoints

On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote:

I think those are two independent issues - knowing that the snapshot is
from the last checkpoint, and knowing that it's correct (not corrupted).
And yeah, we should be careful about fsync/durable_rename.

Yeah, that's bugging me as well. I don't really get why we would not
want durability at shutdown for this data. So how about switching the
end of pgstat_write_statsfile() to use durable_rename()? Sounds like
an independent change, worth on its own.

Yeah, I was wondering about the same thing - can't this mean we fail to
start autovacuum? Let's say we delete a significant fraction of a huge
table, but then we crash before the next checkpoint. With this patch we
restore the last stats snapshot, which can easily have

n_dead_tup | 0
n_mod_since_analyze | 0

for the table. And thus we fail to notice the table needs autovacuum.
AFAIK we run autovacuum on all tables with missing stats (or am I
wrong?). That's what's happening on replicas after switchover/failover
too, right?

That's the opposite, please look at relation_needs_vacanalyze(). If a
table does not have any stats found in pgstats, like on HEAD after a
crash, then autoanalyze is skipped and autovacuum happens only for the
anti-wrap case.

For the database case, rebuild_database_list() uses
pgstat_fetch_stat_dbentry() three times, discards entirely databases
that have no stats. Again, there should be a stats entry post a
crash upon a reconnection.

So there's an argument in saying that the situation does not get worse
here and that we actually may improve odds by keeping a trace of the
previous stats, no? In most cases, there would be a stats entry
post-crash as an INSERT or a scan would have re-created it, but the
stats would reflect the state registered since the last crash recovery
(even on HEAD, a bulk delete bumping the dead tuple counter would not
be detected post-crash). The case of spiky workloads may impact the
decision-making, of course, but at least we'd allow autovacuum to take
some decision over giving up entirely based on some previous state of
the stats. That sounds like a win for me with steady workloads, less
for bulby workloads..

It'd not be such an issue if we updated stats during recovery, but I
think think we're doing that. Perhaps we should, which might also help
on replicas - no idea if it's feasible, though.

Stats on replicas are considered an independent thing AFAIU (scans are
counted for example in read-only queries). If we were to do that we
may want to split stats handling between nodes in standby state and
crash recovery. Not sure if that's worth the complication. First,
the stats exist at node-level.
--
Michael

#5Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#4)
4 attachment(s)
Re: Flush pgstats file during checkpoints

On Fri, Jul 05, 2024 at 01:52:31PM +0900, Michael Paquier wrote:

On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote:

I think those are two independent issues - knowing that the snapshot is
from the last checkpoint, and knowing that it's correct (not corrupted).
And yeah, we should be careful about fsync/durable_rename.

Yeah, that's bugging me as well. I don't really get why we would not
want durability at shutdown for this data. So how about switching the
end of pgstat_write_statsfile() to use durable_rename()? Sounds like
an independent change, worth on its own.

Please find attached a rebased patch set with the durability point
addressed in 0001. There were also some conflicts.

Note that I have applied the previous 0002 adding an assert in
pgstat_write_statsfile() as 734c057a8935, as I've managed to break
again this assumption while hacking more on this area..
--
Michael

Attachments:

v2-0001-Make-write-of-pgstats-file-durable.patchtext/x-diff; charset=us-asciiDownload
From b03087f1746233bce0d1fbbb69789795296e779b Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 12 Jul 2024 15:31:27 +0900
Subject: [PATCH v2 1/4] Make write of pgstats file durable

This switches the pgstats code to use durable_rename() rather than
rename(), to make sure that the stats file is not lost should the host
by plugged off after PostgreSQL has been stopped.
---
 src/backend/utils/activity/pgstat.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index ed7baa6e04..66cda798fb 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -1478,12 +1478,9 @@ pgstat_write_statsfile(void)
 						tmpfile)));
 		unlink(tmpfile);
 	}
-	else if (rename(tmpfile, statfile) < 0)
+	else if (durable_rename(tmpfile, statfile, LOG) < 0)
 	{
-		ereport(LOG,
-				(errcode_for_file_access(),
-				 errmsg("could not rename temporary statistics file \"%s\" to \"%s\": %m",
-						tmpfile, statfile)));
+		/* error logged already */
 		unlink(tmpfile);
 	}
 }
-- 
2.45.2

v2-0002-Add-redo-LSN-to-pgstats-file.patchtext/x-diff; charset=us-asciiDownload
From 4340ebd842049b5c706ada324bba36a5c5a78bb9 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 12 Jul 2024 15:22:12 +0900
Subject: [PATCH v2 2/4] Add redo LSN to pgstats file

This is used in the startup process to check that the file we are
loading is the one referring to the latest checkpoint.

Bump PGSTAT_FILE_FORMAT_ID.
---
 src/include/pgstat.h                |  5 +++--
 src/backend/access/transam/xlog.c   |  2 +-
 src/backend/utils/activity/pgstat.c | 26 +++++++++++++++++++-------
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6b99bb8aad..043d39a565 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -235,7 +236,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAD
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAE
 
 typedef struct PgStat_ArchiverStats
 {
@@ -466,7 +467,7 @@ extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
 /* Functions called during server startup / shutdown */
-extern void pgstat_restore_stats(void);
+extern void pgstat_restore_stats(XLogRecPtr redo);
 extern void pgstat_discard_stats(void);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 33e27a6e72..f137551e8d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5641,7 +5641,7 @@ StartupXLOG(void)
 	if (didCrash)
 		pgstat_discard_stats();
 	else
-		pgstat_restore_stats();
+		pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 66cda798fb..2952604a51 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -94,6 +94,7 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "access/xlog.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -171,8 +172,8 @@ typedef struct PgStat_SnapshotEntry
  * ----------
  */
 
-static void pgstat_write_statsfile(void);
-static void pgstat_read_statsfile(void);
+static void pgstat_write_statsfile(XLogRecPtr redo);
+static void pgstat_read_statsfile(XLogRecPtr redo);
 
 static void pgstat_reset_after_failure(void);
 
@@ -448,9 +449,9 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
  * Should only be called by the startup process or in single user mode.
  */
 void
-pgstat_restore_stats(void)
+pgstat_restore_stats(XLogRecPtr redo)
 {
-	pgstat_read_statsfile();
+	pgstat_read_statsfile(redo);
 }
 
 /*
@@ -526,7 +527,7 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	if (code == 0)
 	{
 		pgStatLocal.shmem->is_shutdown = true;
-		pgstat_write_statsfile();
+		pgstat_write_statsfile(GetRedoRecPtr());
 	}
 }
 
@@ -1349,7 +1350,7 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
  * stats so locking is not required.
  */
 static void
-pgstat_write_statsfile(void)
+pgstat_write_statsfile(XLogRecPtr redo)
 {
 	FILE	   *fpout;
 	int32		format_id;
@@ -1387,6 +1388,9 @@ pgstat_write_statsfile(void)
 	format_id = PGSTAT_FILE_FORMAT_ID;
 	write_chunk_s(fpout, &format_id);
 
+	/* Write the redo LSN, used to cross check the file loaded */
+	write_chunk_s(fpout, &redo);
+
 	/* Write various stats structs for fixed number of objects */
 	for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; kind++)
 	{
@@ -1501,13 +1505,14 @@ read_chunk(FILE *fpin, void *ptr, size_t len)
  * stats so locking is not required.
  */
 static void
-pgstat_read_statsfile(void)
+pgstat_read_statsfile(XLogRecPtr redo)
 {
 	FILE	   *fpin;
 	int32		format_id;
 	bool		found;
 	const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 	PgStat_ShmemControl *shmem = pgStatLocal.shmem;
+	XLogRecPtr	file_redo;
 
 	/* shouldn't be called from postmaster */
 	Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
@@ -1541,6 +1546,13 @@ pgstat_read_statsfile(void)
 		format_id != PGSTAT_FILE_FORMAT_ID)
 		goto error;
 
+	/*
+	 * Read the redo LSN stored in the file.
+	 */
+	if (!read_chunk_s(fpin, &file_redo) ||
+		file_redo != redo)
+		goto error;
+
 	/*
 	 * We found an existing statistics file. Read it and put all the stats
 	 * data into place.
-- 
2.45.2

v2-0003-Add-some-DEBUG2-information-about-the-redo-LSN-of.patchtext/x-diff; charset=us-asciiDownload
From 54e5253ee94229ea7c0c39960c9f10d305c1ef21 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 18 Jun 2024 11:10:19 +0900
Subject: [PATCH v2 3/4] Add some DEBUG2 information about the redo LSN of the
 stats file

This is useful for.. Debugging.  How surprising.
---
 src/backend/utils/activity/pgstat.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 2952604a51..148dbf2c4a 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -1367,7 +1367,8 @@ pgstat_write_statsfile(XLogRecPtr redo)
 	/* we're shutting down, so it's ok to just override this */
 	pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
 
-	elog(DEBUG2, "writing stats file \"%s\"", statfile);
+	elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile,
+		 LSN_FORMAT_ARGS(redo));
 
 	/*
 	 * Open the statistics temp file to write out the current values.
@@ -1517,7 +1518,8 @@ pgstat_read_statsfile(XLogRecPtr redo)
 	/* shouldn't be called from postmaster */
 	Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
 
-	elog(DEBUG2, "reading stats file \"%s\"", statfile);
+	elog(DEBUG2, "reading stats file \"%s\" with redo %X/%X", statfile,
+		LSN_FORMAT_ARGS(redo));
 
 	/*
 	 * Try to open the stats file. If it doesn't exist, the backends simply
-- 
2.45.2

v2-0004-Flush-pgstats-file-during-checkpoints.patchtext/x-diff; charset=us-asciiDownload
From beda72b0ee48dd62ed60205a9cb4ca1496b45da5 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 18 Jun 2024 14:49:33 +0900
Subject: [PATCH v2 4/4] Flush pgstats file during checkpoints

The flushes are done for non-shutdown checkpoints and restart points, so
as it is possible to keep some statistics around in the event of a
crash.  Keeping the before_shmem_exit() callback to flush the pgstats
file in a shutdown sequence is a bit easier than doing so at checkpoint,
as they may be stats to flush before we are completely done with the
shutdown checkpoint.  Keeping the callback also keeps the shutdown of
single-user mode simpler, where the stats also need to be flushed.

At the beginning of recovery, the stats file is loaded when the redo LSN
it stores matches with the point we are replaying from.
---
 src/include/pgstat.h                     |  4 +-
 src/backend/access/transam/xlog.c        | 27 ++++++++---
 src/backend/utils/activity/pgstat.c      | 62 ++++++------------------
 src/test/recovery/t/029_stats_restart.pl | 22 +++++++--
 4 files changed, 54 insertions(+), 61 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 043d39a565..f780b56cc3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -466,9 +466,9 @@ typedef struct PgStat_PendingWalStats
 extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
-/* Functions called during server startup / shutdown */
+/* Functions called during server startup / checkpoint / shutdown */
 extern void pgstat_restore_stats(XLogRecPtr redo);
-extern void pgstat_discard_stats(void);
+extern void pgstat_flush_stats(XLogRecPtr redo);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
 /* Functions for backend initialization */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f137551e8d..9727b0f20c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5390,7 +5390,6 @@ StartupXLOG(void)
 	XLogCtlInsert *Insert;
 	CheckPoint	checkPoint;
 	bool		wasShutdown;
-	bool		didCrash;
 	bool		haveTblspcMap;
 	bool		haveBackupLabel;
 	XLogRecPtr	EndOfLog;
@@ -5510,10 +5509,7 @@ StartupXLOG(void)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
-		didCrash = true;
 	}
-	else
-		didCrash = false;
 
 	/*
 	 * Prepare for WAL recovery if needed.
@@ -5638,10 +5634,7 @@ StartupXLOG(void)
 	 * TODO: With a bit of extra work we could just start with a pgstat file
 	 * associated with the checkpoint redo location we're starting from.
 	 */
-	if (didCrash)
-		pgstat_discard_stats();
-	else
-		pgstat_restore_stats(checkPoint.redo);
+	pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
@@ -7213,6 +7206,15 @@ CreateCheckPoint(int flags)
 	 */
 	END_CRIT_SECTION();
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if (!shutdown)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
 	 * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
@@ -7681,6 +7683,15 @@ CreateRestartPoint(int flags)
 	}
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if ((flags & CHECKPOINT_IS_SHUTDOWN) == 0)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * Update the average distance between checkpoints/restartpoints if the
 	 * prior checkpoint exists.
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 148dbf2c4a..d5ca4936f1 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -10,9 +10,10 @@
  * statistics.
  *
  * Statistics are loaded from the filesystem during startup (by the startup
- * process), unless preceded by a crash, in which case all stats are
- * discarded. They are written out by the checkpointer process just before
- * shutting down, except when shutting down in immediate mode.
+ * process) if the stats file has a redo LSN that matches with the . They
+ * are written out by the checkpointer during checkpoints and restart points,
+ * as well as before shutting down, except when shutting down in immediate
+ * mode.
  *
  * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
  *
@@ -455,53 +456,19 @@ pgstat_restore_stats(XLogRecPtr redo)
 }
 
 /*
- * Remove the stats file.  This is currently used only if WAL recovery is
- * needed after a crash.
+ * Write stats in memory to disk.
  *
- * Should only be called by the startup process or in single user mode.
+ * This is called by the checkpointer or in single-user mode.
  */
 void
-pgstat_discard_stats(void)
+pgstat_flush_stats(XLogRecPtr redo)
 {
-	int			ret;
-
-	/* NB: this needs to be done even in single user mode */
-
-	ret = unlink(PGSTAT_STAT_PERMANENT_FILENAME);
-	if (ret != 0)
-	{
-		if (errno == ENOENT)
-			elog(DEBUG2,
-				 "didn't need to unlink permanent stats file \"%s\" - didn't exist",
-				 PGSTAT_STAT_PERMANENT_FILENAME);
-		else
-			ereport(LOG,
-					(errcode_for_file_access(),
-					 errmsg("could not unlink permanent statistics file \"%s\": %m",
-							PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-	else
-	{
-		ereport(DEBUG2,
-				(errcode_for_file_access(),
-				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-								 PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-
-	/*
-	 * Reset stats contents. This will set reset timestamps of fixed-numbered
-	 * stats to the current time (no variable stats exist).
-	 */
-	pgstat_reset_after_failure();
+	pgstat_write_statsfile(redo);
 }
 
 /*
  * pgstat_before_server_shutdown() needs to be called by exactly one process
- * during regular server shutdowns. Otherwise all stats will be lost.
- *
- * We currently only write out stats for proc_exit(0). We might want to change
- * that at some point... But right now pgstat_discard_stats() would be called
- * during the start after a disorderly shutdown, anyway.
+ * during regular server shutdowns.
  */
 void
 pgstat_before_server_shutdown(int code, Datum arg)
@@ -526,6 +493,9 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	 */
 	if (code == 0)
 	{
+		/* we're shutting down, so it's ok to just override this */
+		pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
+
 		pgStatLocal.shmem->is_shutdown = true;
 		pgstat_write_statsfile(GetRedoRecPtr());
 	}
@@ -1346,8 +1316,8 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
 #define write_chunk_s(fpout, ptr) write_chunk(fpout, ptr, sizeof(*ptr))
 
 /*
- * This function is called in the last process that is accessing the shared
- * stats so locking is not required.
+ * This function is called in the checkpointer or in single-user mode,
+ * so locking is not required.
  */
 static void
 pgstat_write_statsfile(XLogRecPtr redo)
@@ -1364,9 +1334,6 @@ pgstat_write_statsfile(XLogRecPtr redo)
 	/* should be called only by the checkpointer or single user mode */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
 
-	/* we're shutting down, so it's ok to just override this */
-	pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
-
 	elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile,
 		 LSN_FORMAT_ARGS(redo));
 
@@ -1423,7 +1390,6 @@ pgstat_write_statsfile(XLogRecPtr redo)
 		CHECK_FOR_INTERRUPTS();
 
 		/* we may have some "dropped" entries not yet removed, skip them */
-		Assert(!ps->dropped);
 		if (ps->dropped)
 			continue;
 
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index 93a7209f69..9956a6ab82 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -59,7 +59,7 @@ ok(-f "$og_stats", "origin stats file must exist");
 copy($og_stats, $statsfile) or die "Copy failed: $!";
 
 
-## test discarding of stats file after crash etc
+## test retention of stats file after crash etc
 
 $node->start;
 
@@ -69,12 +69,28 @@ is(have_stats('function', $dboid, $funcoid),
 	't', "$sect: function stats do exist");
 is(have_stats('relation', $dboid, $tableoid),
 	't', "$sect: relation stats do exist");
+# Flush the stats to disk.
+$node->psql('postgres', 'checkpoint');
 
 $node->stop('immediate');
 
-ok(!-f "$og_stats", "no stats file should exist after immediate shutdown");
+ok(-f "$og_stats", "stats file should exist after immediate shutdown");
 
-# copy the old stats back to test we discard stats after crash restart
+# Start once, there should be stats restored from the previous checkpoint.
+$node->start;
+
+$sect = "crashrecovery";
+is(have_stats('database', $dboid, 0), 't', "$sect: db stats do exist");
+is(have_stats('function', $dboid, $funcoid),
+	't', "$sect: function stats do exist");
+is(have_stats('relation', $dboid, $tableoid),
+	't', "$sect: relation stats do exist");
+
+$node->stop('immediate');
+
+# Copy the old stats back, and test that we discard stats after crash restart
+# because these don't match with the redo LSN stored in the stats file
+# generated by the previous checkpoint.
 copy($statsfile, $og_stats) or die "Copy failed: $!";
 
 $node->start;
-- 
2.45.2

#6Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#4)
Re: Flush pgstats file during checkpoints

Hi,

On Fri, Jul 05, 2024 at 01:52:31PM +0900, Michael Paquier wrote:

On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote:

I think those are two independent issues - knowing that the snapshot is
from the last checkpoint, and knowing that it's correct (not corrupted).
And yeah, we should be careful about fsync/durable_rename.

Yeah, that's bugging me as well. I don't really get why we would not
want durability at shutdown for this data. So how about switching the
end of pgstat_write_statsfile() to use durable_rename()? Sounds like
an independent change, worth on its own.

Yeah, I was wondering about the same thing - can't this mean we fail to
start autovacuum? Let's say we delete a significant fraction of a huge
table, but then we crash before the next checkpoint. With this patch we
restore the last stats snapshot, which can easily have

n_dead_tup | 0
n_mod_since_analyze | 0

for the table. And thus we fail to notice the table needs autovacuum.
AFAIK we run autovacuum on all tables with missing stats (or am I
wrong?). That's what's happening on replicas after switchover/failover
too, right?

That's the opposite, please look at relation_needs_vacanalyze(). If a
table does not have any stats found in pgstats, like on HEAD after a
crash, then autoanalyze is skipped and autovacuum happens only for the
anti-wrap case.

For the database case, rebuild_database_list() uses
pgstat_fetch_stat_dbentry() three times, discards entirely databases
that have no stats. Again, there should be a stats entry post a
crash upon a reconnection.

So there's an argument in saying that the situation does not get worse
here and that we actually may improve odds by keeping a trace of the
previous stats, no?

I agree, we still could get autoanalyze/autovacuum skipped due to obsolete/stales
stats being restored from the last checkpoint but that's better than having them
always skipped (current HEAD).

In most cases, there would be a stats entry
post-crash as an INSERT or a scan would have re-created it,

Yeap.

but the
stats would reflect the state registered since the last crash recovery
(even on HEAD, a bulk delete bumping the dead tuple counter would not
be detected post-crash).

Right.

The case of spiky workloads may impact the
decision-making, of course, but at least we'd allow autovacuum to take
some decision over giving up entirely based on some previous state of
the stats. That sounds like a win for me with steady workloads, less
for bulby workloads..

I agree and it is not worst (though not ideal) that currently on HEAD.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#7Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#5)
Re: Flush pgstats file during checkpoints

Hi,

On Fri, Jul 12, 2024 at 03:42:21PM +0900, Michael Paquier wrote:

On Fri, Jul 05, 2024 at 01:52:31PM +0900, Michael Paquier wrote:

On Sat, Jun 29, 2024 at 11:13:04PM +0200, Tomas Vondra wrote:

I think those are two independent issues - knowing that the snapshot is
from the last checkpoint, and knowing that it's correct (not corrupted).
And yeah, we should be careful about fsync/durable_rename.

Yeah, that's bugging me as well. I don't really get why we would not
want durability at shutdown for this data. So how about switching the
end of pgstat_write_statsfile() to use durable_rename()? Sounds like
an independent change, worth on its own.

Please find attached a rebased patch set with the durability point
addressed in 0001. There were also some conflicts.

Thanks!

Looking at 0001:

+ /* error logged already */

Maybe mention it's already logged by durable_rename() (like it's done in
InstallXLogFileSegment(), BaseBackup() for example).

Except this nit, 0001 LGTM.

Need to spend more time and thoughts on 0002+.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#8Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Bertrand Drouvot (#7)
Re: Flush pgstats file during checkpoints

Hi,

On Fri, Jul 12, 2024 at 12:10:26PM +0000, Bertrand Drouvot wrote:

Need to spend more time and thoughts on 0002+.

I think there is a corner case, say:

1. shutdown checkpoint at LSN1
2. startup->reads the stat file (contains LSN1)->all good->read stat file and
remove it
3. crash (no checkpoint occured between 2. and 3.)
4. startup (last checkpoint is still LSN1)->no stat file (as removed in 2.)

In that case we start with empty stats.

Instead of removing the stat file, should we keep it around until the first call
to pgstat_write_statsfile()?

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#9Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#8)
Re: Flush pgstats file during checkpoints

On Fri, Jul 12, 2024 at 01:01:19PM +0000, Bertrand Drouvot wrote:

Instead of removing the stat file, should we keep it around until the first call
to pgstat_write_statsfile()?

Oops. You are right, I have somewhat missed the unlink() once we are
done reading the stats file with a correct redo location.
--
Michael

#10Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#7)
Re: Flush pgstats file during checkpoints

On Fri, Jul 12, 2024 at 12:10:26PM +0000, Bertrand Drouvot wrote:

Looking at 0001:

+ /* error logged already */

Maybe mention it's already logged by durable_rename() (like it's done in
InstallXLogFileSegment(), BaseBackup() for example).

Except this nit, 0001 LGTM.

Tweaked the comment, and applied 0001 for durable_rename(). Thanks
for the review.
--
Michael

#11Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#9)
3 attachment(s)
Re: Flush pgstats file during checkpoints

On Tue, Jul 16, 2024 at 10:37:39AM +0900, Michael Paquier wrote:

On Fri, Jul 12, 2024 at 01:01:19PM +0000, Bertrand Drouvot wrote:

Instead of removing the stat file, should we keep it around until the first call
to pgstat_write_statsfile()?

Oops. You are right, I have somewhat missed the unlink() once we are
done reading the stats file with a correct redo location.

The durable_rename() part has been applied. Please find attached a
rebase of the patch set with all the other comments addressed.
--
Michael

Attachments:

v3-0001-Add-redo-LSN-to-pgstats-file.patchtext/x-diff; charset=us-asciiDownload
From 05700c0f0a6145c6fb607ecd4add4e81d0509d8f Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Fri, 12 Jul 2024 15:22:12 +0900
Subject: [PATCH v3 1/3] Add redo LSN to pgstats file

This is used in the startup process to check that the file we are
loading is the one referring to the latest checkpoint.

Bump PGSTAT_FILE_FORMAT_ID.
---
 src/include/pgstat.h                |  5 +++--
 src/backend/access/transam/xlog.c   |  2 +-
 src/backend/utils/activity/pgstat.c | 26 +++++++++++++++++++-------
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6b99bb8aad..043d39a565 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -235,7 +236,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAD
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAE
 
 typedef struct PgStat_ArchiverStats
 {
@@ -466,7 +467,7 @@ extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
 /* Functions called during server startup / shutdown */
-extern void pgstat_restore_stats(void);
+extern void pgstat_restore_stats(XLogRecPtr redo);
 extern void pgstat_discard_stats(void);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 33e27a6e72..f137551e8d 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5641,7 +5641,7 @@ StartupXLOG(void)
 	if (didCrash)
 		pgstat_discard_stats();
 	else
-		pgstat_restore_stats();
+		pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 2e22bf2707..1b0e64bf82 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -94,6 +94,7 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "access/xlog.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -171,8 +172,8 @@ typedef struct PgStat_SnapshotEntry
  * ----------
  */
 
-static void pgstat_write_statsfile(void);
-static void pgstat_read_statsfile(void);
+static void pgstat_write_statsfile(XLogRecPtr redo);
+static void pgstat_read_statsfile(XLogRecPtr redo);
 
 static void pgstat_reset_after_failure(void);
 
@@ -448,9 +449,9 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
  * Should only be called by the startup process or in single user mode.
  */
 void
-pgstat_restore_stats(void)
+pgstat_restore_stats(XLogRecPtr redo)
 {
-	pgstat_read_statsfile();
+	pgstat_read_statsfile(redo);
 }
 
 /*
@@ -526,7 +527,7 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	if (code == 0)
 	{
 		pgStatLocal.shmem->is_shutdown = true;
-		pgstat_write_statsfile();
+		pgstat_write_statsfile(GetRedoRecPtr());
 	}
 }
 
@@ -1349,7 +1350,7 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
  * stats so locking is not required.
  */
 static void
-pgstat_write_statsfile(void)
+pgstat_write_statsfile(XLogRecPtr redo)
 {
 	FILE	   *fpout;
 	int32		format_id;
@@ -1387,6 +1388,9 @@ pgstat_write_statsfile(void)
 	format_id = PGSTAT_FILE_FORMAT_ID;
 	write_chunk_s(fpout, &format_id);
 
+	/* Write the redo LSN, used to cross check the file loaded */
+	write_chunk_s(fpout, &redo);
+
 	/* Write various stats structs for fixed number of objects */
 	for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; kind++)
 	{
@@ -1501,13 +1505,14 @@ read_chunk(FILE *fpin, void *ptr, size_t len)
  * stats so locking is not required.
  */
 static void
-pgstat_read_statsfile(void)
+pgstat_read_statsfile(XLogRecPtr redo)
 {
 	FILE	   *fpin;
 	int32		format_id;
 	bool		found;
 	const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 	PgStat_ShmemControl *shmem = pgStatLocal.shmem;
+	XLogRecPtr	file_redo;
 
 	/* shouldn't be called from postmaster */
 	Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
@@ -1541,6 +1546,13 @@ pgstat_read_statsfile(void)
 		format_id != PGSTAT_FILE_FORMAT_ID)
 		goto error;
 
+	/*
+	 * Read the redo LSN stored in the file.
+	 */
+	if (!read_chunk_s(fpin, &file_redo) ||
+		file_redo != redo)
+		goto error;
+
 	/*
 	 * We found an existing statistics file. Read it and put all the stats
 	 * data into place.
-- 
2.45.2

v3-0002-Add-some-DEBUG2-information-about-the-redo-LSN-of.patchtext/x-diff; charset=us-asciiDownload
From 511205ae260d02dc2542a8594daf16fbbe7184d6 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 18 Jun 2024 11:10:19 +0900
Subject: [PATCH v3 2/3] Add some DEBUG2 information about the redo LSN of the
 stats file

This is useful for.. Debugging.  How surprising.
---
 src/backend/utils/activity/pgstat.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 1b0e64bf82..e739c9823b 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -1367,7 +1367,8 @@ pgstat_write_statsfile(XLogRecPtr redo)
 	/* we're shutting down, so it's ok to just override this */
 	pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
 
-	elog(DEBUG2, "writing stats file \"%s\"", statfile);
+	elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile,
+		 LSN_FORMAT_ARGS(redo));
 
 	/*
 	 * Open the statistics temp file to write out the current values.
@@ -1517,7 +1518,8 @@ pgstat_read_statsfile(XLogRecPtr redo)
 	/* shouldn't be called from postmaster */
 	Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
 
-	elog(DEBUG2, "reading stats file \"%s\"", statfile);
+	elog(DEBUG2, "reading stats file \"%s\" with redo %X/%X", statfile,
+		LSN_FORMAT_ARGS(redo));
 
 	/*
 	 * Try to open the stats file. If it doesn't exist, the backends simply
-- 
2.45.2

v3-0003-Flush-pgstats-file-during-checkpoints.patchtext/x-diff; charset=us-asciiDownload
From 52c6b52a5b795f25f7f638fabaa67875f0b08e04 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 17 Jul 2024 12:42:43 +0900
Subject: [PATCH v3 3/3] Flush pgstats file during checkpoints

The flushes are done for non-shutdown checkpoints and restart points, so
as it is possible to keep some statistics around in the event of a
crash.  Keeping the before_shmem_exit() callback to flush the pgstats
file in a shutdown sequence is a bit easier than doing so at checkpoint,
as they may be stats to flush before we are completely done with the
shutdown checkpoint.  Keeping the callback also keeps the shutdown of
single-user mode simpler, where the stats also need to be flushed.

At the beginning of recovery, the stats file is loaded when the redo LSN
it stores matches with the point we are replaying from.  The file is not
removed anymore after being read, to ensure that there is still a source
of stats to feed on should the system crash until the next checkpoint
that would update the stats.
---
 src/include/pgstat.h                          |  4 +-
 src/backend/access/transam/xlog.c             | 27 +++++---
 src/backend/utils/activity/pgstat.c           | 65 ++++---------------
 src/test/recovery/t/029_stats_restart.pl      | 26 ++++++--
 .../recovery/t/030_stats_cleanup_replica.pl   |  2 +-
 5 files changed, 57 insertions(+), 67 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 043d39a565..f780b56cc3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -466,9 +466,9 @@ typedef struct PgStat_PendingWalStats
 extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
-/* Functions called during server startup / shutdown */
+/* Functions called during server startup / checkpoint / shutdown */
 extern void pgstat_restore_stats(XLogRecPtr redo);
-extern void pgstat_discard_stats(void);
+extern void pgstat_flush_stats(XLogRecPtr redo);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
 /* Functions for backend initialization */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f137551e8d..9727b0f20c 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5390,7 +5390,6 @@ StartupXLOG(void)
 	XLogCtlInsert *Insert;
 	CheckPoint	checkPoint;
 	bool		wasShutdown;
-	bool		didCrash;
 	bool		haveTblspcMap;
 	bool		haveBackupLabel;
 	XLogRecPtr	EndOfLog;
@@ -5510,10 +5509,7 @@ StartupXLOG(void)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
-		didCrash = true;
 	}
-	else
-		didCrash = false;
 
 	/*
 	 * Prepare for WAL recovery if needed.
@@ -5638,10 +5634,7 @@ StartupXLOG(void)
 	 * TODO: With a bit of extra work we could just start with a pgstat file
 	 * associated with the checkpoint redo location we're starting from.
 	 */
-	if (didCrash)
-		pgstat_discard_stats();
-	else
-		pgstat_restore_stats(checkPoint.redo);
+	pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
@@ -7213,6 +7206,15 @@ CreateCheckPoint(int flags)
 	 */
 	END_CRIT_SECTION();
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if (!shutdown)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
 	 * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
@@ -7681,6 +7683,15 @@ CreateRestartPoint(int flags)
 	}
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if ((flags & CHECKPOINT_IS_SHUTDOWN) == 0)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * Update the average distance between checkpoints/restartpoints if the
 	 * prior checkpoint exists.
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index e739c9823b..c19e65e50f 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -10,9 +10,10 @@
  * statistics.
  *
  * Statistics are loaded from the filesystem during startup (by the startup
- * process), unless preceded by a crash, in which case all stats are
- * discarded. They are written out by the checkpointer process just before
- * shutting down, except when shutting down in immediate mode.
+ * process) if the stats file has a redo LSN that matches with the . They
+ * are written out by the checkpointer during checkpoints and restart points,
+ * as well as before shutting down, except when shutting down in immediate
+ * mode.
  *
  * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
  *
@@ -455,53 +456,19 @@ pgstat_restore_stats(XLogRecPtr redo)
 }
 
 /*
- * Remove the stats file.  This is currently used only if WAL recovery is
- * needed after a crash.
+ * Write stats in memory to disk.
  *
- * Should only be called by the startup process or in single user mode.
+ * This is called by the checkpointer or in single-user mode.
  */
 void
-pgstat_discard_stats(void)
+pgstat_flush_stats(XLogRecPtr redo)
 {
-	int			ret;
-
-	/* NB: this needs to be done even in single user mode */
-
-	ret = unlink(PGSTAT_STAT_PERMANENT_FILENAME);
-	if (ret != 0)
-	{
-		if (errno == ENOENT)
-			elog(DEBUG2,
-				 "didn't need to unlink permanent stats file \"%s\" - didn't exist",
-				 PGSTAT_STAT_PERMANENT_FILENAME);
-		else
-			ereport(LOG,
-					(errcode_for_file_access(),
-					 errmsg("could not unlink permanent statistics file \"%s\": %m",
-							PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-	else
-	{
-		ereport(DEBUG2,
-				(errcode_for_file_access(),
-				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-								 PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-
-	/*
-	 * Reset stats contents. This will set reset timestamps of fixed-numbered
-	 * stats to the current time (no variable stats exist).
-	 */
-	pgstat_reset_after_failure();
+	pgstat_write_statsfile(redo);
 }
 
 /*
  * pgstat_before_server_shutdown() needs to be called by exactly one process
- * during regular server shutdowns. Otherwise all stats will be lost.
- *
- * We currently only write out stats for proc_exit(0). We might want to change
- * that at some point... But right now pgstat_discard_stats() would be called
- * during the start after a disorderly shutdown, anyway.
+ * during regular server shutdowns.
  */
 void
 pgstat_before_server_shutdown(int code, Datum arg)
@@ -526,6 +493,9 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	 */
 	if (code == 0)
 	{
+		/* we're shutting down, so it's ok to just override this */
+		pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
+
 		pgStatLocal.shmem->is_shutdown = true;
 		pgstat_write_statsfile(GetRedoRecPtr());
 	}
@@ -1346,8 +1316,8 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
 #define write_chunk_s(fpout, ptr) write_chunk(fpout, ptr, sizeof(*ptr))
 
 /*
- * This function is called in the last process that is accessing the shared
- * stats so locking is not required.
+ * This function is called in the checkpointer or in single-user mode,
+ * so locking is not required.
  */
 static void
 pgstat_write_statsfile(XLogRecPtr redo)
@@ -1364,9 +1334,6 @@ pgstat_write_statsfile(XLogRecPtr redo)
 	/* should be called only by the checkpointer or single user mode */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
 
-	/* we're shutting down, so it's ok to just override this */
-	pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
-
 	elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile,
 		 LSN_FORMAT_ARGS(redo));
 
@@ -1423,7 +1390,6 @@ pgstat_write_statsfile(XLogRecPtr redo)
 		CHECK_FOR_INTERRUPTS();
 
 		/* we may have some "dropped" entries not yet removed, skip them */
-		Assert(!ps->dropped);
 		if (ps->dropped)
 			continue;
 
@@ -1686,9 +1652,6 @@ pgstat_read_statsfile(XLogRecPtr redo)
 done:
 	FreeFile(fpin);
 
-	elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-	unlink(statfile);
-
 	return;
 
 error:
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index 93a7209f69..73d059b9f9 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -59,7 +59,7 @@ ok(-f "$og_stats", "origin stats file must exist");
 copy($og_stats, $statsfile) or die "Copy failed: $!";
 
 
-## test discarding of stats file after crash etc
+## test retention of stats file after crash etc
 
 $node->start;
 
@@ -69,12 +69,28 @@ is(have_stats('function', $dboid, $funcoid),
 	't', "$sect: function stats do exist");
 is(have_stats('relation', $dboid, $tableoid),
 	't', "$sect: relation stats do exist");
+# Flush the stats to disk.
+$node->psql('postgres', 'checkpoint');
 
 $node->stop('immediate');
 
-ok(!-f "$og_stats", "no stats file should exist after immediate shutdown");
+ok(-f "$og_stats", "stats file should exist after immediate shutdown");
 
-# copy the old stats back to test we discard stats after crash restart
+# Start once, there should be stats restored from the previous checkpoint.
+$node->start;
+
+$sect = "crashrecovery";
+is(have_stats('database', $dboid, 0), 't', "$sect: db stats do exist");
+is(have_stats('function', $dboid, $funcoid),
+	't', "$sect: function stats do exist");
+is(have_stats('relation', $dboid, $tableoid),
+	't', "$sect: relation stats do exist");
+
+$node->stop('immediate');
+
+# Copy the old stats back, and test that we discard stats after crash restart
+# because these don't match with the redo LSN stored in the stats file
+# generated by the previous checkpoint.
 copy($statsfile, $og_stats) or die "Copy failed: $!";
 
 $node->start;
@@ -274,9 +290,9 @@ my $wal_restart_immediate = wal_stats();
 
 cmp_ok(
 	$wal_reset_restart->{reset},
-	'lt',
+	'eq',
 	$wal_restart_immediate->{reset},
-	"$sect: reset timestamp is new");
+	"$sect: reset timestamp is the same");
 
 $node->stop;
 done_testing();
diff --git a/src/test/recovery/t/030_stats_cleanup_replica.pl b/src/test/recovery/t/030_stats_cleanup_replica.pl
index 74b516cc7c..5735921c99 100644
--- a/src/test/recovery/t/030_stats_cleanup_replica.pl
+++ b/src/test/recovery/t/030_stats_cleanup_replica.pl
@@ -115,7 +115,7 @@ $node_standby->start();
 $sect = "post immediate restart";
 
 test_standby_func_tab_stats_status('postgres',
-	$dboid, $tableoid, $funcoid, 'f');
+	$dboid, $tableoid, $funcoid, 't');
 
 
 done_testing();
-- 
2.45.2

#12Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#11)
Re: Flush pgstats file during checkpoints

Hi,

On Wed, Jul 17, 2024 at 12:52:12PM +0900, Michael Paquier wrote:

On Tue, Jul 16, 2024 at 10:37:39AM +0900, Michael Paquier wrote:

On Fri, Jul 12, 2024 at 01:01:19PM +0000, Bertrand Drouvot wrote:

Instead of removing the stat file, should we keep it around until the first call
to pgstat_write_statsfile()?

Oops. You are right, I have somewhat missed the unlink() once we are
done reading the stats file with a correct redo location.

The durable_rename() part has been applied. Please find attached a
rebase of the patch set with all the other comments addressed.

Thanks!

Looking at 0001:

1 ===

-               pgstat_write_statsfile();
+               pgstat_write_statsfile(GetRedoRecPtr());

Not related with your patch but this comment in the GetRedoRecPtr() function:

* grabbed a WAL insertion lock to read the authoritative value in
* Insert->RedoRecPtr

sounds weird. Should'nt that be s/Insert/XLogCtl/?

2 ===

+ /* Write the redo LSN, used to cross check the file loaded */

Nit: s/loaded/read/?

3 ===

+       /*
+        * Read the redo LSN stored in the file.
+        */
+       if (!read_chunk_s(fpin, &file_redo) ||
+               file_redo != redo)
+               goto error;

I wonder if it would make sense to have dedicated error messages for
"file_redo != redo" and for "format_id != PGSTAT_FILE_FORMAT_ID". That would
ease to diagnose as to why the stat file is discarded.

Looking at 0002:

LGTM

Looking at 0003:

4 ===

@@ -5638,10 +5634,7 @@ StartupXLOG(void)
         * TODO: With a bit of extra work we could just start with a pgstat file
         * associated with the checkpoint redo location we're starting from.
         */
-       if (didCrash)
-               pgstat_discard_stats();
-       else
-               pgstat_restore_stats(checkPoint.redo);
+       pgstat_restore_stats(checkPoint.redo)

remove the TODO comment?

5 ===

+ * process) if the stats file has a redo LSN that matches with the .

unfinished sentence?

6 ===

- * Should only be called by the startup process or in single user mode.
+ * This is called by the checkpointer or in single-user mode.
  */
 void
-pgstat_discard_stats(void)
+pgstat_flush_stats(XLogRecPtr redo)
 {

Would that make sense to add an Assert in pgstat_flush_stats()? (checking what
the above comment states).

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#13Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#12)
4 attachment(s)
Re: Flush pgstats file during checkpoints

On Mon, Jul 22, 2024 at 07:01:41AM +0000, Bertrand Drouvot wrote:

1 ===
Not related with your patch but this comment in the GetRedoRecPtr() function:

* grabbed a WAL insertion lock to read the authoritative value in
* Insert->RedoRecPtr

sounds weird. Should'nt that be s/Insert/XLogCtl/?

No, the comment is right. We are retrieving a copy of
Insert->RedoRecPtr here.

2 ===

+ /* Write the redo LSN, used to cross check the file loaded */

Nit: s/loaded/read/?

WFM.

3 ===

+       /*
+        * Read the redo LSN stored in the file.
+        */
+       if (!read_chunk_s(fpin, &file_redo) ||
+               file_redo != redo)
+               goto error;

I wonder if it would make sense to have dedicated error messages for
"file_redo != redo" and for "format_id != PGSTAT_FILE_FORMAT_ID". That would
ease to diagnose as to why the stat file is discarded.

Yep. This has been itching me quite a bit, and that's a bit more than
just the format ID or the redo LSN: it relates to all the read_chunk()
callers. I've taken a shot at this with patch 0001, implemented on
top of the rest. Adjusted as well the redo LSN read to have more
error context, now in 0002.

Looking at 0003:

4 ===

@@ -5638,10 +5634,7 @@ StartupXLOG(void)
* TODO: With a bit of extra work we could just start with a pgstat file
* associated with the checkpoint redo location we're starting from.
*/
-       if (didCrash)
-               pgstat_discard_stats();
-       else
-               pgstat_restore_stats(checkPoint.redo);
+       pgstat_restore_stats(checkPoint.redo)

remove the TODO comment?

Pretty sure I've removed that more than one time already, and that
this is a rebase accident. Thanks for noticing.

5 ===

+ * process) if the stats file has a redo LSN that matches with the .

unfinished sentence?

This is missing a reference to the control file.

6 ===

- * Should only be called by the startup process or in single user mode.
+ * This is called by the checkpointer or in single-user mode.
*/
void
-pgstat_discard_stats(void)
+pgstat_flush_stats(XLogRecPtr redo)
{

Would that make sense to add an Assert in pgstat_flush_stats()? (checking what
the above comment states).

There is one in pgstat_write_statsfile(), not sure there is a point in
duplicating the assertion in both.

Attaching a new v4 series, with all these comments addressed.
--
Michael

Attachments:

v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patchtext/x-diff; charset=us-asciiDownload
From efcbb76efe406d59c2ba8b4a09e04c01158ba575 Mon Sep 17 00:00:00 2001
From: Melanie Plageman <melanieplageman@gmail.com>
Date: Mon, 22 Jul 2024 16:13:56 -0400
Subject: [PATCH v4 1/4] Revert "Test that vacuum removes tuples older than
 OldestXmin"

This reverts commit aa607980aee08416211f003ab41aa750f5559712.

This test proved to be unstable on the buildfarm, timing out before the
standby could catch up on 32-bit machines where more rows were required
and failing to reliably trigger multiple index vacuum rounds on 64-bit
machines where fewer rows should be required.

Because the instability is only known to be present on versions of
Postgres with TIDStore used for dead TID storage by vacuum, this is only
being reverted on master and REL_17_STABLE.

As having this coverage may be valuable, there is a discussion on the
thread of possible ways to stabilize the test. If that happens, a fixed
test can be committed again.

Backpatch-through: 17
Reported-by: Tom Lane

Discussion: https://postgr.es/m/614152.1721580711%40sss.pgh.pa.us
---
 src/test/recovery/meson.build                 |   1 -
 .../recovery/t/043_vacuum_horizon_floor.pl    | 268 ------------------
 2 files changed, 269 deletions(-)
 delete mode 100644 src/test/recovery/t/043_vacuum_horizon_floor.pl

diff --git a/src/test/recovery/meson.build b/src/test/recovery/meson.build
index 1d55d6bf56..b1eb77b1ec 100644
--- a/src/test/recovery/meson.build
+++ b/src/test/recovery/meson.build
@@ -51,7 +51,6 @@ tests += {
       't/040_standby_failover_slots_sync.pl',
       't/041_checkpoint_at_promote.pl',
       't/042_low_level_backup.pl',
-      't/043_vacuum_horizon_floor.pl',
     ],
   },
 }
diff --git a/src/test/recovery/t/043_vacuum_horizon_floor.pl b/src/test/recovery/t/043_vacuum_horizon_floor.pl
deleted file mode 100644
index 7ccf5ca850..0000000000
--- a/src/test/recovery/t/043_vacuum_horizon_floor.pl
+++ /dev/null
@@ -1,268 +0,0 @@
-use strict;
-use warnings;
-use PostgreSQL::Test::Cluster;
-use Test::More;
-
-# Test that vacuum prunes away all dead tuples killed before OldestXmin
-#
-# This test creates a table on a primary, updates the table to generate dead
-# tuples for vacuum, and then, during the vacuum, uses the replica to force
-# GlobalVisState->maybe_needed on the primary to move backwards and precede
-# the value of OldestXmin set at the beginning of vacuuming the table.
-
-# Set up nodes
-my $node_primary = PostgreSQL::Test::Cluster->new('primary');
-$node_primary->init(allows_streaming => 'physical');
-
-$node_primary->append_conf(
-	'postgresql.conf', qq[
-hot_standby_feedback = on
-log_recovery_conflict_waits = true
-autovacuum = off
-log_min_messages = INFO
-maintenance_work_mem = 1024
-]);
-$node_primary->start;
-
-my $node_replica = PostgreSQL::Test::Cluster->new('standby');
-
-$node_primary->backup('my_backup');
-$node_replica->init_from_backup($node_primary, 'my_backup',
-	has_streaming => 1);
-
-$node_replica->start;
-
-my $test_db = "test_db";
-$node_primary->safe_psql('postgres', "CREATE DATABASE $test_db");
-
-# Save the original connection info for later use
-my $orig_conninfo = $node_primary->connstr();
-
-my $table1 = "vac_horizon_floor_table";
-
-# Long-running Primary Session A
-my $psql_primaryA =
-  $node_primary->background_psql($test_db, on_error_stop => 1);
-
-# Long-running Primary Session B
-my $psql_primaryB  =
-  $node_primary->background_psql($test_db, on_error_stop => 1);
-
-# The TIDStore vacuum uses to store dead items is optimized for its target
-# system. On a 32-bit system, our example requires twice as many pages with
-# the same number of dead items per page to fill the TIDStore and trigger a
-# second round of index vacuuming.
-my $is_64bit = $node_primary->safe_psql($test_db,
-	qq[SELECT typbyval FROM pg_type WHERE typname = 'int8';]);
-
-my $nrows = $is_64bit eq 't' ? 400000 : 800000;
-
-# Because vacuum's first pass, pruning, is where we use the GlobalVisState to
-# check tuple visibility, GlobalVisState->maybe_needed must move backwards
-# during pruning before checking the visibility for a tuple which would have
-# been considered HEAPTUPLE_DEAD prior to maybe_needed moving backwards but
-# HEAPTUPLE_RECENTLY_DEAD compared to the new, older value of maybe_needed.
-#
-# We must not only force the horizon on the primary to move backwards but also
-# force the vacuuming backend's GlobalVisState to be updated. GlobalVisState
-# is forced to update during index vacuuming.
-#
-# _bt_pendingfsm_finalize() calls GetOldestNonRemovableTransactionId() at the
-# end of a round of index vacuuming, updating the backend's GlobalVisState
-# and, in our case, moving maybe_needed backwards.
-#
-# Then vacuum's first (pruning) pass will continue and pruning will find our
-# later inserted and updated tuple HEAPTUPLE_RECENTLY_DEAD when compared to
-# maybe_needed but HEAPTUPLE_DEAD when compared to OldestXmin.
-#
-# Thus, we must force at least two rounds of index vacuuming to ensure that
-# some tuple visibility checks will happen after a round of index vacuuming.
-# To accomplish this, we set maintenance_work_mem to its minimum value and
-# insert and update enough rows that we force at least one round of index
-# vacuuming before getting to a dead tuple which was killed after the standby
-# is disconnected.
-$node_primary->safe_psql($test_db, qq[
-	CREATE TABLE ${table1}(col1 int)
-		WITH (autovacuum_enabled=false, fillfactor=10);
-	INSERT INTO $table1 VALUES(7);
-	INSERT INTO $table1 SELECT generate_series(1, $nrows) % 3;
-	CREATE INDEX on ${table1}(col1);
-	UPDATE $table1 SET col1 = 3 WHERE col1 = 0;
-	INSERT INTO $table1 VALUES(7);
-]);
-
-# We will later move the primary forward while the standby is disconnected.
-# For now, however, there is no reason not to wait for the standby to catch
-# up.
-my $primary_lsn = $node_primary->lsn('flush');
-$node_primary->wait_for_catchup($node_replica, 'replay', $primary_lsn);
-
-# Test that the WAL receiver is up and running.
-$node_replica->poll_query_until($test_db, qq[
-	select exists (select * from pg_stat_wal_receiver);] , 't');
-
-# Set primary_conninfo to something invalid on the replica and reload the
-# config. Once the config is reloaded, the startup process will force the WAL
-# receiver to restart and it will be unable to reconnect because of the
-# invalid connection information.
-$node_replica->safe_psql($test_db, qq[
-		ALTER SYSTEM SET primary_conninfo = '';
-		SELECT pg_reload_conf();
-	]);
-
-# Wait until the WAL receiver has shut down and been unable to start up again.
-$node_replica->poll_query_until($test_db, qq[
-	select exists (select * from pg_stat_wal_receiver);] , 'f');
-
-# Now insert and update a tuple which will be visible to the vacuum on the
-# primary but which will have xmax newer than the oldest xmin on the standby
-# that was recently disconnected.
-my $res = $psql_primaryA->query_safe(
-	qq[
-		INSERT INTO $table1 VALUES (99);
-		UPDATE $table1 SET col1 = 100 WHERE col1 = 99;
-		SELECT 'after_update';
-        ]
-	);
-
-# Make sure the UPDATE finished
-like($res, qr/^after_update$/m, "UPDATE occurred on primary session A");
-
-# Open a cursor on the primary whose pin will keep VACUUM from getting a
-# cleanup lock on the first page of the relation. We want VACUUM to be able to
-# start, calculate initial values for OldestXmin and GlobalVisState and then
-# be unable to proceed with pruning our dead tuples. This will allow us to
-# reconnect the standby and push the horizon back before we start actual
-# pruning and vacuuming.
-my $primary_cursor1 = "vac_horizon_floor_cursor1";
-
-# The first value inserted into the table was a 7, so FETCH FORWARD should
-# return a 7. That's how we know the cursor has a pin.
-$res = $psql_primaryB->query_safe(
-	qq[
-	BEGIN;
-	DECLARE $primary_cursor1 CURSOR FOR SELECT * FROM $table1 WHERE col1 = 7;
-	FETCH $primary_cursor1;
-	]
-	);
-
-is($res, 7, qq[Cursor query returned $res. Expected value 7.]);
-
-# Get the PID of the session which will run the VACUUM FREEZE so that we can
-# use it to filter pg_stat_activity later.
-my $vacuum_pid = $psql_primaryA->query_safe("SELECT pg_backend_pid();");
-
-# Now start a VACUUM FREEZE on the primary. It will call vacuum_get_cutoffs()
-# and establish values of OldestXmin and GlobalVisState which are newer than
-# all of our dead tuples. Then it will be unable to get a cleanup lock to
-# start pruning, so it will hang.
-#
-# We use VACUUM FREEZE because it will wait for a cleanup lock instead of
-# skipping the page pinned by the cursor. Note that works because the target
-# tuple's xmax precedes OldestXmin which ensures that lazy_scan_noprune() will
-# return false and we will wait for the cleanup lock.
-$psql_primaryA->{stdin} .= qq[
-		VACUUM (VERBOSE, FREEZE) $table1;
-		\\echo VACUUM
-        ];
-
-# Make sure the VACUUM command makes it to the server.
-$psql_primaryA->{run}->pump_nb();
-
-# Make sure that the VACUUM has already called vacuum_get_cutoffs() and is
-# just waiting on the lock to start vacuuming. We don't want the standby to
-# re-establish a connection to the primary and push the horizon back until
-# we've saved initial values in GlobalVisState and calculated OldestXmin.
-$node_primary->poll_query_until($test_db,
-	qq[
-	SELECT count(*) >= 1 FROM pg_stat_activity
-		WHERE pid = $vacuum_pid
-		AND wait_event = 'BufferPin';
-	],
-	't');
-
-# Ensure the WAL receiver is still not active on the replica.
-$node_replica->poll_query_until($test_db, qq[
-	SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);] , 'f');
-
-# Allow the WAL receiver connection to re-establish.
-$node_replica->safe_psql(
-	$test_db, qq[
-		ALTER SYSTEM SET primary_conninfo = '$orig_conninfo';
-		SELECT pg_reload_conf();
-	]);
-
-# Ensure the new WAL receiver has connected.
-$node_replica->poll_query_until($test_db, qq[
-	SELECT EXISTS (SELECT * FROM pg_stat_wal_receiver);] , 't');
-
-# Once the WAL sender is shown on the primary, the replica should have
-# connected with the primary and pushed the horizon backward. Primary Session
-# A won't see that until the VACUUM FREEZE proceeds and does its first round
-# of index vacuuming.
-$node_primary->poll_query_until($test_db, qq[
-	SELECT EXISTS (SELECT * FROm pg_stat_replication);] , 't');
-
-# Move the cursor forward to the next 7. We inserted the 7 much later, so
-# advancing the cursor should allow vacuum to proceed vacuuming most pages of
-# the relation. Because we set maintanence_work_mem sufficiently low, we
-# expect that a round of index vacuuming has happened and that the vacuum is
-# now waiting for the cursor to release its pin on the last page of the
-# relation.
-$res = $psql_primaryB->query_safe("FETCH $primary_cursor1");
-is($res, 7,
-	qq[Cursor query returned $res from second fetch. Expected value 7.]);
-
-# Prevent the test from incorrectly passing by confirming that we did indeed
-# do a pass of index vacuuming.
-$node_primary->poll_query_until($test_db, qq[
-	SELECT index_vacuum_count > 0
-	FROM pg_stat_progress_vacuum
-	WHERE datname='$test_db' AND relid::regclass = '$table1'::regclass;
-	] , 't');
-
-# Commit the transaction with the open cursor so that the VACUUM can finish.
-$psql_primaryB->query_until(
-		qr/^commit$/m,
-		qq[
-			COMMIT;
-			\\echo commit
-        ]
-	);
-
-# VACUUM proceeds with pruning and does a visibility check on each tuple. In
-# older versions of Postgres, pruning found our final dead tuple
-# non-removable (HEAPTUPLE_RECENTLY_DEAD) since its xmax is after the new
-# value of maybe_needed. Then heap_prepare_freeze_tuple() would decide the
-# tuple xmax should be frozen because it precedes OldestXmin. Vacuum would
-# then error out in heap_pre_freeze_checks() with "cannot freeze committed
-# xmax". This was fixed by changing pruning to find all
-# HEAPTUPLE_RECENTLY_DEAD tuples with xmaxes preceding OldestXmin
-# HEAPTUPLE_DEAD and removing them.
-
-# With the fix, VACUUM should finish successfully, incrementing the table
-# vacuum_count.
-$node_primary->poll_query_until($test_db,
-	qq[
-	SELECT vacuum_count > 0
-	FROM pg_stat_all_tables WHERE relname = '${table1}';
-	]
-	, 't');
-
-$primary_lsn = $node_primary->lsn('flush');
-
-# Make sure something causes us to flush
-$node_primary->safe_psql($test_db, "INSERT INTO $table1 VALUES (1);");
-
-# Nothing on the replica should cause a recovery conflict, so this should
-# finish successfully.
-$node_primary->wait_for_catchup($node_replica, 'replay', $primary_lsn);
-
-## Shut down psqls
-$psql_primaryA->quit;
-$psql_primaryB->quit;
-
-$node_replica->stop();
-$node_primary->stop();
-
-done_testing();
-- 
2.45.2

v4-0002-Add-more-debugging-information-when-reading-stats.patchtext/x-diff; charset=us-asciiDownload
From cb1f44150c0f047facd0ad7fa9a782fa5101d769 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 23 Jul 2024 12:31:08 +0900
Subject: [PATCH v4 2/4] Add more debugging information when reading stats
 files

This is useful to know which part of a stats file is corrupted when
loading it.
---
 src/backend/utils/activity/pgstat.c | 65 +++++++++++++++++++++++++++--
 1 file changed, 61 insertions(+), 4 deletions(-)

diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 2e22bf2707..4a1724d342 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -1537,9 +1537,18 @@ pgstat_read_statsfile(void)
 	/*
 	 * Verify it's of the expected format.
 	 */
-	if (!read_chunk_s(fpin, &format_id) ||
-		format_id != PGSTAT_FILE_FORMAT_ID)
+	if (!read_chunk_s(fpin, &format_id))
+	{
+		elog(WARNING, "could not read format ID");
 		goto error;
+	}
+
+	if (format_id != PGSTAT_FILE_FORMAT_ID)
+	{
+		elog(WARNING, "found incorrect format ID %d (expected %d)",
+			 format_id, PGSTAT_FILE_FORMAT_ID);
+		goto error;
+	}
 
 	/*
 	 * We found an existing statistics file. Read it and put all the stats
@@ -1559,22 +1568,37 @@ pgstat_read_statsfile(void)
 
 					/* entry for fixed-numbered stats */
 					if (!read_chunk_s(fpin, &kind))
+					{
+						elog(WARNING, "could not read stats kind for entry of type %c", t);
 						goto error;
+					}
 
 					if (!pgstat_is_kind_valid(kind))
+					{
+						elog(WARNING, "invalid stats kind %d for entry of type %c",
+							 kind, t);
 						goto error;
+					}
 
 					info = pgstat_get_kind_info(kind);
 
 					if (!info->fixed_amount)
+					{
+						elog(WARNING, "invalid fixed_amount in stats kind %d for entry of type %c",
+							 kind, t);
 						goto error;
+					}
 
 					/* Load back stats into shared memory */
 					ptr = ((char *) shmem) + info->shared_ctl_off +
 						info->shared_data_off;
 
 					if (!read_chunk(fpin, ptr, info->shared_data_len))
+					{
+						elog(WARNING, "could not read data of stats kind %d for entry of type %c",
+							 kind, t);
 						goto error;
+					}
 
 					break;
 				}
@@ -1591,10 +1615,17 @@ pgstat_read_statsfile(void)
 					{
 						/* normal stats entry, identified by PgStat_HashKey */
 						if (!read_chunk_s(fpin, &key))
+						{
+							elog(WARNING, "could not read key for entry of type %c", t);
 							goto error;
+						}
 
 						if (!pgstat_is_kind_valid(key.kind))
+						{
+							elog(WARNING, "invalid stats kind for entry %d/%u/%u of type %c",
+								 key.kind, key.dboid, key.objoid, t);
 							goto error;
+						}
 					}
 					else
 					{
@@ -1604,22 +1635,41 @@ pgstat_read_statsfile(void)
 						NameData	name;
 
 						if (!read_chunk_s(fpin, &kind))
+						{
+							elog(WARNING, "could not read stats kind for entry of type %c", t);
 							goto error;
+						}
 						if (!read_chunk_s(fpin, &name))
+						{
+							elog(WARNING, "could not read name of stats kind %d for entry of type %c",
+								 kind, t);
 							goto error;
+						}
 						if (!pgstat_is_kind_valid(kind))
+						{
+							elog(WARNING, "invalid stats kind %d for entry of type %c",
+								 kind, t);
 							goto error;
+						}
 
 						kind_info = pgstat_get_kind_info(kind);
 
 						if (!kind_info->from_serialized_name)
+						{
+							elog(WARNING, "invalid from_serialized_name in stats kind %d for entry of type %c",
+								 kind, t);
 							goto error;
+						}
 
 						if (!kind_info->from_serialized_name(&name, &key))
 						{
 							/* skip over data for entry we don't care about */
 							if (fseek(fpin, pgstat_get_entry_len(kind), SEEK_CUR) != 0)
+							{
+								elog(WARNING, "could not seek \"%s\" of stats kind %d for entry of type %c",
+									 NameStr(name), kind, t);
 								goto error;
+							}
 
 							continue;
 						}
@@ -1638,8 +1688,8 @@ pgstat_read_statsfile(void)
 					if (found)
 					{
 						dshash_release_lock(pgStatLocal.shared_hash, p);
-						elog(WARNING, "found duplicate stats entry %d/%u/%u",
-							 key.kind, key.dboid, key.objoid);
+						elog(WARNING, "found duplicate stats entry %d/%u/%u of type %c",
+							 key.kind, key.dboid, key.objoid, t);
 						goto error;
 					}
 
@@ -1649,7 +1699,11 @@ pgstat_read_statsfile(void)
 					if (!read_chunk(fpin,
 									pgstat_get_entry_data(key.kind, header),
 									pgstat_get_entry_len(key.kind)))
+					{
+						elog(WARNING, "could not read data for entry %d/%u/%u of type %c",
+							 key.kind, key.dboid, key.objoid, t);
 						goto error;
+					}
 
 					break;
 				}
@@ -1660,7 +1714,10 @@ pgstat_read_statsfile(void)
 				 * file
 				 */
 				if (fgetc(fpin) != EOF)
+				{
+					elog(WARNING, "could not read end-of-file");
 					goto error;
+				}
 
 				goto done;
 
-- 
2.45.2

v4-0003-Add-redo-LSN-to-pgstats-file.patchtext/x-diff; charset=us-asciiDownload
From 5a4108f1c23619558f205033964a4257545af6a4 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 23 Jul 2024 12:40:08 +0900
Subject: [PATCH v4 3/4] Add redo LSN to pgstats file

This is used in the startup process to check that the file we are
loading is the one referring to the latest checkpoint.

Bump PGSTAT_FILE_FORMAT_ID.
---
 src/include/pgstat.h                |  5 +++--
 src/backend/access/transam/xlog.c   |  2 +-
 src/backend/utils/activity/pgstat.c | 35 +++++++++++++++++++++++------
 3 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6b99bb8aad..043d39a565 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -235,7 +236,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAD
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAE
 
 typedef struct PgStat_ArchiverStats
 {
@@ -466,7 +467,7 @@ extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
 /* Functions called during server startup / shutdown */
-extern void pgstat_restore_stats(void);
+extern void pgstat_restore_stats(XLogRecPtr redo);
 extern void pgstat_discard_stats(void);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index d70ba67bac..fa076050f4 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5642,7 +5642,7 @@ StartupXLOG(void)
 	if (didCrash)
 		pgstat_discard_stats();
 	else
-		pgstat_restore_stats();
+		pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 4a1724d342..593763eb4c 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -94,6 +94,7 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "access/xlog.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -171,8 +172,8 @@ typedef struct PgStat_SnapshotEntry
  * ----------
  */
 
-static void pgstat_write_statsfile(void);
-static void pgstat_read_statsfile(void);
+static void pgstat_write_statsfile(XLogRecPtr redo);
+static void pgstat_read_statsfile(XLogRecPtr redo);
 
 static void pgstat_reset_after_failure(void);
 
@@ -448,9 +449,9 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
  * Should only be called by the startup process or in single user mode.
  */
 void
-pgstat_restore_stats(void)
+pgstat_restore_stats(XLogRecPtr redo)
 {
-	pgstat_read_statsfile();
+	pgstat_read_statsfile(redo);
 }
 
 /*
@@ -526,7 +527,7 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	if (code == 0)
 	{
 		pgStatLocal.shmem->is_shutdown = true;
-		pgstat_write_statsfile();
+		pgstat_write_statsfile(GetRedoRecPtr());
 	}
 }
 
@@ -1349,7 +1350,7 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
  * stats so locking is not required.
  */
 static void
-pgstat_write_statsfile(void)
+pgstat_write_statsfile(XLogRecPtr redo)
 {
 	FILE	   *fpout;
 	int32		format_id;
@@ -1387,6 +1388,9 @@ pgstat_write_statsfile(void)
 	format_id = PGSTAT_FILE_FORMAT_ID;
 	write_chunk_s(fpout, &format_id);
 
+	/* Write the redo LSN, used to cross check the file read */
+	write_chunk_s(fpout, &redo);
+
 	/* Write various stats structs for fixed number of objects */
 	for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; kind++)
 	{
@@ -1501,13 +1505,14 @@ read_chunk(FILE *fpin, void *ptr, size_t len)
  * stats so locking is not required.
  */
 static void
-pgstat_read_statsfile(void)
+pgstat_read_statsfile(XLogRecPtr redo)
 {
 	FILE	   *fpin;
 	int32		format_id;
 	bool		found;
 	const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 	PgStat_ShmemControl *shmem = pgStatLocal.shmem;
+	XLogRecPtr	file_redo;
 
 	/* shouldn't be called from postmaster */
 	Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
@@ -1550,6 +1555,22 @@ pgstat_read_statsfile(void)
 		goto error;
 	}
 
+	/*
+	 * Read the redo LSN stored in the file.
+	 */
+	if (!read_chunk_s(fpin, &file_redo))
+	{
+		elog(WARNING, "could not read redo LSN");
+		goto error;
+	}
+
+	if (file_redo != redo)
+	{
+		elog(WARNING, "found incorrect redo LSN %X/%X (expected %X/%X)",
+			 LSN_FORMAT_ARGS(file_redo), LSN_FORMAT_ARGS(redo));
+		goto error;
+	}
+
 	/*
 	 * We found an existing statistics file. Read it and put all the stats
 	 * data into place.
-- 
2.45.2

v4-0004-Flush-pgstats-file-during-checkpoints.patchtext/x-diff; charset=us-asciiDownload
From 287dd0dd23b734a9f452a08ff959db7bb080e074 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 17 Jul 2024 12:42:43 +0900
Subject: [PATCH v4 4/4] Flush pgstats file during checkpoints

The flushes are done for non-shutdown checkpoints and restart points, so
as it is possible to keep some statistics around in the event of a
crash.  Keeping the before_shmem_exit() callback to flush the pgstats
file in a shutdown sequence is a bit easier than doing so at checkpoint,
as they may be stats to flush before we are completely done with the
shutdown checkpoint.  Keeping the callback also keeps the shutdown of
single-user mode simpler, where the stats also need to be flushed.

At the beginning of recovery, the stats file is loaded when the redo LSN
it stores matches with the point we are replaying from.  The file is not
removed anymore after being read, to ensure that there is still a source
of stats to feed on should the system crash until the next checkpoint
that would update the stats.
---
 src/include/pgstat.h                          |  4 +-
 src/backend/access/transam/xlog.c             | 30 +++++---
 src/backend/utils/activity/pgstat.c           | 68 +++++--------------
 src/test/recovery/t/029_stats_restart.pl      | 26 +++++--
 .../recovery/t/030_stats_cleanup_replica.pl   |  2 +-
 5 files changed, 59 insertions(+), 71 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 043d39a565..f780b56cc3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -466,9 +466,9 @@ typedef struct PgStat_PendingWalStats
 extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
-/* Functions called during server startup / shutdown */
+/* Functions called during server startup / checkpoint / shutdown */
 extern void pgstat_restore_stats(XLogRecPtr redo);
-extern void pgstat_discard_stats(void);
+extern void pgstat_flush_stats(XLogRecPtr redo);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
 /* Functions for backend initialization */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fa076050f4..1b52b03841 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5391,7 +5391,6 @@ StartupXLOG(void)
 	XLogCtlInsert *Insert;
 	CheckPoint	checkPoint;
 	bool		wasShutdown;
-	bool		didCrash;
 	bool		haveTblspcMap;
 	bool		haveBackupLabel;
 	XLogRecPtr	EndOfLog;
@@ -5511,10 +5510,7 @@ StartupXLOG(void)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
-		didCrash = true;
 	}
-	else
-		didCrash = false;
 
 	/*
 	 * Prepare for WAL recovery if needed.
@@ -5635,14 +5631,8 @@ StartupXLOG(void)
 	 *
 	 * NB: Restoring replication slot stats relies on slot state to have
 	 * already been restored from disk.
-	 *
-	 * TODO: With a bit of extra work we could just start with a pgstat file
-	 * associated with the checkpoint redo location we're starting from.
 	 */
-	if (didCrash)
-		pgstat_discard_stats();
-	else
-		pgstat_restore_stats(checkPoint.redo);
+	pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
@@ -7213,6 +7203,15 @@ CreateCheckPoint(int flags)
 	 */
 	END_CRIT_SECTION();
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if (!shutdown)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
 	 * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
@@ -7682,6 +7681,15 @@ CreateRestartPoint(int flags)
 	}
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if ((flags & CHECKPOINT_IS_SHUTDOWN) == 0)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * Update the average distance between checkpoints/restartpoints if the
 	 * prior checkpoint exists.
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 593763eb4c..cda21b2814 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -10,9 +10,10 @@
  * statistics.
  *
  * Statistics are loaded from the filesystem during startup (by the startup
- * process), unless preceded by a crash, in which case all stats are
- * discarded. They are written out by the checkpointer process just before
- * shutting down, except when shutting down in immediate mode.
+ * process) if the stats file has a redo LSN that matches with the one stored
+ * in the control file. They are written out by the checkpointer during
+ * checkpoints and restart points, as well as before shutting down, except
+ * when shutting down in immediate mode.
  *
  * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
  *
@@ -455,53 +456,19 @@ pgstat_restore_stats(XLogRecPtr redo)
 }
 
 /*
- * Remove the stats file.  This is currently used only if WAL recovery is
- * needed after a crash.
+ * Write stats in memory to disk.
  *
- * Should only be called by the startup process or in single user mode.
+ * This is called by the checkpointer or in single-user mode.
  */
 void
-pgstat_discard_stats(void)
+pgstat_flush_stats(XLogRecPtr redo)
 {
-	int			ret;
-
-	/* NB: this needs to be done even in single user mode */
-
-	ret = unlink(PGSTAT_STAT_PERMANENT_FILENAME);
-	if (ret != 0)
-	{
-		if (errno == ENOENT)
-			elog(DEBUG2,
-				 "didn't need to unlink permanent stats file \"%s\" - didn't exist",
-				 PGSTAT_STAT_PERMANENT_FILENAME);
-		else
-			ereport(LOG,
-					(errcode_for_file_access(),
-					 errmsg("could not unlink permanent statistics file \"%s\": %m",
-							PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-	else
-	{
-		ereport(DEBUG2,
-				(errcode_for_file_access(),
-				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-								 PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-
-	/*
-	 * Reset stats contents. This will set reset timestamps of fixed-numbered
-	 * stats to the current time (no variable stats exist).
-	 */
-	pgstat_reset_after_failure();
+	pgstat_write_statsfile(redo);
 }
 
 /*
  * pgstat_before_server_shutdown() needs to be called by exactly one process
- * during regular server shutdowns. Otherwise all stats will be lost.
- *
- * We currently only write out stats for proc_exit(0). We might want to change
- * that at some point... But right now pgstat_discard_stats() would be called
- * during the start after a disorderly shutdown, anyway.
+ * during regular server shutdowns.
  */
 void
 pgstat_before_server_shutdown(int code, Datum arg)
@@ -526,6 +493,9 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	 */
 	if (code == 0)
 	{
+		/* we're shutting down, so it's ok to just override this */
+		pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
+
 		pgStatLocal.shmem->is_shutdown = true;
 		pgstat_write_statsfile(GetRedoRecPtr());
 	}
@@ -1346,8 +1316,8 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
 #define write_chunk_s(fpout, ptr) write_chunk(fpout, ptr, sizeof(*ptr))
 
 /*
- * This function is called in the last process that is accessing the shared
- * stats so locking is not required.
+ * This function is called in the checkpointer or in single-user mode,
+ * so locking is not required.
  */
 static void
 pgstat_write_statsfile(XLogRecPtr redo)
@@ -1364,10 +1334,8 @@ pgstat_write_statsfile(XLogRecPtr redo)
 	/* should be called only by the checkpointer or single user mode */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
 
-	/* we're shutting down, so it's ok to just override this */
-	pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
-
-	elog(DEBUG2, "writing stats file \"%s\"", statfile);
+	elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile,
+		 LSN_FORMAT_ARGS(redo));
 
 	/*
 	 * Open the statistics temp file to write out the current values.
@@ -1422,7 +1390,6 @@ pgstat_write_statsfile(XLogRecPtr redo)
 		CHECK_FOR_INTERRUPTS();
 
 		/* we may have some "dropped" entries not yet removed, skip them */
-		Assert(!ps->dropped);
 		if (ps->dropped)
 			continue;
 
@@ -1750,9 +1717,6 @@ pgstat_read_statsfile(XLogRecPtr redo)
 done:
 	FreeFile(fpin);
 
-	elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-	unlink(statfile);
-
 	return;
 
 error:
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index 93a7209f69..73d059b9f9 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -59,7 +59,7 @@ ok(-f "$og_stats", "origin stats file must exist");
 copy($og_stats, $statsfile) or die "Copy failed: $!";
 
 
-## test discarding of stats file after crash etc
+## test retention of stats file after crash etc
 
 $node->start;
 
@@ -69,12 +69,28 @@ is(have_stats('function', $dboid, $funcoid),
 	't', "$sect: function stats do exist");
 is(have_stats('relation', $dboid, $tableoid),
 	't', "$sect: relation stats do exist");
+# Flush the stats to disk.
+$node->psql('postgres', 'checkpoint');
 
 $node->stop('immediate');
 
-ok(!-f "$og_stats", "no stats file should exist after immediate shutdown");
+ok(-f "$og_stats", "stats file should exist after immediate shutdown");
 
-# copy the old stats back to test we discard stats after crash restart
+# Start once, there should be stats restored from the previous checkpoint.
+$node->start;
+
+$sect = "crashrecovery";
+is(have_stats('database', $dboid, 0), 't', "$sect: db stats do exist");
+is(have_stats('function', $dboid, $funcoid),
+	't', "$sect: function stats do exist");
+is(have_stats('relation', $dboid, $tableoid),
+	't', "$sect: relation stats do exist");
+
+$node->stop('immediate');
+
+# Copy the old stats back, and test that we discard stats after crash restart
+# because these don't match with the redo LSN stored in the stats file
+# generated by the previous checkpoint.
 copy($statsfile, $og_stats) or die "Copy failed: $!";
 
 $node->start;
@@ -274,9 +290,9 @@ my $wal_restart_immediate = wal_stats();
 
 cmp_ok(
 	$wal_reset_restart->{reset},
-	'lt',
+	'eq',
 	$wal_restart_immediate->{reset},
-	"$sect: reset timestamp is new");
+	"$sect: reset timestamp is the same");
 
 $node->stop;
 done_testing();
diff --git a/src/test/recovery/t/030_stats_cleanup_replica.pl b/src/test/recovery/t/030_stats_cleanup_replica.pl
index 74b516cc7c..5735921c99 100644
--- a/src/test/recovery/t/030_stats_cleanup_replica.pl
+++ b/src/test/recovery/t/030_stats_cleanup_replica.pl
@@ -115,7 +115,7 @@ $node_standby->start();
 $sect = "post immediate restart";
 
 test_standby_func_tab_stats_status('postgres',
-	$dboid, $tableoid, $funcoid, 'f');
+	$dboid, $tableoid, $funcoid, 't');
 
 
 done_testing();
-- 
2.45.2

#14Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#13)
Re: Flush pgstats file during checkpoints

Hi,

On Tue, Jul 23, 2024 at 12:52:11PM +0900, Michael Paquier wrote:

On Mon, Jul 22, 2024 at 07:01:41AM +0000, Bertrand Drouvot wrote:

3 ===

+       /*
+        * Read the redo LSN stored in the file.
+        */
+       if (!read_chunk_s(fpin, &file_redo) ||
+               file_redo != redo)
+               goto error;

I wonder if it would make sense to have dedicated error messages for
"file_redo != redo" and for "format_id != PGSTAT_FILE_FORMAT_ID". That would
ease to diagnose as to why the stat file is discarded.

Yep. This has been itching me quite a bit, and that's a bit more than
just the format ID or the redo LSN: it relates to all the read_chunk()
callers. I've taken a shot at this with patch 0001, implemented on
top of the rest.

Thanks! 0001 attached is v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patch
so I guess you did not attached the right one.

Attaching a new v4 series, with all these comments addressed.

Thanks!

Looking at 0002:

1 ===

   if (!read_chunk(fpin, ptr, info->shared_data_len))
+  {
+		elog(WARNING, "could not read data of stats kind %d for entry of type %c",
+    			kind, t);

Nit: what about to include the "info->shared_data_len" value in the WARNING?

2 ===

   if (!read_chunk_s(fpin, &name))
+  {
+ 		elog(WARNING, "could not read name of stats kind %d for entry of type %c",
+                kind, t);
        goto error;
+  }
   if (!pgstat_is_kind_valid(kind))
+  {
+       elog(WARNING, "invalid stats kind %d for entry of type %c",
+                kind, t);
        goto error;
+  }

Shouldn't we swap those 2 tests so that we check that the kind is valid right
after this one?

  if (!read_chunk_s(fpin, &kind))
+ {
+      elog(WARNING, "could not read stats kind for entry of type %c", t);
       goto error;
+ }

Looking at 0003: LGTM

Looking at 0004: LGTM

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#15Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#14)
2 attachment(s)
Re: Flush pgstats file during checkpoints

On Mon, Jul 29, 2024 at 04:46:17AM +0000, Bertrand Drouvot wrote:

Thanks! 0001 attached is v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patch
so I guess you did not attached the right one.

I did attach the right set of patches, please ignore 0001 entirely:
the patch series is made of three patches, beginning with 0002 :)

Looking at 0002:

if (!read_chunk(fpin, ptr, info->shared_data_len))
+  {
+		elog(WARNING, "could not read data of stats kind %d for entry of type %c",
+    			kind, t);

Nit: what about to include the "info->shared_data_len" value in the WARNING?

Good idea, so added.

if (!read_chunk_s(fpin, &name))
+  {
+ 		elog(WARNING, "could not read name of stats kind %d for entry of type %c",
+                kind, t);
goto error;
+  }
if (!pgstat_is_kind_valid(kind))
+  {
+       elog(WARNING, "invalid stats kind %d for entry of type %c",
+                kind, t);
goto error;
+  }

Shouldn't we swap those 2 tests so that we check that the kind is valid right
after this one?

Hmm. We could, but this order is not buggy either. So I've let it
as-is for now, just adding the WARNINGs.

By the way, I have noticed an extra path where a WARNING would not be
logged while playing with corrupted pgstats files: when the entry type
itself is incorrect. I have added an extra elog() in this case, and
applied 0001. Err.. 0002, sorry ;)

Looking at 0003: LGTM
Looking at 0004: LGTM

Thanks. Attached are the two remaining patches, for now. I'm
planning to get back to these in a few days.
--
Michael

Attachments:

v5-0001-Add-redo-LSN-to-pgstats-file.patchtext/x-diff; charset=us-asciiDownload
From c0c2a3211bea705b49185a63975c86589b46871a Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Tue, 23 Jul 2024 12:40:08 +0900
Subject: [PATCH v5 1/2] Add redo LSN to pgstats file

This is used in the startup process to check that the file we are
loading is the one referring to the latest checkpoint.

Bump PGSTAT_FILE_FORMAT_ID.
---
 src/include/pgstat.h                |  5 +++--
 src/backend/access/transam/xlog.c   |  2 +-
 src/backend/utils/activity/pgstat.c | 35 +++++++++++++++++++++++------
 3 files changed, 32 insertions(+), 10 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 6b99bb8aad..043d39a565 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -11,6 +11,7 @@
 #ifndef PGSTAT_H
 #define PGSTAT_H
 
+#include "access/xlogdefs.h"
 #include "datatype/timestamp.h"
 #include "portability/instr_time.h"
 #include "postmaster/pgarch.h"	/* for MAX_XFN_CHARS */
@@ -235,7 +236,7 @@ typedef struct PgStat_TableXactStatus
  * ------------------------------------------------------------
  */
 
-#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAD
+#define PGSTAT_FILE_FORMAT_ID	0x01A5BCAE
 
 typedef struct PgStat_ArchiverStats
 {
@@ -466,7 +467,7 @@ extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
 /* Functions called during server startup / shutdown */
-extern void pgstat_restore_stats(void);
+extern void pgstat_restore_stats(XLogRecPtr redo);
 extern void pgstat_discard_stats(void);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index f86f4b5c4b..b7bb4f9a31 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5652,7 +5652,7 @@ StartupXLOG(void)
 	if (didCrash)
 		pgstat_discard_stats();
 	else
-		pgstat_restore_stats();
+		pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 228cdf73f7..65ae488e2a 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -94,6 +94,7 @@
 #include <unistd.h>
 
 #include "access/xact.h"
+#include "access/xlog.h"
 #include "lib/dshash.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -171,8 +172,8 @@ typedef struct PgStat_SnapshotEntry
  * ----------
  */
 
-static void pgstat_write_statsfile(void);
-static void pgstat_read_statsfile(void);
+static void pgstat_write_statsfile(XLogRecPtr redo);
+static void pgstat_read_statsfile(XLogRecPtr redo);
 
 static void pgstat_reset_after_failure(void);
 
@@ -448,9 +449,9 @@ static const PgStat_KindInfo pgstat_kind_infos[PGSTAT_NUM_KINDS] = {
  * Should only be called by the startup process or in single user mode.
  */
 void
-pgstat_restore_stats(void)
+pgstat_restore_stats(XLogRecPtr redo)
 {
-	pgstat_read_statsfile();
+	pgstat_read_statsfile(redo);
 }
 
 /*
@@ -526,7 +527,7 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	if (code == 0)
 	{
 		pgStatLocal.shmem->is_shutdown = true;
-		pgstat_write_statsfile();
+		pgstat_write_statsfile(GetRedoRecPtr());
 	}
 }
 
@@ -1349,7 +1350,7 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
  * stats so locking is not required.
  */
 static void
-pgstat_write_statsfile(void)
+pgstat_write_statsfile(XLogRecPtr redo)
 {
 	FILE	   *fpout;
 	int32		format_id;
@@ -1387,6 +1388,9 @@ pgstat_write_statsfile(void)
 	format_id = PGSTAT_FILE_FORMAT_ID;
 	write_chunk_s(fpout, &format_id);
 
+	/* Write the redo LSN, used to cross check the file read */
+	write_chunk_s(fpout, &redo);
+
 	/* Write various stats structs for fixed number of objects */
 	for (int kind = PGSTAT_KIND_FIRST_VALID; kind <= PGSTAT_KIND_LAST; kind++)
 	{
@@ -1501,13 +1505,14 @@ read_chunk(FILE *fpin, void *ptr, size_t len)
  * stats so locking is not required.
  */
 static void
-pgstat_read_statsfile(void)
+pgstat_read_statsfile(XLogRecPtr redo)
 {
 	FILE	   *fpin;
 	int32		format_id;
 	bool		found;
 	const char *statfile = PGSTAT_STAT_PERMANENT_FILENAME;
 	PgStat_ShmemControl *shmem = pgStatLocal.shmem;
+	XLogRecPtr	file_redo;
 
 	/* shouldn't be called from postmaster */
 	Assert(IsUnderPostmaster || !IsPostmasterEnvironment);
@@ -1550,6 +1555,22 @@ pgstat_read_statsfile(void)
 		goto error;
 	}
 
+	/*
+	 * Read the redo LSN stored in the file.
+	 */
+	if (!read_chunk_s(fpin, &file_redo))
+	{
+		elog(WARNING, "could not read redo LSN");
+		goto error;
+	}
+
+	if (file_redo != redo)
+	{
+		elog(WARNING, "found incorrect redo LSN %X/%X (expected %X/%X)",
+			 LSN_FORMAT_ARGS(file_redo), LSN_FORMAT_ARGS(redo));
+		goto error;
+	}
+
 	/*
 	 * We found an existing statistics file. Read it and put all the stats
 	 * data into place.
-- 
2.45.2

v5-0002-Flush-pgstats-file-during-checkpoints.patchtext/x-diff; charset=us-asciiDownload
From d9dd1206b2895fe96c0766244cd65a1b9861e783 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 17 Jul 2024 12:42:43 +0900
Subject: [PATCH v5 2/2] Flush pgstats file during checkpoints

The flushes are done for non-shutdown checkpoints and restart points, so
as it is possible to keep some statistics around in the event of a
crash.  Keeping the before_shmem_exit() callback to flush the pgstats
file in a shutdown sequence is a bit easier than doing so at checkpoint,
as they may be stats to flush before we are completely done with the
shutdown checkpoint.  Keeping the callback also keeps the shutdown of
single-user mode simpler, where the stats also need to be flushed.

At the beginning of recovery, the stats file is loaded when the redo LSN
it stores matches with the point we are replaying from.  The file is not
removed anymore after being read, to ensure that there is still a source
of stats to feed on should the system crash until the next checkpoint
that would update the stats.
---
 src/include/pgstat.h                          |  4 +-
 src/backend/access/transam/xlog.c             | 30 +++++---
 src/backend/utils/activity/pgstat.c           | 68 +++++--------------
 src/test/recovery/t/029_stats_restart.pl      | 26 +++++--
 .../recovery/t/030_stats_cleanup_replica.pl   |  2 +-
 5 files changed, 59 insertions(+), 71 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 043d39a565..f780b56cc3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -466,9 +466,9 @@ typedef struct PgStat_PendingWalStats
 extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
-/* Functions called during server startup / shutdown */
+/* Functions called during server startup / checkpoint / shutdown */
 extern void pgstat_restore_stats(XLogRecPtr redo);
-extern void pgstat_discard_stats(void);
+extern void pgstat_flush_stats(XLogRecPtr redo);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
 /* Functions for backend initialization */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index b7bb4f9a31..40422a3f3f 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5401,7 +5401,6 @@ StartupXLOG(void)
 	XLogCtlInsert *Insert;
 	CheckPoint	checkPoint;
 	bool		wasShutdown;
-	bool		didCrash;
 	bool		haveTblspcMap;
 	bool		haveBackupLabel;
 	XLogRecPtr	EndOfLog;
@@ -5521,10 +5520,7 @@ StartupXLOG(void)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
-		didCrash = true;
 	}
-	else
-		didCrash = false;
 
 	/*
 	 * Prepare for WAL recovery if needed.
@@ -5645,14 +5641,8 @@ StartupXLOG(void)
 	 *
 	 * NB: Restoring replication slot stats relies on slot state to have
 	 * already been restored from disk.
-	 *
-	 * TODO: With a bit of extra work we could just start with a pgstat file
-	 * associated with the checkpoint redo location we're starting from.
 	 */
-	if (didCrash)
-		pgstat_discard_stats();
-	else
-		pgstat_restore_stats(checkPoint.redo);
+	pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
@@ -7244,6 +7234,15 @@ CreateCheckPoint(int flags)
 	 */
 	END_CRIT_SECTION();
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if (!shutdown)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
 	 * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
@@ -7713,6 +7712,15 @@ CreateRestartPoint(int flags)
 	}
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if ((flags & CHECKPOINT_IS_SHUTDOWN) == 0)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * Update the average distance between checkpoints/restartpoints if the
 	 * prior checkpoint exists.
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 65ae488e2a..3f94d436a2 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -10,9 +10,10 @@
  * statistics.
  *
  * Statistics are loaded from the filesystem during startup (by the startup
- * process), unless preceded by a crash, in which case all stats are
- * discarded. They are written out by the checkpointer process just before
- * shutting down, except when shutting down in immediate mode.
+ * process) if the stats file has a redo LSN that matches with the one stored
+ * in the control file. They are written out by the checkpointer during
+ * checkpoints and restart points, as well as before shutting down, except
+ * when shutting down in immediate mode.
  *
  * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
  *
@@ -455,53 +456,19 @@ pgstat_restore_stats(XLogRecPtr redo)
 }
 
 /*
- * Remove the stats file.  This is currently used only if WAL recovery is
- * needed after a crash.
+ * Write stats in memory to disk.
  *
- * Should only be called by the startup process or in single user mode.
+ * This is called by the checkpointer or in single-user mode.
  */
 void
-pgstat_discard_stats(void)
+pgstat_flush_stats(XLogRecPtr redo)
 {
-	int			ret;
-
-	/* NB: this needs to be done even in single user mode */
-
-	ret = unlink(PGSTAT_STAT_PERMANENT_FILENAME);
-	if (ret != 0)
-	{
-		if (errno == ENOENT)
-			elog(DEBUG2,
-				 "didn't need to unlink permanent stats file \"%s\" - didn't exist",
-				 PGSTAT_STAT_PERMANENT_FILENAME);
-		else
-			ereport(LOG,
-					(errcode_for_file_access(),
-					 errmsg("could not unlink permanent statistics file \"%s\": %m",
-							PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-	else
-	{
-		ereport(DEBUG2,
-				(errcode_for_file_access(),
-				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-								 PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-
-	/*
-	 * Reset stats contents. This will set reset timestamps of fixed-numbered
-	 * stats to the current time (no variable stats exist).
-	 */
-	pgstat_reset_after_failure();
+	pgstat_write_statsfile(redo);
 }
 
 /*
  * pgstat_before_server_shutdown() needs to be called by exactly one process
- * during regular server shutdowns. Otherwise all stats will be lost.
- *
- * We currently only write out stats for proc_exit(0). We might want to change
- * that at some point... But right now pgstat_discard_stats() would be called
- * during the start after a disorderly shutdown, anyway.
+ * during regular server shutdowns.
  */
 void
 pgstat_before_server_shutdown(int code, Datum arg)
@@ -526,6 +493,9 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	 */
 	if (code == 0)
 	{
+		/* we're shutting down, so it's ok to just override this */
+		pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
+
 		pgStatLocal.shmem->is_shutdown = true;
 		pgstat_write_statsfile(GetRedoRecPtr());
 	}
@@ -1346,8 +1316,8 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
 #define write_chunk_s(fpout, ptr) write_chunk(fpout, ptr, sizeof(*ptr))
 
 /*
- * This function is called in the last process that is accessing the shared
- * stats so locking is not required.
+ * This function is called in the checkpointer or in single-user mode,
+ * so locking is not required.
  */
 static void
 pgstat_write_statsfile(XLogRecPtr redo)
@@ -1364,10 +1334,8 @@ pgstat_write_statsfile(XLogRecPtr redo)
 	/* should be called only by the checkpointer or single user mode */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
 
-	/* we're shutting down, so it's ok to just override this */
-	pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
-
-	elog(DEBUG2, "writing stats file \"%s\"", statfile);
+	elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile,
+		 LSN_FORMAT_ARGS(redo));
 
 	/*
 	 * Open the statistics temp file to write out the current values.
@@ -1422,7 +1390,6 @@ pgstat_write_statsfile(XLogRecPtr redo)
 		CHECK_FOR_INTERRUPTS();
 
 		/* we may have some "dropped" entries not yet removed, skip them */
-		Assert(!ps->dropped);
 		if (ps->dropped)
 			continue;
 
@@ -1751,9 +1718,6 @@ pgstat_read_statsfile(XLogRecPtr redo)
 done:
 	FreeFile(fpin);
 
-	elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-	unlink(statfile);
-
 	return;
 
 error:
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index 93a7209f69..73d059b9f9 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -59,7 +59,7 @@ ok(-f "$og_stats", "origin stats file must exist");
 copy($og_stats, $statsfile) or die "Copy failed: $!";
 
 
-## test discarding of stats file after crash etc
+## test retention of stats file after crash etc
 
 $node->start;
 
@@ -69,12 +69,28 @@ is(have_stats('function', $dboid, $funcoid),
 	't', "$sect: function stats do exist");
 is(have_stats('relation', $dboid, $tableoid),
 	't', "$sect: relation stats do exist");
+# Flush the stats to disk.
+$node->psql('postgres', 'checkpoint');
 
 $node->stop('immediate');
 
-ok(!-f "$og_stats", "no stats file should exist after immediate shutdown");
+ok(-f "$og_stats", "stats file should exist after immediate shutdown");
 
-# copy the old stats back to test we discard stats after crash restart
+# Start once, there should be stats restored from the previous checkpoint.
+$node->start;
+
+$sect = "crashrecovery";
+is(have_stats('database', $dboid, 0), 't', "$sect: db stats do exist");
+is(have_stats('function', $dboid, $funcoid),
+	't', "$sect: function stats do exist");
+is(have_stats('relation', $dboid, $tableoid),
+	't', "$sect: relation stats do exist");
+
+$node->stop('immediate');
+
+# Copy the old stats back, and test that we discard stats after crash restart
+# because these don't match with the redo LSN stored in the stats file
+# generated by the previous checkpoint.
 copy($statsfile, $og_stats) or die "Copy failed: $!";
 
 $node->start;
@@ -274,9 +290,9 @@ my $wal_restart_immediate = wal_stats();
 
 cmp_ok(
 	$wal_reset_restart->{reset},
-	'lt',
+	'eq',
 	$wal_restart_immediate->{reset},
-	"$sect: reset timestamp is new");
+	"$sect: reset timestamp is the same");
 
 $node->stop;
 done_testing();
diff --git a/src/test/recovery/t/030_stats_cleanup_replica.pl b/src/test/recovery/t/030_stats_cleanup_replica.pl
index 74b516cc7c..5735921c99 100644
--- a/src/test/recovery/t/030_stats_cleanup_replica.pl
+++ b/src/test/recovery/t/030_stats_cleanup_replica.pl
@@ -115,7 +115,7 @@ $node_standby->start();
 $sect = "post immediate restart";
 
 test_standby_func_tab_stats_status('postgres',
-	$dboid, $tableoid, $funcoid, 'f');
+	$dboid, $tableoid, $funcoid, 't');
 
 
 done_testing();
-- 
2.45.2

#16Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#15)
Re: Flush pgstats file during checkpoints

Hi,

On Tue, Jul 30, 2024 at 03:25:31PM +0900, Michael Paquier wrote:

On Mon, Jul 29, 2024 at 04:46:17AM +0000, Bertrand Drouvot wrote:

Thanks! 0001 attached is v4-0001-Revert-Test-that-vacuum-removes-tuples-older-than.patch
so I guess you did not attached the right one.

I did attach the right set of patches, please ignore 0001 entirely:
the patch series is made of three patches, beginning with 0002 :)

Yeah, saw that ;-)

Looking at 0002:

if (!read_chunk(fpin, ptr, info->shared_data_len))
+  {
+		elog(WARNING, "could not read data of stats kind %d for entry of type %c",
+    			kind, t);

Nit: what about to include the "info->shared_data_len" value in the WARNING?

Good idea, so added.

Thanks!

Looking at 0003: LGTM
Looking at 0004: LGTM

Thanks. Attached are the two remaining patches, for now. I'm
planning to get back to these in a few days.

Did a quick check and still LGTM.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#17Michael Paquier
michael@paquier.xyz
In reply to: Bertrand Drouvot (#16)
1 attachment(s)
Re: Flush pgstats file during checkpoints

On Tue, Jul 30, 2024 at 08:53:48AM +0000, Bertrand Drouvot wrote:

Did a quick check and still LGTM.

Applied 0003 for now to add the redo LSN to the pgstats file, adding
the redo LSN to the two DEBUG2 entries when reading and writing while
on it, that I forgot. (It was not 01:57 where I am now.)

Attached is the last one.
--
Michael

Attachments:

v6-0001-Flush-pgstats-file-during-checkpoints.patchtext/x-diff; charset=us-asciiDownload
From 8f1f7a9ca9c19dae88057a5593b0f9c0cd748a88 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Wed, 17 Jul 2024 12:42:43 +0900
Subject: [PATCH v6] Flush pgstats file during checkpoints

The flushes are done for non-shutdown checkpoints and restart points, so
as it is possible to keep some statistics around in the event of a
crash.  Keeping the before_shmem_exit() callback to flush the pgstats
file in a shutdown sequence is a bit easier than doing so at checkpoint,
as they may be stats to flush before we are completely done with the
shutdown checkpoint.  Keeping the callback also keeps the shutdown of
single-user mode simpler, where the stats also need to be flushed.

At the beginning of recovery, the stats file is loaded when the redo LSN
it stores matches with the point we are replaying from.  The file is not
removed anymore after being read, to ensure that there is still a source
of stats to feed on should the system crash until the next checkpoint
that would update the stats.
---
 src/include/pgstat.h                          |  4 +-
 src/backend/access/transam/xlog.c             | 30 +++++----
 src/backend/utils/activity/pgstat.c           | 65 ++++---------------
 src/test/recovery/t/029_stats_restart.pl      | 26 ++++++--
 .../recovery/t/030_stats_cleanup_replica.pl   |  2 +-
 5 files changed, 57 insertions(+), 70 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 043d39a565..f780b56cc3 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -466,9 +466,9 @@ typedef struct PgStat_PendingWalStats
 extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
-/* Functions called during server startup / shutdown */
+/* Functions called during server startup / checkpoint / shutdown */
 extern void pgstat_restore_stats(XLogRecPtr redo);
-extern void pgstat_discard_stats(void);
+extern void pgstat_flush_stats(XLogRecPtr redo);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
 /* Functions for backend initialization */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 6499eabe4d..f058d68fc3 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5401,7 +5401,6 @@ StartupXLOG(void)
 	XLogCtlInsert *Insert;
 	CheckPoint	checkPoint;
 	bool		wasShutdown;
-	bool		didCrash;
 	bool		haveTblspcMap;
 	bool		haveBackupLabel;
 	XLogRecPtr	EndOfLog;
@@ -5521,10 +5520,7 @@ StartupXLOG(void)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
-		didCrash = true;
 	}
-	else
-		didCrash = false;
 
 	/*
 	 * Prepare for WAL recovery if needed.
@@ -5645,14 +5641,8 @@ StartupXLOG(void)
 	 *
 	 * NB: Restoring replication slot stats relies on slot state to have
 	 * already been restored from disk.
-	 *
-	 * TODO: With a bit of extra work we could just start with a pgstat file
-	 * associated with the checkpoint redo location we're starting from.
 	 */
-	if (didCrash)
-		pgstat_discard_stats();
-	else
-		pgstat_restore_stats(checkPoint.redo);
+	pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
@@ -7244,6 +7234,15 @@ CreateCheckPoint(int flags)
 	 */
 	END_CRIT_SECTION();
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if (!shutdown)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
 	 * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
@@ -7713,6 +7712,15 @@ CreateRestartPoint(int flags)
 	}
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if ((flags & CHECKPOINT_IS_SHUTDOWN) == 0)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * Update the average distance between checkpoints/restartpoints if the
 	 * prior checkpoint exists.
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 81484222cf..0b02353359 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -10,9 +10,10 @@
  * statistics.
  *
  * Statistics are loaded from the filesystem during startup (by the startup
- * process), unless preceded by a crash, in which case all stats are
- * discarded. They are written out by the checkpointer process just before
- * shutting down, except when shutting down in immediate mode.
+ * process) if the stats file has a redo LSN that matches with the one stored
+ * in the control file. They are written out by the checkpointer during
+ * checkpoints and restart points, as well as before shutting down, except
+ * when shutting down in immediate mode.
  *
  * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
  *
@@ -455,53 +456,19 @@ pgstat_restore_stats(XLogRecPtr redo)
 }
 
 /*
- * Remove the stats file.  This is currently used only if WAL recovery is
- * needed after a crash.
+ * Write stats in memory to disk.
  *
- * Should only be called by the startup process or in single user mode.
+ * This is called by the checkpointer or in single-user mode.
  */
 void
-pgstat_discard_stats(void)
+pgstat_flush_stats(XLogRecPtr redo)
 {
-	int			ret;
-
-	/* NB: this needs to be done even in single user mode */
-
-	ret = unlink(PGSTAT_STAT_PERMANENT_FILENAME);
-	if (ret != 0)
-	{
-		if (errno == ENOENT)
-			elog(DEBUG2,
-				 "didn't need to unlink permanent stats file \"%s\" - didn't exist",
-				 PGSTAT_STAT_PERMANENT_FILENAME);
-		else
-			ereport(LOG,
-					(errcode_for_file_access(),
-					 errmsg("could not unlink permanent statistics file \"%s\": %m",
-							PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-	else
-	{
-		ereport(DEBUG2,
-				(errcode_for_file_access(),
-				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-								 PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-
-	/*
-	 * Reset stats contents. This will set reset timestamps of fixed-numbered
-	 * stats to the current time (no variable stats exist).
-	 */
-	pgstat_reset_after_failure();
+	pgstat_write_statsfile(redo);
 }
 
 /*
  * pgstat_before_server_shutdown() needs to be called by exactly one process
- * during regular server shutdowns. Otherwise all stats will be lost.
- *
- * We currently only write out stats for proc_exit(0). We might want to change
- * that at some point... But right now pgstat_discard_stats() would be called
- * during the start after a disorderly shutdown, anyway.
+ * during regular server shutdowns.
  */
 void
 pgstat_before_server_shutdown(int code, Datum arg)
@@ -526,6 +493,9 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	 */
 	if (code == 0)
 	{
+		/* we're shutting down, so it's ok to just override this */
+		pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
+
 		pgStatLocal.shmem->is_shutdown = true;
 		pgstat_write_statsfile(GetRedoRecPtr());
 	}
@@ -1346,8 +1316,8 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
 #define write_chunk_s(fpout, ptr) write_chunk(fpout, ptr, sizeof(*ptr))
 
 /*
- * This function is called in the last process that is accessing the shared
- * stats so locking is not required.
+ * This function is called in the checkpointer or in single-user mode,
+ * so locking is not required.
  */
 static void
 pgstat_write_statsfile(XLogRecPtr redo)
@@ -1364,9 +1334,6 @@ pgstat_write_statsfile(XLogRecPtr redo)
 	/* should be called only by the checkpointer or single user mode */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
 
-	/* we're shutting down, so it's ok to just override this */
-	pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
-
 	elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile,
 		 LSN_FORMAT_ARGS(redo));
 
@@ -1423,7 +1390,6 @@ pgstat_write_statsfile(XLogRecPtr redo)
 		CHECK_FOR_INTERRUPTS();
 
 		/* we may have some "dropped" entries not yet removed, skip them */
-		Assert(!ps->dropped);
 		if (ps->dropped)
 			continue;
 
@@ -1753,9 +1719,6 @@ pgstat_read_statsfile(XLogRecPtr redo)
 done:
 	FreeFile(fpin);
 
-	elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-	unlink(statfile);
-
 	return;
 
 error:
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index 93a7209f69..73d059b9f9 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -59,7 +59,7 @@ ok(-f "$og_stats", "origin stats file must exist");
 copy($og_stats, $statsfile) or die "Copy failed: $!";
 
 
-## test discarding of stats file after crash etc
+## test retention of stats file after crash etc
 
 $node->start;
 
@@ -69,12 +69,28 @@ is(have_stats('function', $dboid, $funcoid),
 	't', "$sect: function stats do exist");
 is(have_stats('relation', $dboid, $tableoid),
 	't', "$sect: relation stats do exist");
+# Flush the stats to disk.
+$node->psql('postgres', 'checkpoint');
 
 $node->stop('immediate');
 
-ok(!-f "$og_stats", "no stats file should exist after immediate shutdown");
+ok(-f "$og_stats", "stats file should exist after immediate shutdown");
 
-# copy the old stats back to test we discard stats after crash restart
+# Start once, there should be stats restored from the previous checkpoint.
+$node->start;
+
+$sect = "crashrecovery";
+is(have_stats('database', $dboid, 0), 't', "$sect: db stats do exist");
+is(have_stats('function', $dboid, $funcoid),
+	't', "$sect: function stats do exist");
+is(have_stats('relation', $dboid, $tableoid),
+	't', "$sect: relation stats do exist");
+
+$node->stop('immediate');
+
+# Copy the old stats back, and test that we discard stats after crash restart
+# because these don't match with the redo LSN stored in the stats file
+# generated by the previous checkpoint.
 copy($statsfile, $og_stats) or die "Copy failed: $!";
 
 $node->start;
@@ -274,9 +290,9 @@ my $wal_restart_immediate = wal_stats();
 
 cmp_ok(
 	$wal_reset_restart->{reset},
-	'lt',
+	'eq',
 	$wal_restart_immediate->{reset},
-	"$sect: reset timestamp is new");
+	"$sect: reset timestamp is the same");
 
 $node->stop;
 done_testing();
diff --git a/src/test/recovery/t/030_stats_cleanup_replica.pl b/src/test/recovery/t/030_stats_cleanup_replica.pl
index 74b516cc7c..5735921c99 100644
--- a/src/test/recovery/t/030_stats_cleanup_replica.pl
+++ b/src/test/recovery/t/030_stats_cleanup_replica.pl
@@ -115,7 +115,7 @@ $node_standby->start();
 $sect = "post immediate restart";
 
 test_standby_func_tab_stats_status('postgres',
-	$dboid, $tableoid, $funcoid, 'f');
+	$dboid, $tableoid, $funcoid, 't');
 
 
 done_testing();
-- 
2.45.2

#18Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#17)
1 attachment(s)
Re: Flush pgstats file during checkpoints

On Fri, Aug 02, 2024 at 02:11:34AM +0900, Michael Paquier wrote:

Applied 0003 for now to add the redo LSN to the pgstats file, adding
the redo LSN to the two DEBUG2 entries when reading and writing while
on it, that I forgot. (It was not 01:57 where I am now.)

Attached is the last one.

The CF bot has been complaining in injection_points as an effect of
the stats remaining after a crash, so rebased to adapt to that.
--
Michael

Attachments:

v7-0001-Flush-pgstats-file-during-checkpoints.patchtext/x-diff; charset=us-asciiDownload
From 2874c96f0325ca8e7495366ea08376de6568f438 Mon Sep 17 00:00:00 2001
From: Michael Paquier <michael@paquier.xyz>
Date: Mon, 26 Aug 2024 13:55:28 +0900
Subject: [PATCH v7] Flush pgstats file during checkpoints

The flushes are done for non-shutdown checkpoints and restart points, so
as it is possible to keep some statistics around in the event of a
crash.  Keeping the before_shmem_exit() callback to flush the pgstats
file in a shutdown sequence is a bit easier than doing so at checkpoint,
as they may be stats to flush before we are completely done with the
shutdown checkpoint.  Keeping the callback also keeps the shutdown of
single-user mode simpler, where the stats also need to be flushed.

At the beginning of recovery, the stats file is loaded when the redo LSN
it stores matches with the point we are replaying from.  The file is not
removed anymore after being read, to ensure that there is still a source
of stats to feed on should the system crash until the next checkpoint
that would update the stats.
---
 src/include/pgstat.h                          |  4 +-
 src/backend/access/transam/xlog.c             | 30 +++++----
 src/backend/utils/activity/pgstat.c           | 65 ++++---------------
 .../modules/injection_points/t/001_stats.pl   |  6 +-
 src/test/recovery/t/029_stats_restart.pl      | 26 ++++++--
 .../recovery/t/030_stats_cleanup_replica.pl   |  2 +-
 6 files changed, 60 insertions(+), 73 deletions(-)

diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index f63159c55c..87a4d2e4f7 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -495,9 +495,9 @@ typedef struct PgStat_PendingWalStats
 extern Size StatsShmemSize(void);
 extern void StatsShmemInit(void);
 
-/* Functions called during server startup / shutdown */
+/* Functions called during server startup / checkpoint / shutdown */
 extern void pgstat_restore_stats(XLogRecPtr redo);
-extern void pgstat_discard_stats(void);
+extern void pgstat_flush_stats(XLogRecPtr redo);
 extern void pgstat_before_server_shutdown(int code, Datum arg);
 
 /* Functions for backend initialization */
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index ee0fb0e28f..52d43436aa 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -5402,7 +5402,6 @@ StartupXLOG(void)
 	XLogCtlInsert *Insert;
 	CheckPoint	checkPoint;
 	bool		wasShutdown;
-	bool		didCrash;
 	bool		haveTblspcMap;
 	bool		haveBackupLabel;
 	XLogRecPtr	EndOfLog;
@@ -5522,10 +5521,7 @@ StartupXLOG(void)
 	{
 		RemoveTempXlogFiles();
 		SyncDataDirectory();
-		didCrash = true;
 	}
-	else
-		didCrash = false;
 
 	/*
 	 * Prepare for WAL recovery if needed.
@@ -5646,14 +5642,8 @@ StartupXLOG(void)
 	 *
 	 * NB: Restoring replication slot stats relies on slot state to have
 	 * already been restored from disk.
-	 *
-	 * TODO: With a bit of extra work we could just start with a pgstat file
-	 * associated with the checkpoint redo location we're starting from.
 	 */
-	if (didCrash)
-		pgstat_discard_stats();
-	else
-		pgstat_restore_stats(checkPoint.redo);
+	pgstat_restore_stats(checkPoint.redo);
 
 	lastFullPageWrites = checkPoint.fullPageWrites;
 
@@ -7251,6 +7241,15 @@ CreateCheckPoint(int flags)
 	 */
 	END_CRIT_SECTION();
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if (!shutdown)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * WAL summaries end when the next XLOG_CHECKPOINT_REDO or
 	 * XLOG_CHECKPOINT_SHUTDOWN record is reached. This is the first point
@@ -7720,6 +7719,15 @@ CreateRestartPoint(int flags)
 	}
 	LWLockRelease(ControlFileLock);
 
+	/*
+	 * The control file update is done, hence it is time to write some fresh
+	 * statistics.  Stats are flushed at shutdown by the checkpointer in a
+	 * dedicated before_shmem_exit callback, combined with sanity checks
+	 * related to the shutdown of pgstats.
+	 */
+	if ((flags & CHECKPOINT_IS_SHUTDOWN) == 0)
+		pgstat_flush_stats(RedoRecPtr);
+
 	/*
 	 * Update the average distance between checkpoints/restartpoints if the
 	 * prior checkpoint exists.
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index b2ca3f39b7..1f0fcc39c4 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -10,9 +10,10 @@
  * statistics.
  *
  * Statistics are loaded from the filesystem during startup (by the startup
- * process), unless preceded by a crash, in which case all stats are
- * discarded. They are written out by the checkpointer process just before
- * shutting down, except when shutting down in immediate mode.
+ * process) if the stats file has a redo LSN that matches with the one stored
+ * in the control file. They are written out by the checkpointer during
+ * checkpoints and restart points, as well as before shutting down, except
+ * when shutting down in immediate mode.
  *
  * Fixed-numbered stats are stored in plain (non-dynamic) shared memory.
  *
@@ -474,53 +475,19 @@ pgstat_restore_stats(XLogRecPtr redo)
 }
 
 /*
- * Remove the stats file.  This is currently used only if WAL recovery is
- * needed after a crash.
+ * Write stats in memory to disk.
  *
- * Should only be called by the startup process or in single user mode.
+ * This is called by the checkpointer or in single-user mode.
  */
 void
-pgstat_discard_stats(void)
+pgstat_flush_stats(XLogRecPtr redo)
 {
-	int			ret;
-
-	/* NB: this needs to be done even in single user mode */
-
-	ret = unlink(PGSTAT_STAT_PERMANENT_FILENAME);
-	if (ret != 0)
-	{
-		if (errno == ENOENT)
-			elog(DEBUG2,
-				 "didn't need to unlink permanent stats file \"%s\" - didn't exist",
-				 PGSTAT_STAT_PERMANENT_FILENAME);
-		else
-			ereport(LOG,
-					(errcode_for_file_access(),
-					 errmsg("could not unlink permanent statistics file \"%s\": %m",
-							PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-	else
-	{
-		ereport(DEBUG2,
-				(errcode_for_file_access(),
-				 errmsg_internal("unlinked permanent statistics file \"%s\"",
-								 PGSTAT_STAT_PERMANENT_FILENAME)));
-	}
-
-	/*
-	 * Reset stats contents. This will set reset timestamps of fixed-numbered
-	 * stats to the current time (no variable stats exist).
-	 */
-	pgstat_reset_after_failure();
+	pgstat_write_statsfile(redo);
 }
 
 /*
  * pgstat_before_server_shutdown() needs to be called by exactly one process
- * during regular server shutdowns. Otherwise all stats will be lost.
- *
- * We currently only write out stats for proc_exit(0). We might want to change
- * that at some point... But right now pgstat_discard_stats() would be called
- * during the start after a disorderly shutdown, anyway.
+ * during regular server shutdowns.
  */
 void
 pgstat_before_server_shutdown(int code, Datum arg)
@@ -545,6 +512,9 @@ pgstat_before_server_shutdown(int code, Datum arg)
 	 */
 	if (code == 0)
 	{
+		/* we're shutting down, so it's ok to just override this */
+		pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
+
 		pgStatLocal.shmem->is_shutdown = true;
 		pgstat_write_statsfile(GetRedoRecPtr());
 	}
@@ -1508,8 +1478,8 @@ write_chunk(FILE *fpout, void *ptr, size_t len)
 #define write_chunk_s(fpout, ptr) write_chunk(fpout, ptr, sizeof(*ptr))
 
 /*
- * This function is called in the last process that is accessing the shared
- * stats so locking is not required.
+ * This function is called in the checkpointer or in single-user mode,
+ * so locking is not required.
  */
 static void
 pgstat_write_statsfile(XLogRecPtr redo)
@@ -1526,9 +1496,6 @@ pgstat_write_statsfile(XLogRecPtr redo)
 	/* should be called only by the checkpointer or single user mode */
 	Assert(!IsUnderPostmaster || MyBackendType == B_CHECKPOINTER);
 
-	/* we're shutting down, so it's ok to just override this */
-	pgstat_fetch_consistency = PGSTAT_FETCH_CONSISTENCY_NONE;
-
 	elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile,
 		 LSN_FORMAT_ARGS(redo));
 
@@ -1589,7 +1556,6 @@ pgstat_write_statsfile(XLogRecPtr redo)
 		CHECK_FOR_INTERRUPTS();
 
 		/* we may have some "dropped" entries not yet removed, skip them */
-		Assert(!ps->dropped);
 		if (ps->dropped)
 			continue;
 
@@ -1938,9 +1904,6 @@ pgstat_read_statsfile(XLogRecPtr redo)
 done:
 	FreeFile(fpin);
 
-	elog(DEBUG2, "removing permanent stats file \"%s\"", statfile);
-	unlink(statfile);
-
 	return;
 
 error:
diff --git a/src/test/modules/injection_points/t/001_stats.pl b/src/test/modules/injection_points/t/001_stats.pl
index 7d6070e713..07ba62fd45 100644
--- a/src/test/modules/injection_points/t/001_stats.pl
+++ b/src/test/modules/injection_points/t/001_stats.pl
@@ -59,14 +59,14 @@ $fixedstats = $node->safe_psql('postgres',
 	"SELECT * FROM injection_points_stats_fixed();");
 is($fixedstats, '1|0|2|1|1', 'fixed stats after clean restart');
 
-# On crash the stats are gone.
+# On crash the stats are still there.
 $node->stop('immediate');
 $node->start;
 $numcalls = $node->safe_psql('postgres',
 	"SELECT injection_points_stats_numcalls('stats-notice');");
-is($numcalls, '', 'number of stats after crash');
+is($numcalls, '3', 'number of stats after crash');
 $fixedstats = $node->safe_psql('postgres',
 	"SELECT * FROM injection_points_stats_fixed();");
-is($fixedstats, '0|0|0|0|0', 'fixed stats after crash');
+is($fixedstats, '1|0|2|1|1', 'fixed stats after crash');
 
 done_testing();
diff --git a/src/test/recovery/t/029_stats_restart.pl b/src/test/recovery/t/029_stats_restart.pl
index 93a7209f69..73d059b9f9 100644
--- a/src/test/recovery/t/029_stats_restart.pl
+++ b/src/test/recovery/t/029_stats_restart.pl
@@ -59,7 +59,7 @@ ok(-f "$og_stats", "origin stats file must exist");
 copy($og_stats, $statsfile) or die "Copy failed: $!";
 
 
-## test discarding of stats file after crash etc
+## test retention of stats file after crash etc
 
 $node->start;
 
@@ -69,12 +69,28 @@ is(have_stats('function', $dboid, $funcoid),
 	't', "$sect: function stats do exist");
 is(have_stats('relation', $dboid, $tableoid),
 	't', "$sect: relation stats do exist");
+# Flush the stats to disk.
+$node->psql('postgres', 'checkpoint');
 
 $node->stop('immediate');
 
-ok(!-f "$og_stats", "no stats file should exist after immediate shutdown");
+ok(-f "$og_stats", "stats file should exist after immediate shutdown");
 
-# copy the old stats back to test we discard stats after crash restart
+# Start once, there should be stats restored from the previous checkpoint.
+$node->start;
+
+$sect = "crashrecovery";
+is(have_stats('database', $dboid, 0), 't', "$sect: db stats do exist");
+is(have_stats('function', $dboid, $funcoid),
+	't', "$sect: function stats do exist");
+is(have_stats('relation', $dboid, $tableoid),
+	't', "$sect: relation stats do exist");
+
+$node->stop('immediate');
+
+# Copy the old stats back, and test that we discard stats after crash restart
+# because these don't match with the redo LSN stored in the stats file
+# generated by the previous checkpoint.
 copy($statsfile, $og_stats) or die "Copy failed: $!";
 
 $node->start;
@@ -274,9 +290,9 @@ my $wal_restart_immediate = wal_stats();
 
 cmp_ok(
 	$wal_reset_restart->{reset},
-	'lt',
+	'eq',
 	$wal_restart_immediate->{reset},
-	"$sect: reset timestamp is new");
+	"$sect: reset timestamp is the same");
 
 $node->stop;
 done_testing();
diff --git a/src/test/recovery/t/030_stats_cleanup_replica.pl b/src/test/recovery/t/030_stats_cleanup_replica.pl
index 74b516cc7c..5735921c99 100644
--- a/src/test/recovery/t/030_stats_cleanup_replica.pl
+++ b/src/test/recovery/t/030_stats_cleanup_replica.pl
@@ -115,7 +115,7 @@ $node_standby->start();
 $sect = "post immediate restart";
 
 test_standby_func_tab_stats_status('postgres',
-	$dboid, $tableoid, $funcoid, 'f');
+	$dboid, $tableoid, $funcoid, 't');
 
 
 done_testing();
-- 
2.45.2

#19Bertrand Drouvot
bertranddrouvot.pg@gmail.com
In reply to: Michael Paquier (#18)
Re: Flush pgstats file during checkpoints

Hi,

On Mon, Aug 26, 2024 at 01:56:40PM +0900, Michael Paquier wrote:

On Fri, Aug 02, 2024 at 02:11:34AM +0900, Michael Paquier wrote:

Applied 0003 for now to add the redo LSN to the pgstats file, adding
the redo LSN to the two DEBUG2 entries when reading and writing while
on it, that I forgot. (It was not 01:57 where I am now.)

Attached is the last one.

The CF bot has been complaining in injection_points as an effect of
the stats remaining after a crash, so rebased to adapt to that.

Thanks!

Checking the V7 diffs as compared to V4:

1. In pgstat_write_statsfile():

-	elog(DEBUG2, "writing stats file \"%s\"", statfile);
+	elog(DEBUG2, "writing stats file \"%s\" with redo %X/%X", statfile,
+		 LSN_FORMAT_ARGS(redo));

2. and the ones in injection_points/t/001_stats.pl:

 +# On crash the stats are still there.
  $node->stop('immediate');
  $node->start;
  $numcalls = $node->safe_psql('postgres',
  	"SELECT injection_points_stats_numcalls('stats-notice');");
 -is($numcalls, '', 'number of stats after crash');
 +is($numcalls, '3', 'number of stats after crash');
  $fixedstats = $node->safe_psql('postgres',
  	"SELECT * FROM injection_points_stats_fixed();");
 -is($fixedstats, '0|0|0|0|0', 'fixed stats after crash');
 +is($fixedstats, '1|0|2|1|1', 'fixed stats after crash');

They both LGTM.

Regards,

--
Bertrand Drouvot
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

#20Heikki Linnakangas
hlinnaka@iki.fi
In reply to: Bertrand Drouvot (#19)
Re: Flush pgstats file during checkpoints

It'd not be such an issue if we updated stats during recovery, but I
think think we're doing that. Perhaps we should, which might also help
on replicas - no idea if it's feasible, though.

Stats on replicas are considered an independent thing AFAIU (scans are
counted for example in read-only queries). If we were to do that we
may want to split stats handling between nodes in standby state and
crash recovery. Not sure if that's worth the complication. First,
the stats exist at node-level.

Hmm, I'm a bit disappointed this doesn't address replication. It makes
sense that scans are counted separately on a standby, but it would be
nice if stats like last_vacuum were propagated from primary to standbys.
I guess that can be handled separately later.

Reviewing v7-0001-Flush-pgstats-file-during-checkpoints.patch:

There are various race conditions where a stats entry can be leaked in
the pgstats file. I.e. relation is dropped, but its stats entry is
retained in the stats file after crash. In the worst case, suck leaked
entries can accumulate until the stats file is manually removed, which
resets all stats again. Perhaps that's acceptable - it's still possible
leak the actual relation file for a new relation on crash, after all,
which is much worse (I'm glad Horiguchi-san is working on that [1]/messages/by-id/20240901.010925.656452225144636594.horikyota.ntt@gmail.com).

For example:
1. BEGIN; CREATE TABLE foo (); ANALYZE foo;
2. CHECKPOINT;
3. pg_ctl restart -m immediate

This is the same scenario where we leak the relfile, but now you can
have it with e.g. function statistics too.

Until 5891c7a8ed, there was a mechanism to garbage collect such orphaned
entries (pgstat_vacuum()). How bad would it be to re-introduce that? Or
can we make it more watertight so that there are no leaks?

If you do this:

pg_ctl start -D data
pg_ctl stop -D data -m immediate
pg_ctl start -D data
pg_ctl stop -D data -m immediate

You get this in the log:

2024-09-02 16:28:37.874 EEST [1397281] WARNING: found incorrect redo
LSN 0/160A3C8 (expected 0/160A440)

I think it's failing to flush the stats file at the end of recovery
checkpoint.

[1]: /messages/by-id/20240901.010925.656452225144636594.horikyota.ntt@gmail.com
/messages/by-id/20240901.010925.656452225144636594.horikyota.ntt@gmail.com

--
Heikki Linnakangas
Neon (https://neon.tech)

#21Michael Paquier
michael@paquier.xyz
In reply to: Heikki Linnakangas (#20)
Re: Flush pgstats file during checkpoints

On Mon, Sep 02, 2024 at 05:08:03PM +0300, Heikki Linnakangas wrote:

Hmm, I'm a bit disappointed this doesn't address replication. It makes sense
that scans are counted separately on a standby, but it would be nice if
stats like last_vacuum were propagated from primary to standbys. I guess
that can be handled separately later.

Yes, it's not something that I'm planning to tackle for this thread.
Speaking of which, the design that I got in mind for this area was not
"that" complicated:
- Add a new RMGR for all the stats.
- Add a first callback for stats kinds for WAL inserts, giving to each
stats the possibility to pass down data inserted to the record, as we
want to replicate a portion of the data depending on the kind dealt
with.
- Add a second callback for recovery, called depending on the kind ID.

I have not looked into the details yet, but stats to replicate should
be grouped in a single record on transaction commit or depending on
the flush timing for fixed-numbered stats. Or we should just add them
in commit records?

Reviewing v7-0001-Flush-pgstats-file-during-checkpoints.patch:

There are various race conditions where a stats entry can be leaked in the
pgstats file. I.e. relation is dropped, but its stats entry is retained in
the stats file after crash. In the worst case, suck leaked entries can
accumulate until the stats file is manually removed, which resets all stats
again. Perhaps that's acceptable - it's still possible leak the actual
relation file for a new relation on crash, after all, which is much worse
(I'm glad Horiguchi-san is working on that [1]).

Yeah, that's not an easy issue. We don't really have a protection
regarding that as well now. Backends can also refer to stats entries
in their shutdown callback that have been dropped concurrently. See
some details about that at https://commitfest.postgresql.org/49/5045/.

Until 5891c7a8ed, there was a mechanism to garbage collect such orphaned
entries (pgstat_vacuum()). How bad would it be to re-introduce that? Or can
we make it more watertight so that there are no leaks?

Not sure about this part, TBH. Doing that again in autovacuum does
not excite me much as it has a cost.

I think it's failing to flush the stats file at the end of recovery
checkpoint.

Missed that, oops. I'll double-check this area.
--
Michael