pg_stat_io for the startup process

Started by Fujii Masaoover 2 years ago24 messages
#1Fujii Masao
masao.fujii@oss.nttdata.com

Hi,

Regarding pg_stat_io for the startup process, I noticed that the counters
are only incremented after the startup process exits, not during WAL replay
in standby mode. This is because pgstat_flush_io() is only called when
the startup process exits. Shouldn't it be called during WAL replay as well
to report IO statistics by the startup process even in standby mode?

Also, the pg_stat_io view includes a row with backend_type=startup and
context=vacuum, but it seems that the startup process doesn't perform
any I/O operations with BAS_VACUUM. If this understanding is right,
shouldn't we omit this row from the view? Additionally, I noticed that
the view also includes a row with backend_type=startup and
context=bulkread / bulkwrite. Do these operations actually occur
during startup process?

Regards,

--
Fujii Masao
Advanced Computing Technology Center
Research and Development Headquarters
NTT DATA CORPORATION

#2Melanie Plageman
melanieplageman@gmail.com
In reply to: Fujii Masao (#1)
Re: pg_stat_io for the startup process

On Tue, Apr 25, 2023 at 10:51:14PM +0900, Fujii Masao wrote:

Hi,

Regarding pg_stat_io for the startup process, I noticed that the counters
are only incremented after the startup process exits, not during WAL replay
in standby mode. This is because pgstat_flush_io() is only called when
the startup process exits. Shouldn't it be called during WAL replay as well
to report IO statistics by the startup process even in standby mode?

Yes, we definitely want stats from the startup process on the standby.
Elsewhere on the internet where you originally raised this, I mentioned
that I hacked a pgstat_flush_io() into the redo apply loop in
PerformWalRecovery() but that I wasn't sure that this was affordable.
Andres Freund replied saying that it would be too expensive and
suggested that the set up a regular timeout which sets a flag that's
checked by HandleStartupProcInterrupts().

I'm wondering if this is something we consider a bug and thus would be
under consideration for 16.

Also, the pg_stat_io view includes a row with backend_type=startup and
context=vacuum, but it seems that the startup process doesn't perform
any I/O operations with BAS_VACUUM. If this understanding is right,
shouldn't we omit this row from the view? Additionally, I noticed that
the view also includes a row with backend_type=startup and
context=bulkread / bulkwrite. Do these operations actually occur
during startup process?

Hmm. Yes, I remember posing this question on the thread and not getting
an answer. I read some code and did some testing and can't see a way we
would end up with the startup process doing IO in a non-normal context.

Certainly I can't see how startup process would ever use a BAS_VACUUM
context given that it executes heap_xlog_vacuum().

I thought at some point I had encountered an assertion failure when I
banned the startup process from tracking io operations in bulkread and
bulkwrite contexts. But, I'm not seeing how that could happen.

- Melanie

#3Andres Freund
andres@anarazel.de
In reply to: Melanie Plageman (#2)
Re: pg_stat_io for the startup process

Hi,

On 2023-04-25 13:54:43 -0400, Melanie Plageman wrote:

On Tue, Apr 25, 2023 at 10:51:14PM +0900, Fujii Masao wrote:

Regarding pg_stat_io for the startup process, I noticed that the counters
are only incremented after the startup process exits, not during WAL replay
in standby mode. This is because pgstat_flush_io() is only called when
the startup process exits. Shouldn't it be called during WAL replay as well
to report IO statistics by the startup process even in standby mode?

Yes, we definitely want stats from the startup process on the standby.
Elsewhere on the internet where you originally raised this, I mentioned
that I hacked a pgstat_flush_io() into the redo apply loop in
PerformWalRecovery() but that I wasn't sure that this was affordable.
Andres Freund replied saying that it would be too expensive and
suggested that the set up a regular timeout which sets a flag that's
checked by HandleStartupProcInterrupts().

It's tempting to try to reuse the STARTUP_PROGRESS_TIMEOUT timer. But it's
controlled by a GUC, so it's not really suitable.

I'm wondering if this is something we consider a bug and thus would be
under consideration for 16.

I'm mildly inclined to not consider it a bug, given that this looks to have
been true for other stats for quite a while? But it does still seem worth
improving upon - I'd make the consideration when to apply the relevant patches
depend on the complexity. I'm worried we'd need to introduce sufficiently new
infrastructure that 16 doesn't seem like a good idea. Let's come up with a
patch and judge it after?

Also, the pg_stat_io view includes a row with backend_type=startup and
context=vacuum, but it seems that the startup process doesn't perform
any I/O operations with BAS_VACUUM. If this understanding is right,
shouldn't we omit this row from the view? Additionally, I noticed that
the view also includes a row with backend_type=startup and
context=bulkread / bulkwrite. Do these operations actually occur
during startup process?

Hmm. Yes, I remember posing this question on the thread and not getting
an answer. I read some code and did some testing and can't see a way we
would end up with the startup process doing IO in a non-normal context.

Certainly I can't see how startup process would ever use a BAS_VACUUM
context given that it executes heap_xlog_vacuum().

I thought at some point I had encountered an assertion failure when I
banned the startup process from tracking io operations in bulkread and
bulkwrite contexts. But, I'm not seeing how that could happen.

It's possible that we decided to not apply such restrictions because the
startup process can be made to execute more code via the extensible
rmgrs.

Greetings,

Andres Freund

#4Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#3)
Re: pg_stat_io for the startup process

On Tue, Apr 25, 2023 at 2:39 PM Andres Freund <andres@anarazel.de> wrote:

I'm mildly inclined to not consider it a bug, given that this looks to have
been true for other stats for quite a while? But it does still seem worth
improving upon - I'd make the consideration when to apply the relevant patches
depend on the complexity. I'm worried we'd need to introduce sufficiently new
infrastructure that 16 doesn't seem like a good idea. Let's come up with a
patch and judge it after?

ISTM that it's pretty desirable to do something about this. If the
process isn't going to report statistics properly, at least remove it
from the view. If it can be made to report properly, that would be
even better. But shipping a new view with information that will nearly
always be zeroes instead of real data seems like a bad call, even if
there are existing cases that have the same problem.

--
Robert Haas
EDB: http://www.enterprisedb.com

#5Michael Paquier
michael@paquier.xyz
In reply to: Robert Haas (#4)
Re: pg_stat_io for the startup process

On Tue, Apr 25, 2023 at 04:00:24PM -0400, Robert Haas wrote:

ISTM that it's pretty desirable to do something about this. If the
process isn't going to report statistics properly, at least remove it
from the view. If it can be made to report properly, that would be
even better. But shipping a new view with information that will nearly
always be zeroes instead of real data seems like a bad call, even if
there are existing cases that have the same problem.

Agreed that reporting no information may be better than reporting
incorrect information, even if it means an extra qual. As mentioned
upthread, if this requires an extra design piece, adding the correct
information had better be pushed to v17~.

Perhaps an open item should be added?
--
Michael

#6Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#4)
Re: pg_stat_io for the startup process

Hi,

On 2023-04-25 16:00:24 -0400, Robert Haas wrote:

On Tue, Apr 25, 2023 at 2:39 PM Andres Freund <andres@anarazel.de> wrote:

I'm mildly inclined to not consider it a bug, given that this looks to have
been true for other stats for quite a while? But it does still seem worth
improving upon - I'd make the consideration when to apply the relevant patches
depend on the complexity. I'm worried we'd need to introduce sufficiently new
infrastructure that 16 doesn't seem like a good idea. Let's come up with a
patch and judge it after?

ISTM that it's pretty desirable to do something about this. If the
process isn't going to report statistics properly, at least remove it
from the view.

It's populated after crash recovery, when shutting down and at the time of
promotion, that isn't *completely* crazy.

If it can be made to report properly, that would be even better. But
shipping a new view with information that will nearly always be zeroes
instead of real data seems like a bad call, even if there are existing cases
that have the same problem.

I refreshed my memory: The startup process has indeed behaved that way for
much longer than pg_stat_io existed - but it's harder to spot, because the
stats are more coarsely aggregated :/. And it's very oddly inconsistent:

The startup process doesn't report per-relation read/hit (it might when we
create a fake relcache entry, to lazy to see what happens exactly), because we
key those stats by oid. However, it *does* report the read/write time. But
only at process exit, of course. The weird part is that the startup process
does *NOT* increase pg_stat_database.blks_read/blks_hit, because instead of
basing those on pgBufferUsage.shared_blks_read etc, we compute them based on
the relation level stats. pgBufferUsage is just used for EXPLAIN. This isn't
recent, afaict.

TL;DR: Currently the startup process maintains blk_read_time, blk_write_time,
but doesn't maintain blks_read, blks_hit - which doesn't make sense.

Yikes.

Greetings,

Andres Freund

#7Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#6)
1 attachment(s)
Re: pg_stat_io for the startup process

At Tue, 25 Apr 2023 16:04:23 -0700, Andres Freund <andres@anarazel.de> wrote in

I refreshed my memory: The startup process has indeed behaved that way for
much longer than pg_stat_io existed - but it's harder to spot, because the
stats are more coarsely aggregated :/. And it's very oddly inconsistent:

The startup process doesn't report per-relation read/hit (it might when we
create a fake relcache entry, to lazy to see what happens exactly), because we

The key difference lies between relation-level and smgr-level;
recovery doesn't call ReadBufferExtended.

key those stats by oid. However, it *does* report the read/write time. But
only at process exit, of course. The weird part is that the startup process
does *NOT* increase pg_stat_database.blks_read/blks_hit, because instead of
basing those on pgBufferUsage.shared_blks_read etc, we compute them based on
the relation level stats. pgBufferUsage is just used for EXPLAIN. This isn't
recent, afaict.

I see four issues here.

1. The current database stats omits buffer fetches that don't
originate from a relation.

In this case pgstat_relation can't work since recovery isn't conscious
of relids. We might be able to resolve relfilenode into a relid, but
it may not be that simple. Fortunately we already count fetches and
hits process-wide using pgBufferUsage, so we can use this for database
stats.

2. Even if we wanted to report stats for the startup process,
pgstat_report_stats wouldn't permit it since transaction-end
timestamp doesn't advance.

I'm not certain if it's the correct approach, but perhaps we could use
GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp()
specifically for the startup process.

3. When should we call pgstat_report_stats on the startup process?

During recovery, I think we can call pgstat_report_stats() (or a
subset of it) right before invoking WaitLatch and at segment
boundaries.

4. In the existing ReadBuffer_common, there's an inconsistency in
counting hits and reads between pgstat_io and pgBufferUsage.

The difference comes from the case of RBM_ZERO pages. We should simply
align them.

TL;DR: Currently the startup process maintains blk_read_time, blk_write_time,
but doesn't maintain blks_read, blks_hit - which doesn't make sense.

As a result, the attached patch, which is meant for discussion, allows
pg_stat_database to show fetches and reads by the startup process as
the counts for the database with id 0.

There's still some difference between pg_stat_io and pg_stat_database,
but I haven't examined it in detail.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

pgstat_io_startup.txttext/plain; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 188f6d6f85..6010ae8662 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3248,6 +3248,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		close(readFile);
 		readFile = -1;
 		readSource = XLOG_FROM_ANY;
+		pgstat_report_stat(false);
 	}
 
 	XLByteToSeg(targetPagePtr, readSegNo, wal_segment_size);
@@ -3617,6 +3618,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						/* Do background tasks that might benefit us later. */
 						KnownAssignedTransactionIdsIdleMaintenance();
 
+						pgstat_report_stat(false);
 						(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
 										 WL_LATCH_SET | WL_TIMEOUT |
 										 WL_EXIT_ON_PM_DEATH,
@@ -3893,6 +3895,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					 * Wait for more WAL to arrive, when we will be woken
 					 * immediately by the WAL receiver.
 					 */
+					pgstat_report_stat(false);
 					(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
 									 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
 									 -1L,
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1fa689052e..520936d0dd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1046,9 +1046,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
@@ -1062,9 +1059,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 							 strategy, &found, io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
 	/* At this point we do NOT hold any locks. */
@@ -1126,6 +1120,10 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		pgstat_count_io_op_time(io_object, io_context,
 								IOOP_READ, io_start, 1);
+		if (isLocalBuf)
+			pgBufferUsage.local_blks_read++;
+		else
+			pgBufferUsage.shared_blks_read++;
 
 		/* check for garbage data */
 		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index b125802b21..0864b56689 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -608,7 +608,10 @@ pgstat_report_stat(bool force)
 	 */
 	Assert(!pgStatLocal.shmem->is_shutdown);
 
-	now = GetCurrentTransactionStopTimestamp();
+	if (MyBackendType != B_STARTUP)
+		now = GetCurrentTransactionStopTimestamp();
+	else
+		now = GetCurrentTimestamp();
 
 	if (!force)
 	{
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 7149f22f72..c93b92771a 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -17,6 +17,7 @@
 
 #include "postgres.h"
 
+#include "executor/instrument.h"
 #include "utils/pgstat_internal.h"
 #include "utils/timestamp.h"
 #include "storage/procsignal.h"
@@ -269,7 +270,10 @@ AtEOXact_PgStat_Database(bool isCommit, bool parallel)
 void
 pgstat_update_dbstats(TimestampTz ts)
 {
+	static BufferUsage lastBufferUsage = {0};
 	PgStat_StatDBEntry *dbentry;
+	PgStat_Counter hit_diff;
+	PgStat_Counter read_diff;
 
 	dbentry = pgstat_prep_database_pending(MyDatabaseId);
 
@@ -282,6 +286,16 @@ pgstat_update_dbstats(TimestampTz ts)
 	dbentry->blk_read_time += pgStatBlockReadTime;
 	dbentry->blk_write_time += pgStatBlockWriteTime;
 
+	/* we need to preserve pgBufferUsage */
+	read_diff =
+		pgBufferUsage.shared_blks_read - lastBufferUsage.shared_blks_read +
+		pgBufferUsage.local_blks_read - lastBufferUsage.local_blks_read;
+	hit_diff = 
+		pgBufferUsage.shared_blks_hit - lastBufferUsage.shared_blks_hit +
+		pgBufferUsage.local_blks_hit - lastBufferUsage.local_blks_hit;
+	dbentry->blocks_fetched += read_diff + hit_diff;
+	dbentry->blocks_hit += hit_diff;
+
 	if (pgstat_should_report_connstat())
 	{
 		long		secs;
@@ -304,6 +318,8 @@ pgstat_update_dbstats(TimestampTz ts)
 	pgStatBlockWriteTime = 0;
 	pgStatActiveTime = 0;
 	pgStatTransactionIdleTime = 0;
+
+	lastBufferUsage = pgBufferUsage;
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 9876e0c1e8..8945b8141a 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -844,8 +844,6 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	dbentry->tuples_inserted += lstats->counts.tuples_inserted;
 	dbentry->tuples_updated += lstats->counts.tuples_updated;
 	dbentry->tuples_deleted += lstats->counts.tuples_deleted;
-	dbentry->blocks_fetched += lstats->counts.blocks_fetched;
-	dbentry->blocks_hit += lstats->counts.blocks_hit;
 
 	return true;
 }
#8Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#7)
Re: pg_stat_io for the startup process

On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

3. When should we call pgstat_report_stats on the startup process?

During recovery, I think we can call pgstat_report_stats() (or a
subset of it) right before invoking WaitLatch and at segment
boundaries.

I think this kind of idea is worth exploring. Andres mentioned timers,
but it seems to me that something where we just do it at certain
"convenient" points might be good enough and simpler. I'd much rather
have statistics that were up to date as of the last time we finished a
segment than have nothing at all.

--
Robert Haas
EDB: http://www.enterprisedb.com

#9Melih Mutlu
m.melihmutlu@gmail.com
In reply to: Robert Haas (#8)
1 attachment(s)
Re: pg_stat_io for the startup process

Hi all,

Robert Haas <robertmhaas@gmail.com>, 26 Nis 2023 Çar, 15:34 tarihinde şunu
yazdı:

On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

3. When should we call pgstat_report_stats on the startup process?

During recovery, I think we can call pgstat_report_stats() (or a
subset of it) right before invoking WaitLatch and at segment
boundaries.

I think this kind of idea is worth exploring. Andres mentioned timers,
but it seems to me that something where we just do it at certain
"convenient" points might be good enough and simpler. I'd much rather
have statistics that were up to date as of the last time we finished a
segment than have nothing at all.

I created a rough prototype of a timer-based approach for comparison.
Please see attached.

Registered a new timeout named "STARTUP_STAT_FLUSH_TIMEOUT", The timeout's
handler sets a flag indicating that io stats need to be flushed.
HandleStartupProcInterrupts checks the flag to flush stats.
It's enabled if any WAL record is read (in the main loop of
PerformWalRecovery). And it's disabled before WaitLatch to avoid
unnecessary wakeups if the startup process decides to sleep.

With those changes, I see that startup related rows in pg_stat_io are
updated without waiting for the startup process to exit.

postgres=# select context, reads, extends, hits from pg_stat_io where

backend_type = 'startup';
context | reads | extends | hits
-----------+--------+---------+----------
bulkread | 0 | | 0
bulkwrite | 0 | 0 | 0
normal | 6 | 1 | 41
vacuum | 0 | 0 | 0
(4 rows)

I'm not sure this is the correct way to implement this approach though. I
appreciate any comment.

Also; some questions about this implementation if you think it's worth
discussing:
1- I set an arbitrary timeout (1 sec) for testing. I don't know what the
correct value should be. Does it make sense to use
PGSTAT_MIN/MAX/IDLE_INTERVAL instead?
2- I'm also not sure if this timeout should be registered at the beginning
of StartupProcessMain, or does it even make any difference. I tried to do
this just before the main redo loop in PerformWalRecovery, but that made
the CI red.

Best,
--
Melih Mutlu
Microsoft

Attachments:

0001-Add-timeout-to-flush-stats-during-startup-s-main-replay-loop.patchapplication/octet-stream; name=0001-Add-timeout-to-flush-stats-during-startup-s-main-replay-loop.patchDownload
From 73728acc384d51497bc73a0376a88ca5205c906d Mon Sep 17 00:00:00 2001
From: Melih Mutlu <m.melihmutlu@gmail.com>
Date: Wed, 26 Apr 2023 18:21:32 +0300
Subject: [PATCH] Add timeout to flush stats during startup's main replay loop

---
 src/backend/access/transam/xlogrecovery.c | 18 +++++++++
 src/backend/postmaster/startup.c          | 46 +++++++++++++++++++++++
 src/include/postmaster/startup.h          |  4 ++
 src/include/utils/timeout.h               |  1 +
 4 files changed, 69 insertions(+)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 188f6d6f85..eac5ee3e5c 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -63,6 +63,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
+#include "utils/timeout.h"
 
 /* Unsupported old recovery command file names (relative to $PGDATA) */
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
@@ -1675,6 +1676,9 @@ PerformWalRecovery(void)
 				ereport_startup_progress("redo in progress, elapsed time: %ld.%02d s, current LSN: %X/%X",
 										 LSN_FORMAT_ARGS(xlogreader->ReadRecPtr));
 
+			/* Is this the right place to enable this? */
+			enable_startup_stat_flush_timeout();
+
 #ifdef WAL_DEBUG
 			if (XLOG_DEBUG ||
 				(record->xl_rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) ||
@@ -3617,6 +3621,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						/* Do background tasks that might benefit us later. */
 						KnownAssignedTransactionIdsIdleMaintenance();
 
+						/* 
+						 * Need to disable flush timeout to avoid unnecessary
+						 * wakeups. Enable it again after a WAL record is read
+						 * in PerformWalRecovery.
+						 */
+						disable_startup_stat_flush_timeout();
+
 						(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
 										 WL_LATCH_SET | WL_TIMEOUT |
 										 WL_EXIT_ON_PM_DEATH,
@@ -3889,6 +3900,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					/* Update pg_stat_recovery_prefetch before sleeping. */
 					XLogPrefetcherComputeStats(xlogprefetcher);
 
+					/* 
+					 * Need to disable flush timeout to avoid unnecessary
+					 * wakeups. Enable it again after a WAL record is read
+					 * in PerformWalRecovery.
+					 */
+					disable_startup_stat_flush_timeout();
+
 					/*
 					 * Wait for more WAL to arrive, when we will be woken
 					 * immediately by the WAL receiver.
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index efc2580536..b250fa95f9 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -72,6 +72,11 @@ static TimestampTz startup_progress_phase_start_time;
  */
 static volatile sig_atomic_t startup_progress_timer_expired = false;
 
+/* Indicates whether flushing stats is needed. */
+static volatile sig_atomic_t startup_stat_need_flush = false;
+
+int			pgstat_stat_flush_timeout = 1000;	/* 1 sec ?? */
+
 /*
  * Time between progress updates for long-running startup operations.
  */
@@ -206,6 +211,13 @@ HandleStartupProcInterrupts(void)
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
 		ProcessLogMemoryContextInterrupt();
+
+	if (startup_stat_need_flush)
+	{
+		/* It's time to flush wal stats. */
+		pgstat_report_wal(true);
+		startup_stat_need_flush = false;
+	}
 }
 
 
@@ -256,6 +268,10 @@ StartupProcessMain(void)
 	RegisterTimeout(STANDBY_TIMEOUT, StandbyTimeoutHandler);
 	RegisterTimeout(STANDBY_LOCK_TIMEOUT, StandbyLockTimeoutHandler);
 
+	/* Register the timeout to flush stats periodically. */
+	RegisterTimeout(STARTUP_STAT_FLUSH_TIMEOUT,
+					startup_stat_flush_timeout_handler);
+
 	/*
 	 * Unblock signals (they were blocked when the postmaster forked us)
 	 */
@@ -385,3 +401,33 @@ has_startup_progress_timeout_expired(long *secs, int *usecs)
 
 	return true;
 }
+
+/* Set a flag indicating that it's time to flush. */
+void
+startup_stat_flush_timeout_handler(void)
+{
+	startup_stat_need_flush = true;
+}
+
+/* Disable the timeout set for startup stat flush. */
+void
+disable_startup_stat_flush_timeout(void)
+{
+	/* one last flush might be needed before disabling. */
+	startup_stat_need_flush = true;
+
+	disable_timeout(STARTUP_STAT_FLUSH_TIMEOUT, false);
+}
+
+/* Enable the timeout set for startup stat flush. */
+void
+enable_startup_stat_flush_timeout(void)
+{
+	TimestampTz fin_time;
+
+	startup_progress_phase_start_time = GetCurrentTimestamp();
+	fin_time = TimestampTzPlusMilliseconds(startup_progress_phase_start_time,
+										   pgstat_stat_flush_timeout);
+	enable_timeout_every(STARTUP_STAT_FLUSH_TIMEOUT, fin_time,
+						 pgstat_stat_flush_timeout);
+}
\ No newline at end of file
diff --git a/src/include/postmaster/startup.h b/src/include/postmaster/startup.h
index 6a2e4c4526..dbc79560b5 100644
--- a/src/include/postmaster/startup.h
+++ b/src/include/postmaster/startup.h
@@ -38,4 +38,8 @@ extern void begin_startup_progress_phase(void);
 extern void startup_progress_timeout_handler(void);
 extern bool has_startup_progress_timeout_expired(long *secs, int *usecs);
 
+extern void enable_startup_stat_flush_timeout(void);
+extern void disable_startup_stat_flush_timeout(void);
+extern void startup_stat_flush_timeout_handler(void);
+
 #endif							/* _STARTUP_H */
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index e561a1cde9..a8d360e255 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -35,6 +35,7 @@ typedef enum TimeoutId
 	IDLE_STATS_UPDATE_TIMEOUT,
 	CLIENT_CONNECTION_CHECK_TIMEOUT,
 	STARTUP_PROGRESS_TIMEOUT,
+	STARTUP_STAT_FLUSH_TIMEOUT,
 	/* First user-definable timeout reason */
 	USER_TIMEOUT,
 	/* Maximum number of timeout reasons */
-- 
2.25.1

#10Andres Freund
andres@anarazel.de
In reply to: Melih Mutlu (#9)
Re: pg_stat_io for the startup process

Hi,

On 2023-04-26 21:53:16 +0300, Melih Mutlu wrote:

Robert Haas <robertmhaas@gmail.com>, 26 Nis 2023 Çar, 15:34 tarihinde şunu
yazdı:

On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

3. When should we call pgstat_report_stats on the startup process?

During recovery, I think we can call pgstat_report_stats() (or a
subset of it) right before invoking WaitLatch and at segment
boundaries.

I think this kind of idea is worth exploring. Andres mentioned timers,
but it seems to me that something where we just do it at certain
"convenient" points might be good enough and simpler. I'd much rather
have statistics that were up to date as of the last time we finished a
segment than have nothing at all.

I created a rough prototype of a timer-based approach for comparison.
Please see attached.

Thanks!

2- I'm also not sure if this timeout should be registered at the beginning
of StartupProcessMain, or does it even make any difference. I tried to do
this just before the main redo loop in PerformWalRecovery, but that made
the CI red.

Huh - do you have a link to the failure? That's how I would expect it to be
done.

/* Unsupported old recovery command file names (relative to $PGDATA) */
#define RECOVERY_COMMAND_FILE "recovery.conf"
@@ -1675,6 +1676,9 @@ PerformWalRecovery(void)
ereport_startup_progress("redo in progress, elapsed time: %ld.%02d s, current LSN: %X/%X",
LSN_FORMAT_ARGS(xlogreader->ReadRecPtr));

+			/* Is this the right place to enable this? */
+			enable_startup_stat_flush_timeout();
+

I think we should try not have additional code for this inside the loop - we
should enable the timer once, when needed, not repeatedly.

#ifdef WAL_DEBUG
if (XLOG_DEBUG ||
(record->xl_rmid == RM_XACT_ID && trace_recovery_messages <= DEBUG2) ||
@@ -3617,6 +3621,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
/* Do background tasks that might benefit us later. */
KnownAssignedTransactionIdsIdleMaintenance();

+						/* 
+						 * Need to disable flush timeout to avoid unnecessary
+						 * wakeups. Enable it again after a WAL record is read
+						 * in PerformWalRecovery.
+						 */
+						disable_startup_stat_flush_timeout();
+
(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,

I think always disabling the timer here isn't quite right - we want to wake up
*once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending
stats before waiting - potentially for a long time - for WAL. One way would be
just explicitly report before the WaitLatch().

Another approach, I think better, would be to not use enable_timeout_every(),
and to explicitly rearm the timer in HandleStartupProcInterrupts(). When
called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do so
at the end of WaitForWALToBecomeAvailable(). That way we also wouldn't
repeatedly fire between calls to HandleStartupProcInterrupts().

diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index efc2580536..b250fa95f9 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -72,6 +72,11 @@ static TimestampTz startup_progress_phase_start_time;
*/
static volatile sig_atomic_t startup_progress_timer_expired = false;
+/* Indicates whether flushing stats is needed. */
+static volatile sig_atomic_t startup_stat_need_flush = false;
+
+int			pgstat_stat_flush_timeout = 1000;	/* 1 sec ?? */

We probably should move the existing PGSTAT_MIN_INTERVAL constant from
pgstat.c to pgstat.h.

+extern void enable_startup_stat_flush_timeout(void);
+extern void disable_startup_stat_flush_timeout(void);
+extern void startup_stat_flush_timeout_handler(void);
+
#endif							/* _STARTUP_H */
diff --git a/src/include/utils/timeout.h b/src/include/utils/timeout.h
index e561a1cde9..a8d360e255 100644
--- a/src/include/utils/timeout.h
+++ b/src/include/utils/timeout.h
@@ -35,6 +35,7 @@ typedef enum TimeoutId
IDLE_STATS_UPDATE_TIMEOUT,
CLIENT_CONNECTION_CHECK_TIMEOUT,
STARTUP_PROGRESS_TIMEOUT,
+	STARTUP_STAT_FLUSH_TIMEOUT,
/* First user-definable timeout reason */
USER_TIMEOUT,
/* Maximum number of timeout reasons */

I think we could just reuse IDLE_STATS_UPDATE_TIMEOUT?

Greetings,

Andres Freund

#11Melanie Plageman
melanieplageman@gmail.com
In reply to: Melih Mutlu (#9)
Re: pg_stat_io for the startup process

On Wed, Apr 26, 2023 at 2:53 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:

Hi all,

Robert Haas <robertmhaas@gmail.com>, 26 Nis 2023 Çar, 15:34 tarihinde şunu yazdı:

On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

3. When should we call pgstat_report_stats on the startup process?

During recovery, I think we can call pgstat_report_stats() (or a
subset of it) right before invoking WaitLatch and at segment
boundaries.

I think this kind of idea is worth exploring. Andres mentioned timers,
but it seems to me that something where we just do it at certain
"convenient" points might be good enough and simpler. I'd much rather
have statistics that were up to date as of the last time we finished a
segment than have nothing at all.

I created a rough prototype of a timer-based approach for comparison. Please see attached.

Registered a new timeout named "STARTUP_STAT_FLUSH_TIMEOUT", The timeout's
handler sets a flag indicating that io stats need to be flushed.
HandleStartupProcInterrupts checks the flag to flush stats. It's enabled if
any WAL record is read (in the main loop of PerformWalRecovery). And it's
disabled before WaitLatch to avoid unnecessary wakeups if the startup process
decides to sleep.

I was reviewing this and found that if I remove the calls to
disable_startup_stat_flush_timeout() the number of calls to
pgstat_report_wal() on a briefly active and then idle standby are about
8 in 100 seconds whereas with timer disablement, the calls over the same
period are about 40. I would have thought that disabling the timer would
have caused us to try and flush the stats less often.

With the calls to disable_startup_stat_flush_timeout(), we do have much
fewer calls to setitimer(), of course.

Given the below suggestion by Andres, I tried doing some traces of the
various approaches.

+       disable_startup_stat_flush_timeout();
+
(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,

I think always disabling the timer here isn't quite right - we want to wake up
*once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending
stats before waiting - potentially for a long time - for WAL. One way would be
just explicitly report before the WaitLatch().

Another approach, I think better, would be to not use enable_timeout_every(),
and to explicitly rearm the timer in HandleStartupProcInterrupts(). When
called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do so
at the end of WaitForWALToBecomeAvailable(). That way we also wouldn't
repeatedly fire between calls to HandleStartupProcInterrupts().

After a quick example implementation of this, I found that it seemed to
try and flush the stats less often on an idle standby (good) than using
enable_timeout_every().

- Melanie

#12Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Melanie Plageman (#11)
Re: pg_stat_io for the startup process

At Thu, 27 Apr 2023 17:30:40 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in

On Wed, Apr 26, 2023 at 2:53 PM Melih Mutlu <m.melihmutlu@gmail.com> wrote:

Hi all,

Robert Haas <robertmhaas@gmail.com>, 26 Nis 2023 Çar, 15:34 tarihinde şunu yazdı:

On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

3. When should we call pgstat_report_stats on the startup process?

During recovery, I think we can call pgstat_report_stats() (or a
subset of it) right before invoking WaitLatch and at segment
boundaries.

I think this kind of idea is worth exploring. Andres mentioned timers,
but it seems to me that something where we just do it at certain
"convenient" points might be good enough and simpler. I'd much rather
have statistics that were up to date as of the last time we finished a
segment than have nothing at all.

I created a rough prototype of a timer-based approach for comparison. Please see attached.

Registered a new timeout named "STARTUP_STAT_FLUSH_TIMEOUT", The timeout's
handler sets a flag indicating that io stats need to be flushed.
HandleStartupProcInterrupts checks the flag to flush stats. It's enabled if
any WAL record is read (in the main loop of PerformWalRecovery). And it's
disabled before WaitLatch to avoid unnecessary wakeups if the startup process
decides to sleep.

I was reviewing this and found that if I remove the calls to
disable_startup_stat_flush_timeout() the number of calls to
pgstat_report_wal() on a briefly active and then idle standby are about
8 in 100 seconds whereas with timer disablement, the calls over the same
period are about 40. I would have thought that disabling the timer would
have caused us to try and flush the stats less often.

With the calls to disable_startup_stat_flush_timeout(), we do have much
fewer calls to setitimer(), of course.

Given the below suggestion by Andres, I tried doing some traces of the
various approaches.

+       disable_startup_stat_flush_timeout();
+
(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
WL_LATCH_SET | WL_TIMEOUT |
WL_EXIT_ON_PM_DEATH,

I think always disabling the timer here isn't quite right - we want to wake up
*once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending
stats before waiting - potentially for a long time - for WAL. One way would be
just explicitly report before the WaitLatch().

Another approach, I think better, would be to not use enable_timeout_every(),
and to explicitly rearm the timer in HandleStartupProcInterrupts(). When
called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do so
at the end of WaitForWALToBecomeAvailable(). That way we also wouldn't
repeatedly fire between calls to HandleStartupProcInterrupts().

After a quick example implementation of this, I found that it seemed to
try and flush the stats less often on an idle standby (good) than using
enable_timeout_every().

Just rearming with the full-interval will work that way. Our existing
strategy for this is seen in PostgresMain().

stats_timeout = pgstat_report_stat(false);
if (stats_timeout > 0)
{
if (!get_timeout_active(BLAH_TIMEOUT))
enable_timeout_after(BLAH_TIMEOUT, stats_timeout);
}
else
{
if (get_timeout_active(BLAH_TIMEOUT))
disable_timeout(BLAH_TIMEOUT, false);
}
WaitLatch();

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#13Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Kyotaro Horiguchi (#12)
1 attachment(s)
Re: pg_stat_io for the startup process

At Fri, 28 Apr 2023 11:15:51 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 27 Apr 2023 17:30:40 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in

After a quick example implementation of this, I found that it seemed to
try and flush the stats less often on an idle standby (good) than using
enable_timeout_every().

Just rearming with the full-interval will work that way. Our existing
strategy for this is seen in PostgresMain().

stats_timeout = pgstat_report_stat(false);
if (stats_timeout > 0)
{
if (!get_timeout_active(BLAH_TIMEOUT))
enable_timeout_after(BLAH_TIMEOUT, stats_timeout);
}
else
{
if (get_timeout_active(BLAH_TIMEOUT))
disable_timeout(BLAH_TIMEOUT, false);
}
WaitLatch();

Im my example, I left out idle-time flushing, but I realized we don't
need the timeout mechanism here since we're already managing it. So
the following should work (assuming the timestamp updates with
GetCurrentTimestamp() in my last patch).

@@ -3889,13 +3900,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
/* Update pg_stat_recovery_prefetch before sleeping. */
XLogPrefetcherComputeStats(xlogprefetcher);

+					/*
+					 * Report stats; if not time yet, set next WaitLatch to
+					 * wake up at the next reporing time.
+					 */
+					wait_time = pgstat_report_stat(false);
+
+					/* if no pending stats, sleep forever */
+					if (wait_time == 0)
+						wait_time = -1L;
+
 					/*
 					 * Wait for more WAL to arrive, when we will be woken
 					 * immediately by the WAL receiver.
 					 */
 					(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
 									 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
-									 -1L,
+									 wait_time,
 									 WAIT_EVENT_RECOVERY_WAL_STREAM);

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

Attachments:

v2_pgstat_io_startup.txttext/plain; charset=us-asciiDownload
diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 188f6d6f85..0dd465c5e3 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -3248,6 +3248,7 @@ XLogPageRead(XLogReaderState *xlogreader, XLogRecPtr targetPagePtr, int reqLen,
 		close(readFile);
 		readFile = -1;
 		readSource = XLOG_FROM_ANY;
+		pgstat_report_stat(false);
 	}
 
 	XLByteToSeg(targetPagePtr, readSegNo, wal_segment_size);
@@ -3607,6 +3608,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 													wal_retrieve_retry_interval))
 					{
 						long		wait_time;
+						long		stats_timeout;
 
 						wait_time = wal_retrieve_retry_interval -
 							TimestampDifferenceMilliseconds(last_fail_time, now);
@@ -3617,6 +3619,14 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 						/* Do background tasks that might benefit us later. */
 						KnownAssignedTransactionIdsIdleMaintenance();
 
+						/*
+						 * Report stats; if not time yet, set next WaitLatch to
+						 * wake up at the next reporing time.
+						 */
+						stats_timeout = pgstat_report_stat(false);
+						if (stats_timeout > 0 && stats_timeout < wait_time)
+							wait_time = stats_timeout;
+
 						(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
 										 WL_LATCH_SET | WL_TIMEOUT |
 										 WL_EXIT_ON_PM_DEATH,
@@ -3698,6 +3708,7 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 			case XLOG_FROM_STREAM:
 				{
 					bool		havedata;
+					long		wait_time;
 
 					/*
 					 * We should be able to move to XLOG_FROM_STREAM only in
@@ -3889,13 +3900,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
 					/* Update pg_stat_recovery_prefetch before sleeping. */
 					XLogPrefetcherComputeStats(xlogprefetcher);
 
+					/*
+					 * Report stats; if not time yet, set next WaitLatch to
+					 * wake up at the next reporing time.
+					 */
+					wait_time = pgstat_report_stat(false);
+
+					/* if no pending stats, sleep forever */
+					if (wait_time == 0)
+						wait_time = -1L;
+
 					/*
 					 * Wait for more WAL to arrive, when we will be woken
 					 * immediately by the WAL receiver.
 					 */
 					(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
 									 WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
-									 -1L,
+									 wait_time,
 									 WAIT_EVENT_RECOVERY_WAL_STREAM);
 					ResetLatch(&XLogRecoveryCtl->recoveryWakeupLatch);
 					break;
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 1fa689052e..520936d0dd 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1046,9 +1046,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 		bufHdr = LocalBufferAlloc(smgr, forkNum, blockNum, &found);
 		if (found)
 			pgBufferUsage.local_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.local_blks_read++;
 	}
 	else
 	{
@@ -1062,9 +1059,6 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 							 strategy, &found, io_context);
 		if (found)
 			pgBufferUsage.shared_blks_hit++;
-		else if (mode == RBM_NORMAL || mode == RBM_NORMAL_NO_LOG ||
-				 mode == RBM_ZERO_ON_ERROR)
-			pgBufferUsage.shared_blks_read++;
 	}
 
 	/* At this point we do NOT hold any locks. */
@@ -1126,6 +1120,10 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
 
 		pgstat_count_io_op_time(io_object, io_context,
 								IOOP_READ, io_start, 1);
+		if (isLocalBuf)
+			pgBufferUsage.local_blks_read++;
+		else
+			pgBufferUsage.shared_blks_read++;
 
 		/* check for garbage data */
 		if (!PageIsVerifiedExtended((Page) bufBlock, blockNum,
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index b125802b21..0864b56689 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -608,7 +608,10 @@ pgstat_report_stat(bool force)
 	 */
 	Assert(!pgStatLocal.shmem->is_shutdown);
 
-	now = GetCurrentTransactionStopTimestamp();
+	if (MyBackendType != B_STARTUP)
+		now = GetCurrentTransactionStopTimestamp();
+	else
+		now = GetCurrentTimestamp();
 
 	if (!force)
 	{
diff --git a/src/backend/utils/activity/pgstat_database.c b/src/backend/utils/activity/pgstat_database.c
index 7149f22f72..c93b92771a 100644
--- a/src/backend/utils/activity/pgstat_database.c
+++ b/src/backend/utils/activity/pgstat_database.c
@@ -17,6 +17,7 @@
 
 #include "postgres.h"
 
+#include "executor/instrument.h"
 #include "utils/pgstat_internal.h"
 #include "utils/timestamp.h"
 #include "storage/procsignal.h"
@@ -269,7 +270,10 @@ AtEOXact_PgStat_Database(bool isCommit, bool parallel)
 void
 pgstat_update_dbstats(TimestampTz ts)
 {
+	static BufferUsage lastBufferUsage = {0};
 	PgStat_StatDBEntry *dbentry;
+	PgStat_Counter hit_diff;
+	PgStat_Counter read_diff;
 
 	dbentry = pgstat_prep_database_pending(MyDatabaseId);
 
@@ -282,6 +286,16 @@ pgstat_update_dbstats(TimestampTz ts)
 	dbentry->blk_read_time += pgStatBlockReadTime;
 	dbentry->blk_write_time += pgStatBlockWriteTime;
 
+	/* we need to preserve pgBufferUsage */
+	read_diff =
+		pgBufferUsage.shared_blks_read - lastBufferUsage.shared_blks_read +
+		pgBufferUsage.local_blks_read - lastBufferUsage.local_blks_read;
+	hit_diff = 
+		pgBufferUsage.shared_blks_hit - lastBufferUsage.shared_blks_hit +
+		pgBufferUsage.local_blks_hit - lastBufferUsage.local_blks_hit;
+	dbentry->blocks_fetched += read_diff + hit_diff;
+	dbentry->blocks_hit += hit_diff;
+
 	if (pgstat_should_report_connstat())
 	{
 		long		secs;
@@ -304,6 +318,8 @@ pgstat_update_dbstats(TimestampTz ts)
 	pgStatBlockWriteTime = 0;
 	pgStatActiveTime = 0;
 	pgStatTransactionIdleTime = 0;
+
+	lastBufferUsage = pgBufferUsage;
 }
 
 /*
diff --git a/src/backend/utils/activity/pgstat_relation.c b/src/backend/utils/activity/pgstat_relation.c
index 9876e0c1e8..8945b8141a 100644
--- a/src/backend/utils/activity/pgstat_relation.c
+++ b/src/backend/utils/activity/pgstat_relation.c
@@ -844,8 +844,6 @@ pgstat_relation_flush_cb(PgStat_EntryRef *entry_ref, bool nowait)
 	dbentry->tuples_inserted += lstats->counts.tuples_inserted;
 	dbentry->tuples_updated += lstats->counts.tuples_updated;
 	dbentry->tuples_deleted += lstats->counts.tuples_deleted;
-	dbentry->blocks_fetched += lstats->counts.blocks_fetched;
-	dbentry->blocks_hit += lstats->counts.blocks_hit;
 
 	return true;
 }
#14Melih Mutlu
m.melihmutlu@gmail.com
In reply to: Andres Freund (#10)
1 attachment(s)
Re: pg_stat_io for the startup process

Hi,

Andres Freund <andres@anarazel.de>, 27 Nis 2023 Per, 19:27 tarihinde şunu
yazdı:

Huh - do you have a link to the failure? That's how I would expect it to be
done.

I think I should have registered it in the beginning of
PerformWalRecovery() and not just before the main redo loop
since HandleStartupProcInterrupts is called before the loop too.
Here's the error log otherise [1]https://api.cirrus-ci.com/v1/artifact/task/5282291971260416/testrun/build/testrun/recovery/010_logical_decoding_timelines/log/010_logical_decoding_timelines_replica.log

#ifdef WAL_DEBUG

if (XLOG_DEBUG ||
(record->xl_rmid == RM_XACT_ID &&

trace_recovery_messages <= DEBUG2) ||

@@ -3617,6 +3621,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,

bool randAccess,

/* Do background tasks

that might benefit us later. */

KnownAssignedTransactionIdsIdleMaintenance();

+                                             /*
+                                              * Need to disable flush

timeout to avoid unnecessary

+ * wakeups. Enable it

again after a WAL record is read

+                                              * in PerformWalRecovery.
+                                              */
+

disable_startup_stat_flush_timeout();

+
(void)

WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,

WL_LATCH_SET | WL_TIMEOUT |

WL_EXIT_ON_PM_DEATH,

I think always disabling the timer here isn't quite right - we want to
wake up
*once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending
stats before waiting - potentially for a long time - for WAL. One way
would be
just explicitly report before the WaitLatch().

Another approach, I think better, would be to not use
enable_timeout_every(),
and to explicitly rearm the timer in HandleStartupProcInterrupts(). When
called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do
so
at the end of WaitForWALToBecomeAvailable(). That way we also wouldn't
repeatedly fire between calls to HandleStartupProcInterrupts().

Attached patch is probably not doing what you asked. IIUC
HandleStartupProcInterrupts should arm the timer but also shouldn't arm it
if it's called from WaitForWALToBecomeAvailable. But the timer should be
armed again at the very end of WaitForWALToBecomeAvailable. I've been
thinking about how to do this properly. Should HandleStartupProcInterrupts
take a parameter to decide whether the timer needs to be armed? Or need to
add an additional global flag to rearm the timer? Any thoughts?

[1]: https://api.cirrus-ci.com/v1/artifact/task/5282291971260416/testrun/build/testrun/recovery/010_logical_decoding_timelines/log/010_logical_decoding_timelines_replica.log
https://api.cirrus-ci.com/v1/artifact/task/5282291971260416/testrun/build/testrun/recovery/010_logical_decoding_timelines/log/010_logical_decoding_timelines_replica.log

Best,
--
Melih Mutlu
Microsoft

Attachments:

v2-0001-Add-timeout-to-flush-stats-during-startup-s-main-replay-loop.patchapplication/octet-stream; name=v2-0001-Add-timeout-to-flush-stats-during-startup-s-main-replay-loop.patchDownload
From 9be7360e49db424c45c53e85efe8a4f5359b5244 Mon Sep 17 00:00:00 2001
From: Melih Mutlu <m.melihmutlu@gmail.com>
Date: Wed, 26 Apr 2023 18:21:32 +0300
Subject: [PATCH v2] Add timeout to flush stats during startup's main replay
 loop

---
 src/backend/access/transam/xlogrecovery.c |  7 +++++
 src/backend/postmaster/startup.c          | 34 +++++++++++++++++++++++
 src/backend/utils/activity/pgstat.c       | 14 ----------
 src/include/pgstat.h                      | 14 ++++++++++
 src/include/postmaster/startup.h          |  3 ++
 5 files changed, 58 insertions(+), 14 deletions(-)

diff --git a/src/backend/access/transam/xlogrecovery.c b/src/backend/access/transam/xlogrecovery.c
index 188f6d6f85..53a4939be9 100644
--- a/src/backend/access/transam/xlogrecovery.c
+++ b/src/backend/access/transam/xlogrecovery.c
@@ -63,6 +63,7 @@
 #include "utils/pg_lsn.h"
 #include "utils/ps_status.h"
 #include "utils/pg_rusage.h"
+#include "utils/timeout.h"
 
 /* Unsupported old recovery command file names (relative to $PGDATA) */
 #define RECOVERY_COMMAND_FILE	"recovery.conf"
@@ -1628,6 +1629,10 @@ PerformWalRecovery(void)
 	 */
 	CheckRecoveryConsistency();
 
+	/* Register the timeout to flush wal stats when it's needed. */
+	RegisterTimeout(IDLE_STATS_UPDATE_TIMEOUT,
+					idle_stats_update_timeout_handler);
+
 	/*
 	 * Find the first record that logically follows the checkpoint --- it
 	 * might physically precede it, though.
@@ -1666,6 +1671,8 @@ PerformWalRecovery(void)
 		if (!StandbyMode)
 			begin_startup_progress_phase();
 
+		enable_idle_stats_update_timeout();
+
 		/*
 		 * main redo apply loop
 		 */
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index efc2580536..842394bc8f 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -72,6 +72,9 @@ static TimestampTz startup_progress_phase_start_time;
  */
 static volatile sig_atomic_t startup_progress_timer_expired = false;
 
+/* Indicates whether flushing stats is needed. */
+static volatile sig_atomic_t idle_stats_update_pending = false;
+
 /*
  * Time between progress updates for long-running startup operations.
  */
@@ -206,6 +209,18 @@ HandleStartupProcInterrupts(void)
 	/* Perform logging of memory contexts of this process */
 	if (LogMemoryContextPending)
 		ProcessLogMemoryContextInterrupt();
+
+	if (idle_stats_update_pending)
+	{
+		/* It's time to report wal stats. */
+		pgstat_report_wal(true);
+		idle_stats_update_pending = false;
+	}
+	else if (!get_timeout_active(IDLE_STATS_UPDATE_TIMEOUT))
+	{
+		/* Set the next timeout. */
+		enable_idle_stats_update_timeout();
+	}
 }
 
 
@@ -385,3 +400,22 @@ has_startup_progress_timeout_expired(long *secs, int *usecs)
 
 	return true;
 }
+
+/* Set a flag indicating that it's time to flush wal stats. */
+void
+idle_stats_update_timeout_handler(void)
+{
+	idle_stats_update_pending = true;
+	WakeupRecovery();
+}
+
+/* Enable the timeout set for wal stat flush. */
+void
+enable_idle_stats_update_timeout(void)
+{
+	TimestampTz fin_time;
+
+	fin_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
+										   PGSTAT_MIN_INTERVAL);
+	enable_timeout_at(IDLE_STATS_UPDATE_TIMEOUT, fin_time);
+}
\ No newline at end of file
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index b125802b21..a3ca3e648d 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -109,20 +109,6 @@
 #include "utils/timestamp.h"
 
 
-/* ----------
- * Timer definitions.
- *
- * In milliseconds.
- * ----------
- */
-
-/* minimum interval non-forced stats flushes.*/
-#define PGSTAT_MIN_INTERVAL			1000
-/* how long until to block flushing pending stats updates */
-#define PGSTAT_MAX_INTERVAL			60000
-/* when to call pgstat_report_stat() again, even when idle */
-#define PGSTAT_IDLE_INTERVAL		10000
-
 /* ----------
  * Initial size hints for the hash tables used in statistics.
  * ----------
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index fff4ad5b6d..069e08006c 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -31,6 +31,20 @@
 /* Default directory to store temporary statistics data in */
 #define PG_STAT_TMP_DIR		"pg_stat_tmp"
 
+/* ----------
+ * Timer definitions.
+ *
+ * In milliseconds.
+ * ----------
+ */
+
+/* minimum interval non-forced stats flushes.*/
+#define PGSTAT_MIN_INTERVAL			1000
+/* how long until to block flushing pending stats updates */
+#define PGSTAT_MAX_INTERVAL			60000
+/* when to call pgstat_report_stat() again, even when idle */
+#define PGSTAT_IDLE_INTERVAL		10000
+
 /* The types of statistics entries */
 typedef enum PgStat_Kind
 {
diff --git a/src/include/postmaster/startup.h b/src/include/postmaster/startup.h
index 6a2e4c4526..f1e051811f 100644
--- a/src/include/postmaster/startup.h
+++ b/src/include/postmaster/startup.h
@@ -38,4 +38,7 @@ extern void begin_startup_progress_phase(void);
 extern void startup_progress_timeout_handler(void);
 extern bool has_startup_progress_timeout_expired(long *secs, int *usecs);
 
+extern void enable_idle_stats_update_timeout(void);
+extern void idle_stats_update_timeout_handler(void);
+
 #endif							/* _STARTUP_H */
-- 
2.25.1

#15Melanie Plageman
melanieplageman@gmail.com
In reply to: Kyotaro Horiguchi (#7)
Re: pg_stat_io for the startup process

I've reviewed both your patch versions and responded to the ideas in
both of them and the associated emails below.

On Wed, Apr 26, 2023 at 5:47 AM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Tue, 25 Apr 2023 16:04:23 -0700, Andres Freund <andres@anarazel.de> wrote in

key those stats by oid. However, it *does* report the read/write time. But
only at process exit, of course. The weird part is that the startup process
does *NOT* increase pg_stat_database.blks_read/blks_hit, because instead of
basing those on pgBufferUsage.shared_blks_read etc, we compute them based on
the relation level stats. pgBufferUsage is just used for EXPLAIN. This isn't
recent, afaict.

I see four issues here.

1. The current database stats omits buffer fetches that don't
originate from a relation.

In this case pgstat_relation can't work since recovery isn't conscious
of relids. We might be able to resolve relfilenode into a relid, but
it may not be that simple. Fortunately we already count fetches and
hits process-wide using pgBufferUsage, so we can use this for database
stats.

...

TL;DR: Currently the startup process maintains blk_read_time, blk_write_time,
but doesn't maintain blks_read, blks_hit - which doesn't make sense.

As a result, the attached patch, which is meant for discussion, allows
pg_stat_database to show fetches and reads by the startup process as
the counts for the database with id 0.

I would put this in its own patch in a patchset. Of course it relies on
having pgstat_report_stat() called at appropriate times by the startup
process, but having pg_stat_database show read/hit counts is a separate
issue than having pg_stat_io do so. I'm not suggesting we do this, but
you could argue that if we fix the startup process stats reporting that
pg_stat_database not showing reads and hits for the startup process is a
bug that also exists in 15.

Not directly related, but I do not get why the existing stats counters
for pg_stat_database count "fetches" and "hits" and then use subtraction
to get the number of reads. I find it confusing and seems like it could
lead to subtle inconsistencies with those counters counting reads closer
to where they are actually happening.

2. Even if we wanted to report stats for the startup process,
pgstat_report_stats wouldn't permit it since transaction-end
timestamp doesn't advance.

I'm not certain if it's the correct approach, but perhaps we could use
GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp()
specifically for the startup process.

In theory, since all of the approaches proposed in this thread would
exercise rigid control over how often we flush stats in the startup
process, I think it is okay to use GetCurrentTimestamp() when
pgstat_report_stat() is called by the startup process (i.e. we don't
have to worry about overhead of doing it). But looking at it implemented
in the patch made me feel unsettled for some reason.

3. When should we call pgstat_report_stats on the startup process?

During recovery, I think we can call pgstat_report_stats() (or a
subset of it) right before invoking WaitLatch and at segment
boundaries.

I see in the patch you call pgstat_report_stat() in XLogPageRead(). Will
this only be called on segment boundaries?

4. In the existing ReadBuffer_common, there's an inconsistency in
counting hits and reads between pgstat_io and pgBufferUsage.

The difference comes from the case of RBM_ZERO pages. We should simply
align them.

I would definitely make this a separate patch and probably a separate
thread. It isn't related to the startup process and is worth a separate
discussion.

On Thu, Apr 27, 2023 at 10:43 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Fri, 28 Apr 2023 11:15:51 +0900 (JST), Kyotaro Horiguchi <horikyota.ntt@gmail.com> wrote in

At Thu, 27 Apr 2023 17:30:40 -0400, Melanie Plageman <melanieplageman@gmail.com> wrote in

After a quick example implementation of this, I found that it seemed to
try and flush the stats less often on an idle standby (good) than using
enable_timeout_every().

Just rearming with the full-interval will work that way. Our existing
strategy for this is seen in PostgresMain().

stats_timeout = pgstat_report_stat(false);
if (stats_timeout > 0)
{
if (!get_timeout_active(BLAH_TIMEOUT))
enable_timeout_after(BLAH_TIMEOUT, stats_timeout);
}
else
{
if (get_timeout_active(BLAH_TIMEOUT))
disable_timeout(BLAH_TIMEOUT, false);
}
WaitLatch();

Im my example, I left out idle-time flushing, but I realized we don't
need the timeout mechanism here since we're already managing it. So
the following should work (assuming the timestamp updates with
GetCurrentTimestamp() in my last patch).

@@ -3889,13 +3900,23 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr, bool randAccess,
/* Update pg_stat_recovery_prefetch before sleeping. */
XLogPrefetcherComputeStats(xlogprefetcher);

+                                       /*
+                                        * Report stats; if not time yet, set next WaitLatch to
+                                        * wake up at the next reporing time.
+                                        */
+                                       wait_time = pgstat_report_stat(false);
+
+                                       /* if no pending stats, sleep forever */
+                                       if (wait_time == 0)
+                                               wait_time = -1L;
+
/*
* Wait for more WAL to arrive, when we will be woken
* immediately by the WAL receiver.
*/
(void) WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,
WL_LATCH_SET | WL_EXIT_ON_PM_DEATH,
-                                                                        -1L,
+                                                                        wait_time,
WAIT_EVENT_RECOVERY_WAL_STREAM);

The idle-time flushing is a great point I did not think of. I think
Andres did have some concern with unconditionally calling
pgstat_report_stat() in WaitForWalToBecomeAvailable() before WaitLatch()
-- I believe because it would be called too often and attempt flushing
multiple times between HandleStartupProcInterrupts().

- Melanie

#16Melanie Plageman
melanieplageman@gmail.com
In reply to: Melih Mutlu (#14)
Re: pg_stat_io for the startup process

On Wed, May 03, 2023 at 04:11:33PM +0300, Melih Mutlu wrote:

Andres Freund <andres@anarazel.de>, 27 Nis 2023 Per, 19:27 tarihinde şunu yazdı:

#ifdef WAL_DEBUG

if (XLOG_DEBUG ||
(record->xl_rmid == RM_XACT_ID &&

trace_recovery_messages <= DEBUG2) ||

@@ -3617,6 +3621,13 @@ WaitForWALToBecomeAvailable(XLogRecPtr RecPtr,

bool randAccess,

/* Do background tasks

that might benefit us later. */

KnownAssignedTransactionIdsIdleMaintenance();

+                                             /*
+                                              * Need to disable flush

timeout to avoid unnecessary

+ * wakeups. Enable it

again after a WAL record is read

+                                              * in PerformWalRecovery.
+                                              */
+

disable_startup_stat_flush_timeout();

+
(void)

WaitLatch(&XLogRecoveryCtl->recoveryWakeupLatch,

WL_LATCH_SET | WL_TIMEOUT |

WL_EXIT_ON_PM_DEATH,

I think always disabling the timer here isn't quite right - we want to
wake up
*once* in WaitForWALToBecomeAvailable(), otherwise we'll not submit pending
stats before waiting - potentially for a long time - for WAL. One way
would be
just explicitly report before the WaitLatch().

Another approach, I think better, would be to not use
enable_timeout_every(),
and to explicitly rearm the timer in HandleStartupProcInterrupts(). When
called from WaitForWALToBecomeAvailable(), we'd not rearm, and instead do
so
at the end of WaitForWALToBecomeAvailable(). That way we also wouldn't
repeatedly fire between calls to HandleStartupProcInterrupts().

Attached patch is probably not doing what you asked. IIUC
HandleStartupProcInterrupts should arm the timer but also shouldn't arm it
if it's called from WaitForWALToBecomeAvailable. But the timer should be
armed again at the very end of WaitForWALToBecomeAvailable. I've been
thinking about how to do this properly. Should HandleStartupProcInterrupts
take a parameter to decide whether the timer needs to be armed? Or need to
add an additional global flag to rearm the timer? Any thoughts?

I had the same question about how to determine whether or not to rearm.

From 9be7360e49db424c45c53e85efe8a4f5359b5244 Mon Sep 17 00:00:00 2001
From: Melih Mutlu <m.melihmutlu@gmail.com>
Date: Wed, 26 Apr 2023 18:21:32 +0300
Subject: [PATCH v2] Add timeout to flush stats during startup's main replay
loop
diff --git a/src/backend/postmaster/startup.c b/src/backend/postmaster/startup.c
index efc2580536..842394bc8f 100644
--- a/src/backend/postmaster/startup.c
+++ b/src/backend/postmaster/startup.c
@@ -72,6 +72,9 @@ static TimestampTz startup_progress_phase_start_time;
*/
static volatile sig_atomic_t startup_progress_timer_expired = false;
+/* Indicates whether flushing stats is needed. */
+static volatile sig_atomic_t idle_stats_update_pending = false;
+
/*
* Time between progress updates for long-running startup operations.
*/
@@ -206,6 +209,18 @@ HandleStartupProcInterrupts(void)
/* Perform logging of memory contexts of this process */
if (LogMemoryContextPending)
ProcessLogMemoryContextInterrupt();
+
+	if (idle_stats_update_pending)
+	{
+		/* It's time to report wal stats. */

If we want dbstats to be updated, we'll probably have to call
pgstat_report_stat() here and fix the timestamp issue Horiguchi-san
points out upthread. Perhaps you could review those changes and consider
adding those as preliminary patches before this in a set.

I think you will then need to handle the issue he mentions with partial
flushes coming from pgstat_report_stat() and remembering to try and
flush stats again in case of a partial flush. Though, perhaps we can
just pass force=true.

+		pgstat_report_wal(true);
+		idle_stats_update_pending = false;
+	}

Good that you thought to check if the timeout was already active.

+	else if (!get_timeout_active(IDLE_STATS_UPDATE_TIMEOUT))
+	{
+		/* Set the next timeout. */
+		enable_idle_stats_update_timeout();
+	}
}

@@ -385,3 +400,22 @@ has_startup_progress_timeout_expired(long *secs, int *usecs)

return true;
}
+
+/* Set a flag indicating that it's time to flush wal stats. */
+void
+idle_stats_update_timeout_handler(void)
+{
+	idle_stats_update_pending = true;

Is WakeupRecovery() needed when the timer goes off and the startup
process is waiting on a latch? Does this happen in
WaitForWalToBecomeAvailable()?

+	WakeupRecovery();
+}
+
+/* Enable the timeout set for wal stat flush. */
+void
+enable_idle_stats_update_timeout(void)
+{
+	TimestampTz fin_time;
+
+	fin_time = TimestampTzPlusMilliseconds(GetCurrentTimestamp(),
+										   PGSTAT_MIN_INTERVAL);

It is a shame we have to end up calling GetCurrentTimestamp() since we
are using enable_timeout_at(). Couldn't we use enable_timeout_after()?

+	enable_timeout_at(IDLE_STATS_UPDATE_TIMEOUT, fin_time);
+}

- Melanie

#17Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#7)
Re: pg_stat_io for the startup process

Hi,

On 2023-04-26 18:47:14 +0900, Kyotaro Horiguchi wrote:

I see four issues here.

1. The current database stats omits buffer fetches that don't
originate from a relation.

In this case pgstat_relation can't work since recovery isn't conscious
of relids. We might be able to resolve relfilenode into a relid, but
it may not be that simple. Fortunately we already count fetches and
hits process-wide using pgBufferUsage, so we can use this for database
stats.

I don't think we need to do anything about that for 16 - they aren't updated
at process end either.

I think the fix here is to do the architectural change of maintaining most
stats keyed by relfilenode as we've discussed in some other threads. Then we
also can have relation level write stats etc.

2. Even if we wanted to report stats for the startup process,
pgstat_report_stats wouldn't permit it since transaction-end
timestamp doesn't advance.

I'm not certain if it's the correct approach, but perhaps we could use
GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp()
specifically for the startup process.

What about using GetCurrentTimestamp() when force == true? That'd make sense
for other users as well, I think?

3. When should we call pgstat_report_stats on the startup process?

During recovery, I think we can call pgstat_report_stats() (or a
subset of it) right before invoking WaitLatch and at segment
boundaries.

I've pondered that as well. But I don't think it's great - it's not exactly
intuitive that stats reporting gets far less common if you use a 1GB
wal_segment_size.

Greetings,

Andres Freund

#18Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#17)
Re: pg_stat_io for the startup process

Thanks for clarifying!

At Mon, 8 May 2023 14:46:43 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2023-04-26 18:47:14 +0900, Kyotaro Horiguchi wrote:

I see four issues here.

1. The current database stats omits buffer fetches that don't
originate from a relation.

In this case pgstat_relation can't work since recovery isn't conscious
of relids. We might be able to resolve relfilenode into a relid, but
it may not be that simple. Fortunately we already count fetches and
hits process-wide using pgBufferUsage, so we can use this for database
stats.

I don't think we need to do anything about that for 16 - they aren't updated
at process end either.

I think the fix here is to do the architectural change of maintaining most
stats keyed by relfilenode as we've discussed in some other threads. Then we
also can have relation level write stats etc.

I think so.

2. Even if we wanted to report stats for the startup process,
pgstat_report_stats wouldn't permit it since transaction-end
timestamp doesn't advance.

I'm not certain if it's the correct approach, but perhaps we could use
GetCurrentTimestamp() instead of GetCurrentTransactionStopTimestamp()
specifically for the startup process.

What about using GetCurrentTimestamp() when force == true? That'd make sense
for other users as well, I think?

I'm not sure if I got you right, but when force==true, it allows
pgstat_report_stats to flush without considering whether the interval
has elapsed. In that case, there's no need to keep track of the last
flush time and the caller should handle the interval instead.

3. When should we call pgstat_report_stats on the startup process?

During recovery, I think we can call pgstat_report_stats() (or a
subset of it) right before invoking WaitLatch and at segment
boundaries.

I've pondered that as well. But I don't think it's great - it's not exactly
intuitive that stats reporting gets far less common if you use a 1GB
wal_segment_size.

If the segment size gets larger, the archive intervals become
longer. So, I have a vague feeling that users wouldn't go for such a
large segment size. I don't have a clear idea about the ideal length
for stats reporting intervals in this case, but I think every few
minutes or so would not be so bad for the startup process to report
stats when recovery gets busy. Also, I think recovery will often wait
for new data once it catches up to the primary.

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#19Andres Freund
andres@anarazel.de
In reply to: Melih Mutlu (#14)
Re: pg_stat_io for the startup process

Hi,

I don't really feel confident we're going to get the timeout approach solid
enough for something going into the tree around beta 1.

How about this, imo a lot simpler, approach: We flush stats whenever replaying
a XLOG_RUNNING_XACTS record. Unless the primary is idle, it will log those at
a regular interval. If the primary is idle, we don't need to flush stats in
the startup process, because we'll not have done any io.

We only log XLOG_RUNNING_XACTS when wal_level >= replica, so stats wouldn't
get regularly flushed if wal_level = minimal - but in that case the stats are
also not accessible, so that's not a problem.

It's not the prettiest solution, but I think the simplicity is worth a lot.

Greetings,

Andres Freund

#20Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#19)
Re: pg_stat_io for the startup process

At Sun, 21 May 2023 15:14:23 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

I don't really feel confident we're going to get the timeout approach solid
enough for something going into the tree around beta 1.

How about this, imo a lot simpler, approach: We flush stats whenever replaying
a XLOG_RUNNING_XACTS record. Unless the primary is idle, it will log those at
a regular interval. If the primary is idle, we don't need to flush stats in
the startup process, because we'll not have done any io.

We only log XLOG_RUNNING_XACTS when wal_level >= replica, so stats wouldn't
get regularly flushed if wal_level = minimal - but in that case the stats are
also not accessible, so that's not a problem.

I concur with the three aspects, interval regularity, reliability and
about the case of the minimal wal_level. While I can't confidently
claim it is the best, its simplicilty gives us a solid reason to
proceed with it. If necessary, we can switch to alternative timing
sources in the future without causing major disruptions for users, I
believe.

It's not the prettiest solution, but I think the simplicity is worth a lot.

Greetings,

+1.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#21Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#20)
1 attachment(s)
Re: pg_stat_io for the startup process

Hi,

On 2023-05-22 13:36:42 +0900, Kyotaro Horiguchi wrote:

At Sun, 21 May 2023 15:14:23 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

I don't really feel confident we're going to get the timeout approach solid
enough for something going into the tree around beta 1.

How about this, imo a lot simpler, approach: We flush stats whenever replaying
a XLOG_RUNNING_XACTS record. Unless the primary is idle, it will log those at
a regular interval. If the primary is idle, we don't need to flush stats in
the startup process, because we'll not have done any io.

We only log XLOG_RUNNING_XACTS when wal_level >= replica, so stats wouldn't
get regularly flushed if wal_level = minimal - but in that case the stats are
also not accessible, so that's not a problem.

I concur with the three aspects, interval regularity, reliability and
about the case of the minimal wal_level. While I can't confidently
claim it is the best, its simplicilty gives us a solid reason to
proceed with it. If necessary, we can switch to alternative timing
sources in the future without causing major disruptions for users, I
believe.

It's not the prettiest solution, but I think the simplicity is worth a lot.

Greetings,

+1.

Attached a patch implementing this approach.

It's imo always a separate bug that we were using
GetCurrentTransactionStopTimestamp() when force was passed in - that timestamp
can be quite out of date in some cases. But I don't immediately see a
noticeable consequence, so ...

Greetings,

Andres Freund

Attachments:

v1-0001-Report-stats-when-replaying-XLOG_RUNNING_XACTS.patchtext/x-diff; charset=us-asciiDownload
From 249c811da9d257c5a71e42016584fc3db0c9a99b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 9 Jun 2023 21:19:36 -0700
Subject: [PATCH v1] Report stats when replaying XLOG_RUNNING_XACTS

Previously stats in the startup process would only get reported during
shutdown of the startup process. It has been that way for a long time, but
became a lot more noticeable with the new pg_stat_io view, which separates out
IO done by different backend types...

While replaying after every XLOG_RUNNING_XACTS isn't the prettiest approach,
it has the advantage of being quite easy. Given that we're well past feature
freeze...

It's not a problem that we don't report stats more frequently with
wal_level=minimal, in that case stats can't be read before the stats process
has shut down.

Reported-by: Fujii Masao <masao.fujii@oss.nttdata.com>
Discussion: https://postgr.es/m/5315aedc-fbca-1556-c5de-dc2e00b23a14@oss.nttdata.com
---
 src/backend/storage/ipc/standby.c   |  9 +++++++++
 src/backend/utils/activity/pgstat.c | 17 ++++++++++++++---
 2 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/src/backend/storage/ipc/standby.c b/src/backend/storage/ipc/standby.c
index ffe5e1563f5..70b0da50fc1 100644
--- a/src/backend/storage/ipc/standby.c
+++ b/src/backend/storage/ipc/standby.c
@@ -1193,6 +1193,15 @@ standby_redo(XLogReaderState *record)
 		running.xids = xlrec->xids;
 
 		ProcArrayApplyRecoveryInfo(&running);
+
+		/*
+		 * The startup process currently has no convenient way to schedule
+		 * stats to be reported. XLOG_RUNNING_XACTS records issued at a
+		 * regular cadence, making this a convenient location to report
+		 * stats. While these records aren't generated with wal_level=minimal,
+		 * stats also cannot be accessed during WAL replay.
+		 */
+		pgstat_report_stat(true);
 	}
 	else if (info == XLOG_INVALIDATIONS)
 	{
diff --git a/src/backend/utils/activity/pgstat.c b/src/backend/utils/activity/pgstat.c
index 0cdb552631e..d743fc0b289 100644
--- a/src/backend/utils/activity/pgstat.c
+++ b/src/backend/utils/activity/pgstat.c
@@ -615,10 +615,21 @@ pgstat_report_stat(bool force)
 	 */
 	Assert(!pgStatLocal.shmem->is_shutdown);
 
-	now = GetCurrentTransactionStopTimestamp();
-
-	if (!force)
+	if (force)
 	{
+		/*
+		 * Stats reports are forced either when it's been too long since stats
+		 * have been reported or in processes that force stats reporting to
+		 * happen at specific points (including shutdown). In the former case
+		 * the transaction stop time might be quite old, in the latter it
+		 * would never get cleared.
+		 */
+		now = GetCurrentTimestamp();
+	}
+	else
+	{
+		now = GetCurrentTransactionStopTimestamp();
+
 		if (pending_since > 0 &&
 			TimestampDifferenceExceeds(pending_since, now, PGSTAT_MAX_INTERVAL))
 		{
-- 
2.38.0

#22Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#21)
Re: pg_stat_io for the startup process

At Fri, 9 Jun 2023 21:28:23 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

On 2023-05-22 13:36:42 +0900, Kyotaro Horiguchi wrote:

At Sun, 21 May 2023 15:14:23 -0700, Andres Freund <andres@anarazel.de> wrote in

Hi,

I don't really feel confident we're going to get the timeout approach solid
enough for something going into the tree around beta 1.

How about this, imo a lot simpler, approach: We flush stats whenever replaying
a XLOG_RUNNING_XACTS record. Unless the primary is idle, it will log those at
a regular interval. If the primary is idle, we don't need to flush stats in
the startup process, because we'll not have done any io.

We only log XLOG_RUNNING_XACTS when wal_level >= replica, so stats wouldn't
get regularly flushed if wal_level = minimal - but in that case the stats are
also not accessible, so that's not a problem.

I concur with the three aspects, interval regularity, reliability and
about the case of the minimal wal_level. While I can't confidently
claim it is the best, its simplicilty gives us a solid reason to
proceed with it. If necessary, we can switch to alternative timing
sources in the future without causing major disruptions for users, I
believe.

It's not the prettiest solution, but I think the simplicity is worth a lot.

Greetings,

+1.

Attached a patch implementing this approach.

It's imo always a separate bug that we were using
GetCurrentTransactionStopTimestamp() when force was passed in - that timestamp
can be quite out of date in some cases. But I don't immediately see a
noticeable consequence, so ...

Thanks!

I think that the reason is that we only pass true only when we're in a
sort of idle time. Addition to that XLOG_RUNNING_XACTS comes so
infrequently to cause noticeable degradation. If it causes
sufficiently frequent updates, we will be satisfied with it.

The patch is quite simple and looks good to me. (At least way better
than always using GetCurrentTimestamp():)

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#23Andres Freund
andres@anarazel.de
In reply to: Kyotaro Horiguchi (#22)
Re: pg_stat_io for the startup process

Hi,

On 2023-06-13 12:54:19 +0900, Kyotaro Horiguchi wrote:

I think that the reason is that we only pass true only when we're in a
sort of idle time. Addition to that XLOG_RUNNING_XACTS comes so
infrequently to cause noticeable degradation. If it causes
sufficiently frequent updates, we will be satisfied with it.

The patch is quite simple and looks good to me. (At least way better
than always using GetCurrentTimestamp():)

Thanks for looking - I already had pushed the commit by the time I read your
email, otherwise I'd have mentioned you reviewing it...

Greetings,

Andres Freund

#24Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Andres Freund (#23)
Re: pg_stat_io for the startup process

At Mon, 12 Jun 2023 21:13:35 -0700, Andres Freund <andres@anarazel.de> wrote in

Thanks for looking - I already had pushed the commit by the time I read your
email, otherwise I'd have mentioned you reviewing it...

Oops! But I anticipated that and that's no problem (I should have
confirmed commits.). Thanks for the reply!

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center