[PATCH] fix wait_event of pg_stat_activity in case of high amount of connections

Started by Michael Zhilinover 3 years ago8 messages
#1Michael Zhilin
m.zhilin@postgrespro.ru
1 attachment(s)

Hi,

I would like to submit patch for column wait_event of view
pg_stat_activity. Please find it attached.

The view pg_stat_activity provides snapshot of actual backends' state
with following columns:
- wait_event contains event name for which backend is waiting for;
- state of backend, for instance active or idle.

We observe that in high load applications the value of column wait_event
of pg_stat_activity is incorrect. For instance, it can be "ClientRead"
for active backends. According to source code, the wait event
"ClientRead" is possible only for idle or 'idle in transaction'
backends. So if backend is active, the wait_event can't be 'ClientRead'.

The key parameter to reproduce defect is high value of max_connections
(more than 1000). Let's do the following:
- create new database (initdb)
- set max_connections to 10000
- trigger pgbench in select-only mode (200 connections, 1 job)
- fetch data from pg_stat_activity

The following query can be used to detect problem:

select state, wait_event, backend_type, count(1)
from pg_stat_activity
group by 1,2,3;

The pgbench command is following:

pgbench -n -c 200 -j 1 -T 60 -P 1 -M prepared -S postgres

Before patch, the output looks like:

postgres=# select state, wait_event, backend_type, count(1) from
pg_stat_activity group by 1,2,3;
state | wait_event | backend_type | count
--------+---------------------+------------------------------+-------
idle | | client backend | 3
active | | client backend | 1
| WalWriterMain | walwriter | 1
| CheckpointerMain | checkpointer | 1
| LogicalLauncherMain | logical replication launcher | 1
| AutoVacuumMain | autovacuum launcher | 1
| BgWriterHibernate | background writer | 1
active | ClientRead | client backend | 4
idle | ClientRead | client backend | 193
(9 rows)

Time: 4.406 ms

Please pay attention to lines with state 'active' and wait_event
'ClientRead'. According to output above, we see 4 backends with such
combination of state and wait_event.

After patch, the output is better:

postgres=# select state, wait_event, backend_type, count(1) from
pg_stat_activity group by 1,2,3;
state | wait_event | backend_type | count
--------+---------------------+------------------------------+-------
| | walwriter | 1
active | | client backend | 5
| LogicalLauncherMain | logical replication launcher | 1
| AutoVacuumMain | autovacuum launcher | 1
| | background writer | 1
idle | ClientRead | client backend | 196
| | checkpointer | 1
(7 rows)

Time: 1.520 ms

The lines with active-ClientRead and idle-nowait are disappeared and
output looks expecting: 5 active backend with no wait, 196 idle
connections with wait event ClientRead.

The output is incorrect because state & wait information are gathered at
different times. At first, the view gathers backends' information into
local structures and then it iterates over backends to enrich data by
wait event. To read wait event it tries to get LWLock per backend, so
iterating over backends takes some time (few milliseconds). As result
backend wait events may be changed for quick queries.

The idea of patch is to avoid iterating over backend and gather all
information at once.

As well, patch changes way to allocate memory for local structure.
Before it estimates maximum size of required memory and allocate it at
once. It could result into allocation of dozens/hundreds of megabytes
for nothing. Now it allocates memory by chunks to reduce overall amount
of allocated memory and reduce time for allocation.

In example above, the timing is reduced from 4.4ms to 1.5ms (3 times).

The patch is for PostgreSQL version 15. If fix is OK and is required for
previous versions, please let know.
It's worth to mention Yury Sokolov as co-author of patch.

Please feel free to ask any questions.

Best regards,
--
Michael Zhilin
Postgres Professional
+7(925)3366270
https://www.postgrespro.ru

Attachments:

v1-0001-fix-column-wait_event-of-view-pg_stat_activity.patchtext/x-patch; charset=UTF-8; name=v1-0001-fix-column-wait_event-of-view-pg_stat_activity.patchDownload
From c4c91a8579364987038ffa6e4cf0ea6314266c42 Mon Sep 17 00:00:00 2001
From: Michael Zhilin <m.zhilin@postgrespro.ru>
Date: Thu, 10 Mar 2022 17:23:11 +0300
Subject: [PATCH v1] fix column wait_event of view pg_stat_activity

In case of high amount of connections, we have observed value 'ClientRead' of
column wait_event for active connections. Actually backend sets wait_event as
'ClientRead' only in case of idle or idle in transaction state.
The reason of discrepancy is different time moments of gathering information
from internal structures. At first it gathers state information for all backends.
Then it iterates over connections to gather wait information for each backend.

Before this patch the columns 'state' and 'wait_event' of view pg_stat_activity
may contain contadictive values.

The aim of change is to make wait information correct. To achieve it the state
and wait information should be gathered at same time.

Also the patch changes the way to allocate memory for temporary (aka local state)
structures. In case of huge amount of connections (1000 or more) it results in
increase of latency and waste of allocated memory. Instead of one-shot
allocation of huge portition it triggers memory allocation on demand by chunks.

Patch makes no change of system behaviour, but makes pg_stat_activity view more
consistent and a bit faster.

Tested on CentOS 7.9 / Debian 11 / FreeBSD 14 with 8-32 cores.

Co-authored-by: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/backend/utils/activity/backend_status.c | 129 +++++++++++++++-----
 src/backend/utils/adt/pgstatfuncs.c         |  49 ++------
 src/include/utils/backend_status.h          |  13 +-
 3 files changed, 120 insertions(+), 71 deletions(-)

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index c7ed1e6d7ac..5cfe35398fb 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -25,6 +25,7 @@
 #include "utils/guc.h"			/* for application_name */
 #include "utils/memutils.h"
 
+#define UINT32_ACCESS_ONCE(var)           ((uint32)(*((volatile uint32 *)&(var))))
 
 /* ----------
  * Total number of backends including auxiliary
@@ -716,6 +717,24 @@ pgstat_report_xact_timestamp(TimestampTz tstamp)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+#define PGSTAT_LOCALDATA_CHUNK	128
+#ifdef HAVE_TYPEOF
+#define PGSTAT_LOCALDATA_ALLOC(x,cnt) \
+	if (x == NULL || x > x##_thr) {	\
+		x = (typeof(x)) \
+			MemoryContextAlloc(backendStatusSnapContext, \
+							   cnt * sizeof(x[0]) * PGSTAT_LOCALDATA_CHUNK); \
+		x##_thr = x + cnt * (PGSTAT_LOCALDATA_CHUNK - 1); \
+	}
+#else
+#define PGSTAT_LOCALDATA_ALLOC(x,cnt) \
+	if (x == NULL || x > x##_thr) {	\
+		x = MemoryContextAlloc(backendStatusSnapContext, \
+							   cnt * sizeof(x[0]) * PGSTAT_LOCALDATA_CHUNK); \
+		x##_thr = x + cnt * (PGSTAT_LOCALDATA_CHUNK - 1); \
+	}
+#endif
+
 /* ----------
  * pgstat_read_current_status() -
  *
@@ -729,16 +748,25 @@ pgstat_read_current_status(void)
 	volatile PgBackendStatus *beentry;
 	LocalPgBackendStatus *localtable;
 	LocalPgBackendStatus *localentry;
+	PgBackendStatus *localbeentry,
+			   *localbeentry_thr;
+	PGPROC	   *proc;
 	char	   *localappname,
+			   *localappname_thr,
 			   *localclienthostname,
-			   *localactivity;
+			   *localclienthostname_thr,
+			   *localactivity,
+			   *localactivity_thr;
 #ifdef USE_SSL
-	PgBackendSSLStatus *localsslstatus;
+	PgBackendSSLStatus *localsslstatus,
+			   *localsslstatus_thr;
 #endif
 #ifdef ENABLE_GSS
-	PgBackendGSSStatus *localgssstatus;
+	PgBackendGSSStatus *localgssstatus,
+			   *localgssstatus_thr;
 #endif
 	int			i;
+	int			pid;
 
 	if (localBackendStatusTable)
 		return;					/* already done */
@@ -756,32 +784,39 @@ pgstat_read_current_status(void)
 	localtable = (LocalPgBackendStatus *)
 		MemoryContextAlloc(backendStatusSnapContext,
 						   sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
-	localappname = (char *)
-		MemoryContextAlloc(backendStatusSnapContext,
-						   NAMEDATALEN * NumBackendStatSlots);
-	localclienthostname = (char *)
-		MemoryContextAlloc(backendStatusSnapContext,
-						   NAMEDATALEN * NumBackendStatSlots);
-	localactivity = (char *)
-		MemoryContextAllocHuge(backendStatusSnapContext,
-							   pgstat_track_activity_query_size * NumBackendStatSlots);
+
 #ifdef USE_SSL
-	localsslstatus = (PgBackendSSLStatus *)
-		MemoryContextAlloc(backendStatusSnapContext,
-						   sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
+	localsslstatus = NULL;
 #endif
 #ifdef ENABLE_GSS
-	localgssstatus = (PgBackendGSSStatus *)
-		MemoryContextAlloc(backendStatusSnapContext,
-						   sizeof(PgBackendGSSStatus) * NumBackendStatSlots);
+	localgssstatus = NULL;
 #endif
 
 	localNumBackends = 0;
+	localbeentry = NULL;
+	localappname = NULL;
+	localclienthostname = NULL;
+	localactivity = NULL;
 
 	beentry = BackendStatusArray;
 	localentry = localtable;
 	for (i = 1; i <= NumBackendStatSlots; i++)
 	{
+		/*
+		 * Check available space in buffers for local state data
+		 */
+		PGSTAT_LOCALDATA_ALLOC(localbeentry, 1);
+		PGSTAT_LOCALDATA_ALLOC(localappname, NAMEDATALEN);
+		PGSTAT_LOCALDATA_ALLOC(localclienthostname, NAMEDATALEN);
+		PGSTAT_LOCALDATA_ALLOC(localactivity, pgstat_track_activity_query_size);
+#ifdef USE_SSL
+		PGSTAT_LOCALDATA_ALLOC(localsslstatus, 1);
+#endif
+#ifdef ENABLE_GSS
+		PGSTAT_LOCALDATA_ALLOC(localgssstatus, 1);
+#endif
+
+		localentry->backendStatus = localbeentry;
 		/*
 		 * Follow the protocol of retrying if st_changecount changes while we
 		 * copy the entry, or if it's odd.  (The check for odd is needed to
@@ -796,11 +831,11 @@ pgstat_read_current_status(void)
 
 			pgstat_begin_read_activity(beentry, before_changecount);
 
-			localentry->backendStatus.st_procpid = beentry->st_procpid;
+			localentry->backendStatus->st_procpid = beentry->st_procpid;
 			/* Skip all the data-copying work if entry is not in use */
-			if (localentry->backendStatus.st_procpid > 0)
+			if (localentry->backendStatus->st_procpid > 0)
 			{
-				memcpy(&localentry->backendStatus, unvolatize(PgBackendStatus *, beentry), sizeof(PgBackendStatus));
+				memcpy(localentry->backendStatus, unvolatize(PgBackendStatus *, beentry), sizeof(PgBackendStatus));
 
 				/*
 				 * For each PgBackendStatus field that is a pointer, copy the
@@ -811,23 +846,23 @@ pgstat_read_current_status(void)
 				 * because there's always a \0 at the end of the buffer.
 				 */
 				strcpy(localappname, (char *) beentry->st_appname);
-				localentry->backendStatus.st_appname = localappname;
+				localentry->backendStatus->st_appname = localappname;
 				strcpy(localclienthostname, (char *) beentry->st_clienthostname);
-				localentry->backendStatus.st_clienthostname = localclienthostname;
+				localentry->backendStatus->st_clienthostname = localclienthostname;
 				strcpy(localactivity, (char *) beentry->st_activity_raw);
-				localentry->backendStatus.st_activity_raw = localactivity;
+				localentry->backendStatus->st_activity_raw = localactivity;
 #ifdef USE_SSL
 				if (beentry->st_ssl)
 				{
 					memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
-					localentry->backendStatus.st_sslstatus = localsslstatus;
+					localentry->backendStatus->st_sslstatus = localsslstatus;
 				}
 #endif
 #ifdef ENABLE_GSS
 				if (beentry->st_gss)
 				{
 					memcpy(localgssstatus, beentry->st_gssstatus, sizeof(PgBackendGSSStatus));
-					localentry->backendStatus.st_gssstatus = localgssstatus;
+					localentry->backendStatus->st_gssstatus = localgssstatus;
 				}
 #endif
 			}
@@ -843,17 +878,49 @@ pgstat_read_current_status(void)
 		}
 
 		beentry++;
+		pid = localentry->backendStatus->st_procpid;
 		/* Only valid entries get included into the local array */
-		if (localentry->backendStatus.st_procpid > 0)
+		if (pid > 0)
 		{
 			BackendIdGetTransactionIds(i,
 									   &localentry->backend_xid,
 									   &localentry->backend_xmin);
+			proc = BackendIdGetProc(i);
+
+			/*
+			 * If a PGPROC entry was retrieved, display wait events and lock
+			 * group leader information if any.  To avoid extra overhead, no
+			 * extra lock is being held, so there is no guarantee of
+			 * consistency across multiple rows.
+			 */
+			if (proc != NULL)
+			{
+				PGPROC	   *leader;
+
+				localentry->wait_event_info = UINT32_ACCESS_ONCE(proc->wait_event_info);
+				leader = proc->lockGroupLeader;
+
+				/*
+				 * Show the leader only for active parallel workers.  This
+				 * leaves the field as NULL for the leader of a parallel
+				 * group.
+				 */
+				if (leader && leader->pid != pid)
+					localentry->leader_pid = leader->pid;
+				else
+					localentry->leader_pid = 0;
+			}
+			else
+			{
+				localentry->wait_event_info = 0;
+				localentry->leader_pid = 0;
+			}
 
 			localentry++;
-			localappname += NAMEDATALEN;
-			localclienthostname += NAMEDATALEN;
-			localactivity += pgstat_track_activity_query_size;
+			localbeentry++;
+			localappname += strlen(localappname) + 1;
+			localclienthostname += strlen(localclienthostname) + 1;
+			localactivity += strlen(localactivity) + 1;
 #ifdef USE_SSL
 			localsslstatus++;
 #endif
@@ -1064,7 +1131,7 @@ pgstat_fetch_stat_beentry(int beid)
 	if (beid < 1 || beid > localNumBackends)
 		return NULL;
 
-	return &localBackendStatusTable[beid - 1].backendStatus;
+	return localBackendStatusTable[beid - 1].backendStatus;
 }
 
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 893690dad52..c1595ec1a9c 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -500,7 +500,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		if (!local_beentry)
 			continue;
 
-		beentry = &local_beentry->backendStatus;
+		beentry = local_beentry->backendStatus;
 
 		/*
 		 * Report values for only those backends which are running the given
@@ -555,7 +555,6 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 		bool		nulls[PG_STAT_GET_ACTIVITY_COLS];
 		LocalPgBackendStatus *local_beentry;
 		PgBackendStatus *beentry;
-		PGPROC	   *proc;
 		const char *wait_event_type = NULL;
 		const char *wait_event = NULL;
 
@@ -582,7 +581,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 			continue;
 		}
 
-		beentry = &local_beentry->backendStatus;
+		beentry = local_beentry->backendStatus;
 
 		/* If looking for specific PID, ignore all the others */
 		if (pid != -1 && beentry->st_procpid != pid)
@@ -652,46 +651,18 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 			pfree(clipped_activity);
 
 			/* leader_pid */
-			nulls[28] = true;
-
-			proc = BackendPidGetProc(beentry->st_procpid);
-
-			if (proc == NULL && (beentry->st_backendType != B_BACKEND))
+			if (local_beentry->leader_pid)
 			{
-				/*
-				 * For an auxiliary process, retrieve process info from
-				 * AuxiliaryProcs stored in shared-memory.
-				 */
-				proc = AuxiliaryPidGetProc(beentry->st_procpid);
+				values[28] = Int32GetDatum(local_beentry->leader_pid);
+				nulls[28] = false;
 			}
+			else
+				nulls[28] = true;
 
-			/*
-			 * If a PGPROC entry was retrieved, display wait events and lock
-			 * group leader information if any.  To avoid extra overhead, no
-			 * extra lock is being held, so there is no guarantee of
-			 * consistency across multiple rows.
-			 */
-			if (proc != NULL)
+			if (local_beentry->wait_event_info)
 			{
-				uint32		raw_wait_event;
-				PGPROC	   *leader;
-
-				raw_wait_event = UINT32_ACCESS_ONCE(proc->wait_event_info);
-				wait_event_type = pgstat_get_wait_event_type(raw_wait_event);
-				wait_event = pgstat_get_wait_event(raw_wait_event);
-
-				leader = proc->lockGroupLeader;
-
-				/*
-				 * Show the leader only for active parallel workers.  This
-				 * leaves the field as NULL for the leader of a parallel
-				 * group.
-				 */
-				if (leader && leader->pid != beentry->st_procpid)
-				{
-					values[28] = Int32GetDatum(leader->pid);
-					nulls[28] = false;
-				}
+				wait_event_type = pgstat_get_wait_event_type(local_beentry->wait_event_info);
+				wait_event = pgstat_get_wait_event(local_beentry->wait_event_info);
 			}
 
 			if (wait_event_type)
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 7403bca25ed..63eae799978 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -245,7 +245,7 @@ typedef struct LocalPgBackendStatus
 	/*
 	 * Local version of the backend status entry.
 	 */
-	PgBackendStatus backendStatus;
+	PgBackendStatus *backendStatus;
 
 	/*
 	 * The xid of the current transaction if available, InvalidTransactionId
@@ -258,6 +258,17 @@ typedef struct LocalPgBackendStatus
 	 * not.
 	 */
 	TransactionId backend_xmin;
+
+	/*
+	 * The process wait information
+	 */
+	uint32 wait_event_info;
+
+	/*
+	 * The leader process ID
+	 */
+	int leader_pid;
+
 } LocalPgBackendStatus;
 
 
-- 
2.34.1

#2Justin Pryzby
pryzby@telsasoft.com
In reply to: Michael Zhilin (#1)
2 attachment(s)
Re: [PATCH] fix wait_event of pg_stat_activity in case of high amount of connections

On Thu, Jun 30, 2022 at 11:05:28PM +0300, Michael Zhilin wrote:

I would like to submit patch for column wait_event of view pg_stat_activity.
Please find it attached.

Thanks. I had this thread on my list to look at later but then, by chance, I
got an erroneous nagios notification about a longrunning transaction, which
seems to have been caused by this bug.

You can reproduce the problem by running a couple handsful of these:

psql -h /tmp postgres -Atc "SELECT * FROM pg_stat_activity WHERE state='active' AND wait_event='ClientRead' LIMIT 1"&

As result backend wait events may be changed for quick queries.

Yep. I realize now that I've been seeing this at one of our customers for a
pkey lookup query. We don't use high max_connections, but if you have a lot of
connections, you might be more likely to hit this.

I agree that this is a bug, since it can (and did) cause false positives in a
monitoring system.

As well, patch changes way to allocate memory for local structure. Before it
estimates maximum size of required memory and allocate it at once. It could
result into allocation of dozens/hundreds of megabytes for nothing. Now it
allocates memory by chunks to reduce overall amount of allocated memory and
reduce time for allocation.

I suggest to present this as two patches: a 0001 patch to fix the bug, and
proposed for backpatch, and an 0002 patch for master to improve memory usage.
As attached. Actually, once 0001 is resolved, it may be good to start a
separate thread for 0002. I plan to add to the next CF.

Did you really experience latency before the patch ? It seems like
max_connections=1000 would only allocate a bit more than 1MB.

For me, max_connections=10000 makes pg_stat_activity 10% slower for me than 100
(11.8s vs 13.2s). With the patch, 10k is only about ~3% slower than 100. So
there is an improvement, but people here may not be enthusiastic excited about
improving performance for huge values of max_connections.

time for a in `seq 1 99`; do psql -h /tmp postgres -Atc "SELECT FROM pg_stat_activity -- WHERE state='active' AND wait_event='ClientRead' LIMIT 1"; done

Is there a reason why you made separate allocations for appname, hostname,
activity, ssl, gss (I know that's how it's done originally, too) ? Couldn't it
be a single allocation ? You could do a single (re)allocation when an active
backend is found if the amount of free space is less than
2*NAMEDATALEN+track_activity_query_size+sizeof(PgBackendSSLStatus)+sizeof(PgBackendGSSStatus)
I have a patch for that, but I'll share it after 0001 is resolved.

Why did you change backendStatus to a pointer ? Is it to minimize the size of
the structure to allow for huge values of max_connections ?

Note that there were warnings from your 0002:
backend_status.c:723:21: warning: ‘localactivity_thr’ may be used uninitialized in this function [-Wmaybe-uninitialized]

Thanks,
--
Justin

Attachments:

0001-fix-column-wait_event-of-view-pg_stat_activity.patchtext/x-diff; charset=us-asciiDownload
From 590d9b545805f99b80103bfe900d175ad1d0668c Mon Sep 17 00:00:00 2001
From: Michael Zhilin <m.zhilin@postgrespro.ru>
Date: Thu, 10 Mar 2022 17:23:11 +0300
Subject: [PATCH 1/2] fix column wait_event of view pg_stat_activity

In case of high amount of connections, we have observed value 'ClientRead' of
column wait_event for active connections. Actually backend sets wait_event as
'ClientRead' only in case of idle or idle in transaction state.
The reason of discrepancy is different time moments of gathering information
from internal structures. At first it gathers state information for all backends.
Then it iterates over connections to gather wait information for each backend.

Before this patch the columns 'state' and 'wait_event' of view pg_stat_activity
may contain contadictive values.

The aim of change is to make wait information correct. To achieve it the state
and wait information should be gathered at same time.

Patch makes no change of system behaviour, but makes pg_stat_activity view more
consistent.

Tested on CentOS 7.9 / Debian 11 / FreeBSD 14 with 8-32 cores.

Co-authored-by: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/backend/utils/activity/backend_status.c | 18 +++++++++
 src/backend/utils/adt/pgstatfuncs.c         | 45 ++++-----------------
 src/include/utils/backend_status.h          | 10 +++++
 3 files changed, 36 insertions(+), 37 deletions(-)

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 415604db322..64567ff5cd7 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -845,6 +845,24 @@ pgstat_read_current_status(void)
 		/* Only valid entries get included into the local array */
 		if (localentry->backendStatus.st_procpid > 0)
 		{
+			PGPROC	*proc = BackendIdGetProc(i);
+			if (proc != NULL)
+			{
+				PGPROC	*leader = proc->lockGroupLeader;
+
+				/*
+				 * Show the leader only for active parallel workers.  This
+				 * leaves the field as NULL for the leader of a parallel
+				 * group.
+				 */
+				if (leader && leader->pid != localentry->backendStatus.st_procpid)
+					localentry->leader_pid = leader->pid;
+				else
+					localentry->leader_pid = 0;
+
+				localentry->wait_event_info = proc->wait_event_info;
+			}
+
 			BackendIdGetTransactionIds(i,
 									   &localentry->backend_xid,
 									   &localentry->backend_xmin);
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index db061754c8b..82f169a2d8b 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -555,7 +555,6 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 		bool		nulls[PG_STAT_GET_ACTIVITY_COLS];
 		LocalPgBackendStatus *local_beentry;
 		PgBackendStatus *beentry;
-		PGPROC	   *proc;
 		const char *wait_event_type = NULL;
 		const char *wait_event = NULL;
 
@@ -652,46 +651,18 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 			pfree(clipped_activity);
 
 			/* leader_pid */
-			nulls[28] = true;
-
-			proc = BackendPidGetProc(beentry->st_procpid);
-
-			if (proc == NULL && (beentry->st_backendType != B_BACKEND))
+			if (local_beentry->leader_pid)
 			{
-				/*
-				 * For an auxiliary process, retrieve process info from
-				 * AuxiliaryProcs stored in shared-memory.
-				 */
-				proc = AuxiliaryPidGetProc(beentry->st_procpid);
+				values[28] = Int32GetDatum(local_beentry->leader_pid);
+				nulls[28] = false;
 			}
+			else
+				nulls[28] = true;
 
-			/*
-			 * If a PGPROC entry was retrieved, display wait events and lock
-			 * group leader information if any.  To avoid extra overhead, no
-			 * extra lock is being held, so there is no guarantee of
-			 * consistency across multiple rows.
-			 */
-			if (proc != NULL)
+			if (local_beentry->wait_event_info)
 			{
-				uint32		raw_wait_event;
-				PGPROC	   *leader;
-
-				raw_wait_event = UINT32_ACCESS_ONCE(proc->wait_event_info);
-				wait_event_type = pgstat_get_wait_event_type(raw_wait_event);
-				wait_event = pgstat_get_wait_event(raw_wait_event);
-
-				leader = proc->lockGroupLeader;
-
-				/*
-				 * Show the leader only for active parallel workers.  This
-				 * leaves the field as NULL for the leader of a parallel
-				 * group.
-				 */
-				if (leader && leader->pid != beentry->st_procpid)
-				{
-					values[28] = Int32GetDatum(leader->pid);
-					nulls[28] = false;
-				}
+				wait_event_type = pgstat_get_wait_event_type(local_beentry->wait_event_info);
+				wait_event = pgstat_get_wait_event(local_beentry->wait_event_info);
 			}
 
 			if (wait_event_type)
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 7403bca25ed..8f890911486 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -258,6 +258,16 @@ typedef struct LocalPgBackendStatus
 	 * not.
 	 */
 	TransactionId backend_xmin;
+
+	/*
+	 * The process wait information
+	 */
+	uint32 wait_event_info;
+
+	/*
+	 * The leader process ID
+	 */
+	int leader_pid;
 } LocalPgBackendStatus;
 
 
-- 
2.17.1

0002-wip-pg_stat_activity-allocate-memory-differently.patchtext/x-diff; charset=us-asciiDownload
From d122170f942932f9d747f8437a8089990665457a Mon Sep 17 00:00:00 2001
From: Michael Zhilin <m.zhilin@postgrespro.ru>
Date: Thu, 10 Mar 2022 17:23:11 +0300
Subject: [PATCH 2/2] wip: pg_stat_activity: allocate memory differently

This patch changes the way memory is allocated for temporary (aka local state)
structures.  In the case of large values of max_connections (over 1000), memory
was allocated to accommodate the worst-case number of active backends which may
be wasteful and cause increased latency.

The patch makes pg_stat_activity view a bit more efficient by allocating memory
as needed for active backends.  It's still allocated in chunks.

Co-authored-by: Yura Sokolov <y.sokolov@postgrespro.ru>
---
 src/backend/utils/activity/backend_status.c | 107 ++++++++++++++------
 src/backend/utils/adt/pgstatfuncs.c         |   4 +-
 src/include/utils/backend_status.h          |   2 +-
 3 files changed, 77 insertions(+), 36 deletions(-)

diff --git a/src/backend/utils/activity/backend_status.c b/src/backend/utils/activity/backend_status.c
index 64567ff5cd7..9fbe8a601db 100644
--- a/src/backend/utils/activity/backend_status.c
+++ b/src/backend/utils/activity/backend_status.c
@@ -25,6 +25,7 @@
 #include "utils/guc.h"			/* for application_name */
 #include "utils/memutils.h"
 
+#define UINT32_ACCESS_ONCE(var)           ((uint32)(*((volatile uint32 *)&(var))))
 
 /* ----------
  * Total number of backends including auxiliary
@@ -716,6 +717,24 @@ pgstat_report_xact_timestamp(TimestampTz tstamp)
 	PGSTAT_END_WRITE_ACTIVITY(beentry);
 }
 
+#define PGSTAT_LOCALDATA_CHUNK	128
+#ifdef HAVE_TYPEOF
+#define PGSTAT_LOCALDATA_ALLOC(x,cnt) \
+	if (x == NULL || x > x##_thr) {	\
+		x = (typeof(x)) \
+			MemoryContextAlloc(backendStatusSnapContext, \
+							   cnt * sizeof(x[0]) * PGSTAT_LOCALDATA_CHUNK); \
+		x##_thr = x + cnt * (PGSTAT_LOCALDATA_CHUNK - 1); \
+	}
+#else
+#define PGSTAT_LOCALDATA_ALLOC(x,cnt) \
+	if (x == NULL || x > x##_thr) {	\
+		x = MemoryContextAlloc(backendStatusSnapContext, \
+							   cnt * sizeof(x[0]) * PGSTAT_LOCALDATA_CHUNK); \
+		x##_thr = x + cnt * (PGSTAT_LOCALDATA_CHUNK - 1); \
+	}
+#endif
+
 /* ----------
  * pgstat_read_current_status() -
  *
@@ -729,16 +748,24 @@ pgstat_read_current_status(void)
 	volatile PgBackendStatus *beentry;
 	LocalPgBackendStatus *localtable;
 	LocalPgBackendStatus *localentry;
+	PgBackendStatus *localbeentry,
+			   *localbeentry_thr;
 	char	   *localappname,
+			   *localappname_thr,
 			   *localclienthostname,
-			   *localactivity;
+			   *localclienthostname_thr,
+			   *localactivity,
+			   *localactivity_thr;
 #ifdef USE_SSL
-	PgBackendSSLStatus *localsslstatus;
+	PgBackendSSLStatus *localsslstatus,
+			   *localsslstatus_thr;
 #endif
 #ifdef ENABLE_GSS
-	PgBackendGSSStatus *localgssstatus;
+	PgBackendGSSStatus *localgssstatus,
+			   *localgssstatus_thr;
 #endif
 	int			i;
+	int			pid;
 
 	if (localBackendStatusTable)
 		return;					/* already done */
@@ -756,32 +783,39 @@ pgstat_read_current_status(void)
 	localtable = (LocalPgBackendStatus *)
 		MemoryContextAlloc(backendStatusSnapContext,
 						   sizeof(LocalPgBackendStatus) * NumBackendStatSlots);
-	localappname = (char *)
-		MemoryContextAlloc(backendStatusSnapContext,
-						   NAMEDATALEN * NumBackendStatSlots);
-	localclienthostname = (char *)
-		MemoryContextAlloc(backendStatusSnapContext,
-						   NAMEDATALEN * NumBackendStatSlots);
-	localactivity = (char *)
-		MemoryContextAllocHuge(backendStatusSnapContext,
-							   pgstat_track_activity_query_size * NumBackendStatSlots);
+
 #ifdef USE_SSL
-	localsslstatus = (PgBackendSSLStatus *)
-		MemoryContextAlloc(backendStatusSnapContext,
-						   sizeof(PgBackendSSLStatus) * NumBackendStatSlots);
+	localsslstatus = NULL;
 #endif
 #ifdef ENABLE_GSS
-	localgssstatus = (PgBackendGSSStatus *)
-		MemoryContextAlloc(backendStatusSnapContext,
-						   sizeof(PgBackendGSSStatus) * NumBackendStatSlots);
+	localgssstatus = NULL;
 #endif
 
 	localNumBackends = 0;
+	localbeentry = NULL;
+	localappname = NULL;
+	localclienthostname = NULL;
+	localactivity = NULL;
 
 	beentry = BackendStatusArray;
 	localentry = localtable;
 	for (i = 1; i <= NumBackendStatSlots; i++)
 	{
+		/*
+		 * Check available space in buffers for local state data
+		 */
+		PGSTAT_LOCALDATA_ALLOC(localbeentry, 1);
+		PGSTAT_LOCALDATA_ALLOC(localappname, NAMEDATALEN);
+		PGSTAT_LOCALDATA_ALLOC(localclienthostname, NAMEDATALEN);
+		PGSTAT_LOCALDATA_ALLOC(localactivity, pgstat_track_activity_query_size);
+#ifdef USE_SSL
+		PGSTAT_LOCALDATA_ALLOC(localsslstatus, 1);
+#endif
+#ifdef ENABLE_GSS
+		PGSTAT_LOCALDATA_ALLOC(localgssstatus, 1);
+#endif
+
+		localentry->backendStatus = localbeentry;
 		/*
 		 * Follow the protocol of retrying if st_changecount changes while we
 		 * copy the entry, or if it's odd.  (The check for odd is needed to
@@ -796,11 +830,11 @@ pgstat_read_current_status(void)
 
 			pgstat_begin_read_activity(beentry, before_changecount);
 
-			localentry->backendStatus.st_procpid = beentry->st_procpid;
+			localentry->backendStatus->st_procpid = beentry->st_procpid;
 			/* Skip all the data-copying work if entry is not in use */
-			if (localentry->backendStatus.st_procpid > 0)
+			if (localentry->backendStatus->st_procpid > 0)
 			{
-				memcpy(&localentry->backendStatus, unvolatize(PgBackendStatus *, beentry), sizeof(PgBackendStatus));
+				memcpy(localentry->backendStatus, unvolatize(PgBackendStatus *, beentry), sizeof(PgBackendStatus));
 
 				/*
 				 * For each PgBackendStatus field that is a pointer, copy the
@@ -811,23 +845,23 @@ pgstat_read_current_status(void)
 				 * because there's always a \0 at the end of the buffer.
 				 */
 				strcpy(localappname, (char *) beentry->st_appname);
-				localentry->backendStatus.st_appname = localappname;
+				localentry->backendStatus->st_appname = localappname;
 				strcpy(localclienthostname, (char *) beentry->st_clienthostname);
-				localentry->backendStatus.st_clienthostname = localclienthostname;
+				localentry->backendStatus->st_clienthostname = localclienthostname;
 				strcpy(localactivity, (char *) beentry->st_activity_raw);
-				localentry->backendStatus.st_activity_raw = localactivity;
+				localentry->backendStatus->st_activity_raw = localactivity;
 #ifdef USE_SSL
 				if (beentry->st_ssl)
 				{
 					memcpy(localsslstatus, beentry->st_sslstatus, sizeof(PgBackendSSLStatus));
-					localentry->backendStatus.st_sslstatus = localsslstatus;
+					localentry->backendStatus->st_sslstatus = localsslstatus;
 				}
 #endif
 #ifdef ENABLE_GSS
 				if (beentry->st_gss)
 				{
 					memcpy(localgssstatus, beentry->st_gssstatus, sizeof(PgBackendGSSStatus));
-					localentry->backendStatus.st_gssstatus = localgssstatus;
+					localentry->backendStatus->st_gssstatus = localgssstatus;
 				}
 #endif
 			}
@@ -842,8 +876,9 @@ pgstat_read_current_status(void)
 			CHECK_FOR_INTERRUPTS();
 		}
 
+		pid = localentry->backendStatus->st_procpid;
 		/* Only valid entries get included into the local array */
-		if (localentry->backendStatus.st_procpid > 0)
+		if (pid > 0)
 		{
 			PGPROC	*proc = BackendIdGetProc(i);
 			if (proc != NULL)
@@ -855,12 +890,17 @@ pgstat_read_current_status(void)
 				 * leaves the field as NULL for the leader of a parallel
 				 * group.
 				 */
-				if (leader && leader->pid != localentry->backendStatus.st_procpid)
+				if (leader && leader->pid != localentry->backendStatus->st_procpid)
 					localentry->leader_pid = leader->pid;
 				else
 					localentry->leader_pid = 0;
 
-				localentry->wait_event_info = proc->wait_event_info;
+				localentry->wait_event_info = UINT32_ACCESS_ONCE(proc->wait_event_info);
+			}
+			else
+			{
+				localentry->leader_pid = 0;
+				localentry->wait_event_info = 0;
 			}
 
 			BackendIdGetTransactionIds(i,
@@ -868,9 +908,10 @@ pgstat_read_current_status(void)
 									   &localentry->backend_xmin);
 
 			localentry++;
-			localappname += NAMEDATALEN;
-			localclienthostname += NAMEDATALEN;
-			localactivity += pgstat_track_activity_query_size;
+			localbeentry++;
+			localappname += strlen(localappname) + 1;
+			localclienthostname += strlen(localclienthostname) + 1;
+			localactivity += strlen(localactivity) + 1;
 #ifdef USE_SSL
 			localsslstatus++;
 #endif
@@ -1083,7 +1124,7 @@ pgstat_fetch_stat_beentry(int beid)
 	if (beid < 1 || beid > localNumBackends)
 		return NULL;
 
-	return &localBackendStatusTable[beid - 1].backendStatus;
+	return localBackendStatusTable[beid - 1].backendStatus;
 }
 
 
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 82f169a2d8b..d5f2448912b 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -500,7 +500,7 @@ pg_stat_get_progress_info(PG_FUNCTION_ARGS)
 		if (!local_beentry)
 			continue;
 
-		beentry = &local_beentry->backendStatus;
+		beentry = local_beentry->backendStatus;
 
 		/*
 		 * Report values for only those backends which are running the given
@@ -581,7 +581,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 			continue;
 		}
 
-		beentry = &local_beentry->backendStatus;
+		beentry = local_beentry->backendStatus;
 
 		/* If looking for specific PID, ignore all the others */
 		if (pid != -1 && beentry->st_procpid != pid)
diff --git a/src/include/utils/backend_status.h b/src/include/utils/backend_status.h
index 8f890911486..0b4b426e3aa 100644
--- a/src/include/utils/backend_status.h
+++ b/src/include/utils/backend_status.h
@@ -245,7 +245,7 @@ typedef struct LocalPgBackendStatus
 	/*
 	 * Local version of the backend status entry.
 	 */
-	PgBackendStatus backendStatus;
+	PgBackendStatus *backendStatus;
 
 	/*
 	 * The xid of the current transaction if available, InvalidTransactionId
-- 
2.17.1

#3Kyotaro Horiguchi
horikyota.ntt@gmail.com
In reply to: Justin Pryzby (#2)
Re: [PATCH] fix wait_event of pg_stat_activity in case of high amount of connections

At Thu, 7 Jul 2022 13:58:06 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in

I agree that this is a bug, since it can (and did) cause false positives in a
monitoring system.

I'm not this is undoubtfully a bug but agree about the rest.

As well, patch changes way to allocate memory for local structure. Before it
estimates maximum size of required memory and allocate it at once. It could
result into allocation of dozens/hundreds of megabytes for nothing. Now it
allocates memory by chunks to reduce overall amount of allocated memory and
reduce time for allocation.

I suggest to present this as two patches: a 0001 patch to fix the bug, and
proposed for backpatch, and an 0002 patch for master to improve memory usage.
As attached. Actually, once 0001 is resolved, it may be good to start a
separate thread for 0002. I plan to add to the next CF.

Looking the patch 0001, I wonder we can move wait_even_info from
PGPROC to backend status. If I'm not missing anything, I don't see a
plausible requirement for it being in PROC, or rather see a reason to
move it to backend_status.

void
pgstat_report_activity(BackendState state, const char *cmd_str)
{
...
beentry->st_xact_start_timestamp = 0;
beentry->st_query_id = UINT64CONST(0);
proc->wait_event_info = 0;
PGSTAT_END_WRITE_ACTIVITY(beentry);

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center

#4Robert Haas
robertmhaas@gmail.com
In reply to: Kyotaro Horiguchi (#3)
Re: [PATCH] fix wait_event of pg_stat_activity in case of high amount of connections

On Thu, Jul 7, 2022 at 10:39 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 7 Jul 2022 13:58:06 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in

I agree that this is a bug, since it can (and did) cause false positives in a
monitoring system.

I'm not this is undoubtfully a bug but agree about the rest.

I don't agree that this is a bug, and even if it were, I don't think
this patch can fix it.

Let's start with the second point first: pgstat_report_wait_start()
and pgstat_report_wait_end() change the advertised wait event for a
process, while the backend state is changed by
pgstat_report_activity(). Since those function calls are in different
places, those changes are bound to happen at different times, and
therefore you can observe drift between the two values. Now perhaps
there are some one-directional guarantees: I think we probably always
set the state to idle before we start reading from the client, and
always finish reading from the client before the state ceases to be
idle. But I don't really see how that helps anything, because when you
read those values, you must read one and then the other. If you read
the activity before the wait event, you might see the state before it
goes idle and then the wait event after it's reached ClientRead. If
you read the wait event before the activity, you might see the wait
event as ClientRead, and then by the time you check the activity the
backend might have gotten some data from the client and no longer be
idle. The very best a patch like this can hope to do is narrow the
race condition enough that the discrepancies are observed less
frequently in practice.

And that's why I think this is not a bug fix, or even a good idea.
It's just encouraging people to rely on something which can never be
fully reliable in the way that the original poster is hoping. There
was never any intention of having wait events synchronized with the
pgstat_report_activity() stuff, and I think that's perfectly fine.
Both systems are trying to provide visibility into states that can
change very quickly, and therefore they need to be low-overhead, and
therefore they use very lightweight synchronization, which means that
ephemeral discrepancies are possible by nature. There are plenty of
other examples of that as well. You can't for example query pg_locks
and pg_stat_activity in the same query and expect that all and only
those backends that are apparently waiting for a lock in
pg_stat_activity will have an ungranted lock in pg_locks. It just
doesn't work like that, and there's a very good reason for that:
trying to make all of these introspection facilities behave in
MVCC-like ways would be painful to code and probably end up slowing
the system down substantially.

I think the right fix here is to change nothing in the code, and stop
expecting these things to be perfectly consistent with each other.

--
Robert Haas
EDB: http://www.enterprisedb.com

#5Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Robert Haas (#4)
Re: [PATCH] fix wait_event of pg_stat_activity in case of high amount of connections

В Пт, 08/07/2022 в 09:44 -0400, Robert Haas пишет:

On Thu, Jul 7, 2022 at 10:39 PM Kyotaro Horiguchi
<horikyota.ntt@gmail.com> wrote:

At Thu, 7 Jul 2022 13:58:06 -0500, Justin Pryzby <pryzby@telsasoft.com> wrote in

I agree that this is a bug, since it can (and did) cause false positives in a
monitoring system.

I'm not this is undoubtfully a bug but agree about the rest.

I don't agree that this is a bug, and even if it were, I don't think
this patch can fix it.

Let's start with the second point first: pgstat_report_wait_start()
and pgstat_report_wait_end() change the advertised wait event for a
process, while the backend state is changed by
pgstat_report_activity(). Since those function calls are in different
places, those changes are bound to happen at different times, and
therefore you can observe drift between the two values. Now perhaps
there are some one-directional guarantees: I think we probably always
set the state to idle before we start reading from the client, and
always finish reading from the client before the state ceases to be
idle. But I don't really see how that helps anything, because when you
read those values, you must read one and then the other. If you read
the activity before the wait event, you might see the state before it
goes idle and then the wait event after it's reached ClientRead. If
you read the wait event before the activity, you might see the wait
event as ClientRead, and then by the time you check the activity the
backend might have gotten some data from the client and no longer be
idle. The very best a patch like this can hope to do is narrow the
race condition enough that the discrepancies are observed less
frequently in practice.

And that's why I think this is not a bug fix, or even a good idea.
It's just encouraging people to rely on something which can never be
fully reliable in the way that the original poster is hoping. There
was never any intention of having wait events synchronized with the
pgstat_report_activity() stuff, and I think that's perfectly fine.
Both systems are trying to provide visibility into states that can
change very quickly, and therefore they need to be low-overhead, and
therefore they use very lightweight synchronization, which means that
ephemeral discrepancies are possible by nature. There are plenty of
other examples of that as well. You can't for example query pg_locks
and pg_stat_activity in the same query and expect that all and only
those backends that are apparently waiting for a lock in
pg_stat_activity will have an ungranted lock in pg_locks. It just
doesn't work like that, and there's a very good reason for that:
trying to make all of these introspection facilities behave in
MVCC-like ways would be painful to code and probably end up slowing
the system down substantially.

I think the right fix here is to change nothing in the code, and stop
expecting these things to be perfectly consistent with each other.

I see analogy with Bus Stop:
- there is bus stop
- there is a schedule of bus arriving this top
- there are passengers, who every day travel with this bus

Bus occasionally comes later... Well, it comes later quite often...

Which way Major (or other responsible person) should act?

First possibility: do all the best Bus comes at the schedule. And
although there will no be 100% guarantee, it will raise from 90%
to 99%.

Second possibility: tell the passengers "you should not rely on bus
schedule, and we will not do anything to make it more reliable".

If I were passenger, I'd prefer first choice.

regards

Yura

#6Robert Haas
robertmhaas@gmail.com
In reply to: Yura Sokolov (#5)
Re: [PATCH] fix wait_event of pg_stat_activity in case of high amount of connections

On Fri, Jul 8, 2022 at 10:11 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I see analogy with Bus Stop:
- there is bus stop
- there is a schedule of bus arriving this top
- there are passengers, who every day travel with this bus

Bus occasionally comes later... Well, it comes later quite often...

Which way Major (or other responsible person) should act?

I do not think that is a good analogy, because a bus schedule is an
implicit promise - or at least a strong suggestion - that the bus will
arrive at the scheduled time.

In this case, who made such a promise? The original post presents it
as fact that these systems should give compatible answers at all
times, but there's nothing in the code or documentation to suggest
that this is true.

IMHO, a better analogy would be if you noticed that the 7:03am bus was
normally blue and you took that one because you have a small child who
likes the color blue and it makes them happy to take a blue bus. And
then one day the bus at that time is a red bus and your child is upset
and you call the major (or other responsible person) to complain.
They're probably not going to handle that situation by trying to send
a blue bus at 7:03am as often as possible. They're going to tell you
that they only promised you a bus at 7:03am, not what color it would
be.

Perhaps that's not an ideal analogy either, because the reported wait
event and the reported activity are more closely related than the time
of a bus is to the color of the bus. But I think it's still true that
nobody ever promised that those values would be compatible with each
other, and that's not really fixable, and that there are lots of other
cases just like this one which can't be fixed either.

I think that the more we try to pretend like it is possible to make
these values seem like they are synchronized, the more unhappy people
will be in the unavoidable cases where they aren't, and the more
pressure there will be to try to tighten it up even further. That's
likely to result in code that is more complex and slower, which I do
not want, and especially not for the sake of avoiding a harmless
reporting discrepancy.

--
Robert Haas
EDB: http://www.enterprisedb.com

#7Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Robert Haas (#6)
Re: [PATCH] fix wait_event of pg_stat_activity in case of high amount of connections

В Пт, 08/07/2022 в 11:04 -0400, Robert Haas пишет:

On Fri, Jul 8, 2022 at 10:11 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I see analogy with Bus Stop:
- there is bus stop
- there is a schedule of bus arriving this top
- there are passengers, who every day travel with this bus

Bus occasionally comes later... Well, it comes later quite often...

Which way Major (or other responsible person) should act?

I do not think that is a good analogy, because a bus schedule is an
implicit promise - or at least a strong suggestion - that the bus will
arrive at the scheduled time.

There is implicit promise: those data are written in single row.
If you want to notice they are NOT related to each other, return them
in different rows or even in different view tables.

In this case, who made such a promise? The original post presents it
as fact that these systems should give compatible answers at all
times, but there's nothing in the code or documentation to suggest
that this is true.

IMHO, a better analogy would be if you noticed that the 7:03am bus was
normally blue and you took that one because you have a small child who
likes the color blue and it makes them happy to take a blue bus. And
then one day the bus at that time is a red bus and your child is upset
and you call the major (or other responsible person) to complain.
They're probably not going to handle that situation by trying to send
a blue bus at 7:03am as often as possible. They're going to tell you
that they only promised you a bus at 7:03am, not what color it would
be.

Perhaps that's not an ideal analogy either, because the reported wait
event and the reported activity are more closely related than the time
of a bus is to the color of the bus. But I think it's still true that
nobody ever promised that those values would be compatible with each
other, and that's not really fixable, and that there are lots of other
cases just like this one which can't be fixed either.

I think that the more we try to pretend like it is possible to make
these values seem like they are synchronized, the more unhappy people
will be in the unavoidable cases where they aren't, and the more
pressure there will be to try to tighten it up even further. That's
likely to result in code that is more complex and slower, which I do
not want, and especially not for the sake of avoiding a harmless
reporting discrepancy.

Then just don't return them together, right?

#8Yura Sokolov
y.sokolov@postgrespro.ru
In reply to: Yura Sokolov (#7)
Re: [PATCH] fix wait_event of pg_stat_activity in case of high amount of connections

В Сб, 09/07/2022 в 02:32 +0300, Yura Sokolov пишет:

В Пт, 08/07/2022 в 11:04 -0400, Robert Haas пишет:

On Fri, Jul 8, 2022 at 10:11 AM Yura Sokolov <y.sokolov@postgrespro.ru> wrote:

I see analogy with Bus Stop:
- there is bus stop
- there is a schedule of bus arriving this top
- there are passengers, who every day travel with this bus

Bus occasionally comes later... Well, it comes later quite often...

Which way Major (or other responsible person) should act?

I do not think that is a good analogy, because a bus schedule is an
implicit promise - or at least a strong suggestion - that the bus will
arrive at the scheduled time.

There is implicit promise: those data are written in single row.
If you want to notice they are NOT related to each other, return them
in different rows or even in different view tables.

In this case, who made such a promise? The original post presents it
as fact that these systems should give compatible answers at all
times, but there's nothing in the code or documentation to suggest
that this is true.

IMHO, a better analogy would be if you noticed that the 7:03am bus was
normally blue and you took that one because you have a small child who
likes the color blue and it makes them happy to take a blue bus. And
then one day the bus at that time is a red bus and your child is upset
and you call the major (or other responsible person) to complain.
They're probably not going to handle that situation by trying to send
a blue bus at 7:03am as often as possible. They're going to tell you
that they only promised you a bus at 7:03am, not what color it would
be.

Perhaps that's not an ideal analogy either, because the reported wait
event and the reported activity are more closely related than the time
of a bus is to the color of the bus. But I think it's still true that
nobody ever promised that those values would be compatible with each
other, and that's not really fixable, and that there are lots of other
cases just like this one which can't be fixed either.

I think that the more we try to pretend like it is possible to make
these values seem like they are synchronized, the more unhappy people
will be in the unavoidable cases where they aren't, and the more
pressure there will be to try to tighten it up even further. That's
likely to result in code that is more complex and slower, which I do
not want, and especially not for the sake of avoiding a harmless
reporting discrepancy.

Then just don't return them together, right?

Well, I'm a bit hotter guy than it is needed. I appologize for that.

Lets look on situation from compromise point of view:
- We are telling: we could make this view more synchronous (and faster).
- You are telling: it will never be totally synchronous, and it is
mistake we didn't mention the issue in documentation.

Why don't do both?
Why can't we do it more synchronous (and faster) AND mention in
documentaion it is not totally synchronous and never will be?

--------

regards

Yura