More efficient truncation of pg_stat_activity query strings

Started by Andres Freundover 8 years ago14 messages
#1Andres Freund
andres@anarazel.de

Hi,

I've recently seen a benchmark in which pg_mbcliplen() showed up
prominently. Which it will basically in any benchmark with longer query
strings, but fast queries. That's not that uncommon.

I wonder if we could avoid the cost of pg_mbcliplen() from within
pgstat_report_activity(), by moving some of the cost to the read
side. pgstat values are obviously read far less frequently in nearly all
cases that are performance relevant.

Therefore I wonder if we couldn't just store a querystring that's
essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the
read side. I think that should work because all *server side* encodings
store character lengths in the *first* byte of a multibyte character
(at least one clientside encoding, gb18030, doesn't behave that way).

That'd necessitate an added memory copy in pg_stat_get_activity(), but
that seems fairly harmless.

Faults in my thinking?

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#2Tatsuo Ishii
ishii@sraoss.co.jp
In reply to: Andres Freund (#1)
Re: More efficient truncation of pg_stat_activity query strings

read side. I think that should work because all *server side* encodings
store character lengths in the *first* byte of a multibyte character

What do you mean? I don't see such data in a multibyte string.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#3Andres Freund
andres@anarazel.de
In reply to: Tatsuo Ishii (#2)
Re: More efficient truncation of pg_stat_activity query strings

On 2017-09-12 16:53:49 +0900, Tatsuo Ishii wrote:

read side. I think that should work because all *server side* encodings
store character lengths in the *first* byte of a multibyte character

What do you mean? I don't see such data in a multibyte string.

Check the information the pg_*_mblen use / how the relevant encodings
work. Will be something like
int
pg_utf_mblen(const unsigned char *s)
{
int len;

if ((*s & 0x80) == 0)
len = 1;
else if ((*s & 0xe0) == 0xc0)
len = 2;
else if ((*s & 0xf0) == 0xe0)
len = 3;
else if ((*s & 0xf8) == 0xf0)
len = 4;
#ifdef NOT_USED
else if ((*s & 0xfc) == 0xf8)
len = 5;
else if ((*s & 0xfe) == 0xfc)
len = 6;
#endif
else
len = 1;
return len;
}

As you can see, only the first character (*s) is accessed to determine
the length/width of the multibyte-character. That's afaict the case for
all server-side encodings.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#4Tatsuo Ishii
ishii@sraoss.co.jp
In reply to: Andres Freund (#3)
Re: More efficient truncation of pg_stat_activity query strings

Check the information the pg_*_mblen use / how the relevant encodings
work. Will be something like
int
pg_utf_mblen(const unsigned char *s)
{
int len;

if ((*s & 0x80) == 0)
len = 1;
else if ((*s & 0xe0) == 0xc0)
len = 2;
else if ((*s & 0xf0) == 0xe0)
len = 3;
else if ((*s & 0xf8) == 0xf0)
len = 4;
#ifdef NOT_USED
else if ((*s & 0xfc) == 0xf8)
len = 5;
else if ((*s & 0xfe) == 0xfc)
len = 6;
#endif
else
len = 1;
return len;
}

As you can see, only the first character (*s) is accessed to determine
the length/width of the multibyte-character. That's afaict the case for
all server-side encodings.

So your idea is just storing cmd_str into st_activity, which might be
clipped in the middle of multibyte character. And the reader of the
string will call pg_mbclipen() when it needs to read the string. Yes,
I think it works except for gb18030 by the reason you said.

However, if we could have variants of mblen functions with additional
parameter: input string length, then we could remove this inconstancy.
I don't know if this is overkill or not, though.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#5Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#1)
Re: More efficient truncation of pg_stat_activity query strings

On Tue, Sep 12, 2017 at 3:19 AM, Andres Freund <andres@anarazel.de> wrote:

Therefore I wonder if we couldn't just store a querystring that's
essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the
read side. I think that should work because all *server side* encodings
store character lengths in the *first* byte of a multibyte character
(at least one clientside encoding, gb18030, doesn't behave that way).

That'd necessitate an added memory copy in pg_stat_get_activity(), but
that seems fairly harmless.

Interesting idea. I was (ha, ha, what a coincidence) also thinking
about this problem and was wondering if we couldn't be a lot smarter
about pg_mbcliplen(). I mean, pg_mbcliplen is basically just being
used here to trim away any partial character that would have to get
chopped off to fit within the length limit. But right now it's
scanning the whole string to do this, which is unnecessary. At least
for UTF-8, we could do that much more directly: if the string is short
enough, stop, else, look at cmd_str[pgstat_track_activity_query_size].
If that character has (c & 0xc0) != 0x80, write a '\0' and stop; else,
back up until you find a character that for which that continuation
holds, write a '\0', and stop. This kind of approach only works if we
have a definitive test for whether something is a "continuation
character" which probably isn't true in all encodings, but maybe it's
still worth considering.

Your idea is probably a lot simpler to implement, though, and I
definitely agree that shifting the work from the write side to the
read side makes sense. Updating pg_stat_activity is a lot more common
than reading it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#6Tom Lane
tgl@sss.pgh.pa.us
In reply to: Robert Haas (#5)
Re: More efficient truncation of pg_stat_activity query strings

Robert Haas <robertmhaas@gmail.com> writes:

On Tue, Sep 12, 2017 at 3:19 AM, Andres Freund <andres@anarazel.de> wrote:

Therefore I wonder if we couldn't just store a querystring that's
essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the
read side.

Interesting idea. I was (ha, ha, what a coincidence) also thinking
about this problem and was wondering if we couldn't be a lot smarter
about pg_mbcliplen(). ...

for UTF-8, we could do that much more directly: if the string is short
enough, stop, else, look at cmd_str[pgstat_track_activity_query_size].
If that character has (c & 0xc0) != 0x80, write a '\0' and stop; else,
back up until you find a character that for which that continuation
holds, write a '\0', and stop. This kind of approach only works if we
have a definitive test for whether something is a "continuation
character" which probably isn't true in all encodings, but maybe it's
still worth considering.

Offhand I think it doesn't work in anything but UTF8. A way that would
work in all encodings is to back up to the first non-high-bit-set
byte and then mbcliplen from there. Probably, there's enough ASCII
in typical SQL commands that this would often be a win.

Your idea is probably a lot simpler to implement, though, and I
definitely agree that shifting the work from the write side to the
read side makes sense. Updating pg_stat_activity is a lot more common
than reading it.

+1. I kinda doubt that it is worth optimizing further than that,
really.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#7Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#1)
1 attachment(s)
Re: More efficient truncation of pg_stat_activity query strings

On 2017-09-12 00:19:48 -0700, Andres Freund wrote:

Hi,

I've recently seen a benchmark in which pg_mbcliplen() showed up
prominently. Which it will basically in any benchmark with longer query
strings, but fast queries. That's not that uncommon.

I wonder if we could avoid the cost of pg_mbcliplen() from within
pgstat_report_activity(), by moving some of the cost to the read
side. pgstat values are obviously read far less frequently in nearly all
cases that are performance relevant.

Therefore I wonder if we couldn't just store a querystring that's
essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the
read side. I think that should work because all *server side* encodings
store character lengths in the *first* byte of a multibyte character
(at least one clientside encoding, gb18030, doesn't behave that way).

That'd necessitate an added memory copy in pg_stat_get_activity(), but
that seems fairly harmless.

Faults in my thinking?

Here's a patch that implements that idea. Seems to work well. I'm a
bit loathe to add proper regression tests for this, seems awfully
dependent on specific track_activity_query_size settings. I did confirm
using gdb that I see incomplete characters before
pgstat_clip_activity(), but not after.

I've renamed st_activity to st_activity_raw to increase the likelihood
that potential external users of st_activity notice and adapt. Increases
the noise, but imo to a very bareable amount. Don't feel strongly
though.

Greetings,

Andres Freund

Attachments:

0001-Speedup-pgstat_report_activity-by-moving-mb-aware-tr.patchtext/x-diff; charset=us-asciiDownload
From 9c9503f0dfe1babb21e81c1955e996ad06c93608 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 13 Sep 2017 19:25:34 -0700
Subject: [PATCH 1/8] Speedup pgstat_report_activity by moving mb-aware
 truncation to read side.

Previously multi-byte aware truncation was done on every
pgstat_report_activity() call - proving to be a bottleneck for
workloads with long query strings that execute quickly.

Instead move the truncation to the read side, which is commonly
executed far less frequently. That's possible because all server
encodings allow to determine the length of a multi-byte string from
the first byte.

Author: Andres Freund
Discussion: https://postgr.es/m/20170912071948.pa7igbpkkkviecpz@alap3.anarazel.de
---
 src/backend/postmaster/pgstat.c     | 63 ++++++++++++++++++++++++++++---------
 src/backend/utils/adt/pgstatfuncs.c | 17 +++++++---
 src/include/pgstat.h                | 12 +++++--
 3 files changed, 72 insertions(+), 20 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index accf302cf7..ccb528e627 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2701,7 +2701,7 @@ CreateSharedBackendStatus(void)
 		buffer = BackendActivityBuffer;
 		for (i = 0; i < NumBackendStatSlots; i++)
 		{
-			BackendStatusArray[i].st_activity = buffer;
+			BackendStatusArray[i].st_activity_raw = buffer;
 			buffer += pgstat_track_activity_query_size;
 		}
 	}
@@ -2922,11 +2922,11 @@ pgstat_bestart(void)
 #endif
 	beentry->st_state = STATE_UNDEFINED;
 	beentry->st_appname[0] = '\0';
-	beentry->st_activity[0] = '\0';
+	beentry->st_activity_raw[0] = '\0';
 	/* Also make sure the last byte in each string area is always 0 */
 	beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
 	beentry->st_appname[NAMEDATALEN - 1] = '\0';
-	beentry->st_activity[pgstat_track_activity_query_size - 1] = '\0';
+	beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
 	beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
 	beentry->st_progress_command_target = InvalidOid;
 
@@ -3017,7 +3017,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 			pgstat_increment_changecount_before(beentry);
 			beentry->st_state = STATE_DISABLED;
 			beentry->st_state_start_timestamp = 0;
-			beentry->st_activity[0] = '\0';
+			beentry->st_activity_raw[0] = '\0';
 			beentry->st_activity_start_timestamp = 0;
 			/* st_xact_start_timestamp and wait_event_info are also disabled */
 			beentry->st_xact_start_timestamp = 0;
@@ -3034,8 +3034,12 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	start_timestamp = GetCurrentStatementStartTimestamp();
 	if (cmd_str != NULL)
 	{
-		len = pg_mbcliplen(cmd_str, strlen(cmd_str),
-						   pgstat_track_activity_query_size - 1);
+		/*
+		 * Compute length of to-be-stored string unaware of multi-byte
+		 * characters. For speed reasons that'll get corrected on read, rather
+		 * than computed every write.
+		 */
+		len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
 	}
 	current_timestamp = GetCurrentTimestamp();
 
@@ -3049,8 +3053,8 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 
 	if (cmd_str != NULL)
 	{
-		memcpy((char *) beentry->st_activity, cmd_str, len);
-		beentry->st_activity[len] = '\0';
+		memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+		beentry->st_activity_raw[len] = '\0';
 		beentry->st_activity_start_timestamp = start_timestamp;
 	}
 
@@ -3278,8 +3282,8 @@ pgstat_read_current_status(void)
 				 */
 				strcpy(localappname, (char *) beentry->st_appname);
 				localentry->backendStatus.st_appname = localappname;
-				strcpy(localactivity, (char *) beentry->st_activity);
-				localentry->backendStatus.st_activity = localactivity;
+				strcpy(localactivity, (char *) beentry->st_activity_raw);
+				localentry->backendStatus.st_activity_raw = localactivity;
 				localentry->backendStatus.st_ssl = beentry->st_ssl;
 #ifdef USE_SSL
 				if (beentry->st_ssl)
@@ -3945,10 +3949,13 @@ pgstat_get_backend_current_activity(int pid, bool checkUser)
 			/* Now it is safe to use the non-volatile pointer */
 			if (checkUser && !superuser() && beentry->st_userid != GetUserId())
 				return "<insufficient privilege>";
-			else if (*(beentry->st_activity) == '\0')
+			else if (*(beentry->st_activity_raw) == '\0')
 				return "<command string not enabled>";
 			else
-				return beentry->st_activity;
+			{
+				/* this'll leak a bit of memory, but that seems acceptable */
+				return pgstat_clip_activity(beentry->st_activity_raw);
+			}
 		}
 
 		beentry++;
@@ -3994,7 +4001,7 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 		if (beentry->st_procpid == pid)
 		{
 			/* Read pointer just once, so it can't change after validation */
-			const char *activity = beentry->st_activity;
+			const char *activity = beentry->st_activity_raw;
 			const char *activity_last;
 
 			/*
@@ -4017,7 +4024,8 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 			/*
 			 * Copy only ASCII-safe characters so we don't run into encoding
 			 * problems when reporting the message; and be sure not to run off
-			 * the end of memory.
+			 * the end of memory.  As only ASCII characters are reported, it
+			 * doesn't seem necessary to perform multibyte aware clipping.
 			 */
 			ascii_safe_strlcpy(buffer, activity,
 							   Min(buflen, pgstat_track_activity_query_size));
@@ -6270,3 +6278,30 @@ pgstat_db_requested(Oid databaseid)
 
 	return false;
 }
+
+/*
+ * Convert a potentially unsafely truncated activity string (see
+ * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
+ * one.
+ *
+ * The returned string is allocated in the caller's memory context and may be
+ * freed.
+ */
+char *
+pgstat_clip_activity(const char *activity)
+{
+	int rawlen = strnlen(activity, pgstat_track_activity_query_size - 1);
+	int cliplen;
+
+	/*
+	 * All supported server-encodings allow to determine the length of a
+	 * multi-byte character from it's first byte (not the case for client
+	 * encodings, see GB18030). As st_activity always is stored using server
+	 * encoding, this allows us to perform multi-byte aware truncation, even
+	 * if the string earlier was truncated in the middle of a multi-byte
+	 * character.
+	 */
+	cliplen = pg_mbcliplen(activity, rawlen,
+						   pgstat_track_activity_query_size - 1);
+	return pnstrdup(activity, cliplen);
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 20ce48b2d8..5a968e3758 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -664,6 +664,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 			is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))
 		{
 			SockAddr	zero_clientaddr;
+			char	   *clipped_activity;
 
 			switch (beentry->st_state)
 			{
@@ -690,7 +691,9 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 					break;
 			}
 
-			values[5] = CStringGetTextDatum(beentry->st_activity);
+			clipped_activity = pgstat_clip_activity(beentry->st_activity_raw);
+			values[5] = CStringGetTextDatum(clipped_activity);
+			pfree(clipped_activity);
 
 			proc = BackendPidGetProc(beentry->st_procpid);
 			if (proc != NULL)
@@ -906,17 +909,23 @@ pg_stat_get_backend_activity(PG_FUNCTION_ARGS)
 	int32		beid = PG_GETARG_INT32(0);
 	PgBackendStatus *beentry;
 	const char *activity;
+	char *clipped_activity;
+	text *ret;
 
 	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
 		activity = "<backend information not available>";
 	else if (!has_privs_of_role(GetUserId(), beentry->st_userid))
 		activity = "<insufficient privilege>";
-	else if (*(beentry->st_activity) == '\0')
+	else if (*(beentry->st_activity_raw) == '\0')
 		activity = "<command string not enabled>";
 	else
-		activity = beentry->st_activity;
+		activity = beentry->st_activity_raw;
 
-	PG_RETURN_TEXT_P(cstring_to_text(activity));
+	clipped_activity = pgstat_clip_activity(activity);
+	ret = cstring_to_text(activity);
+	pfree(clipped_activity);
+
+	PG_RETURN_TEXT_P(ret);
 }
 
 Datum
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 57ac5d41e4..3690780236 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1003,8 +1003,14 @@ typedef struct PgBackendStatus
 	/* application name; MUST be null-terminated */
 	char	   *st_appname;
 
-	/* current command string; MUST be null-terminated */
-	char	   *st_activity;
+	/*
+	 * Current command string; MUST be null-terminated. Note that this string
+	 * possibly is truncated in the middle of a multi-byte character. As
+	 * activity strings are stored more frequently than read, that allows to
+	 * move the cost of correct truncation to the display side. Use
+	 * pgstat_clip_activity() to trunctae correctly.
+	 */
+	char	   *st_activity_raw;
 
 	/*
 	 * Command progress reporting.  Any command which wishes can advertise
@@ -1193,6 +1199,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
 
+extern char *pgstat_clip_activity(const char *activity);
+
 /* ----------
  * pgstat_report_wait_start() -
  *
-- 
2.14.1.536.g6867272d5b.dirty

#8Kuntal Ghosh
kuntalghosh.2007@gmail.com
In reply to: Andres Freund (#7)
Re: More efficient truncation of pg_stat_activity query strings

On Thu, Sep 14, 2017 at 11:33 AM, Andres Freund <andres@anarazel.de> wrote:

On 2017-09-12 00:19:48 -0700, Andres Freund wrote:

Hi,

I've recently seen a benchmark in which pg_mbcliplen() showed up
prominently. Which it will basically in any benchmark with longer query
strings, but fast queries. That's not that uncommon.

I wonder if we could avoid the cost of pg_mbcliplen() from within
pgstat_report_activity(), by moving some of the cost to the read
side. pgstat values are obviously read far less frequently in nearly all
cases that are performance relevant.

Therefore I wonder if we couldn't just store a querystring that's
essentially just a memcpy()ed prefix, and do a pg_mbcliplen() on the
read side. I think that should work because all *server side* encodings
store character lengths in the *first* byte of a multibyte character
(at least one clientside encoding, gb18030, doesn't behave that way).

That'd necessitate an added memory copy in pg_stat_get_activity(), but
that seems fairly harmless.

Faults in my thinking?

Here's a patch that implements that idea. Seems to work well. I'm a
bit loathe to add proper regression tests for this, seems awfully
dependent on specific track_activity_query_size settings. I did confirm
using gdb that I see incomplete characters before
pgstat_clip_activity(), but not after.

I've renamed st_activity to st_activity_raw to increase the likelihood
that potential external users of st_activity notice and adapt. Increases
the noise, but imo to a very bareable amount. Don't feel strongly
though.

Hello,

The patch looks good to me. I've done some regression testing with a
custom script on my local system. The script contains the following
statement:
SELECT 'aaa..<repeated 600 times>' as col;

Test 1
-----------------------------------
duration: 300 seconds
clients/threads: 1

On HEAD TPS: 13181
+    9.30%     0.27%  postgres  postgres             [.] pgstat_report_activity
+    8.88%     0.02%  postgres  postgres             [.] pg_mbcliplen
+    7.76%     4.77%  postgres  postgres             [.] pg_encoding_mbcliplen
+    4.06%     4.06%  postgres  postgres             [.] pg_utf_mblen

With the patch TPS:13628 (+3.39%)
+ 0.36% 0.21% postgres postgres [.] pgstat_report_activity

Test 2
-----------------------------------
duration: 300 seconds
clients/threads: 8

On HEAD TPS: 53084
+   12.17%     0.20%  postgres  postgres             [.]
pgstat_report_activity
+   11.83%     0.02%  postgres  postgres             [.] pg_mbcliplen
+   11.19%     8.03%  postgres  postgres             [.] pg_encoding_mbcliplen
+    3.74%     3.73%  postgres  postgres             [.] pg_utf_mblen

With the patch TPS: 63949 (+20.4%)
+ 0.40% 0.25% postgres postgres [.] pgstat_report_activity

This shows the significance of this patch in terms of performance
improvement of pgstat_report_activity. Is there any other tests I
should do for the same?

--
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#9Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#7)
Re: More efficient truncation of pg_stat_activity query strings

On Thu, Sep 14, 2017 at 2:03 AM, Andres Freund <andres@anarazel.de> wrote:

Here's a patch that implements that idea.

From the department of cosmetic nitpicking:

+ * All supported server-encodings allow to determine the length of a

make it possible to determine

+ * multi-byte character from it's first byte (not the case for client

its

this is not the case

+ * encodings, see GB18030). As st_activity always is stored using server

is always stored using a server

+ * pgstat_clip_activity() to trunctae correctly.

truncate

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#10Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#9)
1 attachment(s)
Re: More efficient truncation of pg_stat_activity query strings

On 2017-09-15 08:25:09 -0400, Robert Haas wrote:

On Thu, Sep 14, 2017 at 2:03 AM, Andres Freund <andres@anarazel.de> wrote:

Here's a patch that implements that idea.

From the department of cosmetic nitpicking:

+ * All supported server-encodings allow to determine the length of a

make it possible to determine

+ * multi-byte character from it's first byte (not the case for client

its

this is not the case

+ * encodings, see GB18030). As st_activity always is stored using server

is always stored using a server

+ * pgstat_clip_activity() to trunctae correctly.

Version correcting these is attached. Thanks!

Greetings,

Andres Freund

Attachments:

0001-Speedup-pgstat_report_activity-by-moving-mb-aware-tr.v2.patchtext/x-diff; charset=us-asciiDownload
From a2325a2649da403dc562b8e93df972839823d924 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Wed, 13 Sep 2017 19:25:34 -0700
Subject: [PATCH] Speedup pgstat_report_activity by moving mb-aware truncation
 to read side.

Previously multi-byte aware truncation was done on every
pgstat_report_activity() call - proving to be a bottleneck for
workloads with long query strings that execute quickly.

Instead move the truncation to the read side, which is commonly
executed far less frequently. That's possible because all server
encodings allow to determine the length of a multi-byte string from
the first byte.

Author: Andres Freund
Discussion: https://postgr.es/m/20170912071948.pa7igbpkkkviecpz@alap3.anarazel.de
---
 src/backend/postmaster/pgstat.c     | 63 ++++++++++++++++++++++++++++---------
 src/backend/utils/adt/pgstatfuncs.c | 17 +++++++---
 src/include/pgstat.h                | 12 +++++--
 3 files changed, 72 insertions(+), 20 deletions(-)

diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index accf302cf7..1ffdac5448 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -2701,7 +2701,7 @@ CreateSharedBackendStatus(void)
 		buffer = BackendActivityBuffer;
 		for (i = 0; i < NumBackendStatSlots; i++)
 		{
-			BackendStatusArray[i].st_activity = buffer;
+			BackendStatusArray[i].st_activity_raw = buffer;
 			buffer += pgstat_track_activity_query_size;
 		}
 	}
@@ -2922,11 +2922,11 @@ pgstat_bestart(void)
 #endif
 	beentry->st_state = STATE_UNDEFINED;
 	beentry->st_appname[0] = '\0';
-	beentry->st_activity[0] = '\0';
+	beentry->st_activity_raw[0] = '\0';
 	/* Also make sure the last byte in each string area is always 0 */
 	beentry->st_clienthostname[NAMEDATALEN - 1] = '\0';
 	beentry->st_appname[NAMEDATALEN - 1] = '\0';
-	beentry->st_activity[pgstat_track_activity_query_size - 1] = '\0';
+	beentry->st_activity_raw[pgstat_track_activity_query_size - 1] = '\0';
 	beentry->st_progress_command = PROGRESS_COMMAND_INVALID;
 	beentry->st_progress_command_target = InvalidOid;
 
@@ -3017,7 +3017,7 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 			pgstat_increment_changecount_before(beentry);
 			beentry->st_state = STATE_DISABLED;
 			beentry->st_state_start_timestamp = 0;
-			beentry->st_activity[0] = '\0';
+			beentry->st_activity_raw[0] = '\0';
 			beentry->st_activity_start_timestamp = 0;
 			/* st_xact_start_timestamp and wait_event_info are also disabled */
 			beentry->st_xact_start_timestamp = 0;
@@ -3034,8 +3034,12 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 	start_timestamp = GetCurrentStatementStartTimestamp();
 	if (cmd_str != NULL)
 	{
-		len = pg_mbcliplen(cmd_str, strlen(cmd_str),
-						   pgstat_track_activity_query_size - 1);
+		/*
+		 * Compute length of to-be-stored string unaware of multi-byte
+		 * characters. For speed reasons that'll get corrected on read, rather
+		 * than computed every write.
+		 */
+		len = Min(strlen(cmd_str), pgstat_track_activity_query_size - 1);
 	}
 	current_timestamp = GetCurrentTimestamp();
 
@@ -3049,8 +3053,8 @@ pgstat_report_activity(BackendState state, const char *cmd_str)
 
 	if (cmd_str != NULL)
 	{
-		memcpy((char *) beentry->st_activity, cmd_str, len);
-		beentry->st_activity[len] = '\0';
+		memcpy((char *) beentry->st_activity_raw, cmd_str, len);
+		beentry->st_activity_raw[len] = '\0';
 		beentry->st_activity_start_timestamp = start_timestamp;
 	}
 
@@ -3278,8 +3282,8 @@ pgstat_read_current_status(void)
 				 */
 				strcpy(localappname, (char *) beentry->st_appname);
 				localentry->backendStatus.st_appname = localappname;
-				strcpy(localactivity, (char *) beentry->st_activity);
-				localentry->backendStatus.st_activity = localactivity;
+				strcpy(localactivity, (char *) beentry->st_activity_raw);
+				localentry->backendStatus.st_activity_raw = localactivity;
 				localentry->backendStatus.st_ssl = beentry->st_ssl;
 #ifdef USE_SSL
 				if (beentry->st_ssl)
@@ -3945,10 +3949,13 @@ pgstat_get_backend_current_activity(int pid, bool checkUser)
 			/* Now it is safe to use the non-volatile pointer */
 			if (checkUser && !superuser() && beentry->st_userid != GetUserId())
 				return "<insufficient privilege>";
-			else if (*(beentry->st_activity) == '\0')
+			else if (*(beentry->st_activity_raw) == '\0')
 				return "<command string not enabled>";
 			else
-				return beentry->st_activity;
+			{
+				/* this'll leak a bit of memory, but that seems acceptable */
+				return pgstat_clip_activity(beentry->st_activity_raw);
+			}
 		}
 
 		beentry++;
@@ -3994,7 +4001,7 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 		if (beentry->st_procpid == pid)
 		{
 			/* Read pointer just once, so it can't change after validation */
-			const char *activity = beentry->st_activity;
+			const char *activity = beentry->st_activity_raw;
 			const char *activity_last;
 
 			/*
@@ -4017,7 +4024,8 @@ pgstat_get_crashed_backend_activity(int pid, char *buffer, int buflen)
 			/*
 			 * Copy only ASCII-safe characters so we don't run into encoding
 			 * problems when reporting the message; and be sure not to run off
-			 * the end of memory.
+			 * the end of memory.  As only ASCII characters are reported, it
+			 * doesn't seem necessary to perform multibyte aware clipping.
 			 */
 			ascii_safe_strlcpy(buffer, activity,
 							   Min(buflen, pgstat_track_activity_query_size));
@@ -6270,3 +6278,30 @@ pgstat_db_requested(Oid databaseid)
 
 	return false;
 }
+
+/*
+ * Convert a potentially unsafely truncated activity string (see
+ * PgBackendStatus.st_activity_raw's documentation) into a correctly truncated
+ * one.
+ *
+ * The returned string is allocated in the caller's memory context and may be
+ * freed.
+ */
+char *
+pgstat_clip_activity(const char *activity)
+{
+	int rawlen = strnlen(activity, pgstat_track_activity_query_size - 1);
+	int cliplen;
+
+	/*
+	 * All supported server-encodings make it possible to determine the length
+	 * of a multi-byte character from its first byte (this is not the case for
+	 * client encodings, see GB18030). As st_activity is always stored using
+	 * server encoding, this allows us to perform multi-byte aware truncation,
+	 * even if the string earlier was truncated in the middle of a multi-byte
+	 * character.
+	 */
+	cliplen = pg_mbcliplen(activity, rawlen,
+						   pgstat_track_activity_query_size - 1);
+	return pnstrdup(activity, cliplen);
+}
diff --git a/src/backend/utils/adt/pgstatfuncs.c b/src/backend/utils/adt/pgstatfuncs.c
index 20ce48b2d8..5a968e3758 100644
--- a/src/backend/utils/adt/pgstatfuncs.c
+++ b/src/backend/utils/adt/pgstatfuncs.c
@@ -664,6 +664,7 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 			is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_ALL_STATS))
 		{
 			SockAddr	zero_clientaddr;
+			char	   *clipped_activity;
 
 			switch (beentry->st_state)
 			{
@@ -690,7 +691,9 @@ pg_stat_get_activity(PG_FUNCTION_ARGS)
 					break;
 			}
 
-			values[5] = CStringGetTextDatum(beentry->st_activity);
+			clipped_activity = pgstat_clip_activity(beentry->st_activity_raw);
+			values[5] = CStringGetTextDatum(clipped_activity);
+			pfree(clipped_activity);
 
 			proc = BackendPidGetProc(beentry->st_procpid);
 			if (proc != NULL)
@@ -906,17 +909,23 @@ pg_stat_get_backend_activity(PG_FUNCTION_ARGS)
 	int32		beid = PG_GETARG_INT32(0);
 	PgBackendStatus *beentry;
 	const char *activity;
+	char *clipped_activity;
+	text *ret;
 
 	if ((beentry = pgstat_fetch_stat_beentry(beid)) == NULL)
 		activity = "<backend information not available>";
 	else if (!has_privs_of_role(GetUserId(), beentry->st_userid))
 		activity = "<insufficient privilege>";
-	else if (*(beentry->st_activity) == '\0')
+	else if (*(beentry->st_activity_raw) == '\0')
 		activity = "<command string not enabled>";
 	else
-		activity = beentry->st_activity;
+		activity = beentry->st_activity_raw;
 
-	PG_RETURN_TEXT_P(cstring_to_text(activity));
+	clipped_activity = pgstat_clip_activity(activity);
+	ret = cstring_to_text(activity);
+	pfree(clipped_activity);
+
+	PG_RETURN_TEXT_P(ret);
 }
 
 Datum
diff --git a/src/include/pgstat.h b/src/include/pgstat.h
index 57ac5d41e4..52af0aa541 100644
--- a/src/include/pgstat.h
+++ b/src/include/pgstat.h
@@ -1003,8 +1003,14 @@ typedef struct PgBackendStatus
 	/* application name; MUST be null-terminated */
 	char	   *st_appname;
 
-	/* current command string; MUST be null-terminated */
-	char	   *st_activity;
+	/*
+	 * Current command string; MUST be null-terminated. Note that this string
+	 * possibly is truncated in the middle of a multi-byte character. As
+	 * activity strings are stored more frequently than read, that allows to
+	 * move the cost of correct truncation to the display side. Use
+	 * pgstat_clip_activity() to truncate correctly.
+	 */
+	char	   *st_activity_raw;
 
 	/*
 	 * Command progress reporting.  Any command which wishes can advertise
@@ -1193,6 +1199,8 @@ extern PgStat_BackendFunctionEntry *find_funcstat_entry(Oid func_id);
 
 extern void pgstat_initstats(Relation rel);
 
+extern char *pgstat_clip_activity(const char *activity);
+
 /* ----------
  * pgstat_report_wait_start() -
  *
-- 
2.14.1.536.g6867272d5b.dirty

#11Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#10)
Re: More efficient truncation of pg_stat_activity query strings

Andres Freund <andres@anarazel.de> writes:

Version correcting these is attached. Thanks!

I'd vote for undoing the s/st_activity/st_activity_raw/g business.
That may have been useful while writing the patch, to ensure you
found all the references; but it's just creating a lot of unnecessary
delta in the final code, with the attendant hazards for back-patches.

regards, tom lane

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#12Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#11)
Re: More efficient truncation of pg_stat_activity query strings

On 2017-09-15 16:45:47 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Version correcting these is attached. Thanks!

I'd vote for undoing the s/st_activity/st_activity_raw/g business.
That may have been useful while writing the patch, to ensure you
found all the references; but it's just creating a lot of unnecessary
delta in the final code, with the attendant hazards for back-patches.

I was wondering about that too (see [1]http://archives.postgresql.org/message-id/20170914060329.j7lt7oyyzquxiut6%40alap3.anarazel.de). My only concern is that some
extensions out there might be accessing the string expecting it to be
properly truncated. The rename at least forces them to look for the new
name... I'm slightly in favor of keeping the rename, it doesn't seem
likely to cause a lot of backpatch pain.

Regards,

Andres Freund

[1]: http://archives.postgresql.org/message-id/20170914060329.j7lt7oyyzquxiut6%40alap3.anarazel.de

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#13Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#12)
Re: More efficient truncation of pg_stat_activity query strings

On Fri, Sep 15, 2017 at 4:49 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-09-15 16:45:47 -0400, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Version correcting these is attached. Thanks!

I'd vote for undoing the s/st_activity/st_activity_raw/g business.
That may have been useful while writing the patch, to ensure you
found all the references; but it's just creating a lot of unnecessary
delta in the final code, with the attendant hazards for back-patches.

I was wondering about that too (see [1]). My only concern is that some
extensions out there might be accessing the string expecting it to be
properly truncated. The rename at least forces them to look for the new
name... I'm slightly in favor of keeping the rename, it doesn't seem
likely to cause a lot of backpatch pain.

I tend to agree with you, but it's not a huge deal either way. Even
if somebody fails to update third-party code that touches this, a lot
of times it'll work anyway. But that very fact is, of course, why it
would be slightly better to break it explicitly.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

#14Andres Freund
andres@anarazel.de
In reply to: Kuntal Ghosh (#8)
Re: More efficient truncation of pg_stat_activity query strings

Hi,

On 2017-09-15 17:43:35 +0530, Kuntal Ghosh wrote:

The patch looks good to me. I've done some regression testing with a
custom script on my local system. The script contains the following
statement:

SELECT 'aaa..<repeated 600 times>' as col;

Test 1
-----------------------------------
duration: 300 seconds
clients/threads: 1

With the patch TPS:13628 (+3.39%)
+ 0.36% 0.21% postgres postgres [.] pgstat_report_activity

Test 2
-----------------------------------
duration: 300 seconds
clients/threads: 8

With the patch TPS: 63949 (+20.4%)
+ 0.40% 0.25% postgres postgres [.] pgstat_report_activity

This shows the significance of this patch in terms of performance
improvement of pgstat_report_activity. Is there any other tests I
should do for the same?

Thanks for the test! I think this looks good, no further tests
necessary.

Greetings,

Andres Freund

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers