Sample rate added to pg_stat_statements

Started by Ilia Evdokimovabout 1 year ago58 messages

ilya.evdokimov@tantorlabs.com

about 1 year ago

1 attachment(s)

Hi hackers,

Under high-load scenarios with a significant number of transactions per
second, pg_stat_statements introduces substantial overhead due to the
collection and storage of statistics. Currently, we are sometimes forced
to disable pg_stat_statements or adjust the size of the statistics using
pg_stat_statements.max, which is not always optimal. One potential
solution to this issue could be query sampling in pg_stat_statements.

A similar approach has been implemented in extensions like auto_explain
and pg_store_plans, and it has proven very useful in high-load systems.
However, this approach has its trade-offs, as it sacrifices statistical
accuracy for improved performance. This patch introduces a new
configuration parameter, pg_stat_statements.sample_rate for the
pg_stat_statements extension. The patch provides the ability to control
the sampling of query statistics in pg_stat_statements.

This patch serves as a proof of concept (POC), and I would like to hear
your thoughts on whether such an approach is viable and applicable.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

0001-PATCH-Allow-setting-sample-ratio-for-pg_stat_stateme.patchtext/x-patch; charset=UTF-8; name=0001-PATCH-Allow-setting-sample-ratio-for-pg_stat_stateme.patchDownload

From b19081bd257014f0c4046570097e4bc7b28a3126 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Date: Mon, 18 Nov 2024 21:05:43 +0300
Subject: [PATCH] [PATCH] Allow setting sample ratio for pg_stat_statements

New configuration parameter pg_stat_statements.sample_ratio makes it
possible to control just a fraction of the queries meeting the configured
threshold, to reduce the amount of controlling.
---
 .../pg_stat_statements/pg_stat_statements.c   | 22 ++++++++++++++++++-
 doc/src/sgml/pgstatstatements.sgml            | 17 ++++++++++++++
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 49c657b3e0..51d1a48252 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -49,6 +49,7 @@
 
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
+#include "common/pg_prng.h"
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
@@ -73,6 +74,8 @@
 
 PG_MODULE_MAGIC;
 
+static double pgss_sample_rate = 1;
+
 /* Location of permanent stats file (valid when database is shut down) */
 #define PGSS_DUMP_FILE	PGSTAT_STAT_PERMANENT_DIRECTORY "/pg_stat_statements.stat"
 
@@ -295,6 +298,8 @@ static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
 
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 #define pgss_enabled(level) \
 	(!IsParallelWorker() && \
@@ -414,6 +419,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to process.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -989,6 +1007,8 @@ pgss_planner(Query *parse,
 static void
 pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 {
+	current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate);
+
 	if (prev_ExecutorStart)
 		prev_ExecutorStart(queryDesc, eflags);
 	else
@@ -999,7 +1019,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (current_query_sampled && pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..1e2533a802 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,23 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      control a fraction of the statements in each session.  The default is 1,
+      meaning control all the queries. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Ilia Evdokimov (#1)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

Hi everyone,

I believe we should also include this check in the pgss_ExecutorEnd()
function because sampling in pgss_ExecutorEnd() ensures that a query not
initially sampled in pgss_ExecutorStart() can still be logged if it
meets the pg_stat_statements.sample_rate criteria. This approach adds
flexibility by allowing critical queries to be captured while
maintaining efficient sampling.

I attached new version of patch.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v2-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v2-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchDownload

From c33e1ae07e8eb4f797b47e7deb07e6322b1375a3 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Tue, 19 Nov 2024 12:53:52 +0300
Subject: [PATCH] Allow setting sample ratio for pg_stat_statements

New configuration parameter pg_stat_statements.sample_ratio makes it
possible to control just a fraction of the queries meeting the configured
threshold, to reduce the amount of controlling.
---
 .../pg_stat_statements/pg_stat_statements.c   | 25 ++++++++++++++++---
 doc/src/sgml/pgstatstatements.sgml            | 17 +++++++++++++
 2 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 49c657b3e0..42b3fee815 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -49,6 +49,7 @@
 
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
+#include "common/pg_prng.h"
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
@@ -294,7 +295,10 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1;
 
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 #define pgss_enabled(level) \
 	(!IsParallelWorker() && \
@@ -414,6 +418,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to process.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -989,6 +1006,8 @@ pgss_planner(Query *parse,
 static void
 pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 {
+	current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate);
+
 	if (prev_ExecutorStart)
 		prev_ExecutorStart(queryDesc, eflags);
 	else
@@ -999,7 +1018,7 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (current_query_sampled && pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1068,8 +1087,8 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 {
 	uint64		queryId = queryDesc->plannedstmt->queryId;
 
-	if (queryId != UINT64CONST(0) && queryDesc->totaltime &&
-		pgss_enabled(nesting_level))
+	if (current_query_sampled && queryId != UINT64CONST(0) &&
+		queryDesc->totaltime && pgss_enabled(nesting_level))
 	{
 		/*
 		 * Make sure stats accumulation is done.  (Note: it's okay if several
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..1e2533a802 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,23 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      control a fraction of the statements in each session.  The default is 1,
+      meaning control all the queries. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

Andrey M. Borodin

x4mmm@yandex-team.ru

about 1 year ago

In reply to: Ilia Evdokimov (#2)

Re: Sample rate added to pg_stat_statements

On 18 Nov 2024, at 23:33, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote:

Hi hackers,

Under high-load scenarios with a significant number of transactions per second, pg_stat_statements introduces substantial overhead due to the collection and storage of statistics. Currently, we are sometimes forced to disable pg_stat_statements or adjust the size of the statistics using pg_stat_statements.max, which is not always optimal. One potential solution to this issue could be query sampling in pg_stat_statements.

A similar approach has been implemented in extensions like auto_explain and pg_store_plans, and it has proven very useful in high-load systems. However, this approach has its trade-offs, as it sacrifices statistical accuracy for improved performance. This patch introduces a new configuration parameter, pg_stat_statements.sample_rate for the pg_stat_statements extension. The patch provides the ability to control the sampling of query statistics in pg_stat_statements.

This patch serves as a proof of concept (POC), and I would like to hear your thoughts on whether such an approach is viable and applicable.

+1 for the idea. I heard a lot of complaints about that pgss is costly. Most of them were using it wrong though. But at least it could give an easy way to rule out performance impact of pgss.

On 19 Nov 2024, at 15:09, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote:

I believe we should also include this check in the pgss_ExecutorEnd() function because sampling in pgss_ExecutorEnd() ensures that a query not initially sampled in pgss_ExecutorStart() can still be logged if it meets the pg_stat_statements.sample_rate criteria. This approach adds flexibility by allowing critical queries to be captured while maintaining efficient sampling.

Is there a reason why pgss_ProcessUtility is excluded?

Best regards, Andrey Borodin.

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Andrey M. Borodin (#3)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 19.11.2024 15:11, Andrey M. Borodin wrote:

On 18 Nov 2024, at 23:33, Ilia Evdokimov<ilya.evdokimov@tantorlabs.com> wrote:

Hi hackers,

Under high-load scenarios with a significant number of transactions per second, pg_stat_statements introduces substantial overhead due to the collection and storage of statistics. Currently, we are sometimes forced to disable pg_stat_statements or adjust the size of the statistics using pg_stat_statements.max, which is not always optimal. One potential solution to this issue could be query sampling in pg_stat_statements.

A similar approach has been implemented in extensions like auto_explain and pg_store_plans, and it has proven very useful in high-load systems. However, this approach has its trade-offs, as it sacrifices statistical accuracy for improved performance. This patch introduces a new configuration parameter, pg_stat_statements.sample_rate for the pg_stat_statements extension. The patch provides the ability to control the sampling of query statistics in pg_stat_statements.

This patch serves as a proof of concept (POC), and I would like to hear your thoughts on whether such an approach is viable and applicable.

+1 for the idea. I heard a lot of complaints about that pgss is costly. Most of them were using it wrong though. But at least it could give an easy way to rule out performance impact of pgss.

Thank you for review.

On 19 Nov 2024, at 15:09, Ilia Evdokimov<ilya.evdokimov@tantorlabs.com> wrote:

I believe we should also include this check in the pgss_ExecutorEnd() function because sampling in pgss_ExecutorEnd() ensures that a query not initially sampled in pgss_ExecutorStart() can still be logged if it meets the pg_stat_statements.sample_rate criteria. This approach adds flexibility by allowing critical queries to be captured while maintaining efficient sampling.

Is there a reason why pgss_ProcessUtility is excluded?

Best regards, Andrey Borodin.

Ah, you’re right! Moreover, this check is needed not only in
pgss_ProcessUtility() but in all places where pgss_enabled() is called.
Therefore, it’s better to move the sampling check directly into
pgss_enabled().

However, another issue arises with the initialization of
'current_query_sample' variable that contains whether query is sampling
or not. Initializing it in pgss_ExecutorStart()||is not sufficient,
because pgss_post_parse_analyze() or pgss_ProcessUtility() might be
called earlier. This could lead to different values of
'current_query_sample' being used in these functions, which is
undesirable. Run the regression tests for pg_stat_statements with
initializing in pgss_ExecutorStart(), and you'll see this.

To avoid this, we need to find a function that is called earlier than
all the others. In my opinion, pgss_post_parse_analyze() is a good
candidate for this purpose. If you have objections or better
suggestions, feel free to share them. I attached the patch with fixes.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v3-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v3-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchDownload

From 725d0a428040d46cfedac0ca707d7d7a27db8721 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Date: Tue, 19 Nov 2024 16:58:28 +0300
Subject: [PATCH] Allow setting sample ratio for pg_stat_statements

New configuration parameter pg_stat_statements.sample_ratio makes it
possible to control just a fraction of the queries meeting the configured
threshold, to reduce the amount of controlling.
---
 .../pg_stat_statements/pg_stat_statements.c   | 22 ++++++++++++++++++-
 doc/src/sgml/pgstatstatements.sgml            | 17 ++++++++++++++
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 49c657b3e0..01586c32b0 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -49,6 +49,7 @@
 
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
+#include "common/pg_prng.h"
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
@@ -294,12 +295,16 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1;
 
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 #define pgss_enabled(level) \
 	(!IsParallelWorker() && \
 	(pgss_track == PGSS_TRACK_ALL || \
-	(pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+	(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+	current_query_sampled)
 
 #define record_gc_qtexts() \
 	do { \
@@ -414,6 +419,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to process.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -835,6 +853,8 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 	if (prev_post_parse_analyze_hook)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
 
+	current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate);
+
 	/* Safety check... */
 	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
 		return;
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..1e2533a802 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,23 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      control a fraction of the statements in each session.  The default is 1,
+      meaning control all the queries. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

Greg Sabino Mullane

htamfids@gmail.com

about 1 year ago

In reply to: Andrey M. Borodin (#3)

Re: Sample rate added to pg_stat_statements

On Tue, Nov 19, 2024 at 7:12 AM Andrey M. Borodin <x4mmm@yandex-team.ru>
wrote:

+1 for the idea. I heard a lot of complaints about that pgss is costly.
Most of them were using it wrong though.

I'm curious what "using it wrong" means exactly?

Oh, and a +1 in general to the patch, OP, although it would also be nice to
start finding the bottlenecks that cause such performance issues.

Cheers,
Greg

Michael Paquier

michael@paquier.xyz

about 1 year ago

In reply to: Greg Sabino Mullane (#5)

Re: Sample rate added to pg_stat_statements

On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote:

Oh, and a +1 in general to the patch, OP, although it would also be nice to
start finding the bottlenecks that cause such performance issues.

FWIW, I'm not eager to integrate this proposal without looking at this
exact argument in depth.

One piece of it would be to see how much of such "bottlenecks" we
would be able to get rid of by integrating pg_stat_statements into
the central pgstats with the custom APIs, without pushing the module
into core. This means that we would combine the existing hash of pgss
to shrink to 8 bytes for objid rather than 13 bytes now as the current
code relies on (toplevel, userid, queryid) for the entry lookup (entry
removal is sniped with these three values as well, or dshash seq
scans). The odds of conflicts still still play in our favor even if
we have a few million entries, or even ten times that.

This would also get rid of the pgss text file for the queries, which
is a source of one of the bottlenecks, as we could just store query
strings upper-bounded based on a postmaster GUC to control the size of
the entries in the pgstats dshash. More normalization for IN and ANY
clauses would also help a not here, these are a cause of a lot of
bloat.

This integration is not something I will be able to work on for the
PG18 dev cycle as I'm in full review/commit mode for the rest of this
release, but I got some plans for it in PG19 except if somebody beats
me to it.
--
Michael

Greg Sabino Mullane

htamfids@gmail.com

about 1 year ago

In reply to: Michael Paquier (#6)

Re: Sample rate added to pg_stat_statements

On Tue, Nov 19, 2024 at 5:07 PM Michael Paquier <michael@paquier.xyz> wrote:

One piece of it would be to see how much of such "bottlenecks" we
would be able to get rid of by integrating pg_stat_statements into
the central pgstats with the custom APIs, without pushing the module
into core.

Any particular reason these days we cannot push this into core and allow
disabling on startup? To say this extension is widely used would be an
understatement.

Cheers,
Greg

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Michael Paquier (#6)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 20.11.2024 01:07, Michael Paquier wrote:

On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote:

Oh, and a +1 in general to the patch, OP, although it would also be nice to
start finding the bottlenecks that cause such performance issues.

FWIW, I'm not eager to integrate this proposal without looking at this
exact argument in depth.

One piece of it would be to see how much of such "bottlenecks" we
would be able to get rid of by integrating pg_stat_statements into
the central pgstats with the custom APIs, without pushing the module
into core. This means that we would combine the existing hash of pgss
to shrink to 8 bytes for objid rather than 13 bytes now as the current
code relies on (toplevel, userid, queryid) for the entry lookup (entry
removal is sniped with these three values as well, or dshash seq
scans). The odds of conflicts still still play in our favor even if
we have a few million entries, or even ten times that.

This would also get rid of the pgss text file for the queries, which
is a source of one of the bottlenecks, as we could just store query
strings upper-bounded based on a postmaster GUC to control the size of
the entries in the pgstats dshash. More normalization for IN and ANY
clauses would also help a not here, these are a cause of a lot of
bloat.

This integration is not something I will be able to work on for the
PG18 dev cycle as I'm in full review/commit mode for the rest of this
release, but I got some plans for it in PG19 except if somebody beats
me to it.
--
Michael

I agree. Your proposal can indeed improve performance. Currently, I am
working on these changes and will validate them with benchmarks. Once I
have concrete results, I will open new threads to facilitate further
discussion.

However, in my opinion, the suggested improvements are not enough, and
sampling is essential.

1. I agree with Greg that pgss is widely used. It's quite odd that
sampling exists in 'auto_explain' but not in pgss.

2. If performance issues arise even after these improvements and it
turns out that pgss is the cause, the only painless solution without
restarting the instance is sampling. The current pgss's parameters are
not optimal for achieving this.

BTW, I forgot to include a case of nested statements. Either all will be
tracked or none. I attached new version of patch.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v4-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v4-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchDownload

From 68f5451019b261bf03a222f5a05ac57cd0fb9a24 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Date: Thu, 21 Nov 2024 11:24:03 +0300
Subject: [PATCH] Allow setting sample ratio for pg_stat_statements

New configuration parameter pg_stat_statements.sample_ratio makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 .../pg_stat_statements/pg_stat_statements.c   | 23 ++++++++++++++++++-
 doc/src/sgml/pgstatstatements.sgml            | 18 +++++++++++++++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 49c657b3e0..d06e0d8a44 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -49,6 +49,7 @@
 
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
+#include "common/pg_prng.h"
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
@@ -294,12 +295,16 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1;
 
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 #define pgss_enabled(level) \
 	(!IsParallelWorker() && \
 	(pgss_track == PGSS_TRACK_ALL || \
-	(pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+	(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+	current_query_sampled)
 
 #define record_gc_qtexts() \
 	do { \
@@ -414,6 +419,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to process.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -835,6 +853,9 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 	if (prev_post_parse_analyze_hook)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
 
+	if (nesting_level == 0)
+		current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate);
+
 	/* Safety check... */
 	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
 		return;
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..d06349d097 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,24 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session.  The default is 1,
+      meaning track all the queries. In case of nested statements,
+      either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

Andrey M. Borodin

x4mmm@yandex-team.ru

about 1 year ago

In reply to: Greg Sabino Mullane (#5)

Re: Sample rate added to pg_stat_statements

On 19 Nov 2024, at 17:39, Greg Sabino Mullane <htamfids@gmail.com> wrote:

I'm curious what "using it wrong" means exactly?

Here's an example. pgSCV is querying pgss for every database separately every minute. It makes sense for the project. But when you have ~1000 databases, you have a lot of traffic to pgss alone. Even if you have only one active database, because all pgss results must be put into tuplestore before filtering. See [0]/messages/by-id/1AEEB240-9B68-44D5-8A29-8F9FDB22C801@yandex-team.ru for details.

Best regards, Andrey Borodin.

[0]: /messages/by-id/1AEEB240-9B68-44D5-8A29-8F9FDB22C801@yandex-team.ru

#10

Alexander Korotkov

aekorotkov@gmail.com

about 1 year ago

In reply to: Michael Paquier (#6)

Re: Sample rate added to pg_stat_statements

On Wed, Nov 20, 2024 at 12:07 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote:

Oh, and a +1 in general to the patch, OP, although it would also be nice to
start finding the bottlenecks that cause such performance issues.

FWIW, I'm not eager to integrate this proposal without looking at this
exact argument in depth.

One piece of it would be to see how much of such "bottlenecks" we
would be able to get rid of by integrating pg_stat_statements into
the central pgstats with the custom APIs, without pushing the module
into core. This means that we would combine the existing hash of pgss
to shrink to 8 bytes for objid rather than 13 bytes now as the current
code relies on (toplevel, userid, queryid) for the entry lookup (entry
removal is sniped with these three values as well, or dshash seq
scans). The odds of conflicts still still play in our favor even if
we have a few million entries, or even ten times that.

If you run "pgbench -S -M prepared" on a pretty large machine with
high concurrency, then spin lock in pgss_store() could become pretty
much of a bottleneck. And I'm not sure switching all counters to
atomics could somehow improve the situation given we already have
pretty many counters.

I'm generally +1 for the approach taken in this thread. But I would
suggest introducing a threshold value for a query execution time, and
sample just everything below that threshold. The slower query
shouldn't be sampled, because it can't be too frequent, and also it
could be more valuable to be counter individually (while very fast
queries probably only matter "in average").

------
Regards,
Alexander Korotkov
Supabase

#11

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Alexander Korotkov (#10)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 22.11.2024 09:08, Alexander Korotkov wrote:

On Wed, Nov 20, 2024 at 12:07 AM Michael Paquier<michael@paquier.xyz> wrote:

On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote:

Oh, and a +1 in general to the patch, OP, although it would also be nice to
start finding the bottlenecks that cause such performance issues.

FWIW, I'm not eager to integrate this proposal without looking at this
exact argument in depth.

One piece of it would be to see how much of such "bottlenecks" we
would be able to get rid of by integrating pg_stat_statements into
the central pgstats with the custom APIs, without pushing the module
into core. This means that we would combine the existing hash of pgss
to shrink to 8 bytes for objid rather than 13 bytes now as the current
code relies on (toplevel, userid, queryid) for the entry lookup (entry
removal is sniped with these three values as well, or dshash seq
scans). The odds of conflicts still still play in our favor even if
we have a few million entries, or even ten times that.

If you run "pgbench -S -M prepared" on a pretty large machine with
high concurrency, then spin lock in pgss_store() could become pretty
much of a bottleneck. And I'm not sure switching all counters to
atomics could somehow improve the situation given we already have
pretty many counters.

I'm generally +1 for the approach taken in this thread. But I would
suggest introducing a threshold value for a query execution time, and
sample just everything below that threshold. The slower query
shouldn't be sampled, because it can't be too frequent, and also it
could be more valuable to be counter individually (while very fast
queries probably only matter "in average").

------
Regards,
Alexander Korotkov
Supabase

I really liked your idea, and I’d like to propose an enhancement that I
believe improves it further.

Yes, if a query’s execution time exceeds the threshold, it should always
be tracked without sampling. However, for queries with execution times
below the threshold, the sampling logic should prioritize shorter
queries over those closer to the threshold. In my view, the ideal
approach is for shorter queries to have the highest probability of being
sampled, while queries closer to the threshold are less likely to be
sampled.

This behavior can be achieved with the following logic:

pg_stat_statements.sample_exectime_threshold * random(0, 1) < total_time

Here’s how it works:

* As a query’s execution time approaches zero, the probability of it
being sampled approaches one.
* Conversely, as a query’s execution time approaches the threshold,
the probability of it being sampled approaches zero.

In other words, the sampling probability decreases linearly from 1 to 0
as the execution time gets closer to the threshold.

I believe this approach offers an ideal user experience. I have attached
a new patch implementing this logic. Please let me know if you have any
feedback regarding the comments in the code, the naming of variables or
documentation. I’m always open to discussion.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v5-0001-Add-time-based-sampling-to-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v5-0001-Add-time-based-sampling-to-pg_stat_statements.patchDownload

From edba1bcefa85e995ec3e9df6c9e8d30adcd940b9 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Date: Tue, 26 Nov 2024 00:58:24 +0300
Subject: [PATCH] Add time-based sampling to pg_stat_statements

New configuration parameter pg_stat_statements.sample_exectime_threshold,
which allows tracking only queries that meet a specified execution time threshold.
Queries with execution times above the threshold are always tracked,
while shorter queries are sampled probabilistically.
This helps reduce the overhead of tracking frequent, fast queries
while preserving data for longer-running ones.
---
 .../pg_stat_statements/pg_stat_statements.c   | 57 ++++++++++++++-----
 doc/src/sgml/pgstatstatements.sgml            | 20 +++++++
 2 files changed, 64 insertions(+), 13 deletions(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 49c657b3e0..6ec1e3e7ed 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -49,6 +49,7 @@
 
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
+#include "common/pg_prng.h"
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
@@ -289,6 +290,7 @@ static const struct config_enum_entry track_options[] =
 };
 
 static int	pgss_max = 5000;	/* max # statements to track */
+static int	pgss_sample_exectime_threshold = 0; /* Threshold for query execution sampling (msec) */
 static int	pgss_track = PGSS_TRACK_TOP;	/* tracking level */
 static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
@@ -414,6 +416,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomIntVariable("pg_stat_statements.sample_exectime_threshold",
+							"Sets the threshold (in msec) for query execution time sampling.",
+							NULL,
+							&pgss_sample_exectime_threshold,
+							0,
+							0,
+							INT_MAX / 2,
+							PGC_SUSET,
+							0,
+							NULL,
+							NULL,
+							NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -1071,25 +1086,41 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 	if (queryId != UINT64CONST(0) && queryDesc->totaltime &&
 		pgss_enabled(nesting_level))
 	{
+		double total_time = 0.0;
 		/*
 		 * Make sure stats accumulation is done.  (Note: it's okay if several
 		 * levels of hook all do this.)
 		 */
 		InstrEndLoop(queryDesc->totaltime);
 
-		pgss_store(queryDesc->sourceText,
-				   queryId,
-				   queryDesc->plannedstmt->stmt_location,
-				   queryDesc->plannedstmt->stmt_len,
-				   PGSS_EXEC,
-				   queryDesc->totaltime->total * 1000.0,	/* convert to msec */
-				   queryDesc->estate->es_total_processed,
-				   &queryDesc->totaltime->bufusage,
-				   &queryDesc->totaltime->walusage,
-				   queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
-				   NULL,
-				   queryDesc->estate->es_parallel_workers_to_launch,
-				   queryDesc->estate->es_parallel_workers_launched);
+		/* convert to msec */
+		total_time = queryDesc->totaltime->total * 1000.0;
+
+		/*
+		* Sampling is implemented by comparing the execution time of the query
+		* to a random threshold derived from the configured sample_execution_time.
+		* This method ensures proportional sampling: queries with execution times
+		* closer to the threshold are more likely to be trackeded, while very short
+		* queries are more likely to be sampled.
+		*/
+		if (total_time >= pgss_sample_exectime_threshold ||
+		   (total_time < pgss_sample_exectime_threshold &&
+		   (pgss_sample_exectime_threshold * pg_prng_double(&pg_global_prng_state) < total_time)))
+		{
+			pgss_store(queryDesc->sourceText,
+				   queryId,
+				   queryDesc->plannedstmt->stmt_location,
+				   queryDesc->plannedstmt->stmt_len,
+				   PGSS_EXEC,
+				   total_time,
+				   queryDesc->estate->es_total_processed,
+				   &queryDesc->totaltime->bufusage,
+				   &queryDesc->totaltime->walusage,
+				   queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
+				   NULL,
+				   queryDesc->estate->es_parallel_workers_to_launch,
+				   queryDesc->estate->es_parallel_workers_launched);
+		}
 	}
 
 	if (prev_ExecutorEnd)
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..7b998a77f3 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -872,6 +872,26 @@
     </listitem>
    </varlistentry>
 
+  <variablelist>
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_exectime_threshold</varname> (<type>integer</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_exectime_threshold</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_exectime_threshold</varname> is the threshold (in msec)
+      for query execution time sampling. Queries with execution times above this threshold are
+      are never sampled. Execution time below threshold are sampled to reduce overhead from frequent short queries.
+      The default value is 0.
+      Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.track</varname> (<type>enum</type>)
-- 
2.34.1

#12

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Ilia Evdokimov (#11)

Re: Sample rate added to pg_stat_statements

On 26.11.2024 01:15, Ilia Evdokimov wrote:

On 22.11.2024 09:08, Alexander Korotkov wrote:

On Wed, Nov 20, 2024 at 12:07 AM Michael Paquier<michael@paquier.xyz> wrote:

On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote:

Oh, and a +1 in general to the patch, OP, although it would also be nice to
start finding the bottlenecks that cause such performance issues.

FWIW, I'm not eager to integrate this proposal without looking at this
exact argument in depth.

One piece of it would be to see how much of such "bottlenecks" we
would be able to get rid of by integrating pg_stat_statements into
the central pgstats with the custom APIs, without pushing the module
into core. This means that we would combine the existing hash of pgss
to shrink to 8 bytes for objid rather than 13 bytes now as the current
code relies on (toplevel, userid, queryid) for the entry lookup (entry
removal is sniped with these three values as well, or dshash seq
scans). The odds of conflicts still still play in our favor even if
we have a few million entries, or even ten times that.

If you run "pgbench -S -M prepared" on a pretty large machine with
high concurrency, then spin lock in pgss_store() could become pretty
much of a bottleneck. And I'm not sure switching all counters to
atomics could somehow improve the situation given we already have
pretty many counters.

I'm generally +1 for the approach taken in this thread. But I would
suggest introducing a threshold value for a query execution time, and
sample just everything below that threshold. The slower query
shouldn't be sampled, because it can't be too frequent, and also it
could be more valuable to be counter individually (while very fast
queries probably only matter "in average").

------
Regards,
Alexander Korotkov
Supabase

I really liked your idea, and I’d like to propose an enhancement that
I believe improves it further.

Yes, if a query’s execution time exceeds the threshold, it should
always be tracked without sampling. However, for queries with
execution times below the threshold, the sampling logic should
prioritize shorter queries over those closer to the threshold. In my
view, the ideal approach is for shorter queries to have the highest
probability of being sampled, while queries closer to the threshold
are less likely to be sampled.

This behavior can be achieved with the following logic:

pg_stat_statements.sample_exectime_threshold * random(0, 1) < total_time

Here’s how it works:

* As a query’s execution time approaches zero, the probability of it
being sampled approaches one.
* Conversely, as a query’s execution time approaches the threshold,
the probability of it being sampled approaches zero.

In other words, the sampling probability decreases linearly from 1 to
0 as the execution time gets closer to the threshold.

I believe this approach offers an ideal user experience. I have
attached a new patch implementing this logic. Please let me know if
you have any feedback regarding the comments in the code, the naming
of variables or documentation. I’m always open to discussion.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

I’ve been closely reviewing my last (v5-*.patch) patch on implementing
time-based sampling, and I’ve realized that it might not be the best
approach. Let me explain the reasons.

* We can only perform sampling before the 'pgss_planner()' function.
However, at that point, we don’t yet know the query's execution time
since it only becomes available during 'pgss_ExecutorEnd()' or
'pgss_ProcessUtility()';
* If we wait to sample until the execution completes and we have the
actual execution time, this introduces a problem. By that point, we
might have already recorded the query's statistics into shared
memory from the 'pgss_planner()' making it too late to decide
whether to sample the query;
* Delaying sampling until the execution finishes would require waiting
for the execution time, which could introduce performance overhead.
This defeats the purpose of sampling, which aims to reduce the cost
of tracking query.

I believe we should reconsider the approach and revert to sampling based
on v4-*.patch. If I’ve missed anything or there’s an alternative way to
implement time threshold-based sampling efficiently, I’d be grateful to
hear your insights.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#13

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Ilia Evdokimov (#12)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

Hi everyone,

I attached previous sampling patch for pg_stat_statements (v4).
Suggestions like sampling based on execution time remain unfeasible, as
pg_stat_statements can track query during query planning, not execution.

To evaluate the implementation, I ran a benchmark creating 1,000 random
tables and executing randomized JOIN queries on a small machine. When
pg_stat_statements enabled, performance decreases, but reducing the
sampling rate helps mitigate the impact and improves performance.

I’d be interested in hearing your new thoughts. Are there areas where
this patch could be improved, or other ideas worth exploring?

--
Best regards.
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v4-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v4-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchDownload

From 68f5451019b261bf03a222f5a05ac57cd0fb9a24 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Date: Thu, 21 Nov 2024 11:24:03 +0300
Subject: [PATCH] Allow setting sample ratio for pg_stat_statements

New configuration parameter pg_stat_statements.sample_ratio makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 .../pg_stat_statements/pg_stat_statements.c   | 23 ++++++++++++++++++-
 doc/src/sgml/pgstatstatements.sgml            | 18 +++++++++++++++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 49c657b3e0..d06e0d8a44 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -49,6 +49,7 @@
 
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
+#include "common/pg_prng.h"
 #include "common/int.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
@@ -294,12 +295,16 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1;
 
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 #define pgss_enabled(level) \
 	(!IsParallelWorker() && \
 	(pgss_track == PGSS_TRACK_ALL || \
-	(pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+	(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+	current_query_sampled)
 
 #define record_gc_qtexts() \
 	do { \
@@ -414,6 +419,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to process.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -835,6 +853,9 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 	if (prev_post_parse_analyze_hook)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
 
+	if (nesting_level == 0)
+		current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate);
+
 	/* Safety check... */
 	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
 		return;
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..d06349d097 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,24 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session.  The default is 1,
+      meaning track all the queries. In case of nested statements,
+      either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#14

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Ilia Evdokimov (#13)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 10.12.2024 17:35, Ilia Evdokimov wrote:

Hi everyone,

I attached previous sampling patch for pg_stat_statements (v4).
Suggestions like sampling based on execution time remain unfeasible,
as pg_stat_statements can track query during query planning, not
execution.

To evaluate the implementation, I ran a benchmark creating 1,000
random tables and executing randomized JOIN queries on a small
machine. When pg_stat_statements enabled, performance decreases, but
reducing the sampling rate helps mitigate the impact and improves
performance.

I’d be interested in hearing your new thoughts. Are there areas where
this patch could be improved, or other ideas worth exploring?

--
Best regards.
Ilia Evdokimov,
Tantor Labs LLC.

I've fixed some small typos in the documentation and in the GUC
description in the attached patch. Any suggestions for improvements?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v6-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v6-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchDownload

From e0955b12e6a71efd6fc5d406e3d93bda4ec57bbf Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Date: Mon, 6 Jan 2025 13:38:22 +0300
Subject: [PATCH] Allow setting sample ratio for pg_stat_statements

New configuration parameter pg_stat_statements.sample_ratio makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 .../pg_stat_statements/pg_stat_statements.c   | 23 ++++++++++++++++++-
 doc/src/sgml/pgstatstatements.sgml            | 18 +++++++++++++++
 2 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index bebf8134eb..a8e91f9469 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -294,12 +295,16 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1;
 
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 #define pgss_enabled(level) \
 	(!IsParallelWorker() && \
 	(pgss_track == PGSS_TRACK_ALL || \
-	(pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+	(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+	current_query_sampled)
 
 #define record_gc_qtexts() \
 	do { \
@@ -414,6 +419,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to track.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -835,6 +853,9 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 	if (prev_post_parse_analyze_hook)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
 
+	if (nesting_level == 0)
+		current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate);
+
 	/* Safety check... */
 	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
 		return;
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..37f6788798 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,24 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session. The default is 1,
+      meaning track all the queries. In case of nested statements,
+      either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#15

Andrey M. Borodin

x4mmm@yandex-team.ru

about 1 year ago

In reply to: Ilia Evdokimov (#14)

Re: Sample rate added to pg_stat_statements

On 6 Jan 2025, at 15:50, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote:

Any suggestions for improvements?

The patch looks good to me, just a few nits.

0. Perhaps, we could have a test for edge values of 0 and 1. I do not insist, just an idea to think about.

1. This code seems a little different from your patch. It is trying to avoid engaging PRNG. I'm not sure it's a good idea, but still. Also, it uses "<=", not "<".

xact_is_sampled = log_xact_sample_rate != 0 &&
(log_xact_sample_rate == 1 ||
pg_prng_double(&pg_global_prng_state) <= log_xact_sample_rate);

Thanks!

Best regards, Andrey Borodin.

#16

Sami Imseih

samimseih@gmail.com

about 1 year ago

In reply to: Andrey M. Borodin (#15)

Re: Sample rate added to pg_stat_statements

Hi,

I was looking at this patch, and I was specifically curious about
how this works with prepared statements. The way the patch
works now is that it determines if the statement is to be sampled
at post_parse_analyze time which could lead to unexpected
behavior.

Let's take an example below in which the
pg_stat_statements.sample_rate is set to 0 ( to mimic
some sampling rate < 1 in which this query does not
get sampled ). At that point, all subsequent executions
of the statement will not get tracked at all. Is this
what is expected for prepared statements? My concern
is we will even lose more stats than what a user
may expect.

This of course will not be an issue for simple query.

postgres=# set pg_stat_statements.sample_rate = 0;
SET
postgres=# select pg_stat_statements_reset();
pg_stat_statements_reset
-------------------------------
2025-01-06 11:45:23.484793-06
(1 row)

postgres=# SELECT $1 \parse stmt

postgres=#
postgres=# \bind_named stmt 1 \g
?column?
----------
1
(1 row)

postgres=# \bind_named stmt 1 \g
?column?
----------
1
(1 row)

postgres=# set pg_stat_statements.sample_rate = 1;
SET
postgres=# \bind_named stmt 1 \g
?column?
----------
1
(1 row)

postgres=# \bind_named stmt 1 \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
-------+-------
(0 rows)

Regards,

Sami Imseih
Amazon Web Services (AWS)

#17

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Sami Imseih (#16)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 06.01.2025 18:57, Andrey M. Borodin wrote:

On 6 Jan 2025, at 15:50, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote:

Any suggestions for improvements?

The patch looks good to me, just a few nits.

0. Perhaps, we could have a test for edge values of 0 and 1. I do not insist, just an idea to think about.

I would add tests, because they are never useless. I've added a simple
test which verifies hash table with queries after setting sample_rate =
0.0 and sample_rate = 1.0.

1. This code seems a little different from your patch. It is trying to avoid engaging PRNG. I'm not sure it's a good idea, but still. Also, it uses "<=", not "<".

xact_is_sampled = log_xact_sample_rate != 0 &&
(log_xact_sample_rate == 1 ||
pg_prng_double(&pg_global_prng_state) <= log_xact_sample_rate);

Thanks!

Best regards, Andrey Borodin.

Are we sure we're discussing the same patch? Because these remarks refer
to the 5 version of the patch, which I abandoned due to your remarks.

On 06.01.2025 20:51, Sami Imseih wrote:

Hi,

I was looking at this patch, and I was specifically curious about
how this works with prepared statements. The way the patch
works now is that it determines if the statement is to be sampled
at post_parse_analyze time which could lead to unexpected
behavior.

Let's take an example below in which the
pg_stat_statements.sample_rate is set to 0 ( to mimic
some sampling rate < 1 in which this query does not
get sampled ). At that point, all subsequent executions
of the statement will not get tracked at all. Is this
what is expected for prepared statements? My concern
is we will even lose more stats than what a user
may expect.

This of course will not be an issue for simple query.

postgres=# set pg_stat_statements.sample_rate = 0;
SET
postgres=# select pg_stat_statements_reset();
pg_stat_statements_reset
-------------------------------
2025-01-06 11:45:23.484793-06
(1 row)

postgres=# SELECT $1 \parse stmt

postgres=#
postgres=# \bind_named stmt 1 \g
?column?
----------
1
(1 row)

postgres=# \bind_named stmt 1 \g
?column?
----------
1
(1 row)

postgres=# set pg_stat_statements.sample_rate = 1;
SET
postgres=# \bind_named stmt 1 \g
?column?
----------
1
(1 row)

postgres=# \bind_named stmt 1 \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
-------+-------
(0 rows)

Regards,

Sami Imseih
Amazon Web Services (AWS)

You are right. This is absolutely unexpected for users. Thank you for
the review.

To fix this, I suggest storing a random number in the [0, 1) range in a
separate variable, which will be used for comparisons in any place. We
will compare 'sample_rate' and random value not only in
pgss_post_parse_analyze(), but also in pgss_ProcessUtility() and
pgss_planner().

I attached patch with test and fixes.

If you have any objections or suggestions on how to improve it, I'm
always open to feedback.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v7-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v7-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchDownload

From 388c12632610cc6abddd6fc750a1b15712d62e63 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Date: Tue, 7 Jan 2025 14:02:32 +0300
Subject: [PATCH] Allow setting sample ratio for pg_stat_statements

New configuration parameter pg_stat_statements.sample_ratio makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 .../pg_stat_statements/expected/select.out    | 37 ++++++++++++++++++
 .../pg_stat_statements/pg_stat_statements.c   | 38 ++++++++++++++++++-
 contrib/pg_stat_statements/sql/select.sql     | 11 ++++++
 doc/src/sgml/pgstatstatements.sgml            | 18 +++++++++
 4 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out
index 37a30af034..724ec9c8ac 100644
--- a/contrib/pg_stat_statements/expected/select.out
+++ b/contrib/pg_stat_statements/expected/select.out
@@ -153,6 +153,43 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
  t
 (1 row)
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements;
+                       query                        | calls 
+----------------------------------------------------+-------
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t |     1
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements;
+                       query                        | calls 
+----------------------------------------------------+-------
+ SELECT $1 AS "int"                                 |     1
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t |     1
+ SELECT query, calls FROM pg_stat_statements        |     0
+(3 rows)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
 --
 -- queries with locking clauses
 --
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index bebf8134eb..4bab2a74ea 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -255,6 +256,7 @@ typedef struct pgssSharedState
 
 /* Current nesting depth of planner/ExecutorRun/ProcessUtility calls */
 static int	nesting_level = 0;
+static double	random_sample = 0.0;
 
 /* Saved hook values */
 static shmem_request_hook_type prev_shmem_request_hook = NULL;
@@ -294,12 +296,17 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1;
+
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 
 #define pgss_enabled(level) \
 	(!IsParallelWorker() && \
 	(pgss_track == PGSS_TRACK_ALL || \
-	(pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+	(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+	current_query_sampled)
 
 #define record_gc_qtexts() \
 	do { \
@@ -414,6 +421,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to track.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -835,6 +855,11 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 	if (prev_post_parse_analyze_hook)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
 
+	random_sample = pg_prng_double(&pg_global_prng_state);
+
+	if (nesting_level == 0)
+		current_query_sampled = random_sample < pgss_sample_rate;
+
 	/* Safety check... */
 	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
 		return;
@@ -889,6 +914,9 @@ pgss_planner(Query *parse,
 {
 	PlannedStmt *result;
 
+	if (nesting_level == 0)
+		current_query_sampled = random_sample < pgss_sample_rate;
+
 	/*
 	 * We can't process the query if no query_string is provided, as
 	 * pgss_store needs it.  We also ignore query without queryid, as it would
@@ -1111,7 +1139,13 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled;
+
+	if (nesting_level == 0)
+		current_query_sampled = random_sample < pgss_sample_rate;
+
+	/* current_query_sampled is in pgss_enabled */
+	enabled = pgss_track_utility && pgss_enabled(nesting_level);
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
diff --git a/contrib/pg_stat_statements/sql/select.sql b/contrib/pg_stat_statements/sql/select.sql
index e0be58d5e2..0a881c3cbf 100644
--- a/contrib/pg_stat_statements/sql/select.sql
+++ b/contrib/pg_stat_statements/sql/select.sql
@@ -59,6 +59,17 @@ DEALLOCATE pgss_test;
 SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
 SELECT pg_stat_statements_reset() IS NOT NULL AS t;
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements;
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
 --
 -- queries with locking clauses
 --
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..8ade4e3ced 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,24 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session. The default is 1,
+      meaning track all the queries. In case of nested statements,
+      either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#18

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Ilia Evdokimov (#17)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

I completely forgot about ordering pg_stat_statements in the test, which
is why the test failed in [0]https://cirrus-ci.com/task/5724458477944832. I've added ORDER BY query COLLATE "C" to
avoid any non-deterministic ordering in the table.

[0]: https://cirrus-ci.com/task/5724458477944832

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v8-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v8-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchDownload

From be8a69ae36d0a20c945ff12bb385d634ce50a84e Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Date: Tue, 7 Jan 2025 16:43:34 +0300
Subject: [PATCH] Allow setting sample ratio for pg_stat_statements

New configuration parameter pg_stat_statements.sample_ratio makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 .../pg_stat_statements/expected/select.out    | 37 ++++++++++++++++++
 .../pg_stat_statements/pg_stat_statements.c   | 38 ++++++++++++++++++-
 contrib/pg_stat_statements/sql/select.sql     | 11 ++++++
 doc/src/sgml/pgstatstatements.sgml            | 18 +++++++++
 4 files changed, 102 insertions(+), 2 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out
index 37a30af034..8485a7c451 100644
--- a/contrib/pg_stat_statements/expected/select.out
+++ b/contrib/pg_stat_statements/expected/select.out
@@ -153,6 +153,43 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
  t
 (1 row)
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                       query                        | calls 
+----------------------------------------------------+-------
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t |     1
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                                 query                                  | calls 
+------------------------------------------------------------------------+-------
+ SELECT $1 AS "int"                                                     |     1
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t                     |     1
+ SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" |     0
+(3 rows)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
 --
 -- queries with locking clauses
 --
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index bebf8134eb..4bab2a74ea 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -255,6 +256,7 @@ typedef struct pgssSharedState
 
 /* Current nesting depth of planner/ExecutorRun/ProcessUtility calls */
 static int	nesting_level = 0;
+static double	random_sample = 0.0;
 
 /* Saved hook values */
 static shmem_request_hook_type prev_shmem_request_hook = NULL;
@@ -294,12 +296,17 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1;
+
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 
 #define pgss_enabled(level) \
 	(!IsParallelWorker() && \
 	(pgss_track == PGSS_TRACK_ALL || \
-	(pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+	(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+	current_query_sampled)
 
 #define record_gc_qtexts() \
 	do { \
@@ -414,6 +421,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to track.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -835,6 +855,11 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 	if (prev_post_parse_analyze_hook)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
 
+	random_sample = pg_prng_double(&pg_global_prng_state);
+
+	if (nesting_level == 0)
+		current_query_sampled = random_sample < pgss_sample_rate;
+
 	/* Safety check... */
 	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
 		return;
@@ -889,6 +914,9 @@ pgss_planner(Query *parse,
 {
 	PlannedStmt *result;
 
+	if (nesting_level == 0)
+		current_query_sampled = random_sample < pgss_sample_rate;
+
 	/*
 	 * We can't process the query if no query_string is provided, as
 	 * pgss_store needs it.  We also ignore query without queryid, as it would
@@ -1111,7 +1139,13 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled;
+
+	if (nesting_level == 0)
+		current_query_sampled = random_sample < pgss_sample_rate;
+
+	/* current_query_sampled is in pgss_enabled */
+	enabled = pgss_track_utility && pgss_enabled(nesting_level);
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
diff --git a/contrib/pg_stat_statements/sql/select.sql b/contrib/pg_stat_statements/sql/select.sql
index e0be58d5e2..def00d515d 100644
--- a/contrib/pg_stat_statements/sql/select.sql
+++ b/contrib/pg_stat_statements/sql/select.sql
@@ -59,6 +59,17 @@ DEALLOCATE pgss_test;
 SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
 SELECT pg_stat_statements_reset() IS NOT NULL AS t;
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
 --
 -- queries with locking clauses
 --
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..8ade4e3ced 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,24 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session. The default is 1,
+      meaning track all the queries. In case of nested statements,
+      either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#19

Andrey M. Borodin

x4mmm@yandex-team.ru

about 1 year ago

In reply to: Ilia Evdokimov (#17)

Re: Sample rate added to pg_stat_statements

On 7 Jan 2025, at 16:05, Ilia Evdokimov <ilya.evdokimov@tantorlabs.com> wrote:

1. This code seems a little different from your patch. It is trying to avoid engaging PRNG. I'm not sure it's a good idea, but still. Also, it uses "<=", not "<".

xact_is_sampled = log_xact_sample_rate != 0 &&
(log_xact_sample_rate == 1 ||
pg_prng_double(&pg_global_prng_state) <= log_xact_sample_rate);

Are we sure we're discussing the same patch? Because these remarks refer to the 5 version of the patch, which I abandoned due to your remarks.

Yes. v6 has this code
+	if (nesting_level == 0)
+		current_query_sampled = (pg_prng_double(&pg_global_prng_state) < pgss_sample_rate);

while upstream has code that I cited. And logic is slightly different.

Best regards, Andrey Borodin.

#20

Sami Imseih

samimseih@gmail.com

about 1 year ago

In reply to: Andrey M. Borodin (#19)

Re: Sample rate added to pg_stat_statements

You are right. This is absolutely unexpected for users. Thank you for
the review.

To fix this, I suggest storing a random number in the [0, 1) range in a
separate variable, which will be used for comparisons in any place. We
will compare 'sample_rate' and random value not only in
pgss_post_parse_analyze(), but also in pgss_ProcessUtility() and
pgss_planner().

I attached patch with test and fixes.

I still think this is not the correct approach. It seems that post_parse_analyze
should not have to deal with checking for a sample rate. This is because
post_parse_analyze, which is the only hook with access to jstate, is
only responsible
for storing a normalized query text on disk and creating a not-yet
user visible entry
in the hash. i.e. pgss_store will never increment counters when called from
pgss_post_parse_analyze.

/* Increment the counts, except when jstate is not NULL */
if (!jstate)
{

What I think may be a better idea is to add something similar
to auto_explain.c, but it should only be added to the top of
pgss_planner, pgss_ExecutorStart and pgss_ProcessUtility.

if (nesting_level == 0)
{
if (!IsParallelWorker())
current_query_sampled = pg_prng_double(&pg_global_prng_state) <
pgss_sample_rate;
else
current_query_sampled = false;
}

This is needed for ExecutorStart and not ExecutorEnd because
ExecutorStart is where the instrumentation is initialized with
queryDesc->totaltime. The above code block could be
turned into a macro and reused in the routines mentioned.

However, it seems with this change, we can see a much
higher likelihood of non-normalized query texts being stored.
This is because jstate is only available during post_parse_analyze.
Therefore, if the first time you are sampling the statement is in ExecutorEnd,
then you will end up storing a non-normalized version of the query text,
see below example with the attached v8.

postgres=# set pg_stat_statements.sample_rate = 0;
SET
postgres=# select pg_stat_statements_reset();
pg_stat_statements_reset
-------------------------------
2025-01-07 13:02:11.807952-06
(1 row)

postgres=# SELECT 1 \parse stmt

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
-------+-------
(0 rows)

postgres=# set pg_stat_statements.sample_rate = 1;
SET
postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
-------+-------
(0 rows)

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
---------------------------------------------+-------
SELECT 1 | 1
SELECT query, calls FROM pg_stat_statements | 1
(2 rows)

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
---------------------------------------------+-------
SELECT 1 | 2
SELECT query, calls FROM pg_stat_statements | 2
(2 rows)

One idea is to make jstate available to all hooks, and completely
remove reliance on post_parse_analyze in pg_stat_statements.
I can't find the thread now, but I know this has come up in past discussions
when troubleshooting gaps in query normalization. My concern is this
feature will greatly increase the likelihood of non-normalized query texts.

Also, with regards to the benchmarks, it seems
sampling will be beneficial for workloads that are touching a small number
of entries with high concurrency. This is why we see benefit for
a standard pgbench workload.
Samping becomes less beneficial when there is a large set of queries
being updated.
I still think this is a good approach for workloads that touch a small set
of entries.

Regards,

Sami

#21

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Sami Imseih (#20)

Re: Sample rate added to pg_stat_statements

On 07.01.2025 22:29, Sami Imseih wrote:

You are right. This is absolutely unexpected for users. Thank you for
the review.

To fix this, I suggest storing a random number in the [0, 1) range in a
separate variable, which will be used for comparisons in any place. We
will compare 'sample_rate' and random value not only in
pgss_post_parse_analyze(), but also in pgss_ProcessUtility() and
pgss_planner().

I attached patch with test and fixes.

I still think this is not the correct approach. It seems that post_parse_analyze
should not have to deal with checking for a sample rate. This is because
post_parse_analyze, which is the only hook with access to jstate, is
only responsible
for storing a normalized query text on disk and creating a not-yet
user visible entry
in the hash. i.e. pgss_store will never increment counters when called from
pgss_post_parse_analyze.

/* Increment the counts, except when jstate is not NULL */
if (!jstate)
{

What I think may be a better idea is to add something similar
to auto_explain.c, but it should only be added to the top of
pgss_planner, pgss_ExecutorStart and pgss_ProcessUtility.

if (nesting_level == 0)
{
if (!IsParallelWorker())
current_query_sampled = pg_prng_double(&pg_global_prng_state) <
pgss_sample_rate;
else
current_query_sampled = false;
}

This is needed for ExecutorStart and not ExecutorEnd because
ExecutorStart is where the instrumentation is initialized with
queryDesc->totaltime. The above code block could be
turned into a macro and reused in the routines mentioned.

However, it seems with this change, we can see a much
higher likelihood of non-normalized query texts being stored.
This is because jstate is only available during post_parse_analyze.
Therefore, if the first time you are sampling the statement is in ExecutorEnd,
then you will end up storing a non-normalized version of the query text,
see below example with the attached v8.

postgres=# set pg_stat_statements.sample_rate = 0;
SET
postgres=# select pg_stat_statements_reset();
pg_stat_statements_reset
-------------------------------
2025-01-07 13:02:11.807952-06
(1 row)

postgres=# SELECT 1 \parse stmt

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
-------+-------
(0 rows)

postgres=# set pg_stat_statements.sample_rate = 1;
SET
postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
-------+-------
(0 rows)

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
---------------------------------------------+-------
SELECT 1 | 1
SELECT query, calls FROM pg_stat_statements | 1
(2 rows)

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
---------------------------------------------+-------
SELECT 1 | 2
SELECT query, calls FROM pg_stat_statements | 2
(2 rows)

One idea is to make jstate available to all hooks, and completely
remove reliance on post_parse_analyze in pg_stat_statements.
I can't find the thread now, but I know this has come up in past discussions
when troubleshooting gaps in query normalization. My concern is this
feature will greatly increase the likelihood of non-normalized query texts.

Also, with regards to the benchmarks, it seems
sampling will be beneficial for workloads that are touching a small number
of entries with high concurrency. This is why we see benefit for
a standard pgbench workload.
Samping becomes less beneficial when there is a large set of queries
being updated.
I still think this is a good approach for workloads that touch a small set
of entries.

Regards,

Sami

Wow, thank you for pointing this out. Your solution looks interesting,
but I'd like to explore other ways to solve the issue besides making it
available to all hooks. If I don't find anything better, I'll go with yours.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#22

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Sami Imseih (#20)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 07.01.2025 22:29, Sami Imseih wrote:

You are right. This is absolutely unexpected for users. Thank you for
the review.

To fix this, I suggest storing a random number in the [0, 1) range in a
separate variable, which will be used for comparisons in any place. We
will compare 'sample_rate' and random value not only in
pgss_post_parse_analyze(), but also in pgss_ProcessUtility() and
pgss_planner().

I attached patch with test and fixes.

I still think this is not the correct approach. It seems that post_parse_analyze
should not have to deal with checking for a sample rate. This is because
post_parse_analyze, which is the only hook with access to jstate, is
only responsible
for storing a normalized query text on disk and creating a not-yet
user visible entry
in the hash. i.e. pgss_store will never increment counters when called from
pgss_post_parse_analyze.

/* Increment the counts, except when jstate is not NULL */
if (!jstate)
{

What I think may be a better idea is to add something similar
to auto_explain.c, but it should only be added to the top of
pgss_planner, pgss_ExecutorStart and pgss_ProcessUtility.

if (nesting_level == 0)
{
if (!IsParallelWorker())
current_query_sampled = pg_prng_double(&pg_global_prng_state) <
pgss_sample_rate;
else
current_query_sampled = false;
}

This is needed for ExecutorStart and not ExecutorEnd because
ExecutorStart is where the instrumentation is initialized with
queryDesc->totaltime. The above code block could be
turned into a macro and reused in the routines mentioned.

I added this code in to the top of pgss_planner, pgss_ExecutorStart and
pgss_ProcessUtility. I removed this checking from
pgss_post_parse_analyze. I attached v9-patch.

However, it seems with this change, we can see a much
higher likelihood of non-normalized query texts being stored.
This is because jstate is only available during post_parse_analyze.
Therefore, if the first time you are sampling the statement is in ExecutorEnd,
then you will end up storing a non-normalized version of the query text,
see below example with the attached v8.

postgres=# set pg_stat_statements.sample_rate = 0;
SET
postgres=# select pg_stat_statements_reset();
pg_stat_statements_reset
-------------------------------
2025-01-07 13:02:11.807952-06
(1 row)

postgres=# SELECT 1 \parse stmt

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
-------+-------
(0 rows)

postgres=# set pg_stat_statements.sample_rate = 1;
SET
postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
-------+-------
(0 rows)

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
---------------------------------------------+-------
SELECT 1 | 1
SELECT query, calls FROM pg_stat_statements | 1
(2 rows)

postgres=# \bind_named stmt \g
?column?
----------
1
(1 row)

postgres=# SELECT query, calls FROM pg_stat_statements;
query | calls
---------------------------------------------+-------
SELECT 1 | 2
SELECT query, calls FROM pg_stat_statements | 2
(2 rows)

One idea is to make jstate available to all hooks, and completely
remove reliance on post_parse_analyze in pg_stat_statements.
I can't find the thread now, but I know this has come up in past discussions
when troubleshooting gaps in query normalization. My concern is this
feature will greatly increase the likelihood of non-normalized query texts.

Also, with regards to the benchmarks, it seems
sampling will be beneficial for workloads that are touching a small number
of entries with high concurrency. This is why we see benefit for
a standard pgbench workload.
Samping becomes less beneficial when there is a large set of queries
being updated.
I still think this is a good approach for workloads that touch a small set
of entries.

Regards,

Sami

After the changes in v9-patch, won’t all the necessary queries now be
normalized? Since there are no longer any restrictions in the parser
hook, queries will be normalized as usual, and pgss_planner,
pgss_ExecutorStart, and ExecutorEnd will simply fetch them from
'pgss_query_texts.stat' file.

For now, this seems like the simplest approach instead of making
JumbleState available to other hooks. Moreover, if we do proceed with
that, we might be able to remove the 'pgss_query_texts.stat' file
altogether and more other improvements. In my view, if no other options
arise, we should address making JumbleState available to other hooks in
a separate thread. If you have any suggestions, I'm always open to feedback.

I attached v9 patch with changes and test above.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v9-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v9-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchDownload

From b40ab06155cac38e57f0551a3858d0574d96865f Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Date: Wed, 8 Jan 2025 22:15:02 +0300
Subject: [PATCH] Allow setting sample ratio for pg_stat_statements

New configuration parameter pg_stat_statements.sample_ratio makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 .../pg_stat_statements/expected/select.out    | 76 +++++++++++++++++++
 .../pg_stat_statements/pg_stat_statements.c   | 54 ++++++++++++-
 contrib/pg_stat_statements/sql/select.sql     | 21 +++++
 doc/src/sgml/pgstatstatements.sgml            | 18 +++++
 4 files changed, 166 insertions(+), 3 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out
index 37a30af034..558d93fb46 100644
--- a/contrib/pg_stat_statements/expected/select.out
+++ b/contrib/pg_stat_statements/expected/select.out
@@ -153,6 +153,82 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
  t
 (1 row)
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                       query                        | calls 
+----------------------------------------------------+-------
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t |     1
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                                 query                                  | calls 
+------------------------------------------------------------------------+-------
+ SELECT $1 AS "int"                                                     |     1
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t                     |     1
+ SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" |     0
+(3 rows)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+set pg_stat_statements.sample_rate = 0.0;
+select pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+set pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                                 query                                  | calls 
+------------------------------------------------------------------------+-------
+ SELECT $1                                                              |     1
+ SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" |     0
+(2 rows)
+
+select pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
 --
 -- queries with locking clauses
 --
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index bebf8134eb..52957c4628 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -294,6 +295,10 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1;
+
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 
 #define pgss_enabled(level) \
@@ -414,6 +419,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to track.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -889,12 +907,21 @@ pgss_planner(Query *parse,
 {
 	PlannedStmt *result;
 
+	if (nesting_level == 0)
+	{
+		if (!IsParallelWorker())
+			current_query_sampled = pg_prng_double(&pg_global_prng_state) < pgss_sample_rate;
+		else
+			current_query_sampled = false;
+
+	}
+
 	/*
 	 * We can't process the query if no query_string is provided, as
 	 * pgss_store needs it.  We also ignore query without queryid, as it would
 	 * be treated as a utility statement, which may not be the case.
 	 */
-	if (pgss_enabled(nesting_level)
+	if (pgss_enabled(nesting_level) && current_query_sampled
 		&& pgss_track_planning && query_string
 		&& parse->queryId != UINT64CONST(0))
 	{
@@ -994,12 +1021,22 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	else
 		standard_ExecutorStart(queryDesc, eflags);
 
+	if (nesting_level == 0)
+	{
+		if (!IsParallelWorker())
+			current_query_sampled = pg_prng_double(&pg_global_prng_state) < pgss_sample_rate;
+		else
+			current_query_sampled = false;
+
+	}
+
 	/*
 	 * If query has queryId zero, don't track it.  This prevents double
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (pgss_enabled(nesting_level) && current_query_sampled &&
+		queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1111,7 +1148,18 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled;
+
+	if (nesting_level == 0)
+	{
+		if (!IsParallelWorker())
+			current_query_sampled = pg_prng_double(&pg_global_prng_state) < pgss_sample_rate;
+		else
+			current_query_sampled = false;
+	}
+
+	/* current_query_sampled is in pgss_enabled */
+	enabled = pgss_track_utility && pgss_enabled(nesting_level) && current_query_sampled;
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
diff --git a/contrib/pg_stat_statements/sql/select.sql b/contrib/pg_stat_statements/sql/select.sql
index e0be58d5e2..21f09ef94a 100644
--- a/contrib/pg_stat_statements/sql/select.sql
+++ b/contrib/pg_stat_statements/sql/select.sql
@@ -59,6 +59,27 @@ DEALLOCATE pgss_test;
 SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
 SELECT pg_stat_statements_reset() IS NOT NULL AS t;
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+set pg_stat_statements.sample_rate = 0.0;
+select pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+set pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+select pg_stat_statements_reset() IS NOT NULL AS t;
+
 --
 -- queries with locking clauses
 --
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..8ade4e3ced 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,24 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session. The default is 1,
+      meaning track all the queries. In case of nested statements,
+      either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#23

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Ilia Evdokimov (#22)

Re: Sample rate added to pg_stat_statements

On 08.01.2025 22:39, Ilia Evdokimov wrote:

After the changes in v9-patch, won’t all the necessary queries now be
normalized? Since there are no longer any restrictions in the parser
hook, queries will be normalized as usual, and pgss_planner,
pgss_ExecutorStart, and ExecutorEnd will simply fetch them from
'pgss_query_texts.stat' file.

For now, this seems like the simplest approach instead of making
JumbleState available to other hooks. Moreover, if we do proceed with
that, we might be able to remove the 'pgss_query_texts.stat' file
altogether and more other improvements. In my view, if no other
options arise, we should address making JumbleState available to other
hooks in a separate thread. If you have any suggestions, I'm always
open to feedback.

I attached v9 patch with changes and test above.

Unfortunately, these changes do not achieve the intended sampling goal.
I looked into this more deeply: while the sampled-out queries do not
appear in pg_stat_statements, an entry is still allocated in the hash
table after normalization, which, in my view, should not happen when
sampling is in effect. Therefore, patch v9 is unlikely to meet our needs.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#24

Sami Imseih

samimseih@gmail.com

about 1 year ago

In reply to: Ilia Evdokimov (#23)

Re: Sample rate added to pg_stat_statements

Unfortunately, these changes do not achieve the intended sampling goal.
I looked into this more deeply: while the sampled-out queries do not
appear in pg_stat_statements, an entry is still allocated in the hash
table after normalization, which, in my view, should not happen when
sampling is in effect. Therefore, patch v9 is unlikely to meet our needs.

pg_stat_statements creates entries as "sticky" initially to give them
more time to stay in the hash before the first execution completes.
It's not perfect, but it works for the majority of cases. So, what you
are observing is how pg_stat_statements currently works.

If an entry is popular enough, we will need it anyways ( even
with the proposed sampling ). An entry that's not popular will
eventually be aged out.

From my understanding, what the proposed sampling will do is
to reduce the overhead of incrementing counters of popular entries,
because of the spinlock to update the counters. This is particularly
the case with high concurrency on large machines ( high cpu count ),
and especially when there is a small set of popular entries.
IMO, This patch should also have a benchmark that proves
that a user can benefit with sampling in those types of
workloads.

Regards,

Sami

Regards,

Sami

#25

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Sami Imseih (#24)

Re: Sample rate added to pg_stat_statements

On 09.01.2025 05:29, Sami Imseih wrote:

Unfortunately, these changes do not achieve the intended sampling goal.
I looked into this more deeply: while the sampled-out queries do not
appear in pg_stat_statements, an entry is still allocated in the hash
table after normalization, which, in my view, should not happen when
sampling is in effect. Therefore, patch v9 is unlikely to meet our needs.

pg_stat_statements creates entries as "sticky" initially to give them
more time to stay in the hash before the first execution completes.
It's not perfect, but it works for the majority of cases. So, what you
are observing is how pg_stat_statements currently works.

If an entry is popular enough, we will need it anyways ( even
with the proposed sampling ). An entry that's not popular will
eventually be aged out.

From my understanding, what the proposed sampling will do is
to reduce the overhead of incrementing counters of popular entries,
because of the spinlock to update the counters. This is particularly
the case with high concurrency on large machines ( high cpu count ),
and especially when there is a small set of popular entries.
IMO, This patch should also have a benchmark that proves
that a user can benefit with sampling in those types of
workloads.

Ah, so patch version 9 might be the best fit to achieve this. I’ll need
to benchmark it on a large, high-concurrency machine then.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#26

Alena Rybakina

a.rybakina@postgrespro.ru

about 1 year ago

In reply to: Ilia Evdokimov (#22)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

Hi! Thank you for the work with this subject.

I looked at your patch and noticed that this part of the code is
repeated several times:

if (nesting_level == 0)
    {
        if (!IsParallelWorker())
            current_query_sampled =
pg_prng_double(&pg_global_prng_state) < pgss_sample_rate;
        else
            current_query_sampled = false;

}

I think you should put this in a function like
update_current_query_sampled. I've attached a diff file with the changes.

--
Regards,
Alena Rybakina
Postgres Professional

Attachments:

Allow_setting_sample_ratio_for_pg_stat_statements.difftext/x-patch; charset=UTF-8; name=Allow_setting_sample_ratio_for_pg_stat_statements.diffDownload

diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out
index 37a30af034a..558d93fb468 100644
--- a/contrib/pg_stat_statements/expected/select.out
+++ b/contrib/pg_stat_statements/expected/select.out
@@ -153,6 +153,82 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
  t
 (1 row)
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                       query                        | calls 
+----------------------------------------------------+-------
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t |     1
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                                 query                                  | calls 
+------------------------------------------------------------------------+-------
+ SELECT $1 AS "int"                                                     |     1
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t                     |     1
+ SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" |     0
+(3 rows)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+set pg_stat_statements.sample_rate = 0.0;
+select pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+set pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                                 query                                  | calls 
+------------------------------------------------------------------------+-------
+ SELECT $1                                                              |     1
+ SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" |     0
+(2 rows)
+
+select pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
 --
 -- queries with locking clauses
 --
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index bebf8134eb0..73795458364 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -294,6 +295,10 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1;
+
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 
 #define pgss_enabled(level) \
@@ -414,6 +419,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to track.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -877,6 +895,18 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 				   0);
 }
 
+static void
+update_current_query_sampled()
+{
+	if (nesting_level == 0)
+	{
+		if (!IsParallelWorker())
+			current_query_sampled = pg_prng_double(&pg_global_prng_state) < pgss_sample_rate;
+		else
+			current_query_sampled = false;
+
+	}
+}
 /*
  * Planner hook: forward to regular planner, but measure planning time
  * if needed.
@@ -889,12 +919,15 @@ pgss_planner(Query *parse,
 {
 	PlannedStmt *result;
 
+	if (nesting_level == 0)
+		update_current_query_sampled();
+
 	/*
 	 * We can't process the query if no query_string is provided, as
 	 * pgss_store needs it.  We also ignore query without queryid, as it would
 	 * be treated as a utility statement, which may not be the case.
 	 */
-	if (pgss_enabled(nesting_level)
+	if (pgss_enabled(nesting_level) && current_query_sampled
 		&& pgss_track_planning && query_string
 		&& parse->queryId != UINT64CONST(0))
 	{
@@ -994,12 +1027,16 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	else
 		standard_ExecutorStart(queryDesc, eflags);
 
+	if (nesting_level == 0)
+		update_current_query_sampled();
+
 	/*
 	 * If query has queryId zero, don't track it.  This prevents double
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (pgss_enabled(nesting_level) && current_query_sampled &&
+		queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1111,7 +1148,13 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled;
+
+	if (nesting_level == 0)
+		update_current_query_sampled();
+
+	/* current_query_sampled is in pgss_enabled */
+	enabled = pgss_track_utility && pgss_enabled(nesting_level) && current_query_sampled;
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
diff --git a/contrib/pg_stat_statements/sql/select.sql b/contrib/pg_stat_statements/sql/select.sql
index e0be58d5e24..21f09ef94a6 100644
--- a/contrib/pg_stat_statements/sql/select.sql
+++ b/contrib/pg_stat_statements/sql/select.sql
@@ -59,6 +59,27 @@ DEALLOCATE pgss_test;
 SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
 SELECT pg_stat_statements_reset() IS NOT NULL AS t;
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+set pg_stat_statements.sample_rate = 0.0;
+select pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+set pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+select pg_stat_statements_reset() IS NOT NULL AS t;
+
 --
 -- queries with locking clauses
 --

#27

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Alena Rybakina (#26)

Re: Sample rate added to pg_stat_statements

On 09.01.2025 22:13, Alena Rybakina wrote:

Hi! Thank you for the work with this subject.

I looked at your patch and noticed that this part of the code is
repeated several times:

if (nesting_level == 0)
    {
        if (!IsParallelWorker())
            current_query_sampled =
pg_prng_double(&pg_global_prng_state) < pgss_sample_rate;
        else
            current_query_sampled = false;

    }

I think you should put this in a function like
update_current_query_sampled. I've attached a diff file with the changes.

Agree, thanks.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#28

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

about 1 year ago

In reply to: Alexander Korotkov (#10)

Re: Sample rate added to pg_stat_statements

On 22.11.2024 09:08, Alexander Korotkov wrote:

On Wed, Nov 20, 2024 at 12:07 AM Michael Paquier <michael@paquier.xyz> wrote:

On Tue, Nov 19, 2024 at 09:39:21AM -0500, Greg Sabino Mullane wrote:

Oh, and a +1 in general to the patch, OP, although it would also be nice to
start finding the bottlenecks that cause such performance issues.

FWIW, I'm not eager to integrate this proposal without looking at this
exact argument in depth.

One piece of it would be to see how much of such "bottlenecks" we
would be able to get rid of by integrating pg_stat_statements into
the central pgstats with the custom APIs, without pushing the module
into core. This means that we would combine the existing hash of pgss
to shrink to 8 bytes for objid rather than 13 bytes now as the current
code relies on (toplevel, userid, queryid) for the entry lookup (entry
removal is sniped with these three values as well, or dshash seq
scans). The odds of conflicts still still play in our favor even if
we have a few million entries, or even ten times that.

If you run "pgbench -S -M prepared" on a pretty large machine with
high concurrency, then spin lock in pgss_store() could become pretty
much of a bottleneck. And I'm not sure switching all counters to
atomics could somehow improve the situation given we already have
pretty many counters.

I'm generally +1 for the approach taken in this thread. But I would
suggest introducing a threshold value for a query execution time, and
sample just everything below that threshold. The slower query
shouldn't be sampled, because it can't be too frequent, and also it
could be more valuable to be counter individually (while very fast
queries probably only matter "in average").

------
Regards,
Alexander Korotkov
Supabase

BTW, since we're performing sampling outside of
pgss_post_parse_analyze(), I'd like to reconsider Alexander's
proposal—to sample only those queries that have an execution time below
a specified threshold, as defined in the hook where execution time is
recorded. Queries exceeding this threshold would not be sampled.
Certainly I'll fix all Andrey's comments.

What are your thoughts on this approach?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#29

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

12 months ago

In reply to: Alena Rybakina (#26)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 09.01.2025 22:13, Alena Rybakina wrote:

Hi! Thank you for the work with this subject.

I looked at your patch and noticed that this part of the code is
repeated several times:

if (nesting_level == 0)
    {
        if (!IsParallelWorker())
            current_query_sampled =
pg_prng_double(&pg_global_prng_state) < pgss_sample_rate;
        else
            current_query_sampled = false;

    }

I think you should put this in a function like
update_current_query_sampled. I've attached a diff file with the changes.

There was redundant checking of nesting_level, so I removed one of them.
The fixed version is included in patch v10.

Alena, Sami – I apologize for not including you in the previous email.
If you're interested in this approach, I'm open to any suggestions.

[0]: /messages/by-id/1b13d748-5e98-479c-9222-3253a734a038@tantorlabs.com
/messages/by-id/1b13d748-5e98-479c-9222-3253a734a038@tantorlabs.com

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v10-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v10-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchDownload

From e8d9911bd7b1384112773977c1a30a476fc77471 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Tue, 14 Jan 2025 12:18:52 +0300
Subject: [PATCH v1] Allow setting sample ratio for pg_stat_statements

New configuration parameter pg_stat_statements.sample_ratio makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 .../pg_stat_statements/expected/select.out    | 76 +++++++++++++++++++
 .../pg_stat_statements/pg_stat_statements.c   | 46 ++++++++++-
 contrib/pg_stat_statements/sql/select.sql     | 21 +++++
 doc/src/sgml/pgstatstatements.sgml            | 18 +++++
 4 files changed, 158 insertions(+), 3 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out
index 37a30af034..558d93fb46 100644
--- a/contrib/pg_stat_statements/expected/select.out
+++ b/contrib/pg_stat_statements/expected/select.out
@@ -153,6 +153,82 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
  t
 (1 row)
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                       query                        | calls 
+----------------------------------------------------+-------
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t |     1
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                                 query                                  | calls 
+------------------------------------------------------------------------+-------
+ SELECT $1 AS "int"                                                     |     1
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t                     |     1
+ SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" |     0
+(3 rows)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+set pg_stat_statements.sample_rate = 0.0;
+select pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+set pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                                 query                                  | calls 
+------------------------------------------------------------------------+-------
+ SELECT $1                                                              |     1
+ SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" |     0
+(2 rows)
+
+select pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
 --
 -- queries with locking clauses
 --
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index bebf8134eb..d79402bac4 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -294,6 +295,10 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1;
+
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 
 #define pgss_enabled(level) \
@@ -414,6 +419,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to track.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -877,6 +895,19 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 				   0);
 }
 
+static void
+update_current_query_sampled()
+{
+	if (nesting_level == 0)
+	{
+		if (!IsParallelWorker())
+			current_query_sampled = pg_prng_double(&pg_global_prng_state) < pgss_sample_rate;
+		else
+			current_query_sampled = false;
+
+	}
+}
+
 /*
  * Planner hook: forward to regular planner, but measure planning time
  * if needed.
@@ -889,12 +920,14 @@ pgss_planner(Query *parse,
 {
 	PlannedStmt *result;
 
+	update_current_query_sampled();
+
 	/*
 	 * We can't process the query if no query_string is provided, as
 	 * pgss_store needs it.  We also ignore query without queryid, as it would
 	 * be treated as a utility statement, which may not be the case.
 	 */
-	if (pgss_enabled(nesting_level)
+	if (pgss_enabled(nesting_level) && current_query_sampled
 		&& pgss_track_planning && query_string
 		&& parse->queryId != UINT64CONST(0))
 	{
@@ -994,12 +1027,15 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	else
 		standard_ExecutorStart(queryDesc, eflags);
 
+	update_current_query_sampled();
+
 	/*
 	 * If query has queryId zero, don't track it.  This prevents double
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (pgss_enabled(nesting_level) && current_query_sampled &&
+		queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1111,7 +1147,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled;
+
+	update_current_query_sampled();
+
+	enabled = pgss_track_utility && pgss_enabled(nesting_level) && current_query_sampled;
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
diff --git a/contrib/pg_stat_statements/sql/select.sql b/contrib/pg_stat_statements/sql/select.sql
index e0be58d5e2..21f09ef94a 100644
--- a/contrib/pg_stat_statements/sql/select.sql
+++ b/contrib/pg_stat_statements/sql/select.sql
@@ -59,6 +59,27 @@ DEALLOCATE pgss_test;
 SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
 SELECT pg_stat_statements_reset() IS NOT NULL AS t;
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+set pg_stat_statements.sample_rate = 0.0;
+select pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+set pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+select pg_stat_statements_reset() IS NOT NULL AS t;
+
 --
 -- queries with locking clauses
 --
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..8ade4e3ced 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,24 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session. The default is 1,
+      meaning track all the queries. In case of nested statements,
+      either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#30

Sami Imseih

samimseih@gmail.com

12 months ago

In reply to: Ilia Evdokimov (#29)

Re: Sample rate added to pg_stat_statements

Alena, Sami – I apologize for not including you in the previous email.
If you're interested in this approach, I'm open to any suggestions.

[0]:
/messages/by-id/1b13d748-5e98-479c-9222-3253a734a038@tantorlabs.com

Here are my thoughts on this:

There is good reason to apply sample rate selectively,
but I am not sure if execution time is the way to go. I
would rather apply a sample rate on the most frequently
called queries, since I can gather enough samples
to draw conclusions about performance of the query.
I just don't know if that can be done in a sensible way,
because while we can check the number of calls in the entry,
we will need to do that with a shared lock and spin lock,
which will defeat the purpose of this patch.

This also got me thinking if maybe we should
allow the user to disable sample rate for
utility statements as those are less frequent
in most workloads and a user may want to capture
all such statements, i.e. DROP, CREATE, etc.

Regards,

Sami

#31

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

12 months ago

In reply to: Sami Imseih (#30)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 15.01.2025 01:07, Sami Imseih wrote:

Alena, Sami – I apologize for not including you in the previous email.
If you're interested in this approach, I'm open to any suggestions.

[0]:
/messages/by-id/1b13d748-5e98-479c-9222-3253a734a038@tantorlabs.com

Here are my thoughts on this:

There is good reason to apply sample rate selectively,
but I am not sure if execution time is the way to go. I
would rather apply a sample rate on the most frequently
called queries, since I can gather enough samples
to draw conclusions about performance of the query.
I just don't know if that can be done in a sensible way,
because while we can check the number of calls in the entry,
we will need to do that with a shared lock and spin lock,
which will defeat the purpose of this patch.

Agree. Indeed, we should reduce the load on the spin locks, but we can’t
check the most popular called queries without inspecting the hash table
and locking spin locks.

This also got me thinking if maybe we should
allow the user to disable sample rate for
utility statements as those are less frequent
in most workloads and a user may want to capture
all such statements, i.e. DROP, CREATE, etc.

Regards,

Sami

Probably, but first I suggest benchmarking with sampling applied to all
queries. If the results are good, we can later filter certain queries
based on different characteristics.

On 06.01.2025 18:57, Andrey M. Borodin wrote:

1. This code seems a little different from your patch. It is trying to avoid engaging PRNG. I'm not sure it's a good idea, but still. Also, it uses "<=", not "<".

xact_is_sampled = log_xact_sample_rate != 0 &&
(log_xact_sample_rate == 1 ||
pg_prng_double(&pg_global_prng_state) <= log_xact_sample_rate);

Sorry for the delayed reply. Andrey was right about this
suggestion—first, it makes the code more readable for others, and
second, it avoids engaging the PRNG on edge values of 0.0 and 1.0. I’ve
attached patch v11 with these changes.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v11-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v11-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchDownload

From 838c7e591a5cf157e88bc82cd8adf549b60ce212 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Wed, 15 Jan 2025 12:37:36 +0300
Subject: [PATCH v1] Allow setting sample ratio for pg_stat_statements

New configuration parameter pg_stat_statements.sample_ratio makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 .../pg_stat_statements/expected/select.out    | 76 +++++++++++++++++++
 .../pg_stat_statements/pg_stat_statements.c   | 48 +++++++++++-
 contrib/pg_stat_statements/sql/select.sql     | 21 +++++
 doc/src/sgml/pgstatstatements.sgml            | 18 +++++
 4 files changed, 160 insertions(+), 3 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out
index 37a30af034..558d93fb46 100644
--- a/contrib/pg_stat_statements/expected/select.out
+++ b/contrib/pg_stat_statements/expected/select.out
@@ -153,6 +153,82 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
  t
 (1 row)
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                       query                        | calls 
+----------------------------------------------------+-------
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t |     1
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                                 query                                  | calls 
+------------------------------------------------------------------------+-------
+ SELECT $1 AS "int"                                                     |     1
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t                     |     1
+ SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" |     0
+(3 rows)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+set pg_stat_statements.sample_rate = 0.0;
+select pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+set pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                                 query                                  | calls 
+------------------------------------------------------------------------+-------
+ SELECT $1                                                              |     1
+ SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" |     0
+(2 rows)
+
+select pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
 --
 -- queries with locking clauses
 --
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index bebf8134eb..ed57269b54 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -294,6 +295,10 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1.0;
+
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 
 #define pgss_enabled(level) \
@@ -414,6 +419,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to track.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -877,6 +895,21 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 				   0);
 }
 
+static void
+update_current_query_sampled()
+{
+	if (nesting_level == 0)
+	{
+		if (!IsParallelWorker())
+			current_query_sampled = pgss_sample_rate != 0.0 &&
+				(pgss_sample_rate == 1.0 ||
+				pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate);
+		else
+			current_query_sampled = false;
+
+	}
+}
+
 /*
  * Planner hook: forward to regular planner, but measure planning time
  * if needed.
@@ -889,12 +922,14 @@ pgss_planner(Query *parse,
 {
 	PlannedStmt *result;
 
+	update_current_query_sampled();
+
 	/*
 	 * We can't process the query if no query_string is provided, as
 	 * pgss_store needs it.  We also ignore query without queryid, as it would
 	 * be treated as a utility statement, which may not be the case.
 	 */
-	if (pgss_enabled(nesting_level)
+	if (pgss_enabled(nesting_level) && current_query_sampled
 		&& pgss_track_planning && query_string
 		&& parse->queryId != UINT64CONST(0))
 	{
@@ -994,12 +1029,15 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	else
 		standard_ExecutorStart(queryDesc, eflags);
 
+	update_current_query_sampled();
+
 	/*
 	 * If query has queryId zero, don't track it.  This prevents double
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (pgss_enabled(nesting_level) && current_query_sampled &&
+		queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1111,7 +1149,11 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled;
+
+	update_current_query_sampled();
+
+	enabled = pgss_track_utility && pgss_enabled(nesting_level) && current_query_sampled;
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
diff --git a/contrib/pg_stat_statements/sql/select.sql b/contrib/pg_stat_statements/sql/select.sql
index e0be58d5e2..21f09ef94a 100644
--- a/contrib/pg_stat_statements/sql/select.sql
+++ b/contrib/pg_stat_statements/sql/select.sql
@@ -59,6 +59,27 @@ DEALLOCATE pgss_test;
 SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
 SELECT pg_stat_statements_reset() IS NOT NULL AS t;
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+set pg_stat_statements.sample_rate = 0.0;
+select pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+set pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+select pg_stat_statements_reset() IS NOT NULL AS t;
+
 --
 -- queries with locking clauses
 --
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..8ade4e3ced 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,24 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session. The default is 1,
+      meaning track all the queries. In case of nested statements,
+      either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#32

Sami Imseih

samimseih@gmail.com

12 months ago

In reply to: Ilia Evdokimov (#31)

Re: Sample rate added to pg_stat_statements

Probably, but first I suggest benchmarking with sampling applied to all queries. If the results are good, we can later filter certain queries based on different characteristics.

Absolutely. The benchmark numbers to justify this feature are
the next step. Thanks for your work on this!

Regards,

Sami

#33

Alena Rybakina

a.rybakina@postgrespro.ru

12 months ago

In reply to: Ilia Evdokimov (#31)

Re: Sample rate added to pg_stat_statements

On 15.01.2025 12:47, Ilia Evdokimov wrote:

On 06.01.2025 18:57, Andrey M. Borodin wrote:

1. This code seems a little different from your patch. It is trying to avoid engaging PRNG. I'm not sure it's a good idea, but still. Also, it uses "<=", not "<".

xact_is_sampled = log_xact_sample_rate != 0 &&
(log_xact_sample_rate == 1 ||
pg_prng_double(&pg_global_prng_state) <= log_xact_sample_rate);

Sorry for the delayed reply. Andrey was right about this
suggestion—first, it makes the code more readable for others, and
second, it avoids engaging the PRNG on edge values of 0.0 and 1.0.
I’ve attached patch v11 with these changes.

Patch looks fine. Thank you!

On 14.01.2025 15:00, Ilia Evdokimov wrote:

Alena, Sami – I apologize for not including you in the previous email.
If you're interested in this approach, I'm open to any suggestions.

[0]:
/messages/by-id/1b13d748-5e98-479c-9222-3253a734a038@tantorlabs.com

This is a difficult question. I tend to agree with Alexander Korotkov's
proposal to add a filter that saves information about queries whose
statistical information satisfies the conditions of the configured
filter. However, I don’t believe query execution time is a sufficient
metric for this purpose. It is too unstable and influenced by many
external factors, such as system load. For instance, various background
processes like vacuum, checkpointer, or background writer mechanisms
could be running simultaneously.

Additionally, a query may take a long time to execute because another
large query is consuming most of the system's resources at the same
time. In such cases, the long execution time of the triggered query may
not indicate anything remarkable. Furthermore, statistics for
resource-intensive queries may appear normal if the query simply
processed a large volume of data, such as when it involves a Cartesian
product or a full join.

Therefore, I think the idea of using filters is more promising,
especially if multiple filters are implemented. For example, we could
add filters for buffer usage (pages read or modified), differences in
cardinality (predicted vs. actual), username, application name, and
other criteria. These filters would help reduce the volume of queries
tracked by the pg_stat_statements extension. However, there might be
challenges in keeping this state up to date, given the volatile and
unstable nature of the system.

Your proposed parameter could also reduce the volume of queries for
which statistics need to be saved, but it is too unpredictable for
analysis. This unpredictability makes it unclear how the data would be
interpreted in the future. For instance, a query that genuinely impacted
performance might not be processed by pg_stat_statements simply due to
randomness, while a small, insignificant query could be selected
instead. Analyzing such statistical information might lead to misleading
conclusions.

--
Regards,
Alena Rybakina
Postgres Professional

#34

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

12 months ago

In reply to: Sami Imseih (#32)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 15.01.2025 20:16, Sami Imseih wrote:

Probably, but first I suggest benchmarking with sampling applied to all queries. If the results are good, we can later filter certain queries based on different characteristics.

Absolutely. The benchmark numbers to justify this feature are
the next step. Thanks for your work on this!

Regards,

Sami

Hi hackers,

I’d like to share the results of my benchmark.

To stress the spinlock in pg_stat_statements while incrementing counters
for popular entries, it's important to use the same query repeatedly. To
avoid overloading pgss with normalization, the queries should not
contain constants. I found that using the query 'SELECT now()' works
best for this purpose.

I ran the benchmark on a machine with 48 CPUs, which may not be
sufficient to fully test sampling, but I was able to achieve around 1.5
million TPS using 'SELECT now()'. To load the CPUs to about 85-90%, I
ran pgbench with -c 45 -j 45, using a custom 'SELECT now()' in
pgbench_script.sql file containing 'SELECT now()'. The benchmark was
conducted on a newly created 'pgbench' database, , with processes like
autovacuum, fsync, and checkpoints disabled(e.g., checkpoint _timeout =
'24h').

I tested various 'sample_rate' values and compared the performance with
pgss both enabled and disabled. The detailed results are attached in the
'sample_rate.txt' file, along with pgbench progress reports every 10
seconds.

pgbench -c 45 -j 45 -T 200 -f pgbench_script.sql --progress=10 pgbench

sample_rate | num of transactions | lat | stddev | tps

        1.0 |    1207127.572580 |0.037| 0.030 | 241400836
        0.8 |    1403551.516338 |0.032| 0.031 | 280673286
        0.5 |    1658596.614064 |0.027| 0.012 | 331679344
        0.2 |    1757747.509476 |0.025| 0.008 | 351507156
        0.0 |    1760055.986259 |0.025| 0.008 | 351965559
   pgss off |    1828743.791205 |0.024| 0.008 | 365703762

If anyone has the capability to run this benchmark on machines with more
CPUs or with different queries, it would be nice. I’d appreciate any
suggestions or feedback.

--.
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#35

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

12 months ago

In reply to: Ilia Evdokimov (#34)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 20.01.2025 17:20, Ilia Evdokimov wrote:

On 15.01.2025 20:16, Sami Imseih wrote:

Probably, but first I suggest benchmarking with sampling applied to
all queries. If the results are good, we can later filter certain
queries based on different characteristics.

Absolutely. The benchmark numbers to justify this feature are
the next step. Thanks for your work on this!

Regards,

Sami

Hi hackers,

I’d like to share the results of my benchmark.

To stress the spinlock in pg_stat_statements while incrementing
counters for popular entries, it's important to use the same query
repeatedly. To avoid overloading pgss with normalization, the queries
should not contain constants. I found that using the query 'SELECT
now()' works best for this purpose.

I ran the benchmark on a machine with 48 CPUs, which may not be
sufficient to fully test sampling, but I was able to achieve around
1.5 million TPS using 'SELECT now()'. To load the CPUs to about
85-90%, I ran pgbench with -c 45 -j 45, using a custom 'SELECT now()'
in pgbench_script.sql file containing 'SELECT now()'. The benchmark
was conducted on a newly created 'pgbench' database, , with processes
like autovacuum, fsync, and checkpoints disabled(e.g., checkpoint
_timeout = '24h').

I tested various 'sample_rate' values and compared the performance
with pgss both enabled and disabled. The detailed results are attached
in the 'sample_rate.txt' file, along with pgbench progress reports
every 10 seconds.

pgbench -c 45 -j 45 -T 200 -f pgbench_script.sql --progress=10 pgbench

sample_rate | num of transactions | lat | stddev | tps

        1.0 |    1207127.572580 |0.037| 0.030 | 241400836
        0.8 |    1403551.516338 |0.032| 0.031 | 280673286
        0.5 |    1658596.614064 |0.027| 0.012 | 331679344
        0.2 |    1757747.509476 |0.025| 0.008 | 351507156
        0.0 |    1760055.986259 |0.025| 0.008 | 351965559
   pgss off |    1828743.791205 |0.024| 0.008 | 365703762

If anyone has the capability to run this benchmark on machines with
more CPUs or with different queries, it would be nice. I’d appreciate
any suggestions or feedback.

--.
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Hi hackers,

I'd like to share the results of a new benchmark I conducted using the
pgbench tool with the -M prepared -S flags!

Previously, I shared a benchmark focused on extreme high-load scenarios,
but now I’ve conducted a more realistic test case based on Alexander’s
suggestion [0]/messages/by-id/CAPpHfdsTKAQqC3A48-MGQhrhfEamXZPb64w=utk7thQcOMNr7Q@mail.gmail.com.

Machine Characteristics:

* 48 CPUs
* 374 GB RAM
* 3 TB Disk

To avoid interference from background processes, I disabled autovacuum
and synchronous operations using the following settings in
postgresql.auto.conf:
    autovacuum = off
    fsync = off
    checkpoint_timeout = '24h'
    shared_preload_libraries = 'pg_stat_statements'

Create a test database:
createdb pgbench
pgbench -i -s 3000 pgbench

If pg_stat_statements is enabled, reset its hash table before starting
the benchmark:
psql -c 'SELECT pg_stat_statements_reset()'

Run the benchmark with the following command:
pgbench -c 46 -j 46 -T 200 -M prepared -S --progress=10 pgbench

Here are the results with and without pg_stat_statements, varying the
sample_rate parameter:

sample_rate |       tps | lat | stddev | num of transactions
       1.0 |    1299070.105018 |0.035| 0.055 | 259781895
       0.8 |    1393327.873249 |0.033| 0.039 | 278630669
       0.5 |    1420303.856480 |0.032| 0.037 | 284023396
       0.2 |    1432482.672301 |0.032| 0.037 | 286461834
       0.0 |    1760055.986259 |0.032| 0.036 | 290169584
pgss off |    1460920.805259 |0.031| 0.036 | 292144256

When pg_stat_statements is enabled, TPS decreases by about 10%. However,
by leveraging the new sampling feature (sample_rate), it is possible to
significantly mitigate this overhead. For detailed benchmark results
with the '-M prepared -S' flags, please see the attached file
'sample_rate_prepared.txt'

P.S. Previously, I shared results under extreme high-load conditions
using the query 'SELECT now()'. For clarity, I’m rewriting those results
for better readability:

sample_rate |       tps | lat | stddev | num of transactions
       1.0 |    1207127.572580 |0.037| 0.030 | 241400836
       0.8 |    1403551.516338 |0.032| 0.031 | 280673286
       0.5 |    1658596.614064 |0.027| 0.012 | 331679344
       0.2 |    1757747.509476 |0.025| 0.008 | 351507156
       0.0 |    1760055.986259 |0.025| 0.008 | 351965559
pgss off |    1828743.791205 |0.024| 0.008 | 365703762

Here, we observed a significant ~33% decrease in TPS when pgss was
enabled without sampling.

Any thoughts?

[0]: /messages/by-id/CAPpHfdsTKAQqC3A48-MGQhrhfEamXZPb64w=utk7thQcOMNr7Q@mail.gmail.com
/messages/by-id/CAPpHfdsTKAQqC3A48-MGQhrhfEamXZPb64w=utk7thQcOMNr7Q@mail.gmail.com

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#36

Andrey Borodin

x4mmm@yandex-team.ru

12 months ago

In reply to: Ilia Evdokimov (#31)

Re: Sample rate added to pg_stat_statements

<v11-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patch>

There’s a typo in the commit message (ratio instead of rate). Besides this the patch looks ready for committer.

Best regards, Andrey Borodin.

#37

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

12 months ago

In reply to: Andrey Borodin (#36)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 23.01.2025 08:21, Andrey Borodin wrote:

<v11-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patch>

There’s a typo in the commit message (ratio instead of rate). Besides this the patch looks ready for committer.

Best regards, Andrey Borodin.

Fixed. Thank you for review!

I noticed that the code has not enough comments, so I added additional
ones to improve clarity. Additionally, I moved the
update_current_query_sampled() function to the end of the file, as
keeping it between hooks didn’t seem appropriate to me.

All these changes are included in the updated patch (v12), which I have
attached.

The patch is part of the current 2025-01 Commitfest:
https://commitfest.postgresql.org/51/5390/

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v12-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v12-0001-Allow-setting-sample-ratio-for-pg_stat_statements.patchDownload

From ddbc6af2af511ff342b183cb13a9027edadc0ad3 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Thu, 23 Jan 2025 11:31:57 +0300
Subject: [PATCH v1] Allow setting sample ratio for pg_stat_statements

New configuration parameter pg_stat_statements.sample_rate makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 .../pg_stat_statements/expected/select.out    | 76 +++++++++++++++++++
 .../pg_stat_statements/pg_stat_statements.c   | 73 +++++++++++++++---
 contrib/pg_stat_statements/sql/select.sql     | 21 +++++
 doc/src/sgml/pgstatstatements.sgml            | 18 +++++
 4 files changed, 179 insertions(+), 9 deletions(-)

diff --git a/contrib/pg_stat_statements/expected/select.out b/contrib/pg_stat_statements/expected/select.out
index 37a30af034..558d93fb46 100644
--- a/contrib/pg_stat_statements/expected/select.out
+++ b/contrib/pg_stat_statements/expected/select.out
@@ -153,6 +153,82 @@ SELECT pg_stat_statements_reset() IS NOT NULL AS t;
  t
 (1 row)
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                       query                        | calls 
+----------------------------------------------------+-------
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t |     1
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                                 query                                  | calls 
+------------------------------------------------------------------------+-------
+ SELECT $1 AS "int"                                                     |     1
+ SELECT pg_stat_statements_reset() IS NOT NULL AS t                     |     1
+ SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" |     0
+(3 rows)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+set pg_stat_statements.sample_rate = 0.0;
+select pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+set pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+                                 query                                  | calls 
+------------------------------------------------------------------------+-------
+ SELECT $1                                                              |     1
+ SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C" |     0
+(2 rows)
+
+select pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
 --
 -- queries with locking clauses
 --
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index bebf8134eb..92ba954d41 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -294,6 +295,10 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1.0; /* fraction of statements to track */
+
+/* Is the current top-level query to be sampled? */
+static bool current_query_sampled = false;
 
 
 #define pgss_enabled(level) \
@@ -373,6 +378,7 @@ static char *generate_normalized_query(JumbleState *jstate, const char *query,
 static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
 									 int query_loc);
 static int	comp_location(const void *a, const void *b);
+static void	update_current_query_sampled(void);
 
 
 /*
@@ -414,6 +420,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to track.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -889,12 +908,16 @@ pgss_planner(Query *parse,
 {
 	PlannedStmt *result;
 
+	/* Whether the current query is sampled based on sample_rate. */
+	update_current_query_sampled();
+
 	/*
-	 * We can't process the query if no query_string is provided, as
-	 * pgss_store needs it.  We also ignore query without queryid, as it would
-	 * be treated as a utility statement, which may not be the case.
+	 * We can't process the query if it is not sampled or
+	 * if no query_string is provided, as pgss_store needs it.
+	 * We also ignore query without queryid, as it would be treated
+	 * as a utility statement, which may not be the case.
 	 */
-	if (pgss_enabled(nesting_level)
+	if (pgss_enabled(nesting_level) && current_query_sampled
 		&& pgss_track_planning && query_string
 		&& parse->queryId != UINT64CONST(0))
 	{
@@ -994,12 +1017,16 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	else
 		standard_ExecutorStart(queryDesc, eflags);
 
+	/* Whether the current query is sampled based on sample_rate. */
+	update_current_query_sampled();
+
 	/*
-	 * If query has queryId zero, don't track it.  This prevents double
-	 * counting of optimizable statements that are directly contained in
-	 * utility statements.
+	 * If query has queryId zero or if query is not sampled, don't track it.
+	 * This prevents double counting of optimizable statements
+	 * that are directly contained in utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (pgss_enabled(nesting_level) && current_query_sampled &&
+		queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1111,7 +1138,13 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled;
+
+	/* Whether the current query is sampled based on sample_rate. */
+	update_current_query_sampled();
+
+	enabled = pgss_track_utility && pgss_enabled(nesting_level) &&
+				current_query_sampled;
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
@@ -3011,3 +3044,25 @@ comp_location(const void *a, const void *b)
 
 	return pg_cmp_s32(l, r);
 }
+
+/*
+ * At the beginning of each top-level statement, decide whether we'll
+ * sample this statement.  If nested-statement tracking is enabled,
+ * either all nested statements will be tracked or none will.
+ *
+ * When in a parallel worker, we should do nothing, which we can implement
+ * cheaply by pretending we decided not to sample the current statement.
+ */
+static void
+update_current_query_sampled(void)
+{
+	if (nesting_level == 0)
+	{
+		if (!IsParallelWorker())
+			current_query_sampled = pgss_sample_rate != 0.0 &&
+				(pgss_sample_rate == 1.0 ||
+				pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate);
+		else
+			current_query_sampled = false;
+	}
+}
\ No newline at end of file
diff --git a/contrib/pg_stat_statements/sql/select.sql b/contrib/pg_stat_statements/sql/select.sql
index e0be58d5e2..21f09ef94a 100644
--- a/contrib/pg_stat_statements/sql/select.sql
+++ b/contrib/pg_stat_statements/sql/select.sql
@@ -59,6 +59,27 @@ DEALLOCATE pgss_test;
 SELECT calls, rows, query FROM pg_stat_statements ORDER BY query COLLATE "C";
 SELECT pg_stat_statements_reset() IS NOT NULL AS t;
 
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+
+set pg_stat_statements.sample_rate = 0.0;
+select pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+set pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+select pg_stat_statements_reset() IS NOT NULL AS t;
+
 --
 -- queries with locking clauses
 --
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..8ade4e3ced 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,24 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session. The default is 1,
+      meaning track all the queries. In case of nested statements,
+      either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#38

Sami Imseih

samimseih@gmail.com

12 months ago

In reply to: Ilia Evdokimov (#37)

Re: Sample rate added to pg_stat_statements

Hi,

I have a few comments about the patch.

1/
tests are missing for pg_stat_statements.track=all and with a sample
rate < 1 applied.
Can we add tests for this case?

This declaration:

+static bool current_query_sampled = false;

should be moved near the declaration of nesting_level,
the other local variable.

3/
I do not really like the calls to update_current_query_sampled()
the way there are, and I think the pgss_enabled should encapsulate
the new sample rate check as well.

My thoughts are:

1/ change the name of the function from update_current_query_sampled
to current_query_sampled

Also, the static bool should be called is_query_sampled or something like that.

The function should also allow you to skip the sample rate check, such as if
called from pgss_post_parse_analyze.

We can also remove the IsParallelWorker() check in this case, since that
is done in pgss_enabled.

something like this is what I am thinking:

static bool
current_query_sampled(bool skip)
{
if (skip)
return true;

if (nesting_level == 0)
{
is_query_sampled = pgss_sample_rate != 0.0 &&
(pgss_sample_rate == 1.0 ||

pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate);
}

return is_query_sampled;
}

#define pgss_enabled(level, skip_sampling_check) \
(!IsParallelWorker() && \
(pgss_track == PGSS_TRACK_ALL || \
(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
current_query_sampled(skip_sampling_check))

What do you think?

4/ Now we have pg_stat_statements.track = "off" which is effectively
the same thing as pg_stat_statements.sample_rate = "0". Does this need
to be called out in docs?

Regards,

Sami

#39

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

12 months ago

In reply to: Sami Imseih (#38)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 28.01.2025 02:00, Sami Imseih wrote:

Hi,

I have a few comments about the patch.

1/
tests are missing for pg_stat_statements.track=all and with a sample
rate < 1 applied.
Can we add tests for this case?

I think I’ve come up with a good idea for implementing this. We can
create a new file, sampling.sql, for regression tests, where we move
written tests in select.out and check test with setting
"pg_stat_statements.track = 'all'" and "pg_stat_statements.sample_rate =
0.5". Additionally, we can create two files in the 'expected' directory:
sampling.out and sampling_1.out where we can verify either all nested
statements are tracked or none of them.

2/

This declaration:

+static bool current_query_sampled = false;

should be moved near the declaration of nesting_level,
the other local variable.

3/
I do not really like the calls to update_current_query_sampled()
the way there are, and I think the pgss_enabled should encapsulate
the new sample rate check as well.

My thoughts are:

1/ change the name of the function from update_current_query_sampled
to current_query_sampled

Also, the static bool should be called is_query_sampled or something like that.

The function should also allow you to skip the sample rate check, such as if
called from pgss_post_parse_analyze.

We can also remove the IsParallelWorker() check in this case, since that
is done in pgss_enabled.

something like this is what I am thinking:

static bool
current_query_sampled(bool skip)
{
if (skip)
return true;

if (nesting_level == 0)
{
is_query_sampled = pgss_sample_rate != 0.0 &&
(pgss_sample_rate == 1.0 ||

pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate);
}

return is_query_sampled;
}

#define pgss_enabled(level, skip_sampling_check) \
(!IsParallelWorker() && \
(pgss_track == PGSS_TRACK_ALL || \
(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
current_query_sampled(skip_sampling_check))

What do you think?

I agree with all of these changes; they make the code more readable.
Adding comments to the current_query_sampled() function will make it
even clearer.

4/ Now we have pg_stat_statements.track = "off" which is effectively
the same thing as pg_stat_statements.sample_rate = "0". Does this need
to be called out in docs?

+1, because we have the same thing in log_statement_sample_rate.

All the changes mentioned above are included in the v13 patch. Since the
patch status is 'Ready for Committer,' I believe it is now better for
upstream inclusion, with improved details in tests and documentation. Do
you have any further suggestions?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v13-0001-Allow-setting-sample-rate-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v13-0001-Allow-setting-sample-rate-for-pg_stat_statements.patchDownload

From 149688fe3bccce25f8d01dfe8c01444555dce03c Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Date: Tue, 28 Jan 2025 05:31:20 +0300
Subject: [PATCH] Allow setting sample rate for pg_stat_statements

New configuration parameter pg_stat_statements.sample_rate makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 contrib/pg_stat_statements/Makefile           |  2 +-
 .../pg_stat_statements/expected/sampling.out  | 99 +++++++++++++++++++
 .../expected/sampling_1.out                   | 97 ++++++++++++++++++
 .../pg_stat_statements/pg_stat_statements.c   | 59 +++++++++--
 contrib/pg_stat_statements/sql/sampling.sql   | 34 +++++++
 doc/src/sgml/pgstatstatements.sgml            | 19 ++++
 6 files changed, 302 insertions(+), 8 deletions(-)
 create mode 100644 contrib/pg_stat_statements/expected/sampling.out
 create mode 100644 contrib/pg_stat_statements/expected/sampling_1.out
 create mode 100644 contrib/pg_stat_statements/sql/sampling.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 241c02587b..b70bdfaf2d 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS))
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
 REGRESS = select dml cursors utility level_tracking planning \
 	user_activity wal entry_timestamp privileges extended \
-	parallel cleanup oldextversions
+	parallel sampling cleanup oldextversions
 # Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/expected/sampling.out b/contrib/pg_stat_statements/expected/sampling.out
new file mode 100644
index 0000000000..6e73a0a318
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/sampling.out
@@ -0,0 +1,99 @@
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+       query        | calls 
+--------------------+-------
+ SELECT $1 AS "int" |     1
+(1 row)
+
+-- sample_rate is (0, 1) and track = "all"
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0.5;
+-- either all nested statements will be tracked (sample.out) or none will (sample_1.out).
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+    ORDER BY query COLLATE "C", toplevel;
+ toplevel | calls |                       query                        
+----------+-------+----------------------------------------------------
+ t        |     1 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     1 | SELECT $1
+ t        |     1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
+ t        |     1 | SET pg_stat_statements.sample_rate = $1
+(4 rows)
+
+SET pg_stat_statements.track = "top";
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+   query   | calls 
+-----------+-------
+ SELECT $1 |     1
+(1 row)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
diff --git a/contrib/pg_stat_statements/expected/sampling_1.out b/contrib/pg_stat_statements/expected/sampling_1.out
new file mode 100644
index 0000000000..3a390226fa
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/sampling_1.out
@@ -0,0 +1,97 @@
+--
+-- sample statements
+--
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+       query        | calls 
+--------------------+-------
+ SELECT $1 AS "int" |     1
+(1 row)
+
+-- sample_rate is (0, 1) and track = "all"
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0.5;
+-- either all nested statements will be tracked (sample.out) or none will (sample_1.out).
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+    ORDER BY query COLLATE "C", toplevel;
+ toplevel | calls |                       query                        
+----------+-------+----------------------------------------------------
+ t        |     1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
+ t        |     1 | SET pg_stat_statements.sample_rate = $1
+(2 rows)
+
+SET pg_stat_statements.track = "top";
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+   query   | calls 
+-----------+-------
+ SELECT $1 |     1
+(1 row)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index bebf8134eb..afd0537713 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -256,6 +257,9 @@ typedef struct pgssSharedState
 /* Current nesting depth of planner/ExecutorRun/ProcessUtility calls */
 static int	nesting_level = 0;
 
+/* Is the current top-level query to be sampled? */
+static bool is_query_sampled = false;
+
 /* Saved hook values */
 static shmem_request_hook_type prev_shmem_request_hook = NULL;
 static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
@@ -294,12 +298,14 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1.0; /* fraction of statements to track */
 
 
-#define pgss_enabled(level) \
+#define pgss_enabled(level, skip_sampling_check) \
 	(!IsParallelWorker() && \
 	(pgss_track == PGSS_TRACK_ALL || \
-	(pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+	(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+	current_query_sampled(skip_sampling_check))
 
 #define record_gc_qtexts() \
 	do { \
@@ -373,6 +379,7 @@ static char *generate_normalized_query(JumbleState *jstate, const char *query,
 static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
 									 int query_loc);
 static int	comp_location(const void *a, const void *b);
+static bool	current_query_sampled(bool skip);
 
 
 /*
@@ -414,6 +421,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to track.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -836,7 +856,7 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
 
 	/* Safety check... */
-	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
+	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level, true))
 		return;
 
 	/*
@@ -894,7 +914,7 @@ pgss_planner(Query *parse,
 	 * pgss_store needs it.  We also ignore query without queryid, as it would
 	 * be treated as a utility statement, which may not be the case.
 	 */
-	if (pgss_enabled(nesting_level)
+	if (pgss_enabled(nesting_level, false)
 		&& pgss_track_planning && query_string
 		&& parse->queryId != UINT64CONST(0))
 	{
@@ -999,7 +1019,8 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (pgss_enabled(nesting_level, false) &&
+		queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1068,7 +1089,7 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 	uint64		queryId = queryDesc->plannedstmt->queryId;
 
 	if (queryId != UINT64CONST(0) && queryDesc->totaltime &&
-		pgss_enabled(nesting_level))
+		pgss_enabled(nesting_level, false))
 	{
 		/*
 		 * Make sure stats accumulation is done.  (Note: it's okay if several
@@ -1111,7 +1132,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level, false);
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
@@ -3011,3 +3032,27 @@ comp_location(const void *a, const void *b)
 
 	return pg_cmp_s32(l, r);
 }
+
+/*
+ * Determine whether the current query should be sampled.
+ *
+ * In some cases, sampling may need to be skipped entirely, such as during
+ * query parsing for normalization.
+ *
+ * At the beginning of each top-level statement, decide whether we'll
+ * sample this statement. If nested-statement tracking is enabled,
+ * either all nested statements will be tracked or none will.
+ */
+static bool
+current_query_sampled(bool skip)
+{
+	if (skip)
+		return true;
+
+	if (nesting_level == 0)
+		is_query_sampled = pgss_sample_rate != 0.0 &&
+			(pgss_sample_rate == 1.0 ||
+			pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate);
+
+	return is_query_sampled;
+}
\ No newline at end of file
diff --git a/contrib/pg_stat_statements/sql/sampling.sql b/contrib/pg_stat_statements/sql/sampling.sql
new file mode 100644
index 0000000000..4dabe1540c
--- /dev/null
+++ b/contrib/pg_stat_statements/sql/sampling.sql
@@ -0,0 +1,34 @@
+--
+-- sample statements
+--
+
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+
+
+-- sample_rate is (0, 1) and track = "all"
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.sample_rate = 0.5;
+
+-- either all nested statements will be tracked (sample.out) or none will (sample_1.out).
+EXPLAIN (COSTS OFF) SELECT 1;
+SELECT toplevel, calls, query FROM pg_stat_statements
+    ORDER BY query COLLATE "C", toplevel;
+SET pg_stat_statements.track = "top";
+
+
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
\ No newline at end of file
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..cd3fdaeffe 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,25 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session. The default is <literal>1</literal>,
+      meaning track all the queries. Setting this to <literal>0</literal> disables sampled statements
+      tracking, the same as setting <varname>pg_stat_statements.track</varname> to <literal>none</literal>.
+      In case of nested statements, either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#40

Sami Imseih

samimseih@gmail.com

12 months ago

In reply to: Ilia Evdokimov (#39)

2 attachment(s)

Re: Sample rate added to pg_stat_statements

All the changes mentioned above are included in the v13 patch. Since the
patch status is 'Ready for Committer,' I believe it is now better for
upstream inclusion, with improved details in tests and documentation. Do
you have any further suggestions?

I am not quite clear on the sample_1.out. I do like the idea of separating
the sample tests, but I was thinking of something a bit more simple.
What do you think of my attached, sampling.sql, test? It tests sample
rate in both
simple and extended query protocols and for both top level and
nested levels?

If anyone has the capability to run this benchmark on machines with more
CPUs or with different queries, it would be nice. I’d appreciate any
suggestions or feedback.

I wanted to share some additional benchmarks I ran as well
on a r8g.48xlarge ( 192 vCPUs, 1,536 GiB of memory) configured
with 16GB of shared_buffers. I also attached the benchmark.sh
script used to generate the output.
The benchmark is running the select-only pgbench workload,
so we have a single heavily contentious entry, which is the
worst case.

The test shows that the spinlock (SpinDelay waits)
becomes an issue at high connection counts and will
become worse on larger machines. A sample_rate going from
1 to .75 shows a 60% improvement; but this is on a single
contentious entry. Most workloads will likely not see this type
of improvement. I also could not really observe
this type of difference on smaller machines ( i.e. 32 vCPUs),
as expected.

## init
pgbench -i -s500

### 192 connections
pgbench -c192 -j20 -S -Mprepared -T120 --progress 10

sample_rate = 1
tps = 484338.769799 (without initial connection time)
waits
-----
11107 SpinDelay
9568 CPU
929 ClientRead
13 DataFileRead
3 BufferMapping

sample_rate = .75
tps = 909547.562124 (without initial connection time)
waits
-----
12079 CPU
4781 SpinDelay
2100 ClientRead

sample_rate = .5
tps = 1028594.555273 (without initial connection time)
waits
-----
13253 CPU
3378 ClientRead
174 SpinDelay

sample_rate = .25
tps = 1019507.126313 (without initial connection time)
waits
-----
13397 CPU
3423 ClientRead

sample_rate = 0
tps = 1015425.288538 (without initial connection time)
waits
-----
13106 CPU
3502 ClientRead

### 32 connections
pgbench -c32 -j20 -S -Mprepared -T120 --progress 10

sample_rate = 1
tps = 620667.049565 (without initial connection time)
waits
-----
1782 CPU
560 ClientRead

sample_rate = .75
tps = 620663.131347 (without initial connection time)
waits
-----
1736 CPU
554 ClientRead

sample_rate = .5
tps = 624094.688239 (without initial connection time)
waits
-----
1741 CPU
648 ClientRead

sample_rate = .25
tps = 628638.538204 (without initial connection time)
waits
-----
1702 CPU
576 ClientRead

sample_rate = 0
tps = 630483.464912 (without initial connection time)
waits
-----
1638 CPU
574 ClientRead

Regards,

Sami

#41

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

12 months ago

In reply to: Sami Imseih (#40)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 28.01.2025 20:21, Sami Imseih wrote:

All the changes mentioned above are included in the v13 patch. Since the
patch status is 'Ready for Committer,' I believe it is now better for
upstream inclusion, with improved details in tests and documentation. Do
you have any further suggestions?

I am not quite clear on the sample_1.out. I do like the idea of separating
the sample tests, but I was thinking of something a bit more simple.
What do you think of my attached, sampling.sql, test? It tests sample
rate in both
simple and extended query protocols and for both top level and
nested levels?

That sounds great! I've added your sample.sql file to my v14
patch. However, I was focused on testing sample_rate values between 0
and 1. The approach I came up with was using the sample{_1}.out files.
I’ve removed the test involving those files for now, but if the
committer prefers to keep it, I can reintroduce them.

If anyone has the capability to run this benchmark on machines with more
CPUs or with different queries, it would be nice. I’d appreciate any
suggestions or feedback.

I wanted to share some additional benchmarks I ran as well
on a r8g.48xlarge ( 192 vCPUs, 1,536 GiB of memory) configured
with 16GB of shared_buffers. I also attached the benchmark.sh
script used to generate the output.
The benchmark is running the select-only pgbench workload,
so we have a single heavily contentious entry, which is the
worst case.

The test shows that the spinlock (SpinDelay waits)
becomes an issue at high connection counts and will
become worse on larger machines. A sample_rate going from
1 to .75 shows a 60% improvement; but this is on a single
contentious entry. Most workloads will likely not see this type
of improvement. I also could not really observe
this type of difference on smaller machines ( i.e. 32 vCPUs),
as expected.

## init
pgbench -i -s500

### 192 connections
pgbench -c192 -j20 -S -Mprepared -T120 --progress 10

sample_rate = 1
tps = 484338.769799 (without initial connection time)
waits
-----
11107 SpinDelay
9568 CPU
929 ClientRead
13 DataFileRead
3 BufferMapping

sample_rate = .75
tps = 909547.562124 (without initial connection time)
waits
-----
12079 CPU
4781 SpinDelay
2100 ClientRead

sample_rate = .5
tps = 1028594.555273 (without initial connection time)
waits
-----
13253 CPU
3378 ClientRead
174 SpinDelay

sample_rate = .25
tps = 1019507.126313 (without initial connection time)
waits
-----
13397 CPU
3423 ClientRead

sample_rate = 0
tps = 1015425.288538 (without initial connection time)
waits
-----
13106 CPU
3502 ClientRead

### 32 connections
pgbench -c32 -j20 -S -Mprepared -T120 --progress 10

sample_rate = 1
tps = 620667.049565 (without initial connection time)
waits
-----
1782 CPU
560 ClientRead

sample_rate = .75
tps = 620663.131347 (without initial connection time)
waits
-----
1736 CPU
554 ClientRead

sample_rate = .5
tps = 624094.688239 (without initial connection time)
waits
-----
1741 CPU
648 ClientRead

sample_rate = .25
tps = 628638.538204 (without initial connection time)
waits
-----
1702 CPU
576 ClientRead

sample_rate = 0
tps = 630483.464912 (without initial connection time)
waits
-----
1638 CPU
574 ClientRead

Regards,

Sami

Thank you so much for benchmarking this on a pretty large machine with a
large number of CPUs. The results look fantastic, and I truly appreciate
your effort.

BWT, I realized that the 'sampling' test needs to be added not only to
the Makefile but also to meson.build. I've included that in the v14 patch.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v14-0001-Allow-setting-sample-rate-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v14-0001-Allow-setting-sample-rate-for-pg_stat_statements.patchDownload

From 7c9d45325e29a65259740d5255d39f9f57ee6fba Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Date: Tue, 28 Jan 2025 23:04:34 +0300
Subject: [PATCH] Allow setting sample rate for pg_stat_statements

New configuration parameter pg_stat_statements.sample_rate makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 contrib/pg_stat_statements/Makefile           |   2 +-
 .../pg_stat_statements/expected/sampling.out  | 174 ++++++++++++++++++
 contrib/pg_stat_statements/meson.build        |   1 +
 .../pg_stat_statements/pg_stat_statements.c   |  59 +++++-
 contrib/pg_stat_statements/sql/sampling.sql   |  50 +++++
 doc/src/sgml/pgstatstatements.sgml            |  19 ++
 6 files changed, 297 insertions(+), 8 deletions(-)
 create mode 100644 contrib/pg_stat_statements/expected/sampling.out
 create mode 100644 contrib/pg_stat_statements/sql/sampling.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 241c02587b..b70bdfaf2d 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS))
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
 REGRESS = select dml cursors utility level_tracking planning \
 	user_activity wal entry_timestamp privileges extended \
-	parallel cleanup oldextversions
+	parallel sampling cleanup oldextversions
 # Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/expected/sampling.out b/contrib/pg_stat_statements/expected/sampling.out
new file mode 100644
index 0000000000..2204215f64
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/sampling.out
@@ -0,0 +1,174 @@
+--
+-- sample statements
+--
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+ pg_stat_statements.track 
+--------------------------
+ top
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+       query        | calls 
+--------------------+-------
+ SELECT $1 AS "int" |     1
+(1 row)
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+   query   | calls 
+-----------+-------
+ SELECT $1 |     1
+(1 row)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+DEALLOCATE stmt;
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                       query                        
+----------+-------+----------------------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     2 | SELECT $1
+ t        |     1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
+ t        |     2 | SET pg_stat_statements.sample_rate = $1
+(4 rows)
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0;
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                  query                  
+----------+-------+-----------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     3 | SELECT $1
+ t        |     1 | SET pg_stat_statements.sample_rate = $1
+(3 rows)
+
diff --git a/contrib/pg_stat_statements/meson.build b/contrib/pg_stat_statements/meson.build
index 4446af58c5..351354a6a5 100644
--- a/contrib/pg_stat_statements/meson.build
+++ b/contrib/pg_stat_statements/meson.build
@@ -54,6 +54,7 @@ tests += {
       'privileges',
       'extended',
       'parallel',
+      'sampling',
       'cleanup',
       'oldextversions',
     ],
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index bebf8134eb..afd0537713 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -256,6 +257,9 @@ typedef struct pgssSharedState
 /* Current nesting depth of planner/ExecutorRun/ProcessUtility calls */
 static int	nesting_level = 0;
 
+/* Is the current top-level query to be sampled? */
+static bool is_query_sampled = false;
+
 /* Saved hook values */
 static shmem_request_hook_type prev_shmem_request_hook = NULL;
 static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
@@ -294,12 +298,14 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1.0; /* fraction of statements to track */
 
 
-#define pgss_enabled(level) \
+#define pgss_enabled(level, skip_sampling_check) \
 	(!IsParallelWorker() && \
 	(pgss_track == PGSS_TRACK_ALL || \
-	(pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+	(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+	current_query_sampled(skip_sampling_check))
 
 #define record_gc_qtexts() \
 	do { \
@@ -373,6 +379,7 @@ static char *generate_normalized_query(JumbleState *jstate, const char *query,
 static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
 									 int query_loc);
 static int	comp_location(const void *a, const void *b);
+static bool	current_query_sampled(bool skip);
 
 
 /*
@@ -414,6 +421,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to track.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -836,7 +856,7 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
 
 	/* Safety check... */
-	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
+	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level, true))
 		return;
 
 	/*
@@ -894,7 +914,7 @@ pgss_planner(Query *parse,
 	 * pgss_store needs it.  We also ignore query without queryid, as it would
 	 * be treated as a utility statement, which may not be the case.
 	 */
-	if (pgss_enabled(nesting_level)
+	if (pgss_enabled(nesting_level, false)
 		&& pgss_track_planning && query_string
 		&& parse->queryId != UINT64CONST(0))
 	{
@@ -999,7 +1019,8 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (pgss_enabled(nesting_level, false) &&
+		queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1068,7 +1089,7 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 	uint64		queryId = queryDesc->plannedstmt->queryId;
 
 	if (queryId != UINT64CONST(0) && queryDesc->totaltime &&
-		pgss_enabled(nesting_level))
+		pgss_enabled(nesting_level, false))
 	{
 		/*
 		 * Make sure stats accumulation is done.  (Note: it's okay if several
@@ -1111,7 +1132,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level, false);
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
@@ -3011,3 +3032,27 @@ comp_location(const void *a, const void *b)
 
 	return pg_cmp_s32(l, r);
 }
+
+/*
+ * Determine whether the current query should be sampled.
+ *
+ * In some cases, sampling may need to be skipped entirely, such as during
+ * query parsing for normalization.
+ *
+ * At the beginning of each top-level statement, decide whether we'll
+ * sample this statement. If nested-statement tracking is enabled,
+ * either all nested statements will be tracked or none will.
+ */
+static bool
+current_query_sampled(bool skip)
+{
+	if (skip)
+		return true;
+
+	if (nesting_level == 0)
+		is_query_sampled = pgss_sample_rate != 0.0 &&
+			(pgss_sample_rate == 1.0 ||
+			pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate);
+
+	return is_query_sampled;
+}
\ No newline at end of file
diff --git a/contrib/pg_stat_statements/sql/sampling.sql b/contrib/pg_stat_statements/sql/sampling.sql
new file mode 100644
index 0000000000..b09f45991b
--- /dev/null
+++ b/contrib/pg_stat_statements/sql/sampling.sql
@@ -0,0 +1,50 @@
+--
+-- sample statements
+--
+
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+DEALLOCATE stmt;
+
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SET pg_stat_statements.sample_rate = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+\bind_named stmt \g
+\bind_named stmt \g
+SET pg_stat_statements.sample_rate = 0;
+\bind_named stmt \g
+\bind_named stmt \g
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..cd3fdaeffe 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,25 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session. The default is <literal>1</literal>,
+      meaning track all the queries. Setting this to <literal>0</literal> disables sampled statements
+      tracking, the same as setting <varname>pg_stat_statements.track</varname> to <literal>none</literal>.
+      In case of nested statements, either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#42

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

12 months ago

In reply to: Ilia Evdokimov (#41)

Re: Sample rate added to pg_stat_statements

On 28.01.2025 23:50, Ilia Evdokimov wrote:

If anyone has the capability to run this benchmark on machines with
more
CPUs or with different queries, it would be nice. I’d appreciate any
suggestions or feedback.

I wanted to share some additional benchmarks I ran as well
on a r8g.48xlarge ( 192 vCPUs, 1,536 GiB of memory) configured
with 16GB of shared_buffers. I also attached the benchmark.sh
script used to generate the output.
The benchmark is running the select-only pgbench workload,
so we have a single heavily contentious entry, which is the
worst case.

The test shows that the spinlock (SpinDelay waits)
becomes an issue at high connection counts and will
become worse on larger machines. A sample_rate going from
1 to .75 shows a 60% improvement; but this is on a single
contentious entry. Most workloads will likely not see this type
of improvement. I also could not really observe
this type of difference on smaller machines ( i.e. 32 vCPUs),
as expected.

## init
pgbench -i -s500

### 192 connections
pgbench -c192 -j20 -S -Mprepared -T120 --progress 10

sample_rate = 1
tps = 484338.769799 (without initial connection time)
waits
-----
   11107 SpinDelay
    9568 CPU
     929 ClientRead
      13 DataFileRead
       3 BufferMapping

sample_rate = .75
tps = 909547.562124 (without initial connection time)
waits
-----
   12079 CPU
    4781 SpinDelay
    2100 ClientRead

sample_rate = .5
tps = 1028594.555273 (without initial connection time)
waits
-----
   13253 CPU
    3378 ClientRead
     174 SpinDelay

sample_rate = .25
tps = 1019507.126313 (without initial connection time)
waits
-----
   13397 CPU
    3423 ClientRead

sample_rate = 0
tps = 1015425.288538 (without initial connection time)
waits
-----
   13106 CPU
    3502 ClientRead

### 32 connections
pgbench -c32 -j20 -S -Mprepared -T120 --progress 10

sample_rate = 1
tps = 620667.049565 (without initial connection time)
waits
-----
    1782 CPU
     560 ClientRead

sample_rate = .75
tps = 620663.131347 (without initial connection time)
waits
-----
    1736 CPU
     554 ClientRead

sample_rate = .5
tps = 624094.688239 (without initial connection time)
waits
-----
    1741 CPU
     648 ClientRead

sample_rate = .25
tps = 628638.538204 (without initial connection time)
waits
-----
    1702 CPU
     576 ClientRead

sample_rate = 0
tps = 630483.464912 (without initial connection time)
waits
-----
    1638 CPU
     574 ClientRead

Regards,

Sami

Thank you so much for benchmarking this on a pretty large machine with
a large number of CPUs. The results look fantastic, and I truly
appreciate your effort.

BWT, I realized that the 'sampling' test needs to be added not only to
the Makefile but also to meson.build. I've included that in the v14
patch.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

In my opinion, if we can't observe bottleneck of spinlock on 32 CPUs, we
should determine the CPU count at which it becomes. This will help us
understand the scale of the problem. Does this make sense, or are there
really no real workloads where the same query runs on more than 32 CPUs,
and we've been trying to solve a non-existent problem?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#43

Sami Imseih

samimseih@gmail.com

12 months ago

In reply to: Ilia Evdokimov (#42)

Re: Sample rate added to pg_stat_statements

and we've been trying to solve a non-existent problem?

Anecdotally speaking, Most users will encounter bottlenecks
due to exclusive locks on pgss hash ( dealloc and garbage collection )
more often than they will on the spinlock. The spinlock will likely
become more of an issue when you are not seeing much dealloc/gc
and running on a very large machine. So, I don't think it's a non-existent
problem, but I don't think it's the most common problem.

Regards,

Sami

#44

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

12 months ago

In reply to: Ilia Evdokimov (#42)

Re: Sample rate added to pg_stat_statements

On 29.01.2025 21:52, Ilia Evdokimov wrote:

... I also attached the benchmark.sh
script used to generate the output.

In my opinion, if we can't observe bottleneck of spinlock on 32 CPUs,
we should determine the CPU count at which it becomes. This will help
us understand the scale of the problem. Does this make sense, or are
there really no real workloads where the same query runs on more than
32 CPUs, and we've been trying to solve a non-existent problem?

I ran the same benchmark on 48 CPUs for -c 48 -j 20 for objectivity.

### 48 connections
pgbench -c48 -j20 -S -Mprepared -T120 --progress 10

sample_rate = 1
tps = 643251.640175 (without initial connection time)
waits
-----
    932 ClientRead
    911 CPU
     44 SpinDelay

sample_rate = .75
tps = 653946.777122 (without initial connection time)
waits
-----
    939 CPU
    875 ClientRead
      3 SpinDelay

sample_rate = .5
tps = 651654.348463 (without initial connection time)
waits
-----
932 ClientRead
841 CPU

sample_rate = .25
tps = 652668.807245 (without initial connection time)
waits
-----
910 ClientRead
860 CPU

sample_rate = 0
tps = 659111.347019 (without initial connection time)
waits
-----
882 ClientRead
849 CPU

There is a small amount ofSpinDelay, as the user mentioned. However, we
can identify the threshold where the problem appears.

To summarize the results of all benchmarks, I compiled them into a table:

CPUs | sample_rate | tps | CPU waits | ClientRead wait | SpinDelay wait
192 | 1.0 | 484338|      9568 |             929 | 11107
192 |    0.75 | 909547|      12079 |            2100 | 4781
192 | 0.5 |1028594|     13253 |            3378 | 174
192 |    0.25 |1019507|     13397 |            3423 | -
192 | 0.0 |1015425|     13106 |            3502 | -

   48 | 1.0 | 643251|       911 |             932 | 44
   48 |    0.75 | 653946|       939 |             939 | 3
   48 | 0.5 | 651654|       841 |             932 | -
   48 |    0.25 | 652668|       860 |             910 | -
   48 | 0.0 | 659111|       849 |             882 | -

   32 | 1.0 | 620667|      1782 |             560 | -
   32 |    0.75 | 620667|      1736 |             554 | -
   32 | 0.5 | 624094|      1741 |             648 | -
   32 |    0.25 | 628638|      1702 |             576 | -
   32 | 0.0 | 630483|      1638 |             574 | -

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#45

Sami Imseih

samimseih@gmail.com

11 months ago

In reply to: Ilia Evdokimov (#44)

Re: Sample rate added to pg_stat_statements

To summarize the results of all benchmarks, I compiled them into a table:

Thanks for compiling the benchmark data above.

The main benefit of this patch will be to give the user
a toggle if they are observing high spinlock contention
due to pg_stat_statements which will likely occur
on larger machines.

I also did not see this thread [1]/messages/by-id/2895b53b033c47ccb22972b589050dd9@EX13D05UWC001.ant.amazon.com mentioned in the thread above,
but one of the reasons pg_stat_statements.track_planning
was turned off by default was due to the additional spinlock
acquire to track the planning stats. Bringing this up as sample_rate
might also be beneficial as well if a user decides to track planning.

Regards,

Sami

[1]: /messages/by-id/2895b53b033c47ccb22972b589050dd9@EX13D05UWC001.ant.amazon.com

#46

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

11 months ago

In reply to: Sami Imseih (#45)

Re: Sample rate added to pg_stat_statements

On 04.02.2025 20:59, Sami Imseih wrote:

To summarize the results of all benchmarks, I compiled them into a table:

Thanks for compiling the benchmark data above.

The main benefit of this patch will be to give the user
a toggle if they are observing high spinlock contention
due to pg_stat_statements which will likely occur
on larger machines.

I also did not see this thread [1] mentioned in the thread above,
but one of the reasons pg_stat_statements.track_planning
was turned off by default was due to the additional spinlock
acquire to track the planning stats. Bringing this up as sample_rate
might also be beneficial as well if a user decides to track planning.

Regards,

Sami

[1] /messages/by-id/2895b53b033c47ccb22972b589050dd9@EX13D05UWC001.ant.amazon.com

Thanks for the thread. As we can see, simply enabling or disabling not
only track_planning but also other pg-stat_statements parameters is too
radical a measure for managing performance. With the introduction of
sample_rate, users now have more flexibility in controlling spinlock
contention. This allows them to balance overhead and statistics
collection rather than completely disabling the feature.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#47

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

11 months ago

In reply to: Ilia Evdokimov (#46)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

Hi hackers,

Since current patch is in the commitfest with the status 'Ready for
committer', I’d like to summarize what it does, the problems it
addresses, and answer the key questions raised in the discussion thread.

Enabling pg_stat_statements can cause a performance drop due to two main
reasons:

* Contention on the exclusive lock, which blocks the hash table when
allocating or deallocating an entry.
* Contention on the spinlock, which blocks access to an existing entry.

The exclusive lock issue can be mitigated by normalizing similar queries
to the same queryid. There is an ongoing thread discussing this approach
for IN and ARRAY queries: [0]https://commitfest.postgresql.org/51/2837/. Another solution is integrating
pg_stat_statements into central pgstats with the custom APIs, without
pushing the module into core [1]/messages/by-id/Zz0MFPq1s4WFxxpL@paquier.xyz.

However, even if we do all that, the spinlock contention problem will
persist for very frequent queries. This is exactly what I aim to solve
with sampling. Other approaches, such as replacing the spinlock with
atomic variables or an lwlock, have not yielded good results [2]https://commitfest.postgresql.org/29/2634/. Even
track_planning was disabled by default to reduce contention, but on
high-core systems, the issue still remains.So far, I see no alternative
solution to this problem.

The benchmark results on different number of CPUs demonstrating the
impact of this patch can be found here:

pgbench -c{CPUs} -j20 -S -Mprepared -T120
CPUs | sample_rate | tps | CPU waits | ClientRead wait | SpinDelay wait
   192 | 1.0 | 484338|      9568 |             929 | 11107
   192 |    0.75 | 909547|     12079 |            2100 | 4781
   192 | 0.5 |1028594|     13253 |            3378 | 174
   192 |    0.25 |1019507|     13397 |            3423 | -
   192 | 0.0 |1015425|     13106 |            3502 | -

    48 | 1.0 | 643251|       911 |             932 | 44
    48 |    0.75 | 653946|       939 |             939 | 3
    48 | 0.5 | 651654|       841 |             932 | -
    48 |    0.25 | 652668|       860 |             910 | -
    48 | 0.0 | 659111|       849 |             882 | -

    32 | 1.0 | 620667|      1782 |             560 | -
    32 |    0.75 | 620663|      1736 |             554 | -
    32 | 0.5 | 624094|      1741 |             648 | -
    32 |    0.25 | 628638|      1702 |             576 | -
    32 | 0.0 | 630483|      1638 |             574 | -

Some suggested sampling based on execution time instead of a random
generator. While this is a good idea, it has a critical limitation.
Execution time can only be checked in pgss_ExecutorEnd. However,
pgss_planner already interacts with the spinlock, and we can not sample
based on execution time at that stage. Finding a criterion that works
for both pgss_planner and pgss_ExecutorEnd is difficult. This is why
random sampling remains the most viable option.

BTW, Instead of using a bool flag to check if we are inside
pgss_post_parse_analyze, we now pass a pgssStoreKind to pgss_enabled.
This makes the logic clearer and more readable, explicitly indicating
where we need to check whether a query should be sampled. I made it in
new v15 patch.

Are there any remaining concerns or alternative suggestions regarding
this approach? I'm happy to discuss further.

[0]: https://commitfest.postgresql.org/51/2837/
[1]: /messages/by-id/Zz0MFPq1s4WFxxpL@paquier.xyz
[2]: https://commitfest.postgresql.org/29/2634/

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v15-0001-Allow-setting-sample-rate-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v15-0001-Allow-setting-sample-rate-for-pg_stat_statements.patchDownload

From c9540fac9f6d6ad6d15acdaad844ca6c4ecba455 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Mon, 10 Feb 2025 11:00:46 +0300
Subject: [PATCH] Allow setting sample rate for pg_stat_statements

New configuration parameter pg_stat_statements.sample_rate makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 contrib/pg_stat_statements/Makefile           |   2 +-
 .../pg_stat_statements/expected/sampling.out  | 174 ++++++++++++++++++
 contrib/pg_stat_statements/meson.build        |   1 +
 .../pg_stat_statements/pg_stat_statements.c   |  53 +++++-
 contrib/pg_stat_statements/sql/sampling.sql   |  50 +++++
 doc/src/sgml/pgstatstatements.sgml            |  19 ++
 6 files changed, 291 insertions(+), 8 deletions(-)
 create mode 100644 contrib/pg_stat_statements/expected/sampling.out
 create mode 100644 contrib/pg_stat_statements/sql/sampling.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 241c02587b..b70bdfaf2d 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS))
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
 REGRESS = select dml cursors utility level_tracking planning \
 	user_activity wal entry_timestamp privileges extended \
-	parallel cleanup oldextversions
+	parallel sampling cleanup oldextversions
 # Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/expected/sampling.out b/contrib/pg_stat_statements/expected/sampling.out
new file mode 100644
index 0000000000..2204215f64
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/sampling.out
@@ -0,0 +1,174 @@
+--
+-- sample statements
+--
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+ pg_stat_statements.track 
+--------------------------
+ top
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+       query        | calls 
+--------------------+-------
+ SELECT $1 AS "int" |     1
+(1 row)
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+   query   | calls 
+-----------+-------
+ SELECT $1 |     1
+(1 row)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+DEALLOCATE stmt;
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                       query                        
+----------+-------+----------------------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     2 | SELECT $1
+ t        |     1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
+ t        |     2 | SET pg_stat_statements.sample_rate = $1
+(4 rows)
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0;
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                  query                  
+----------+-------+-----------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     3 | SELECT $1
+ t        |     1 | SET pg_stat_statements.sample_rate = $1
+(3 rows)
+
diff --git a/contrib/pg_stat_statements/meson.build b/contrib/pg_stat_statements/meson.build
index 4446af58c5..351354a6a5 100644
--- a/contrib/pg_stat_statements/meson.build
+++ b/contrib/pg_stat_statements/meson.build
@@ -54,6 +54,7 @@ tests += {
       'privileges',
       'extended',
       'parallel',
+      'sampling',
       'cleanup',
       'oldextversions',
     ],
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index bebf8134eb..9e79e6ac33 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -256,6 +257,9 @@ typedef struct pgssSharedState
 /* Current nesting depth of planner/ExecutorRun/ProcessUtility calls */
 static int	nesting_level = 0;
 
+/* Is the current top-level query to be sampled? */
+static bool is_query_sampled = false;
+
 /* Saved hook values */
 static shmem_request_hook_type prev_shmem_request_hook = NULL;
 static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
@@ -294,12 +298,14 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1.0; /* fraction of statements to track */
 
 
-#define pgss_enabled(level) \
+#define pgss_enabled(level, skip_sampling_check) \
 	(!IsParallelWorker() && \
 	(pgss_track == PGSS_TRACK_ALL || \
-	(pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+	(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+	(skip_sampling_check == PGSS_INVALID || current_query_sampled()))
 
 #define record_gc_qtexts() \
 	do { \
@@ -373,6 +379,7 @@ static char *generate_normalized_query(JumbleState *jstate, const char *query,
 static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
 									 int query_loc);
 static int	comp_location(const void *a, const void *b);
+static bool	current_query_sampled(void);
 
 
 /*
@@ -414,6 +421,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to track.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -836,7 +856,7 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
 
 	/* Safety check... */
-	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
+	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level, PGSS_INVALID))
 		return;
 
 	/*
@@ -894,7 +914,7 @@ pgss_planner(Query *parse,
 	 * pgss_store needs it.  We also ignore query without queryid, as it would
 	 * be treated as a utility statement, which may not be the case.
 	 */
-	if (pgss_enabled(nesting_level)
+	if (pgss_enabled(nesting_level, PGSS_PLAN)
 		&& pgss_track_planning && query_string
 		&& parse->queryId != UINT64CONST(0))
 	{
@@ -999,7 +1019,8 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (pgss_enabled(nesting_level, PGSS_EXEC) &&
+		queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1068,7 +1089,7 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 	uint64		queryId = queryDesc->plannedstmt->queryId;
 
 	if (queryId != UINT64CONST(0) && queryDesc->totaltime &&
-		pgss_enabled(nesting_level))
+		pgss_enabled(nesting_level, PGSS_EXEC))
 	{
 		/*
 		 * Make sure stats accumulation is done.  (Note: it's okay if several
@@ -1111,7 +1132,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level, PGSS_EXEC);
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
@@ -3011,3 +3032,21 @@ comp_location(const void *a, const void *b)
 
 	return pg_cmp_s32(l, r);
 }
+
+/*
+ * Determine whether the current query should be sampled.
+ *
+ * At the beginning of each top-level statement, decide whether we'll
+ * sample this statement. If nested-statement tracking is enabled,
+ * either all nested statements will be tracked or none will.
+ */
+static bool
+current_query_sampled(void)
+{
+	if (nesting_level == 0)
+		is_query_sampled = pgss_sample_rate != 0.0 &&
+			(pgss_sample_rate == 1.0 ||
+			pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate);
+
+	return is_query_sampled;
+}
\ No newline at end of file
diff --git a/contrib/pg_stat_statements/sql/sampling.sql b/contrib/pg_stat_statements/sql/sampling.sql
new file mode 100644
index 0000000000..b09f45991b
--- /dev/null
+++ b/contrib/pg_stat_statements/sql/sampling.sql
@@ -0,0 +1,50 @@
+--
+-- sample statements
+--
+
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+DEALLOCATE stmt;
+
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SET pg_stat_statements.sample_rate = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+\bind_named stmt \g
+\bind_named stmt \g
+SET pg_stat_statements.sample_rate = 0;
+\bind_named stmt \g
+\bind_named stmt \g
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..cd3fdaeffe 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,25 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session. The default is <literal>1</literal>,
+      meaning track all the queries. Setting this to <literal>0</literal> disables sampled statements
+      tracking, the same as setting <varname>pg_stat_statements.track</varname> to <literal>none</literal>.
+      In case of nested statements, either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#48

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

11 months ago

In reply to: Ilia Evdokimov (#47)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

Hi hackers,

I've decided to explore a slightly different approach to reducing
spinlock contention—by introducing a simple execution time threshold. If
a query’s execution time exceeds this threshold, it is recorded in
pg_stat_statements; otherwise, it is ignored. As Alexander [0]/messages/by-id/CAPpHfdsTKAQqC3A48-MGQhrhfEamXZPb64w=utk7thQcOMNr7Q@mail.gmail.com pointed
out, this helps retain valuable queries for further analysis. A similar
mechanism is already present in auto_explain and pg_store_plans. When
pg_stat_statements.track_min_duration = -1, disable tracking. If
pg_stat_statements.track_min_duration = -1, all statements are tracked.

I benchmarked this approach using -M prepared -S on my machine with 48
CPUs. However, I couldn’t reproduce spinlock contention because the
machine isn’t large enough to create sufficient concurrency.
Nevertheless, I’m sharing my results for reference and checking correct
results of threshold.

Here’s the benchmarking procedure I followed:
    createdb pgbench
    pgbench -i -s 3000 pgbench
    psql -c 'SELECT pg_stat_statements_reset()'
    pgbench -c 46 -j 46 -T 120 -M prepared -S --progress=10 pgbench

select query, calls, min_exec_time, max_exec_time, mean_exec_time,
stddev_exec_time from pg_stat_statements where query = 'SELECT abalance
FROM pgbench_accounts WHERE aid = $1';

track_min_duration | calls | min_exec_time | max_exec_time |
mean_exec_time | stddev_exec_time
                 0 | 111282955 |       0.00365 |    15.56946 |
0.015042374707317802 | 0.06067634978916631
                 5 |       458 |       5.00627 |    15.699129 |
5.962879746724887 | 1.1432124887616204
                10 |        14 |      10.538461 |    16.113204 |
12.415218999999999 | 1.5598854455354354
                20 |         - |      - |    - |
   - | -
                -1 |         - |      - |    - |
   - | -

I’d greatly appreciate any feedback on this alternative approach, as
well as benchmarking on a pretty large machine to see its impact at scale.

[0]: /messages/by-id/CAPpHfdsTKAQqC3A48-MGQhrhfEamXZPb64w=utk7thQcOMNr7Q@mail.gmail.com
/messages/by-id/CAPpHfdsTKAQqC3A48-MGQhrhfEamXZPb64w=utk7thQcOMNr7Q@mail.gmail.com

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v16-0001-Allow-setting-execution-time-threshold-for-pgss.patchtext/x-patch; charset=UTF-8; name=v16-0001-Allow-setting-execution-time-threshold-for-pgss.patchDownload

From aaf82d481aeb40458dfd7debfb3bcd254f33cdee Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Fri, 14 Feb 2025 15:40:36 +0300
Subject: [PATCH v16] Allow setting execution time threshold for
 pg_stat_statements

New configuration parameter pg_stat_statements.track_min_duration makes it
possible to track statements whose execution time is greater than threshold,
to reduce the amount of tracking.
---
 contrib/pg_stat_statements/Makefile           |   2 +-
 .../pg_stat_statements/expected/duration.out  | 201 ++++++++++++++++++
 contrib/pg_stat_statements/meson.build        |   1 +
 .../pg_stat_statements/pg_stat_statements.c   |  79 ++++---
 contrib/pg_stat_statements/sql/duration.sql   |  58 +++++
 doc/src/sgml/pgstatstatements.sgml            |  21 ++
 6 files changed, 332 insertions(+), 30 deletions(-)
 create mode 100644 contrib/pg_stat_statements/expected/duration.out
 create mode 100644 contrib/pg_stat_statements/sql/duration.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 241c02587b..4b2a59d7c3 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS))
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
 REGRESS = select dml cursors utility level_tracking planning \
 	user_activity wal entry_timestamp privileges extended \
-	parallel cleanup oldextversions
+	parallel duration cleanup oldextversions
 # Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/expected/duration.out b/contrib/pg_stat_statements/expected/duration.out
new file mode 100644
index 0000000000..8bbd4b0183
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/duration.out
@@ -0,0 +1,201 @@
+--
+-- track_min_duration
+--
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+ pg_stat_statements.track 
+--------------------------
+ top
+(1 row)
+
+SET pg_stat_statements.track_min_duration = -1;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.track_min_duration = 0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+       query        | calls 
+--------------------+-------
+ SELECT $1 AS "int" |     1
+(1 row)
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.track_min_duration = -1;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.track_min_duration = 0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+   query   | calls 
+-----------+-------
+ SELECT $1 |     1
+(1 row)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+DEALLOCATE stmt;
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.track_min_duration = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.track_min_duration = -1;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                       query                        
+----------+-------+----------------------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     2 | SELECT $1
+ t        |     1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
+ t        |     2 | SET pg_stat_statements.track_min_duration = $1
+(4 rows)
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.track_min_duration = 0;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.track_min_duration = -1;
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                     query                      
+----------+-------+------------------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     3 | SELECT $1
+ t        |     1 | SET pg_stat_statements.track_min_duration = $1
+(3 rows)
+
+-- threshold check
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.track_min_duration = 100;
+SELECT pg_sleep(0.001);	-- 1   ms < pg_stat_statements.track_min_duration
+ pg_sleep 
+----------
+ 
+(1 row)
+
+SELECT pg_sleep(0.5);	-- 500 ms > pg_stat_statements.track_min_duration
+ pg_sleep 
+----------
+ 
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |        query        
+----------+-------+---------------------
+ t        |     1 | SELECT pg_sleep($1)
+(1 row)
+
diff --git a/contrib/pg_stat_statements/meson.build b/contrib/pg_stat_statements/meson.build
index 4446af58c5..2bc918bcd3 100644
--- a/contrib/pg_stat_statements/meson.build
+++ b/contrib/pg_stat_statements/meson.build
@@ -54,6 +54,7 @@ tests += {
       'privileges',
       'extended',
       'parallel',
+      'duration',
       'cleanup',
       'oldextversions',
     ],
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index bebf8134eb..0eb509d7ca 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -290,6 +290,7 @@ static const struct config_enum_entry track_options[] =
 
 static int	pgss_max = 5000;	/* max # statements to track */
 static int	pgss_track = PGSS_TRACK_TOP;	/* tracking level */
+static int	pgss_track_min_duration = 0; /* msec or -1 */
 static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
@@ -414,6 +415,18 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomIntVariable("pg_stat_statements.track_min_duration",
+							"Sets the minimum execution time above which plans will be tracked.",
+							"Zero tracks all plans. -1 turns this feature off.",
+							&pgss_track_min_duration,
+							0,
+							-1, INT_MAX,
+							PGC_SUSET,
+							GUC_UNIT_MS,
+							NULL,
+							NULL,
+							NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -999,7 +1012,8 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (pgss_enabled(nesting_level) && pgss_track_min_duration >= 0 &&
+		queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1068,27 +1082,30 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 	uint64		queryId = queryDesc->plannedstmt->queryId;
 
 	if (queryId != UINT64CONST(0) && queryDesc->totaltime &&
-		pgss_enabled(nesting_level))
+		pgss_enabled(nesting_level) && pgss_track_min_duration >= 0)
 	{
+		double		msec;
 		/*
 		 * Make sure stats accumulation is done.  (Note: it's okay if several
 		 * levels of hook all do this.)
 		 */
 		InstrEndLoop(queryDesc->totaltime);
 
-		pgss_store(queryDesc->sourceText,
-				   queryId,
-				   queryDesc->plannedstmt->stmt_location,
-				   queryDesc->plannedstmt->stmt_len,
-				   PGSS_EXEC,
-				   queryDesc->totaltime->total * 1000.0,	/* convert to msec */
-				   queryDesc->estate->es_total_processed,
-				   &queryDesc->totaltime->bufusage,
-				   &queryDesc->totaltime->walusage,
-				   queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
-				   NULL,
-				   queryDesc->estate->es_parallel_workers_to_launch,
-				   queryDesc->estate->es_parallel_workers_launched);
+		msec = queryDesc->totaltime->total * 1000.0;
+		if (msec >= pgss_track_min_duration)
+			pgss_store(queryDesc->sourceText,
+						queryId,
+						queryDesc->plannedstmt->stmt_location,
+						queryDesc->plannedstmt->stmt_len,
+						PGSS_EXEC,
+						msec,
+						queryDesc->estate->es_total_processed,
+						&queryDesc->totaltime->bufusage,
+						&queryDesc->totaltime->walusage,
+						queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
+						NULL,
+						queryDesc->estate->es_parallel_workers_to_launch,
+						queryDesc->estate->es_parallel_workers_launched);
 	}
 
 	if (prev_ExecutorEnd)
@@ -1111,7 +1128,8 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level) &&
+						  pgss_track_min_duration >= 0;
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
@@ -1154,6 +1172,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 					bufusage;
 		WalUsage	walusage_start,
 					walusage;
+		double		msec;
 
 		bufusage_start = pgBufferUsage;
 		walusage_start = pgWalUsage;
@@ -1209,19 +1228,21 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		memset(&walusage, 0, sizeof(WalUsage));
 		WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
 
-		pgss_store(queryString,
-				   saved_queryId,
-				   saved_stmt_location,
-				   saved_stmt_len,
-				   PGSS_EXEC,
-				   INSTR_TIME_GET_MILLISEC(duration),
-				   rows,
-				   &bufusage,
-				   &walusage,
-				   NULL,
-				   NULL,
-				   0,
-				   0);
+		msec = INSTR_TIME_GET_MILLISEC(duration);
+		if (msec >= pgss_track_min_duration)
+			pgss_store(queryString,
+						saved_queryId,
+						saved_stmt_location,
+						saved_stmt_len,
+						PGSS_EXEC,
+						msec,
+						rows,
+						&bufusage,
+						&walusage,
+						NULL,
+						NULL,
+						0,
+						0);
 	}
 	else
 	{
diff --git a/contrib/pg_stat_statements/sql/duration.sql b/contrib/pg_stat_statements/sql/duration.sql
new file mode 100644
index 0000000000..b99c8288bb
--- /dev/null
+++ b/contrib/pg_stat_statements/sql/duration.sql
@@ -0,0 +1,58 @@
+--
+-- track_min_duration
+--
+
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+SET pg_stat_statements.track_min_duration = -1;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.track_min_duration = 0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.track_min_duration = -1;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.track_min_duration = 0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+DEALLOCATE stmt;
+
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.track_min_duration = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SET pg_stat_statements.track_min_duration = -1;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.track_min_duration = 0;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+\bind_named stmt \g
+\bind_named stmt \g
+SET pg_stat_statements.track_min_duration = -1;
+\bind_named stmt \g
+\bind_named stmt \g
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+
+-- threshold check
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.track_min_duration = 100;
+SELECT pg_sleep(0.001);	-- 1   ms < pg_stat_statements.track_min_duration
+SELECT pg_sleep(0.5);	-- 500 ms > pg_stat_statements.track_min_duration
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index 501b468e9a..21047f6a7d 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -936,6 +936,27 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.track_min_duration</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.track_min_duration</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.track_min_duration</varname> is the minimum statement
+      execution time, in milliseconds, that will cause the statement to
+      be tracked. Setting this to <literal>0</literal> tracks all statements.
+      <literal>-1</literal> (the default) disables tracking of statements. For
+      example, if you set it to <literal>250ms</literal> then all statements
+      that run 250ms or longer will be tracked. Only superusers can change this
+      setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#49

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

11 months ago

In reply to: Ilia Evdokimov (#48)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 14.02.2025 16:17, Ilia Evdokimov wrote:

Hi hackers,

I've decided to explore a slightly different approach to reducing
spinlock contention—by introducing a simple execution time threshold.
If a query’s execution time exceeds this threshold, it is recorded in
pg_stat_statements; otherwise, it is ignored. As Alexander [0] pointed
out, this helps retain valuable queries for further analysis. A
similar mechanism is already present in auto_explain and
pg_store_plans. When pg_stat_statements.track_min_duration = -1,
disable tracking. If pg_stat_statements.track_min_duration = -1, all
statements are tracked.

I benchmarked this approach using -M prepared -S on my machine with 48
CPUs. However, I couldn’t reproduce spinlock contention because the
machine isn’t large enough to create sufficient concurrency.
Nevertheless, I’m sharing my results for reference and checking
correct results of threshold.

Here’s the benchmarking procedure I followed:
    createdb pgbench
    pgbench -i -s 3000 pgbench
    psql -c 'SELECT pg_stat_statements_reset()'
    pgbench -c 46 -j 46 -T 120 -M prepared -S --progress=10 pgbench

select query, calls, min_exec_time, max_exec_time, mean_exec_time,
stddev_exec_time from pg_stat_statements where query = 'SELECT
abalance FROM pgbench_accounts WHERE aid = $1';

track_min_duration | calls | min_exec_time | max_exec_time |
mean_exec_time | stddev_exec_time
                 0 | 111282955 |       0.00365 |    15.56946 |
0.015042374707317802 | 0.06067634978916631
                 5 |       458 |       5.00627 |    15.699129 |
5.962879746724887 | 1.1432124887616204
                10 |        14 |      10.538461 |    16.113204 |
12.415218999999999 | 1.5598854455354354
                20 |         - |      - |    - |
   - | -
                -1 |         - |      - |    - |
   - | -

I’d greatly appreciate any feedback on this alternative approach, as
well as benchmarking on a pretty large machine to see its impact at
scale.

[0]:
/messages/by-id/CAPpHfdsTKAQqC3A48-MGQhrhfEamXZPb64w=utk7thQcOMNr7Q@mail.gmail.com

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

I have rebased the v16 patch to v17. I updated the extra description in
accordance with commit 977d865 [0]https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=977d865c36cc6ed4e813b8545b09d06f51534011 and removed the additional test with
pg_sleep, as it may produce different results on different platforms.

[0]: https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=977d865c36cc6ed4e813b8545b09d06f51534011
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=977d865c36cc6ed4e813b8545b09d06f51534011

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v17-0001-Allow-setting-execution-time-threshold-for-pgss.patchtext/x-patch; charset=UTF-8; name=v17-0001-Allow-setting-execution-time-threshold-for-pgss.patchDownload

From 6b12d65a1a1570d2e6f24c491c25f0b0c432b88e Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Mon, 17 Feb 2025 12:18:06 +0300
Subject: [PATCH v17] Allow setting execution time threshold for
 pg_stat_statements

New configuration parameter pg_stat_statements.track_min_duration makes it
possible to track statements whose execution time is greater than threshold,
to reduce the amount of tracking.
---
 contrib/pg_stat_statements/Makefile           |   2 +-
 .../expected/min_duration.out                 | 174 ++++++++++++++++++
 contrib/pg_stat_statements/meson.build        |   1 +
 .../pg_stat_statements/pg_stat_statements.c   |  79 +++++---
 .../pg_stat_statements/sql/min_duration.sql   |  50 +++++
 doc/src/sgml/pgstatstatements.sgml            |  21 +++
 6 files changed, 297 insertions(+), 30 deletions(-)
 create mode 100644 contrib/pg_stat_statements/expected/min_duration.out
 create mode 100644 contrib/pg_stat_statements/sql/min_duration.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 241c02587b..df512e428e 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS))
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
 REGRESS = select dml cursors utility level_tracking planning \
 	user_activity wal entry_timestamp privileges extended \
-	parallel cleanup oldextversions
+	parallel min_duration cleanup oldextversions
 # Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/expected/min_duration.out b/contrib/pg_stat_statements/expected/min_duration.out
new file mode 100644
index 0000000000..ef6f2a4437
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/min_duration.out
@@ -0,0 +1,174 @@
+--
+-- track_min_duration
+--
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+ pg_stat_statements.track 
+--------------------------
+ top
+(1 row)
+
+SET pg_stat_statements.track_min_duration = -1;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.track_min_duration = 0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+       query        | calls 
+--------------------+-------
+ SELECT $1 AS "int" |     1
+(1 row)
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.track_min_duration = -1;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.track_min_duration = 0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+   query   | calls 
+-----------+-------
+ SELECT $1 |     1
+(1 row)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+DEALLOCATE stmt;
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.track_min_duration = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.track_min_duration = -1;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                       query                        
+----------+-------+----------------------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     2 | SELECT $1
+ t        |     1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
+ t        |     2 | SET pg_stat_statements.track_min_duration = $1
+(4 rows)
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.track_min_duration = 0;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.track_min_duration = -1;
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                     query                      
+----------+-------+------------------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     3 | SELECT $1
+ t        |     1 | SET pg_stat_statements.track_min_duration = $1
+(3 rows)
+
diff --git a/contrib/pg_stat_statements/meson.build b/contrib/pg_stat_statements/meson.build
index 4446af58c5..6427665ee1 100644
--- a/contrib/pg_stat_statements/meson.build
+++ b/contrib/pg_stat_statements/meson.build
@@ -54,6 +54,7 @@ tests += {
       'privileges',
       'extended',
       'parallel',
+      'min_duration',
       'cleanup',
       'oldextversions',
     ],
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 4a3e855ce9..e90cf50fdb 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -291,6 +291,7 @@ static const struct config_enum_entry track_options[] =
 
 static int	pgss_max = 5000;	/* max # statements to track */
 static int	pgss_track = PGSS_TRACK_TOP;	/* tracking level */
+static int	pgss_track_min_duration = 0; /* msec or -1 */
 static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
@@ -415,6 +416,18 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomIntVariable("pg_stat_statements.track_min_duration",
+							"Sets the minimum execution time above which statements will be tracked.",
+							"0 tracks all statements. -1 turns this feature off.",
+							&pgss_track_min_duration,
+							0,
+							-1, INT_MAX,
+							PGC_SUSET,
+							GUC_UNIT_MS,
+							NULL,
+							NULL,
+							NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -1000,7 +1013,8 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (pgss_enabled(nesting_level) && pgss_track_min_duration >= 0 &&
+		queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1069,27 +1083,30 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 	uint64		queryId = queryDesc->plannedstmt->queryId;
 
 	if (queryId != UINT64CONST(0) && queryDesc->totaltime &&
-		pgss_enabled(nesting_level))
+		pgss_enabled(nesting_level) && pgss_track_min_duration >= 0)
 	{
+		double		msec;
 		/*
 		 * Make sure stats accumulation is done.  (Note: it's okay if several
 		 * levels of hook all do this.)
 		 */
 		InstrEndLoop(queryDesc->totaltime);
 
-		pgss_store(queryDesc->sourceText,
-				   queryId,
-				   queryDesc->plannedstmt->stmt_location,
-				   queryDesc->plannedstmt->stmt_len,
-				   PGSS_EXEC,
-				   queryDesc->totaltime->total * 1000.0,	/* convert to msec */
-				   queryDesc->estate->es_total_processed,
-				   &queryDesc->totaltime->bufusage,
-				   &queryDesc->totaltime->walusage,
-				   queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
-				   NULL,
-				   queryDesc->estate->es_parallel_workers_to_launch,
-				   queryDesc->estate->es_parallel_workers_launched);
+		msec = queryDesc->totaltime->total * 1000.0;
+		if (msec >= pgss_track_min_duration)
+			pgss_store(queryDesc->sourceText,
+						queryId,
+						queryDesc->plannedstmt->stmt_location,
+						queryDesc->plannedstmt->stmt_len,
+						PGSS_EXEC,
+						msec,
+						queryDesc->estate->es_total_processed,
+						&queryDesc->totaltime->bufusage,
+						&queryDesc->totaltime->walusage,
+						queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
+						NULL,
+						queryDesc->estate->es_parallel_workers_to_launch,
+						queryDesc->estate->es_parallel_workers_launched);
 	}
 
 	if (prev_ExecutorEnd)
@@ -1112,7 +1129,8 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level) &&
+						  pgss_track_min_duration >= 0;
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
@@ -1155,6 +1173,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 					bufusage;
 		WalUsage	walusage_start,
 					walusage;
+		double		msec;
 
 		bufusage_start = pgBufferUsage;
 		walusage_start = pgWalUsage;
@@ -1210,19 +1229,21 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 		memset(&walusage, 0, sizeof(WalUsage));
 		WalUsageAccumDiff(&walusage, &pgWalUsage, &walusage_start);
 
-		pgss_store(queryString,
-				   saved_queryId,
-				   saved_stmt_location,
-				   saved_stmt_len,
-				   PGSS_EXEC,
-				   INSTR_TIME_GET_MILLISEC(duration),
-				   rows,
-				   &bufusage,
-				   &walusage,
-				   NULL,
-				   NULL,
-				   0,
-				   0);
+		msec = INSTR_TIME_GET_MILLISEC(duration);
+		if (msec >= pgss_track_min_duration)
+			pgss_store(queryString,
+						saved_queryId,
+						saved_stmt_location,
+						saved_stmt_len,
+						PGSS_EXEC,
+						msec,
+						rows,
+						&bufusage,
+						&walusage,
+						NULL,
+						NULL,
+						0,
+						0);
 	}
 	else
 	{
diff --git a/contrib/pg_stat_statements/sql/min_duration.sql b/contrib/pg_stat_statements/sql/min_duration.sql
new file mode 100644
index 0000000000..63903cc388
--- /dev/null
+++ b/contrib/pg_stat_statements/sql/min_duration.sql
@@ -0,0 +1,50 @@
+--
+-- track_min_duration
+--
+
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+SET pg_stat_statements.track_min_duration = -1;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.track_min_duration = 0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.track_min_duration = -1;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.track_min_duration = 0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+DEALLOCATE stmt;
+
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.track_min_duration = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SET pg_stat_statements.track_min_duration = -1;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.track_min_duration = 0;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+\bind_named stmt \g
+\bind_named stmt \g
+SET pg_stat_statements.track_min_duration = -1;
+\bind_named stmt \g
+\bind_named stmt \g
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
\ No newline at end of file
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index e2ac1c2d50..cae56ae153 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -945,6 +945,27 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.track_min_duration</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.track_min_duration</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.track_min_duration</varname> is the minimum statement
+      execution time, in milliseconds, that will cause the statement to
+      be tracked. Setting this to <literal>0</literal> tracks all statements.
+      <literal>-1</literal> (the default) disables tracking of statements. For
+      example, if you set it to <literal>250ms</literal> then all statements
+      that run 250ms or longer will be tracked. Only superusers can change this
+      setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#50

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

11 months ago

In reply to: Ilia Evdokimov (#48)

Re: Sample rate added to pg_stat_statements

On 14.02.2025 16:17, Ilia Evdokimov wrote:

Hi hackers,

I've decided to explore a slightly different approach to reducing
spinlock contention—by introducing a simple execution time threshold.
If a query’s execution time exceeds this threshold, it is recorded in
pg_stat_statements; otherwise, it is ignored. As Alexander [0] pointed
out, this helps retain valuable queries for further analysis. A
similar mechanism is already present in auto_explain and
pg_store_plans. When pg_stat_statements.track_min_duration = -1,
disable tracking. If pg_stat_statements.track_min_duration = -1, all
statements are tracked.

I benchmarked this approach using -M prepared -S on my machine with 48
CPUs. However, I couldn’t reproduce spinlock contention because the
machine isn’t large enough to create sufficient concurrency.
Nevertheless, I’m sharing my results for reference and checking
correct results of threshold.

Here’s the benchmarking procedure I followed:
    createdb pgbench
    pgbench -i -s 3000 pgbench
    psql -c 'SELECT pg_stat_statements_reset()'
    pgbench -c 46 -j 46 -T 120 -M prepared -S --progress=10 pgbench

select query, calls, min_exec_time, max_exec_time, mean_exec_time,
stddev_exec_time from pg_stat_statements where query = 'SELECT
abalance FROM pgbench_accounts WHERE aid = $1';

track_min_duration | calls | min_exec_time | max_exec_time |
mean_exec_time | stddev_exec_time
                 0 | 111282955 |       0.00365 |    15.56946 |
0.015042374707317802 | 0.06067634978916631
                 5 |       458 |       5.00627 |    15.699129 |
5.962879746724887 | 1.1432124887616204
                10 |        14 |      10.538461 |    16.113204 |
12.415218999999999 | 1.5598854455354354
                20 |         - |      - |    - |
   - | -
                -1 |         - |      - |    - |
   - | -

I’d greatly appreciate any feedback on this alternative approach, as
well as benchmarking on a pretty large machine to see its impact at
scale.

[0]:
/messages/by-id/CAPpHfdsTKAQqC3A48-MGQhrhfEamXZPb64w=utk7thQcOMNr7Q@mail.gmail.com

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Hi hackers,

I rebased this patch to v18 to fix an inaccurate additional description
of the GUC parameter. The -1 value should be described first, followed by 0.

Does anyone have other suggestions on how we could sample queries in
pg_stat_statements to reduce spin-lock contention on entries?

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#51

Sami Imseih

samimseih@gmail.com

11 months ago

In reply to: Ilia Evdokimov (#50)

Re: Sample rate added to pg_stat_statements

I don't see v18 attached.

But, I think it's wrong that you changed the design of this patch
(sample rate to duration based ) after it was ready for committer.

This CF [0]https://commitfest.postgresql.org/patch/5390/ should go back to "Needs Review" if this is the case.

Sami

[0]: https://commitfest.postgresql.org/patch/5390/

#52

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

11 months ago

In reply to: Sami Imseih (#51)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 19.02.2025 22:11, Sami Imseih wrote:

I don't see v18 attached.

Ah, sorry, I attached v18.

But, I think it's wrong that you changed the design of this patch
(sample rate to duration based ) after it was ready for committer.

I've decided to reconsider how to better reduce the load on the
spin-lock in pg_stat_statements.

If we look at the benchmark with 192 CPUs [0]/messages/by-id/CAA5RZ0vxn_UiUkUK5SdngObLZzw40RhafB7SHydxzBw2_xjjiA@mail.gmail.com, we can see that even a
small reduction in the frequency of incrementing counter of entries
gives a significant performance boost. For example, at sample_rate =
0.75, performance almost doubles, while at sample_rate < 0.5, the
spin-lock bottleneck is already eliminated. This suggests that even
small adjustments to the frequency can significantly improve the situation.

But instead of blindly reducing the frequency via PRNG, we can take a
more thoughtful approach with threshold by execute time:

* Find the most frequent query by column 'calls' in pg_stat_statements;
* In this query look at info about execution time: min_exec_time,
max_exec_time, etc;
* Gradually increase the threshold from min_exec_time to
max_exec_time, limiting the tracking of this query.
* Monitor performance: once the bottleneck is resolved, stop at the
current threshold value.

This approach allows us to:

* Eliminate the spin-lock bottleneck;
* Preserve data about slow queries, which may be critical for
performance analysis;
* Reduce the load on the most frequent queries causing contention,
instead of uniformly reducing the frequency for all queries.

BTW, I track queries in pgss_ProcessUtility without checking exec time
because these queries unlikely can lead to bottleneck.

Any other thoughts or objections?

This CF [0] should go back to "Needs Review" if this is the case.

Done!

P.S. During the recent review, I noticed an interesting behavior: when
we introduced \parse, changing the 'track' from 'none' to 'all' might
lead to an undefined behavior. The steps are:

postgres=# SELECT pg_stat_statements_reset() IS NOT NULL AS t; t --- t
(1 row) postgres=# SET pg_stat_statements.track = "none"; SET postgres=#
SELECT 1 \parse stmt ?column? ---------- 1 (1 row) postgres=#
\bind_named stmt \g ?column? ---------- 1 (1 row) postgres=# SELECT
query, calls FROM pg_stat_statements; query | calls -------+------- (0
rows) postgres=# SET pg_stat_statements.track = "all"; SET postgres=#
\bind_named stmt \g ?column? ---------- 1 (1 row) postgres=# SELECT
query, calls FROM pg_stat_statements; query | calls ----------+-------
SELECT 1 | 1 (1 row)

As you can see, the query has not been normalized. Is it a bug, expected
or undefined behavior?

[0]: /messages/by-id/CAA5RZ0vxn_UiUkUK5SdngObLZzw40RhafB7SHydxzBw2_xjjiA@mail.gmail.com
/messages/by-id/CAA5RZ0vxn_UiUkUK5SdngObLZzw40RhafB7SHydxzBw2_xjjiA@mail.gmail.com

-- Best regards, Ilia Evdokimov, Tantor Labs LLC.

Attachments:

v18-0001-Allow-setting-execution-time-threshold-for-pgss.patchtext/x-patch; charset=UTF-8; name=v18-0001-Allow-setting-execution-time-threshold-for-pgss.patchDownload

From b8f88dcd4d44a11301776455cc133bb3e859d000 Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Date: Thu, 20 Feb 2025 01:29:55 +0300
Subject: [PATCH v18] Allow setting execution time threshold for
 pg_stat_statements

New configuration parameter pg_stat_statements.track_min_duration makes it
possible to track statements whose execution time is greater than threshold,
to reduce the amount of tracking.
---
 contrib/pg_stat_statements/Makefile           |   2 +-
 .../expected/min_duration.out                 | 174 ++++++++++++++++++
 contrib/pg_stat_statements/meson.build        |   1 +
 .../pg_stat_statements/pg_stat_statements.c   |  59 ++++--
 .../pg_stat_statements/sql/min_duration.sql   |  50 +++++
 doc/src/sgml/pgstatstatements.sgml            |  22 +++
 6 files changed, 287 insertions(+), 21 deletions(-)
 create mode 100644 contrib/pg_stat_statements/expected/min_duration.out
 create mode 100644 contrib/pg_stat_statements/sql/min_duration.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 241c02587b..df512e428e 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS))
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
 REGRESS = select dml cursors utility level_tracking planning \
 	user_activity wal entry_timestamp privileges extended \
-	parallel cleanup oldextversions
+	parallel min_duration cleanup oldextversions
 # Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/expected/min_duration.out b/contrib/pg_stat_statements/expected/min_duration.out
new file mode 100644
index 0000000000..ef6f2a4437
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/min_duration.out
@@ -0,0 +1,174 @@
+--
+-- track_min_duration
+--
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+ pg_stat_statements.track 
+--------------------------
+ top
+(1 row)
+
+SET pg_stat_statements.track_min_duration = -1;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.track_min_duration = 0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+       query        | calls 
+--------------------+-------
+ SELECT $1 AS "int" |     1
+(1 row)
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.track_min_duration = -1;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.track_min_duration = 0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+   query   | calls 
+-----------+-------
+ SELECT $1 |     1
+(1 row)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+DEALLOCATE stmt;
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.track_min_duration = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.track_min_duration = -1;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                       query                        
+----------+-------+----------------------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     2 | SELECT $1
+ t        |     1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
+ t        |     2 | SET pg_stat_statements.track_min_duration = $1
+(4 rows)
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.track_min_duration = 0;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.track_min_duration = -1;
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                     query                      
+----------+-------+------------------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     3 | SELECT $1
+ t        |     1 | SET pg_stat_statements.track_min_duration = $1
+(3 rows)
+
diff --git a/contrib/pg_stat_statements/meson.build b/contrib/pg_stat_statements/meson.build
index 4446af58c5..6427665ee1 100644
--- a/contrib/pg_stat_statements/meson.build
+++ b/contrib/pg_stat_statements/meson.build
@@ -54,6 +54,7 @@ tests += {
       'privileges',
       'extended',
       'parallel',
+      'min_duration',
       'cleanup',
       'oldextversions',
     ],
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 4a3e855ce9..00ac11aed0 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -291,16 +291,18 @@ static const struct config_enum_entry track_options[] =
 
 static int	pgss_max = 5000;	/* max # statements to track */
 static int	pgss_track = PGSS_TRACK_TOP;	/* tracking level */
+static int	pgss_track_min_duration = 0; /* msec or -1 */
 static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
 
 
-#define pgss_enabled(level) \
+#define pgss_enabled(level, skip_sample_check) \
 	(!IsParallelWorker() && \
 	(pgss_track == PGSS_TRACK_ALL || \
-	(pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+	(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+	(skip_sample_check == PGSS_INVALID || pgss_track_min_duration >= 0))
 
 #define record_gc_qtexts() \
 	do { \
@@ -415,6 +417,18 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomIntVariable("pg_stat_statements.track_min_duration",
+							"Sets the minimum execution time above which statements will be tracked.",
+							" -1 turns this feature off. 0 tracks all statements.",
+							&pgss_track_min_duration,
+							0,
+							-1, INT_MAX,
+							PGC_SUSET,
+							GUC_UNIT_MS,
+							NULL,
+							NULL,
+							NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -837,7 +851,7 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
 
 	/* Safety check... */
-	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
+	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level, PGSS_INVALID))
 		return;
 
 	/*
@@ -895,7 +909,7 @@ pgss_planner(Query *parse,
 	 * pgss_store needs it.  We also ignore query without queryid, as it would
 	 * be treated as a utility statement, which may not be the case.
 	 */
-	if (pgss_enabled(nesting_level)
+	if (pgss_enabled(nesting_level, PGSS_PLAN)
 		&& pgss_track_planning && query_string
 		&& parse->queryId != UINT64CONST(0))
 	{
@@ -1000,7 +1014,8 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (pgss_enabled(nesting_level, PGSS_EXEC) &&
+		queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1069,27 +1084,31 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 	uint64		queryId = queryDesc->plannedstmt->queryId;
 
 	if (queryId != UINT64CONST(0) && queryDesc->totaltime &&
-		pgss_enabled(nesting_level))
+		pgss_enabled(nesting_level, PGSS_EXEC))
 	{
+		double		msec;
 		/*
 		 * Make sure stats accumulation is done.  (Note: it's okay if several
 		 * levels of hook all do this.)
 		 */
 		InstrEndLoop(queryDesc->totaltime);
 
-		pgss_store(queryDesc->sourceText,
-				   queryId,
-				   queryDesc->plannedstmt->stmt_location,
-				   queryDesc->plannedstmt->stmt_len,
-				   PGSS_EXEC,
-				   queryDesc->totaltime->total * 1000.0,	/* convert to msec */
-				   queryDesc->estate->es_total_processed,
-				   &queryDesc->totaltime->bufusage,
-				   &queryDesc->totaltime->walusage,
-				   queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
-				   NULL,
-				   queryDesc->estate->es_parallel_workers_to_launch,
-				   queryDesc->estate->es_parallel_workers_launched);
+		/* convert to msec */
+		msec = queryDesc->totaltime->total * 1000.0;
+		if (msec >= pgss_track_min_duration)
+			pgss_store(queryDesc->sourceText,
+						queryId,
+						queryDesc->plannedstmt->stmt_location,
+						queryDesc->plannedstmt->stmt_len,
+						PGSS_EXEC,
+						msec,
+						queryDesc->estate->es_total_processed,
+						&queryDesc->totaltime->bufusage,
+						&queryDesc->totaltime->walusage,
+						queryDesc->estate->es_jit ? &queryDesc->estate->es_jit->instr : NULL,
+						NULL,
+						queryDesc->estate->es_parallel_workers_to_launch,
+						queryDesc->estate->es_parallel_workers_launched);
 	}
 
 	if (prev_ExecutorEnd)
@@ -1112,7 +1131,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level, PGSS_EXEC);
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
diff --git a/contrib/pg_stat_statements/sql/min_duration.sql b/contrib/pg_stat_statements/sql/min_duration.sql
new file mode 100644
index 0000000000..63903cc388
--- /dev/null
+++ b/contrib/pg_stat_statements/sql/min_duration.sql
@@ -0,0 +1,50 @@
+--
+-- track_min_duration
+--
+
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+SET pg_stat_statements.track_min_duration = -1;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.track_min_duration = 0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.track_min_duration = -1;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.track_min_duration = 0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+DEALLOCATE stmt;
+
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.track_min_duration = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SET pg_stat_statements.track_min_duration = -1;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.track_min_duration = 0;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+\bind_named stmt \g
+\bind_named stmt \g
+SET pg_stat_statements.track_min_duration = -1;
+\bind_named stmt \g
+\bind_named stmt \g
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
\ No newline at end of file
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index e2ac1c2d50..1c5f57425f 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -945,6 +945,28 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.track_min_duration</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.track_min_duration</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.track_min_duration</varname> is the minimum statement
+      execution time, in milliseconds, that will cause the statement to
+      be tracked. Setting this to <literal>0</literal> tracks all statements.
+      <literal>-1</literal> (the default) disables tracking of statements.
+      The same as setting <varname>pg_stat_statements.track</varname> to <literal>none</literal>.
+      For example, if you set it to <literal>250ms</literal> then all statements
+      that run 250ms or longer will be tracked. Only superusers can change this
+      setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#53

Sami Imseih

samimseih@gmail.com

11 months ago

In reply to: Ilia Evdokimov (#52)

Re: Sample rate added to pg_stat_statements

As you can see, the query has not been normalized. Is it a bug, expected or undefined behavior?

No, this behavior is expected. The ability to normalize a statement is
only available
during post_parse_analyze (pgss_post_parse_analyze), as this is when
we have access to
JumbleState.

In your example, pgss_post_parse_analyze is skipped and during the
subsequent \bind_named,
ExecutorEnd (pgss_ExecutorEnd) does not have access to JumbleState,
leading to the query to be stored without normalization.

It will be nice to make JumbleState available to all hooks, IMO.

--
Sami

#54

Sami Imseih

samimseih@gmail.com

11 months ago

In reply to: Sami Imseih (#53)

Re: Sample rate added to pg_stat_statements

But instead of blindly reducing the frequency via PRNG, we can take a more thoughtful approach with threshold by execute time:

Find the most frequent query by column 'calls' in pg_stat_statements;
In this query look at info about execution time: min_exec_time, max_exec_time, etc;
Gradually increase the threshold from min_exec_time to max_exec_time, limiting the tracking of this query.
Monitor performance: once the bottleneck is resolved, stop at the current threshold value.

This approach allows us to:

Eliminate the spin-lock bottleneck;
Preserve data about slow queries, which may be critical for performance analysis;
Reduce the load on the most frequent queries causing contention, instead of uniformly reducing the frequency for all queries.

In my opinion, sample rate is a better fit for pg_stat_statements,
since the queries that
you care about the most are usually the most frequently executed. Sampling them
will still provide enough good data without the risk of not capturing
statistics about
them at all.

Longer running queries will also likely be the least frequent, so they
are already not likely
contributing to the spinlock contention. Also, the least frequent
queries will likely be aged
out faster, so pg_stat_statements was never really a good candidate to
track those anyways;
slow query logging with log_min_duration_statement is a better way to
ensure you capture
the data.

Maybe others may have a different opinion?

Sami

#55

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

11 months ago

In reply to: Sami Imseih (#53)

Re: Sample rate added to pg_stat_statements

On 20.02.2025 03:32, Sami Imseih wrote:

As you can see, the query has not been normalized. Is it a bug, expected or undefined behavior?

No, this behavior is expected. The ability to normalize a statement is
only available
during post_parse_analyze (pgss_post_parse_analyze), as this is when
we have access to
JumbleState.

In your example, pgss_post_parse_analyze is skipped and during the
subsequent \bind_named,
ExecutorEnd (pgss_ExecutorEnd) does not have access to JumbleState,
leading to the query to be stored without normalization.

It will be nice to make JumbleState available to all hooks, IMO.

--
Sami

Got it. I think it's worth considering, but not within the scope of this
thread.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

#56

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

11 months ago

In reply to: Sami Imseih (#54)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

On 20.02.2025 04:04, Sami Imseih wrote:

In my opinion, sample rate is a better fit for pg_stat_statements,
since the queries that
you care about the most are usually the most frequently executed. Sampling them
will still provide enough good data without the risk of not capturing
statistics about
them at all.

Longer running queries will also likely be the least frequent, so they
are already not likely
contributing to the spinlock contention. Also, the least frequent
queries will likely be aged
out faster, so pg_stat_statements was never really a good candidate to
track those anyways;
slow query logging with log_min_duration_statement is a better way to
ensure you capture
the data.

Maybe others may have a different opinion?

--

Sami

Hi everyone,

Since most people have expressed support for sampling based on frequency
of query, I will revert the rebased patch for version v15. Also, given
that everyone is likely busy with the higher-priority task of IN
normalization in pg_stat_statements, I will leave the rebased v19-patch
here and move it back to Ready for Committer, as the patch has already
been it.

Any suggestions are welcome.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v19-0001-Allow-setting-sample-rate-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v19-0001-Allow-setting-sample-rate-for-pg_stat_statements.patchDownload

From 9a27fecdfb0a2e351e8476430d5d5030324a887d Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.com>
Date: Mon, 3 Mar 2025 15:41:48 +0300
Subject: [PATCH v19] Allow setting sample rate for pg_stat_statements

New configuration parameter pg_stat_statements.sample_rate makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 contrib/pg_stat_statements/Makefile           |   2 +-
 .../pg_stat_statements/expected/sampling.out  | 174 ++++++++++++++++++
 contrib/pg_stat_statements/meson.build        |   1 +
 .../pg_stat_statements/pg_stat_statements.c   |  53 +++++-
 contrib/pg_stat_statements/sql/sampling.sql   |  50 +++++
 doc/src/sgml/pgstatstatements.sgml            |  19 ++
 6 files changed, 291 insertions(+), 8 deletions(-)
 create mode 100644 contrib/pg_stat_statements/expected/sampling.out
 create mode 100644 contrib/pg_stat_statements/sql/sampling.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index 241c02587b..b70bdfaf2d 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS))
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
 REGRESS = select dml cursors utility level_tracking planning \
 	user_activity wal entry_timestamp privileges extended \
-	parallel cleanup oldextversions
+	parallel sampling cleanup oldextversions
 # Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/expected/sampling.out b/contrib/pg_stat_statements/expected/sampling.out
new file mode 100644
index 0000000000..2204215f64
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/sampling.out
@@ -0,0 +1,174 @@
+--
+-- sample statements
+--
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+ pg_stat_statements.track 
+--------------------------
+ top
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+       query        | calls 
+--------------------+-------
+ SELECT $1 AS "int" |     1
+(1 row)
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+   query   | calls 
+-----------+-------
+ SELECT $1 |     1
+(1 row)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+DEALLOCATE stmt;
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                       query                        
+----------+-------+----------------------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     2 | SELECT $1
+ t        |     1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
+ t        |     2 | SET pg_stat_statements.sample_rate = $1
+(4 rows)
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0;
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                  query                  
+----------+-------+-----------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     3 | SELECT $1
+ t        |     1 | SET pg_stat_statements.sample_rate = $1
+(3 rows)
+
diff --git a/contrib/pg_stat_statements/meson.build b/contrib/pg_stat_statements/meson.build
index 4446af58c5..351354a6a5 100644
--- a/contrib/pg_stat_statements/meson.build
+++ b/contrib/pg_stat_statements/meson.build
@@ -54,6 +54,7 @@ tests += {
       'privileges',
       'extended',
       'parallel',
+      'sampling',
       'cleanup',
       'oldextversions',
     ],
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index b245d04097..8ec379cae3 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -257,6 +258,9 @@ typedef struct pgssSharedState
 /* Current nesting depth of planner/ExecutorRun/ProcessUtility calls */
 static int	nesting_level = 0;
 
+/* Is the current top-level query to be sampled? */
+static bool is_query_sampled = false;
+
 /* Saved hook values */
 static shmem_request_hook_type prev_shmem_request_hook = NULL;
 static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
@@ -295,12 +299,14 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1.0; /* fraction of statements to track */
 
 
-#define pgss_enabled(level) \
+#define pgss_enabled(level, skip_sampling_check) \
 	(!IsParallelWorker() && \
 	(pgss_track == PGSS_TRACK_ALL || \
-	(pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+	(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+	(skip_sampling_check == PGSS_INVALID || current_query_sampled()))
 
 #define record_gc_qtexts() \
 	do { \
@@ -374,6 +380,7 @@ static char *generate_normalized_query(JumbleState *jstate, const char *query,
 static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
 									 int query_loc);
 static int	comp_location(const void *a, const void *b);
+static bool	current_query_sampled(void);
 
 
 /*
@@ -415,6 +422,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							 "Fraction of queries to track.",
+							 NULL,
+							 &pgss_sample_rate,
+							 1.0,
+							 0.0,
+							 1.0,
+							 PGC_SUSET,
+							 0,
+							 NULL,
+							 NULL,
+							 NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -837,7 +857,7 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
 
 	/* Safety check... */
-	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
+	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level, PGSS_INVALID))
 		return;
 
 	/*
@@ -895,7 +915,7 @@ pgss_planner(Query *parse,
 	 * pgss_store needs it.  We also ignore query without queryid, as it would
 	 * be treated as a utility statement, which may not be the case.
 	 */
-	if (pgss_enabled(nesting_level)
+	if (pgss_enabled(nesting_level, PGSS_PLAN)
 		&& pgss_track_planning && query_string
 		&& parse->queryId != UINT64CONST(0))
 	{
@@ -1006,7 +1026,8 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (pgss_enabled(nesting_level, PGSS_EXEC) &&
+		queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1077,7 +1098,7 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 	uint64		queryId = queryDesc->plannedstmt->queryId;
 
 	if (queryId != UINT64CONST(0) && queryDesc->totaltime &&
-		pgss_enabled(nesting_level))
+		pgss_enabled(nesting_level, PGSS_EXEC))
 	{
 		/*
 		 * Make sure stats accumulation is done.  (Note: it's okay if several
@@ -1120,7 +1141,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level, PGSS_EXEC);
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
@@ -3025,3 +3046,21 @@ comp_location(const void *a, const void *b)
 
 	return pg_cmp_s32(l, r);
 }
+
+/*
+ * Determine whether the current query should be sampled.
+ *
+ * At the beginning of each top-level statement, decide whether we'll
+ * sample this statement. If nested-statement tracking is enabled,
+ * either all nested statements will be tracked or none will.
+ */
+static bool
+current_query_sampled(void)
+{
+	if (nesting_level == 0)
+		is_query_sampled = pgss_sample_rate != 0.0 &&
+			(pgss_sample_rate == 1.0 ||
+			pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate);
+
+	return is_query_sampled;
+}
\ No newline at end of file
diff --git a/contrib/pg_stat_statements/sql/sampling.sql b/contrib/pg_stat_statements/sql/sampling.sql
new file mode 100644
index 0000000000..b09f45991b
--- /dev/null
+++ b/contrib/pg_stat_statements/sql/sampling.sql
@@ -0,0 +1,50 @@
+--
+-- sample statements
+--
+
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+DEALLOCATE stmt;
+
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SET pg_stat_statements.sample_rate = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+\bind_named stmt \g
+\bind_named stmt \g
+SET pg_stat_statements.sample_rate = 0;
+\bind_named stmt \g
+\bind_named stmt \g
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index e2ac1c2d50..f517dd855b 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -945,6 +945,25 @@
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session. The default is <literal>1</literal>,
+      meaning track all the queries. Setting this to <literal>0</literal> disables sampled statements
+      tracking, the same as setting <varname>pg_stat_statements.track</varname> to <literal>none</literal>.
+      In case of nested statements, either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#57

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

10 months ago

In reply to: Ilia Evdokimov (#56)

1 attachment(s)

Re: Sample rate added to pg_stat_statements

Hi,

I attached rebased patch v20.

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.

Attachments:

v20-0001-Allow-setting-sample-rate-for-pg_stat_statements.patchtext/x-patch; charset=UTF-8; name=v20-0001-Allow-setting-sample-rate-for-pg_stat_statements.patchDownload

From accee46b077e1debdc3db61555923b2c11e18d5e Mon Sep 17 00:00:00 2001
From: Ilia Evdokimov <ilya.evdokimov@tantorlabs.ru>
Date: Fri, 21 Mar 2025 17:37:08 +0300
Subject: [PATCH v20] Allow setting sample rate for pg_stat_statements

New configuration parameter pg_stat_statements.sample_rate makes it
possible to track just a fraction of the queries meeting the configured
threshold, to reduce the amount of tracking.
---
 contrib/pg_stat_statements/Makefile           |   2 +-
 .../pg_stat_statements/expected/sampling.out  | 174 ++++++++++++++++++
 contrib/pg_stat_statements/meson.build        |   1 +
 .../pg_stat_statements/pg_stat_statements.c   |  53 +++++-
 contrib/pg_stat_statements/sql/sampling.sql   |  50 +++++
 doc/src/sgml/pgstatstatements.sgml            |  19 ++
 6 files changed, 291 insertions(+), 8 deletions(-)
 create mode 100644 contrib/pg_stat_statements/expected/sampling.out
 create mode 100644 contrib/pg_stat_statements/sql/sampling.sql

diff --git a/contrib/pg_stat_statements/Makefile b/contrib/pg_stat_statements/Makefile
index b2bd8794d2a..f2df4f18513 100644
--- a/contrib/pg_stat_statements/Makefile
+++ b/contrib/pg_stat_statements/Makefile
@@ -20,7 +20,7 @@ LDFLAGS_SL += $(filter -lm, $(LIBS))
 REGRESS_OPTS = --temp-config $(top_srcdir)/contrib/pg_stat_statements/pg_stat_statements.conf
 REGRESS = select dml cursors utility level_tracking planning \
 	user_activity wal entry_timestamp privileges extended \
-	parallel cleanup oldextversions squashing
+	parallel sampling cleanup oldextversions squashing
 # Disabled because these tests require "shared_preload_libraries=pg_stat_statements",
 # which typical installcheck users do not have (e.g. buildfarm clients).
 NO_INSTALLCHECK = 1
diff --git a/contrib/pg_stat_statements/expected/sampling.out b/contrib/pg_stat_statements/expected/sampling.out
new file mode 100644
index 00000000000..2204215f64b
--- /dev/null
+++ b/contrib/pg_stat_statements/expected/sampling.out
@@ -0,0 +1,174 @@
+--
+-- sample statements
+--
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+ pg_stat_statements.track 
+--------------------------
+ top
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+ int 
+-----
+   1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+       query        | calls 
+--------------------+-------
+ SELECT $1 AS "int" |     1
+(1 row)
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SELECT 1 \parse stmt
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+ query | calls 
+-------+-------
+(0 rows)
+
+SET pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+ ?column? 
+----------
+        1
+(1 row)
+
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+   query   | calls 
+-----------+-------
+ SELECT $1 |     1
+(1 row)
+
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+DEALLOCATE stmt;
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+EXPLAIN (COSTS OFF) SELECT 1;
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                       query                        
+----------+-------+----------------------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     2 | SELECT $1
+ t        |     1 | SELECT pg_stat_statements_reset() IS NOT NULL AS t
+ t        |     2 | SET pg_stat_statements.sample_rate = $1
+(4 rows)
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+ t 
+---
+ t
+(1 row)
+
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SET pg_stat_statements.sample_rate = 0;
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+\bind_named stmt \g
+ QUERY PLAN 
+------------
+ Result
+(1 row)
+
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+ toplevel | calls |                  query                  
+----------+-------+-----------------------------------------
+ t        |     2 | EXPLAIN (COSTS OFF) SELECT $1
+ f        |     3 | SELECT $1
+ t        |     1 | SET pg_stat_statements.sample_rate = $1
+(3 rows)
+
diff --git a/contrib/pg_stat_statements/meson.build b/contrib/pg_stat_statements/meson.build
index 01a6cbdcf61..6fa9b7dba50 100644
--- a/contrib/pg_stat_statements/meson.build
+++ b/contrib/pg_stat_statements/meson.build
@@ -54,6 +54,7 @@ tests += {
       'privileges',
       'extended',
       'parallel',
+      'sampling',
       'cleanup',
       'oldextversions',
       'squashing',
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index 8ab9ad58e1c..b721b8331f9 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -50,6 +50,7 @@
 #include "access/parallel.h"
 #include "catalog/pg_authid.h"
 #include "common/int.h"
+#include "common/pg_prng.h"
 #include "executor/instrument.h"
 #include "funcapi.h"
 #include "jit/jit.h"
@@ -257,6 +258,9 @@ typedef struct pgssSharedState
 /* Current nesting depth of planner/ExecutorRun/ProcessUtility calls */
 static int	nesting_level = 0;
 
+/* Is the current top-level query to be sampled? */
+static bool is_query_sampled = false;
+
 /* Saved hook values */
 static shmem_request_hook_type prev_shmem_request_hook = NULL;
 static shmem_startup_hook_type prev_shmem_startup_hook = NULL;
@@ -295,11 +299,13 @@ static bool pgss_track_utility = true;	/* whether to track utility commands */
 static bool pgss_track_planning = false;	/* whether to track planning
 											 * duration */
 static bool pgss_save = true;	/* whether to save stats across shutdown */
+static double pgss_sample_rate = 1.0; /* fraction of statements to track */
 
-#define pgss_enabled(level) \
+#define pgss_enabled(level, skip_sampling_check) \
 	(!IsParallelWorker() && \
 	(pgss_track == PGSS_TRACK_ALL || \
-	(pgss_track == PGSS_TRACK_TOP && (level) == 0)))
+	(pgss_track == PGSS_TRACK_TOP && (level) == 0)) && \
+	(skip_sampling_check == PGSS_INVALID || current_query_sampled()))
 
 #define record_gc_qtexts() \
 	do { \
@@ -373,6 +379,7 @@ static char *generate_normalized_query(JumbleState *jstate, const char *query,
 static void fill_in_constant_lengths(JumbleState *jstate, const char *query,
 									 int query_loc);
 static int	comp_location(const void *a, const void *b);
+static bool	current_query_sampled(void);
 
 
 /*
@@ -414,6 +421,19 @@ _PG_init(void)
 							NULL,
 							NULL);
 
+	DefineCustomRealVariable("pg_stat_statements.sample_rate",
+							"Fraction of queries to track.",
+							NULL,
+							&pgss_sample_rate,
+							1.0,
+							0.0,
+							1.0,
+							PGC_SUSET,
+							0,
+							NULL,
+							NULL,
+							NULL);
+
 	DefineCustomEnumVariable("pg_stat_statements.track",
 							 "Selects which statements are tracked by pg_stat_statements.",
 							 NULL,
@@ -836,7 +856,7 @@ pgss_post_parse_analyze(ParseState *pstate, Query *query, JumbleState *jstate)
 		prev_post_parse_analyze_hook(pstate, query, jstate);
 
 	/* Safety check... */
-	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level))
+	if (!pgss || !pgss_hash || !pgss_enabled(nesting_level, PGSS_INVALID))
 		return;
 
 	/*
@@ -894,7 +914,7 @@ pgss_planner(Query *parse,
 	 * pgss_store needs it.  We also ignore query without queryid, as it would
 	 * be treated as a utility statement, which may not be the case.
 	 */
-	if (pgss_enabled(nesting_level)
+	if (pgss_enabled(nesting_level, PGSS_PLAN)
 		&& pgss_track_planning && query_string
 		&& parse->queryId != UINT64CONST(0))
 	{
@@ -1005,7 +1025,8 @@ pgss_ExecutorStart(QueryDesc *queryDesc, int eflags)
 	 * counting of optimizable statements that are directly contained in
 	 * utility statements.
 	 */
-	if (pgss_enabled(nesting_level) && queryDesc->plannedstmt->queryId != UINT64CONST(0))
+	if (pgss_enabled(nesting_level, PGSS_EXEC) &&
+		queryDesc->plannedstmt->queryId != UINT64CONST(0))
 	{
 		/*
 		 * Set up to track total elapsed time in ExecutorRun.  Make sure the
@@ -1076,7 +1097,7 @@ pgss_ExecutorEnd(QueryDesc *queryDesc)
 	uint64		queryId = queryDesc->plannedstmt->queryId;
 
 	if (queryId != UINT64CONST(0) && queryDesc->totaltime &&
-		pgss_enabled(nesting_level))
+		pgss_enabled(nesting_level, PGSS_EXEC))
 	{
 		/*
 		 * Make sure stats accumulation is done.  (Note: it's okay if several
@@ -1119,7 +1140,7 @@ pgss_ProcessUtility(PlannedStmt *pstmt, const char *queryString,
 	uint64		saved_queryId = pstmt->queryId;
 	int			saved_stmt_location = pstmt->stmt_location;
 	int			saved_stmt_len = pstmt->stmt_len;
-	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level);
+	bool		enabled = pgss_track_utility && pgss_enabled(nesting_level, PGSS_EXEC);
 
 	/*
 	 * Force utility statements to get queryId zero.  We do this even in cases
@@ -3081,3 +3102,21 @@ comp_location(const void *a, const void *b)
 
 	return pg_cmp_s32(l, r);
 }
+
+/*
+ * Determine whether the current query should be sampled.
+ *
+ * At the beginning of each top-level statement, decide whether we'll
+ * sample this statement. If nested-statement tracking is enabled,
+ * either all nested statements will be tracked or none will.
+ */
+static bool
+current_query_sampled(void)
+{
+	if (nesting_level == 0)
+		is_query_sampled = pgss_sample_rate != 0.0 &&
+			(pgss_sample_rate == 1.0 ||
+			pg_prng_double(&pg_global_prng_state) <= pgss_sample_rate);
+
+	return is_query_sampled;
+}
\ No newline at end of file
diff --git a/contrib/pg_stat_statements/sql/sampling.sql b/contrib/pg_stat_statements/sql/sampling.sql
new file mode 100644
index 00000000000..b09f45991ba
--- /dev/null
+++ b/contrib/pg_stat_statements/sql/sampling.sql
@@ -0,0 +1,50 @@
+--
+-- sample statements
+--
+
+-- top-level tracking - simple query protocol
+SHOW pg_stat_statements.track;
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+SELECT 1 AS "int";
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+
+-- top-level tracking - extended query protocol
+SET pg_stat_statements.sample_rate = 0.0;
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SELECT 1 \parse stmt
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SET pg_stat_statements.sample_rate = 1.0;
+\bind_named stmt \g
+SELECT query, calls FROM pg_stat_statements ORDER BY query COLLATE "C";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+DEALLOCATE stmt;
+
+-- nested tracking - simple query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SET pg_stat_statements.sample_rate = 0;
+EXPLAIN (COSTS OFF) SELECT 1;
+EXPLAIN (COSTS OFF) SELECT 1;
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
+
+-- nested tracking - extended query protocol
+SET pg_stat_statements.track = "all";
+SELECT pg_stat_statements_reset() IS NOT NULL AS t;
+SET pg_stat_statements.sample_rate = 1;
+EXPLAIN (COSTS OFF) SELECT 1; \parse stmt
+\bind_named stmt \g
+\bind_named stmt \g
+SET pg_stat_statements.sample_rate = 0;
+\bind_named stmt \g
+\bind_named stmt \g
+SELECT toplevel, calls, query FROM pg_stat_statements
+  ORDER BY query COLLATE "C";
diff --git a/doc/src/sgml/pgstatstatements.sgml b/doc/src/sgml/pgstatstatements.sgml
index f4e384e95ae..25097b9ff02 100644
--- a/doc/src/sgml/pgstatstatements.sgml
+++ b/doc/src/sgml/pgstatstatements.sgml
@@ -963,6 +963,25 @@ calls | 2
     </listitem>
    </varlistentry>
 
+   <varlistentry>
+    <term>
+     <varname>pg_stat_statements.sample_rate</varname> (<type>real</type>)
+     <indexterm>
+      <primary><varname>pg_stat_statements.sample_rate</varname> configuration parameter</primary>
+     </indexterm>
+    </term>
+
+    <listitem>
+     <para>
+      <varname>pg_stat_statements.sample_rate</varname> causes pg_stat_statements to only
+      track a fraction of the statements in each session. The default is <literal>1</literal>,
+      meaning track all the queries. Setting this to <literal>0</literal> disables sampled statements
+      tracking, the same as setting <varname>pg_stat_statements.track</varname> to <literal>none</literal>.
+      In case of nested statements, either all will be tracked or none. Only superusers can change this setting.
+     </para>
+    </listitem>
+   </varlistentry>
+
    <varlistentry>
     <term>
      <varname>pg_stat_statements.save</varname> (<type>boolean</type>)
-- 
2.34.1

#58

Ilia Evdokimov

ilya.evdokimov@tantorlabs.com

9 months ago

In reply to: Ilia Evdokimov (#57)

Re: Sample rate added to pg_stat_statements

Since at the moment these changes are not a priority, and it's more
important to address [0]/messages/by-id/Zz0MFPq1s4WFxxpL@paquier.xyz for pg_stat_statements, I'm marking this patch
as "Returned with Feedback".

[0]: /messages/by-id/Zz0MFPq1s4WFxxpL@paquier.xyz

--
Best regards,
Ilia Evdokimov,
Tantor Labs LLC.