New GUC to sample log queries
Hello hackers,
In case of OLTP trafic it is hard to catch fast queries in logs (for example,
you want to know parameters for only few queries).
You have to put log_min_duration_statement to 0, do a reload, wait a few
seconds/minutes, back log_min_duration_statement to a previous value and reload
again.
In this time, you can cross your fingers impact will not be important and keep
an eye on log size.
I suggest to sample logs, like sample_rate for auto_explain [1]. Attached patch
introduce a new GUC, log_sample_rate, 1 means all queries will be logged (same
behavior as now).
Here is a naive SELECT only bench with a dataset which fit in ram (scale factor
= 100) and PGDATA and log on a ramdisk:
shared_buffers = 4GB
seq_page_cost = random_page_cost = 1.0
logging_collector = on (no rotation)
pgbench -c 4 -S -T 60 bench
master :
log_min_duration_statement = 0
TPS: 22562
log size: 1353746 lines (172MB)
log_min_duration_statement = -1
TPS: 25654
log size: 0 lines
patched:
log_min_duration_statement = 0
log_sample_rate = 1
TPS: 22548
log size: 1352873 lines (171MB)
log_min_duration_statement = 0
log_sample_rate = 0.1
TPS: 24802
log size: 148709 lines (19MB)
log_min_duration_statement = 0
log_sample_rate = 0.01
TPS: 25245
log size: 15344 lines (2MB)
log_min_duration_statement = 0
log_sample_rate = 0
TPS: 25858
log size: 0 lines
log_min_duration_statement = -1
log_sample_rate = 1
TPS: 25599
log size: 0 lines
I don't know the cost of random() call?
With log_sample_rate = 0.01 we got 15K lines of logs and you are close to
log_min_duration_statement = -1. Difference between log_min_duration_statement =
0 and -1 is about 12% performance drop on my laptop.
I will update documentation and postgresql.conf.sample later.
Thanks,
1: https://www.postgresql.org/docs/current/static/auto-explain.html
--
Adrien NAYRAT
Attachments:
sample_rate-1.patchtext/x-patch; name=sample_rate-1.patchDownload+17-2
On 31 May 2018 at 06:44, Adrien Nayrat <adrien.nayrat@anayrat.info> wrote:
Here is a naive SELECT only bench with a dataset which fit in ram (scale factor
= 100) and PGDATA and log on a ramdisk:
shared_buffers = 4GB
seq_page_cost = random_page_cost = 1.0
logging_collector = on (no rotation)
It would be better to just: SELECT 1; to try to get the true overhead
of the additional logging code.
I don't know the cost of random() call?
It's probably best to test in Postgres to see if there's an overhead
to the new code. It may be worth special casing the 0 and 1 case so
random() is not called.
+ (random() < log_sample_rate * MAX_RANDOM_VALUE);
this should be <=, or you'll randomly miss logging a query when
log_sample_rate is 1.0 every 4 billion or so queries.
Of course, it would be better if we had a proper profiler, but I can
see your need for this. Enabling logging of all queries in production
is currently reserved for people with low traffic servers and the
insane.
--
David Rowley http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services
On 05/31/2018 03:34 AM, David Rowley wrote:
On 31 May 2018 at 06:44, Adrien Nayrat <adrien.nayrat@anayrat.info> wrote:
Here is a naive SELECT only bench with a dataset which fit in ram (scale factor
= 100) and PGDATA and log on a ramdisk:
shared_buffers = 4GB
seq_page_cost = random_page_cost = 1.0
logging_collector = on (no rotation)It would be better to just: SELECT 1; to try to get the true overhead
of the additional logging code.
Right, I wonder why I didn't have the idea :)
I don't know the cost of random() call?
It's probably best to test in Postgres to see if there's an overhead
to the new code. It may be worth special casing the 0 and 1 case so
random() is not called.
Thanks, I based on auto_explain's code. I will send a patch for auto_explain
with same changes.
+ (random() < log_sample_rate * MAX_RANDOM_VALUE);
this should be <=, or you'll randomly miss logging a query when
log_sample_rate is 1.0 every 4 billion or so queries.
Right, it is the same in auto_explain.
Of course, it would be better if we had a proper profiler, but I can
see your need for this. Enabling logging of all queries in production
is currently reserved for people with low traffic servers and the
insane.
Yes, here are results with "SELECT 1" and changes to avoid random() call with 0
and 1.
master :
log_min_duration_statement = 0
TPS: 41294
log size: 2477589 lines (199MB)
log_min_duration_statement = -1
TPS: 49478
log size: 0 lines
patched:
log_min_duration_statement = 0
log_sample_rate = 1
TPS: 41635
log size: 2497994 lines (201MB)
log_min_duration_statement = 0
log_sample_rate = 0.1
TPS: 46915
log size: 282303 lines (23MB)
log_min_duration_statement = 0
log_sample_rate = 0.01
TPS: 48867
log size: 29193 lines (2.4MB)
log_min_duration_statement = 0
log_sample_rate = 0
TPS: 48912
log size: 0 lines
log_min_duration_statement = -1
log_sample_rate = 1
TPS: 49061
log size: 0 lines
Difference between l_m_d_s = 0 and -1 is about 16% and logger process eat around
30% CPU on my laptop.
Revised patch attached.
Thanks,
--
Adrien NAYRAT
Attachments:
sample_rate-2.patchtext/x-patch; name=sample_rate-2.patchDownload+20-2
On Thu, May 31, 2018 at 02:37:07PM +0200, Adrien Nayrat wrote:
On 05/31/2018 03:34 AM, David Rowley wrote:
On 31 May 2018 at 06:44, Adrien Nayrat <adrien.nayrat@anayrat.info> wrote:
Here is a naive SELECT only bench with a dataset which fit in ram (scale factor
= 100) and PGDATA and log on a ramdisk:
shared_buffers = 4GB
seq_page_cost = random_page_cost = 1.0
logging_collector = on (no rotation)It would be better to just: SELECT 1; to try to get the true overhead
of the additional logging code.Right, I wonder why I didn't have the idea :)
Using prepared statements help also in reducing the load with
unnecessary planning.
--
Michael
On 05/31/2018 03:22 PM, Michael Paquier wrote:
On Thu, May 31, 2018 at 02:37:07PM +0200, Adrien Nayrat wrote:
On 05/31/2018 03:34 AM, David Rowley wrote:
On 31 May 2018 at 06:44, Adrien Nayrat <adrien.nayrat@anayrat.info> wrote:
Here is a naive SELECT only bench with a dataset which fit in ram (scale factor
= 100) and PGDATA and log on a ramdisk:
shared_buffers = 4GB
seq_page_cost = random_page_cost = 1.0
logging_collector = on (no rotation)It would be better to just: SELECT 1; to try to get the true overhead
of the additional logging code.Right, I wonder why I didn't have the idea :)
Using prepared statements help also in reducing the load with
unnecessary planning.
--
Michael
Thanks, but the winner is Julien Rouhaud with only ";", I reach 115K TPS on my
laptop :)
Attached third version of the patch with documentation.
--
Adrien NAYRAT
Attachments:
sample_rate-3.patchtext/x-patch; name=sample_rate-3.patchDownload+42-2
On 24/06/18 13:22, Adrien Nayrat wrote:
Attached third version of the patch with documentation.
Hi. I'm reviewing this.
exceeded = (log_min_duration_statement == 0 || (log_min_duration_statement > 0 && (secs > log_min_duration_statement / 1000 || - secs * 1000 + msecs >= log_min_duration_statement))); + secs * 1000 + msecs >= log_min_duration_statement))) && + log_sample_rate != 0 && (log_sample_rate == 1 || + random() <= log_sample_rate * MAX_RANDOM_VALUE);
A few notes on this part, which is the crux of the patch.
1) Is there an off-by-one error here? drandom() has the formula
(double) random() / ((double) MAX_RANDOM_VALUE + 1)
but yours doesn't look like that.
2) I think the whole thing should be separated into its own expression
instead of tagging along on an already hard to read expression.
3) Is it intentional to only sample with log_min_duration_statement and
not also with log_duration? It seems like it should affect both. In
both cases, the name is too generic. Something called "log_sample_rate"
should sample **everything**.
Otherwise I think it's good. Attached is a revised patch that fixes 2)
as well as some wordsmithing throughout. It does not deal with the
other issues I raised.
Marked "Waiting on Author".
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
Attachments:
sample_rate-3-reviewed.patchtext/x-patch; name=sample_rate-3-reviewed.patchDownload+47-2
On 06/24/2018 08:41 PM, Vik Fearing wrote:
On 24/06/18 13:22, Adrien Nayrat wrote:
Attached third version of the patch with documentation.
Hi. I'm reviewing this.
Hi, thanks for your review.
exceeded = (log_min_duration_statement == 0 || (log_min_duration_statement > 0 && (secs > log_min_duration_statement / 1000 || - secs * 1000 + msecs >= log_min_duration_statement))); + secs * 1000 + msecs >= log_min_duration_statement))) && + log_sample_rate != 0 && (log_sample_rate == 1 || + random() <= log_sample_rate * MAX_RANDOM_VALUE);A few notes on this part, which is the crux of the patch.
1) Is there an off-by-one error here? drandom() has the formula
(double) random() / ((double) MAX_RANDOM_VALUE + 1)
but yours doesn't look like that.
I don't think there is an error. random() function return a long int between 0
and MAX_RANDOM_VALUE.
2) I think the whole thing should be separated into its own expression
instead of tagging along on an already hard to read expression.
+1
3) Is it intentional to only sample with log_min_duration_statement and
not also with log_duration? It seems like it should affect both. In
both cases, the name is too generic. Something called "log_sample_rate"
should sample **everything**.
I do not think it could be useful to sample other case such as log_duration.
But yes, the GUC is confusing and I am not comfortable to introduce a new GUC in
my initial patch.
Maybe we should adapt current GUC with something like :
log_min_duration_statement = <time>,<sample rate>
This give :
log_min_duration_statement = 0,0.1
Equivalent to :
log_min_duration_statement = 0
log_sample_rate = 0.1
Thought?
Otherwise I think it's good. Attached is a revised patch that fixes 2)
as well as some wordsmithing throughout. It does not deal with the
other issues I raised.
Thanks for your corrections.
--
Adrien
On 06/27/2018 11:13 PM, Adrien Nayrat wrote:
3) Is it intentional to only sample with log_min_duration_statement and
not also with log_duration? It seems like it should affect both. In
both cases, the name is too generic. Something called "log_sample_rate"
should sample **everything**.I do not think it could be useful to sample other case such as log_duration.
But yes, the GUC is confusing and I am not comfortable to introduce a new GUC in
my initial patch.Maybe we should adapt current GUC with something like :
log_min_duration_statement = <time>,<sample rate>>
This give :log_min_duration_statement = 0,0.1
Equivalent to :
log_min_duration_statement = 0
log_sample_rate = 0.1Thought?
After reflection it seems a bad idea :
* it breaks compatibility with external tools
* it introduce a kind of "composite" GUC which may add complexity to use. For
example in pg_settings view.
What do you think of : log_min_duration_statement_sample ? Is it too long?
I saw a few days ago this error on http://commitfest.cputube.org
postgres.sgml:5202: element xref: validity error : IDREF attribute linkend
references an unknown ID "log_min_duration_statement"
Patch attached with fix on linkend marker
Regards,
--
Adrien
Attachments:
sample_rate-4.patchtext/x-patch; name=sample_rate-4.patchDownload+47-2
On 10/07/18 20:34, Adrien Nayrat wrote:
On 06/27/2018 11:13 PM, Adrien Nayrat wrote:
3) Is it intentional to only sample with log_min_duration_statement and
not also with log_duration? It seems like it should affect both. In
both cases, the name is too generic. Something called "log_sample_rate"
should sample **everything**.I do not think it could be useful to sample other case such as log_duration.
But yes, the GUC is confusing and I am not comfortable to introduce a new GUC in
my initial patch.Maybe we should adapt current GUC with something like :
log_min_duration_statement = <time>,<sample rate>>
This give :log_min_duration_statement = 0,0.1
Equivalent to :
log_min_duration_statement = 0
log_sample_rate = 0.1Thought?
After reflection it seems a bad idea :
* it breaks compatibility with external tools
* it introduce a kind of "composite" GUC which may add complexity to use. For
example in pg_settings view.
Yes, I think that was a very bad idea.
What do you think of : log_min_duration_statement_sample ?
Hmm. Not sure if that last word should be _sample, _sampling, _rate, or
a combination of those.
Is it too long?
I introduced idle_in_transaction_session_timeout, so I'm not one to say
a setting name is too long :)
--
Vik Fearing +33 6 46 75 15 36
http://2ndQuadrant.fr PostgreSQL : Expertise, Formation et Support
On Sun, Jul 15, 2018 at 6:53 AM, Vik Fearing
<vik.fearing@2ndquadrant.com> wrote:
Hmm. Not sure if that last word should be _sample, _sampling, _rate, or
a combination of those.
+1 for rate or sample_rate. I think "sample" or "sampling" without
"rate" will not be very clear.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On 07/16/2018 05:24 PM, Robert Haas wrote:
On Sun, Jul 15, 2018 at 6:53 AM, Vik Fearing
<vik.fearing@2ndquadrant.com> wrote:Hmm. Not sure if that last word should be _sample, _sampling, _rate, or
a combination of those.+1 for rate or sample_rate. I think "sample" or "sampling" without
"rate" will not be very clear.
Thanks,
After reading this mail :
/messages/by-id/20180710183828.GB3890@telsasoft.com
I wonder if this GUC could be used for theses logging method?
log_statement_stats
log_parser_stats
log_planner_stats
log_executor_stats
--
Adrien
On 07/16/2018 05:24 PM, Robert Haas wrote:
On Sun, Jul 15, 2018 at 6:53 AM, Vik Fearing
<vik.fearing@2ndquadrant.com> wrote:Hmm. Not sure if that last word should be _sample, _sampling, _rate, or
a combination of those.+1 for rate or sample_rate. I think "sample" or "sampling" without
"rate" will not be very clear.
1+ to sample_rate
It's what auto_explain and pgbench uses, so let's keep the naming
consistent.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, 16 Jul 2018 at 23:07, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 07/16/2018 05:24 PM, Robert Haas wrote:
On Sun, Jul 15, 2018 at 6:53 AM, Vik Fearing
<vik.fearing@2ndquadrant.com> wrote:Hmm. Not sure if that last word should be _sample, _sampling, _rate, or
a combination of those.+1 for rate or sample_rate. I think "sample" or "sampling" without
"rate" will not be very clear.1+ to sample_rate
It's what auto_explain and pgbench uses, so let's keep the naming
consistent.
This patch went through last few commitfests without any activity, but cfbot
says it still applies and doesn't break any tests. From what I see there was
some agreement about naming, so the patch is indeed needs more review. Could
anyone from the reviewers (Vik?) confirm that it's in a good shape?
On 11/18/18 12:47 AM, Dmitry Dolgov wrote:
This patch went through last few commitfests without any activity, but cfbot
says it still applies and doesn't break any tests. From what I see there was
some agreement about naming, so the patch is indeed needs more review. Could
anyone from the reviewers (Vik?) confirm that it's in a good shape?
Thanks Dmitry to bringing this up.
For me, the question is how to name this GUC?
Is "log_sample_rate" is enough? I am not sure because it is only related to
queries logged by log_min_duration_statements.
Alors, I wonder if we should use the same logic for other parameters, such as
log_statement_stats
log_parser_stats
log_planner_stats
log_executor_stats
It was mentioned in this thread
/messages/by-id/20180710183828.GB3890@telsasoft.com
On Sun, 18 Nov 2018 at 10:52, Adrien Nayrat <adrien.nayrat@anayrat.info>
wrote:For me, the question is how to name this GUC? Is "log_sample_rate" is enough?
I am not sure because it is only related to queries logged by
log_min_duration_statements.
Since it's hard to come up with a concise name that will mention sampling rate
in the context of min_duration_statement, I think it's fine to name this
configuration "log_sample_rate", as long as it's dependency from
log_min_duration_statements is clearly explained in the documentation.
On Sun, Nov 18, 2018 at 12:18:33PM +0100, Dmitry Dolgov wrote:
Since it's hard to come up with a concise name that will mention sampling rate
in the context of min_duration_statement, I think it's fine to name this
configuration "log_sample_rate", as long as it's dependency from
log_min_duration_statements is clearly explained in the documentation.
log_sample_rate looks fine to me as a name.
--
Michael
On 11/19/18 2:57 AM, Michael Paquier wrote:
On Sun, Nov 18, 2018 at 12:18:33PM +0100, Dmitry Dolgov wrote:
Since it's hard to come up with a concise name that will mention sampling rate
in the context of min_duration_statement, I think it's fine to name this
configuration "log_sample_rate", as long as it's dependency from
log_min_duration_statements is clearly explained in the documentation.log_sample_rate looks fine to me as a name.
That seems far too short to me - the name should indicate it applies to
statement logging. I'd say log_statement_sample_rate is better.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 11/18/18 10:52 AM, Adrien Nayrat wrote:
...
Alors, I wonder if we should use the same logic for other parameters, such as
log_statement_stats
log_parser_stats
log_planner_stats
log_executor_statsIt was mentioned in this thread
/messages/by-id/20180710183828.GB3890@telsasoft.com
You mean apply sampling to those logging options too? I doubt that would
be very useful, as the those options are IMHO meant for development.
Maybe tracking some of the info would be useful, but integrating it into
pg_stat_statements seems like a better solution than just dumping it
into the server log.
regards
--
Tomas Vondra http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Nov 19, 2018 at 2:40 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 11/19/18 2:57 AM, Michael Paquier wrote:
On Sun, Nov 18, 2018 at 12:18:33PM +0100, Dmitry Dolgov wrote:
Since it's hard to come up with a concise name that will mention sampling rate
in the context of min_duration_statement, I think it's fine to name this
configuration "log_sample_rate", as long as it's dependency from
log_min_duration_statements is clearly explained in the documentation.log_sample_rate looks fine to me as a name.
That seems far too short to me - the name should indicate it applies to
statement logging. I'd say log_statement_sample_rate is better.
I agree, sounds reasonable.
On 11/19/18 2:52 PM, Dmitry Dolgov wrote:
On Mon, Nov 19, 2018 at 2:40 PM Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
On 11/19/18 2:57 AM, Michael Paquier wrote:
On Sun, Nov 18, 2018 at 12:18:33PM +0100, Dmitry Dolgov wrote:
Since it's hard to come up with a concise name that will mention sampling rate
in the context of min_duration_statement, I think it's fine to name this
configuration "log_sample_rate", as long as it's dependency from
log_min_duration_statements is clearly explained in the documentation.log_sample_rate looks fine to me as a name.
That seems far too short to me - the name should indicate it applies to
statement logging. I'd say log_statement_sample_rate is better.I agree, sounds reasonable.
Thanks for your comments. Here is the updated patch. I fixed a warning for
missing parentheses in this expression:
if ((exceeded && in_sample) || log_duration)
It passed make check_world and make docs