Sampling-based timing for EXPLAIN ANALYZE

Started by Lukas Fittlover 3 years ago12 messageshackers
Jump to latest
#1Lukas Fittl
lukas@fittl.com

Hi,

Since EXPLAIN ANALYZE with TIMING ON still carries noticeable overhead on
modern hardware (despite time sources being faster), I'd like to propose a
new setting EXPLAIN ANALYZE, called "TIMING SAMPLING", as compared to
TIMING ON.

This new timing mode uses a timer on a fixed recurring frequency (e.g. 100
or 1000 Hz) to gather a sampled timestamp on a predefined schedule, instead
of getting the time on-demand when InstrStartNode/InstrStopNode is called.
To implement the timer, we can use the existing timeout infrastructure,
which is backed by a wall clock timer (ITIMER_REAL).

Conceptually this is inspired by how sampling profilers work (e.g. "perf"),
but it ties into the existing per-plan node instrumentation done by EXPLAIN
ANALYZE, and simply provides a lower accuracy version of the total time for
each plan node.

In EXPLAIN output this is marked as "sampled time", and scaled to the total
wall clock time (to adjust for the sampling undercounting):

=# EXPLAIN (ANALYZE, BUFFERS, TIMING SAMPLING, SAMPLEFREQ 100) SELECT ...;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=201747.90..201748.00 rows=10 width=12) (actual
sampled time=5490.974 rows=9 loops=1)
...
-> Hash Join (cost=0.23..199247.90 rows=499999 width=4) (actual
sampled time=3738.619 rows=9000000 loops=1)
...
-> Seq Scan on large (cost=0.00..144247.79 rows=9999979 width=4)
(actual sampled time=1004.671 rows=10000001 loops=1)
...
-> Hash (cost=0.10..0.10 rows=10 width=4) (actual sampled
time=0.000 rows=10 loops=1)
...
Execution Time: 5491.475 ms
---

In simple query tests like this on my local machine, this shows a
consistent benefit over TIMING ON (and behaves close to ANALYZE with TIMING
OFF), whilst providing a "good enough" accuracy to identify which part of
the query was problematic.

Attached is a prototype patch for early feedback on the concept, with tests
and documentation to come in a follow-up. Since the January commitfest is
still marked as open I'll register it there, but note that my assumption is
this is *not* Postgres 16 material.

As an open item, note that in the patch the requested sampling frequency is
not yet passed to parallel workers (it always defaults to 1000 Hz when
sampling is enabled). Also, note the timing frequency is limited to a
maximum of 1000 Hz (1ms) due to current limitations of the timeout
infrastructure.

With thanks to Andres Freund for help on refining the idea, collaborating
on early code and finding the approach to hook into the timeout API.

Thanks,
Lukas

--
Lukas Fittl

Attachments:

v1-0001-Add-TIMING-SAMPLING-option-for-EXPLAIN-ANALYZE.patchapplication/octet-stream; name=v1-0001-Add-TIMING-SAMPLING-option-for-EXPLAIN-ANALYZE.patchDownload+320-10
#2Jelte Fennema-Nio
postgres@jeltef.nl
In reply to: Lukas Fittl (#1)
Re: Sampling-based timing for EXPLAIN ANALYZE

Nice addition! And the code looks pretty straight forward.

The current patch triggers warnings:
https://cirrus-ci.com/task/6016013976731648 Looks like you need to add void
as the argument.

Do you have some performance comparison between TIMING ON and TIMING
SAMPLING?

In InstrStartSampling there's logic to increase/decrease the frequency of
an already existing timer. It's not clear to me when this can occur. I'd
expect sampling frequency to remain constant throughout an explain plan. If
it's indeed needed, I think a code comment would be useful to explain why
this edge case is necessary.

On Fri, 6 Jan 2023 at 09:41, Lukas Fittl <lukas@fittl.com> wrote:

Show quoted text

Hi,

Since EXPLAIN ANALYZE with TIMING ON still carries noticeable overhead on
modern hardware (despite time sources being faster), I'd like to propose a
new setting EXPLAIN ANALYZE, called "TIMING SAMPLING", as compared to
TIMING ON.

This new timing mode uses a timer on a fixed recurring frequency (e.g. 100
or 1000 Hz) to gather a sampled timestamp on a predefined schedule, instead
of getting the time on-demand when InstrStartNode/InstrStopNode is called.
To implement the timer, we can use the existing timeout infrastructure,
which is backed by a wall clock timer (ITIMER_REAL).

Conceptually this is inspired by how sampling profilers work (e.g.
"perf"), but it ties into the existing per-plan node instrumentation done
by EXPLAIN ANALYZE, and simply provides a lower accuracy version of the
total time for each plan node.

In EXPLAIN output this is marked as "sampled time", and scaled to the
total wall clock time (to adjust for the sampling undercounting):

=# EXPLAIN (ANALYZE, BUFFERS, TIMING SAMPLING, SAMPLEFREQ 100) SELECT ...;
QUERY PLAN

-------------------------------------------------------------------------------------------------------------------------------------
HashAggregate (cost=201747.90..201748.00 rows=10 width=12) (actual
sampled time=5490.974 rows=9 loops=1)
...
-> Hash Join (cost=0.23..199247.90 rows=499999 width=4) (actual
sampled time=3738.619 rows=9000000 loops=1)
...
-> Seq Scan on large (cost=0.00..144247.79 rows=9999979
width=4) (actual sampled time=1004.671 rows=10000001 loops=1)
...
-> Hash (cost=0.10..0.10 rows=10 width=4) (actual sampled
time=0.000 rows=10 loops=1)
...
Execution Time: 5491.475 ms
---

In simple query tests like this on my local machine, this shows a
consistent benefit over TIMING ON (and behaves close to ANALYZE with TIMING
OFF), whilst providing a "good enough" accuracy to identify which part of
the query was problematic.

Attached is a prototype patch for early feedback on the concept, with
tests and documentation to come in a follow-up. Since the January
commitfest is still marked as open I'll register it there, but note that my
assumption is this is *not* Postgres 16 material.

As an open item, note that in the patch the requested sampling frequency
is not yet passed to parallel workers (it always defaults to 1000 Hz when
sampling is enabled). Also, note the timing frequency is limited to a
maximum of 1000 Hz (1ms) due to current limitations of the timeout
infrastructure.

With thanks to Andres Freund for help on refining the idea, collaborating
on early code and finding the approach to hook into the timeout API.

Thanks,
Lukas

--
Lukas Fittl

#3David Geier
geidav.pg@gmail.com
In reply to: Jelte Fennema-Nio (#2)
Re: Sampling-based timing for EXPLAIN ANALYZE

Nice idea.

On 1/6/23 10:19, Jelte Fennema wrote:

Do you have some performance comparison between TIMING ON and TIMING
SAMPLING?

+1 to see some numbers compared to TIMING ON.

Mostly I'm wondering if the sampling based approach gains us enough to
be worth it, once the patch to use RDTSC hopefully landed (see [1]/messages/by-id/20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de). I
believe that with the RDTSC patch the overhead of TIMING ON is lower
than the overhead of using ANALYZE with TIMING OFF in the first place.
Hence, to be really useful, it would be great if we could on top of
TIMING SAMPLING also lower the overhead of ANALYZE itself further (e.g.
by using a fast path for the default EXPLAIN (ANALYZE, TIMING ON /
SAMPLING)). Currently, InstrStartNode() and InstrStopNode() have a ton
of branches and without all the typically deactivated code the
implementation would be very small and could be placed in an inlinable
function.

[1]: /messages/by-id/20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de
/messages/by-id/20200612232810.f46nbqkdhbutzqdg@alap3.anarazel.de

--
David Geier
(ServiceNow)

#4Andres Freund
andres@anarazel.de
In reply to: David Geier (#3)
Re: Sampling-based timing for EXPLAIN ANALYZE

Hi,

On 2023-01-13 09:11:06 +0100, David Geier wrote:

Mostly I'm wondering if the sampling based approach gains us enough to be
worth it, once the patch to use RDTSC hopefully landed (see [1]).

Well, I'm not sure we have a path forward on it. There's portability and
accuracy concerns. But more importantly:

I believe that with the RDTSC patch the overhead of TIMING ON is lower than
the overhead of using ANALYZE with TIMING OFF in the first place. Hence, to
be really useful, it would be great if we could on top of TIMING SAMPLING
also lower the overhead of ANALYZE itself further (e.g. by using a fast path
for the default EXPLAIN (ANALYZE, TIMING ON / SAMPLING)). Currently,
InstrStartNode() and InstrStopNode() have a ton of branches and without all
the typically deactivated code the implementation would be very small and
could be placed in an inlinable function.

Yes, I think SAMPLING could get rid of most of the instrumentation overhead -
at the cost of a bit less detail in the explain, of course. Which could make
it a lot more feasible to enable something like auto_explain.log_timing in
busy workloads.

For the sampling mode we don't really need something like
InstrStart/StopNode. We just need a pointer to node currently executing - not
free to set, but still a heck of a lot cheaper than InstrStopNode(), even
without ->need_timer etc. Then the timer just needs to do
instr->sampled_total += (now - last_sample)
last_sample = now

I've been thinking that we should consider making more of the instrumentation
code work like that. The amount of work we're doing in InstrStart/StopNode()
has steadily risen. When buffer usage and WAL usage are enabled, we're
executing over 1000 instructions! And each single Instrumentation node is ~450
bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
embedded.

If we instead have InstrStartNode() set up a global pointer to the
Instrumentation node, we can make the instrumentation code modify both the
"global" counters (pgBufferUsage, pgWalUsage) and, if set,
current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
changes - right now nodes "automatically" include the IO/WAL incurred in child
nodes, but that's just a small bit of additional summin-up to be done during
EXPLAIN.

Separately, I think we should consider re-ordering Instrumentation so that
bufusage_start, walusage_start are after the much more commonly used
elements. We're forcing ntuples, nloops, .. onto separate cachelines, even
though they're accounted for unconditionally.

Greetings,

Andres Freund

#5Lukas Fittl
lukas@fittl.com
In reply to: Jelte Fennema-Nio (#2)
Re: Sampling-based timing for EXPLAIN ANALYZE

On Fri, Jan 6, 2023 at 1:19 AM Jelte Fennema <me@jeltef.nl> wrote:

Nice addition! And the code looks pretty straight forward.

Thanks for reviewing!

The current patch triggers warnings:

https://cirrus-ci.com/task/6016013976731648 Looks like you need to add
void as the argument.

Fixed in v2 attached. This also adds a simple regression test, as well as
fixes the parallel working handling.

Do you have some performance comparison between TIMING ON and TIMING

SAMPLING?

Here are some benchmarks of auto_explain overhead on my ARM-based M1
Macbook for the following query run with pgbench on a scale factor 100 data
set:

SELECT COUNT(*) FROM pgbench_branches JOIN pgbench_accounts USING (bid)
JOIN pgbench_tellers USING (bid) WHERE bid = 42;

(the motivation is to use a query that is more complex than the standard
pgbench select-only test query)

avg latency (best of 3), -T 300, -c 4, -s 100, shared_buffers 2GB, fsync
off, max_parallel_workers_per_gather 0:

master, log_timing = off: 871 ms (878 / 877 / 871)
patch, log_timing = off: 869 ms (882 / 880 / 869)
patch, log_timing = on: 890 ms (917 / 930 / 890)
patch, log_timing = sampling, samplefreq = 1000: 869 ms (887 / 869 / 894)

Additionally, here is Andres' benchmark from [1]/messages/by-id/20230116213913.4oseovlzvc2674z7@awork3.anarazel.de, with the sampling option
added:

% psql -Xc 'DROP TABLE IF EXISTS t; CREATE TABLE t AS SELECT * FROM
generate_series(1, 100000) g(i);' postgres && pgbench -n -r -t 100 -f
<(echo -e "SELECT COUNT(*) FROM t;EXPLAIN (ANALYZE, TIMING OFF) SELECT
COUNT(*) FROM t;EXPLAIN (ANALYZE, TIMING SAMPLING) SELECT COUNT(*) FROM
t;EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;") postgres |grep '^ '
DROP TABLE
SELECT 100000
3.507 0 SELECT COUNT(*) FROM t;
3.476 0 EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*)
FROM t;
3.576 0 EXPLAIN (ANALYZE, TIMING SAMPLING) SELECT
COUNT(*) FROM t;
5.096 0 EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*)
FROM t;

My pg_test_timing data for reference:

% pg_test_timing
Testing timing overhead for 3 seconds.
Per loop time including overhead: 23.65 ns
Histogram of timing durations:
< us % of total count
1 97.64472 123876325
2 2.35421 2986658
4 0.00022 277
8 0.00016 202
16 0.00064 815
32 0.00005 64

In InstrStartSampling there's logic to increase/decrease the frequency of

an already existing timer. It's not clear to me when this can occur. I'd
expect sampling frequency to remain constant throughout an explain plan. If
it's indeed needed, I think a code comment would be useful to explain why
this edge case is necessary.

Clarified in a code comment in v2. This is needed for handling nested
statements which could have different sampling frequencies for each nesting
level, i.e. a function might want to sample it's queries at a higher
frequency than its caller.

Thanks,
Lukas

[1]: /messages/by-id/20230116213913.4oseovlzvc2674z7@awork3.anarazel.de

--
Lukas Fittl

Attachments:

v2-0001-Add-TIMING-SAMPLING-option-for-EXPLAIN-ANALYZE.patchapplication/octet-stream; name=v2-0001-Add-TIMING-SAMPLING-option-for-EXPLAIN-ANALYZE.patchDownload+419-10
#6Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#4)
Re: Sampling-based timing for EXPLAIN ANALYZE

On 1/15/23 21:22, Andres Freund wrote:

Hi,

On 2023-01-13 09:11:06 +0100, David Geier wrote:

Mostly I'm wondering if the sampling based approach gains us enough to be
worth it, once the patch to use RDTSC hopefully landed (see [1]).

Well, I'm not sure we have a path forward on it. There's portability and
accuracy concerns. But more importantly:

I believe that with the RDTSC patch the overhead of TIMING ON is lower than
the overhead of using ANALYZE with TIMING OFF in the first place. Hence, to
be really useful, it would be great if we could on top of TIMING SAMPLING
also lower the overhead of ANALYZE itself further (e.g. by using a fast path
for the default EXPLAIN (ANALYZE, TIMING ON / SAMPLING)). Currently,
InstrStartNode() and InstrStopNode() have a ton of branches and without all
the typically deactivated code the implementation would be very small and
could be placed in an inlinable function.

Yes, I think SAMPLING could get rid of most of the instrumentation overhead -
at the cost of a bit less detail in the explain, of course. Which could make
it a lot more feasible to enable something like auto_explain.log_timing in
busy workloads.

For the sampling mode we don't really need something like
InstrStart/StopNode. We just need a pointer to node currently executing - not
free to set, but still a heck of a lot cheaper than InstrStopNode(), even
without ->need_timer etc. Then the timer just needs to do
instr->sampled_total += (now - last_sample)
last_sample = now

I don't understand why we would even use timestamps, in this case? AFAIK
"sampling profilers" simply increment a counter for the executing node,
and then approximate the time as proportional to the count.

That also does not have issues with timestamp "rounding" - considering
e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
possible the node completes within 1ms, in which case

(now - last_sample)

ends up being 0 (assuming I correctly understand the code).

And I don't think there's any particularly good way to correct this.

It seems ExplainNode() attempts to do some correction, but I doubt
that's very reliable, as these fast nodes will have sampled_total=0, so
no matter what you multiply this with, it'll still be 0.

I've been thinking that we should consider making more of the instrumentation
code work like that. The amount of work we're doing in InstrStart/StopNode()
has steadily risen. When buffer usage and WAL usage are enabled, we're
executing over 1000 instructions! And each single Instrumentation node is ~450
bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
embedded.

If we instead have InstrStartNode() set up a global pointer to the
Instrumentation node, we can make the instrumentation code modify both the
"global" counters (pgBufferUsage, pgWalUsage) and, if set,
current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
changes - right now nodes "automatically" include the IO/WAL incurred in child
nodes, but that's just a small bit of additional summin-up to be done during
EXPLAIN.

That's certainly one way to implement that. I wonder if we could make
that work without the global pointer, but I can't think of any.

Separately, I think we should consider re-ordering Instrumentation so that
bufusage_start, walusage_start are after the much more commonly used
elements. We're forcing ntuples, nloops, .. onto separate cachelines, even
though they're accounted for unconditionally.

+1 to that

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#7Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#6)
Re: Sampling-based timing for EXPLAIN ANALYZE

Hi,

On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:

I don't understand why we would even use timestamps, in this case? AFAIK
"sampling profilers" simply increment a counter for the executing node,
and then approximate the time as proportional to the count.

The timer interrupt distances aren't all that evenly spaced, particularly
under load, and are easily distorted by having to wait for IO, an lwlock ...

That also does not have issues with timestamp "rounding" - considering
e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
possible the node completes within 1ms, in which case

(now - last_sample)

ends up being 0 (assuming I correctly understand the code).

That part should be counting in nanoseconds, I think? Unless I misunderstand
something?

We already compute the timestamp inside timeout.c, but don't yet pass that to
timeout handlers. I think there's others re-computing timestamps.

And I don't think there's any particularly good way to correct this.

It seems ExplainNode() attempts to do some correction, but I doubt
that's very reliable, as these fast nodes will have sampled_total=0, so
no matter what you multiply this with, it'll still be 0.

That's just the scaling to the "actual time" that you're talking about above,
no?

I've been thinking that we should consider making more of the instrumentation
code work like that. The amount of work we're doing in InstrStart/StopNode()
has steadily risen. When buffer usage and WAL usage are enabled, we're
executing over 1000 instructions! And each single Instrumentation node is ~450
bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
embedded.

If we instead have InstrStartNode() set up a global pointer to the
Instrumentation node, we can make the instrumentation code modify both the
"global" counters (pgBufferUsage, pgWalUsage) and, if set,
current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
changes - right now nodes "automatically" include the IO/WAL incurred in child
nodes, but that's just a small bit of additional summin-up to be done during
EXPLAIN.

That's certainly one way to implement that. I wonder if we could make
that work without the global pointer, but I can't think of any.

I don't see a realistic way at least. We could pass down an
"InstrumentationContext" through everything that needs to do IO and WAL. But
that seems infeasible at this point.

Greetings,

Andres Freund

#8Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#7)
Re: Sampling-based timing for EXPLAIN ANALYZE

On 1/17/23 18:02, Andres Freund wrote:

Hi,

On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:

I don't understand why we would even use timestamps, in this case? AFAIK
"sampling profilers" simply increment a counter for the executing node,
and then approximate the time as proportional to the count.

The timer interrupt distances aren't all that evenly spaced, particularly
under load, and are easily distorted by having to wait for IO, an lwlock ...

OK, so the difference is that these events (I/O, lwlocks) may block
signals, and after signals get unblocked we only get a single event for
each signal. Yeah, the timestamp handles that case better.

That also does not have issues with timestamp "rounding" - considering
e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
possible the node completes within 1ms, in which case

(now - last_sample)

ends up being 0 (assuming I correctly understand the code).

That part should be counting in nanoseconds, I think? Unless I misunderstand
something?

The higher precision does not help, because both values come from the
*sampled* timestamp (i.e. the one updated from the signal handler). So
if the node happens to execute between two signals, the values are going
to be the same, and the difference is 0.

Perhaps for many executions it works out, because some executions will
cross the boundary, and the average will converge to the right value.

We already compute the timestamp inside timeout.c, but don't yet pass that to
timeout handlers. I think there's others re-computing timestamps.

And I don't think there's any particularly good way to correct this.

It seems ExplainNode() attempts to do some correction, but I doubt
that's very reliable, as these fast nodes will have sampled_total=0, so
no matter what you multiply this with, it'll still be 0.

That's just the scaling to the "actual time" that you're talking about above,
no?

Maybe, not sure.

I've been thinking that we should consider making more of the instrumentation
code work like that. The amount of work we're doing in InstrStart/StopNode()
has steadily risen. When buffer usage and WAL usage are enabled, we're
executing over 1000 instructions! And each single Instrumentation node is ~450
bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
embedded.

If we instead have InstrStartNode() set up a global pointer to the
Instrumentation node, we can make the instrumentation code modify both the
"global" counters (pgBufferUsage, pgWalUsage) and, if set,
current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
changes - right now nodes "automatically" include the IO/WAL incurred in child
nodes, but that's just a small bit of additional summin-up to be done during
EXPLAIN.

That's certainly one way to implement that. I wonder if we could make
that work without the global pointer, but I can't think of any.

I don't see a realistic way at least. We could pass down an
"InstrumentationContext" through everything that needs to do IO and WAL. But
that seems infeasible at this point.

Why infeasible?

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#9Andres Freund
andres@anarazel.de
In reply to: Tomas Vondra (#8)
Re: Sampling-based timing for EXPLAIN ANALYZE

Hi,

On 2023-01-17 19:00:02 +0100, Tomas Vondra wrote:

On 1/17/23 18:02, Andres Freund wrote:

On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:

That also does not have issues with timestamp "rounding" - considering
e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
possible the node completes within 1ms, in which case

(now - last_sample)

ends up being 0 (assuming I correctly understand the code).

That part should be counting in nanoseconds, I think? Unless I misunderstand
something?

The higher precision does not help, because both values come from the
*sampled* timestamp (i.e. the one updated from the signal handler). So
if the node happens to execute between two signals, the values are going
to be the same, and the difference is 0.

In that case there simply wasn't any sample for the node, and a non-timestamp
based sample counter wouldn't do anything different?

If you're worried about the case where a timer does fire during execution of
the node, but exactly once, that should provide a difference between the last
sampled timestamp and the current time. It'll attribute a bit too much to the
in-progress nodes, but well, that's sampling for you.

I think a "hybrid" explain mode might be worth thinking about. Use the
"current" sampling method for the first execution of a node, and for the first
few milliseconds of a query (or perhaps the first few timestamp
acquisitions). That provides an accurate explain analyze for short queries,
without a significant slowdown. Then switch to sampling, which provides decent
attribution for a bit longer running queries.

I've been thinking that we should consider making more of the instrumentation
code work like that. The amount of work we're doing in InstrStart/StopNode()
has steadily risen. When buffer usage and WAL usage are enabled, we're
executing over 1000 instructions! And each single Instrumentation node is ~450
bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
embedded.

If we instead have InstrStartNode() set up a global pointer to the
Instrumentation node, we can make the instrumentation code modify both the
"global" counters (pgBufferUsage, pgWalUsage) and, if set,
current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
changes - right now nodes "automatically" include the IO/WAL incurred in child
nodes, but that's just a small bit of additional summin-up to be done during
EXPLAIN.

That's certainly one way to implement that. I wonder if we could make
that work without the global pointer, but I can't think of any.

I don't see a realistic way at least. We could pass down an
"InstrumentationContext" through everything that needs to do IO and WAL. But
that seems infeasible at this point.

Why infeasible?

Primarily the scale of the change. We'd have to pass down the context into all
table/index AM functions. And into a lot of bufmgr.c, xlog.c functions,
which'd require their callers to have access to the context. That's hundreds
if not thousands places.

Adding that many function parameters might turn out to be noticable runtime
wise, due to increased register movement. I think for a number of the places
where we currently don't, we ought to use by-reference struct for the
not-always-used parameters, that then also could contain this context.

Greetings,

Andres Freund

#10Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: Andres Freund (#9)
Re: Sampling-based timing for EXPLAIN ANALYZE

On 1/17/23 19:46, Andres Freund wrote:

Hi,

On 2023-01-17 19:00:02 +0100, Tomas Vondra wrote:

On 1/17/23 18:02, Andres Freund wrote:

On 2023-01-17 15:52:07 +0100, Tomas Vondra wrote:

That also does not have issues with timestamp "rounding" - considering
e.g. sample rate 1000Hz, that's 1ms between samples. And it's quite
possible the node completes within 1ms, in which case

(now - last_sample)

ends up being 0 (assuming I correctly understand the code).

That part should be counting in nanoseconds, I think? Unless I misunderstand
something?

The higher precision does not help, because both values come from the
*sampled* timestamp (i.e. the one updated from the signal handler). So
if the node happens to execute between two signals, the values are going
to be the same, and the difference is 0.

In that case there simply wasn't any sample for the node, and a non-timestamp
based sample counter wouldn't do anything different?

Yeah, you're right.

If you're worried about the case where a timer does fire during execution of
the node, but exactly once, that should provide a difference between the last
sampled timestamp and the current time. It'll attribute a bit too much to the
in-progress nodes, but well, that's sampling for you.

I think a "hybrid" explain mode might be worth thinking about. Use the
"current" sampling method for the first execution of a node, and for the first
few milliseconds of a query (or perhaps the first few timestamp
acquisitions). That provides an accurate explain analyze for short queries,
without a significant slowdown. Then switch to sampling, which provides decent
attribution for a bit longer running queries.

Yeah, this is essentially the sampling I imagined when I first read the
subject of this thread. It samples which node executions to measure (and
then measures those accurately), while these patches sample timestamps.

I've been thinking that we should consider making more of the instrumentation
code work like that. The amount of work we're doing in InstrStart/StopNode()
has steadily risen. When buffer usage and WAL usage are enabled, we're
executing over 1000 instructions! And each single Instrumentation node is ~450
bytes, to a good degree due to having 2 BufUsage and 2 WalUsage structs
embedded.

If we instead have InstrStartNode() set up a global pointer to the
Instrumentation node, we can make the instrumentation code modify both the
"global" counters (pgBufferUsage, pgWalUsage) and, if set,
current_instr->{pgBufferUsage, pgWalUsage}. That'll require some larger
changes - right now nodes "automatically" include the IO/WAL incurred in child
nodes, but that's just a small bit of additional summin-up to be done during
EXPLAIN.

That's certainly one way to implement that. I wonder if we could make
that work without the global pointer, but I can't think of any.

I don't see a realistic way at least. We could pass down an
"InstrumentationContext" through everything that needs to do IO and WAL. But
that seems infeasible at this point.

Why infeasible?

Primarily the scale of the change. We'd have to pass down the context into all
table/index AM functions. And into a lot of bufmgr.c, xlog.c functions,
which'd require their callers to have access to the context. That's hundreds
if not thousands places.

Adding that many function parameters might turn out to be noticable runtime
wise, due to increased register movement. I think for a number of the places
where we currently don't, we ought to use by-reference struct for the
not-always-used parameters, that then also could contain this context.

OK, I haven't realized we'd have to pass it to that many places.

regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

#11Bruce Momjian
bruce@momjian.us
In reply to: Tomas Vondra (#10)
Re: Sampling-based timing for EXPLAIN ANALYZE

On Tue, 17 Jan 2023 at 14:52, Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:

On 1/17/23 19:46, Andres Freund wrote:

I think a "hybrid" explain mode might be worth thinking about. Use the
"current" sampling method for the first execution of a node, and for the first
few milliseconds of a query (or perhaps the first few timestamp
acquisitions). That provides an accurate explain analyze for short queries,
without a significant slowdown. Then switch to sampling, which provides decent
attribution for a bit longer running queries.

Yeah, this is essentially the sampling I imagined when I first read the
subject of this thread. It samples which node executions to measure (and
then measures those accurately), while these patches sample timestamps.

That sounds interesting. Fwiw my first thought would be to implement
it a bit differently. Always have a timer running sampling right
from the start, but also if there are less than, say, 1000 samples for
a node then measure the actual start/finish time.

So for any given node once you've hit enough samples to get a decent
estimate you stop checking the time. That way any fast or rarely
called nodes still have accurate measurements even if they get few
samples and any long or frequently called nodes stop getting
timestamps and just use timer counts.

--
greg

#12Daniel Gustafsson
daniel@yesql.se
In reply to: Bruce Momjian (#11)
Re: Sampling-based timing for EXPLAIN ANALYZE

This thread has been stale since January with no movement at all during the
March CF, and according to the CFBot it stopped building at all ~ 14 weeks ago.

I'm marking this returned with feedback, it can be resubmitted for a future CF
if someone decides to pick it up.

--
Daniel Gustafsson