Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Started by Andres Freundover 5 years ago112 messages
Jump to latest
#1Andres Freund
andres@anarazel.de

Hi,

Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;

-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)

-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)

-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)

That nearly *double* the execution time without TIMING.

Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.

That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)

but obviously doesn't move the needle.

Looking at a profile it's easy to confirm that we spend a lot of time
acquiring time:
-   95.49%     0.00%  postgres     postgres                 [.] agg_retrieve_direct (inlined)
   - agg_retrieve_direct (inlined)
      - 79.27% fetch_input_tuple
         - ExecProcNode (inlined)
            - 75.72% ExecProcNodeInstr
               + 25.22% SeqNext
               - 21.74% InstrStopNode
                  + 17.80% __GI___clock_gettime (inlined)
               - 21.44% InstrStartNode
                  + 19.23% __GI___clock_gettime (inlined)
               + 4.06% ExecScan
      + 13.09% advance_aggregates (inlined)
        1.06% MemoryContextReset

And that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.

Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
65.30 │1 63: rdtscp

The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.

In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).

When using rdtsc that results in *vastly* lower overhead:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual time=2610.235..2610.235 rows=1 loops=1) │
│ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=50000016 width=0) (actual time=0.006..1512.886 rows=50000000 loops=1) │
│ Planning Time: 0.028 ms │
│ Execution Time: 2610.256 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(4 rows)

Time: 2610.589 ms (00:02.611)

And there's still some smaller improvements that could be made ontop of
that.

As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)

That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...

In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).

I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:

On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.

Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.

In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since Skylake (I think). On older
CPUs we'd have to look at /proc/cpuinfo or
/sys/devices/system/cpu/cpu0/cpufreq/base_frequency.

There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not. The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.

The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...

I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.

Greetings,

Andres Freund

Attachments:

v1-0001-WIP-Change-instr_time-to-just-store-nanoseconds-t.patchtext/x-diff; charset=us-asciiDownload+26-37
v1-0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure.patchtext/x-diff; charset=us-asciiDownload+60-9
#2Pavel Stehule
pavel.stehule@gmail.com
In reply to: Andres Freund (#1)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

so 13. 6. 2020 v 1:28 odesílatel Andres Freund <andres@anarazel.de> napsal:

Hi,

Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;

-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)

-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)

-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)

That nearly *double* the execution time without TIMING.

Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.

That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)

but obviously doesn't move the needle.

Looking at a profile it's easy to confirm that we spend a lot of time
acquiring time:
-   95.49%     0.00%  postgres     postgres                 [.]
agg_retrieve_direct (inlined)
- agg_retrieve_direct (inlined)
- 79.27% fetch_input_tuple
- ExecProcNode (inlined)
- 75.72% ExecProcNodeInstr
+ 25.22% SeqNext
- 21.74% InstrStopNode
+ 17.80% __GI___clock_gettime (inlined)
- 21.44% InstrStartNode
+ 19.23% __GI___clock_gettime (inlined)
+ 4.06% ExecScan
+ 13.09% advance_aggregates (inlined)
1.06% MemoryContextReset

And that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.

Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
65.30 │1 63: rdtscp

The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.

In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).

When using rdtsc that results in *vastly* lower overhead:

┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN

├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual
time=2610.235..2610.235 rows=1 loops=1) │
│ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=50000016
width=0) (actual time=0.006..1512.886 rows=50000000 loops=1) │
│ Planning Time: 0.028 ms

│ Execution Time: 2610.256 ms

└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(4 rows)

Time: 2610.589 ms (00:02.611)

And there's still some smaller improvements that could be made ontop of
that.

As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)

That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...

In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).

I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:

On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.

Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.

In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since Skylake (I think). On older
CPUs we'd have to look at /proc/cpuinfo or
/sys/devices/system/cpu/cpu0/cpufreq/base_frequency.

There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not. The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.

The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...

I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.

+1

Pavel

Show quoted text

Greetings,

Andres Freund

#3Thomas Munro
thomas.munro@gmail.com
In reply to: Andres Freund (#1)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

On Sat, Jun 13, 2020 at 11:28 AM Andres Freund <andres@anarazel.de> wrote:

[PATCH v1 1/2] WIP: Change instr_time to just store nanoseconds, that's cheaper.

Makes a lot of sense. If we do this, I'll need to update pgbench,
which just did something similar locally. If I'd been paying
attention to this thread I might not have committed that piece of the
recent pgbench changes, but it's trivial stuff and I'll be happy to
tidy that up when the time comes.

[PATCH v1 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure time for instrumentation.

Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
65.30 │1 63: rdtscp

The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.

Yeah, after reading a bit about this, I agree that there is no reason
to think that the stalling version makes the answer better in any way.
It might make sense if you use it once at the beginning of a large
computation, but it makes no sense if you sprinkle it around inside
blocks that will run multiple times. It destroys your
instructions-per-cycle while, turning your fancy super scalar Pentium
into a 486. It does raise some interesting questions about what
exactly you're measuring, though: I don't know enough to have a good
grip on how far out of order the TSC could be read!

There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not. The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.

Googling tells me that Nehalem (2008) introduced "invariant TSC"
(clock rate independent) and also socket synchronisation at the same
time, so systems without it are already pretty long in the tooth.

A quick peek at an AMD manual[1]https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf (page 373) tells me that a similar change
happened in 15H/Bulldozer/Piledriver/Steamroller/Excavator (2011),
identified with the same CPUID test.

My first reaction is that it seems like TSC would be the least of your
worries if you're measuring a VM that's currently migrating between
hosts, but maybe the idea is just that you have to make sure you don't
assume it can't ever go backwards or something like that...

Google Benchmark has some clues about how to spell this on MSVC, what
some instructions might be to research on ARM, etc.

[1]: https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf (page 373)
(page 373)
[2]: https://github.com/google/benchmark/blob/master/src/cycleclock.h

#4Lukas Fittl
lukas@fittl.com
In reply to: Andres Freund (#1)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

On Fri, Jun 12, 2020 at 4:28 PM Andres Freund <andres@anarazel.de> wrote:

The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...

I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.

Based on an off-list conversation with Andres, I decided to dust off this
old
patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
improvements (especially when using rdtsc instead of rdtsc*p*) seem to
warrant
giving this a more thorough look.

See attached an updated patch (adding it to the July commitfest), with a few
changes:

- Keep using clock_gettime() as a fallback if we decide to not use rdtsc
- Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work
- The decision to use rdtsc (or not) is made at runtime, in the new
INSTR_TIME_INITIALIZE() -- we can't make this decision at compile time
because this is dependent on the specific CPU in use, amongst other things
- In an abundance of caution, for now I've decided to only enable this if we
are on Linux/x86, and the current kernel clocksource is TSC (the kernel
has
quite sophisticated logic around making this decision, see [1]https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c)

Note that if we implemented the decision logic ourselves (instead of relying
on the Linux kernel), I'd be most worried about older virtualization
technology. In my understanding getting this right is notably more
complicated
than just checking cpuid, see [2]http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/.

Known WIP problems with this patch version:

* There appears to be a timing discrepancy I haven't yet worked out, where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is a problem
somewhere with how we perform the cycles to time conversion
* Possibly related, the floating point handling for the cycles_to_sec
variable
is problematic in terms of precision (see FIXME, taken over from Andres'
POC)

Open questions from me:

1) Do we need to account for different TSC offsets on different CPUs in SMP
systems? (the Linux kernel certainly has logic to that extent, but [3]https://stackoverflow.com/a/11060619/1652607
suggests this is no longer a problem on Nehalem and newer chips, i.e.
those
having an invariant TSC)

2) Should we have a setting "--with-tsc" for configure? (instead of always
enabling it when on Linux/x86 with a TSC clocksource)

3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
current instructions to finish -- the prior discussion seemed to suggest
we don't want it for node instruction measurements, but possibly we do
want
this in other cases?)

4) Should we support using the "mrs" instruction on ARM? (which is similar
to
rdtsc, see [4]https://cpufun.substack.com/p/fun-with-timers-and-cpuid)

Thanks,
Lukas

[1]: https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c
[2]: http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/
[3]: https://stackoverflow.com/a/11060619/1652607
[4]: https://cpufun.substack.com/p/fun-with-timers-and-cpuid

--
Lukas Fittl

Attachments:

v2-0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure.patchapplication/octet-stream; name=v2-0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure.patchDownload+163-12
v2-0001-WIP-Change-instr_time-to-just-store-nanoseconds-t.patchapplication/octet-stream; name=v2-0001-WIP-Change-instr_time-to-just-store-nanoseconds-t.patchDownload+26-37
#5Andres Freund
andres@anarazel.de
In reply to: Lukas Fittl (#4)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Hi,

On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:

On Fri, Jun 12, 2020 at 4:28 PM Andres Freund <andres@anarazel.de> wrote:

The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...

I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.

Based on an off-list conversation with Andres, I decided to dust off this
old
patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
improvements (especially when using rdtsc instead of rdtsc*p*) seem to
warrant
giving this a more thorough look.

See attached an updated patch (adding it to the July commitfest), with a few
changes:

- Keep using clock_gettime() as a fallback if we decide to not use rdtsc

Yep.

- Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work

I suspect that this might not be needed anymore. Seems like it'd be ok to just
fall back to clock_gettime() in that case.

- In an abundance of caution, for now I've decided to only enable this if we
are on Linux/x86, and the current kernel clocksource is TSC (the kernel
has
quite sophisticated logic around making this decision, see [1])

I think our requirements are a bit lower than the kernel's - we're not
tracking wall clock over an extended period...

Note that if we implemented the decision logic ourselves (instead of relying
on the Linux kernel), I'd be most worried about older virtualization
technology. In my understanding getting this right is notably more
complicated
than just checking cpuid, see [2].

Known WIP problems with this patch version:

* There appears to be a timing discrepancy I haven't yet worked out, where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is a problem
somewhere with how we perform the cycles to time conversion

Could you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.

* Possibly related, the floating point handling for the cycles_to_sec
variable
is problematic in terms of precision (see FIXME, taken over from Andres'
POC)

And probably also performance...

Open questions from me:

1) Do we need to account for different TSC offsets on different CPUs in SMP
systems? (the Linux kernel certainly has logic to that extent, but [3]
suggests this is no longer a problem on Nehalem and newer chips, i.e.
those
having an invariant TSC)

I don't think we should cater to systems where we need that.

2) Should we have a setting "--with-tsc" for configure? (instead of always
enabling it when on Linux/x86 with a TSC clocksource)

Probably not worth it.

3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
current instructions to finish -- the prior discussion seemed to suggest
we don't want it for node instruction measurements, but possibly we do
want
this in other cases?)

I was wondering about that too... Perhaps we should add a
INSTR_TIME_SET_CURRENT_BARRIER() or such?

4) Should we support using the "mrs" instruction on ARM? (which is similar
to
rdtsc, see [4])

I'd leave that for later personally.

#define NS_PER_S INT64CONST(1000000000)
#define US_PER_S INT64CONST(1000000)
#define MS_PER_S INT64CONST(1000)
@@ -95,17 +104,37 @@ typedef int64 instr_time;

#define INSTR_TIME_SET_ZERO(t) ((t) = 0)

-static inline instr_time pg_clock_gettime_ns(void)
+extern double cycles_to_sec;
+
+bool use_rdtsc;

This should be extern and inside the ifdef below.

+#if defined(__x86_64__) && defined(__linux__)
+extern void pg_clock_gettime_initialize_rdtsc(void);
+#endif
+
+static inline instr_time pg_clock_gettime_ref_cycles(void)
{
struct timespec tmp;
+#if defined(__x86_64__) && defined(__linux__)
+	if (use_rdtsc)
+		return __rdtsc();
+#endif
+
clock_gettime(PG_INSTR_CLOCK, &tmp);

return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
}

Greetings,

Andres Freund

#6Maciek Sakrejda
m.sakrejda@gmail.com
In reply to: Andres Freund (#5)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

I ran that original test case with and without the patch. Here are the
numbers I'm seeing:

master (best of three):

postgres=# SELECT count(*) FROM lotsarows;
Time: 582.423 ms

postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 616.102 ms

postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1068.700 ms (00:01.069)

patched (best of three):

postgres=# SELECT count(*) FROM lotsarows;
Time: 550.822 ms

postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 612.572 ms

postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 690.875 ms

On Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
...

Known WIP problems with this patch version:

* There appears to be a timing discrepancy I haven't yet worked out, where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is a problem
somewhere with how we perform the cycles to time conversion

Could you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.

I did not see this either, but I did see that the execution time
reported by \timing is (for this test case) consistently 0.5-1ms
*lower* than the Execution Time reported by EXPLAIN. I did not see
that on master. Is that expected?

Thanks,
Maciek

#7Ibrar Ahmed
ibrar.ahmad@gmail.com
In reply to: Maciek Sakrejda (#6)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

On Fri, Jul 15, 2022 at 11:22 PM Maciek Sakrejda <m.sakrejda@gmail.com>
wrote:

I ran that original test case with and without the patch. Here are the
numbers I'm seeing:

master (best of three):

postgres=# SELECT count(*) FROM lotsarows;
Time: 582.423 ms

postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 616.102 ms

postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1068.700 ms (00:01.069)

patched (best of three):

postgres=# SELECT count(*) FROM lotsarows;
Time: 550.822 ms

postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 612.572 ms

postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 690.875 ms

On Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
...

Known WIP problems with this patch version:

* There appears to be a timing discrepancy I haven't yet worked out,

where

the \timing data reported by psql doesn't match what EXPLAIN ANALYZE

is

reporting. With Andres' earlier test case, I'm seeing a consistent

~700ms

higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is a

problem

somewhere with how we perform the cycles to time conversion

Could you explain a bit more what you're seeing? I just tested your

patches

and didn't see that here.

I did not see this either, but I did see that the execution time
reported by \timing is (for this test case) consistently 0.5-1ms
*lower* than the Execution Time reported by EXPLAIN. I did not see
that on master. Is that expected?

Thanks,
Maciek

The patch requires a rebase; please rebase the patch with the latest code.

Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines).
Hunk #6 FAILED at 170.
Hunk #7 succeeded at 165 (offset -69 lines).
2 out of 7 hunks FAILED -- saving rejects to file
src/include/portability/instr_time.h.rej
patching file src/tools/msvc/Mkvcbuild.pm

--
Ibrar Ahmed

#8Michael Paquier
michael@paquier.xyz
In reply to: Ibrar Ahmed (#7)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

On Tue, Sep 06, 2022 at 11:32:18AM +0500, Ibrar Ahmed wrote:

Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines).
Hunk #6 FAILED at 170.
Hunk #7 succeeded at 165 (offset -69 lines).
2 out of 7 hunks FAILED -- saving rejects to file
src/include/portability/instr_time.h.rej
patching file src/tools/msvc/Mkvcbuild.pm

No rebased version has been sent since this update, so this patch has
been marked as RwF.
--
Michael

#9David Geier
geidav.pg@gmail.com
In reply to: Michael Paquier (#8)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

I think it would be great to get this patch committed. Beyond the
reasons already mentioned, the significant overhead also tends to skew
the reported runtimes in ways that makes it difficult to compare them.
For example, if two nodes are executed equally often but one needs twice
the time to process the rows: in such a case EXPLAIN ANALYZE should
report timings that are 2x apart. However, currently, the high overhead
of clock_gettime() tends to skew the relative runtimes.

On 10/12/22 10:33, Michael Paquier wrote:

No rebased version has been sent since this update, so this patch has
been marked as RwF.

I've rebased the patch set on latest master and fixed a few compiler
warnings. Beyond that some findings and thoughts:

You're only using RDTSC if the clock source is 'tsc'. Great idea to not
bother caring about a lot of hairy TSC details. Looking at the kernel
code this seems to imply that the TSC is frequency invariant. I don't
think though that this implies that Linux is not running under a
hypervisor; which is good because I assume PostgreSQL is used a lot in
VMs. However, when running under a hypervisor (at least with VMWare)
CPUID leaf 0x16 is not available. In my tests __get_cpuid() indicated
success but the returned values were garbage. Instead of using leaf
0x16, we should then use the hypervisor interface to obtain the TSC
frequency. Checking if a hypervisor is active can be done via:

bool IsHypervisorActive()
{
    uint32 cpuinfo[4] = {0};
    int res = __get_cpuid(0x1, &cpuinfo[0], &cpuinfo[1], &cpuinfo[2],
&cpuinfo[3]);
    return res > 0 && (cpuinfo[2] & (1 << 30));
}

Obtaining the TSC frequency via the hypervisor interface can be done
with the following code. See https://lwn.net/Articles/301888/ for more
details.

// Under hypervisors (tested with VMWare) leaf 0x16 is not available,
even though __get_cpuid() succeeds.
// Hence, if running under a hypervisor, use the hypervisor interface to
obtain TSC frequency.
uint32 cpuinfo[4] = {0};
if (IsHypervisorActive() && __get_cpuid(0x40000001, &cpuinfo[0],
&cpuinfo[1], &cpuinfo[2], &cpuinfo[3]) > 0)
    cycles_to_sec = 1.0 / ((double)cpuinfo[0] * 1000 * 1000);

Given that we anyways switch between RDTSC and clock_gettime() with a
global variable, what about exposing the clock source as GUC? That way
the user can switch back to a working clock source in case we miss a
detail around activating or reading the TSC.

I'm happy to update the patches accordingly.

--
David Geier
(ServiceNow)

#10David Geier
geidav.pg@gmail.com
In reply to: David Geier (#9)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

I missed attaching the patches.

--
David Geier
(ServiceNow)

Attachments:

0001-WIP-Change-instr_time-to-just-store-nanoseconds-v3.patchtext/x-patch; charset=UTF-8; name=0001-WIP-Change-instr_time-to-just-store-nanoseconds-v3.patchDownload+26-37
0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure-v3.patchtext/x-patch; charset=UTF-8; name=0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure-v3.patchDownload+157-11
#11David Geier
geidav.pg@gmail.com
In reply to: David Geier (#10)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Hi,

I re-based again on master and applied the following changes:

I removed the fallback for obtaining the TSC frequency from /proc/cpu as
suggested by Andres. Worst-case we fall back to clock_gettime().

I added code to obtain the TSC frequency via CPUID when under a
hypervisor. I had to use __cpuid() directly instead of __get_cpuid(),
because __get_cpuid() returns an error if the leaf is > 0x80000000
(probably the implementation pre-dates the hypervisor timing leafs).
Unfortunately, while testing my implementation under VMWare, I found
that RDTSC runs awfully slow there (like 30x slower). [1]https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf (page 11 "Virtual TSC") indicates that
we cannot generally rely on RDTSC being actually fast on VMs. However,
the same applies to clock_gettime(). It runs as slow as RDTSC on my
VMWare setup. Hence, using RDTSC is not at disadvantage. I'm not
entirely sure if there aren't cases where e.g. clock_gettime() is
actually faster than RDTSC and it would be advantageous to use
clock_gettime(). We could add a GUC so that the user can decide which
clock source to use. Any thoughts?

I also somewhat improved the accuracy of the cycles to milli- and
microseconds conversion functions by having two more multipliers with
higher precision. For microseconds we could also keep the computation
integer-only. I'm wondering what to best do for seconds and
milliseconds. I'm currently leaning towards just keeping it as is,
because the durations measured and converted are usually long enough
that precision shouldn't be a problem.

In vacuum_lazy.c we do if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000). I
changed that to use INSTR_TIME_GET_MILLISECS() instead. Additionally, I
initialized a few variables of type instr_time which otherwise resulted
in warnings due to use of potentially uninitialized variables.

I also couldn't reproduce the reported timing discrepancy. For me the
runtime reported by \timing is just slightly higher than the time
reported by EXPLAIN ANALYZE, which is expected.

Beyond that:

What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so
that it's consistent with the _MILLISEC() and _MICROSEC() variants?

The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
return double. This seems error prone. What about renaming the function
or also have the function return a double and cast where necessary at
the call site?

If no one objects I would also re-register this patch in the commit fest.

[1]: https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf (page 11 "Virtual TSC")
https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf
(page 11 "Virtual TSC")

--
David Geier
(ServiceNow)

Attachments:

0001-Change-instr_time-to-just-store-nanoseconds-v4.patchtext/x-patch; charset=UTF-8; name=0001-Change-instr_time-to-just-store-nanoseconds-v4.patchDownload+26-37
0002-Use-CPU-reference-cycles-via-RDTSC-v4.patchtext/x-patch; charset=UTF-8; name=0002-Use-CPU-reference-cycles-via-RDTSC-v4.patchDownload+175-12
0003-Refactor-some-instr_time-related-code-v4.patchtext/x-patch; charset=UTF-8; name=0003-Refactor-some-instr_time-related-code-v4.patchDownload+9-11
#12Lukas Fittl
lukas@fittl.com
In reply to: David Geier (#11)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Hi David,

Thanks for continuing to work on this patch, and my apologies for silence
on the patch.

Its been hard to make time, and especially so because I typically develop
on an ARM-based macOS system where I can't test this directly - hence my
tests with virtualized EC2 instances, where I ran into the timing oddities.

On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:

The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
return double. This seems error prone. What about renaming the function
or also have the function return a double and cast where necessary at
the call site?

Minor note, but in my understanding using a uint64 (where we can) is faster
for any simple arithmetic we do with the values.

If no one objects I would also re-register this patch in the commit fest.

+1, and feel free to carry this patch forward - I'll try to make an effort
to review my earlier testing issues again, as well as your later
improvements to the patch.

Also, FYI, I just posted an alternate idea for speeding up EXPLAIN ANALYZE
with timing over in [0]/messages/by-id/CAP53PkxXMk0j-+0=YwRti2pFR5UB2Gu4v2Oyk8hhZS0DRART6g@mail.gmail.com, using a sampling-based approach to reduce the
timing overhead.

[0]: /messages/by-id/CAP53PkxXMk0j-+0=YwRti2pFR5UB2Gu4v2Oyk8hhZS0DRART6g@mail.gmail.com
/messages/by-id/CAP53PkxXMk0j-+0=YwRti2pFR5UB2Gu4v2Oyk8hhZS0DRART6g@mail.gmail.com

Thanks,
Lukas

--
Lukas Fittl

#13Maciek Sakrejda
m.sakrejda@gmail.com
In reply to: Maciek Sakrejda (#6)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

On Fri, Jul 15, 2022 at 11:21 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:

On Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:

On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
...

Known WIP problems with this patch version:

* There appears to be a timing discrepancy I haven't yet worked out, where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is a problem
somewhere with how we perform the cycles to time conversion

Could you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.

I did not see this either, but I did see that the execution time
reported by \timing is (for this test case) consistently 0.5-1ms
*lower* than the Execution Time reported by EXPLAIN. I did not see
that on master. Is that expected?

For what it's worth, I can no longer reproduce this. In fact, I went
back to master-as-of-around-then and applied Lukas' v2 patches again,
and I still can't reproduce that. I do remember it happening
consistently across several executions, but now \timing consistently
shows 0.5-1ms slower, as expected. This does not explain the different
timing issue Lukas was seeing in his tests, but I think we can assume
what I reported originally here is not an issue.

#14David Geier
geidav.pg@gmail.com
In reply to: Lukas Fittl (#12)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Hi Lukas,

On 1/2/23 20:50, Lukas Fittl wrote:

Thanks for continuing to work on this patch, and my apologies for
silence on the patch.

It would be great if you could review it.
Please also share your thoughts around exposing the used clock source as
GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().

I rebased again on master because of [1]/messages/by-id/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw@mail.gmail.com. Patches attached.

Its been hard to make time, and especially so because I typically
develop on an ARM-based macOS system where I can't test this directly
- hence my tests with virtualized EC2 instances, where I ran into the
timing oddities.

That's good and bad. Bad to do the development and good to test the
implementation on more virtualized setups; given that I also encountered
"interesting" behavior on VMWare (see my previous mails).

On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:

The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
variants
return double. This seems error prone. What about renaming the
function
or also have the function return a double and cast where necessary at
the call site?

Minor note, but in my understanding using a uint64 (where we can) is
faster for any simple arithmetic we do with the values.

That's true. So the argument could be that for seconds and milliseconds
we want the extra precision while microseconds are precise enough.
Still, we could also make the seconds and milliseconds conversion code
integer only and e.g. return two integers with the value before and
after the comma. FWICS, the functions are nowhere used in performance
critical code, so it doesn't really make a difference performance-wise.

+1, and feel free to carry this patch forward - I'll try to make an
effort to review my earlier testing issues again, as well as your
later improvements to the patch.

Moved to the current commit fest. Will you become reviewer?

Also, FYI, I just posted an alternate idea for speeding up EXPLAIN
ANALYZE with timing over in [0], using a sampling-based approach to
reduce the timing overhead.

Interesting idea. I'll reply with some thoughts on the corresponding thread.

[1]: /messages/by-id/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw@mail.gmail.com
/messages/by-id/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw@mail.gmail.com

--
David Geier
(ServiceNow)

Attachments:

0001-Change-instr_time-to-just-store-nanoseconds-v5.patchtext/x-patch; charset=UTF-8; name=0001-Change-instr_time-to-just-store-nanoseconds-v5.patchDownload+26-37
0002-Use-CPU-reference-cycles-via-RDTSC-v5.patchtext/x-patch; charset=UTF-8; name=0002-Use-CPU-reference-cycles-via-RDTSC-v5.patchDownload+175-12
0003-Refactor-some-instr_time-related-code-v5.patchtext/x-patch; charset=UTF-8; name=0003-Refactor-some-instr_time-related-code-v5.patchDownload+9-11
#15vignesh C
vignesh21@gmail.com
In reply to: David Geier (#14)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

On Tue, 3 Jan 2023 at 14:08, David Geier <geidav.pg@gmail.com> wrote:

Hi Lukas,

On 1/2/23 20:50, Lukas Fittl wrote:

Thanks for continuing to work on this patch, and my apologies for
silence on the patch.

It would be great if you could review it.
Please also share your thoughts around exposing the used clock source as
GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().

I rebased again on master because of [1]. Patches attached.

Its been hard to make time, and especially so because I typically
develop on an ARM-based macOS system where I can't test this directly
- hence my tests with virtualized EC2 instances, where I ran into the
timing oddities.

That's good and bad. Bad to do the development and good to test the
implementation on more virtualized setups; given that I also encountered
"interesting" behavior on VMWare (see my previous mails).

On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:

The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
variants
return double. This seems error prone. What about renaming the
function
or also have the function return a double and cast where necessary at
the call site?

Minor note, but in my understanding using a uint64 (where we can) is
faster for any simple arithmetic we do with the values.

That's true. So the argument could be that for seconds and milliseconds
we want the extra precision while microseconds are precise enough.
Still, we could also make the seconds and milliseconds conversion code
integer only and e.g. return two integers with the value before and
after the comma. FWICS, the functions are nowhere used in performance
critical code, so it doesn't really make a difference performance-wise.

+1, and feel free to carry this patch forward - I'll try to make an
effort to review my earlier testing issues again, as well as your
later improvements to the patch.

Moved to the current commit fest. Will you become reviewer?

Also, FYI, I just posted an alternate idea for speeding up EXPLAIN
ANALYZE with timing over in [0], using a sampling-based approach to
reduce the timing overhead.

Interesting idea. I'll reply with some thoughts on the corresponding thread.

[1]
/messages/by-id/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw@mail.gmail.com

CFBot shows some compilation errors as in [1]https://cirrus-ci.com/task/5375312565895168, please post an updated
version for the same:
09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
relocation against `cycles_to_sec' in read-only section `.text'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `pg_clock_gettime_ref_cycles':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
undefined reference to `use_rdtsc'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `test_timing':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
undefined reference to `pg_clock_gettime_initialize_rdtsc'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
undefined reference to `cycles_to_sec'
[09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
[09:08:12.525] collect2: error: ld returned 1 exit status

[1]: https://cirrus-ci.com/task/5375312565895168

Regards,
Vignesh

#16David Geier
geidav.pg@gmail.com
In reply to: vignesh C (#15)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Hi,

CFBot shows some compilation errors as in [1], please post an updated
version for the same:
09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
relocation against `cycles_to_sec' in read-only section `.text'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `pg_clock_gettime_ref_cycles':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
undefined reference to `use_rdtsc'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `test_timing':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
undefined reference to `pg_clock_gettime_initialize_rdtsc'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
undefined reference to `cycles_to_sec'
[09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
[09:08:12.525] collect2: error: ld returned 1 exit status

[1] - https://cirrus-ci.com/task/5375312565895168

Regards,
Vignesh

I fixed the compilation error on CFBot.
I missed adding instr_time.c to the Meson makefile.
New patch set attached.

--
David Geier
(ServiceNow)

Attachments:

0001-Change-instr_time-to-just-store-nanoseconds-v6.patchtext/x-patch; charset=UTF-8; name=0001-Change-instr_time-to-just-store-nanoseconds-v6.patchDownload+26-37
0002-Use-CPU-reference-cycles-via-RDTSC-to-measure-time-v6.patchtext/x-patch; charset=UTF-8; name=0002-Use-CPU-reference-cycles-via-RDTSC-to-measure-time-v6.patchDownload+179-15
0003-Refactor-some-instr_time-related-code-v6.patchtext/x-patch; charset=UTF-8; name=0003-Refactor-some-instr_time-related-code-v6.patchDownload+9-11
#17Andres Freund
andres@anarazel.de
In reply to: David Geier (#16)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Hi,

On 2023-01-04 13:02:05 +0100, David Geier wrote:

From be18633d4735f680c7910fcb4e8ac90c4eada131 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
cheaper.

Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
Instrumentation (16 bytes saved in Instrumentation itself, 32 via
BufferUsage).

While the range of instr_time storing nanoseconds wouldn't be good enough for
a generic timestamp facility (hence using microsecs for Timestamp), the range
seems plenty for its use of measuring runtime:

(2 ** 63) - 1) / ((10 ** 9) * 60 * 60 * 24 * 365) = ~292 years

Of course, when using CLOCK_REALTIME, this is relative to 1970-01-01, so just
239 years.

It could theoretically be a different story, if we stored instr_time's on
disk. But we don't, they're ephemeral.

This doesn't buy a whole lot of performance - the bottlenck is the actual
timestamp computation. But in a query with not much else going on, it's
visible and reproducible. It's, unsurprisingly, a lot easier to see when using
BUFFERS.

For both timespec and nanosecond, I measured three server starts, and for each
started server three executions of
pgbench -n -Mprepared -c1 -P5 -T15 -f <(echo "EXPLAIN (ANALYZE, BUFFERS) SELECT generate_series(1, 10000000) OFFSET 10000000;")

the best result is:
timespec: 1073.431
nanosec: 957.532
a ~10% difference

Greetings,

Andres Freund

#18Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#17)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Andres Freund <andres@anarazel.de> writes:

On 2023-01-04 13:02:05 +0100, David Geier wrote:

Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
cheaper.

Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds.

Cheaper, and perhaps more accurate too? Don't recall if we have any code
paths where the input timestamps are likely to be better-than-microsecond,
but surely that's coming someday.

I'm unsure that we want to deal with rdtsc's vagaries in general, but
no objection to changing instr_time.

regards, tom lane

#19Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#18)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Hi,

On 2023-01-13 15:25:16 -0500, Tom Lane wrote:

Andres Freund <andres@anarazel.de> writes:

Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds.

Cheaper, and perhaps more accurate too? Don't recall if we have any code
paths where the input timestamps are likely to be better-than-microsecond,
but surely that's coming someday.

instr_time on !WIN32 use struct timespec, so we already should have nanosecond
precision available. IOW, we could add a INSTR_TIME_GET_NANOSEC today. Or am I
misunderstanding what you mean?

I'm unsure that we want to deal with rdtsc's vagaries in general, but
no objection to changing instr_time.

Cool.

Looking at the instr_time.h part of the change, I think it should go further,
and basically do the same thing in the WIN32 path. The only part that needs to
be be win32 specific is INSTR_TIME_SET_CURRENT(). That'd reduce duplication a
good bit.

Greetings,

Andres Freund

#20vignesh C
vignesh21@gmail.com
In reply to: David Geier (#16)
Re: Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

On Wed, 4 Jan 2023 at 17:32, David Geier <geidav.pg@gmail.com> wrote:

I fixed the compilation error on CFBot.
I missed adding instr_time.c to the Meson makefile.
New patch set attached.

The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_3751.log, please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
ff23b592ad6621563d3128b26860bcb41daf9542 ===
=== applying patch
./0002-Use-CPU-reference-cycles-via-RDTSC-to-measure-time-v6.patch
....
patching file src/tools/msvc/Mkvcbuild.pm
Hunk #1 FAILED at 135.
1 out of 1 hunk FAILED -- saving rejects to file src/tools/msvc/Mkvcbuild.pm.rej

[1]: http://cfbot.cputube.org/patch_41_3751.log

Regards,
Vignesh

#21Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#17)
#22Andres Freund
andres@anarazel.de
In reply to: David Geier (#11)
#23Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Geier (#16)
#24Pavel Stehule
pavel.stehule@gmail.com
In reply to: Tomas Vondra (#23)
#25Robert Haas
robertmhaas@gmail.com
In reply to: Andres Freund (#17)
#26Andres Freund
andres@anarazel.de
In reply to: Robert Haas (#25)
#27Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#26)
#28Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#27)
#29David Geier
geidav.pg@gmail.com
In reply to: Pavel Stehule (#24)
#30David Geier
geidav.pg@gmail.com
In reply to: Andres Freund (#22)
#31David Geier
geidav.pg@gmail.com
In reply to: Andres Freund (#28)
#32David Geier
geidav.pg@gmail.com
In reply to: Andres Freund (#26)
#33David Geier
geidav.pg@gmail.com
In reply to: David Geier (#29)
#34Tomas Vondra
tomas.vondra@2ndquadrant.com
In reply to: David Geier (#33)
#35Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#28)
#36Tom Lane
tgl@sss.pgh.pa.us
In reply to: Andres Freund (#35)
#37Andres Freund
andres@anarazel.de
In reply to: Tom Lane (#36)
#38Andres Freund
andres@anarazel.de
In reply to: David Geier (#32)
#39Andres Freund
andres@anarazel.de
In reply to: David Geier (#33)
#40Andres Freund
andres@anarazel.de
In reply to: David Geier (#31)
#41Andres Freund
andres@anarazel.de
In reply to: David Geier (#30)
#42Justin Pryzby
pryzby@telsasoft.com
In reply to: Andres Freund (#35)
#43Andres Freund
andres@anarazel.de
In reply to: Justin Pryzby (#42)
#44Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#40)
#45Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#44)
#46David Geier
geidav.pg@gmail.com
In reply to: Andres Freund (#39)
#47Andres Freund
andres@anarazel.de
In reply to: David Geier (#46)
#48David Geier
geidav.pg@gmail.com
In reply to: Andres Freund (#38)
#49David Geier
geidav.pg@gmail.com
In reply to: Andres Freund (#44)
#50Andres Freund
andres@anarazel.de
In reply to: David Geier (#48)
#51Andres Freund
andres@anarazel.de
In reply to: David Geier (#49)
#52David Geier
geidav.pg@gmail.com
In reply to: Andres Freund (#47)
#53David Geier
geidav.pg@gmail.com
In reply to: Andres Freund (#50)
#54David Geier
geidav.pg@gmail.com
In reply to: Andres Freund (#51)
#55Andres Freund
andres@anarazel.de
In reply to: David Geier (#52)
#56David Geier
geidav.pg@gmail.com
In reply to: Andres Freund (#55)
#57David Geier
geidav.pg@gmail.com
In reply to: David Geier (#56)
#58David Geier
geidav.pg@gmail.com
In reply to: David Geier (#57)
#59vignesh C
vignesh21@gmail.com
In reply to: David Geier (#58)
#60vignesh C
vignesh21@gmail.com
In reply to: vignesh C (#59)
#61Andres Freund
andres@anarazel.de
In reply to: David Geier (#58)
#62Lukas Fittl
lukas@fittl.com
In reply to: Andres Freund (#61)
#63Andres Freund
andres@anarazel.de
In reply to: Lukas Fittl (#62)
#64Lukas Fittl
lukas@fittl.com
In reply to: Andres Freund (#63)
#65Michael Paquier
michael@paquier.xyz
In reply to: Lukas Fittl (#64)
#66Michael Paquier
michael@paquier.xyz
In reply to: Michael Paquier (#65)
#67David Geier
geidav.pg@gmail.com
In reply to: Lukas Fittl (#62)
#68Alvaro Herrera
alvherre@2ndquadrant.com
In reply to: Lukas Fittl (#64)
#69Robert Haas
robertmhaas@gmail.com
In reply to: Alvaro Herrera (#68)
#70Andres Freund
andres@anarazel.de
In reply to: David Geier (#67)
#71David Geier
geidav.pg@gmail.com
In reply to: Robert Haas (#69)
#72David Geier
geidav.pg@gmail.com
In reply to: Andres Freund (#70)
#73Lukas Fittl
lukas@fittl.com
In reply to: David Geier (#72)
#74Robert Haas
robertmhaas@gmail.com
In reply to: Lukas Fittl (#73)
#75David Geier
geidav.pg@gmail.com
In reply to: David Geier (#71)
#76David Geier
geidav.pg@gmail.com
In reply to: Robert Haas (#74)
#77Robert Haas
robertmhaas@gmail.com
In reply to: David Geier (#76)
#78David Geier
geidav.pg@gmail.com
In reply to: David Geier (#75)
#79Hannu Krosing
hannu@tm.ee
In reply to: Robert Haas (#77)
#80David Geier
geidav.pg@gmail.com
In reply to: Hannu Krosing (#79)
#81David Geier
geidav.pg@gmail.com
In reply to: David Geier (#80)
#82Lukas Fittl
lukas@fittl.com
In reply to: David Geier (#81)
#83Lukas Fittl
lukas@fittl.com
In reply to: Lukas Fittl (#82)
#84Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Lukas Fittl (#83)
#85Lukas Fittl
lukas@fittl.com
In reply to: Jakub Wartak (#84)
#86Andres Freund
andres@anarazel.de
In reply to: Jakub Wartak (#84)
#87David Geier
geidav.pg@gmail.com
In reply to: Lukas Fittl (#82)
#88David Geier
geidav.pg@gmail.com
In reply to: Lukas Fittl (#85)
#89David Geier
geidav.pg@gmail.com
In reply to: Andres Freund (#86)
#90Andres Freund
andres@anarazel.de
In reply to: Andres Freund (#86)
#91Ants Aasma
ants.aasma@cybertec.at
In reply to: Andres Freund (#90)
#92Lukas Fittl
lukas@fittl.com
In reply to: David Geier (#89)
#93Andres Freund
andres@anarazel.de
In reply to: Lukas Fittl (#92)
#94Lukas Fittl
lukas@fittl.com
In reply to: Andres Freund (#93)
#95Hannu Krosing
hannu@tm.ee
In reply to: Lukas Fittl (#94)
#96Lukas Fittl
lukas@fittl.com
In reply to: Hannu Krosing (#95)
#97Andreas Karlsson
andreas.karlsson@percona.com
In reply to: Hannu Krosing (#95)
#98David Geier
geidav.pg@gmail.com
In reply to: Lukas Fittl (#94)
#99Andres Freund
andres@anarazel.de
In reply to: David Geier (#98)
#100John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#99)
#101Lukas Fittl
lukas@fittl.com
In reply to: John Naylor (#100)
#102Lukas Fittl
lukas@fittl.com
In reply to: Andres Freund (#99)
#103John Naylor
john.naylor@enterprisedb.com
In reply to: Lukas Fittl (#101)
#104Jakub Wartak
jakub.wartak@enterprisedb.com
In reply to: Lukas Fittl (#102)
#105Andres Freund
andres@anarazel.de
In reply to: Lukas Fittl (#102)
#106John Naylor
john.naylor@enterprisedb.com
In reply to: Andres Freund (#105)
#107Lukas Fittl
lukas@fittl.com
In reply to: Andres Freund (#105)
#108Andres Freund
andres@anarazel.de
In reply to: Lukas Fittl (#107)
#109Lukas Fittl
lukas@fittl.com
In reply to: Andres Freund (#108)
#110John Naylor
john.naylor@enterprisedb.com
In reply to: Lukas Fittl (#107)
#111Andres Freund
andres@anarazel.de
In reply to: Lukas Fittl (#109)
#112Lukas Fittl
lukas@fittl.com
In reply to: Andres Freund (#111)