Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?
Hi,
Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.
CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;
-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)
-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)
-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)
That nearly *double* the execution time without TIMING.
Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.
That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)
but obviously doesn't move the needle.
Looking at a profile it's easy to confirm that we spend a lot of time
acquiring time:
- 95.49% 0.00% postgres postgres [.] agg_retrieve_direct (inlined)
- agg_retrieve_direct (inlined)
- 79.27% fetch_input_tuple
- ExecProcNode (inlined)
- 75.72% ExecProcNodeInstr
+ 25.22% SeqNext
- 21.74% InstrStopNode
+ 17.80% __GI___clock_gettime (inlined)
- 21.44% InstrStartNode
+ 19.23% __GI___clock_gettime (inlined)
+ 4.06% ExecScan
+ 13.09% advance_aggregates (inlined)
1.06% MemoryContextReset
And that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.
Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
65.30 │1 63: rdtscp
The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.
In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).
When using rdtsc that results in *vastly* lower overhead:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual time=2610.235..2610.235 rows=1 loops=1) │
│ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=50000016 width=0) (actual time=0.006..1512.886 rows=50000000 loops=1) │
│ Planning Time: 0.028 ms │
│ Execution Time: 2610.256 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(4 rows)
Time: 2610.589 ms (00:02.611)
And there's still some smaller improvements that could be made ontop of
that.
As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)
That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...
In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).
I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:
On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.
Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.
In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since Skylake (I think). On older
CPUs we'd have to look at /proc/cpuinfo or
/sys/devices/system/cpu/cpu0/cpufreq/base_frequency.
There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not. The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.
The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...
I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.
Greetings,
Andres Freund
Attachments:
v1-0001-WIP-Change-instr_time-to-just-store-nanoseconds-t.patchtext/x-diff; charset=us-asciiDownload
From 325140820e5745bea901bffd685626aa84a16d23 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 28 May 2020 14:52:48 -0700
Subject: [PATCH v1 1/2] WIP: Change instr_time to just store nanoseconds,
that's cheaper.
Author:
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/portability/instr_time.h | 62 ++++++++++++----------------
1 file changed, 26 insertions(+), 36 deletions(-)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index d6459327ccf..fc058d548a8 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -83,63 +83,53 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(1000000000)
+#define US_PER_S INT64CONST(1000000)
+#define MS_PER_S INT64CONST(1000)
-#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(1000000)
+#define NS_PER_US INT64CONST(1000)
-#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t) ((t) == 0)
-#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+ struct timespec tmp;
+
+ clock_gettime(PG_INSTR_CLOCK, &tmp);
+
+ return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+ (t) = pg_clock_gettime_ns()
#define INSTR_TIME_ADD(x,y) \
do { \
- (x).tv_sec += (y).tv_sec; \
- (x).tv_nsec += (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
+ (x) += (y); \
} while (0)
#define INSTR_TIME_SUBTRACT(x,y) \
do { \
- (x).tv_sec -= (y).tv_sec; \
- (x).tv_nsec -= (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
+ (x) -= (y); \
} while (0)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
do { \
- (x).tv_sec += (y).tv_sec - (z).tv_sec; \
- (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
- /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
+ (x) += (y) - (z); \
} while (0)
#define INSTR_TIME_GET_DOUBLE(t) \
- (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0)
+ ((double) (t) / NS_PER_S)
#define INSTR_TIME_GET_MILLISEC(t) \
- (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0)
+ ((double) (t) / NS_PER_MS)
#define INSTR_TIME_GET_MICROSEC(t) \
- (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
+ ((double) (t) / NS_PER_US)
#else /* !HAVE_CLOCK_GETTIME */
--
2.25.0.114.g5b0ca878e0
v1-0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure.patchtext/x-diff; charset=us-asciiDownload
From 29a12302bab6b2fb8c2475834510b90c4a6197ce Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Thu, 11 Jun 2020 19:38:18 -0700
Subject: [PATCH v1 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure
time for instrumentation.
---
src/include/portability/instr_time.h | 68 ++++++++++++++++++++++++----
1 file changed, 60 insertions(+), 8 deletions(-)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index fc058d548a8..8b2f9a2e707 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -83,7 +83,9 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
+/* time in baseline cpu cycles */
typedef int64 instr_time;
+
#define NS_PER_S INT64CONST(1000000000)
#define US_PER_S INT64CONST(1000000)
#define MS_PER_S INT64CONST(1000)
@@ -95,17 +97,67 @@ typedef int64 instr_time;
#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
-static inline instr_time pg_clock_gettime_ns(void)
+#include <x86intrin.h>
+#include <cpuid.h>
+
+/*
+ * Return what the number of cycles needs to be multiplied with to end up with
+ * seconds.
+ *
+ * FIXME: The cold portion should probably be out-of-line. And it'd be better
+ * to not recompute this in every file that uses this. Best would probably be
+ * to require explicit initialization of cycles_to_sec, because having a
+ * branch really is unnecessary.
+ *
+ * FIXME: We should probably not unnecessarily use floating point math
+ * here. And it's likely that the numbers are small enough that we are running
+ * into floating point inaccuracies already. Probably worthwhile to be a good
+ * bit smarter.
+ *
+ * FIXME: This would need to be conditional, with a fallback to something not
+ * rdtsc based.
+ */
+static inline double __attribute__((const))
+get_cycles_to_sec(void)
{
- struct timespec tmp;
+ static double cycles_to_sec = 0;
- clock_gettime(PG_INSTR_CLOCK, &tmp);
+ /*
+ * Compute baseline cpu peformance, determines speed at which rdtsc advances
+ */
+ if (unlikely(cycles_to_sec == 0))
+ {
+ uint32 cpuinfo[4] = {0};
- return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+ __get_cpuid(0x16, cpuinfo, cpuinfo + 1, cpuinfo + 2, cpuinfo + 3);
+ cycles_to_sec = 1 / ((double) cpuinfo[0] * 1000 * 1000);
+ }
+
+ return cycles_to_sec;
+}
+
+static inline instr_time pg_clock_gettime_ref_cycles(void)
+{
+ /*
+ * The rdtscp waits for all in-flight instructions to finish (but allows
+ * later instructions to start concurrently). That's good for some timing
+ * situations (when the time is supposed to cover all the work), but
+ * terrible for others (when sub-parts of work are measured, because then
+ * the pipeline stall due to the wait change the overall timing).
+ */
+#if 0
+ unsigned int aux;
+ int64 tsc = __rdtscp(&aux);
+
+ return tsc;
+#else
+
+ return __rdtsc();
+#endif
}
#define INSTR_TIME_SET_CURRENT(t) \
- (t) = pg_clock_gettime_ns()
+ (t) = pg_clock_gettime_ref_cycles()
#define INSTR_TIME_ADD(x,y) \
do { \
@@ -123,13 +175,13 @@ static inline instr_time pg_clock_gettime_ns(void)
} while (0)
#define INSTR_TIME_GET_DOUBLE(t) \
- ((double) (t) / NS_PER_S)
+ ((double) (t) * get_cycles_to_sec())
#define INSTR_TIME_GET_MILLISEC(t) \
- ((double) (t) / NS_PER_MS)
+ ((double) (t) * (get_cycles_to_sec() * MS_PER_S))
#define INSTR_TIME_GET_MICROSEC(t) \
- ((double) (t) / NS_PER_US)
+ ((double) (t) * (get_cycles_to_sec() * US_PER_S))
#else /* !HAVE_CLOCK_GETTIME */
--
2.25.0.114.g5b0ca878e0
so 13. 6. 2020 v 1:28 odesílatel Andres Freund <andres@anarazel.de> napsal:
Hi,
Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)That nearly *double* the execution time without TIMING.
Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)but obviously doesn't move the needle.
Looking at a profile it's easy to confirm that we spend a lot of time acquiring time: - 95.49% 0.00% postgres postgres [.] agg_retrieve_direct (inlined) - agg_retrieve_direct (inlined) - 79.27% fetch_input_tuple - ExecProcNode (inlined) - 75.72% ExecProcNodeInstr + 25.22% SeqNext - 21.74% InstrStopNode + 17.80% __GI___clock_gettime (inlined) - 21.44% InstrStartNode + 19.23% __GI___clock_gettime (inlined) + 4.06% ExecScan + 13.09% advance_aggregates (inlined) 1.06% MemoryContextResetAnd that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
65.30 │1 63: rdtscpThe reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).When using rdtsc that results in *vastly* lower overhead:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN
│├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual
time=2610.235..2610.235 rows=1 loops=1) │
│ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=50000016
width=0) (actual time=0.006..1512.886 rows=50000000 loops=1) │
│ Planning Time: 0.028 ms
│
│ Execution Time: 2610.256 ms
│└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(4 rows)Time: 2610.589 ms (00:02.611)
And there's still some smaller improvements that could be made ontop of
that.As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since Skylake (I think). On older
CPUs we'd have to look at /proc/cpuinfo or
/sys/devices/system/cpu/cpu0/cpufreq/base_frequency.There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not. The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.
+1
Pavel
Show quoted text
Greetings,
Andres Freund
On Sat, Jun 13, 2020 at 11:28 AM Andres Freund <andres@anarazel.de> wrote:
[PATCH v1 1/2] WIP: Change instr_time to just store nanoseconds, that's cheaper.
Makes a lot of sense. If we do this, I'll need to update pgbench,
which just did something similar locally. If I'd been paying
attention to this thread I might not have committed that piece of the
recent pgbench changes, but it's trivial stuff and I'll be happy to
tidy that up when the time comes.
[PATCH v1 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure time for instrumentation.
Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
65.30 │1 63: rdtscp
The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.
Yeah, after reading a bit about this, I agree that there is no reason
to think that the stalling version makes the answer better in any way.
It might make sense if you use it once at the beginning of a large
computation, but it makes no sense if you sprinkle it around inside
blocks that will run multiple times. It destroys your
instructions-per-cycle while, turning your fancy super scalar Pentium
into a 486. It does raise some interesting questions about what
exactly you're measuring, though: I don't know enough to have a good
grip on how far out of order the TSC could be read!
There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not. The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.
Googling tells me that Nehalem (2008) introduced "invariant TSC"
(clock rate independent) and also socket synchronisation at the same
time, so systems without it are already pretty long in the tooth.
A quick peek at an AMD manual[1]https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf (page 373) tells me that a similar change
happened in 15H/Bulldozer/Piledriver/Steamroller/Excavator (2011),
identified with the same CPUID test.
My first reaction is that it seems like TSC would be the least of your
worries if you're measuring a VM that's currently migrating between
hosts, but maybe the idea is just that you have to make sure you don't
assume it can't ever go backwards or something like that...
Google Benchmark has some clues about how to spell this on MSVC, what
some instructions might be to research on ARM, etc.
[1]: https://www.amd.com/system/files/TechDocs/47414_15h_sw_opt_guide.pdf (page 373)
(page 373)
[2]: https://github.com/google/benchmark/blob/master/src/cycleclock.h
On Fri, Jun 12, 2020 at 4:28 PM Andres Freund <andres@anarazel.de> wrote:
The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.
Based on an off-list conversation with Andres, I decided to dust off this
old
patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
improvements (especially when using rdtsc instead of rdtsc*p*) seem to
warrant
giving this a more thorough look.
See attached an updated patch (adding it to the July commitfest), with a few
changes:
- Keep using clock_gettime() as a fallback if we decide to not use rdtsc
- Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work
- The decision to use rdtsc (or not) is made at runtime, in the new
INSTR_TIME_INITIALIZE() -- we can't make this decision at compile time
because this is dependent on the specific CPU in use, amongst other things
- In an abundance of caution, for now I've decided to only enable this if we
are on Linux/x86, and the current kernel clocksource is TSC (the kernel
has
quite sophisticated logic around making this decision, see [1]https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c)
Note that if we implemented the decision logic ourselves (instead of relying
on the Linux kernel), I'd be most worried about older virtualization
technology. In my understanding getting this right is notably more
complicated
than just checking cpuid, see [2]http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/.
Known WIP problems with this patch version:
* There appears to be a timing discrepancy I haven't yet worked out, where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is a problem
somewhere with how we perform the cycles to time conversion
* Possibly related, the floating point handling for the cycles_to_sec
variable
is problematic in terms of precision (see FIXME, taken over from Andres'
POC)
Open questions from me:
1) Do we need to account for different TSC offsets on different CPUs in SMP
systems? (the Linux kernel certainly has logic to that extent, but [3]https://stackoverflow.com/a/11060619/1652607
suggests this is no longer a problem on Nehalem and newer chips, i.e.
those
having an invariant TSC)
2) Should we have a setting "--with-tsc" for configure? (instead of always
enabling it when on Linux/x86 with a TSC clocksource)
3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
current instructions to finish -- the prior discussion seemed to suggest
we don't want it for node instruction measurements, but possibly we do
want
this in other cases?)
4) Should we support using the "mrs" instruction on ARM? (which is similar
to
rdtsc, see [4]https://cpufun.substack.com/p/fun-with-timers-and-cpuid)
Thanks,
Lukas
[1]: https://github.com/torvalds/linux/blob/master/arch/x86/kernel/tsc.c
[2]: http://oliveryang.net/2015/09/pitfalls-of-TSC-usage/
[3]: https://stackoverflow.com/a/11060619/1652607
[4]: https://cpufun.substack.com/p/fun-with-timers-and-cpuid
--
Lukas Fittl
Attachments:
v2-0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure.patchapplication/octet-stream; name=v2-0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure.patchDownload
From 8bd31560afe60bed38c3f63305d924d3b629206b Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Thu, 30 Jun 2022 17:57:37 -0700
Subject: [PATCH v2 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure
time for instrumentation.
For now this is only enabled on Linux/x86 when the system clocksource is
marked tsc as well, as determined at runtime. This way we can rely on the
Linux kernel to make a decision whether tsc is invariant and usable on the
current CPU architecture. In all other cases we continue to use the
clock_gettime() implementation like before.
Note that this intentionally uses rdtsc, not rdtscp, as rdtscp waits for
currently running CPU instructions to have finished, and that adds up to
noticable latency for little benefit in the typical InstrStartNode() /
InstrStopNode() use case.
---
src/backend/utils/init/postinit.c | 3 +
src/bin/pg_test_timing/pg_test_timing.c | 1 +
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 107 ++++++++++++++++++++++++
src/include/portability/instr_time.h | 53 +++++++++---
src/tools/msvc/Mkvcbuild.pm | 2 +-
8 files changed, 163 insertions(+), 11 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 6b9082604f..6ec0ad3271 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -735,6 +735,9 @@ InitPostgres(const char *in_dbname, Oid dboid, const char *username,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* Initialize status reporting */
pgstat_beinit();
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f8762..0d667ff5a7 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -132,6 +132,7 @@ test_timing(unsigned int duration)
total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
+ INSTR_TIME_INITIALIZE();
INSTR_TIME_SET_CURRENT(start_time);
cur = INSTR_TIME_GET_MICROSEC(start_time);
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index fbb74bdc4c..043e2e433d 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7093,6 +7093,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 7c2f555f15..da930043c0 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -322,6 +323,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index e9af7346c9..437a018590 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 0000000000..a63ef19f08
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,107 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "portability/instr_time.h"
+
+#ifdef HAVE_CLOCK_GETTIME
+
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up with
+ * seconds. This indirection exists to support the rtdsc instruction.
+ *
+ * As a default, assume we are using clock_gettime() as a fallback and treat it
+ * as 1 "cycle" per nanosecond (aka 1 GHz).
+ *
+ * When using the rtdsc instruction directly this is filled in during
+ * initialization based on the relevant cpuid fields.
+ */
+double cycles_to_sec = 1.0 / NS_PER_S;
+
+/*
+ * Determines whether rdtsc is used (Linux/x86 only, when OS uses tsc clocksource)
+ */
+bool use_rdtsc = false;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Decide whether we use the rdtsc instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void pg_clock_gettime_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+ char buf[128];
+
+ if (fp)
+ {
+ fgets(buf, sizeof(buf), fp);
+ if (strcmp(buf, "tsc\n") == 0)
+ {
+ use_rdtsc = true;
+ }
+ fclose(fp);
+ }
+
+ /*
+ * Compute baseline cpu peformance, determines speed at which rdtsc advances
+ */
+ if (use_rdtsc)
+ {
+ uint32 cpuinfo[4] = {0};
+
+ /*
+ * FIXME: We should probably not unnecessarily use floating point math
+ * here. And it's likely that the numbers are small enough that we are running
+ * into floating point inaccuracies already. Probably worthwhile to be a good
+ * bit smarter.
+ */
+
+ __get_cpuid(0x16, cpuinfo, cpuinfo + 1, cpuinfo + 2, cpuinfo + 3);
+
+ if (cpuinfo[0] != 0) {
+ cycles_to_sec = 1 / ((double) cpuinfo[0] * 1000 * 1000);
+ } else {
+ FILE *fp = fopen("/proc/cpuinfo", "r");
+ char buf[128];
+ float cpu_mhz;
+
+ if (fp)
+ {
+ while (fgets(buf, sizeof(buf), fp))
+ {
+ if (sscanf(buf, "cpu MHz : %f", &cpu_mhz) == 1)
+ {
+ cycles_to_sec = 1 / ((double) cpu_mhz * 1000 * 1000);
+ break;
+ }
+ }
+ }
+ fclose(fp);
+ }
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
+
+#endif /* HAVE_CLOCK_GETTIME */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index ca260032d1..a23b88d9e6 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,10 +4,12 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime() if available, else
- * gettimeofday(). On Windows, gettimeofday() gives a low-precision result
- * so we must use QueryPerformanceCounter() instead. These macros also give
- * some breathing room to use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() if available, else gettimeofday().
+ * On Windows, gettimeofday() gives a low-precision result so we must use
+ * QueryPerformanceCounter() instead. These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -59,10 +61,15 @@
#ifdef HAVE_CLOCK_GETTIME
-/* Use clock_gettime() */
+/* Uses rdtsc on Linux/x86 if available, otherwise clock_gettime() */
#include <time.h>
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+#endif
+
/*
* The best clockid to use according to the POSIX spec is CLOCK_MONOTONIC,
* since that will give reliable interval timing even in the face of changes
@@ -83,7 +90,9 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
+/* time in cpu reference cycles (when using rdtsc), otherwise nanoseconds */
typedef int64 instr_time;
+
#define NS_PER_S INT64CONST(1000000000)
#define US_PER_S INT64CONST(1000000)
#define MS_PER_S INT64CONST(1000)
@@ -95,17 +104,37 @@ typedef int64 instr_time;
#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
-static inline instr_time pg_clock_gettime_ns(void)
+extern double cycles_to_sec;
+
+bool use_rdtsc;
+
+#if defined(__x86_64__) && defined(__linux__)
+extern void pg_clock_gettime_initialize_rdtsc(void);
+#endif
+
+static inline instr_time pg_clock_gettime_ref_cycles(void)
{
struct timespec tmp;
+#if defined(__x86_64__) && defined(__linux__)
+ if (use_rdtsc)
+ return __rdtsc();
+#endif
+
clock_gettime(PG_INSTR_CLOCK, &tmp);
return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
}
+#if defined(__x86_64__) && defined(__linux__)
+#define INSTR_TIME_INITIALIZE() \
+ pg_clock_gettime_initialize_rdtsc()
+#else
+#define INSTR_TIME_INITIALIZE()
+#endif
+
#define INSTR_TIME_SET_CURRENT(t) \
- (t) = pg_clock_gettime_ns()
+ (t) = pg_clock_gettime_ref_cycles()
#define INSTR_TIME_ADD(x,y) \
do { \
@@ -123,13 +152,13 @@ static inline instr_time pg_clock_gettime_ns(void)
} while (0)
#define INSTR_TIME_GET_DOUBLE(t) \
- ((double) (t) / NS_PER_S)
+ ((double) (t) * cycles_to_sec)
#define INSTR_TIME_GET_MILLISEC(t) \
- ((double) (t) / NS_PER_MS)
+ ((double) (t) * (cycles_to_sec * MS_PER_S))
#define INSTR_TIME_GET_MICROSEC(t) \
- ((double) (t) / NS_PER_US)
+ ((uint64) (t) * (cycles_to_sec * US_PER_S))
#else /* !HAVE_CLOCK_GETTIME */
@@ -143,6 +172,8 @@ typedef struct timeval instr_time;
#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_usec = 0)
+#define INSTR_TIME_INITIALIZE()
+
#define INSTR_TIME_SET_CURRENT(t) gettimeofday(&(t), NULL)
#define INSTR_TIME_ADD(x,y) \
@@ -207,6 +238,8 @@ typedef LARGE_INTEGER instr_time;
#define INSTR_TIME_SET_ZERO(t) ((t).QuadPart = 0)
+#define INSTR_TIME_INITIALIZE()
+
#define INSTR_TIME_SET_CURRENT(t) QueryPerformanceCounter(&(t))
#define INSTR_TIME_ADD(x,y) \
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index e4feda10fd..a881fcc64e 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -126,7 +126,7 @@ sub mkvcbuild
our @pgcommonallfiles = qw(
archive.c base64.c checksum_helper.c compression.c
config_info.c controldata_utils.c d2s.c encnames.c exec.c
- f2s.c file_perm.c file_utils.c hashfn.c ip.c jsonapi.c
+ f2s.c file_perm.c file_utils.c hashfn.c ip.c instr_time.o jsonapi.c
keywords.c kwlookup.c link-canary.c md5_common.c
pg_get_line.c pg_lzcompress.c pg_prng.c pgfnames.c psprintf.c relpath.c
rmtree.c saslprep.c scram-common.c string.c stringinfo.c unicode_norm.c
--
2.34.0
v2-0001-WIP-Change-instr_time-to-just-store-nanoseconds-t.patchapplication/octet-stream; name=v2-0001-WIP-Change-instr_time-to-just-store-nanoseconds-t.patchDownload
From 0c690c76cb4ccfc995eb98340628f10274d89c81 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Thu, 30 Jun 2022 17:55:32 -0700
Subject: [PATCH v2 1/2] WIP: Change instr_time to just store nanoseconds,
that's cheaper.
Author: Andres Freund
Reviewed-By:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/portability/instr_time.h | 62 ++++++++++++----------------
1 file changed, 26 insertions(+), 36 deletions(-)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index ca09a1608f..ca260032d1 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -83,63 +83,53 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(1000000000)
+#define US_PER_S INT64CONST(1000000)
+#define MS_PER_S INT64CONST(1000)
-#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(1000000)
+#define NS_PER_US INT64CONST(1000)
-#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t) ((t) == 0)
-#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+ struct timespec tmp;
+
+ clock_gettime(PG_INSTR_CLOCK, &tmp);
+
+ return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+ (t) = pg_clock_gettime_ns()
#define INSTR_TIME_ADD(x,y) \
do { \
- (x).tv_sec += (y).tv_sec; \
- (x).tv_nsec += (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
+ (x) += (y); \
} while (0)
#define INSTR_TIME_SUBTRACT(x,y) \
do { \
- (x).tv_sec -= (y).tv_sec; \
- (x).tv_nsec -= (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
+ (x) -= (y); \
} while (0)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
do { \
- (x).tv_sec += (y).tv_sec - (z).tv_sec; \
- (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
- /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
+ (x) += (y) - (z); \
} while (0)
#define INSTR_TIME_GET_DOUBLE(t) \
- (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0)
+ ((double) (t) / NS_PER_S)
#define INSTR_TIME_GET_MILLISEC(t) \
- (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0)
+ ((double) (t) / NS_PER_MS)
#define INSTR_TIME_GET_MICROSEC(t) \
- (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
+ ((double) (t) / NS_PER_US)
#else /* !HAVE_CLOCK_GETTIME */
--
2.34.0
Hi,
On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
On Fri, Jun 12, 2020 at 4:28 PM Andres Freund <andres@anarazel.de> wrote:
The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.Based on an off-list conversation with Andres, I decided to dust off this
old
patch for using rdtsc directly. The significant EXPLAIN ANALYZE performance
improvements (especially when using rdtsc instead of rdtsc*p*) seem to
warrant
giving this a more thorough look.See attached an updated patch (adding it to the July commitfest), with a few
changes:- Keep using clock_gettime() as a fallback if we decide to not use rdtsc
Yep.
- Fallback to /proc/cpuinfo for clock frequency, if cpuid(0x16) doesn't work
I suspect that this might not be needed anymore. Seems like it'd be ok to just
fall back to clock_gettime() in that case.
- In an abundance of caution, for now I've decided to only enable this if we
are on Linux/x86, and the current kernel clocksource is TSC (the kernel
has
quite sophisticated logic around making this decision, see [1])
I think our requirements are a bit lower than the kernel's - we're not
tracking wall clock over an extended period...
Note that if we implemented the decision logic ourselves (instead of relying
on the Linux kernel), I'd be most worried about older virtualization
technology. In my understanding getting this right is notably more
complicated
than just checking cpuid, see [2].
Known WIP problems with this patch version:
* There appears to be a timing discrepancy I haven't yet worked out, where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is a problem
somewhere with how we perform the cycles to time conversion
Could you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.
* Possibly related, the floating point handling for the cycles_to_sec
variable
is problematic in terms of precision (see FIXME, taken over from Andres'
POC)
And probably also performance...
Open questions from me:
1) Do we need to account for different TSC offsets on different CPUs in SMP
systems? (the Linux kernel certainly has logic to that extent, but [3]
suggests this is no longer a problem on Nehalem and newer chips, i.e.
those
having an invariant TSC)
I don't think we should cater to systems where we need that.
2) Should we have a setting "--with-tsc" for configure? (instead of always
enabling it when on Linux/x86 with a TSC clocksource)
Probably not worth it.
3) Are there cases where we actually want to use rdtsc*p*? (i.e. wait for
current instructions to finish -- the prior discussion seemed to suggest
we don't want it for node instruction measurements, but possibly we do
want
this in other cases?)
I was wondering about that too... Perhaps we should add a
INSTR_TIME_SET_CURRENT_BARRIER() or such?
4) Should we support using the "mrs" instruction on ARM? (which is similar
to
rdtsc, see [4])
I'd leave that for later personally.
#define NS_PER_S INT64CONST(1000000000)
#define US_PER_S INT64CONST(1000000)
#define MS_PER_S INT64CONST(1000)
@@ -95,17 +104,37 @@ typedef int64 instr_time;#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
-static inline instr_time pg_clock_gettime_ns(void) +extern double cycles_to_sec; + +bool use_rdtsc;
This should be extern and inside the ifdef below.
+#if defined(__x86_64__) && defined(__linux__) +extern void pg_clock_gettime_initialize_rdtsc(void); +#endif + +static inline instr_time pg_clock_gettime_ref_cycles(void) { struct timespec tmp;+#if defined(__x86_64__) && defined(__linux__) + if (use_rdtsc) + return __rdtsc(); +#endif + clock_gettime(PG_INSTR_CLOCK, &tmp);return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
}
Greetings,
Andres Freund
I ran that original test case with and without the patch. Here are the
numbers I'm seeing:
master (best of three):
postgres=# SELECT count(*) FROM lotsarows;
Time: 582.423 ms
postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 616.102 ms
postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1068.700 ms (00:01.069)
patched (best of three):
postgres=# SELECT count(*) FROM lotsarows;
Time: 550.822 ms
postgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 612.572 ms
postgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 690.875 ms
On Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
...Known WIP problems with this patch version:
* There appears to be a timing discrepancy I haven't yet worked out, where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is a problem
somewhere with how we perform the cycles to time conversionCould you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.
I did not see this either, but I did see that the execution time
reported by \timing is (for this test case) consistently 0.5-1ms
*lower* than the Execution Time reported by EXPLAIN. I did not see
that on master. Is that expected?
Thanks,
Maciek
On Fri, Jul 15, 2022 at 11:22 PM Maciek Sakrejda <m.sakrejda@gmail.com>
wrote:
I ran that original test case with and without the patch. Here are the
numbers I'm seeing:master (best of three):
postgres=# SELECT count(*) FROM lotsarows;
Time: 582.423 mspostgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 616.102 mspostgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 1068.700 ms (00:01.069)patched (best of three):
postgres=# SELECT count(*) FROM lotsarows;
Time: 550.822 mspostgres=# EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 612.572 mspostgres=# EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 690.875 msOn Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
...Known WIP problems with this patch version:
* There appears to be a timing discrepancy I haven't yet worked out,
where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE
is
reporting. With Andres' earlier test case, I'm seeing a consistent
~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is aproblem
somewhere with how we perform the cycles to time conversion
Could you explain a bit more what you're seeing? I just tested your
patches
and didn't see that here.
I did not see this either, but I did see that the execution time
reported by \timing is (for this test case) consistently 0.5-1ms
*lower* than the Execution Time reported by EXPLAIN. I did not see
that on master. Is that expected?Thanks,
MaciekThe patch requires a rebase; please rebase the patch with the latest code.
Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines).
Hunk #6 FAILED at 170.
Hunk #7 succeeded at 165 (offset -69 lines).
2 out of 7 hunks FAILED -- saving rejects to file
src/include/portability/instr_time.h.rej
patching file src/tools/msvc/Mkvcbuild.pm
--
Ibrar Ahmed
On Tue, Sep 06, 2022 at 11:32:18AM +0500, Ibrar Ahmed wrote:
Hunk #5 succeeded at 147 with fuzz 2 (offset -3 lines).
Hunk #6 FAILED at 170.
Hunk #7 succeeded at 165 (offset -69 lines).
2 out of 7 hunks FAILED -- saving rejects to file
src/include/portability/instr_time.h.rej
patching file src/tools/msvc/Mkvcbuild.pm
No rebased version has been sent since this update, so this patch has
been marked as RwF.
--
Michael
I think it would be great to get this patch committed. Beyond the
reasons already mentioned, the significant overhead also tends to skew
the reported runtimes in ways that makes it difficult to compare them.
For example, if two nodes are executed equally often but one needs twice
the time to process the rows: in such a case EXPLAIN ANALYZE should
report timings that are 2x apart. However, currently, the high overhead
of clock_gettime() tends to skew the relative runtimes.
On 10/12/22 10:33, Michael Paquier wrote:
No rebased version has been sent since this update, so this patch has
been marked as RwF.
I've rebased the patch set on latest master and fixed a few compiler
warnings. Beyond that some findings and thoughts:
You're only using RDTSC if the clock source is 'tsc'. Great idea to not
bother caring about a lot of hairy TSC details. Looking at the kernel
code this seems to imply that the TSC is frequency invariant. I don't
think though that this implies that Linux is not running under a
hypervisor; which is good because I assume PostgreSQL is used a lot in
VMs. However, when running under a hypervisor (at least with VMWare)
CPUID leaf 0x16 is not available. In my tests __get_cpuid() indicated
success but the returned values were garbage. Instead of using leaf
0x16, we should then use the hypervisor interface to obtain the TSC
frequency. Checking if a hypervisor is active can be done via:
bool IsHypervisorActive()
{
uint32 cpuinfo[4] = {0};
int res = __get_cpuid(0x1, &cpuinfo[0], &cpuinfo[1], &cpuinfo[2],
&cpuinfo[3]);
return res > 0 && (cpuinfo[2] & (1 << 30));
}
Obtaining the TSC frequency via the hypervisor interface can be done
with the following code. See https://lwn.net/Articles/301888/ for more
details.
// Under hypervisors (tested with VMWare) leaf 0x16 is not available,
even though __get_cpuid() succeeds.
// Hence, if running under a hypervisor, use the hypervisor interface to
obtain TSC frequency.
uint32 cpuinfo[4] = {0};
if (IsHypervisorActive() && __get_cpuid(0x40000001, &cpuinfo[0],
&cpuinfo[1], &cpuinfo[2], &cpuinfo[3]) > 0)
cycles_to_sec = 1.0 / ((double)cpuinfo[0] * 1000 * 1000);
Given that we anyways switch between RDTSC and clock_gettime() with a
global variable, what about exposing the clock source as GUC? That way
the user can switch back to a working clock source in case we miss a
detail around activating or reading the TSC.
I'm happy to update the patches accordingly.
--
David Geier
(ServiceNow)
I missed attaching the patches.
--
David Geier
(ServiceNow)
Attachments:
0001-WIP-Change-instr_time-to-just-store-nanoseconds-v3.patchtext/x-patch; charset=UTF-8; name=0001-WIP-Change-instr_time-to-just-store-nanoseconds-v3.patchDownload
From f4e962729ca605498d0c8bfc97d0f42d68a0df06 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/2] WIP: Change instr_time to just store nanoseconds, that's
cheaper.
---
src/include/portability/instr_time.h | 62 ++++++++++++----------------
1 file changed, 26 insertions(+), 36 deletions(-)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 22bcf3d288..4bd555113b 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(1000000000)
+#define US_PER_S INT64CONST(1000000)
+#define MS_PER_S INT64CONST(1000)
-#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(1000000)
+#define NS_PER_US INT64CONST(1000)
-#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t) ((t) == 0)
-#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+ struct timespec tmp;
+
+ clock_gettime(PG_INSTR_CLOCK, &tmp);
+
+ return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+ (t) = pg_clock_gettime_ns()
#define INSTR_TIME_ADD(x,y) \
do { \
- (x).tv_sec += (y).tv_sec; \
- (x).tv_nsec += (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
+ (x) += (y); \
} while (0)
#define INSTR_TIME_SUBTRACT(x,y) \
do { \
- (x).tv_sec -= (y).tv_sec; \
- (x).tv_nsec -= (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
+ (x) -= (y); \
} while (0)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
do { \
- (x).tv_sec += (y).tv_sec - (z).tv_sec; \
- (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
- /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
+ (x) += (y) - (z); \
} while (0)
#define INSTR_TIME_GET_DOUBLE(t) \
- (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0)
+ ((double) (t) / NS_PER_S)
#define INSTR_TIME_GET_MILLISEC(t) \
- (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0)
+ ((double) (t) / NS_PER_MS)
#define INSTR_TIME_GET_MICROSEC(t) \
- (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
+ ((double) (t) / NS_PER_US)
#else /* WIN32 */
--
2.34.1
0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure-v3.patchtext/x-patch; charset=UTF-8; name=0002-WIP-Use-cpu-reference-cycles-via-rdtsc-to-measure-v3.patchDownload
From 7a6317fdf5b1d82f120a4fd5535cfe40c8165153 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Thu, 17 Nov 2022 13:03:59 +0100
Subject: [PATCH 2/2] WIP: Use cpu reference cycles, via rdtsc, to measure time
for instrumentation.
For now this is only enabled on Linux/x86 when the system clocksource is
marked tsc as well, as determined at runtime. This way we can rely on the
Linux kernel to make a decision whether tsc is invariant and usable on the
current CPU architecture. In all other cases we continue to use the
clock_gettime() implementation like before.
Note that this intentionally uses rdtsc, not rdtscp, as rdtscp waits for
currently running CPU instructions to have finished, and that adds up to
noticable latency for little benefit in the typical InstrStartNode() /
InstrStopNode() use case.
---
src/backend/utils/init/postinit.c | 3 +
src/bin/pg_test_timing/pg_test_timing.c | 1 +
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 103 ++++++++++++++++++++++++
src/include/portability/instr_time.h | 50 +++++++++---
src/tools/msvc/Mkvcbuild.pm | 2 +-
8 files changed, 157 insertions(+), 10 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a990c833c5..c684725af3 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -804,6 +804,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* Initialize status reporting */
pgstat_beinit();
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f8762..0d667ff5a7 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -132,6 +132,7 @@ test_timing(unsigned int duration)
total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
+ INSTR_TIME_INITIALIZE();
INSTR_TIME_SET_CURRENT(start_time);
cur = INSTR_TIME_GET_MICROSEC(start_time);
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 36905a8968..1c5a265863 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7076,6 +7076,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index f5b9e268f2..14f368b658 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -322,6 +323,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index e9af7346c9..437a018590 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 0000000000..27653a8bc3
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,103 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "portability/instr_time.h"
+
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up with
+ * seconds. This indirection exists to support the rtdsc instruction.
+ *
+ * As a default, assume we are using clock_gettime() as a fallback and treat it
+ * as 1 "cycle" per nanosecond (aka 1 GHz).
+ *
+ * When using the rtdsc instruction directly this is filled in during
+ * initialization based on the relevant cpuid fields.
+ */
+double cycles_to_sec = 1.0 / NS_PER_S;
+
+/*
+ * Determines whether rdtsc is used (Linux/x86 only, when OS uses tsc clocksource)
+ */
+bool use_rdtsc = false;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Decide whether we use the rdtsc instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void pg_clock_gettime_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+ char buf[128];
+
+ if (fp)
+ {
+ if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+ use_rdtsc = true;
+
+ fclose(fp);
+ }
+
+ /*
+ * Compute baseline cpu peformance, determines speed at which rdtsc advances
+ */
+ if (use_rdtsc)
+ {
+ uint32 cpuinfo[4] = {0};
+
+ /*
+ * FIXME: We should probably not unnecessarily use floating point math
+ * here. And it's likely that the numbers are small enough that we are running
+ * into floating point inaccuracies already. Probably worthwhile to be a good
+ * bit smarter.
+ */
+
+ __get_cpuid(0x16, cpuinfo, cpuinfo + 1, cpuinfo + 2, cpuinfo + 3);
+
+ if (cpuinfo[0] != 0)
+ cycles_to_sec = 1 / ((double) cpuinfo[0] * 1000 * 1000);
+ else
+ {
+ float cpu_mhz;
+
+ fp = fopen("/proc/cpuinfo", "r");
+
+ if (fp)
+ {
+ while (fgets(buf, sizeof(buf), fp))
+ {
+ if (sscanf(buf, "cpu MHz : %f", &cpu_mhz) == 1)
+ {
+ cycles_to_sec = 1 / ((double) cpu_mhz * 1000 * 1000);
+ break;
+ }
+ }
+ }
+
+ fclose(fp);
+ }
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 4bd555113b..330c205924 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter(). These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -56,10 +58,15 @@
#ifndef WIN32
-/* Use clock_gettime() */
+/* Uses rdtsc on Linux/x86 if available, otherwise clock_gettime() */
#include <time.h>
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+#endif
+
/*
* The best clockid to use according to the POSIX spec is CLOCK_MONOTONIC,
* since that will give reliable interval timing even in the face of changes
@@ -80,7 +87,9 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
+/* time in cpu reference cycles (when using rdtsc), otherwise nanoseconds */
typedef int64 instr_time;
+
#define NS_PER_S INT64CONST(1000000000)
#define US_PER_S INT64CONST(1000000)
#define MS_PER_S INT64CONST(1000)
@@ -92,17 +101,38 @@ typedef int64 instr_time;
#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
-static inline instr_time pg_clock_gettime_ns(void)
+extern double cycles_to_sec;
+extern bool use_rdtsc;
+
+#if defined(__x86_64__) && defined(__linux__)
+extern void pg_clock_gettime_initialize_rdtsc(void);
+#endif
+
+static inline instr_time pg_clock_gettime_ref_cycles(void)
{
struct timespec tmp;
+#if defined(__x86_64__) && defined(__linux__)
+#ifndef FRONTEND
+ if (use_rdtsc)
+ return __rdtsc();
+#endif
+#endif
+
clock_gettime(PG_INSTR_CLOCK, &tmp);
return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
}
+#if defined(__x86_64__) && defined(__linux__)
+#define INSTR_TIME_INITIALIZE() \
+ pg_clock_gettime_initialize_rdtsc()
+#else
+#define INSTR_TIME_INITIALIZE()
+#endif
+
#define INSTR_TIME_SET_CURRENT(t) \
- (t) = pg_clock_gettime_ns()
+ (t) = pg_clock_gettime_ref_cycles()
#define INSTR_TIME_ADD(x,y) \
do { \
@@ -120,13 +150,13 @@ static inline instr_time pg_clock_gettime_ns(void)
} while (0)
#define INSTR_TIME_GET_DOUBLE(t) \
- ((double) (t) / NS_PER_S)
+ ((double) (t) * cycles_to_sec)
#define INSTR_TIME_GET_MILLISEC(t) \
- ((double) (t) / NS_PER_MS)
+ ((double) (t) * (cycles_to_sec * MS_PER_S))
#define INSTR_TIME_GET_MICROSEC(t) \
- ((double) (t) / NS_PER_US)
+ ((uint64) (t) * (cycles_to_sec * US_PER_S))
#else /* WIN32 */
@@ -138,6 +168,8 @@ typedef LARGE_INTEGER instr_time;
#define INSTR_TIME_SET_ZERO(t) ((t).QuadPart = 0)
+#define INSTR_TIME_INITIALIZE()
+
#define INSTR_TIME_SET_CURRENT(t) QueryPerformanceCounter(&(t))
#define INSTR_TIME_ADD(x,y) \
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 83a3e40425..09ef56c309 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -135,7 +135,7 @@ sub mkvcbuild
our @pgcommonallfiles = qw(
archive.c base64.c checksum_helper.c compression.c
config_info.c controldata_utils.c d2s.c encnames.c exec.c
- f2s.c file_perm.c file_utils.c hashfn.c ip.c jsonapi.c
+ f2s.c file_perm.c file_utils.c hashfn.c ip.c instr_time.o jsonapi.c
keywords.c kwlookup.c link-canary.c md5_common.c
pg_get_line.c pg_lzcompress.c pg_prng.c pgfnames.c psprintf.c relpath.c
rmtree.c saslprep.c scram-common.c string.c stringinfo.c unicode_norm.c
--
2.34.1
Hi,
I re-based again on master and applied the following changes:
I removed the fallback for obtaining the TSC frequency from /proc/cpu as
suggested by Andres. Worst-case we fall back to clock_gettime().
I added code to obtain the TSC frequency via CPUID when under a
hypervisor. I had to use __cpuid() directly instead of __get_cpuid(),
because __get_cpuid() returns an error if the leaf is > 0x80000000
(probably the implementation pre-dates the hypervisor timing leafs).
Unfortunately, while testing my implementation under VMWare, I found
that RDTSC runs awfully slow there (like 30x slower). [1]https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf (page 11 "Virtual TSC") indicates that
we cannot generally rely on RDTSC being actually fast on VMs. However,
the same applies to clock_gettime(). It runs as slow as RDTSC on my
VMWare setup. Hence, using RDTSC is not at disadvantage. I'm not
entirely sure if there aren't cases where e.g. clock_gettime() is
actually faster than RDTSC and it would be advantageous to use
clock_gettime(). We could add a GUC so that the user can decide which
clock source to use. Any thoughts?
I also somewhat improved the accuracy of the cycles to milli- and
microseconds conversion functions by having two more multipliers with
higher precision. For microseconds we could also keep the computation
integer-only. I'm wondering what to best do for seconds and
milliseconds. I'm currently leaning towards just keeping it as is,
because the durations measured and converted are usually long enough
that precision shouldn't be a problem.
In vacuum_lazy.c we do if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000). I
changed that to use INSTR_TIME_GET_MILLISECS() instead. Additionally, I
initialized a few variables of type instr_time which otherwise resulted
in warnings due to use of potentially uninitialized variables.
I also couldn't reproduce the reported timing discrepancy. For me the
runtime reported by \timing is just slightly higher than the time
reported by EXPLAIN ANALYZE, which is expected.
Beyond that:
What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so
that it's consistent with the _MILLISEC() and _MICROSEC() variants?
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
return double. This seems error prone. What about renaming the function
or also have the function return a double and cast where necessary at
the call site?
If no one objects I would also re-register this patch in the commit fest.
[1]: https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf (page 11 "Virtual TSC")
https://vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf
(page 11 "Virtual TSC")
--
David Geier
(ServiceNow)
Attachments:
0001-Change-instr_time-to-just-store-nanoseconds-v4.patchtext/x-patch; charset=UTF-8; name=0001-Change-instr_time-to-just-store-nanoseconds-v4.patchDownload
From 321d00ae5dd1bcffc8fbdb39879b7f5c78e3930f Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
cheaper.
---
src/include/portability/instr_time.h | 62 ++++++++++++----------------
1 file changed, 26 insertions(+), 36 deletions(-)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 22bcf3d288..4bd555113b 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(1000000000)
+#define US_PER_S INT64CONST(1000000)
+#define MS_PER_S INT64CONST(1000)
-#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(1000000)
+#define NS_PER_US INT64CONST(1000)
-#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t) ((t) == 0)
-#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+ struct timespec tmp;
+
+ clock_gettime(PG_INSTR_CLOCK, &tmp);
+
+ return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+ (t) = pg_clock_gettime_ns()
#define INSTR_TIME_ADD(x,y) \
do { \
- (x).tv_sec += (y).tv_sec; \
- (x).tv_nsec += (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
+ (x) += (y); \
} while (0)
#define INSTR_TIME_SUBTRACT(x,y) \
do { \
- (x).tv_sec -= (y).tv_sec; \
- (x).tv_nsec -= (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
+ (x) -= (y); \
} while (0)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
do { \
- (x).tv_sec += (y).tv_sec - (z).tv_sec; \
- (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
- /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
+ (x) += (y) - (z); \
} while (0)
#define INSTR_TIME_GET_DOUBLE(t) \
- (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0)
+ ((double) (t) / NS_PER_S)
#define INSTR_TIME_GET_MILLISEC(t) \
- (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0)
+ ((double) (t) / NS_PER_MS)
#define INSTR_TIME_GET_MICROSEC(t) \
- (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
+ ((double) (t) / NS_PER_US)
#else /* WIN32 */
--
2.34.1
0002-Use-CPU-reference-cycles-via-RDTSC-v4.patchtext/x-patch; charset=UTF-8; name=0002-Use-CPU-reference-cycles-via-RDTSC-v4.patchDownload
From 28b3b35a51841642ae1d47b5740c3b1526a0c574 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Thu, 17 Nov 2022 13:03:59 +0100
Subject: [PATCH 2/3] Use CPU reference cycles, via RDTSC, to measure time for
instrumentation.
For now this is only enabled on Linux/x86 when the system clocksource is
marked tsc as well, as determined at runtime. This way we can rely on the
Linux kernel to make a decision whether tsc is invariant and usable on the
current CPU architecture. In all other cases we continue to use the
clock_gettime() implementation like before.
Note that this intentionally uses rdtsc, not rdtscp, as rdtscp waits for
currently running CPU instructions to have finished, and that adds up to
noticable latency for little benefit in the typical InstrStartNode() /
InstrStopNode() use case.
---
src/backend/utils/init/postinit.c | 3 +
src/bin/pg_test_timing/pg_test_timing.c | 1 +
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 120 ++++++++++++++++++++++++
src/include/portability/instr_time.h | 52 ++++++++--
src/tools/msvc/Mkvcbuild.pm | 2 +-
8 files changed, 175 insertions(+), 11 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index a990c833c5..c684725af3 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -804,6 +804,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* Initialize status reporting */
pgstat_beinit();
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f8762..0d667ff5a7 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -132,6 +132,7 @@ test_timing(unsigned int duration)
total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
+ INSTR_TIME_INITIALIZE();
INSTR_TIME_SET_CURRENT(start_time);
cur = INSTR_TIME_GET_MICROSEC(start_time);
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 3182a73ad7..0858678534 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7076,6 +7076,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index f5b9e268f2..14f368b658 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -322,6 +323,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 898701fae1..fca4b546c8 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 0000000000..5f85350d1b
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "portability/instr_time.h"
+
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up with
+ * seconds. This indirection exists to support the rtdsc instruction.
+ *
+ * As a default, assume we are using clock_gettime() as a fallback and treat it
+ * as 1 "cycle" per nanosecond (aka 1 GHz).
+ *
+ * When using the RDTSC instruction directly this is filled in during
+ * initialization based on the relevant cpuid fields.
+ */
+double cycles_to_sec = 1.0 / NS_PER_S;
+double cycles_to_ms = 1.0 / MS_PER_S;
+double cycles_to_us = 1.0 / US_PER_S;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Determines whether rdtsc is used (Linux/x86 only, when OS uses tsc clocksource)
+ */
+bool use_rdtsc = false;
+
+static bool under_hypervisor(void)
+{
+ uint32 r0, r1, r2, r3;
+ int result = __get_cpuid(0x1, &r0, &r1, &r2, &r3);
+ return result > 0 && ((r2 >> 31) & 1) == 1;
+}
+
+static bool get_hypervisor_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r0, r1, r2;
+
+ if (!under_hypervisor())
+ return false;
+
+ /*
+ * Use __cpuid() because __get_cpuid() bails out for leafs > 0x80000000.
+ * Likely, it's not accounting for recently introduced VM leafs.
+ * Unlike leaf 0x16, the hypervisor timing leafs return the frequency already in KHz.
+ */
+ __cpuid(0x40000010, *tsc_freq, r0, r1, r2);
+ return *tsc_freq > 0;
+}
+
+static bool get_host_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r0, r1, r2;
+ int result = __get_cpuid(0x16, tsc_freq, &r0, &r1, &r2);
+ *tsc_freq *= 1000; // Convert from MHz to KHz
+ return result > 0 && *tsc_freq > 0;
+}
+
+static bool get_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ if (get_hypervisor_tsc_frequency_khz(tsc_freq))
+ return true;
+ if (get_host_tsc_frequency_khz(tsc_freq))
+ return true;
+
+ return false;
+}
+
+/*
+ * Decide whether we use the rdtsc instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void pg_clock_gettime_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+
+ if (fp)
+ {
+ char buf[128];
+
+ if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+ {
+ /*
+ * Compute baseline CPU peformance, determines speed at which RDTSC advances.
+ */
+ uint32 tsc_freq;
+
+ if (get_tsc_frequency_khz(&tsc_freq))
+ {
+ cycles_to_sec = 1.0 / (tsc_freq * 1000);
+ cycles_to_ms = 1.0 / tsc_freq;
+ cycles_to_us = 1000.0 / tsc_freq;
+ use_rdtsc = true;
+ }
+ }
+
+ fclose(fp);
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 4bd555113b..6442732f3e 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter(). These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -56,10 +58,15 @@
#ifndef WIN32
-/* Use clock_gettime() */
+/* Uses rdtsc on Linux/x86 if available, otherwise clock_gettime() */
#include <time.h>
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+#endif
+
/*
* The best clockid to use according to the POSIX spec is CLOCK_MONOTONIC,
* since that will give reliable interval timing even in the face of changes
@@ -80,7 +87,9 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
+/* time in cpu reference cycles (when using rdtsc), otherwise nanoseconds */
typedef int64 instr_time;
+
#define NS_PER_S INT64CONST(1000000000)
#define US_PER_S INT64CONST(1000000)
#define MS_PER_S INT64CONST(1000)
@@ -92,17 +101,38 @@ typedef int64 instr_time;
#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
-static inline instr_time pg_clock_gettime_ns(void)
+extern double cycles_to_sec;
+extern double cycles_to_ms;
+extern double cycles_to_us;
+
+#if defined(__x86_64__) && defined(__linux__)
+extern bool use_rdtsc;
+
+extern void pg_clock_gettime_initialize_rdtsc(void);
+#endif
+
+static inline instr_time pg_clock_gettime_ref_cycles(void)
{
struct timespec tmp;
+#if defined(__x86_64__) && defined(__linux__)
+ if (use_rdtsc)
+ return __rdtsc();
+#endif
+
clock_gettime(PG_INSTR_CLOCK, &tmp);
-
return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
}
+#if defined(__x86_64__) && defined(__linux__)
+#define INSTR_TIME_INITIALIZE() \
+ pg_clock_gettime_initialize_rdtsc()
+#else
+#define INSTR_TIME_INITIALIZE()
+#endif
+
#define INSTR_TIME_SET_CURRENT(t) \
- (t) = pg_clock_gettime_ns()
+ (t) = pg_clock_gettime_ref_cycles()
#define INSTR_TIME_ADD(x,y) \
do { \
@@ -120,13 +150,13 @@ static inline instr_time pg_clock_gettime_ns(void)
} while (0)
#define INSTR_TIME_GET_DOUBLE(t) \
- ((double) (t) / NS_PER_S)
+ ((double) (t) * cycles_to_sec)
#define INSTR_TIME_GET_MILLISEC(t) \
- ((double) (t) / NS_PER_MS)
+ ((double) (t) * cycles_to_ms)
#define INSTR_TIME_GET_MICROSEC(t) \
- ((double) (t) / NS_PER_US)
+ ((uint64) (t) * cycles_to_us)
#else /* WIN32 */
@@ -138,6 +168,8 @@ typedef LARGE_INTEGER instr_time;
#define INSTR_TIME_SET_ZERO(t) ((t).QuadPart = 0)
+#define INSTR_TIME_INITIALIZE()
+
#define INSTR_TIME_SET_CURRENT(t) QueryPerformanceCounter(&(t))
#define INSTR_TIME_ADD(x,y) \
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index 7b855d51f0..c3b5b780f5 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -135,7 +135,7 @@ sub mkvcbuild
our @pgcommonallfiles = qw(
archive.c base64.c checksum_helper.c compression.c
config_info.c controldata_utils.c d2s.c encnames.c exec.c
- f2s.c file_perm.c file_utils.c hashfn.c ip.c jsonapi.c
+ f2s.c file_perm.c file_utils.c hashfn.c ip.c instr_time.o jsonapi.c
keywords.c kwlookup.c link-canary.c md5_common.c
pg_get_line.c pg_lzcompress.c pg_prng.c pgfnames.c psprintf.c relpath.c
rmtree.c saslprep.c scram-common.c string.c stringinfo.c unicode_norm.c
--
2.34.1
0003-Refactor-some-instr_time-related-code-v4.patchtext/x-patch; charset=UTF-8; name=0003-Refactor-some-instr_time-related-code-v4.patchDownload
From 4360cbf2f9ef11b169a16180ee9aaaf03f3d59fc Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Mon, 2 Jan 2023 13:32:58 +0100
Subject: [PATCH 3/3] Refactor some instr_time related code
---
src/backend/access/heap/vacuumlazy.c | 3 +--
src/backend/access/transam/xlog.c | 4 ++--
src/backend/storage/buffer/bufmgr.c | 4 ++--
src/bin/psql/common.c | 8 ++++----
4 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 9923994b50..49be7c767d 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2992,8 +2992,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
INSTR_TIME_SET_CURRENT(currenttime);
elapsed = currenttime;
INSTR_TIME_SUBTRACT(elapsed, starttime);
- if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
- >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ if (INSTR_TIME_GET_MILLISEC(elapsed) >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
{
if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 91473b00d9..1a3ce8d239 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2177,7 +2177,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
Size nbytes;
Size nleft;
int written;
- instr_time start;
+ instr_time start = 0;
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -8129,7 +8129,7 @@ void
issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
{
char *msg = NULL;
- instr_time start;
+ instr_time start = 0;
Assert(tli != 0);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 73d30bf619..d3365b6a36 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2824,8 +2824,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
{
XLogRecPtr recptr;
ErrorContextCallback errcallback;
- instr_time io_start,
- io_time;
+ instr_time io_start = 0;
+ instr_time io_time = 0;
Block bufBlock;
char *bufToWrite;
uint32 buf_state;
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index b989d792aa..901bb84916 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -1269,8 +1269,8 @@ DescribeQuery(const char *query, double *elapsed_msec)
bool timing = pset.timing;
PGresult *result;
bool OK;
- instr_time before,
- after;
+ instr_time before = 0;
+ instr_time after = 0;
*elapsed_msec = 0;
@@ -1689,8 +1689,8 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec)
int ntuples;
int fetch_count;
char fetch_cmd[64];
- instr_time before,
- after;
+ instr_time before = 0;
+ instr_time after = 0;
int flush_error;
*elapsed_msec = 0;
--
2.34.1
Hi David,
Thanks for continuing to work on this patch, and my apologies for silence
on the patch.
Its been hard to make time, and especially so because I typically develop
on an ARM-based macOS system where I can't test this directly - hence my
tests with virtualized EC2 instances, where I ran into the timing oddities.
On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
return double. This seems error prone. What about renaming the function
or also have the function return a double and cast where necessary at
the call site?
Minor note, but in my understanding using a uint64 (where we can) is faster
for any simple arithmetic we do with the values.
If no one objects I would also re-register this patch in the commit fest.
+1, and feel free to carry this patch forward - I'll try to make an effort
to review my earlier testing issues again, as well as your later
improvements to the patch.
Also, FYI, I just posted an alternate idea for speeding up EXPLAIN ANALYZE
with timing over in [0]/messages/by-id/CAP53PkxXMk0j-+0=YwRti2pFR5UB2Gu4v2Oyk8hhZS0DRART6g@mail.gmail.com, using a sampling-based approach to reduce the
timing overhead.
[0]: /messages/by-id/CAP53PkxXMk0j-+0=YwRti2pFR5UB2Gu4v2Oyk8hhZS0DRART6g@mail.gmail.com
/messages/by-id/CAP53PkxXMk0j-+0=YwRti2pFR5UB2Gu4v2Oyk8hhZS0DRART6g@mail.gmail.com
Thanks,
Lukas
--
Lukas Fittl
On Fri, Jul 15, 2022 at 11:21 AM Maciek Sakrejda <m.sakrejda@gmail.com> wrote:
On Fri, Jul 1, 2022 at 10:26 AM Andres Freund <andres@anarazel.de> wrote:
On 2022-07-01 01:23:01 -0700, Lukas Fittl wrote:
...Known WIP problems with this patch version:
* There appears to be a timing discrepancy I haven't yet worked out, where
the \timing data reported by psql doesn't match what EXPLAIN ANALYZE is
reporting. With Andres' earlier test case, I'm seeing a consistent ~700ms
higher for \timing than for the EXPLAIN ANALYZE time reported on the
server
side, only when rdtsc measurement is used -- its likely there is a problem
somewhere with how we perform the cycles to time conversionCould you explain a bit more what you're seeing? I just tested your patches
and didn't see that here.I did not see this either, but I did see that the execution time
reported by \timing is (for this test case) consistently 0.5-1ms
*lower* than the Execution Time reported by EXPLAIN. I did not see
that on master. Is that expected?
For what it's worth, I can no longer reproduce this. In fact, I went
back to master-as-of-around-then and applied Lukas' v2 patches again,
and I still can't reproduce that. I do remember it happening
consistently across several executions, but now \timing consistently
shows 0.5-1ms slower, as expected. This does not explain the different
timing issue Lukas was seeing in his tests, but I think we can assume
what I reported originally here is not an issue.
Hi Lukas,
On 1/2/23 20:50, Lukas Fittl wrote:
Thanks for continuing to work on this patch, and my apologies for
silence on the patch.
It would be great if you could review it.
Please also share your thoughts around exposing the used clock source as
GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().
I rebased again on master because of [1]/messages/by-id/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw@mail.gmail.com. Patches attached.
Its been hard to make time, and especially so because I typically
develop on an ARM-based macOS system where I can't test this directly
- hence my tests with virtualized EC2 instances, where I ran into the
timing oddities.
That's good and bad. Bad to do the development and good to test the
implementation on more virtualized setups; given that I also encountered
"interesting" behavior on VMWare (see my previous mails).
On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
variants
return double. This seems error prone. What about renaming the
function
or also have the function return a double and cast where necessary at
the call site?Minor note, but in my understanding using a uint64 (where we can) is
faster for any simple arithmetic we do with the values.
That's true. So the argument could be that for seconds and milliseconds
we want the extra precision while microseconds are precise enough.
Still, we could also make the seconds and milliseconds conversion code
integer only and e.g. return two integers with the value before and
after the comma. FWICS, the functions are nowhere used in performance
critical code, so it doesn't really make a difference performance-wise.
+1, and feel free to carry this patch forward - I'll try to make an
effort to review my earlier testing issues again, as well as your
later improvements to the patch.
Moved to the current commit fest. Will you become reviewer?
Also, FYI, I just posted an alternate idea for speeding up EXPLAIN
ANALYZE with timing over in [0], using a sampling-based approach to
reduce the timing overhead.
Interesting idea. I'll reply with some thoughts on the corresponding thread.
[1]: /messages/by-id/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw@mail.gmail.com
/messages/by-id/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw@mail.gmail.com
--
David Geier
(ServiceNow)
Attachments:
0001-Change-instr_time-to-just-store-nanoseconds-v5.patchtext/x-patch; charset=UTF-8; name=0001-Change-instr_time-to-just-store-nanoseconds-v5.patchDownload
From f63527c8e4b3b0b71ffacaa111dd93325d973432 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
cheaper.
---
src/include/portability/instr_time.h | 62 ++++++++++++----------------
1 file changed, 26 insertions(+), 36 deletions(-)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 9ea1a68bd9..c64f21eb53 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(1000000000)
+#define US_PER_S INT64CONST(1000000)
+#define MS_PER_S INT64CONST(1000)
-#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(1000000)
+#define NS_PER_US INT64CONST(1000)
-#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t) ((t) == 0)
-#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+ struct timespec tmp;
+
+ clock_gettime(PG_INSTR_CLOCK, &tmp);
+
+ return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+ (t) = pg_clock_gettime_ns()
#define INSTR_TIME_ADD(x,y) \
do { \
- (x).tv_sec += (y).tv_sec; \
- (x).tv_nsec += (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
+ (x) += (y); \
} while (0)
#define INSTR_TIME_SUBTRACT(x,y) \
do { \
- (x).tv_sec -= (y).tv_sec; \
- (x).tv_nsec -= (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
+ (x) -= (y); \
} while (0)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
do { \
- (x).tv_sec += (y).tv_sec - (z).tv_sec; \
- (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
- /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
+ (x) += (y) - (z); \
} while (0)
#define INSTR_TIME_GET_DOUBLE(t) \
- (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0)
+ ((double) (t) / NS_PER_S)
#define INSTR_TIME_GET_MILLISEC(t) \
- (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0)
+ ((double) (t) / NS_PER_MS)
#define INSTR_TIME_GET_MICROSEC(t) \
- (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
+ ((double) (t) / NS_PER_US)
#else /* WIN32 */
--
2.34.1
0002-Use-CPU-reference-cycles-via-RDTSC-v5.patchtext/x-patch; charset=UTF-8; name=0002-Use-CPU-reference-cycles-via-RDTSC-v5.patchDownload
From 3a268a74696681f2c6f666fd3ecbb020bca53599 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Thu, 17 Nov 2022 13:03:59 +0100
Subject: [PATCH 2/3] Use CPU reference cycles, via RDTSC, to measure time for
instrumentation.
For now this is only enabled on Linux/x86 when the system clocksource is
marked tsc as well, as determined at runtime. This way we can rely on the
Linux kernel to make a decision whether tsc is invariant and usable on the
current CPU architecture. In all other cases we continue to use the
clock_gettime() implementation like before.
Note that this intentionally uses rdtsc, not rdtscp, as rdtscp waits for
currently running CPU instructions to have finished, and that adds up to
noticable latency for little benefit in the typical InstrStartNode() /
InstrStopNode() use case.
---
src/backend/utils/init/postinit.c | 3 +
src/bin/pg_test_timing/pg_test_timing.c | 1 +
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 120 ++++++++++++++++++++++++
src/include/portability/instr_time.h | 52 ++++++++--
src/tools/msvc/Mkvcbuild.pm | 2 +-
8 files changed, 175 insertions(+), 11 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ae5a85ed65..ec0661accc 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -804,6 +804,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* Initialize status reporting */
pgstat_beinit();
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f8762..0d667ff5a7 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -132,6 +132,7 @@ test_timing(unsigned int duration)
total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
+ INSTR_TIME_INITIALIZE();
INSTR_TIME_SET_CURRENT(start_time);
cur = INSTR_TIME_GET_MICROSEC(start_time);
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 820b1ea09f..2f0a306f7e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7076,6 +7076,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 5a28b6f713..3f95e01320 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -322,6 +323,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 898701fae1..fca4b546c8 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 0000000000..5f85350d1b
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "portability/instr_time.h"
+
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up with
+ * seconds. This indirection exists to support the rtdsc instruction.
+ *
+ * As a default, assume we are using clock_gettime() as a fallback and treat it
+ * as 1 "cycle" per nanosecond (aka 1 GHz).
+ *
+ * When using the RDTSC instruction directly this is filled in during
+ * initialization based on the relevant cpuid fields.
+ */
+double cycles_to_sec = 1.0 / NS_PER_S;
+double cycles_to_ms = 1.0 / MS_PER_S;
+double cycles_to_us = 1.0 / US_PER_S;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Determines whether rdtsc is used (Linux/x86 only, when OS uses tsc clocksource)
+ */
+bool use_rdtsc = false;
+
+static bool under_hypervisor(void)
+{
+ uint32 r0, r1, r2, r3;
+ int result = __get_cpuid(0x1, &r0, &r1, &r2, &r3);
+ return result > 0 && ((r2 >> 31) & 1) == 1;
+}
+
+static bool get_hypervisor_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r0, r1, r2;
+
+ if (!under_hypervisor())
+ return false;
+
+ /*
+ * Use __cpuid() because __get_cpuid() bails out for leafs > 0x80000000.
+ * Likely, it's not accounting for recently introduced VM leafs.
+ * Unlike leaf 0x16, the hypervisor timing leafs return the frequency already in KHz.
+ */
+ __cpuid(0x40000010, *tsc_freq, r0, r1, r2);
+ return *tsc_freq > 0;
+}
+
+static bool get_host_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r0, r1, r2;
+ int result = __get_cpuid(0x16, tsc_freq, &r0, &r1, &r2);
+ *tsc_freq *= 1000; // Convert from MHz to KHz
+ return result > 0 && *tsc_freq > 0;
+}
+
+static bool get_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ if (get_hypervisor_tsc_frequency_khz(tsc_freq))
+ return true;
+ if (get_host_tsc_frequency_khz(tsc_freq))
+ return true;
+
+ return false;
+}
+
+/*
+ * Decide whether we use the rdtsc instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void pg_clock_gettime_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+
+ if (fp)
+ {
+ char buf[128];
+
+ if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+ {
+ /*
+ * Compute baseline CPU peformance, determines speed at which RDTSC advances.
+ */
+ uint32 tsc_freq;
+
+ if (get_tsc_frequency_khz(&tsc_freq))
+ {
+ cycles_to_sec = 1.0 / (tsc_freq * 1000);
+ cycles_to_ms = 1.0 / tsc_freq;
+ cycles_to_us = 1000.0 / tsc_freq;
+ use_rdtsc = true;
+ }
+ }
+
+ fclose(fp);
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index c64f21eb53..9c004fa56e 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter(). These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -56,10 +58,15 @@
#ifndef WIN32
-/* Use clock_gettime() */
+/* Uses rdtsc on Linux/x86 if available, otherwise clock_gettime() */
#include <time.h>
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+#endif
+
/*
* The best clockid to use according to the POSIX spec is CLOCK_MONOTONIC,
* since that will give reliable interval timing even in the face of changes
@@ -80,7 +87,9 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
+/* time in cpu reference cycles (when using rdtsc), otherwise nanoseconds */
typedef int64 instr_time;
+
#define NS_PER_S INT64CONST(1000000000)
#define US_PER_S INT64CONST(1000000)
#define MS_PER_S INT64CONST(1000)
@@ -92,17 +101,38 @@ typedef int64 instr_time;
#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
-static inline instr_time pg_clock_gettime_ns(void)
+extern double cycles_to_sec;
+extern double cycles_to_ms;
+extern double cycles_to_us;
+
+#if defined(__x86_64__) && defined(__linux__)
+extern bool use_rdtsc;
+
+extern void pg_clock_gettime_initialize_rdtsc(void);
+#endif
+
+static inline instr_time pg_clock_gettime_ref_cycles(void)
{
struct timespec tmp;
+#if defined(__x86_64__) && defined(__linux__)
+ if (use_rdtsc)
+ return __rdtsc();
+#endif
+
clock_gettime(PG_INSTR_CLOCK, &tmp);
-
return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
}
+#if defined(__x86_64__) && defined(__linux__)
+#define INSTR_TIME_INITIALIZE() \
+ pg_clock_gettime_initialize_rdtsc()
+#else
+#define INSTR_TIME_INITIALIZE()
+#endif
+
#define INSTR_TIME_SET_CURRENT(t) \
- (t) = pg_clock_gettime_ns()
+ (t) = pg_clock_gettime_ref_cycles()
#define INSTR_TIME_ADD(x,y) \
do { \
@@ -120,13 +150,13 @@ static inline instr_time pg_clock_gettime_ns(void)
} while (0)
#define INSTR_TIME_GET_DOUBLE(t) \
- ((double) (t) / NS_PER_S)
+ ((double) (t) * cycles_to_sec)
#define INSTR_TIME_GET_MILLISEC(t) \
- ((double) (t) / NS_PER_MS)
+ ((double) (t) * cycles_to_ms)
#define INSTR_TIME_GET_MICROSEC(t) \
- ((double) (t) / NS_PER_US)
+ ((uint64) (t) * cycles_to_us)
#else /* WIN32 */
@@ -138,6 +168,8 @@ typedef LARGE_INTEGER instr_time;
#define INSTR_TIME_SET_ZERO(t) ((t).QuadPart = 0)
+#define INSTR_TIME_INITIALIZE()
+
#define INSTR_TIME_SET_CURRENT(t) QueryPerformanceCounter(&(t))
#define INSTR_TIME_ADD(x,y) \
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index afa40d885e..40910991e3 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -135,7 +135,7 @@ sub mkvcbuild
our @pgcommonallfiles = qw(
archive.c base64.c checksum_helper.c compression.c
config_info.c controldata_utils.c d2s.c encnames.c exec.c
- f2s.c file_perm.c file_utils.c hashfn.c ip.c jsonapi.c
+ f2s.c file_perm.c file_utils.c hashfn.c ip.c instr_time.o jsonapi.c
keywords.c kwlookup.c link-canary.c md5_common.c
pg_get_line.c pg_lzcompress.c pg_prng.c pgfnames.c psprintf.c relpath.c
rmtree.c saslprep.c scram-common.c string.c stringinfo.c unicode_norm.c
--
2.34.1
0003-Refactor-some-instr_time-related-code-v5.patchtext/x-patch; charset=UTF-8; name=0003-Refactor-some-instr_time-related-code-v5.patchDownload
From 4d6ab38c2de7d0d7ec22a4d016e53b4d82828040 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Mon, 2 Jan 2023 13:32:58 +0100
Subject: [PATCH 3/3] Refactor some instr_time related code
---
src/backend/access/heap/vacuumlazy.c | 3 +--
src/backend/access/transam/xlog.c | 4 ++--
src/backend/storage/buffer/bufmgr.c | 4 ++--
src/bin/psql/common.c | 8 ++++----
4 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 06fd15405f..c6c69b5547 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2994,8 +2994,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
INSTR_TIME_SET_CURRENT(currenttime);
elapsed = currenttime;
INSTR_TIME_SUBTRACT(elapsed, starttime);
- if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
- >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ if (INSTR_TIME_GET_MILLISEC(elapsed) >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
{
if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0070d56b0b..240e5afe35 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2177,7 +2177,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
Size nbytes;
Size nleft;
int written;
- instr_time start;
+ instr_time start = 0;
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -8129,7 +8129,7 @@ void
issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
{
char *msg = NULL;
- instr_time start;
+ instr_time start = 0;
Assert(tli != 0);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3fb38a25cf..96f6735398 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2824,8 +2824,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
{
XLogRecPtr recptr;
ErrorContextCallback errcallback;
- instr_time io_start,
- io_time;
+ instr_time io_start = 0;
+ instr_time io_time = 0;
Block bufBlock;
char *bufToWrite;
uint32 buf_state;
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 38f9b10b7c..5c835c43ad 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -1269,8 +1269,8 @@ DescribeQuery(const char *query, double *elapsed_msec)
bool timing = pset.timing;
PGresult *result;
bool OK;
- instr_time before,
- after;
+ instr_time before = 0;
+ instr_time after = 0;
*elapsed_msec = 0;
@@ -1689,8 +1689,8 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec)
int ntuples;
int fetch_count;
char fetch_cmd[64];
- instr_time before,
- after;
+ instr_time before = 0;
+ instr_time after = 0;
int flush_error;
*elapsed_msec = 0;
--
2.34.1
On Tue, 3 Jan 2023 at 14:08, David Geier <geidav.pg@gmail.com> wrote:
Hi Lukas,
On 1/2/23 20:50, Lukas Fittl wrote:
Thanks for continuing to work on this patch, and my apologies for
silence on the patch.It would be great if you could review it.
Please also share your thoughts around exposing the used clock source as
GUC and renaming INSTR_TIME_GET_DOUBLE() to _SECS().I rebased again on master because of [1]. Patches attached.
Its been hard to make time, and especially so because I typically
develop on an ARM-based macOS system where I can't test this directly
- hence my tests with virtualized EC2 instances, where I ran into the
timing oddities.That's good and bad. Bad to do the development and good to test the
implementation on more virtualized setups; given that I also encountered
"interesting" behavior on VMWare (see my previous mails).On Mon, Jan 2, 2023 at 5:28 AM David Geier <geidav.pg@gmail.com> wrote:
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other
variants
return double. This seems error prone. What about renaming the
function
or also have the function return a double and cast where necessary at
the call site?Minor note, but in my understanding using a uint64 (where we can) is
faster for any simple arithmetic we do with the values.That's true. So the argument could be that for seconds and milliseconds
we want the extra precision while microseconds are precise enough.
Still, we could also make the seconds and milliseconds conversion code
integer only and e.g. return two integers with the value before and
after the comma. FWICS, the functions are nowhere used in performance
critical code, so it doesn't really make a difference performance-wise.+1, and feel free to carry this patch forward - I'll try to make an
effort to review my earlier testing issues again, as well as your
later improvements to the patch.Moved to the current commit fest. Will you become reviewer?
Also, FYI, I just posted an alternate idea for speeding up EXPLAIN
ANALYZE with timing over in [0], using a sampling-based approach to
reduce the timing overhead.Interesting idea. I'll reply with some thoughts on the corresponding thread.
[1]
/messages/by-id/CALDaNm3kRBGPhndujr9JcjjbDCG3anhj0vW8b9YtbXrBDMSvvw@mail.gmail.com
CFBot shows some compilation errors as in [1]https://cirrus-ci.com/task/5375312565895168, please post an updated
version for the same:
09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
relocation against `cycles_to_sec' in read-only section `.text'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `pg_clock_gettime_ref_cycles':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
undefined reference to `use_rdtsc'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `test_timing':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
undefined reference to `pg_clock_gettime_initialize_rdtsc'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
undefined reference to `cycles_to_sec'
[09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
[09:08:12.525] collect2: error: ld returned 1 exit status
[1]: https://cirrus-ci.com/task/5375312565895168
Regards,
Vignesh
Hi,
CFBot shows some compilation errors as in [1], please post an updated
version for the same:
09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: warning:
relocation against `cycles_to_sec' in read-only section `.text'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `pg_clock_gettime_ref_cycles':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/include/portability/instr_time.h:119:
undefined reference to `use_rdtsc'
[09:08:12.525] /usr/bin/ld:
src/bin/pg_test_timing/pg_test_timing.p/pg_test_timing.c.o: in
function `test_timing':
[09:08:12.525] /tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:135:
undefined reference to `pg_clock_gettime_initialize_rdtsc'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:137:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:146:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:169:
undefined reference to `cycles_to_us'
[09:08:12.525] /usr/bin/ld:
/tmp/cirrus-ci-build/build/../src/bin/pg_test_timing/pg_test_timing.c:176:
undefined reference to `cycles_to_sec'
[09:08:12.525] /usr/bin/ld: warning: creating DT_TEXTREL in a PIE
[09:08:12.525] collect2: error: ld returned 1 exit status[1] - https://cirrus-ci.com/task/5375312565895168
Regards,
Vignesh
I fixed the compilation error on CFBot.
I missed adding instr_time.c to the Meson makefile.
New patch set attached.
--
David Geier
(ServiceNow)
Attachments:
0001-Change-instr_time-to-just-store-nanoseconds-v6.patchtext/x-patch; charset=UTF-8; name=0001-Change-instr_time-to-just-store-nanoseconds-v6.patchDownload
From be18633d4735f680c7910fcb4e8ac90c4eada131 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
cheaper.
---
src/include/portability/instr_time.h | 62 ++++++++++++----------------
1 file changed, 26 insertions(+), 36 deletions(-)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 9ea1a68bd9..c64f21eb53 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -80,63 +80,53 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-typedef struct timespec instr_time;
+typedef int64 instr_time;
+#define NS_PER_S INT64CONST(1000000000)
+#define US_PER_S INT64CONST(1000000)
+#define MS_PER_S INT64CONST(1000)
-#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0)
+#define NS_PER_MS INT64CONST(1000000)
+#define NS_PER_US INT64CONST(1000)
-#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0)
+#define INSTR_TIME_IS_ZERO(t) ((t) == 0)
-#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
+
+static inline instr_time pg_clock_gettime_ns(void)
+{
+ struct timespec tmp;
+
+ clock_gettime(PG_INSTR_CLOCK, &tmp);
+
+ return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+ (t) = pg_clock_gettime_ns()
#define INSTR_TIME_ADD(x,y) \
do { \
- (x).tv_sec += (y).tv_sec; \
- (x).tv_nsec += (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
+ (x) += (y); \
} while (0)
#define INSTR_TIME_SUBTRACT(x,y) \
do { \
- (x).tv_sec -= (y).tv_sec; \
- (x).tv_nsec -= (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
+ (x) -= (y); \
} while (0)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
do { \
- (x).tv_sec += (y).tv_sec - (z).tv_sec; \
- (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
- /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
+ (x) += (y) - (z); \
} while (0)
#define INSTR_TIME_GET_DOUBLE(t) \
- (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0)
+ ((double) (t) / NS_PER_S)
#define INSTR_TIME_GET_MILLISEC(t) \
- (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0)
+ ((double) (t) / NS_PER_MS)
#define INSTR_TIME_GET_MICROSEC(t) \
- (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
+ ((double) (t) / NS_PER_US)
#else /* WIN32 */
--
2.34.1
0002-Use-CPU-reference-cycles-via-RDTSC-to-measure-time-v6.patchtext/x-patch; charset=UTF-8; name=0002-Use-CPU-reference-cycles-via-RDTSC-to-measure-time-v6.patchDownload
From 190ca09566eabb017ed25b1512225173ca47fb88 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Thu, 17 Nov 2022 13:03:59 +0100
Subject: [PATCH 2/3] Use CPU reference cycles, via RDTSC, to measure time for
instrumentation.
For now this is only enabled on Linux/x86 when the system clocksource is
marked tsc as well, as determined at runtime. This way we can rely on the
Linux kernel to make a decision whether tsc is invariant and usable on the
current CPU architecture. In all other cases we continue to use the
clock_gettime() implementation like before.
Note that this intentionally uses rdtsc, not rdtscp, as rdtscp waits for
currently running CPU instructions to have finished, and that adds up to
noticable latency for little benefit in the typical InstrStartNode() /
InstrStopNode() use case.
---
.../pg_stat_statements/pg_stat_statements.c | 6 +-
src/backend/utils/init/postinit.c | 3 +
src/bin/pg_test_timing/pg_test_timing.c | 1 +
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 120 ++++++++++++++++++
src/common/meson.build | 1 +
src/include/portability/instr_time.h | 52 ++++++--
src/tools/msvc/Mkvcbuild.pm | 2 +-
10 files changed, 179 insertions(+), 14 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/contrib/pg_stat_statements/pg_stat_statements.c b/contrib/pg_stat_statements/pg_stat_statements.c
index a7a72783e5..0f2a9998d0 100644
--- a/contrib/pg_stat_statements/pg_stat_statements.c
+++ b/contrib/pg_stat_statements/pg_stat_statements.c
@@ -1400,15 +1400,15 @@ pgss_store(const char *query, uint64 queryId,
e->counters.jit_functions += jitusage->created_functions;
e->counters.jit_generation_time += INSTR_TIME_GET_MILLISEC(jitusage->generation_counter);
- if (INSTR_TIME_GET_MILLISEC(jitusage->inlining_counter))
+ if (INSTR_TIME_GET_MILLISEC(jitusage->inlining_counter) > 0.0)
e->counters.jit_inlining_count++;
e->counters.jit_inlining_time += INSTR_TIME_GET_MILLISEC(jitusage->inlining_counter);
- if (INSTR_TIME_GET_MILLISEC(jitusage->optimization_counter))
+ if (INSTR_TIME_GET_MILLISEC(jitusage->optimization_counter) > 0.0)
e->counters.jit_optimization_count++;
e->counters.jit_optimization_time += INSTR_TIME_GET_MILLISEC(jitusage->optimization_counter);
- if (INSTR_TIME_GET_MILLISEC(jitusage->emission_counter))
+ if (INSTR_TIME_GET_MILLISEC(jitusage->emission_counter) > 0.0)
e->counters.jit_emission_count++;
e->counters.jit_emission_time += INSTR_TIME_GET_MILLISEC(jitusage->emission_counter);
}
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index ae5a85ed65..ec0661accc 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -804,6 +804,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* Initialize status reporting */
pgstat_beinit();
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f8762..0d667ff5a7 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -132,6 +132,7 @@ test_timing(unsigned int duration)
total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
+ INSTR_TIME_INITIALIZE();
INSTR_TIME_SET_CURRENT(start_time);
cur = INSTR_TIME_GET_MICROSEC(start_time);
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 820b1ea09f..2f0a306f7e 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7076,6 +7076,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 5a28b6f713..3f95e01320 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -322,6 +323,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 898701fae1..fca4b546c8 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 0000000000..5f85350d1b
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,120 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "portability/instr_time.h"
+
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up with
+ * seconds. This indirection exists to support the rtdsc instruction.
+ *
+ * As a default, assume we are using clock_gettime() as a fallback and treat it
+ * as 1 "cycle" per nanosecond (aka 1 GHz).
+ *
+ * When using the RDTSC instruction directly this is filled in during
+ * initialization based on the relevant cpuid fields.
+ */
+double cycles_to_sec = 1.0 / NS_PER_S;
+double cycles_to_ms = 1.0 / MS_PER_S;
+double cycles_to_us = 1.0 / US_PER_S;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Determines whether rdtsc is used (Linux/x86 only, when OS uses tsc clocksource)
+ */
+bool use_rdtsc = false;
+
+static bool under_hypervisor(void)
+{
+ uint32 r0, r1, r2, r3;
+ int result = __get_cpuid(0x1, &r0, &r1, &r2, &r3);
+ return result > 0 && ((r2 >> 31) & 1) == 1;
+}
+
+static bool get_hypervisor_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r0, r1, r2;
+
+ if (!under_hypervisor())
+ return false;
+
+ /*
+ * Use __cpuid() because __get_cpuid() bails out for leafs > 0x80000000.
+ * Likely, it's not accounting for recently introduced VM leafs.
+ * Unlike leaf 0x16, the hypervisor timing leafs return the frequency already in KHz.
+ */
+ __cpuid(0x40000010, *tsc_freq, r0, r1, r2);
+ return *tsc_freq > 0;
+}
+
+static bool get_host_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r0, r1, r2;
+ int result = __get_cpuid(0x16, tsc_freq, &r0, &r1, &r2);
+ *tsc_freq *= 1000; // Convert from MHz to KHz
+ return result > 0 && *tsc_freq > 0;
+}
+
+static bool get_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ if (get_hypervisor_tsc_frequency_khz(tsc_freq))
+ return true;
+ if (get_host_tsc_frequency_khz(tsc_freq))
+ return true;
+
+ return false;
+}
+
+/*
+ * Decide whether we use the rdtsc instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void pg_clock_gettime_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+
+ if (fp)
+ {
+ char buf[128];
+
+ if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+ {
+ /*
+ * Compute baseline CPU peformance, determines speed at which RDTSC advances.
+ */
+ uint32 tsc_freq;
+
+ if (get_tsc_frequency_khz(&tsc_freq))
+ {
+ cycles_to_sec = 1.0 / (tsc_freq * 1000);
+ cycles_to_ms = 1.0 / tsc_freq;
+ cycles_to_us = 1000.0 / tsc_freq;
+ use_rdtsc = true;
+ }
+ }
+
+ fclose(fp);
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
diff --git a/src/common/meson.build b/src/common/meson.build
index a1fc398d8e..f0f49d0551 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -11,6 +11,7 @@ common_sources = files(
'file_perm.c',
'file_utils.c',
'hashfn.c',
+ 'instr_time.c',
'ip.c',
'jsonapi.c',
'keywords.c',
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index c64f21eb53..9c004fa56e 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter(). These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -56,10 +58,15 @@
#ifndef WIN32
-/* Use clock_gettime() */
+/* Uses rdtsc on Linux/x86 if available, otherwise clock_gettime() */
#include <time.h>
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+#endif
+
/*
* The best clockid to use according to the POSIX spec is CLOCK_MONOTONIC,
* since that will give reliable interval timing even in the face of changes
@@ -80,7 +87,9 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
+/* time in cpu reference cycles (when using rdtsc), otherwise nanoseconds */
typedef int64 instr_time;
+
#define NS_PER_S INT64CONST(1000000000)
#define US_PER_S INT64CONST(1000000)
#define MS_PER_S INT64CONST(1000)
@@ -92,17 +101,38 @@ typedef int64 instr_time;
#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
-static inline instr_time pg_clock_gettime_ns(void)
+extern double cycles_to_sec;
+extern double cycles_to_ms;
+extern double cycles_to_us;
+
+#if defined(__x86_64__) && defined(__linux__)
+extern bool use_rdtsc;
+
+extern void pg_clock_gettime_initialize_rdtsc(void);
+#endif
+
+static inline instr_time pg_clock_gettime_ref_cycles(void)
{
struct timespec tmp;
+#if defined(__x86_64__) && defined(__linux__)
+ if (use_rdtsc)
+ return __rdtsc();
+#endif
+
clock_gettime(PG_INSTR_CLOCK, &tmp);
-
return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
}
+#if defined(__x86_64__) && defined(__linux__)
+#define INSTR_TIME_INITIALIZE() \
+ pg_clock_gettime_initialize_rdtsc()
+#else
+#define INSTR_TIME_INITIALIZE()
+#endif
+
#define INSTR_TIME_SET_CURRENT(t) \
- (t) = pg_clock_gettime_ns()
+ (t) = pg_clock_gettime_ref_cycles()
#define INSTR_TIME_ADD(x,y) \
do { \
@@ -120,13 +150,13 @@ static inline instr_time pg_clock_gettime_ns(void)
} while (0)
#define INSTR_TIME_GET_DOUBLE(t) \
- ((double) (t) / NS_PER_S)
+ ((double) (t) * cycles_to_sec)
#define INSTR_TIME_GET_MILLISEC(t) \
- ((double) (t) / NS_PER_MS)
+ ((double) (t) * cycles_to_ms)
#define INSTR_TIME_GET_MICROSEC(t) \
- ((double) (t) / NS_PER_US)
+ ((uint64) (t) * cycles_to_us)
#else /* WIN32 */
@@ -138,6 +168,8 @@ typedef LARGE_INTEGER instr_time;
#define INSTR_TIME_SET_ZERO(t) ((t).QuadPart = 0)
+#define INSTR_TIME_INITIALIZE()
+
#define INSTR_TIME_SET_CURRENT(t) QueryPerformanceCounter(&(t))
#define INSTR_TIME_ADD(x,y) \
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index afa40d885e..40910991e3 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -135,7 +135,7 @@ sub mkvcbuild
our @pgcommonallfiles = qw(
archive.c base64.c checksum_helper.c compression.c
config_info.c controldata_utils.c d2s.c encnames.c exec.c
- f2s.c file_perm.c file_utils.c hashfn.c ip.c jsonapi.c
+ f2s.c file_perm.c file_utils.c hashfn.c ip.c instr_time.o jsonapi.c
keywords.c kwlookup.c link-canary.c md5_common.c
pg_get_line.c pg_lzcompress.c pg_prng.c pgfnames.c psprintf.c relpath.c
rmtree.c saslprep.c scram-common.c string.c stringinfo.c unicode_norm.c
--
2.34.1
0003-Refactor-some-instr_time-related-code-v6.patchtext/x-patch; charset=UTF-8; name=0003-Refactor-some-instr_time-related-code-v6.patchDownload
From a78113fd9301ba703cb58d7dfadaaa872e04cfb2 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Wed, 4 Jan 2023 12:57:42 +0100
Subject: [PATCH 3/3] Refactor some instr_time related code
---
src/backend/access/heap/vacuumlazy.c | 3 +--
src/backend/access/transam/xlog.c | 4 ++--
src/backend/storage/buffer/bufmgr.c | 4 ++--
src/bin/psql/common.c | 8 ++++----
4 files changed, 9 insertions(+), 10 deletions(-)
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index a42e881da3..91e4d48894 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -2988,8 +2988,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
INSTR_TIME_SET_CURRENT(currenttime);
elapsed = currenttime;
INSTR_TIME_SUBTRACT(elapsed, starttime);
- if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
- >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ if (INSTR_TIME_GET_MILLISEC(elapsed) >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
{
if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
{
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 0070d56b0b..240e5afe35 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2177,7 +2177,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
Size nbytes;
Size nleft;
int written;
- instr_time start;
+ instr_time start = 0;
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -8129,7 +8129,7 @@ void
issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
{
char *msg = NULL;
- instr_time start;
+ instr_time start = 0;
Assert(tli != 0);
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 3fb38a25cf..96f6735398 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -2824,8 +2824,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
{
XLogRecPtr recptr;
ErrorContextCallback errcallback;
- instr_time io_start,
- io_time;
+ instr_time io_start = 0;
+ instr_time io_time = 0;
Block bufBlock;
char *bufToWrite;
uint32 buf_state;
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 38f9b10b7c..5c835c43ad 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -1269,8 +1269,8 @@ DescribeQuery(const char *query, double *elapsed_msec)
bool timing = pset.timing;
PGresult *result;
bool OK;
- instr_time before,
- after;
+ instr_time before = 0;
+ instr_time after = 0;
*elapsed_msec = 0;
@@ -1689,8 +1689,8 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec)
int ntuples;
int fetch_count;
char fetch_cmd[64];
- instr_time before,
- after;
+ instr_time before = 0;
+ instr_time after = 0;
int flush_error;
*elapsed_msec = 0;
--
2.34.1
Hi,
On 2023-01-04 13:02:05 +0100, David Geier wrote:
From be18633d4735f680c7910fcb4e8ac90c4eada131 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Thu, 17 Nov 2022 10:22:01 +0100
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
cheaper.
Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
Instrumentation (16 bytes saved in Instrumentation itself, 32 via
BufferUsage).
While the range of instr_time storing nanoseconds wouldn't be good enough for
a generic timestamp facility (hence using microsecs for Timestamp), the range
seems plenty for its use of measuring runtime:
(2 ** 63) - 1) / ((10 ** 9) * 60 * 60 * 24 * 365) = ~292 years
Of course, when using CLOCK_REALTIME, this is relative to 1970-01-01, so just
239 years.
It could theoretically be a different story, if we stored instr_time's on
disk. But we don't, they're ephemeral.
This doesn't buy a whole lot of performance - the bottlenck is the actual
timestamp computation. But in a query with not much else going on, it's
visible and reproducible. It's, unsurprisingly, a lot easier to see when using
BUFFERS.
For both timespec and nanosecond, I measured three server starts, and for each
started server three executions of
pgbench -n -Mprepared -c1 -P5 -T15 -f <(echo "EXPLAIN (ANALYZE, BUFFERS) SELECT generate_series(1, 10000000) OFFSET 10000000;")
the best result is:
timespec: 1073.431
nanosec: 957.532
a ~10% difference
Greetings,
Andres Freund
Andres Freund <andres@anarazel.de> writes:
On 2023-01-04 13:02:05 +0100, David Geier wrote:
Subject: [PATCH 1/3] Change instr_time to just store nanoseconds, that's
cheaper.
Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds.
Cheaper, and perhaps more accurate too? Don't recall if we have any code
paths where the input timestamps are likely to be better-than-microsecond,
but surely that's coming someday.
I'm unsure that we want to deal with rdtsc's vagaries in general, but
no objection to changing instr_time.
regards, tom lane
Hi,
On 2023-01-13 15:25:16 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds.Cheaper, and perhaps more accurate too? Don't recall if we have any code
paths where the input timestamps are likely to be better-than-microsecond,
but surely that's coming someday.
instr_time on !WIN32 use struct timespec, so we already should have nanosecond
precision available. IOW, we could add a INSTR_TIME_GET_NANOSEC today. Or am I
misunderstanding what you mean?
I'm unsure that we want to deal with rdtsc's vagaries in general, but
no objection to changing instr_time.
Cool.
Looking at the instr_time.h part of the change, I think it should go further,
and basically do the same thing in the WIN32 path. The only part that needs to
be be win32 specific is INSTR_TIME_SET_CURRENT(). That'd reduce duplication a
good bit.
Greetings,
Andres Freund
On Wed, 4 Jan 2023 at 17:32, David Geier <geidav.pg@gmail.com> wrote:
I fixed the compilation error on CFBot.
I missed adding instr_time.c to the Meson makefile.
New patch set attached.
The patch does not apply on top of HEAD as in [1]http://cfbot.cputube.org/patch_41_3751.log, please post a rebased patch:
=== Applying patches on top of PostgreSQL commit ID
ff23b592ad6621563d3128b26860bcb41daf9542 ===
=== applying patch
./0002-Use-CPU-reference-cycles-via-RDTSC-to-measure-time-v6.patch
....
patching file src/tools/msvc/Mkvcbuild.pm
Hunk #1 FAILED at 135.
1 out of 1 hunk FAILED -- saving rejects to file src/tools/msvc/Mkvcbuild.pm.rej
[1]: http://cfbot.cputube.org/patch_41_3751.log
Regards,
Vignesh
Hi,
On 2023-01-13 11:55:47 -0800, Andres Freund wrote:
Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
Instrumentation (16 bytes saved in Instrumentation itself, 32 via
BufferUsage).
This actually under-counted the benefits, because we have two BufferUsage and
two WalUsage in Instrumentation.
Before:
/* size: 448, cachelines: 7, members: 20 */
/* sum members: 445, holes: 1, sum holes: 3 */
After
/* size: 368, cachelines: 6, members: 20 */
/* sum members: 365, holes: 1, sum holes: 3 */
The difference in the number of instructions in InstrStopNode is astounding:
1016 instructions with timespec, 96 instructions with nanoseconds. Some of
that is the simpler data structure, some because the compiler now can
auto-vectorize the four INSTR_TIME_ACCUM_DIFF in BufferUsageAccumDiff into
one.
We probably should convert Instrumentation->firsttuple to a instr_time now as
well, no point in having the code for conversion to double in the hot routine,
that can easily happen in explain. But that's for a later patch.
I suggested downthread that we should convert the win32 implementation to be
more similar to the unix-nanoseconds representation. A blind conversion looks
good, and lets us share a number of macros.
I wonder if we should deprecate INSTR_TIME_IS_ZERO()/INSTR_TIME_SET_ZERO() and
allow 0 to be used instead. Not needing INSTR_TIME_SET_ZERO() allows variable
definitions to initialize the value, which does avoid some unnecessarily
awkward code. Alternatively we could introduce INSTR_TIME_ZERO() for that
purpose?
Greetings,
Andres Freund
Hi,
On 2023-01-02 14:28:20 +0100, David Geier wrote:
I also somewhat improved the accuracy of the cycles to milli- and
microseconds conversion functions by having two more multipliers with higher
precision. For microseconds we could also keep the computation integer-only.
I'm wondering what to best do for seconds and milliseconds. I'm currently
leaning towards just keeping it as is, because the durations measured and
converted are usually long enough that precision shouldn't be a problem.
I'm doubtful this is worth the complexity it incurs. By the time we convert
out of the instr_time format, the times shouldn't be small enough that the
accuracy is affected much.
Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC()
actually accumulate themselves, and should instead keep things in the
instr_time format and convert later. We'd win more accuracy / speed that way.
I don't think the introduction of pg_time_usec_t was a great idea, but oh
well.
Additionally, I initialized a few variables of type instr_time which
otherwise resulted in warnings due to use of potentially uninitialized
variables.
Unless we decide, as I suggested downthread, that we deprecate
INSTR_TIME_SET_ZERO(), that's unfortunately not the right fix. I've a similar
patch that adds all the necesarry INSTR_TIME_SET_ZERO() calls.
What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so that
it's consistent with the _MILLISEC() and _MICROSEC() variants?
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
return double. This seems error prone. What about renaming the function or
also have the function return a double and cast where necessary at the call
site?
I think those should be a separate discussion / patch.
Greetings,
Andres Freund
Hi,
there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy.
As for the patch, I don't have much comments. I'm wondering if it'd be
useful to indicate which timing source was actually used for EXPLAIN
ANALYZE, say something like:
Planning time: 0.197 ms
Execution time: 0.225 ms
Timing source: clock_gettime (or tsc)
There has been a proposal to expose this as a GUC (or perhaps as explain
option), to allow users to pick what timing source to use. I wouldn't go
that far - AFAICS is this is meant to be universally better when
available. But knowing which source was used seems useful.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra <
tomas.vondra@enterprisedb.com> napsal:
Hi,
there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy.
As for the patch, I don't have much comments. I'm wondering if it'd be
useful to indicate which timing source was actually used for EXPLAIN
ANALYZE, say something like:Planning time: 0.197 ms
Execution time: 0.225 ms
Timing source: clock_gettime (or tsc)There has been a proposal to expose this as a GUC (or perhaps as explain
option), to allow users to pick what timing source to use. I wouldn't go
that far - AFAICS is this is meant to be universally better when
available. But knowing which source was used seems useful.
+1
Pavel
Show quoted text
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
On Fri, Jan 13, 2023 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:
Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
Instrumentation (16 bytes saved in Instrumentation itself, 32 via
BufferUsage).
I read through 0001 and it seems basically fine to me. Comments:
1. pg_clock_gettime_ns() doesn't follow pgindent conventions.
2. I'm not entirely sure that the new .?S_PER_.?S macros are
worthwhile but maybe they are, and in any case I don't care very much.
3. I've always found 'struct timespec' to be pretty annoying
notationally, so I like the fact that this patch would reduce use of
it.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2023-01-17 08:46:12 -0500, Robert Haas wrote:
On Fri, Jan 13, 2023 at 2:56 PM Andres Freund <andres@anarazel.de> wrote:
Does anybody see a reason to not move forward with this aspect? We do a fair
amount of INSTR_TIME_ACCUM_DIFF() etc, and that gets a good bit cheaper by
just using nanoseconds. We'd also save memory in BufferUsage (144-122 bytes),
Instrumentation (16 bytes saved in Instrumentation itself, 32 via
BufferUsage).
Here's an updated version of the move to representing instr_time as
nanoseconds. It's now split into a few patches:
0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
warn
Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
just allow to assign 0.
0002) Convert instr_time to uint64
This is the cleaned up version of the prior patch. The main change is
that it deduplicated a lot of the code between the architectures.
0003) Add INSTR_TIME_SET_SECOND()
This is used in 0004. Just allows setting an instr_time to a time in
seconds, allowing for a cheaper loop exit condition in 0004.
0004) report nanoseconds in pg_test_timing
I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out
how to do the cycles->nanosecond conversion with integer shift and multiply in
the common case, which does show a noticable speedup. But that's for another
day.
I fought a bit with myself about whether to send those patches in this thread,
because it'll take over the CF entry. But decided that it's ok, given that
David's patches should be rebased over these anyway?
I read through 0001 and it seems basically fine to me. Comments:
1. pg_clock_gettime_ns() doesn't follow pgindent conventions.
Fixed.
2. I'm not entirely sure that the new .?S_PER_.?S macros are
worthwhile but maybe they are, and in any case I don't care very much.
There's now fewer. But those I'd like to keep. I just end up counting digits
manually way too many times.
3. I've always found 'struct timespec' to be pretty annoying
notationally, so I like the fact that this patch would reduce use of
it.
Same.
Greetings,
Andres Freund
Attachments:
v7-0001-Zero-initialize-instr_time-uses-causing-compiler-.patchtext/x-diff; charset=us-asciiDownload
From c1024a9dfa7f5645200b7fa68e8bce5561c9cee0 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 10:04:42 -0800
Subject: [PATCH v7 1/4] Zero initialize instr_time uses causing compiler
warnings
These are all not necessary from a correctness POV. However, in a subsequent
patch instr_time will be simplified to an int64, at which point gcc would
otherwise start to warn about the changed places.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/20230116023639.rn36vf6ajqmfciua@awork3.anarazel.de
Backpatch:
---
src/backend/access/transam/xlog.c | 4 ++++
src/backend/storage/buffer/bufmgr.c | 4 ++++
src/backend/storage/file/buffile.c | 4 ++++
src/bin/psql/common.c | 6 ++++++
4 files changed, 18 insertions(+)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index 8f47fb75700..7d65b1d4159 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2191,6 +2191,8 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
/* Measure I/O timing to write WAL data */
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
+ else
+ INSTR_TIME_SET_ZERO(start);
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
written = pg_pwrite(openLogFile, from, nleft, startoffset);
@@ -8150,6 +8152,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
/* Measure I/O timing to sync the WAL file */
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
+ else
+ INSTR_TIME_SET_ZERO(start);
pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
switch (sync_method)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a6..800a4248c95 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1017,6 +1017,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
if (track_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
+ else
+ INSTR_TIME_SET_ZERO(io_start);
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
@@ -2902,6 +2904,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
if (track_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
+ else
+ INSTR_TIME_SET_ZERO(io_start);
/*
* bufToWrite is either the shared buffer or a copy, as appropriate.
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c5464b6aa62..0a51624df3b 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -446,6 +446,8 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
+ else
+ INSTR_TIME_SET_ZERO(io_start);
/*
* Read whatever we can get, up to a full bufferload.
@@ -525,6 +527,8 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
+ else
+ INSTR_TIME_SET_ZERO(io_start);
bytestowrite = FileWrite(thisfile,
file->buffer.data + wpos,
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 00627830c47..f907f5d4e8d 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -1276,6 +1276,8 @@ DescribeQuery(const char *query, double *elapsed_msec)
if (timing)
INSTR_TIME_SET_CURRENT(before);
+ else
+ INSTR_TIME_SET_ZERO(before);
/*
* To parse the query but not execute it, we prepare it, using the unnamed
@@ -1406,6 +1408,8 @@ ExecQueryAndProcessResults(const char *query,
if (timing)
INSTR_TIME_SET_CURRENT(before);
+ else
+ INSTR_TIME_SET_ZERO(before);
if (pset.bind_flag)
success = PQsendQueryParams(pset.db, query, pset.bind_nparams, NULL, (const char * const *) pset.bind_params, NULL, NULL, 0);
@@ -1702,6 +1706,8 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec)
if (timing)
INSTR_TIME_SET_CURRENT(before);
+ else
+ INSTR_TIME_SET_ZERO(before);
/* if we're not in a transaction, start one */
if (PQtransactionStatus(pset.db) == PQTRANS_IDLE)
--
2.38.0
v7-0002-Use-int64-to-represent-instr_time-on-all-platform.patchtext/x-diff; charset=us-asciiDownload
From d76727601cff3885988969080b4372e7449478d7 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 20:38:12 -0800
Subject: [PATCH v7 2/4] Use int64 to represent instr_time on all platforms
Until now we used struct timespec on all platforms but windows. That causes a
fair bit of memory (struct timeval is 16 bytes) and runtime overhead (much
more complicated additions). Instead we can convert the time to nanoseconds in
INSTR_TIME_SET_CURRENT(), making the remaining operations cheaper.
Representing time as int64 nanoseconds provides sufficient range, ~292 years
relative to a starting point (depending on clock source, relative to the unix
epoch or the system's boot time). That'd not be sufficient for calendar time
stored on disk, but is plenty for runtime interval time measurement.
On windows instr_time already is represented as cycles. It might make sense to
represent time as cycles on other platforms as well, as using cycle
acquisition instructions like rdtsc directly can reduce the overhead of time
acquisition substantially. This could be done in a fairly localized manner as
the code stands after this commit.
Because the windows and non-windows paths are now more similar, use a common
set of macros. To make that possible, most of the use of LARGE_INTEGER had to
be removed, which looks nicer anyway.
Author: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Author: David Geier <geidav.pg@gmail.com>
Reviewed-by:
Discussion: https://postgr.es/m/20230113195547.k4nlrmawpijqwlsa@awork3.anarazel.de
---
src/include/portability/instr_time.h | 149 +++++++++++++--------------
1 file changed, 73 insertions(+), 76 deletions(-)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 9ea1a68bd94..185be8d5cad 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -34,6 +34,8 @@
*
* INSTR_TIME_GET_MICROSEC(t) convert t to uint64 (in microseconds)
*
+ * INSTR_TIME_GET_NANOSEC(t) convert t to uint64 (in nanoseconds)
+ *
* Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert
* absolute times to intervals. The INSTR_TIME_GET_xxx operations are
* only useful on intervals.
@@ -54,8 +56,26 @@
#ifndef INSTR_TIME_H
#define INSTR_TIME_H
+
+/*
+ * On all platforms we store time using a 64bit integer - they're cheap to
+ * add/subtract, the most common operations. The acquisition of time
+ * and converting to specific units of time is platform specific.
+ */
+
+typedef int64 instr_time;
+
+
+/* helpers macros used in platform specific code below */
+
+#define NS_PER_S INT64CONST(1000000000)
+#define NS_PER_MS INT64CONST(1000000)
+#define NS_PER_US INT64CONST(1000)
+
+
#ifndef WIN32
+
/* Use clock_gettime() */
#include <time.h>
@@ -80,93 +100,39 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-typedef struct timespec instr_time;
+/* helper for INSTR_TIME_SET_CURRENT */
+static inline instr_time
+pg_clock_gettime_ns(void)
+{
+ struct timespec tmp;
-#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0)
+ clock_gettime(PG_INSTR_CLOCK, &tmp);
-#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0)
+ return tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
+}
-#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_CURRENT(t) \
+ ((t) = pg_clock_gettime_ns())
-#define INSTR_TIME_ADD(x,y) \
- do { \
- (x).tv_sec += (y).tv_sec; \
- (x).tv_nsec += (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
- } while (0)
+#define INSTR_TIME_GET_NANOSEC(t) \
+ ((uint64) (t))
-#define INSTR_TIME_SUBTRACT(x,y) \
- do { \
- (x).tv_sec -= (y).tv_sec; \
- (x).tv_nsec -= (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
- } while (0)
-
-#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
- do { \
- (x).tv_sec += (y).tv_sec - (z).tv_sec; \
- (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
- /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
- } while (0)
-
-#define INSTR_TIME_GET_DOUBLE(t) \
- (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0)
-
-#define INSTR_TIME_GET_MILLISEC(t) \
- (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0)
-
-#define INSTR_TIME_GET_MICROSEC(t) \
- (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
#else /* WIN32 */
+
/* Use QueryPerformanceCounter() */
-typedef LARGE_INTEGER instr_time;
+/* helper for INSTR_TIME_SET_CURRENT */
+static inline instr_time
+pg_query_performance_counter(void)
+{
+ LARGE_INTEGER t;
-#define INSTR_TIME_IS_ZERO(t) ((t).QuadPart == 0)
+ QueryPerformanceCounter(&t);
-#define INSTR_TIME_SET_ZERO(t) ((t).QuadPart = 0)
-
-#define INSTR_TIME_SET_CURRENT(t) QueryPerformanceCounter(&(t))
-
-#define INSTR_TIME_ADD(x,y) \
- ((x).QuadPart += (y).QuadPart)
-
-#define INSTR_TIME_SUBTRACT(x,y) \
- ((x).QuadPart -= (y).QuadPart)
-
-#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
- ((x).QuadPart += (y).QuadPart - (z).QuadPart)
-
-#define INSTR_TIME_GET_DOUBLE(t) \
- (((double) (t).QuadPart) / GetTimerFrequency())
-
-#define INSTR_TIME_GET_MILLISEC(t) \
- (((double) (t).QuadPart * 1000.0) / GetTimerFrequency())
-
-#define INSTR_TIME_GET_MICROSEC(t) \
- ((uint64) (((double) (t).QuadPart * 1000000.0) / GetTimerFrequency()))
+ return t.QuadPart;
+}
static inline double
GetTimerFrequency(void)
@@ -177,11 +143,42 @@ GetTimerFrequency(void)
return (double) f.QuadPart;
}
+#define INSTR_TIME_SET_CURRENT(t) \
+ ((t) = pg_query_performance_counter())
+
+#define INSTR_TIME_GET_NANOSEC(t) \
+ ((uint64) (((t) * NS_PER_S) / GetTimerFrequency()))
+
#endif /* WIN32 */
-/* same macro on all platforms */
+
+/*
+ * Common macros
+ */
+
+#define INSTR_TIME_IS_ZERO(t) ((t) == 0)
+
+#define INSTR_TIME_SET_ZERO(t) ((t) = 0)
#define INSTR_TIME_SET_CURRENT_LAZY(t) \
(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
+#define INSTR_TIME_GET_DOUBLE(t) \
+ ((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
+
+#define INSTR_TIME_GET_MILLISEC(t) \
+ ((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_MS)
+
+#define INSTR_TIME_GET_MICROSEC(t) \
+ (INSTR_TIME_GET_NANOSEC(t) / NS_PER_US)
+
+#define INSTR_TIME_ADD(x,y) \
+ ((x) += (y))
+
+#define INSTR_TIME_SUBTRACT(x,y) \
+ ((x) -= (y))
+
+#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
+ ((x) += (y) - (z))
+
#endif /* INSTR_TIME_H */
--
2.38.0
v7-0003-instr_time-Add-INSTR_TIME_SET_SECOND.patchtext/x-diff; charset=us-asciiDownload
From 3ed50650be12bb4da1baab3936e5e4f6346ef5b3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 11:06:36 -0800
Subject: [PATCH v7 3/4] instr_time: Add INSTR_TIME_SET_SECOND()
Useful to calculate the end of a time-bound loop without having to convert
into time units (which is costly).
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/portability/instr_time.h | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 185be8d5cad..91492c62a39 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -22,6 +22,8 @@
* INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
* evaluates to whether t changed
*
+ * INSTR_TIME_SET_SECOND(t, s) set t to s seconds
+ *
* INSTR_TIME_ADD(x, y) x += y
*
* INSTR_TIME_SUBTRACT(x, y) x -= y
@@ -114,6 +116,9 @@ pg_clock_gettime_ns(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_clock_gettime_ns())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t) = NS_PER_S * (s))
+
#define INSTR_TIME_GET_NANOSEC(t) \
((uint64) (t))
@@ -146,6 +151,9 @@ GetTimerFrequency(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t) = s * GetTimerFrequency())
+
#define INSTR_TIME_GET_NANOSEC(t) \
((uint64) (((t) * NS_PER_S) / GetTimerFrequency()))
--
2.38.0
v7-0004-wip-report-nanoseconds-in-pg_test_timing.patchtext/x-diff; charset=us-asciiDownload
From 8d320cb8d0a6e35280134fa927c0297e407bf444 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v7 4/4] wip: report nanoseconds in pg_test_timing
This commit also updates pg_test_timing's documentation:
- compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of
of statement with/without EXPLAIN ANALYZE
- explain the 2x overhead (due to two timestamp acquisitions per row)
- remove old section about old versions of linux - I couldn't update the
numbers, and it's old enough nobody would care
---
src/bin/pg_test_timing/pg_test_timing.c | 72 ++++++++------
doc/src/sgml/ref/pgtesttiming.sgml | 122 +++++++++++-------------
2 files changed, 99 insertions(+), 95 deletions(-)
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f87629..bae10ce6fa4 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -19,8 +19,8 @@ static void handle_args(int argc, char *argv[]);
static uint64 test_timing(unsigned int duration);
static void output(uint64 loop_count);
-/* record duration in powers of 2 microseconds */
-long long int histogram[32];
+/* record duration in powers of 2 nanoseconds */
+uint64 histogram[64];
int
main(int argc, char *argv[])
@@ -122,34 +122,47 @@ static uint64
test_timing(unsigned int duration)
{
uint64 total_time;
- int64 time_elapsed = 0;
uint64 loop_count = 0;
- uint64 prev,
- cur;
+ instr_time cur;
instr_time start_time,
- end_time,
- temp;
-
- total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
+ end_time;
+ instr_time until_time;
INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_MICROSEC(start_time);
- while (time_elapsed < total_time)
+ /*
+ * To reduce loop overhead, check loop condition in instr_time domain.
+ */
+ INSTR_TIME_SET_SECONDS(total_time, duration);
+ until_time = start_time;
+ INSTR_TIME_ADD(until_time, total_time);
+
+ cur = start_time;
+
+ while (cur < until_time)
{
- int32 diff,
- bits = 0;
+ instr_time temp;
+ instr_time prev;
+ int64 diff;
+ int32 bits = 0;
prev = cur;
- INSTR_TIME_SET_CURRENT(temp);
- cur = INSTR_TIME_GET_MICROSEC(temp);
- diff = cur - prev;
+ INSTR_TIME_SET_CURRENT(cur);
+ temp = cur;
+ INSTR_TIME_SUBTRACT(temp, prev);
+ diff = INSTR_TIME_GET_NANOSEC(temp);
/* Did time go backwards? */
- if (diff < 0)
+ if (unlikely(diff <= 0))
{
+ /* can't do anything with that measurement */
+ if (diff == 0)
+ {
+ loop_count++;
+ continue;
+ }
fprintf(stderr, _("Detected clock going backwards in time.\n"));
- fprintf(stderr, _("Time warp: %d ms\n"), diff);
+ fprintf(stderr, _("Time warp: %lld ns\n"), (long long) diff);
exit(1);
}
@@ -164,8 +177,6 @@ test_timing(unsigned int duration)
histogram[bits]++;
loop_count++;
- INSTR_TIME_SUBTRACT(temp, start_time);
- time_elapsed = INSTR_TIME_GET_MICROSEC(temp);
}
INSTR_TIME_SET_CURRENT(end_time);
@@ -173,7 +184,7 @@ test_timing(unsigned int duration)
INSTR_TIME_SUBTRACT(end_time, start_time);
printf(_("Per loop time including overhead: %0.2f ns\n"),
- INSTR_TIME_GET_DOUBLE(end_time) * 1e9 / loop_count);
+ (INSTR_TIME_GET_DOUBLE(end_time) * NS_PER_S) / loop_count);
return loop_count;
}
@@ -181,9 +192,10 @@ test_timing(unsigned int duration)
static void
output(uint64 loop_count)
{
- int64 max_bit = 31,
+ int64 low_bit = 0,
+ max_bit = 63,
i;
- char *header1 = _("< us");
+ char *header1 = _("< ns");
char *header2 = /* xgettext:no-c-format */ _("% of total");
char *header3 = _("count");
int len1 = strlen(header1);
@@ -194,15 +206,19 @@ output(uint64 loop_count)
while (max_bit > 0 && histogram[max_bit] == 0)
max_bit--;
+ /* find lowest bit value */
+ while (low_bit < max_bit && histogram[low_bit] == 0)
+ low_bit++;
+
printf(_("Histogram of timing durations:\n"));
printf("%*s %*s %*s\n",
- Max(6, len1), header1,
+ Max(9, len1), header1,
Max(10, len2), header2,
Max(10, len3), header3);
- for (i = 0; i <= max_bit; i++)
- printf("%*ld %*.5f %*lld\n",
- Max(6, len1), 1l << i,
+ for (i = low_bit; i <= max_bit; i++)
+ printf("%*ld %*.5f %*llu\n",
+ Max(9, len1), 1l << i,
Max(10, len2) - 1, (double) histogram[i] * 100 / loop_count,
- Max(10, len3), histogram[i]);
+ Max(10, len3), (long long unsigned) histogram[i]);
}
diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml
index a5eb3aa25e0..82c625e3a2a 100644
--- a/doc/src/sgml/ref/pgtesttiming.sgml
+++ b/doc/src/sgml/ref/pgtesttiming.sgml
@@ -93,28 +93,34 @@ PostgreSQL documentation
<para>
Good results will show most (>90%) individual timing calls take less than
- one microsecond. Average per loop overhead will be even lower, below 100
- nanoseconds. This example from an Intel i7-860 system using a TSC clock
- source shows excellent performance:
+ one microsecond (1000 nanoseconds). Average per loop overhead will be even
+ lower, below 100 nanoseconds. This example from an Intel i9-9880H system
+ using a TSC clock source shows excellent performance:
<screen><![CDATA[
Testing timing overhead for 3 seconds.
-Per loop time including overhead: 35.96 ns
+Per loop time including overhead: 13.74 ns
Histogram of timing durations:
- < us % of total count
- 1 96.40465 80435604
- 2 3.59518 2999652
- 4 0.00015 126
- 8 0.00002 13
- 16 0.00000 2
+ < ns % of total count
+ 16 97.00221 211857215
+ 32 2.99555 6542412
+ 64 0.00115 2505
+ 128 0.00035 759
+ 256 0.00004 78
+ 512 0.00000 3
+ 1024 0.00000 4
+ 2048 0.00034 732
+ 4096 0.00000 6
+ 8192 0.00000 8
+ 16384 0.00019 409
+ 32768 0.00018 403
+ 65536 0.00000 1
]]></screen>
</para>
<para>
- Note that different units are used for the per loop time than the
- histogram. The loop can have resolution within a few nanoseconds (ns),
- while the individual timing calls can only resolve down to one microsecond
- (us).
+ Note that the accuracy of the histogram entries may be lower than the
+ per loop time.
</para>
</refsect2>
@@ -125,24 +131,30 @@ Histogram of timing durations:
When the query executor is running a statement using
<command>EXPLAIN ANALYZE</command>, individual operations are timed as well
as showing a summary. The overhead of your system can be checked by
- counting rows with the <application>psql</application> program:
+ disabling the per-row timing, using the <literal>TIMING OFF</literal>
+ option:
<screen>
-CREATE TABLE t AS SELECT * FROM generate_series(1,100000);
-\timing
-SELECT COUNT(*) FROM t;
-EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
+CREATE TABLE t AS SELECT * FROM generate_series(1, 100000);
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t;
+EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;
</screen>
</para>
<para>
- The i7-860 system measured runs the count query in 9.8 ms while
- the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each
- processing just over 100,000 rows. That 6.8 ms difference means the timing
- overhead per row is 68 ns, about twice what pg_test_timing estimated it
- would be. Even that relatively small amount of overhead is making the fully
- timed count statement take almost 70% longer. On more substantial queries,
- the timing overhead would be less problematic.
+ The i9-9880H system measured shows an execution time of 4.116 ms for the
+ <literal>TIMING OFF</literal> query, and 6.965 ms for the
+ <literal>TIMING ON</literal>, each processing 100,000 rows.
+
+ That 2.849 ms difference means the timing overhead per row is 28 ns. As
+ <literal>TIMING ON</literal> measures timestamps twice per row returned by
+ an executor node, the overhead is very close to what pg_test_timing
+ estimated it would be.
+
+ more than what pg_test_timing estimated it would be. Even that relatively
+ small amount of overhead is making the fully timed count statement take
+ about 60% longer. On more substantial queries, the timing overhead would
+ be less problematic.
</para>
</refsect2>
@@ -157,28 +169,31 @@ EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
<screen><![CDATA[
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-tsc hpet acpi_pm
+tsc acpi_pm
# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
# pg_test_timing
-Per loop time including overhead: 722.92 ns
+Testing timing overhead for 3 seconds.
+Per loop time including overhead: 708.58 ns
Histogram of timing durations:
- < us % of total count
- 1 27.84870 1155682
- 2 72.05956 2990371
- 4 0.07810 3241
- 8 0.01357 563
- 16 0.00007 3
+ < ns % of total count
+ 1024 99.79796 4225270
+ 2048 0.15560 6588
+ 4096 0.00035 15
+ 8192 0.01738 736
+ 16384 0.01679 711
+ 32768 0.01190 504
]]></screen>
</para>
<para>
In this configuration, the sample <command>EXPLAIN ANALYZE</command> above
- takes 115.9 ms. That's 1061 ns of timing overhead, again a small multiple
- of what's measured directly by this utility. That much timing overhead
- means the actual query itself is only taking a tiny fraction of the
- accounted for time, most of it is being consumed in overhead instead. In
- this configuration, any <command>EXPLAIN ANALYZE</command> totals involving
- many timed operations would be inflated significantly by timing overhead.
+ shows an execution time of 148.7 ms. That's 1392 ns of timing overhead,
+ again a small multiple of what's measured directly by this utility. That
+ much timing overhead means the actual query itself is only taking a tiny
+ fraction of the accounted for time, most of it is being consumed in
+ overhead instead. In this configuration, any <command>EXPLAIN
+ ANALYZE</command> totals involving many timed operations would be inflated
+ significantly by timing overhead.
</para>
<para>
@@ -196,33 +211,6 @@ kern.timecounter.hardware: ACPI-fast -> TSC
</screen>
</para>
- <para>
- Other systems may only allow setting the time source on boot. On older
- Linux systems the "clock" kernel setting is the only way to make this sort
- of change. And even on some more recent ones, the only option you'll see
- for a clock source is "jiffies". Jiffies are the older Linux software clock
- implementation, which can have good resolution when it's backed by fast
- enough timing hardware, as in this example:
-
-<screen><![CDATA[
-$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-jiffies
-$ dmesg | grep time.c
-time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
-time.c: Detected 2400.153 MHz processor.
-$ pg_test_timing
-Testing timing overhead for 3 seconds.
-Per timing duration including loop overhead: 97.75 ns
-Histogram of timing durations:
- < us % of total count
- 1 90.23734 27694571
- 2 9.75277 2993204
- 4 0.00981 3010
- 8 0.00007 22
- 16 0.00000 1
- 32 0.00000 1
-]]></screen></para>
-
</refsect2>
<refsect2>
--
2.38.0
Andres Freund <andres@anarazel.de> writes:
Here's an updated version of the move to representing instr_time as
nanoseconds. It's now split into a few patches:
I took a quick look through this.
0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
warn
Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
just allow to assign 0.
I think it's probably wise to keep the macro. If we ever rethink this
again, we'll be glad we kept it. Similarly, IS_ZERO is a good idea
even if it would work with just compare-to-zero. I'm almost tempted
to suggest you define instr_time as a struct with a uint64 field,
just to help keep us honest about that.
0003) Add INSTR_TIME_SET_SECOND()
This is used in 0004. Just allows setting an instr_time to a time in
seconds, allowing for a cheaper loop exit condition in 0004.
Code and comments are inconsistent about whether it's SET_SECOND or
SET_SECONDS. I think I prefer the latter, but don't care that much.
0004) report nanoseconds in pg_test_timing
Didn't examine 0004 in any detail, but the others look good to go
other than these nits.
regards, tom lane
Hi,
On 2023-01-17 12:26:57 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Here's an updated version of the move to representing instr_time as
nanoseconds. It's now split into a few patches:I took a quick look through this.
Thanks!
0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
warn
Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
just allow to assign 0.I think it's probably wise to keep the macro. If we ever rethink this
again, we'll be glad we kept it. Similarly, IS_ZERO is a good idea
even if it would work with just compare-to-zero.
Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
could give us the best of both worlds?
I'm almost tempted to suggest you define instr_time as a struct with a
uint64 field, just to help keep us honest about that.
I can see that making sense. Unless somebody pipes up with opposition to that
plan soon, I'll see how it goes.
0003) Add INSTR_TIME_SET_SECOND()
This is used in 0004. Just allows setting an instr_time to a time in
seconds, allowing for a cheaper loop exit condition in 0004.Code and comments are inconsistent about whether it's SET_SECOND or
SET_SECONDS. I think I prefer the latter, but don't care that much.
That's probably because I couldn't decide... So I'll go with your preference.
0004) report nanoseconds in pg_test_timing
Didn't examine 0004 in any detail, but the others look good to go
other than these nits.
Thanks for looking!
Greetings,
Andres Freund
On 1/16/23 21:39, Pavel Stehule wrote:
po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:Hi,
there's minor bitrot in the Mkvcbuild.pm change, making cfbot unhappy.
As for the patch, I don't have much comments. I'm wondering if it'd be
useful to indicate which timing source was actually used for EXPLAIN
ANALYZE, say something like:Planning time: 0.197 ms
Execution time: 0.225 ms
Timing source: clock_gettime (or tsc)There has been a proposal to expose this as a GUC (or perhaps as
explain
option), to allow users to pick what timing source to use. I
wouldn't go
that far - AFAICS is this is meant to be universally better when
available. But knowing which source was used seems useful.+1
Thanks for looking at the patch.
I'll fix the merge conflict.
I like the idea of exposing the timing source in the EXPLAIN ANALYZE output.
It's a good tradeoff between inspectability and effort, given that RDTSC
should always be better to use.
If there are no objections I go this way.
--
David Geier
(ServiceNow)
On 1/16/23 18:37, Andres Freund wrote:
Hi,
On 2023-01-02 14:28:20 +0100, David Geier wrote:
I'm doubtful this is worth the complexity it incurs. By the time we convert
out of the instr_time format, the times shouldn't be small enough that the
accuracy is affected much.
I don't feel strong about it and you have a point that we most likely
only convert ones we've accumulated a fair amount of cycles.
Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC()
actually accumulate themselves, and should instead keep things in the
instr_time format and convert later. We'd win more accuracy / speed that way.I don't think the introduction of pg_time_usec_t was a great idea, but oh
well.
Fully agreed. Why not replacing pg_time_usec_t with instr_time in a
separate patch? I haven't looked carefully enough if all occurrences
could sanely replaced but at least the ones that accumulate time seem
good starting points.
Additionally, I initialized a few variables of type instr_time which
otherwise resulted in warnings due to use of potentially uninitialized
variables.Unless we decide, as I suggested downthread, that we deprecate
INSTR_TIME_SET_ZERO(), that's unfortunately not the right fix. I've a similar
patch that adds all the necesarry INSTR_TIME_SET_ZERO() calls.
I don't feel strong about it, but like Tom tend towards keeping the
initialization macro.
Thanks that you have improved on the first patch and fixed these issues
in a better way.
What about renaming INSTR_TIME_GET_DOUBLE() to INSTR_TIME_GET_SECS() so that
it's consistent with the _MILLISEC() and _MICROSEC() variants?
The INSTR_TIME_GET_MICROSEC() returns a uint64 while the other variants
return double. This seems error prone. What about renaming the function or
also have the function return a double and cast where necessary at the call
site?I think those should be a separate discussion / patch.
OK. I'll propose follow-on patches ones we're done with the ones at hand.
I'll then rebase the RDTSC patches on your patch set.
--
David Geier
(ServiceNow)
Hi,
@Andres: will you take care of these changes and provide me with an
updated patch set so I can rebase the RDTSC changes?
Otherwise, I can also apply Tom suggestions to your patch set and send
out the complete patch set.
--
David Geier
(ServiceNow)
Hi Andres,
I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out
how to do the cycles->nanosecond conversion with integer shift and multiply in
the common case, which does show a noticable speedup. But that's for another
day.
I also have code for that here. I decided against integrating it because
we don't convert frequently enough to make it matter. Or am I missing
something?
I fought a bit with myself about whether to send those patches in this thread,
because it'll take over the CF entry. But decided that it's ok, given that
David's patches should be rebased over these anyway?
That's alright.
Though, I would hope we attempt to bring your patch set as well as the
RDTSC patch set in.
--
David Geier
(ServiceNow)
On 1/18/23 13:52, David Geier wrote:
On 1/16/23 21:39, Pavel Stehule wrote:
po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:Hi,
there's minor bitrot in the Mkvcbuild.pm change, making cfbot
unhappy.As for the patch, I don't have much comments. I'm wondering if
it'd be
useful to indicate which timing source was actually used for EXPLAIN
ANALYZE, say something like:Planning time: 0.197 ms
Execution time: 0.225 ms
Timing source: clock_gettime (or tsc)+1
I like the idea of exposing the timing source in the EXPLAIN ANALYZE
output.
It's a good tradeoff between inspectability and effort, given that
RDTSC should always be better to use.
If there are no objections I go this way.
Thinking about this a little more made me realize that this will cause
different pg_regress output depending on the platform. So if we go this
route we would at least need an option for EXPLAIN ANALYZE to disable
it. Or rather have it disabled by default and allow for enabling it.
Thoughts?
--
David Geier
(ServiceNow)
On 1/20/23 07:43, David Geier wrote:
On 1/18/23 13:52, David Geier wrote:
On 1/16/23 21:39, Pavel Stehule wrote:
po 16. 1. 2023 v 21:34 odesílatel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:Hi,
there's minor bitrot in the Mkvcbuild.pm change, making cfbot
unhappy.As for the patch, I don't have much comments. I'm wondering if
it'd be
useful to indicate which timing source was actually used for EXPLAIN
ANALYZE, say something like:Planning time: 0.197 ms
Execution time: 0.225 ms
Timing source: clock_gettime (or tsc)+1
I like the idea of exposing the timing source in the EXPLAIN ANALYZE
output.
It's a good tradeoff between inspectability and effort, given that
RDTSC should always be better to use.
If there are no objections I go this way.Thinking about this a little more made me realize that this will cause
different pg_regress output depending on the platform. So if we go this
route we would at least need an option for EXPLAIN ANALYZE to disable
it. Or rather have it disabled by default and allow for enabling it.
Thoughts?
What about only showing it for VERBOSE mode? I don't think there are
very many tests doing EXPLAIN (ANALYZE, VERBOSE) - a quick grep found
one such place in partition_prune.sql.
regards
--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Hi,
On 2023-01-17 10:50:53 -0800, Andres Freund wrote:
On 2023-01-17 12:26:57 -0500, Tom Lane wrote:
0001) Add INSTR_TIME_SET_ZERO() calls where otherwise 0002 causes gcc to
warn
Alternatively we can decide to deprecate INSTR_TIME_SET_ZERO() and
just allow to assign 0.I think it's probably wise to keep the macro. If we ever rethink this
again, we'll be glad we kept it. Similarly, IS_ZERO is a good idea
even if it would work with just compare-to-zero.Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
could give us the best of both worlds?
I tried that in the attached 0005. I found that it reads better if I also add
INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001
instead, but I wanted to get agreement on it first.
Comments?
I'm almost tempted to suggest you define instr_time as a struct with a
uint64 field, just to help keep us honest about that.I can see that making sense. Unless somebody pipes up with opposition to that
plan soon, I'll see how it goes.
Done in the attached. I think it looks good. Actually found a type confusion
buglet in 0004, so the type safety benefit is noticable.
It does require a new INSTR_TIME_IS_LT() for the loop exit condition in 0004,
but that seems fine.
Besides cosmetic stuff I also added back the cast to double in window's
INSTR_TIME_GET_NANOSEC() - I think there's an overflow danger without it.
We should make this faster by pre-computing
(double) NS_PER_S / GetTimerFrequency()
once, as that'd avoid doing the the slow division on every conversion. But
that's an old issue and thus better tackled separately.
Greetings,
Andres Freund
Attachments:
v8-0001-Zero-initialize-instr_time-uses-causing-compiler-.patchtext/x-diff; charset=us-asciiDownload
From dfea7f159b79aaf1fb6f4c8617614f6fe5c1176f Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 10:04:42 -0800
Subject: [PATCH v8 1/5] Zero initialize instr_time uses causing compiler
warnings
These are all not necessary from a correctness POV. However, in a subsequent
patch instr_time will be simplified to an int64, at which point gcc would
otherwise start to warn about the changed places.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/20230116023639.rn36vf6ajqmfciua@awork3.anarazel.de
Backpatch:
---
src/backend/access/transam/xlog.c | 4 ++++
src/backend/storage/buffer/bufmgr.c | 4 ++++
src/backend/storage/file/buffile.c | 4 ++++
src/backend/storage/ipc/latch.c | 2 ++
src/bin/psql/common.c | 6 ++++++
5 files changed, 20 insertions(+)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index cc0d9a05d9f..fb4c860bdea 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2191,6 +2191,8 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
/* Measure I/O timing to write WAL data */
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
+ else
+ INSTR_TIME_SET_ZERO(start);
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
written = pg_pwrite(openLogFile, from, nleft, startoffset);
@@ -8151,6 +8153,8 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
/* Measure I/O timing to sync the WAL file */
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
+ else
+ INSTR_TIME_SET_ZERO(start);
pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
switch (sync_method)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 8075828e8a6..800a4248c95 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1017,6 +1017,8 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
if (track_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
+ else
+ INSTR_TIME_SET_ZERO(io_start);
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
@@ -2902,6 +2904,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
if (track_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
+ else
+ INSTR_TIME_SET_ZERO(io_start);
/*
* bufToWrite is either the shared buffer or a copy, as appropriate.
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index c5464b6aa62..0a51624df3b 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -446,6 +446,8 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
+ else
+ INSTR_TIME_SET_ZERO(io_start);
/*
* Read whatever we can get, up to a full bufferload.
@@ -525,6 +527,8 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
+ else
+ INSTR_TIME_SET_ZERO(io_start);
bytestowrite = FileWrite(thisfile,
file->buffer.data + wpos,
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index d79d71a8515..f4123e7de7e 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1401,6 +1401,8 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
Assert(timeout >= 0 && timeout <= INT_MAX);
cur_timeout = timeout;
}
+ else
+ INSTR_TIME_SET_ZERO(start_time);
pgstat_report_wait_start(wait_event_info);
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index 00627830c47..f907f5d4e8d 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -1276,6 +1276,8 @@ DescribeQuery(const char *query, double *elapsed_msec)
if (timing)
INSTR_TIME_SET_CURRENT(before);
+ else
+ INSTR_TIME_SET_ZERO(before);
/*
* To parse the query but not execute it, we prepare it, using the unnamed
@@ -1406,6 +1408,8 @@ ExecQueryAndProcessResults(const char *query,
if (timing)
INSTR_TIME_SET_CURRENT(before);
+ else
+ INSTR_TIME_SET_ZERO(before);
if (pset.bind_flag)
success = PQsendQueryParams(pset.db, query, pset.bind_nparams, NULL, (const char * const *) pset.bind_params, NULL, NULL, 0);
@@ -1702,6 +1706,8 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec)
if (timing)
INSTR_TIME_SET_CURRENT(before);
+ else
+ INSTR_TIME_SET_ZERO(before);
/* if we're not in a transaction, start one */
if (PQtransactionStatus(pset.db) == PQTRANS_IDLE)
--
2.38.0
v8-0002-instr_time-Represent-time-as-an-int64-on-all-plat.patchtext/x-diff; charset=us-asciiDownload
From 8c6dc714d2854c279f73399ac11fafa2182d0012 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 20:38:12 -0800
Subject: [PATCH v8 2/5] instr_time: Represent time as an int64 on all
platforms
Until now we used struct timespec on all platforms but windows. Using struct
timespe causes a fair bit of memory (struct timeval is 16 bytes) and runtime
overhead (much more complicated additions). Instead we can convert the time to
nanoseconds in INSTR_TIME_SET_CURRENT(), making the remaining operations
cheaper.
Representing time as int64 nanoseconds provides sufficient range, ~292 years
relative to a starting point (depending on clock source, relative to the unix
epoch or the system's boot time). That'd not be sufficient for calendar time
stored on disk, but is plenty for runtime interval time measurement.
On windows instr_time already is represented as cycles. It might make sense to
represent time as cycles on other platforms as well, as using cycle
acquisition instructions like rdtsc directly can reduce the overhead of time
acquisition substantially. This could be done in a fairly localized manner as
the code stands after this commit.
Because the windows and non-windows paths are now more similar, use a common
set of macros. To make that possible, most of the use of LARGE_INTEGER had to
be removed, which looks nicer anyway.
To avoid users of the API relying on the integer representation, we wrap the
64bit integer inside struct struct instr_time.
Author: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Author: David Geier <geidav.pg@gmail.com>
Reviewed-by: Tom Lane <tgl@sss.pgh.pa.us>
Discussion: https://postgr.es/m/20230113195547.k4nlrmawpijqwlsa@awork3.anarazel.de
---
src/include/portability/instr_time.h | 162 ++++++++++++++-------------
1 file changed, 86 insertions(+), 76 deletions(-)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 9ea1a68bd94..c0ed491395d 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -34,6 +34,8 @@
*
* INSTR_TIME_GET_MICROSEC(t) convert t to uint64 (in microseconds)
*
+ * INSTR_TIME_GET_NANOSEC(t) convert t to uint64 (in nanoseconds)
+ *
* Note that INSTR_TIME_SUBTRACT and INSTR_TIME_ACCUM_DIFF convert
* absolute times to intervals. The INSTR_TIME_GET_xxx operations are
* only useful on intervals.
@@ -54,8 +56,32 @@
#ifndef INSTR_TIME_H
#define INSTR_TIME_H
+
+/*
+ * We store interval times as an int64 integer on all platforms, as int64 is
+ * cheap to add/subtract, the most common operation for instr_time. The
+ * acquisition of time and converting to specific units of time is platform
+ * specific.
+ *
+ * To avoid users of the API relying on the integer representation, we wrap
+ * the 64bit integer in a struct.
+ */
+typedef struct instr_time
+{
+ int64 ticks; /* in platforms specific unit */
+} instr_time;
+
+
+/* helpers macros used in platform specific code below */
+
+#define NS_PER_S INT64CONST(1000000000)
+#define NS_PER_MS INT64CONST(1000000)
+#define NS_PER_US INT64CONST(1000)
+
+
#ifndef WIN32
+
/* Use clock_gettime() */
#include <time.h>
@@ -80,93 +106,43 @@
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-typedef struct timespec instr_time;
+/* helper for INSTR_TIME_SET_CURRENT */
+static inline instr_time
+pg_clock_gettime_ns(void)
+{
+ instr_time now;
+ struct timespec tmp;
-#define INSTR_TIME_IS_ZERO(t) ((t).tv_nsec == 0 && (t).tv_sec == 0)
+ clock_gettime(PG_INSTR_CLOCK, &tmp);
+ now.ticks = tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
-#define INSTR_TIME_SET_ZERO(t) ((t).tv_sec = 0, (t).tv_nsec = 0)
+ return now;
+}
-#define INSTR_TIME_SET_CURRENT(t) ((void) clock_gettime(PG_INSTR_CLOCK, &(t)))
+#define INSTR_TIME_SET_CURRENT(t) \
+ ((t) = pg_clock_gettime_ns())
-#define INSTR_TIME_ADD(x,y) \
- do { \
- (x).tv_sec += (y).tv_sec; \
- (x).tv_nsec += (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
- } while (0)
+#define INSTR_TIME_GET_NANOSEC(t) \
+ ((int64) (t).ticks)
-#define INSTR_TIME_SUBTRACT(x,y) \
- do { \
- (x).tv_sec -= (y).tv_sec; \
- (x).tv_nsec -= (y).tv_nsec; \
- /* Normalize */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
- } while (0)
-
-#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
- do { \
- (x).tv_sec += (y).tv_sec - (z).tv_sec; \
- (x).tv_nsec += (y).tv_nsec - (z).tv_nsec; \
- /* Normalize after each add to avoid overflow/underflow of tv_nsec */ \
- while ((x).tv_nsec < 0) \
- { \
- (x).tv_nsec += 1000000000; \
- (x).tv_sec--; \
- } \
- while ((x).tv_nsec >= 1000000000) \
- { \
- (x).tv_nsec -= 1000000000; \
- (x).tv_sec++; \
- } \
- } while (0)
-
-#define INSTR_TIME_GET_DOUBLE(t) \
- (((double) (t).tv_sec) + ((double) (t).tv_nsec) / 1000000000.0)
-
-#define INSTR_TIME_GET_MILLISEC(t) \
- (((double) (t).tv_sec * 1000.0) + ((double) (t).tv_nsec) / 1000000.0)
-
-#define INSTR_TIME_GET_MICROSEC(t) \
- (((uint64) (t).tv_sec * (uint64) 1000000) + (uint64) ((t).tv_nsec / 1000))
#else /* WIN32 */
+
/* Use QueryPerformanceCounter() */
-typedef LARGE_INTEGER instr_time;
+/* helper for INSTR_TIME_SET_CURRENT */
+static inline instr_time
+pg_query_performance_counter(void)
+{
+ instr_time now;
+ LARGE_INTEGER tmp;
-#define INSTR_TIME_IS_ZERO(t) ((t).QuadPart == 0)
+ QueryPerformanceCounter(&tmp);
+ now.ticks = tmp.QuadPart;
-#define INSTR_TIME_SET_ZERO(t) ((t).QuadPart = 0)
-
-#define INSTR_TIME_SET_CURRENT(t) QueryPerformanceCounter(&(t))
-
-#define INSTR_TIME_ADD(x,y) \
- ((x).QuadPart += (y).QuadPart)
-
-#define INSTR_TIME_SUBTRACT(x,y) \
- ((x).QuadPart -= (y).QuadPart)
-
-#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
- ((x).QuadPart += (y).QuadPart - (z).QuadPart)
-
-#define INSTR_TIME_GET_DOUBLE(t) \
- (((double) (t).QuadPart) / GetTimerFrequency())
-
-#define INSTR_TIME_GET_MILLISEC(t) \
- (((double) (t).QuadPart * 1000.0) / GetTimerFrequency())
-
-#define INSTR_TIME_GET_MICROSEC(t) \
- ((uint64) (((double) (t).QuadPart * 1000000.0) / GetTimerFrequency()))
+ return now;
+}
static inline double
GetTimerFrequency(void)
@@ -177,11 +153,45 @@ GetTimerFrequency(void)
return (double) f.QuadPart;
}
+#define INSTR_TIME_SET_CURRENT(t) \
+ ((t) = pg_query_performance_counter())
+
+#define INSTR_TIME_GET_NANOSEC(t) \
+ ((int64) (((double) (t).ticks * NS_PER_S) / GetTimerFrequency()))
+
#endif /* WIN32 */
-/* same macro on all platforms */
+
+/*
+ * Common macros
+ */
+
+#define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0)
+
+
+#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
#define INSTR_TIME_SET_CURRENT_LAZY(t) \
(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
+
+#define INSTR_TIME_ADD(x,y) \
+ ((x).ticks += (y).ticks)
+
+#define INSTR_TIME_SUBTRACT(x,y) \
+ ((x).ticks -= (y).ticks)
+
+#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
+ ((x).ticks += (y).ticks - (z).ticks)
+
+
+#define INSTR_TIME_GET_DOUBLE(t) \
+ ((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
+
+#define INSTR_TIME_GET_MILLISEC(t) \
+ ((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_MS)
+
+#define INSTR_TIME_GET_MICROSEC(t) \
+ (INSTR_TIME_GET_NANOSEC(t) / NS_PER_US)
+
#endif /* INSTR_TIME_H */
--
2.38.0
v8-0003-instr_time-Add-INSTR_TIME_SET_SECONDS-INSTR_TIME_.patchtext/x-diff; charset=us-asciiDownload
From 0dcdbb6b1ad1121d2bdb2ba370060ef0519640b8 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 20 Jan 2023 15:31:54 -0800
Subject: [PATCH v8 3/5] instr_time: Add INSTR_TIME_SET_SECONDS(),
INSTR_TIME_IS_LT()
INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop
without having to convert into time units (which is
costly). INSTR_TIME_IS_LT() can be used to check the loop condition.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/portability/instr_time.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index c0ed491395d..af2ab6ec887 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -15,6 +15,8 @@
*
* INSTR_TIME_IS_ZERO(t) is t equal to zero?
*
+ * INSTR_TIME_IS_LT(x, y) x < y
+ *
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
* INSTR_TIME_SET_CURRENT(t) set t to current time
@@ -22,6 +24,8 @@
* INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
* evaluates to whether t changed
*
+ * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds
+ *
* INSTR_TIME_ADD(x, y) x += y
*
* INSTR_TIME_SUBTRACT(x, y) x -= y
@@ -122,6 +126,9 @@ pg_clock_gettime_ns(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_clock_gettime_ns())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t).ticks = NS_PER_S * (s))
+
#define INSTR_TIME_GET_NANOSEC(t) \
((int64) (t).ticks)
@@ -156,6 +163,9 @@ GetTimerFrequency(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t).ticks = s * GetTimerFrequency())
+
#define INSTR_TIME_GET_NANOSEC(t) \
((int64) (((double) (t).ticks * NS_PER_S) / GetTimerFrequency()))
@@ -168,6 +178,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0)
+#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks)
+
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
--
2.38.0
v8-0004-wip-report-nanoseconds-in-pg_test_timing.patchtext/x-diff; charset=us-asciiDownload
From 5a458d4584961dedd3f80a07d8faea66e57c5d94 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v8 4/5] wip: report nanoseconds in pg_test_timing
This commit also updates pg_test_timing's documentation:
- compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of
of statement with/without EXPLAIN ANALYZE
- explain the 2x overhead (due to two timestamp acquisitions per row)
- remove old section about old versions of linux - I couldn't update the
numbers, and it's old enough nobody would care
---
src/bin/pg_test_timing/pg_test_timing.c | 74 ++++++++------
doc/src/sgml/ref/pgtesttiming.sgml | 122 +++++++++++-------------
2 files changed, 100 insertions(+), 96 deletions(-)
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f87629..e20718669a5 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -19,8 +19,8 @@ static void handle_args(int argc, char *argv[]);
static uint64 test_timing(unsigned int duration);
static void output(uint64 loop_count);
-/* record duration in powers of 2 microseconds */
-long long int histogram[32];
+/* record duration in powers of 2 nanoseconds */
+uint64 histogram[64];
int
main(int argc, char *argv[])
@@ -121,35 +121,48 @@ handle_args(int argc, char *argv[])
static uint64
test_timing(unsigned int duration)
{
- uint64 total_time;
- int64 time_elapsed = 0;
uint64 loop_count = 0;
- uint64 prev,
- cur;
+ instr_time until_time,
+ total_time;
instr_time start_time,
- end_time,
- temp;
-
- total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
+ end_time;
+ instr_time cur;
INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_MICROSEC(start_time);
- while (time_elapsed < total_time)
+ /*
+ * To reduce loop overhead, check loop condition in instr_time domain.
+ */
+ INSTR_TIME_SET_SECONDS(total_time, duration);
+ until_time = start_time;
+ INSTR_TIME_ADD(until_time, total_time);
+
+ cur = start_time;
+
+ while (INSTR_TIME_IS_LT(cur, until_time))
{
- int32 diff,
- bits = 0;
+ instr_time temp;
+ instr_time prev;
+ int64 diff;
+ int32 bits = 0;
prev = cur;
- INSTR_TIME_SET_CURRENT(temp);
- cur = INSTR_TIME_GET_MICROSEC(temp);
- diff = cur - prev;
+ INSTR_TIME_SET_CURRENT(cur);
+ temp = cur;
+ INSTR_TIME_SUBTRACT(temp, prev);
+ diff = INSTR_TIME_GET_NANOSEC(temp);
/* Did time go backwards? */
- if (diff < 0)
+ if (unlikely(diff <= 0))
{
+ /* can't do anything with that measurement */
+ if (diff == 0)
+ {
+ loop_count++;
+ continue;
+ }
fprintf(stderr, _("Detected clock going backwards in time.\n"));
- fprintf(stderr, _("Time warp: %d ms\n"), diff);
+ fprintf(stderr, _("Time warp: %lld ns\n"), (long long) diff);
exit(1);
}
@@ -164,8 +177,6 @@ test_timing(unsigned int duration)
histogram[bits]++;
loop_count++;
- INSTR_TIME_SUBTRACT(temp, start_time);
- time_elapsed = INSTR_TIME_GET_MICROSEC(temp);
}
INSTR_TIME_SET_CURRENT(end_time);
@@ -173,7 +184,7 @@ test_timing(unsigned int duration)
INSTR_TIME_SUBTRACT(end_time, start_time);
printf(_("Per loop time including overhead: %0.2f ns\n"),
- INSTR_TIME_GET_DOUBLE(end_time) * 1e9 / loop_count);
+ (INSTR_TIME_GET_DOUBLE(end_time) * NS_PER_S) / loop_count);
return loop_count;
}
@@ -181,9 +192,10 @@ test_timing(unsigned int duration)
static void
output(uint64 loop_count)
{
- int64 max_bit = 31,
+ int64 low_bit = 0,
+ max_bit = 63,
i;
- char *header1 = _("< us");
+ char *header1 = _("< ns");
char *header2 = /* xgettext:no-c-format */ _("% of total");
char *header3 = _("count");
int len1 = strlen(header1);
@@ -194,15 +206,19 @@ output(uint64 loop_count)
while (max_bit > 0 && histogram[max_bit] == 0)
max_bit--;
+ /* find lowest bit value */
+ while (low_bit < max_bit && histogram[low_bit] == 0)
+ low_bit++;
+
printf(_("Histogram of timing durations:\n"));
printf("%*s %*s %*s\n",
- Max(6, len1), header1,
+ Max(9, len1), header1,
Max(10, len2), header2,
Max(10, len3), header3);
- for (i = 0; i <= max_bit; i++)
- printf("%*ld %*.5f %*lld\n",
- Max(6, len1), 1l << i,
+ for (i = low_bit; i <= max_bit; i++)
+ printf("%*ld %*.5f %*llu\n",
+ Max(9, len1), 1l << i,
Max(10, len2) - 1, (double) histogram[i] * 100 / loop_count,
- Max(10, len3), histogram[i]);
+ Max(10, len3), (long long unsigned) histogram[i]);
}
diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml
index a5eb3aa25e0..82c625e3a2a 100644
--- a/doc/src/sgml/ref/pgtesttiming.sgml
+++ b/doc/src/sgml/ref/pgtesttiming.sgml
@@ -93,28 +93,34 @@ PostgreSQL documentation
<para>
Good results will show most (>90%) individual timing calls take less than
- one microsecond. Average per loop overhead will be even lower, below 100
- nanoseconds. This example from an Intel i7-860 system using a TSC clock
- source shows excellent performance:
+ one microsecond (1000 nanoseconds). Average per loop overhead will be even
+ lower, below 100 nanoseconds. This example from an Intel i9-9880H system
+ using a TSC clock source shows excellent performance:
<screen><![CDATA[
Testing timing overhead for 3 seconds.
-Per loop time including overhead: 35.96 ns
+Per loop time including overhead: 13.74 ns
Histogram of timing durations:
- < us % of total count
- 1 96.40465 80435604
- 2 3.59518 2999652
- 4 0.00015 126
- 8 0.00002 13
- 16 0.00000 2
+ < ns % of total count
+ 16 97.00221 211857215
+ 32 2.99555 6542412
+ 64 0.00115 2505
+ 128 0.00035 759
+ 256 0.00004 78
+ 512 0.00000 3
+ 1024 0.00000 4
+ 2048 0.00034 732
+ 4096 0.00000 6
+ 8192 0.00000 8
+ 16384 0.00019 409
+ 32768 0.00018 403
+ 65536 0.00000 1
]]></screen>
</para>
<para>
- Note that different units are used for the per loop time than the
- histogram. The loop can have resolution within a few nanoseconds (ns),
- while the individual timing calls can only resolve down to one microsecond
- (us).
+ Note that the accuracy of the histogram entries may be lower than the
+ per loop time.
</para>
</refsect2>
@@ -125,24 +131,30 @@ Histogram of timing durations:
When the query executor is running a statement using
<command>EXPLAIN ANALYZE</command>, individual operations are timed as well
as showing a summary. The overhead of your system can be checked by
- counting rows with the <application>psql</application> program:
+ disabling the per-row timing, using the <literal>TIMING OFF</literal>
+ option:
<screen>
-CREATE TABLE t AS SELECT * FROM generate_series(1,100000);
-\timing
-SELECT COUNT(*) FROM t;
-EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
+CREATE TABLE t AS SELECT * FROM generate_series(1, 100000);
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t;
+EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;
</screen>
</para>
<para>
- The i7-860 system measured runs the count query in 9.8 ms while
- the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each
- processing just over 100,000 rows. That 6.8 ms difference means the timing
- overhead per row is 68 ns, about twice what pg_test_timing estimated it
- would be. Even that relatively small amount of overhead is making the fully
- timed count statement take almost 70% longer. On more substantial queries,
- the timing overhead would be less problematic.
+ The i9-9880H system measured shows an execution time of 4.116 ms for the
+ <literal>TIMING OFF</literal> query, and 6.965 ms for the
+ <literal>TIMING ON</literal>, each processing 100,000 rows.
+
+ That 2.849 ms difference means the timing overhead per row is 28 ns. As
+ <literal>TIMING ON</literal> measures timestamps twice per row returned by
+ an executor node, the overhead is very close to what pg_test_timing
+ estimated it would be.
+
+ more than what pg_test_timing estimated it would be. Even that relatively
+ small amount of overhead is making the fully timed count statement take
+ about 60% longer. On more substantial queries, the timing overhead would
+ be less problematic.
</para>
</refsect2>
@@ -157,28 +169,31 @@ EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
<screen><![CDATA[
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-tsc hpet acpi_pm
+tsc acpi_pm
# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
# pg_test_timing
-Per loop time including overhead: 722.92 ns
+Testing timing overhead for 3 seconds.
+Per loop time including overhead: 708.58 ns
Histogram of timing durations:
- < us % of total count
- 1 27.84870 1155682
- 2 72.05956 2990371
- 4 0.07810 3241
- 8 0.01357 563
- 16 0.00007 3
+ < ns % of total count
+ 1024 99.79796 4225270
+ 2048 0.15560 6588
+ 4096 0.00035 15
+ 8192 0.01738 736
+ 16384 0.01679 711
+ 32768 0.01190 504
]]></screen>
</para>
<para>
In this configuration, the sample <command>EXPLAIN ANALYZE</command> above
- takes 115.9 ms. That's 1061 ns of timing overhead, again a small multiple
- of what's measured directly by this utility. That much timing overhead
- means the actual query itself is only taking a tiny fraction of the
- accounted for time, most of it is being consumed in overhead instead. In
- this configuration, any <command>EXPLAIN ANALYZE</command> totals involving
- many timed operations would be inflated significantly by timing overhead.
+ shows an execution time of 148.7 ms. That's 1392 ns of timing overhead,
+ again a small multiple of what's measured directly by this utility. That
+ much timing overhead means the actual query itself is only taking a tiny
+ fraction of the accounted for time, most of it is being consumed in
+ overhead instead. In this configuration, any <command>EXPLAIN
+ ANALYZE</command> totals involving many timed operations would be inflated
+ significantly by timing overhead.
</para>
<para>
@@ -196,33 +211,6 @@ kern.timecounter.hardware: ACPI-fast -> TSC
</screen>
</para>
- <para>
- Other systems may only allow setting the time source on boot. On older
- Linux systems the "clock" kernel setting is the only way to make this sort
- of change. And even on some more recent ones, the only option you'll see
- for a clock source is "jiffies". Jiffies are the older Linux software clock
- implementation, which can have good resolution when it's backed by fast
- enough timing hardware, as in this example:
-
-<screen><![CDATA[
-$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-jiffies
-$ dmesg | grep time.c
-time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
-time.c: Detected 2400.153 MHz processor.
-$ pg_test_timing
-Testing timing overhead for 3 seconds.
-Per timing duration including loop overhead: 97.75 ns
-Histogram of timing durations:
- < us % of total count
- 1 90.23734 27694571
- 2 9.75277 2993204
- 4 0.00981 3010
- 8 0.00007 22
- 16 0.00000 1
- 32 0.00000 1
-]]></screen></para>
-
</refsect2>
<refsect2>
--
2.38.0
v8-0005-wip-instr_time-Add-and-use-INSTR_TIME_ZERO-INSTR_.patchtext/x-diff; charset=us-asciiDownload
From ceca2ab6b2df034fde9be61b05ae6ce9ccf668b3 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 20 Jan 2023 16:09:37 -0800
Subject: [PATCH v8 5/5] wip: instr_time: Add and use INSTR_TIME_ZERO(),
INSTR_TIME_CURRENT()
This just updates the places that "Zero initialize instr_time uses causing
compiler warnings" changed, to see whether this is a nicer approach.
---
src/include/portability/instr_time.h | 24 ++++++++++++++------
src/backend/access/transam/xlog.c | 11 +++-------
src/backend/storage/buffer/bufmgr.c | 16 +++++---------
src/backend/storage/file/buffile.c | 16 +++++---------
src/backend/storage/ipc/latch.c | 8 +++----
src/bin/psql/common.c | 33 +++++++++++++---------------
6 files changed, 50 insertions(+), 58 deletions(-)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index af2ab6ec887..2d1ff4f7f82 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -17,6 +17,10 @@
*
* INSTR_TIME_IS_LT(x, y) x < y
*
+ * INSTR_TIME_ZERO() an instr_time set to 0
+ *
+ * INSTR_TIME_CURRENT() an instr_time set to current time
+ *
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
* INSTR_TIME_SET_CURRENT(t) set t to current time
@@ -110,7 +114,7 @@ typedef struct instr_time
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-/* helper for INSTR_TIME_SET_CURRENT */
+/* helper for INSTR_TIME_CURRENT */
static inline instr_time
pg_clock_gettime_ns(void)
{
@@ -123,8 +127,8 @@ pg_clock_gettime_ns(void)
return now;
}
-#define INSTR_TIME_SET_CURRENT(t) \
- ((t) = pg_clock_gettime_ns())
+#define INSTR_TIME_CURRENT(t) \
+ pg_clock_gettime_ns()
#define INSTR_TIME_SET_SECONDS(t, s) \
((t).ticks = NS_PER_S * (s))
@@ -138,7 +142,7 @@ pg_clock_gettime_ns(void)
/* Use QueryPerformanceCounter() */
-/* helper for INSTR_TIME_SET_CURRENT */
+/* helper for INSTR_TIME_CURRENT */
static inline instr_time
pg_query_performance_counter(void)
{
@@ -160,8 +164,8 @@ GetTimerFrequency(void)
return (double) f.QuadPart;
}
-#define INSTR_TIME_SET_CURRENT(t) \
- ((t) = pg_query_performance_counter())
+#define INSTR_TIME_CURRENT(t) \
+ pg_query_performance_counter()
#define INSTR_TIME_SET_SECONDS(t, s) \
((t).ticks = s * GetTimerFrequency())
@@ -181,7 +185,13 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks)
-#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
+#define INSTR_TIME_ZERO(t) (instr_time){0}
+
+#define INSTR_TIME_SET_CURRENT(t) \
+ (t) = INSTR_TIME_CURRENT()
+
+#define INSTR_TIME_SET_ZERO(t) \
+ ((t) = INSTR_TIME_ZERO())
#define INSTR_TIME_SET_CURRENT_LAZY(t) \
(INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c
index fb4c860bdea..f563800c8ab 100644
--- a/src/backend/access/transam/xlog.c
+++ b/src/backend/access/transam/xlog.c
@@ -2178,7 +2178,7 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
Size nbytes;
Size nleft;
int written;
- instr_time start;
+ instr_time start = INSTR_TIME_ZERO();
/* OK to write the page(s) */
from = XLogCtl->pages + startidx * (Size) XLOG_BLCKSZ;
@@ -2191,8 +2191,6 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
/* Measure I/O timing to write WAL data */
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
- else
- INSTR_TIME_SET_ZERO(start);
pgstat_report_wait_start(WAIT_EVENT_WAL_WRITE);
written = pg_pwrite(openLogFile, from, nleft, startoffset);
@@ -2204,9 +2202,8 @@ XLogWrite(XLogwrtRqst WriteRqst, TimeLineID tli, bool flexible)
*/
if (track_wal_io_timing)
{
- instr_time duration;
+ instr_time duration = INSTR_TIME_CURRENT();
- INSTR_TIME_SET_CURRENT(duration);
INSTR_TIME_SUBTRACT(duration, start);
PendingWalStats.wal_write_time += INSTR_TIME_GET_MICROSEC(duration);
}
@@ -8137,7 +8134,7 @@ void
issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
{
char *msg = NULL;
- instr_time start;
+ instr_time start = INSTR_TIME_ZERO();
Assert(tli != 0);
@@ -8153,8 +8150,6 @@ issue_xlog_fsync(int fd, XLogSegNo segno, TimeLineID tli)
/* Measure I/O timing to sync the WAL file */
if (track_wal_io_timing)
INSTR_TIME_SET_CURRENT(start);
- else
- INSTR_TIME_SET_ZERO(start);
pgstat_report_wait_start(WAIT_EVENT_WAL_SYNC);
switch (sync_method)
diff --git a/src/backend/storage/buffer/bufmgr.c b/src/backend/storage/buffer/bufmgr.c
index 800a4248c95..d8baf80e650 100644
--- a/src/backend/storage/buffer/bufmgr.c
+++ b/src/backend/storage/buffer/bufmgr.c
@@ -1012,19 +1012,17 @@ ReadBuffer_common(SMgrRelation smgr, char relpersistence, ForkNumber forkNum,
MemSet((char *) bufBlock, 0, BLCKSZ);
else
{
- instr_time io_start,
- io_time;
+ instr_time io_start = INSTR_TIME_ZERO();
if (track_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
- else
- INSTR_TIME_SET_ZERO(io_start);
smgrread(smgr, forkNum, blockNum, (char *) bufBlock);
if (track_io_timing)
{
- INSTR_TIME_SET_CURRENT(io_time);
+ instr_time io_time = INSTR_TIME_CURRENT();
+
INSTR_TIME_SUBTRACT(io_time, io_start);
pgstat_count_buffer_read_time(INSTR_TIME_GET_MICROSEC(io_time));
INSTR_TIME_ADD(pgBufferUsage.blk_read_time, io_time);
@@ -2826,8 +2824,7 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
{
XLogRecPtr recptr;
ErrorContextCallback errcallback;
- instr_time io_start,
- io_time;
+ instr_time io_start = INSTR_TIME_ZERO();
Block bufBlock;
char *bufToWrite;
uint32 buf_state;
@@ -2904,8 +2901,6 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
if (track_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
- else
- INSTR_TIME_SET_ZERO(io_start);
/*
* bufToWrite is either the shared buffer or a copy, as appropriate.
@@ -2918,7 +2913,8 @@ FlushBuffer(BufferDesc *buf, SMgrRelation reln)
if (track_io_timing)
{
- INSTR_TIME_SET_CURRENT(io_time);
+ instr_time io_time = INSTR_TIME_CURRENT();
+
INSTR_TIME_SUBTRACT(io_time, io_start);
pgstat_count_buffer_write_time(INSTR_TIME_GET_MICROSEC(io_time));
INSTR_TIME_ADD(pgBufferUsage.blk_write_time, io_time);
diff --git a/src/backend/storage/file/buffile.c b/src/backend/storage/file/buffile.c
index 0a51624df3b..6f813279690 100644
--- a/src/backend/storage/file/buffile.c
+++ b/src/backend/storage/file/buffile.c
@@ -429,8 +429,7 @@ static void
BufFileLoadBuffer(BufFile *file)
{
File thisfile;
- instr_time io_start;
- instr_time io_time;
+ instr_time io_start = INSTR_TIME_ZERO();
/*
* Advance to next component file if necessary and possible.
@@ -446,8 +445,6 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
- else
- INSTR_TIME_SET_ZERO(io_start);
/*
* Read whatever we can get, up to a full bufferload.
@@ -468,7 +465,8 @@ BufFileLoadBuffer(BufFile *file)
if (track_io_timing)
{
- INSTR_TIME_SET_CURRENT(io_time);
+ instr_time io_time = INSTR_TIME_CURRENT();
+
INSTR_TIME_SUBTRACT(io_time, io_start);
INSTR_TIME_ADD(pgBufferUsage.temp_blk_read_time, io_time);
}
@@ -500,8 +498,7 @@ BufFileDumpBuffer(BufFile *file)
while (wpos < file->nbytes)
{
off_t availbytes;
- instr_time io_start;
- instr_time io_time;
+ instr_time io_start = INSTR_TIME_ZERO();
/*
* Advance to next component file if necessary and possible.
@@ -527,8 +524,6 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
INSTR_TIME_SET_CURRENT(io_start);
- else
- INSTR_TIME_SET_ZERO(io_start);
bytestowrite = FileWrite(thisfile,
file->buffer.data + wpos,
@@ -543,7 +538,8 @@ BufFileDumpBuffer(BufFile *file)
if (track_io_timing)
{
- INSTR_TIME_SET_CURRENT(io_time);
+ instr_time io_time = INSTR_TIME_CURRENT();
+
INSTR_TIME_SUBTRACT(io_time, io_start);
INSTR_TIME_ADD(pgBufferUsage.temp_blk_write_time, io_time);
}
diff --git a/src/backend/storage/ipc/latch.c b/src/backend/storage/ipc/latch.c
index f4123e7de7e..8092ff4a984 100644
--- a/src/backend/storage/ipc/latch.c
+++ b/src/backend/storage/ipc/latch.c
@@ -1385,8 +1385,7 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
uint32 wait_event_info)
{
int returned_events = 0;
- instr_time start_time;
- instr_time cur_time;
+ instr_time start_time = INSTR_TIME_ZERO();
long cur_timeout = -1;
Assert(nevents > 0);
@@ -1401,8 +1400,6 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
Assert(timeout >= 0 && timeout <= INT_MAX);
cur_timeout = timeout;
}
- else
- INSTR_TIME_SET_ZERO(start_time);
pgstat_report_wait_start(wait_event_info);
@@ -1489,7 +1486,8 @@ WaitEventSetWait(WaitEventSet *set, long timeout,
/* If we're not done, update cur_timeout for next iteration */
if (returned_events == 0 && timeout >= 0)
{
- INSTR_TIME_SET_CURRENT(cur_time);
+ instr_time cur_time = INSTR_TIME_CURRENT();
+
INSTR_TIME_SUBTRACT(cur_time, start_time);
cur_timeout = timeout - (long) INSTR_TIME_GET_MILLISEC(cur_time);
if (cur_timeout <= 0)
diff --git a/src/bin/psql/common.c b/src/bin/psql/common.c
index f907f5d4e8d..5badb029e83 100644
--- a/src/bin/psql/common.c
+++ b/src/bin/psql/common.c
@@ -1269,15 +1269,12 @@ DescribeQuery(const char *query, double *elapsed_msec)
bool timing = pset.timing;
PGresult *result;
bool OK;
- instr_time before,
- after;
+ instr_time before = INSTR_TIME_ZERO();
*elapsed_msec = 0;
if (timing)
INSTR_TIME_SET_CURRENT(before);
- else
- INSTR_TIME_SET_ZERO(before);
/*
* To parse the query but not execute it, we prepare it, using the unnamed
@@ -1350,7 +1347,8 @@ DescribeQuery(const char *query, double *elapsed_msec)
if (timing)
{
- INSTR_TIME_SET_CURRENT(after);
+ instr_time after = INSTR_TIME_CURRENT();
+
INSTR_TIME_SUBTRACT(after, before);
*elapsed_msec += INSTR_TIME_GET_MILLISEC(after);
}
@@ -1400,16 +1398,13 @@ ExecQueryAndProcessResults(const char *query,
{
bool timing = pset.timing;
bool success;
- instr_time before,
- after;
+ instr_time before = INSTR_TIME_ZERO();
PGresult *result;
FILE *gfile_fout = NULL;
bool gfile_is_pipe = false;
if (timing)
INSTR_TIME_SET_CURRENT(before);
- else
- INSTR_TIME_SET_ZERO(before);
if (pset.bind_flag)
success = PQsendQueryParams(pset.db, query, pset.bind_nparams, NULL, (const char * const *) pset.bind_params, NULL, NULL, 0);
@@ -1490,7 +1485,8 @@ ExecQueryAndProcessResults(const char *query,
*/
if (timing)
{
- INSTR_TIME_SET_CURRENT(after);
+ instr_time after = INSTR_TIME_CURRENT();
+
INSTR_TIME_SUBTRACT(after, before);
*elapsed_msec = INSTR_TIME_GET_MILLISEC(after);
}
@@ -1595,7 +1591,8 @@ ExecQueryAndProcessResults(const char *query,
*/
if (timing)
{
- INSTR_TIME_SET_CURRENT(after);
+ instr_time after = INSTR_TIME_CURRENT();
+
INSTR_TIME_SUBTRACT(after, before);
*elapsed_msec = INSTR_TIME_GET_MILLISEC(after);
}
@@ -1693,8 +1690,7 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec)
int ntuples;
int fetch_count;
char fetch_cmd[64];
- instr_time before,
- after;
+ instr_time before = INSTR_TIME_ZERO();
int flush_error;
*elapsed_msec = 0;
@@ -1706,8 +1702,6 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec)
if (timing)
INSTR_TIME_SET_CURRENT(before);
- else
- INSTR_TIME_SET_ZERO(before);
/* if we're not in a transaction, start one */
if (PQtransactionStatus(pset.db) == PQTRANS_IDLE)
@@ -1738,7 +1732,8 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec)
if (timing)
{
- INSTR_TIME_SET_CURRENT(after);
+ instr_time after = INSTR_TIME_CURRENT();
+
INSTR_TIME_SUBTRACT(after, before);
*elapsed_msec += INSTR_TIME_GET_MILLISEC(after);
}
@@ -1786,7 +1781,8 @@ ExecQueryUsingCursor(const char *query, double *elapsed_msec)
if (timing)
{
- INSTR_TIME_SET_CURRENT(after);
+ instr_time after = INSTR_TIME_CURRENT();
+
INSTR_TIME_SUBTRACT(after, before);
*elapsed_msec += INSTR_TIME_GET_MILLISEC(after);
}
@@ -1926,7 +1922,8 @@ cleanup:
if (timing)
{
- INSTR_TIME_SET_CURRENT(after);
+ instr_time after = INSTR_TIME_CURRENT();
+
INSTR_TIME_SUBTRACT(after, before);
*elapsed_msec += INSTR_TIME_GET_MILLISEC(after);
}
--
2.38.0
Andres Freund <andres@anarazel.de> writes:
Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
could give us the best of both worlds?
I tried that in the attached 0005. I found that it reads better if I also add
INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001
instead, but I wanted to get agreement on it first.
-1 from here. This forecloses the possibility that it's best to use more
than one assignment to initialize the value, and the code doesn't read
any better than it did before.
regards, tom lane
Hi,
On 2023-01-20 22:27:07 -0500, Tom Lane wrote:
Andres Freund <andres@anarazel.de> writes:
Perhaps an INSTR_TIME_ZERO() that could be assigned in variable definitions
could give us the best of both worlds?I tried that in the attached 0005. I found that it reads better if I also add
INSTR_TIME_CURRENT(). If we decide to go for this, I'd roll it into 0001
instead, but I wanted to get agreement on it first.-1 from here. This forecloses the possibility that it's best to use more
than one assignment to initialize the value, and the code doesn't read
any better than it did before.
I think it does read a bit better, but it's a pretty small improvement. So
I'll leave this aspect be for now.
Thanks for checking.
Greetings,
Andres Freund
Hi,
On 2023-01-19 11:47:49 +0100, David Geier wrote:
I also couldn't help and hacked a bit on the rdtsc pieces. I did figure out
how to do the cycles->nanosecond conversion with integer shift and multiply in
the common case, which does show a noticable speedup. But that's for another
day.I also have code for that here. I decided against integrating it because we
don't convert frequently enough to make it matter. Or am I missing
something?
We do currently do the conversion quite frequently. Admittedly I was
partially motivated by trying to get the per-loop overhead in pg_test_timing
down ;)
But I think it's a real issue. Places where we do, but shouldn't, convert:
- ExecReScan() - quite painful, we can end up with a lot of those
- InstrStopNode() - adds a good bit of overhead to simple
- PendingWalStats.wal_write_time - this is particularly bad because it happens
within very contended code
- calls to pgstat_count_buffer_read_time(), pgstat_count_buffer_write_time() -
they can be very frequent
- pgbench.c, as we already discussed
- pg_stat_statements.c
- ...
These all will get a bit slower when moving to a "variable" frequency.
What was your approach for avoiding the costly operation? I ended up with a
integer multiplication + shift approximation for the floating point
multiplication (which in turn uses the inverse of the division by the
frequency). To allow for sufficient precision while also avoiding overflows, I
had to make that branch conditional, with a slow path for large numbers of
nanoseconds.
I fought a bit with myself about whether to send those patches in this thread,
because it'll take over the CF entry. But decided that it's ok, given that
David's patches should be rebased over these anyway?That's alright.
Though, I would hope we attempt to bring your patch set as well as the RDTSC
patch set in.
I think it'd be great - but I'm not sure we're there yet, reliability and
code-complexity wise.
I think it might be worth makign the rdts aspect somewhat
measurable. E.g. allowing pg_test_timing to use both at the same time, and
have it compare elapsed time with both sources of counters.
Greetings,
Andres Freund
Hi,
On 2023-01-20 07:43:00 +0100, David Geier wrote:
On 1/18/23 13:52, David Geier wrote:
On 1/16/23 21:39, Pavel Stehule wrote:
po 16. 1. 2023 v�21:34 odes�latel Tomas Vondra
<tomas.vondra@enterprisedb.com> napsal:��� Hi,
��� there's minor bitrot in the Mkvcbuild.pm change, making cfbot
unhappy.��� As for the patch, I don't have much comments. I'm wondering if
it'd be
��� useful to indicate which timing source was actually used for EXPLAIN
��� ANALYZE, say something like:��� �Planning time: 0.197 ms
��� �Execution time: 0.225 ms
��� �Timing source: clock_gettime (or tsc)+1
I like the idea of exposing the timing source in the EXPLAIN ANALYZE
output.
It's a good tradeoff between inspectability and effort, given that RDTSC
should always be better to use.
If there are no objections I go this way.Thinking about this a little more made me realize that this will cause
different pg_regress output depending on the platform. So if we go this
route we would at least need an option for EXPLAIN ANALYZE to disable it. Or
rather have it disabled by default and allow for enabling it. Thoughts?
The elapsed time is already inherently unstable, so we shouldn't have any test
output showing the time.
But I doubt showing it in every explain is a good idea - we use instr_time in
plenty of other places. Why show it in explain, but not in all those other
places?
Greetings,
Andres Freund
Hi,
On 2023-01-18 14:05:35 +0100, David Geier wrote:
@Andres: will you take care of these changes and provide me with an updated
patch set so I can rebase the RDTSC changes?
Otherwise, I can also apply Tom suggestions to your patch set and send out
the complete patch set.
I'm planning to push most of my changes soon, had hoped to get to it a bit
sooner, but ...
If you have time to look at the pg_test_timing part, it'd be
appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
hesitant to push it.
Greetings,
Andres Freund
Hi,
On 2023-01-18 14:02:48 +0100, David Geier wrote:
On 1/16/23 18:37, Andres Freund wrote:
I'm doubtful this is worth the complexity it incurs. By the time we convert
out of the instr_time format, the times shouldn't be small enough that the
accuracy is affected much.I don't feel strong about it and you have a point that we most likely only
convert ones we've accumulated a fair amount of cycles.
I think we can avoid the issue another way. The inaccuracy comes from the
cycles_to_sec ending up very small, right? Right now your patch has (and
probably my old version similarly had):
cycles_to_sec = 1.0 / (tsc_freq * 1000);
I think it's better if we have one multiplier to convert cycles to nanoseconds
- that'll be a double comparatively close to 1. We can use that to implement
INSTR_TIME_GET_NANOSECONDS(). The conversion to microseconds then is just a
division by 1000 (which most compilers convert into a multiplication/shift
combo), and the conversions to milliseconds and seconds will be similar.
Because we'll never "wrongly" go into the "huge number" or "very small number"
ranges, that should provide sufficient precision? We'll of course still end up
with a very small number when converting a few nanoseconds to seconds, but
that's ok because it's the precision being asked for, instead of loosing
precision in some intermediate representation.
Looking around, most of the existing uses of INSTR_TIME_GET_MICROSEC()
actually accumulate themselves, and should instead keep things in the
instr_time format and convert later. We'd win more accuracy / speed that way.I don't think the introduction of pg_time_usec_t was a great idea, but oh
well.Fully agreed. Why not replacing pg_time_usec_t with instr_time in a separate
patch?
pgbench used to use instr_time, but it was replaced by somebody thinking the
API is too cumbersome. Which I can't quite deny, even though I think the
specific change isn't great.
But yes, this should definitely be a separate patch.
Greetings,
Andres Freund
On Fri, Jan 20, 2023 at 04:40:32PM -0800, Andres Freund wrote:
From 5a458d4584961dedd3f80a07d8faea66e57c5d94 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v8 4/5] wip: report nanoseconds in pg_test_timing
<para> - The i7-860 system measured runs the count query in 9.8 ms while - the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each - processing just over 100,000 rows. That 6.8 ms difference means the timing - overhead per row is 68 ns, about twice what pg_test_timing estimated it - would be. Even that relatively small amount of overhead is making the fully - timed count statement take almost 70% longer. On more substantial queries, - the timing overhead would be less problematic. + The i9-9880H system measured shows an execution time of 4.116 ms for the + <literal>TIMING OFF</literal> query, and 6.965 ms for the + <literal>TIMING ON</literal>, each processing 100,000 rows. + + That 2.849 ms difference means the timing overhead per row is 28 ns. As + <literal>TIMING ON</literal> measures timestamps twice per row returned by + an executor node, the overhead is very close to what pg_test_timing + estimated it would be. + + more than what pg_test_timing estimated it would be. Even that relatively + small amount of overhead is making the fully timed count statement take + about 60% longer. On more substantial queries, the timing overhead would + be less problematic.
I guess you intend to merge these two paragraphs ?
On 2023-01-20 22:50:37 -0600, Justin Pryzby wrote:
On Fri, Jan 20, 2023 at 04:40:32PM -0800, Andres Freund wrote:
From 5a458d4584961dedd3f80a07d8faea66e57c5d94 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v8 4/5] wip: report nanoseconds in pg_test_timing<para> - The i7-860 system measured runs the count query in 9.8 ms while - the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each - processing just over 100,000 rows. That 6.8 ms difference means the timing - overhead per row is 68 ns, about twice what pg_test_timing estimated it - would be. Even that relatively small amount of overhead is making the fully - timed count statement take almost 70% longer. On more substantial queries, - the timing overhead would be less problematic. + The i9-9880H system measured shows an execution time of 4.116 ms for the + <literal>TIMING OFF</literal> query, and 6.965 ms for the + <literal>TIMING ON</literal>, each processing 100,000 rows. + + That 2.849 ms difference means the timing overhead per row is 28 ns. As + <literal>TIMING ON</literal> measures timestamps twice per row returned by + an executor node, the overhead is very close to what pg_test_timing + estimated it would be. + + more than what pg_test_timing estimated it would be. Even that relatively + small amount of overhead is making the fully timed count statement take + about 60% longer. On more substantial queries, the timing overhead would + be less problematic.I guess you intend to merge these two paragraphs ?
Oops. I was intending to drop the last paragraph.
Looking at the docs again I noticed that I needed to rephrase the 'acpi_pm'
section further, as I'd left the "a small multiple of what's measured directly
by this utility" language in there.
Do the changes otherwise make sense?
The "small multiple" stuff was just due to a) comparing "raw statement" with
explain analyze b) not accounting for two timestamps being taken per row.
I think it makes sense to remove the "jiffies" section - the output shown is
way outdated. And I don't think the jiffies time counter is one something
still sees in the wild, outside of bringing up a new cpu architecture or such.
Greetings,
Andres Freund
Hi,
On 2023-01-20 20:16:13 -0800, Andres Freund wrote:
On 2023-01-18 14:05:35 +0100, David Geier wrote:
@Andres: will you take care of these changes and provide me with an updated
patch set so I can rebase the RDTSC changes?
Otherwise, I can also apply Tom suggestions to your patch set and send out
the complete patch set.I'm planning to push most of my changes soon, had hoped to get to it a bit
sooner, but ...
I pushed the int64-ification commits.
If you have time to look at the pg_test_timing part, it'd be
appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
hesitant to push it.
I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
patch.
Thanks to Justin I've polished the pg_test_timing docs some.
I've attached those two patches. Feel free to include them in your series if
you want, then the CF entry (and thus cfbot) makes sense again...
Greetings,
Andres Freund
Attachments:
v9-0001-instr_time-Add-INSTR_TIME_SET_SECONDS-INSTR_TIME_.patchtext/x-diff; charset=us-asciiDownload
From 2546f3000455a7086ea930986b294fd79024ea59 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 20 Jan 2023 15:31:54 -0800
Subject: [PATCH v9 1/2] instr_time: Add INSTR_TIME_SET_SECONDS(),
INSTR_TIME_IS_LT()
INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop
without having to convert into time units (which is
costly). INSTR_TIME_IS_LT() can be used to check the loop condition.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/portability/instr_time.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index cc85138e21f..aab80effb00 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -15,6 +15,8 @@
*
* INSTR_TIME_IS_ZERO(t) is t equal to zero?
*
+ * INSTR_TIME_IS_LT(x, y) x < y
+ *
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
* INSTR_TIME_SET_CURRENT(t) set t to current time
@@ -22,6 +24,8 @@
* INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
* evaluates to whether t changed
*
+ * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds
+ *
* INSTR_TIME_ADD(x, y) x += y
*
* INSTR_TIME_SUBTRACT(x, y) x -= y
@@ -122,6 +126,9 @@ pg_clock_gettime_ns(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_clock_gettime_ns())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t).ticks = NS_PER_S * (s))
+
#define INSTR_TIME_GET_NANOSEC(t) \
((int64) (t).ticks)
@@ -156,6 +163,9 @@ GetTimerFrequency(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t).ticks = s * GetTimerFrequency())
+
#define INSTR_TIME_GET_NANOSEC(t) \
((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency())))
@@ -168,6 +178,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0)
+#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks)
+
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
--
2.38.0
v9-0002-wip-report-nanoseconds-in-pg_test_timing.patchtext/x-diff; charset=us-asciiDownload
From a3149e3430c9594ca2ceb3039c954e729bd7c46e Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v9 2/2] wip: report nanoseconds in pg_test_timing
This commit also updates pg_test_timing's documentation:
- compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of
of statement with/without EXPLAIN ANALYZE
- explain the 2x overhead (due to two timestamp acquisitions per row)
- remove old section about old versions of linux - I couldn't update the
numbers, and it's old enough nobody would care
---
src/bin/pg_test_timing/pg_test_timing.c | 74 +++++++++------
doc/src/sgml/ref/pgtesttiming.sgml | 117 ++++++++++--------------
2 files changed, 95 insertions(+), 96 deletions(-)
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f87629..e20718669a5 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -19,8 +19,8 @@ static void handle_args(int argc, char *argv[]);
static uint64 test_timing(unsigned int duration);
static void output(uint64 loop_count);
-/* record duration in powers of 2 microseconds */
-long long int histogram[32];
+/* record duration in powers of 2 nanoseconds */
+uint64 histogram[64];
int
main(int argc, char *argv[])
@@ -121,35 +121,48 @@ handle_args(int argc, char *argv[])
static uint64
test_timing(unsigned int duration)
{
- uint64 total_time;
- int64 time_elapsed = 0;
uint64 loop_count = 0;
- uint64 prev,
- cur;
+ instr_time until_time,
+ total_time;
instr_time start_time,
- end_time,
- temp;
-
- total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
+ end_time;
+ instr_time cur;
INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_MICROSEC(start_time);
- while (time_elapsed < total_time)
+ /*
+ * To reduce loop overhead, check loop condition in instr_time domain.
+ */
+ INSTR_TIME_SET_SECONDS(total_time, duration);
+ until_time = start_time;
+ INSTR_TIME_ADD(until_time, total_time);
+
+ cur = start_time;
+
+ while (INSTR_TIME_IS_LT(cur, until_time))
{
- int32 diff,
- bits = 0;
+ instr_time temp;
+ instr_time prev;
+ int64 diff;
+ int32 bits = 0;
prev = cur;
- INSTR_TIME_SET_CURRENT(temp);
- cur = INSTR_TIME_GET_MICROSEC(temp);
- diff = cur - prev;
+ INSTR_TIME_SET_CURRENT(cur);
+ temp = cur;
+ INSTR_TIME_SUBTRACT(temp, prev);
+ diff = INSTR_TIME_GET_NANOSEC(temp);
/* Did time go backwards? */
- if (diff < 0)
+ if (unlikely(diff <= 0))
{
+ /* can't do anything with that measurement */
+ if (diff == 0)
+ {
+ loop_count++;
+ continue;
+ }
fprintf(stderr, _("Detected clock going backwards in time.\n"));
- fprintf(stderr, _("Time warp: %d ms\n"), diff);
+ fprintf(stderr, _("Time warp: %lld ns\n"), (long long) diff);
exit(1);
}
@@ -164,8 +177,6 @@ test_timing(unsigned int duration)
histogram[bits]++;
loop_count++;
- INSTR_TIME_SUBTRACT(temp, start_time);
- time_elapsed = INSTR_TIME_GET_MICROSEC(temp);
}
INSTR_TIME_SET_CURRENT(end_time);
@@ -173,7 +184,7 @@ test_timing(unsigned int duration)
INSTR_TIME_SUBTRACT(end_time, start_time);
printf(_("Per loop time including overhead: %0.2f ns\n"),
- INSTR_TIME_GET_DOUBLE(end_time) * 1e9 / loop_count);
+ (INSTR_TIME_GET_DOUBLE(end_time) * NS_PER_S) / loop_count);
return loop_count;
}
@@ -181,9 +192,10 @@ test_timing(unsigned int duration)
static void
output(uint64 loop_count)
{
- int64 max_bit = 31,
+ int64 low_bit = 0,
+ max_bit = 63,
i;
- char *header1 = _("< us");
+ char *header1 = _("< ns");
char *header2 = /* xgettext:no-c-format */ _("% of total");
char *header3 = _("count");
int len1 = strlen(header1);
@@ -194,15 +206,19 @@ output(uint64 loop_count)
while (max_bit > 0 && histogram[max_bit] == 0)
max_bit--;
+ /* find lowest bit value */
+ while (low_bit < max_bit && histogram[low_bit] == 0)
+ low_bit++;
+
printf(_("Histogram of timing durations:\n"));
printf("%*s %*s %*s\n",
- Max(6, len1), header1,
+ Max(9, len1), header1,
Max(10, len2), header2,
Max(10, len3), header3);
- for (i = 0; i <= max_bit; i++)
- printf("%*ld %*.5f %*lld\n",
- Max(6, len1), 1l << i,
+ for (i = low_bit; i <= max_bit; i++)
+ printf("%*ld %*.5f %*llu\n",
+ Max(9, len1), 1l << i,
Max(10, len2) - 1, (double) histogram[i] * 100 / loop_count,
- Max(10, len3), histogram[i]);
+ Max(10, len3), (long long unsigned) histogram[i]);
}
diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml
index a5eb3aa25e0..7e0266cf58b 100644
--- a/doc/src/sgml/ref/pgtesttiming.sgml
+++ b/doc/src/sgml/ref/pgtesttiming.sgml
@@ -93,28 +93,34 @@ PostgreSQL documentation
<para>
Good results will show most (>90%) individual timing calls take less than
- one microsecond. Average per loop overhead will be even lower, below 100
- nanoseconds. This example from an Intel i7-860 system using a TSC clock
- source shows excellent performance:
+ one microsecond (1000 nanoseconds). Average per loop overhead will be even
+ lower, below 100 nanoseconds. This example from an Intel i9-9880H system
+ using a TSC clock source shows excellent performance:
<screen><![CDATA[
Testing timing overhead for 3 seconds.
-Per loop time including overhead: 35.96 ns
+Per loop time including overhead: 13.74 ns
Histogram of timing durations:
- < us % of total count
- 1 96.40465 80435604
- 2 3.59518 2999652
- 4 0.00015 126
- 8 0.00002 13
- 16 0.00000 2
+ < ns % of total count
+ 16 97.00221 211857215
+ 32 2.99555 6542412
+ 64 0.00115 2505
+ 128 0.00035 759
+ 256 0.00004 78
+ 512 0.00000 3
+ 1024 0.00000 4
+ 2048 0.00034 732
+ 4096 0.00000 6
+ 8192 0.00000 8
+ 16384 0.00019 409
+ 32768 0.00018 403
+ 65536 0.00000 1
]]></screen>
</para>
<para>
- Note that different units are used for the per loop time than the
- histogram. The loop can have resolution within a few nanoseconds (ns),
- while the individual timing calls can only resolve down to one microsecond
- (us).
+ Note that the accuracy of the histogram entries may be lower than the
+ per loop time.
</para>
</refsect2>
@@ -125,24 +131,25 @@ Histogram of timing durations:
When the query executor is running a statement using
<command>EXPLAIN ANALYZE</command>, individual operations are timed as well
as showing a summary. The overhead of your system can be checked by
- counting rows with the <application>psql</application> program:
+ disabling the per-row timing, using the <literal>TIMING OFF</literal>
+ option:
<screen>
-CREATE TABLE t AS SELECT * FROM generate_series(1,100000);
-\timing
-SELECT COUNT(*) FROM t;
-EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
+CREATE TABLE t AS SELECT * FROM generate_series(1, 100000);
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t;
+EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;
</screen>
</para>
<para>
- The i7-860 system measured runs the count query in 9.8 ms while
- the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each
- processing just over 100,000 rows. That 6.8 ms difference means the timing
- overhead per row is 68 ns, about twice what pg_test_timing estimated it
- would be. Even that relatively small amount of overhead is making the fully
- timed count statement take almost 70% longer. On more substantial queries,
- the timing overhead would be less problematic.
+ The i9-9880H system measured shows an execution time of 4.116 ms for the
+ <literal>TIMING OFF</literal> query, and 6.965 ms for the
+ <literal>TIMING ON</literal>, each processing 100,000 rows.
+
+ That 2.849 ms difference means the timing overhead per row is 28 ns. As
+ <literal>TIMING ON</literal> measures timestamps twice per row returned by
+ an executor node, the overhead is very close to what pg_test_timing
+ estimated it would be.
</para>
</refsect2>
@@ -157,28 +164,31 @@ EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
<screen><![CDATA[
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-tsc hpet acpi_pm
+tsc acpi_pm
# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
# pg_test_timing
-Per loop time including overhead: 722.92 ns
+Testing timing overhead for 3 seconds.
+Per loop time including overhead: 708.58 ns
Histogram of timing durations:
- < us % of total count
- 1 27.84870 1155682
- 2 72.05956 2990371
- 4 0.07810 3241
- 8 0.01357 563
- 16 0.00007 3
+ < ns % of total count
+ 1024 99.79796 4225270
+ 2048 0.15560 6588
+ 4096 0.00035 15
+ 8192 0.01738 736
+ 16384 0.01679 711
+ 32768 0.01190 504
]]></screen>
</para>
<para>
In this configuration, the sample <command>EXPLAIN ANALYZE</command> above
- takes 115.9 ms. That's 1061 ns of timing overhead, again a small multiple
- of what's measured directly by this utility. That much timing overhead
- means the actual query itself is only taking a tiny fraction of the
- accounted for time, most of it is being consumed in overhead instead. In
- this configuration, any <command>EXPLAIN ANALYZE</command> totals involving
- many timed operations would be inflated significantly by timing overhead.
+ shows an execution time of 148.7 ms. That's 1392 ns of per-row timing
+ overhead. Taking the two timestamps per row into account, that's again
+ close to what pg_test_timing estimated. That much timing overhead means
+ the actual query itself is only taking a tiny fraction of the accounted for
+ time, most of it is being consumed in overhead instead. In this
+ configuration, any <command>EXPLAIN ANALYZE</command> totals involving many
+ timed operations would be inflated significantly by timing overhead.
</para>
<para>
@@ -196,33 +206,6 @@ kern.timecounter.hardware: ACPI-fast -> TSC
</screen>
</para>
- <para>
- Other systems may only allow setting the time source on boot. On older
- Linux systems the "clock" kernel setting is the only way to make this sort
- of change. And even on some more recent ones, the only option you'll see
- for a clock source is "jiffies". Jiffies are the older Linux software clock
- implementation, which can have good resolution when it's backed by fast
- enough timing hardware, as in this example:
-
-<screen><![CDATA[
-$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-jiffies
-$ dmesg | grep time.c
-time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
-time.c: Detected 2400.153 MHz processor.
-$ pg_test_timing
-Testing timing overhead for 3 seconds.
-Per timing duration including loop overhead: 97.75 ns
-Histogram of timing durations:
- < us % of total count
- 1 90.23734 27694571
- 2 9.75277 2993204
- 4 0.00981 3010
- 8 0.00007 22
- 16 0.00000 1
- 32 0.00000 1
-]]></screen></para>
-
</refsect2>
<refsect2>
--
2.38.0
Hi,
On 2023-01-20 21:31:57 -0800, Andres Freund wrote:
On 2023-01-20 20:16:13 -0800, Andres Freund wrote:
I'm planning to push most of my changes soon, had hoped to get to it a bit
sooner, but ...I pushed the int64-ification commits.
There's an odd compilation failure on AIX.
https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=hoverfly&dt=2023-01-21%2007%3A01%3A42
/opt/IBM/xlc/16.1.0/bin/xlc_r -D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY -qnoansialias -g -O2 -qmaxmem=33554432 -qsuppress=1500-010:1506-995 -qsuppress=1506-010:1506-416:1506-450:1506-480:1506-481:1506-492:1506-944:1506-1264 -qinfo=all:nocnd:noeff:noext:nogot:noini:noord:nopar:noppc:norea:nouni:nouse -qinfo=nounset -qvisibility=hidden -I. -I. -I/opt/freeware/include/python3.5m -I../../../src/include -I/home/nm/sw/nopath/icu58.3-64/include -I/home/nm/sw/nopath/libxml2-64/include/libxml2 -I/home/nm/sw/nopath/uuid-64/include -I/home/nm/sw/nopath/openldap-64/include -I/home/nm/sw/nopath/icu58.3-64/include -I/home/nm/sw/nopath/libxml2-64/include -c -o plpy_cursorobject.o plpy_cursorobject.c
"../../../src/include/portability/instr_time.h", line 116.9: 1506-304 (I) No function prototype given for "clock_gettime".
"../../../src/include/portability/instr_time.h", line 116.23: 1506-045 (S) Undeclared identifier CLOCK_REALTIME.
<builtin>: recipe for target 'plpy_cursorobject.o' failed
but files including instr_time.h *do* build successfully, e.g. instrument.c:
/opt/IBM/xlc/16.1.0/bin/xlc_r -D_LARGE_FILES=1 -DRANDOMIZE_ALLOCATED_MEMORY -qnoansialias -g -O2 -qmaxmem=33554432 -qsuppress=1500-010:1506-995 -qsuppress=1506-010:1506-416:1506-450:1506-480:1506-481:1506-492:1506-944:1506-1264 -qinfo=all:nocnd:noeff:noext:nogot:noini:noord:nopar:noppc:norea:nouni:nouse -qinfo=nounset -I../../../src/include -I/home/nm/sw/nopath/icu58.3-64/include -I/home/nm/sw/nopath/libxml2-64/include/libxml2 -I/home/nm/sw/nopath/uuid-64/include -I/home/nm/sw/nopath/openldap-64/include -I/home/nm/sw/nopath/icu58.3-64/include -I/home/nm/sw/nopath/libxml2-64/include -c -o instrument.o instrument.c
Before the change the clock_gettime() call was in a macro and thus could be
referenced even without a prior declaration, as long as places using
INSTR_TIME_SET_CURRENT() had all the necessary includes and defines.
Argh:
There's nice bit in plpython.h:
/*
* Include order should be: postgres.h, other postgres headers, plpython.h,
* other plpython headers. (In practice, other plpython headers will also
* include this file, so that they can compile standalone.)
*/
#ifndef POSTGRES_H
#error postgres.h must be included before plpython.h
#endif
/*
* Undefine some things that get (re)defined in the Python headers. They aren't
* used by the PL/Python code, and all PostgreSQL headers should be included
* earlier, so this should be pretty safe.
*/
#undef _POSIX_C_SOURCE
#undef _XOPEN_SOURCE
the relevant stuff in time.h is indeed guarded by
#if _XOPEN_SOURCE>=500
I don't think the plpython actually code follows the rule about including all
postgres headers earlier.
plpy_typeio.h:
#include "access/htup.h"
#include "fmgr.h"
#include "plpython.h"
#include "utils/typcache.h"
plpy_curserobject.c:
#include "access/xact.h"
#include "catalog/pg_type.h"
#include "mb/pg_wchar.h"
#include "plpy_cursorobject.h"
#include "plpy_elog.h"
#include "plpy_main.h"
#include "plpy_planobject.h"
#include "plpy_procedure.h"
#include "plpy_resultobject.h"
#include "plpy_spi.h"
#include "plpython.h"
#include "utils/memutils.h"
It strikes me as a uh, not good idea to undefine _POSIX_C_SOURCE,
_XOPEN_SOURCE.
The include order aspect was perhaps feasible when there just was plpython.c,
but with the split into many different C files and many headers, it seems hard
to maintain. There's a lot of violations afaics.
The undefines were added in a11cf433413, the split in 147c2482542.
Greetings,
Andres Freund
Hi,
On 1/21/23 05:14, Andres Freund wrote:
The elapsed time is already inherently unstable, so we shouldn't have any test
output showing the time.But I doubt showing it in every explain is a good idea - we use instr_time in
plenty of other places. Why show it in explain, but not in all those other
places?
Yeah. I thought it would only be an issue if we showed it
unconditionally in EXPLAIN ANALYZE. If we only show it with TIMING ON,
we're likely fine with pretty much all regression tests.
But given the different opinions, I'll leave it out in the new patch set
for the moment being.
--
David Geier
(ServiceNow)
Hi,
On 2023-01-23 18:23:17 +0100, David Geier wrote:
On 1/21/23 05:14, Andres Freund wrote:
The elapsed time is already inherently unstable, so we shouldn't have any test
output showing the time.But I doubt showing it in every explain is a good idea - we use instr_time in
plenty of other places. Why show it in explain, but not in all those other
places?Yeah. I thought it would only be an issue if we showed it unconditionally in
EXPLAIN ANALYZE. If we only show it with TIMING ON, we're likely fine with
pretty much all regression tests.
If we add it, it probably shouldn't depend on TIMING, but on
SUMMARY. Regression test queries showing EXPLAIN ANALYZE output all do
something like
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)
the SUMMARY OFF gets rid of the "top-level" "Planning Time" and "Execution
Time", whereas the TIMING OFF gets rid of the per-node timing. Those are
separate options because per-node timing is problematic performance-wise
(right now), but whole-query timing rarely is.
But given the different opinions, I'll leave it out in the new patch set for
the moment being.
Makes sense.
Another, independent, thing worth thinking about: I think we might want to
expose both rdtsc and rdtscp. For something like
InstrStartNode()/InstrStopNode(), avoiding the "one-way barrier" of rdtscp is
quite important to avoid changing the query performance. But for measuring
whole-query time, we likely want to measure the actual time.
It probably won't matter hugely for the whole query time - the out of order
window of modern CPUs is large, but not *that* large - but I don't think we
can generally assume that.
I'm thinking of something like INSTR_TIME_SET_CURRENT() and
INSTR_TIME_SET_CURRENT_FAST() or _NOBARRIER().
Greetings,
Andres Freund
Hi,
On 1/21/23 05:12, Andres Freund wrote:
We do currently do the conversion quite frequently. Admittedly I was
partially motivated by trying to get the per-loop overhead in pg_test_timing
down ;)But I think it's a real issue. Places where we do, but shouldn't, convert:
- ExecReScan() - quite painful, we can end up with a lot of those
- InstrStopNode() - adds a good bit of overhead to simple
InstrStopNode() doesn't convert in the general case but only for the
first tuple or when async. So it goes somewhat hand in hand with
ExecReScan().
- PendingWalStats.wal_write_time - this is particularly bad because it happens
within very contended code
- calls to pgstat_count_buffer_read_time(), pgstat_count_buffer_write_time() -
they can be very frequent
- pgbench.c, as we already discussed
- pg_stat_statements.c
- ...These all will get a bit slower when moving to a "variable" frequency.
I wonder if we will be able to measure any of them easily. But given
that it's many more places than I had realized and given that the
optimized code is not too involved, let's give it a try.
What was your approach for avoiding the costly operation? I ended up with a
integer multiplication + shift approximation for the floating point
multiplication (which in turn uses the inverse of the division by the
frequency). To allow for sufficient precision while also avoiding overflows, I
had to make that branch conditional, with a slow path for large numbers of
nanoseconds.
It seems like we ended up with the same. I do:
sec = ticks / frequency_hz
ns = ticks / frequency_hz * 1,000,000,000
ns = ticks * (1,000,000,000 / frequency_hz)
ns = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
Now, the constant scaling factor in parentheses is typically a floating
point number. For example for a frequency of 2.5 GHz it would be 2.5. To
work around that we can do something like:
ns = ticks * (1,000,000 * scaler / frequency_khz) / scaler
Where scaler is a power-of-2, big enough to maintain enough precision
while allowing for a shift to implement the division.
The additional multiplication with scaler makes that the maximum range
go down, because we must ensure we never overflow. I'm wondering if we
cannot pick scaler in such a way that remaining range of cycles is large
enough for our use case and we can therefore live without bothering for
the overflow case. What would be "enough"? 1 year? 10 years? ...
Otherwise, we indeed need code that cares for the potential overflow. My
hunch is that it can be done branchless, but it for sure adds dependent
instructions. Maybe in that case a branch is better that almost
certainly will never be taken?
I'll include the code in the new patch set which I'll latest submit
tomorrow.
I think it'd be great - but I'm not sure we're there yet, reliability and
code-complexity wise.
Thanks to your commits, the diff of the new patch set will be already
much smaller and easier to review. What's your biggest concern in terms
of reliability?
I think it might be worth makign the rdts aspect somewhat
measurable. E.g. allowing pg_test_timing to use both at the same time, and
have it compare elapsed time with both sources of counters.
I haven't yet looked into pg_test_timing. I'll do that while including
your patches into the new patch set.
--
David Geier
(ServiceNow)
Hi,
On 1/21/23 06:31, Andres Freund wrote:
I pushed the int64-ification commits.
Great. I started rebasing.
One thing I was wondering about: why did you chose to use a signed
instead of an unsigned 64-bit integer for the ticks?
If you have time to look at the pg_test_timing part, it'd be
appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
hesitant to push it.I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
patch.I've attached those two patches. Feel free to include them in your series if
you want, then the CF entry (and thus cfbot) makes sense again...
I'll include them in my new patch set and also have a careful look at them.
--
David Geier
(ServiceNow)
Hi,
On 2023-01-23 18:49:37 +0100, David Geier wrote:
On 1/21/23 05:12, Andres Freund wrote:
We do currently do the conversion quite frequently. Admittedly I was
partially motivated by trying to get the per-loop overhead in pg_test_timing
down ;)But I think it's a real issue. Places where we do, but shouldn't, convert:
- ExecReScan() - quite painful, we can end up with a lot of those
- InstrStopNode() - adds a good bit of overhead to simpleInstrStopNode() doesn't convert in the general case but only for the first
tuple or when async. So it goes somewhat hand in hand with ExecReScan().
I think even the first-scan portion is likely noticable for quick queries -
you can quickly end up with 5-10 nodes, even for queries processed in the <
0.1ms range.
Of course it's way worse with rescans / loops.
- PendingWalStats.wal_write_time - this is particularly bad because it happens
within very contended code
- calls to pgstat_count_buffer_read_time(), pgstat_count_buffer_write_time() -
they can be very frequent
- pgbench.c, as we already discussed
- pg_stat_statements.c
- ...These all will get a bit slower when moving to a "variable" frequency.
I wonder if we will be able to measure any of them easily. But given that
it's many more places than I had realized and given that the optimized code
is not too involved, let's give it a try.
I think at least some should be converted to just accumulate in an
instr_time...
What was your approach for avoiding the costly operation? I ended up with a
integer multiplication + shift approximation for the floating point
multiplication (which in turn uses the inverse of the division by the
frequency). To allow for sufficient precision while also avoiding overflows, I
had to make that branch conditional, with a slow path for large numbers of
nanoseconds.It seems like we ended up with the same. I do:
sec = ticks / frequency_hz
ns� = ticks / frequency_hz * 1,000,000,000
ns� = ticks * (1,000,000,000 / frequency_hz)
ns� = ticks * (1,000,000 / frequency_khz) <-- now in kilohertzNow, the constant scaling factor in parentheses is typically a floating
point number. For example for a frequency of 2.5 GHz it would be 2.5. To
work around that we can do something like:ns� = ticks * (1,000,000 * scaler / frequency_khz) / scaler
Where scaler is a power-of-2, big enough to maintain enough precision while
allowing for a shift to implement the division.
Yep, at least quite similar.
The additional multiplication with scaler makes that the maximum range go
down, because we must ensure we never overflow. I'm wondering if we cannot
pick scaler in such a way that remaining range of cycles is large enough for
our use case and we can therefore live without bothering for the overflow
case. What would be "enough"? 1 year? 10 years? ...
Depending on how low we want to keep the error, I don't think we can:
If I set the allowed deviation to 10**-9, we end up requiring a shift by 29
for common ghz ranges. Clearly 33bits isn't an interesting range.
But even if you accept a higher error - we don't have *that* much range
available. Assuming an uint64, the range is ~584 years. If we want 10 years
range, we end up
math.log(((2**64)-1) / (10 * 365 * 60 * 60 * 24 * 10**9), 2)
~= 5.87
So 5 bits available that we could "use" for multiply/shift. For something like
2.5ghz, that'd be ~2% error, clearly not acceptable. And even just a year of
range, ends up allowing a failure of 30796s = 8min over a year, still too
high.
But I don't think it's really an issue - normally that branch will never be
taken (at least within the memory of the branch predictor), which on modern
CPUs means it'll just be predicted as not taken. So as long as we tell the
compiler what's the likely branch, it should be fine. At least as long as the
branch compares with a hardcoded number.
I think it'd be great - but I'm not sure we're there yet, reliability and
code-complexity wise.
Thanks to your commits, the diff of the new patch set will be already much
smaller and easier to review. What's your biggest concern in terms of
reliability?
- the restriction just to linux, that'll make testing harder for some, and
ends up encoding too much OS dependency
- I think we need both the barrier and non-barrier variant, otherwise I
suspect we'll end up with inccuracies we don't want
- needs lots more documentation about why certain cpuid registers are used
- cpu microarch dependencies - isn't there, e.g., the case that the scale on
nehalem has to be different than on later architectures?
- lack of facility to evaluate how well the different time sources work
I think it might be worth makign the rdts aspect somewhat
measurable. E.g. allowing pg_test_timing to use both at the same time, and
have it compare elapsed time with both sources of counters.I haven't yet looked into pg_test_timing. I'll do that while including your
patches into the new patch set.
Cool.
Greetings,
Andres Freund
Hi,
On 2023-01-23 18:52:44 +0100, David Geier wrote:
One thing I was wondering about: why did you chose to use a signed instead
of an unsigned 64-bit integer for the ticks?
That's been the case since my first post in the thread :). Mainly, it seems
easier to detect underflow cases during subtraction that way. And the factor
of 2 in range doesn't change a whole lot.
If you have time to look at the pg_test_timing part, it'd be
appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
hesitant to push it.I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
patch.I've attached those two patches. Feel free to include them in your series if
you want, then the CF entry (and thus cfbot) makes sense again...I'll include them in my new patch set and also have a careful look at them.
Thanks.
Greetings,
Andres Freund
Hi,
On 1/23/23 18:41, Andres Freund wrote:
If we add it, it probably shouldn't depend on TIMING, but on
SUMMARY. Regression test queries showing EXPLAIN ANALYZE output all do
something like
EXPLAIN (ANALYZE, COSTS OFF, SUMMARY OFF, TIMING OFF)the SUMMARY OFF gets rid of the "top-level" "Planning Time" and "Execution
Time", whereas the TIMING OFF gets rid of the per-node timing. Those are
separate options because per-node timing is problematic performance-wise
(right now), but whole-query timing rarely is.
Makes sense. I wasn't aware of SUMMARY. Let's keep this option in mind,
in case we'll revisit exposing the clock source in the future.
Another, independent, thing worth thinking about: I think we might want to
expose both rdtsc and rdtscp. For something like
InstrStartNode()/InstrStopNode(), avoiding the "one-way barrier" of rdtscp is
quite important to avoid changing the query performance. But for measuring
whole-query time, we likely want to measure the actual time.It probably won't matter hugely for the whole query time - the out of order
window of modern CPUs is large, but not *that* large - but I don't think we
can generally assume that.
That's what I thought as well. I added INSTR_TIME_SET_CURRENT_FAST() and
for now call that variant from InstrStartNode(), InstrEndNode() and
pg_test_timing. To do so in InstrEndNode(), I removed
INSTR_TIME_SET_CURRENT_LAZY(). Otherwise, two variants of that macro
would be needed. INSTR_TIME_SET_CURRENT_LAZY() was only used in a single
place and the code is more readable that way. INSTR_TIME_SET_CURRENT()
is called from a bunch of places. I still have to go through all of them
and see which should be changed to call the _FAST() variant.
Attached is v7 of the patch:
- Rebased on latest master (most importantly on top of the int64
instr_time commits). - Includes two commits from Andres which introduce
INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() and WIP to report
pg_test_timing output in nanoseconds. - Converts ticks to nanoseconds
only with integer math, while accounting for overflow. - Supports RDTSCP
via INSTR_TIME_SET_CURRENT() and introduced
INSTR_TIME_SET_CURRENT_FAST() which uses RDTSC.
I haven't gotten to the following:
- Looking through all calls to INSTR_TIME_SET_CURRENT() and check if
they should be replaced by INSTR_TIME_SET_CURRENT_FAST(). - Reviewing
Andres commits. Potentially improving on pg_test_timing's output. -
Looking at enabling RDTSC on more platforms. Is there a minimum set of
platforms we would like support for? Windows should be easy. That would
also allow to unify the code a little more. - Add more documentation and
do more testing around the calls to CPUID. - Profiling and optimizing
the code. A quick test showed about 10% improvement over master with
TIMING ON vs TIMING OFF, when using the test-case from Andres' e-mail
that started this thread.
I hope I'll find time to work on these points during the next days.
--
David Geier
(ServiceNow)
Attachments:
v7-0001-instr_time-Add-INSTR_TIME_SET_SECONDS-INSTR_TIME_.patchtext/x-patch; charset=UTF-8; name=v7-0001-instr_time-Add-INSTR_TIME_SET_SECONDS-INSTR_TIME_.patchDownload
From 0b5ce706bed13c0c242e6ace809d3c37a8064029 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 20 Jan 2023 15:31:54 -0800
Subject: [PATCH v7 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(),
INSTR_TIME_IS_LT()
INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop
without having to convert into time units (which is
costly). INSTR_TIME_IS_LT() can be used to check the loop condition.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/portability/instr_time.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index cc85138e21..aab80effb0 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -15,6 +15,8 @@
*
* INSTR_TIME_IS_ZERO(t) is t equal to zero?
*
+ * INSTR_TIME_IS_LT(x, y) x < y
+ *
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
* INSTR_TIME_SET_CURRENT(t) set t to current time
@@ -22,6 +24,8 @@
* INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
* evaluates to whether t changed
*
+ * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds
+ *
* INSTR_TIME_ADD(x, y) x += y
*
* INSTR_TIME_SUBTRACT(x, y) x -= y
@@ -122,6 +126,9 @@ pg_clock_gettime_ns(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_clock_gettime_ns())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t).ticks = NS_PER_S * (s))
+
#define INSTR_TIME_GET_NANOSEC(t) \
((int64) (t).ticks)
@@ -156,6 +163,9 @@ GetTimerFrequency(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t).ticks = s * GetTimerFrequency())
+
#define INSTR_TIME_GET_NANOSEC(t) \
((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency())))
@@ -168,6 +178,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0)
+#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks)
+
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
--
2.34.1
v7-0002-wip-report-nanoseconds-in-pg_test_timing.patchtext/x-patch; charset=UTF-8; name=v7-0002-wip-report-nanoseconds-in-pg_test_timing.patchDownload
From c666455dc526607e2cc70239a2dbd1d4c33ed8ba Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v7 2/3] wip: report nanoseconds in pg_test_timing
This commit also updates pg_test_timing's documentation:
- compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of
of statement with/without EXPLAIN ANALYZE
- explain the 2x overhead (due to two timestamp acquisitions per row)
- remove old section about old versions of linux - I couldn't update the
numbers, and it's old enough nobody would care
---
doc/src/sgml/ref/pgtesttiming.sgml | 117 ++++++++++--------------
src/bin/pg_test_timing/pg_test_timing.c | 74 +++++++++------
2 files changed, 95 insertions(+), 96 deletions(-)
diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml
index a5eb3aa25e..7e0266cf58 100644
--- a/doc/src/sgml/ref/pgtesttiming.sgml
+++ b/doc/src/sgml/ref/pgtesttiming.sgml
@@ -93,28 +93,34 @@ PostgreSQL documentation
<para>
Good results will show most (>90%) individual timing calls take less than
- one microsecond. Average per loop overhead will be even lower, below 100
- nanoseconds. This example from an Intel i7-860 system using a TSC clock
- source shows excellent performance:
+ one microsecond (1000 nanoseconds). Average per loop overhead will be even
+ lower, below 100 nanoseconds. This example from an Intel i9-9880H system
+ using a TSC clock source shows excellent performance:
<screen><![CDATA[
Testing timing overhead for 3 seconds.
-Per loop time including overhead: 35.96 ns
+Per loop time including overhead: 13.74 ns
Histogram of timing durations:
- < us % of total count
- 1 96.40465 80435604
- 2 3.59518 2999652
- 4 0.00015 126
- 8 0.00002 13
- 16 0.00000 2
+ < ns % of total count
+ 16 97.00221 211857215
+ 32 2.99555 6542412
+ 64 0.00115 2505
+ 128 0.00035 759
+ 256 0.00004 78
+ 512 0.00000 3
+ 1024 0.00000 4
+ 2048 0.00034 732
+ 4096 0.00000 6
+ 8192 0.00000 8
+ 16384 0.00019 409
+ 32768 0.00018 403
+ 65536 0.00000 1
]]></screen>
</para>
<para>
- Note that different units are used for the per loop time than the
- histogram. The loop can have resolution within a few nanoseconds (ns),
- while the individual timing calls can only resolve down to one microsecond
- (us).
+ Note that the accuracy of the histogram entries may be lower than the
+ per loop time.
</para>
</refsect2>
@@ -125,24 +131,25 @@ Histogram of timing durations:
When the query executor is running a statement using
<command>EXPLAIN ANALYZE</command>, individual operations are timed as well
as showing a summary. The overhead of your system can be checked by
- counting rows with the <application>psql</application> program:
+ disabling the per-row timing, using the <literal>TIMING OFF</literal>
+ option:
<screen>
-CREATE TABLE t AS SELECT * FROM generate_series(1,100000);
-\timing
-SELECT COUNT(*) FROM t;
-EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
+CREATE TABLE t AS SELECT * FROM generate_series(1, 100000);
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t;
+EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;
</screen>
</para>
<para>
- The i7-860 system measured runs the count query in 9.8 ms while
- the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each
- processing just over 100,000 rows. That 6.8 ms difference means the timing
- overhead per row is 68 ns, about twice what pg_test_timing estimated it
- would be. Even that relatively small amount of overhead is making the fully
- timed count statement take almost 70% longer. On more substantial queries,
- the timing overhead would be less problematic.
+ The i9-9880H system measured shows an execution time of 4.116 ms for the
+ <literal>TIMING OFF</literal> query, and 6.965 ms for the
+ <literal>TIMING ON</literal>, each processing 100,000 rows.
+
+ That 2.849 ms difference means the timing overhead per row is 28 ns. As
+ <literal>TIMING ON</literal> measures timestamps twice per row returned by
+ an executor node, the overhead is very close to what pg_test_timing
+ estimated it would be.
</para>
</refsect2>
@@ -157,28 +164,31 @@ EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
<screen><![CDATA[
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-tsc hpet acpi_pm
+tsc acpi_pm
# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
# pg_test_timing
-Per loop time including overhead: 722.92 ns
+Testing timing overhead for 3 seconds.
+Per loop time including overhead: 708.58 ns
Histogram of timing durations:
- < us % of total count
- 1 27.84870 1155682
- 2 72.05956 2990371
- 4 0.07810 3241
- 8 0.01357 563
- 16 0.00007 3
+ < ns % of total count
+ 1024 99.79796 4225270
+ 2048 0.15560 6588
+ 4096 0.00035 15
+ 8192 0.01738 736
+ 16384 0.01679 711
+ 32768 0.01190 504
]]></screen>
</para>
<para>
In this configuration, the sample <command>EXPLAIN ANALYZE</command> above
- takes 115.9 ms. That's 1061 ns of timing overhead, again a small multiple
- of what's measured directly by this utility. That much timing overhead
- means the actual query itself is only taking a tiny fraction of the
- accounted for time, most of it is being consumed in overhead instead. In
- this configuration, any <command>EXPLAIN ANALYZE</command> totals involving
- many timed operations would be inflated significantly by timing overhead.
+ shows an execution time of 148.7 ms. That's 1392 ns of per-row timing
+ overhead. Taking the two timestamps per row into account, that's again
+ close to what pg_test_timing estimated. That much timing overhead means
+ the actual query itself is only taking a tiny fraction of the accounted for
+ time, most of it is being consumed in overhead instead. In this
+ configuration, any <command>EXPLAIN ANALYZE</command> totals involving many
+ timed operations would be inflated significantly by timing overhead.
</para>
<para>
@@ -196,33 +206,6 @@ kern.timecounter.hardware: ACPI-fast -> TSC
</screen>
</para>
- <para>
- Other systems may only allow setting the time source on boot. On older
- Linux systems the "clock" kernel setting is the only way to make this sort
- of change. And even on some more recent ones, the only option you'll see
- for a clock source is "jiffies". Jiffies are the older Linux software clock
- implementation, which can have good resolution when it's backed by fast
- enough timing hardware, as in this example:
-
-<screen><![CDATA[
-$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-jiffies
-$ dmesg | grep time.c
-time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
-time.c: Detected 2400.153 MHz processor.
-$ pg_test_timing
-Testing timing overhead for 3 seconds.
-Per timing duration including loop overhead: 97.75 ns
-Histogram of timing durations:
- < us % of total count
- 1 90.23734 27694571
- 2 9.75277 2993204
- 4 0.00981 3010
- 8 0.00007 22
- 16 0.00000 1
- 32 0.00000 1
-]]></screen></para>
-
</refsect2>
<refsect2>
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f8762..e20718669a 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -19,8 +19,8 @@ static void handle_args(int argc, char *argv[]);
static uint64 test_timing(unsigned int duration);
static void output(uint64 loop_count);
-/* record duration in powers of 2 microseconds */
-long long int histogram[32];
+/* record duration in powers of 2 nanoseconds */
+uint64 histogram[64];
int
main(int argc, char *argv[])
@@ -121,35 +121,48 @@ handle_args(int argc, char *argv[])
static uint64
test_timing(unsigned int duration)
{
- uint64 total_time;
- int64 time_elapsed = 0;
uint64 loop_count = 0;
- uint64 prev,
- cur;
+ instr_time until_time,
+ total_time;
instr_time start_time,
- end_time,
- temp;
-
- total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
+ end_time;
+ instr_time cur;
INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_MICROSEC(start_time);
- while (time_elapsed < total_time)
+ /*
+ * To reduce loop overhead, check loop condition in instr_time domain.
+ */
+ INSTR_TIME_SET_SECONDS(total_time, duration);
+ until_time = start_time;
+ INSTR_TIME_ADD(until_time, total_time);
+
+ cur = start_time;
+
+ while (INSTR_TIME_IS_LT(cur, until_time))
{
- int32 diff,
- bits = 0;
+ instr_time temp;
+ instr_time prev;
+ int64 diff;
+ int32 bits = 0;
prev = cur;
- INSTR_TIME_SET_CURRENT(temp);
- cur = INSTR_TIME_GET_MICROSEC(temp);
- diff = cur - prev;
+ INSTR_TIME_SET_CURRENT(cur);
+ temp = cur;
+ INSTR_TIME_SUBTRACT(temp, prev);
+ diff = INSTR_TIME_GET_NANOSEC(temp);
/* Did time go backwards? */
- if (diff < 0)
+ if (unlikely(diff <= 0))
{
+ /* can't do anything with that measurement */
+ if (diff == 0)
+ {
+ loop_count++;
+ continue;
+ }
fprintf(stderr, _("Detected clock going backwards in time.\n"));
- fprintf(stderr, _("Time warp: %d ms\n"), diff);
+ fprintf(stderr, _("Time warp: %lld ns\n"), (long long) diff);
exit(1);
}
@@ -164,8 +177,6 @@ test_timing(unsigned int duration)
histogram[bits]++;
loop_count++;
- INSTR_TIME_SUBTRACT(temp, start_time);
- time_elapsed = INSTR_TIME_GET_MICROSEC(temp);
}
INSTR_TIME_SET_CURRENT(end_time);
@@ -173,7 +184,7 @@ test_timing(unsigned int duration)
INSTR_TIME_SUBTRACT(end_time, start_time);
printf(_("Per loop time including overhead: %0.2f ns\n"),
- INSTR_TIME_GET_DOUBLE(end_time) * 1e9 / loop_count);
+ (INSTR_TIME_GET_DOUBLE(end_time) * NS_PER_S) / loop_count);
return loop_count;
}
@@ -181,9 +192,10 @@ test_timing(unsigned int duration)
static void
output(uint64 loop_count)
{
- int64 max_bit = 31,
+ int64 low_bit = 0,
+ max_bit = 63,
i;
- char *header1 = _("< us");
+ char *header1 = _("< ns");
char *header2 = /* xgettext:no-c-format */ _("% of total");
char *header3 = _("count");
int len1 = strlen(header1);
@@ -194,15 +206,19 @@ output(uint64 loop_count)
while (max_bit > 0 && histogram[max_bit] == 0)
max_bit--;
+ /* find lowest bit value */
+ while (low_bit < max_bit && histogram[low_bit] == 0)
+ low_bit++;
+
printf(_("Histogram of timing durations:\n"));
printf("%*s %*s %*s\n",
- Max(6, len1), header1,
+ Max(9, len1), header1,
Max(10, len2), header2,
Max(10, len3), header3);
- for (i = 0; i <= max_bit; i++)
- printf("%*ld %*.5f %*lld\n",
- Max(6, len1), 1l << i,
+ for (i = low_bit; i <= max_bit; i++)
+ printf("%*ld %*.5f %*llu\n",
+ Max(9, len1), 1l << i,
Max(10, len2) - 1, (double) histogram[i] * 100 / loop_count,
- Max(10, len3), histogram[i]);
+ Max(10, len3), (long long unsigned) histogram[i]);
}
--
2.34.1
v7-0003-Use-RDTSC-P-instructions-to-measure-time-on-x86-L.patchtext/x-patch; charset=UTF-8; name=v7-0003-Use-RDTSC-P-instructions-to-measure-time-on-x86-L.patchDownload
From 673d4355c068172ec1a04523239d26ffbd41eef5 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Mon, 23 Jan 2023 18:13:55 +0100
Subject: [PATCH v7 3/3] Use RDTSC(P) instructions to measure time on x86 Linux
For now this is only enabled on Linux/x86 when the system clocksource is
marked tsc as well, as determined at runtime. This way we can rely on the
Linux kernel to make a decision whether tsc is invariant and usable on the
current CPU architecture. In all other cases we continue to use the
clock_gettime() implementation like before.
Note that this intentionally uses rdtsc, not rdtscp, as rdtscp waits for
currently running CPU instructions to have finished, and that adds up to
noticable latency for little benefit in the typical InstrStartNode() /
InstrStopNode() use case.
---
src/backend/access/heap/vacuumlazy.c | 3 +-
src/backend/commands/explain.c | 2 +-
src/backend/executor/instrument.c | 12 +-
src/backend/utils/init/postinit.c | 3 +
src/bin/pg_test_timing/pg_test_timing.c | 7 +-
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 157 ++++++++++++++++++++++++
src/common/meson.build | 1 +
src/include/portability/instr_time.h | 130 ++++++++++++++++----
src/tools/msvc/Mkvcbuild.pm | 2 +-
12 files changed, 293 insertions(+), 32 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..212093568b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3002,8 +3002,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
INSTR_TIME_SET_CURRENT(currenttime);
elapsed = currenttime;
INSTR_TIME_SUBTRACT(elapsed, starttime);
- if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
- >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ if (INSTR_TIME_GET_MILLISEC(elapsed) >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
{
if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
{
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index 5212a64b1e..c4c4ccbebe 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -526,7 +526,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
double totaltime = 0;
int eflags;
int instrument_option = 0;
-
+
Assert(plannedstmt->commandType != CMD_UTILITY);
if (es->analyze && es->timing)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d..1ce3d62cff 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -67,9 +67,13 @@ InstrInit(Instrumentation *instr, int instrument_options)
void
InstrStartNode(Instrumentation *instr)
{
- if (instr->need_timer &&
- !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ if (instr->need_timer)
+ {
+ if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStartNode called twice in a row");
+ else
+ INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
+ }
/* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
@@ -95,7 +99,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (INSTR_TIME_IS_ZERO(instr->starttime))
elog(ERROR, "InstrStopNode called without start");
- INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_SET_CURRENT_FAST(endtime);
INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 2f07ca7a0e..2eef902483 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -805,6 +805,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* Initialize status reporting */
pgstat_beinit();
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index e20718669a..d185c8fe2e 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -128,7 +128,8 @@ test_timing(unsigned int duration)
end_time;
instr_time cur;
- INSTR_TIME_SET_CURRENT(start_time);
+ INSTR_TIME_INITIALIZE();
+ INSTR_TIME_SET_CURRENT_FAST(start_time);
/*
* To reduce loop overhead, check loop condition in instr_time domain.
@@ -147,7 +148,7 @@ test_timing(unsigned int duration)
int32 bits = 0;
prev = cur;
- INSTR_TIME_SET_CURRENT(cur);
+ INSTR_TIME_SET_CURRENT_FAST(cur);
temp = cur;
INSTR_TIME_SUBTRACT(temp, prev);
diff = INSTR_TIME_GET_NANOSEC(temp);
@@ -179,7 +180,7 @@ test_timing(unsigned int duration)
loop_count++;
}
- INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_SET_CURRENT_FAST(end_time);
INSTR_TIME_SUBTRACT(end_time, start_time);
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 9c12ffaea9..b84dbaf478 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7059,6 +7059,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 5a28b6f713..3f95e01320 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -322,6 +323,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 2f424a5735..ac79261556 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -58,6 +58,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 0000000000..281416da89
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,157 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "portability/instr_time.h"
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Indicates if RDTSC can be used (Linux/x86 only, when OS uses TSC clocksource)
+ */
+bool has_rdtsc = false;
+
+/*
+ * Indicates if RDTSCP can be used. True if RDTSC can be used and RDTSCP is available.
+ */
+bool has_rdtscp = false;
+
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up
+ * with nanoseconds using integer math. See comment in pg_initialize_rdtsc()
+ * for more details.
+ *
+ * By default assume we are using clock_gettime() as a fallback which uses
+ * nanoseconds as ticks. Hence, we set the multiplier to the precision scalar
+ * so that the division in INSTR_TIME_GET_NANOSEC() won't change the nanoseconds.
+ *
+ * When using the RDTSC instruction directly this is filled in during initialization
+ * based on the relevant CPUID fields.
+ */
+int64 ticks_per_ns_scaled = TICKS_TO_NS_PRECISION;
+int64 ticks_per_sec = NS_PER_S;
+int64 max_ticks_no_overflow = PG_INT64_MAX / TICKS_TO_NS_PRECISION;
+
+static bool is_under_hypervisor(void)
+{
+ uint32 r0, r1, r2, r3;
+ int result = __get_cpuid(0x1, &r0, &r1, &r2, &r3);
+ return result > 0 && ((r2 >> 31) & 1) == 1;
+}
+
+static bool get_hypervisor_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r0, r1, r2;
+
+ if (!is_under_hypervisor())
+ return false;
+
+ /*
+ * Use __cpuid() directly because __get_cpuid() incorrectly bails out for
+ * VM leafs. The problem is that leaf 0x0 returns the maximum leaf and
+ * __get_cpuid() for any input leaf simple ANDs with 0x80000000 to get the
+ * info leaf 0x0 to obtain the maximum. For the VM timing leaf the ANDing
+ * also results in 0x0 while it should result in 0x40000000 to get the
+ * maximum VM leaf.
+ */
+ __cpuid(0x40000010, *tsc_freq, r0, r1, r2);
+ return *tsc_freq > 0;
+}
+
+static bool get_host_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r0, r1, r2;
+ int result = __get_cpuid(0x16, tsc_freq, &r0, &r1, &r2);
+ *tsc_freq *= 1000; // Convert from MHz to KHz
+ return result > 0 && *tsc_freq > 0;
+}
+
+static bool get_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ if (get_hypervisor_tsc_frequency_khz(tsc_freq))
+ return true;
+ if (get_host_tsc_frequency_khz(tsc_freq))
+ return true;
+
+ return false;
+}
+
+static bool is_rdtscp_available()
+{
+ uint32 r0, r1, r2, r3;
+ return __get_cpuid(0x80000001, &r0, &r1, &r2, &r3) > 0 && (r3 & (1 << 27)) != 0;
+}
+
+/*
+ * Decide whether we use the RDTSC instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void pg_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+
+ if (fp)
+ {
+ char buf[128];
+
+ if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+ {
+ /*
+ * Compute baseline CPU peformance, determines speed at which RDTSC advances.
+ */
+ uint32 tsc_freq;
+
+ if (get_tsc_frequency_khz(&tsc_freq))
+ {
+ /*
+ * Ticks to nanoseconds conversion requires floating point math
+ * because because:
+ *
+ * sec = ticks / frequency_hz
+ * ns = ticks / frequency_hz * 1,000,000,000
+ * ns = ticks * (1,000,000,000 / frequency_hz)
+ * ns = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
+ *
+ * Here, 'ns' is usually a floating number. For example for a 2.5 GHz
+ * CPU the scaling factor becomes 1,000,000 / 2,500,000 = 1.2.
+ *
+ * To be able to use integer math we work around the lack of precision.
+ * We first scale the integer up and after the multiplication by the
+ * number of ticks in INSTR_TIME_GET_NANOSEC() we divide again by the
+ * same value. We picked the scaler such that it provides enough precision
+ * and is a power-of-two which allows for shifting instead of doing an
+ * integer division.
+ */
+ ticks_per_ns_scaled = INT64CONST(1000000) * TICKS_TO_NS_PRECISION / tsc_freq;
+ ticks_per_sec = tsc_freq * 1000; // KHz -> Hz
+ max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
+
+ has_rdtsc = true;
+ has_rdtscp = is_rdtscp_available();
+ }
+ }
+
+ fclose(fp);
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
diff --git a/src/common/meson.build b/src/common/meson.build
index 1caa1fed04..5a7941b9e3 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -10,6 +10,7 @@ common_sources = files(
'file_perm.c',
'file_utils.c',
'hashfn.c',
+ 'instr_time.c',
'ip.c',
'jsonapi.c',
'keywords.c',
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index aab80effb0..6fda565366 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter(). These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -19,10 +21,11 @@
*
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
- * INSTR_TIME_SET_CURRENT(t) set t to current time
- *
- * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
- * evaluates to whether t changed
+ * INSTR_TIME_SET_CURRENT_FAST(t) set t to current time without waiting
+ * for instructions in out-of-order window
+ *
+ * INSTR_TIME_SET_CURRENT(t) set t to current time while waiting for
+ * instructions in OOO to retire
*
* INSTR_TIME_SET_SECONDS(t, s) set t to s seconds
*
@@ -84,7 +87,15 @@ typedef struct instr_time
#ifndef WIN32
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)
+extern int64 ticks_per_ns_scaled;
+extern int64 ticks_per_sec;
+extern int64 max_ticks_no_overflow;
/* Use clock_gettime() */
@@ -110,35 +121,113 @@ typedef struct instr_time
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-/* helper for INSTR_TIME_SET_CURRENT */
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+
+extern bool has_rdtsc;
+extern bool has_rdtscp;
+#endif
+
static inline instr_time
-pg_clock_gettime_ns(void)
+pg_clock_gettime()
{
instr_time now;
struct timespec tmp;
clock_gettime(PG_INSTR_CLOCK, &tmp);
now.ticks = tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
-
return now;
}
+static inline instr_time
+pg_get_ticks_fast(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtsc)
+ {
+ instr_time now;
+ uint32 a;
+ now.ticks = __rdtscp(&a);
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline instr_time
+pg_get_ticks(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtscp)
+ {
+ instr_time now;
+ uint32 unused;
+ now.ticks = __rdtscp(&unused);
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline int64_t
+pg_ticks_to_ns(instr_time t)
+{
+ /*
+ * Would multiplication overflow? If so perform computation in two parts.
+ * Check overflow without actually overflowing via:
+ * a * b > max <=> a > max / b
+ */
+ int64 ns = 0;
+
+ if (unlikely(t.ticks > max_ticks_no_overflow))
+ {
+ /*
+ * Compute how often the maximum number of ticks fits completely into the
+ * number of elapsed ticks and convert that number into nanoseconds.
+ * Then multiply by the count to arrive at the final value. In a 2nd step
+ * we adjust the number of elapsed ticks and convert the remaining ticks.
+ */
+ int64 count = t.ticks / max_ticks_no_overflow;
+ int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ ns = max_ns * count;
+
+ /*
+ * Subtract the ticks that we now already accounted for, so that they don't
+ * get counted twice.
+ */
+ t.ticks -= count * max_ticks_no_overflow;
+ Assert(t.ticks >= 0);
+ }
+
+ ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ return ns;
+}
+
+extern void pg_initialize_rdtsc(void);
+
+#define INSTR_TIME_INITIALIZE() \
+ pg_initialize_rdtsc()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_get_ticks_fast())
+
#define INSTR_TIME_SET_CURRENT(t) \
- ((t) = pg_clock_gettime_ns())
+ ((t) = pg_get_ticks())
#define INSTR_TIME_SET_SECONDS(t, s) \
- ((t).ticks = NS_PER_S * (s))
+ ((t).ticks = (s) * ticks_per_sec)
#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) (t).ticks)
-
+ pg_ticks_to_ns(t)
#else /* WIN32 */
/* Use QueryPerformanceCounter() */
-/* helper for INSTR_TIME_SET_CURRENT */
static inline instr_time
pg_query_performance_counter(void)
{
@@ -160,6 +249,11 @@ GetTimerFrequency(void)
return (double) f.QuadPart;
}
+#define INSTR_TIME_INITIALIZE()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_query_performance_counter())
+
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
@@ -180,13 +274,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks)
-
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
-#define INSTR_TIME_SET_CURRENT_LAZY(t) \
- (INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
-
-
#define INSTR_TIME_ADD(x,y) \
((x).ticks += (y).ticks)
@@ -196,7 +285,6 @@ GetTimerFrequency(void)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
((x).ticks += (y).ticks - (z).ticks)
-
#define INSTR_TIME_GET_DOUBLE(t) \
((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index ee49424d6f..021525f3d8 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -135,7 +135,7 @@ sub mkvcbuild
our @pgcommonallfiles = qw(
base64.c checksum_helper.c compression.c
config_info.c controldata_utils.c d2s.c encnames.c exec.c
- f2s.c file_perm.c file_utils.c hashfn.c ip.c jsonapi.c
+ f2s.c file_perm.c file_utils.c hashfn.c ip.c instr_time.o jsonapi.c
keywords.c kwlookup.c link-canary.c md5_common.c percentrepl.c
pg_get_line.c pg_lzcompress.c pg_prng.c pgfnames.c psprintf.c relpath.c
rmtree.c saslprep.c scram-common.c string.c stringinfo.c unicode_norm.c
--
2.34.1
Hi
I think at least some should be converted to just accumulate in an
instr_time...
I think that's for a later patch though?
Yep, at least quite similar.
OK. I coded it up in the latest version of the patch.
Depending on how low we want to keep the error, I don't think we can:
If I set the allowed deviation to 10**-9, we end up requiring a shift by 29
for common ghz ranges. Clearly 33bits isn't an interesting range.But even if you accept a higher error - we don't have *that* much range
available. Assuming an uint64, the range is ~584 years. If we want 10 years
range, we end upmath.log(((2**64)-1) / (10 * 365 * 60 * 60 * 24 * 10**9), 2)
~= 5.87So 5 bits available that we could "use" for multiply/shift. For something like
2.5ghz, that'd be ~2% error, clearly not acceptable. And even just a year of
range, ends up allowing a failure of 30796s = 8min over a year, still too
high.
Thanks for doing the math. Agreed. The latest patch detects overflow and
correctly handles it.
But I don't think it's really an issue - normally that branch will never be
taken (at least within the memory of the branch predictor), which on modern
CPUs means it'll just be predicted as not taken. So as long as we tell the
compiler what's the likely branch, it should be fine. At least as long as the
branch compares with a hardcoded number.
Yeah. The overflow detection just compares two int64. The "overflow
threshold" is pre-computed.
- the restriction just to linux, that'll make testing harder for some, and
ends up encoding too much OS dependency
- I think we need both the barrier and non-barrier variant, otherwise I
suspect we'll end up with inccuracies we don't want
- needs lots more documentation about why certain cpuid registers are used
- cpu microarch dependencies - isn't there, e.g., the case that the scale on
nehalem has to be different than on later architectures?
- lack of facility to evaluate how well the different time sources work
Makes sense. I carried that list over to my latest e-mail which also
includes the patch to have some sort of summary of where we are in a
single place.
--
David Geier
(ServiceNow)
Hi,
On 1/23/23 21:30, Andres Freund wrote:
That's been the case since my first post in the thread :). Mainly, it seems
easier to detect underflow cases during subtraction that way. And the factor
of 2 in range doesn't change a whole lot.
I just realized it the other day :).
If you have time to look at the pg_test_timing part, it'd be
appreciated. That's a it larger, and nobody looked at it yet. So I'm a bit
hesitant to push it.I haven't yet pushed the pg_test_timing (nor it's small prerequisite)
patch.I've attached those two patches. Feel free to include them in your series if
you want, then the CF entry (and thus cfbot) makes sense again...I'll include them in my new patch set and also have a careful look at them.
I reviewed the prerequisite patch which introduces
INSTR_TIME_SET_SECONDS(), as well as the pg_test_timing patch. Here my
comments:
- The prerequisite patch looks good me.
- By default, the test query in the pg_test_timing doc runs serially.
What about adding SET max_parallel_workers_per_gather = 0 to make sure
it really always does (e.g. on a system with different settings for
parallel_tuple_cost / parallel_setup_cost)? Otherwise, the numbers will
be much more flaky.
- Why have you added a case distinction for diff == 0? Have you
encountered this case? If so, how? Maybe add a comment.
- To further reduce overhead we could call INSTR_TIME_SET_CURRENT()
multiple times. But then again: why do we actually care about the
per-loop time? Why not instead sum up diff and divide by the number of
iterations to exclude all the overhead in the first place?
- In the computation of the per-loop time in nanoseconds you can now use
INSTR_TIME_GET_NANOSEC() instead of INSTR_TIME_GET_DOUBLE() * NS_PER_S.
The rest looks good to me. The rebased patches are part of the patch set
I sent out yesterday in reply to another mail in this thread.
--
David Geier
(ServiceNow)
Hi,
On 2023-01-24 14:30:34 +0100, David Geier wrote:
Attached is v7 of the patch:
- Rebased on latest master (most importantly on top of the int64 instr_time
commits). - Includes two commits from Andres which introduce
INSTR_TIME_SET_SECONDS(), INSTR_TIME_IS_LT() and WIP to report
pg_test_timing output in nanoseconds. - Converts ticks to nanoseconds only
with integer math, while accounting for overflow. - Supports RDTSCP via
INSTR_TIME_SET_CURRENT() and introduced INSTR_TIME_SET_CURRENT_FAST() which
uses RDTSC.I haven't gotten to the following:
- Looking through all calls to INSTR_TIME_SET_CURRENT() and check if they
should be replaced by INSTR_TIME_SET_CURRENT_FAST(). - Reviewing Andres
commits. Potentially improving on pg_test_timing's output. - Looking at
enabling RDTSC on more platforms. Is there a minimum set of platforms we
would like support for? Windows should be easy. That would also allow to
unify the code a little more. - Add more documentation and do more testing
around the calls to CPUID. - Profiling and optimizing the code. A quick test
showed about 10% improvement over master with TIMING ON vs TIMING OFF, when
using the test-case from Andres' e-mail that started this thread.I hope I'll find time to work on these points during the next days.
This fails to build on several platforms:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3751
Greetings,
Andres Freund
Hi,
On 2/7/23 19:12, Andres Freund wrote:
This fails to build on several platforms:
https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3751
I think I fixed the compilation errors. It was due to a few variables
being declared under
#if defined(__x86_64__) && defined(__linux__)
while being used also under non x86 Linux.
I also removed again the code to obtain the TSC frequency under
hypervisors because the TSC is usually emulated and therefore no faster
than clock_gettime() anyways. So we now simply fallback to
clock_gettime() on hypervisors when we cannot obtain the frequency via
leaf 0x16.
Beyond that I reviewed the first two patches a while ago in [1]/messages/by-id/3ac157f7-085d-e071-45fc-b87cd306360c@gmail.com. I hope
we can progress with them to further reduce the size of this patch set.
[1]: /messages/by-id/3ac157f7-085d-e071-45fc-b87cd306360c@gmail.com
/messages/by-id/3ac157f7-085d-e071-45fc-b87cd306360c@gmail.com
--
David Geier
(ServiceNow)
Attachments:
v8-0001-instr_time-Add-INSTR_TIME_SET_SECONDS-INSTR_TIME_.patchtext/x-patch; charset=UTF-8; name=v8-0001-instr_time-Add-INSTR_TIME_SET_SECONDS-INSTR_TIME_.patchDownload
From 36ff7f7ee14bf42ef0fb775cec428180251c3ff9 Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 20 Jan 2023 15:31:54 -0800
Subject: [PATCH v8 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(),
INSTR_TIME_IS_LT()
INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop
without having to convert into time units (which is
costly). INSTR_TIME_IS_LT() can be used to check the loop condition.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/portability/instr_time.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index cc85138e21..aab80effb0 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -15,6 +15,8 @@
*
* INSTR_TIME_IS_ZERO(t) is t equal to zero?
*
+ * INSTR_TIME_IS_LT(x, y) x < y
+ *
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
* INSTR_TIME_SET_CURRENT(t) set t to current time
@@ -22,6 +24,8 @@
* INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
* evaluates to whether t changed
*
+ * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds
+ *
* INSTR_TIME_ADD(x, y) x += y
*
* INSTR_TIME_SUBTRACT(x, y) x -= y
@@ -122,6 +126,9 @@ pg_clock_gettime_ns(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_clock_gettime_ns())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t).ticks = NS_PER_S * (s))
+
#define INSTR_TIME_GET_NANOSEC(t) \
((int64) (t).ticks)
@@ -156,6 +163,9 @@ GetTimerFrequency(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t).ticks = s * GetTimerFrequency())
+
#define INSTR_TIME_GET_NANOSEC(t) \
((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency())))
@@ -168,6 +178,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0)
+#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks)
+
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
--
2.34.1
v8-0002-wip-report-nanoseconds-in-pg_test_timing.patchtext/x-patch; charset=UTF-8; name=v8-0002-wip-report-nanoseconds-in-pg_test_timing.patchDownload
From 1200bfee5b8ebf7c68d8cefb3771a4c3523c2cea Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v8 2/3] wip: report nanoseconds in pg_test_timing
This commit also updates pg_test_timing's documentation:
- compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of
of statement with/without EXPLAIN ANALYZE
- explain the 2x overhead (due to two timestamp acquisitions per row)
- remove old section about old versions of linux - I couldn't update the
numbers, and it's old enough nobody would care
---
doc/src/sgml/ref/pgtesttiming.sgml | 117 ++++++++++--------------
src/bin/pg_test_timing/pg_test_timing.c | 74 +++++++++------
2 files changed, 95 insertions(+), 96 deletions(-)
diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml
index a5eb3aa25e..7e0266cf58 100644
--- a/doc/src/sgml/ref/pgtesttiming.sgml
+++ b/doc/src/sgml/ref/pgtesttiming.sgml
@@ -93,28 +93,34 @@ PostgreSQL documentation
<para>
Good results will show most (>90%) individual timing calls take less than
- one microsecond. Average per loop overhead will be even lower, below 100
- nanoseconds. This example from an Intel i7-860 system using a TSC clock
- source shows excellent performance:
+ one microsecond (1000 nanoseconds). Average per loop overhead will be even
+ lower, below 100 nanoseconds. This example from an Intel i9-9880H system
+ using a TSC clock source shows excellent performance:
<screen><![CDATA[
Testing timing overhead for 3 seconds.
-Per loop time including overhead: 35.96 ns
+Per loop time including overhead: 13.74 ns
Histogram of timing durations:
- < us % of total count
- 1 96.40465 80435604
- 2 3.59518 2999652
- 4 0.00015 126
- 8 0.00002 13
- 16 0.00000 2
+ < ns % of total count
+ 16 97.00221 211857215
+ 32 2.99555 6542412
+ 64 0.00115 2505
+ 128 0.00035 759
+ 256 0.00004 78
+ 512 0.00000 3
+ 1024 0.00000 4
+ 2048 0.00034 732
+ 4096 0.00000 6
+ 8192 0.00000 8
+ 16384 0.00019 409
+ 32768 0.00018 403
+ 65536 0.00000 1
]]></screen>
</para>
<para>
- Note that different units are used for the per loop time than the
- histogram. The loop can have resolution within a few nanoseconds (ns),
- while the individual timing calls can only resolve down to one microsecond
- (us).
+ Note that the accuracy of the histogram entries may be lower than the
+ per loop time.
</para>
</refsect2>
@@ -125,24 +131,25 @@ Histogram of timing durations:
When the query executor is running a statement using
<command>EXPLAIN ANALYZE</command>, individual operations are timed as well
as showing a summary. The overhead of your system can be checked by
- counting rows with the <application>psql</application> program:
+ disabling the per-row timing, using the <literal>TIMING OFF</literal>
+ option:
<screen>
-CREATE TABLE t AS SELECT * FROM generate_series(1,100000);
-\timing
-SELECT COUNT(*) FROM t;
-EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
+CREATE TABLE t AS SELECT * FROM generate_series(1, 100000);
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t;
+EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;
</screen>
</para>
<para>
- The i7-860 system measured runs the count query in 9.8 ms while
- the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each
- processing just over 100,000 rows. That 6.8 ms difference means the timing
- overhead per row is 68 ns, about twice what pg_test_timing estimated it
- would be. Even that relatively small amount of overhead is making the fully
- timed count statement take almost 70% longer. On more substantial queries,
- the timing overhead would be less problematic.
+ The i9-9880H system measured shows an execution time of 4.116 ms for the
+ <literal>TIMING OFF</literal> query, and 6.965 ms for the
+ <literal>TIMING ON</literal>, each processing 100,000 rows.
+
+ That 2.849 ms difference means the timing overhead per row is 28 ns. As
+ <literal>TIMING ON</literal> measures timestamps twice per row returned by
+ an executor node, the overhead is very close to what pg_test_timing
+ estimated it would be.
</para>
</refsect2>
@@ -157,28 +164,31 @@ EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
<screen><![CDATA[
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-tsc hpet acpi_pm
+tsc acpi_pm
# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
# pg_test_timing
-Per loop time including overhead: 722.92 ns
+Testing timing overhead for 3 seconds.
+Per loop time including overhead: 708.58 ns
Histogram of timing durations:
- < us % of total count
- 1 27.84870 1155682
- 2 72.05956 2990371
- 4 0.07810 3241
- 8 0.01357 563
- 16 0.00007 3
+ < ns % of total count
+ 1024 99.79796 4225270
+ 2048 0.15560 6588
+ 4096 0.00035 15
+ 8192 0.01738 736
+ 16384 0.01679 711
+ 32768 0.01190 504
]]></screen>
</para>
<para>
In this configuration, the sample <command>EXPLAIN ANALYZE</command> above
- takes 115.9 ms. That's 1061 ns of timing overhead, again a small multiple
- of what's measured directly by this utility. That much timing overhead
- means the actual query itself is only taking a tiny fraction of the
- accounted for time, most of it is being consumed in overhead instead. In
- this configuration, any <command>EXPLAIN ANALYZE</command> totals involving
- many timed operations would be inflated significantly by timing overhead.
+ shows an execution time of 148.7 ms. That's 1392 ns of per-row timing
+ overhead. Taking the two timestamps per row into account, that's again
+ close to what pg_test_timing estimated. That much timing overhead means
+ the actual query itself is only taking a tiny fraction of the accounted for
+ time, most of it is being consumed in overhead instead. In this
+ configuration, any <command>EXPLAIN ANALYZE</command> totals involving many
+ timed operations would be inflated significantly by timing overhead.
</para>
<para>
@@ -196,33 +206,6 @@ kern.timecounter.hardware: ACPI-fast -> TSC
</screen>
</para>
- <para>
- Other systems may only allow setting the time source on boot. On older
- Linux systems the "clock" kernel setting is the only way to make this sort
- of change. And even on some more recent ones, the only option you'll see
- for a clock source is "jiffies". Jiffies are the older Linux software clock
- implementation, which can have good resolution when it's backed by fast
- enough timing hardware, as in this example:
-
-<screen><![CDATA[
-$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-jiffies
-$ dmesg | grep time.c
-time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
-time.c: Detected 2400.153 MHz processor.
-$ pg_test_timing
-Testing timing overhead for 3 seconds.
-Per timing duration including loop overhead: 97.75 ns
-Histogram of timing durations:
- < us % of total count
- 1 90.23734 27694571
- 2 9.75277 2993204
- 4 0.00981 3010
- 8 0.00007 22
- 16 0.00000 1
- 32 0.00000 1
-]]></screen></para>
-
</refsect2>
<refsect2>
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f8762..e20718669a 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -19,8 +19,8 @@ static void handle_args(int argc, char *argv[]);
static uint64 test_timing(unsigned int duration);
static void output(uint64 loop_count);
-/* record duration in powers of 2 microseconds */
-long long int histogram[32];
+/* record duration in powers of 2 nanoseconds */
+uint64 histogram[64];
int
main(int argc, char *argv[])
@@ -121,35 +121,48 @@ handle_args(int argc, char *argv[])
static uint64
test_timing(unsigned int duration)
{
- uint64 total_time;
- int64 time_elapsed = 0;
uint64 loop_count = 0;
- uint64 prev,
- cur;
+ instr_time until_time,
+ total_time;
instr_time start_time,
- end_time,
- temp;
-
- total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
+ end_time;
+ instr_time cur;
INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_MICROSEC(start_time);
- while (time_elapsed < total_time)
+ /*
+ * To reduce loop overhead, check loop condition in instr_time domain.
+ */
+ INSTR_TIME_SET_SECONDS(total_time, duration);
+ until_time = start_time;
+ INSTR_TIME_ADD(until_time, total_time);
+
+ cur = start_time;
+
+ while (INSTR_TIME_IS_LT(cur, until_time))
{
- int32 diff,
- bits = 0;
+ instr_time temp;
+ instr_time prev;
+ int64 diff;
+ int32 bits = 0;
prev = cur;
- INSTR_TIME_SET_CURRENT(temp);
- cur = INSTR_TIME_GET_MICROSEC(temp);
- diff = cur - prev;
+ INSTR_TIME_SET_CURRENT(cur);
+ temp = cur;
+ INSTR_TIME_SUBTRACT(temp, prev);
+ diff = INSTR_TIME_GET_NANOSEC(temp);
/* Did time go backwards? */
- if (diff < 0)
+ if (unlikely(diff <= 0))
{
+ /* can't do anything with that measurement */
+ if (diff == 0)
+ {
+ loop_count++;
+ continue;
+ }
fprintf(stderr, _("Detected clock going backwards in time.\n"));
- fprintf(stderr, _("Time warp: %d ms\n"), diff);
+ fprintf(stderr, _("Time warp: %lld ns\n"), (long long) diff);
exit(1);
}
@@ -164,8 +177,6 @@ test_timing(unsigned int duration)
histogram[bits]++;
loop_count++;
- INSTR_TIME_SUBTRACT(temp, start_time);
- time_elapsed = INSTR_TIME_GET_MICROSEC(temp);
}
INSTR_TIME_SET_CURRENT(end_time);
@@ -173,7 +184,7 @@ test_timing(unsigned int duration)
INSTR_TIME_SUBTRACT(end_time, start_time);
printf(_("Per loop time including overhead: %0.2f ns\n"),
- INSTR_TIME_GET_DOUBLE(end_time) * 1e9 / loop_count);
+ (INSTR_TIME_GET_DOUBLE(end_time) * NS_PER_S) / loop_count);
return loop_count;
}
@@ -181,9 +192,10 @@ test_timing(unsigned int duration)
static void
output(uint64 loop_count)
{
- int64 max_bit = 31,
+ int64 low_bit = 0,
+ max_bit = 63,
i;
- char *header1 = _("< us");
+ char *header1 = _("< ns");
char *header2 = /* xgettext:no-c-format */ _("% of total");
char *header3 = _("count");
int len1 = strlen(header1);
@@ -194,15 +206,19 @@ output(uint64 loop_count)
while (max_bit > 0 && histogram[max_bit] == 0)
max_bit--;
+ /* find lowest bit value */
+ while (low_bit < max_bit && histogram[low_bit] == 0)
+ low_bit++;
+
printf(_("Histogram of timing durations:\n"));
printf("%*s %*s %*s\n",
- Max(6, len1), header1,
+ Max(9, len1), header1,
Max(10, len2), header2,
Max(10, len3), header3);
- for (i = 0; i <= max_bit; i++)
- printf("%*ld %*.5f %*lld\n",
- Max(6, len1), 1l << i,
+ for (i = low_bit; i <= max_bit; i++)
+ printf("%*ld %*.5f %*llu\n",
+ Max(9, len1), 1l << i,
Max(10, len2) - 1, (double) histogram[i] * 100 / loop_count,
- Max(10, len3), histogram[i]);
+ Max(10, len3), (long long unsigned) histogram[i]);
}
--
2.34.1
v8-0003-Use-RDTSC-P-instructions-to-measure-time-on-x86-L.patchtext/x-patch; charset=UTF-8; name=v8-0003-Use-RDTSC-P-instructions-to-measure-time-on-x86-L.patchDownload
From c3c90049579af7cb9c12d7baa0301f9759d20abf Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Mon, 23 Jan 2023 18:13:55 +0100
Subject: [PATCH v8 3/3] Use RDTSC(P) instructions to measure time on x86 Linux
For now this is only enabled on Linux/x86 when the system clocksource is
marked TSC as well, as determined at runtime. This way we can rely on the
Linux kernel to make a decision whether TSC is invariant and usable on the
current CPU architecture. In all other cases we continue to use the
clock_gettime() implementation like before.
Note that we intentionally use RDTSC, not RDTSCP, as RDTSCP waits for
currently running CPU instructions to have retired. This adds up to
noticable latency for little benefit in the typical InstrStartNode() /
InstrStopNode() use case. The macro to be used for these cases is
called INSTR_TIME_SET_CURRENT_FAST(). The original macro
INSTR_TIME_SET_CURRENT() uses RDTSCP and is supposed to be used when
precision is more important than performance.
---
src/backend/access/heap/vacuumlazy.c | 3 +-
src/backend/commands/explain.c | 2 +-
src/backend/executor/instrument.c | 12 ++-
src/backend/utils/init/postinit.c | 3 +
src/bin/pg_test_timing/pg_test_timing.c | 7 +-
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 130 ++++++++++++++++++++++++
src/common/meson.build | 1 +
src/include/portability/instr_time.h | 130 ++++++++++++++++++++----
src/tools/msvc/Mkvcbuild.pm | 2 +-
12 files changed, 266 insertions(+), 32 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..212093568b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3002,8 +3002,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
INSTR_TIME_SET_CURRENT(currenttime);
elapsed = currenttime;
INSTR_TIME_SUBTRACT(elapsed, starttime);
- if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
- >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ if (INSTR_TIME_GET_MILLISEC(elapsed) >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
{
if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
{
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index fbbf28cf06..c4243bc668 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -526,7 +526,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
double totaltime = 0;
int eflags;
int instrument_option = 0;
-
+
Assert(plannedstmt->commandType != CMD_UTILITY);
if (es->analyze && es->timing)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d..1ce3d62cff 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -67,9 +67,13 @@ InstrInit(Instrumentation *instr, int instrument_options)
void
InstrStartNode(Instrumentation *instr)
{
- if (instr->need_timer &&
- !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ if (instr->need_timer)
+ {
+ if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStartNode called twice in a row");
+ else
+ INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
+ }
/* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
@@ -95,7 +99,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (INSTR_TIME_IS_ZERO(instr->starttime))
elog(ERROR, "InstrStopNode called without start");
- INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_SET_CURRENT_FAST(endtime);
INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 2f07ca7a0e..2eef902483 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -805,6 +805,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* Initialize status reporting */
pgstat_beinit();
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index e20718669a..d185c8fe2e 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -128,7 +128,8 @@ test_timing(unsigned int duration)
end_time;
instr_time cur;
- INSTR_TIME_SET_CURRENT(start_time);
+ INSTR_TIME_INITIALIZE();
+ INSTR_TIME_SET_CURRENT_FAST(start_time);
/*
* To reduce loop overhead, check loop condition in instr_time domain.
@@ -147,7 +148,7 @@ test_timing(unsigned int duration)
int32 bits = 0;
prev = cur;
- INSTR_TIME_SET_CURRENT(cur);
+ INSTR_TIME_SET_CURRENT_FAST(cur);
temp = cur;
INSTR_TIME_SUBTRACT(temp, prev);
diff = INSTR_TIME_GET_NANOSEC(temp);
@@ -179,7 +180,7 @@ test_timing(unsigned int duration)
loop_count++;
}
- INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_SET_CURRENT_FAST(end_time);
INSTR_TIME_SUBTRACT(end_time, start_time);
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 508ed218e8..4125bc8cb7 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7059,6 +7059,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 5a28b6f713..3f95e01320 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -322,6 +323,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..23a5c68705 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 0000000000..998fb30e2f
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,130 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "portability/instr_time.h"
+
+#ifndef WIN32
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up
+ * with nanoseconds using integer math. See comment in pg_initialize_rdtsc()
+ * for more details.
+ *
+ * By default assume we are using clock_gettime() as a fallback which uses
+ * nanoseconds as ticks. Hence, we set the multiplier to the precision scalar
+ * so that the division in INSTR_TIME_GET_NANOSEC() won't change the nanoseconds.
+ *
+ * When using the RDTSC instruction directly this is filled in during initialization
+ * based on the relevant CPUID fields.
+ */
+int64 ticks_per_ns_scaled = TICKS_TO_NS_PRECISION;
+int64 ticks_per_sec = NS_PER_S;
+int64 max_ticks_no_overflow = PG_INT64_MAX / TICKS_TO_NS_PRECISION;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Indicates if RDTSC can be used (Linux/x86 only, when OS uses TSC clocksource)
+ */
+bool has_rdtsc = false;
+
+/*
+ * Indicates if RDTSCP can be used. True if RDTSC can be used and RDTSCP is available.
+ */
+bool has_rdtscp = false;
+
+/*
+ * This fails on hypervisors. On hypervisors a different CPUID leaf could be used
+ * to obtain the TSC frequency. We don't do that because the TSC on hypervisors
+ * is usually emulated and the performance of emulated an TSC is usually as bad as
+ * the performance of the hypervisor's clock_gettime() implementation.
+ */
+static bool get_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r0, r1, r2;
+ int result = __get_cpuid(0x16, tsc_freq, &r0, &r1, &r2);
+ *tsc_freq *= 1000; // Convert from MHz to KHz
+ return result > 0 && *tsc_freq > 0;
+}
+
+static bool is_rdtscp_available()
+{
+ uint32 r0, r1, r2, r3;
+ return __get_cpuid(0x80000001, &r0, &r1, &r2, &r3) > 0 && (r3 & (1 << 27)) != 0;
+}
+
+/*
+ * Decide whether we use the RDTSC instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void pg_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+
+ if (fp)
+ {
+ char buf[128];
+
+ if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+ {
+ /*
+ * Compute baseline CPU peformance, determines speed at which RDTSC advances.
+ */
+ uint32 tsc_freq;
+
+ if (get_tsc_frequency_khz(&tsc_freq))
+ {
+ /*
+ * Ticks to nanoseconds conversion requires floating point math
+ * because because:
+ *
+ * sec = ticks / frequency_hz
+ * ns = ticks / frequency_hz * 1,000,000,000
+ * ns = ticks * (1,000,000,000 / frequency_hz)
+ * ns = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
+ *
+ * Here, 'ns' is usually a floating number. For example for a 2.5 GHz
+ * CPU the scaling factor becomes 1,000,000 / 2,500,000 = 1.2.
+ *
+ * To be able to use integer math we work around the lack of precision.
+ * We first scale the integer up and after the multiplication by the
+ * number of ticks in INSTR_TIME_GET_NANOSEC() we divide again by the
+ * same value. We picked the scaler such that it provides enough precision
+ * and is a power-of-two which allows for shifting instead of doing an
+ * integer division.
+ */
+ ticks_per_ns_scaled = INT64CONST(1000000) * TICKS_TO_NS_PRECISION / tsc_freq;
+ ticks_per_sec = tsc_freq * 1000; // KHz -> Hz
+ max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
+
+ has_rdtsc = true;
+ has_rdtscp = is_rdtscp_available();
+ }
+ }
+
+ fclose(fp);
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
+
+#endif /* WIN32 */
diff --git a/src/common/meson.build b/src/common/meson.build
index 41bd58ebdf..000983c6bf 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -11,6 +11,7 @@ common_sources = files(
'file_perm.c',
'file_utils.c',
'hashfn.c',
+ 'instr_time.c',
'ip.c',
'jsonapi.c',
'keywords.c',
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index aab80effb0..6fda565366 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter(). These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -19,10 +21,11 @@
*
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
- * INSTR_TIME_SET_CURRENT(t) set t to current time
- *
- * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
- * evaluates to whether t changed
+ * INSTR_TIME_SET_CURRENT_FAST(t) set t to current time without waiting
+ * for instructions in out-of-order window
+ *
+ * INSTR_TIME_SET_CURRENT(t) set t to current time while waiting for
+ * instructions in OOO to retire
*
* INSTR_TIME_SET_SECONDS(t, s) set t to s seconds
*
@@ -84,7 +87,15 @@ typedef struct instr_time
#ifndef WIN32
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)
+extern int64 ticks_per_ns_scaled;
+extern int64 ticks_per_sec;
+extern int64 max_ticks_no_overflow;
/* Use clock_gettime() */
@@ -110,35 +121,113 @@ typedef struct instr_time
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-/* helper for INSTR_TIME_SET_CURRENT */
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+
+extern bool has_rdtsc;
+extern bool has_rdtscp;
+#endif
+
static inline instr_time
-pg_clock_gettime_ns(void)
+pg_clock_gettime()
{
instr_time now;
struct timespec tmp;
clock_gettime(PG_INSTR_CLOCK, &tmp);
now.ticks = tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
-
return now;
}
+static inline instr_time
+pg_get_ticks_fast(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtsc)
+ {
+ instr_time now;
+ uint32 a;
+ now.ticks = __rdtscp(&a);
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline instr_time
+pg_get_ticks(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtscp)
+ {
+ instr_time now;
+ uint32 unused;
+ now.ticks = __rdtscp(&unused);
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline int64_t
+pg_ticks_to_ns(instr_time t)
+{
+ /*
+ * Would multiplication overflow? If so perform computation in two parts.
+ * Check overflow without actually overflowing via:
+ * a * b > max <=> a > max / b
+ */
+ int64 ns = 0;
+
+ if (unlikely(t.ticks > max_ticks_no_overflow))
+ {
+ /*
+ * Compute how often the maximum number of ticks fits completely into the
+ * number of elapsed ticks and convert that number into nanoseconds.
+ * Then multiply by the count to arrive at the final value. In a 2nd step
+ * we adjust the number of elapsed ticks and convert the remaining ticks.
+ */
+ int64 count = t.ticks / max_ticks_no_overflow;
+ int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ ns = max_ns * count;
+
+ /*
+ * Subtract the ticks that we now already accounted for, so that they don't
+ * get counted twice.
+ */
+ t.ticks -= count * max_ticks_no_overflow;
+ Assert(t.ticks >= 0);
+ }
+
+ ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ return ns;
+}
+
+extern void pg_initialize_rdtsc(void);
+
+#define INSTR_TIME_INITIALIZE() \
+ pg_initialize_rdtsc()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_get_ticks_fast())
+
#define INSTR_TIME_SET_CURRENT(t) \
- ((t) = pg_clock_gettime_ns())
+ ((t) = pg_get_ticks())
#define INSTR_TIME_SET_SECONDS(t, s) \
- ((t).ticks = NS_PER_S * (s))
+ ((t).ticks = (s) * ticks_per_sec)
#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) (t).ticks)
-
+ pg_ticks_to_ns(t)
#else /* WIN32 */
/* Use QueryPerformanceCounter() */
-/* helper for INSTR_TIME_SET_CURRENT */
static inline instr_time
pg_query_performance_counter(void)
{
@@ -160,6 +249,11 @@ GetTimerFrequency(void)
return (double) f.QuadPart;
}
+#define INSTR_TIME_INITIALIZE()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_query_performance_counter())
+
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
@@ -180,13 +274,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks)
-
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
-#define INSTR_TIME_SET_CURRENT_LAZY(t) \
- (INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
-
-
#define INSTR_TIME_ADD(x,y) \
((x).ticks += (y).ticks)
@@ -196,7 +285,6 @@ GetTimerFrequency(void)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
((x).ticks += (y).ticks - (z).ticks)
-
#define INSTR_TIME_GET_DOUBLE(t) \
((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index f1c9ddf4a0..bec74b52ae 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -135,7 +135,7 @@ sub mkvcbuild
our @pgcommonallfiles = qw(
archive.c base64.c checksum_helper.c compression.c
config_info.c controldata_utils.c d2s.c encnames.c exec.c
- f2s.c file_perm.c file_utils.c hashfn.c ip.c jsonapi.c
+ f2s.c file_perm.c file_utils.c hashfn.c ip.c instr_time.o jsonapi.c
keywords.c kwlookup.c link-canary.c md5_common.c percentrepl.c
pg_get_line.c pg_lzcompress.c pg_prng.c pgfnames.c psprintf.c relpath.c
rmtree.c saslprep.c scram-common.c string.c stringinfo.c unicode_norm.c
--
2.34.1
Hi!
On 2/14/23 12:11, David Geier wrote:
Hi,
I think I fixed the compilation errors. It was due to a few variables
being declared under#if defined(__x86_64__) && defined(__linux__)
while being used also under non x86 Linux.
I also removed again the code to obtain the TSC frequency under
hypervisors because the TSC is usually emulated and therefore no
faster than clock_gettime() anyways. So we now simply fallback to
clock_gettime() on hypervisors when we cannot obtain the frequency via
leaf 0x16.Beyond that I reviewed the first two patches a while ago in [1]. I
hope we can progress with them to further reduce the size of this
patch set.[1]
/messages/by-id/3ac157f7-085d-e071-45fc-b87cd306360c@gmail.com
It still fails.
I'll get Cirrus-CI working on my own Github fork so I can make sure it
really compiles on all platforms before I submit a new version.
--
David Geier
(ServiceNow)
Hi!
On 2/14/23 13:48, David Geier wrote:
It still fails.
I'll get Cirrus-CI working on my own Github fork so I can make sure it
really compiles on all platforms before I submit a new version.
It took some time until Cirrus CI allowed me to run tests against my new
GitHub account (there's a 3 days freeze to avoid people from getting
Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch
which passes builds, rebased on latest master.
I also reviewed the first two patches a while ago in [1]/messages/by-id/3ac157f7-085d-e071-45fc-b87cd306360c@gmail.com. I hope we can
progress with them to further reduce the size of this patch set.
Beyond that: I could work on support for more OSs (e.g. starting with
Windows). Is there appetite for that or do we rather want to instead
start with a smaller patch?
[1]: /messages/by-id/3ac157f7-085d-e071-45fc-b87cd306360c@gmail.com
/messages/by-id/3ac157f7-085d-e071-45fc-b87cd306360c@gmail.com
--
David Geier
(ServiceNow)
Attachments:
v9-0001-instr_time-Add-INSTR_TIME_SET_SECONDS-INSTR_TIME_.patchtext/x-patch; charset=UTF-8; name=v9-0001-instr_time-Add-INSTR_TIME_SET_SECONDS-INSTR_TIME_.patchDownload
From d03a9be2522b0ef22fd58cbcfc95eb19ca8b2bea Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 20 Jan 2023 15:31:54 -0800
Subject: [PATCH v9 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(),
INSTR_TIME_IS_LT()
INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop
without having to convert into time units (which is
costly). INSTR_TIME_IS_LT() can be used to check the loop condition.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/portability/instr_time.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index cc85138e21..aab80effb0 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -15,6 +15,8 @@
*
* INSTR_TIME_IS_ZERO(t) is t equal to zero?
*
+ * INSTR_TIME_IS_LT(x, y) x < y
+ *
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
* INSTR_TIME_SET_CURRENT(t) set t to current time
@@ -22,6 +24,8 @@
* INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
* evaluates to whether t changed
*
+ * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds
+ *
* INSTR_TIME_ADD(x, y) x += y
*
* INSTR_TIME_SUBTRACT(x, y) x -= y
@@ -122,6 +126,9 @@ pg_clock_gettime_ns(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_clock_gettime_ns())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t).ticks = NS_PER_S * (s))
+
#define INSTR_TIME_GET_NANOSEC(t) \
((int64) (t).ticks)
@@ -156,6 +163,9 @@ GetTimerFrequency(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t).ticks = s * GetTimerFrequency())
+
#define INSTR_TIME_GET_NANOSEC(t) \
((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency())))
@@ -168,6 +178,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0)
+#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks)
+
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
--
2.34.1
v9-0002-wip-report-nanoseconds-in-pg_test_timing.patchtext/x-patch; charset=UTF-8; name=v9-0002-wip-report-nanoseconds-in-pg_test_timing.patchDownload
From 639213b01102a7320a62ba5ed68a8e3d6a05514b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v9 2/3] wip: report nanoseconds in pg_test_timing
This commit also updates pg_test_timing's documentation:
- compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of
of statement with/without EXPLAIN ANALYZE
- explain the 2x overhead (due to two timestamp acquisitions per row)
- remove old section about old versions of linux - I couldn't update the
numbers, and it's old enough nobody would care
---
doc/src/sgml/ref/pgtesttiming.sgml | 117 ++++++++++--------------
src/bin/pg_test_timing/pg_test_timing.c | 74 +++++++++------
2 files changed, 95 insertions(+), 96 deletions(-)
diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml
index a5eb3aa25e..7e0266cf58 100644
--- a/doc/src/sgml/ref/pgtesttiming.sgml
+++ b/doc/src/sgml/ref/pgtesttiming.sgml
@@ -93,28 +93,34 @@ PostgreSQL documentation
<para>
Good results will show most (>90%) individual timing calls take less than
- one microsecond. Average per loop overhead will be even lower, below 100
- nanoseconds. This example from an Intel i7-860 system using a TSC clock
- source shows excellent performance:
+ one microsecond (1000 nanoseconds). Average per loop overhead will be even
+ lower, below 100 nanoseconds. This example from an Intel i9-9880H system
+ using a TSC clock source shows excellent performance:
<screen><![CDATA[
Testing timing overhead for 3 seconds.
-Per loop time including overhead: 35.96 ns
+Per loop time including overhead: 13.74 ns
Histogram of timing durations:
- < us % of total count
- 1 96.40465 80435604
- 2 3.59518 2999652
- 4 0.00015 126
- 8 0.00002 13
- 16 0.00000 2
+ < ns % of total count
+ 16 97.00221 211857215
+ 32 2.99555 6542412
+ 64 0.00115 2505
+ 128 0.00035 759
+ 256 0.00004 78
+ 512 0.00000 3
+ 1024 0.00000 4
+ 2048 0.00034 732
+ 4096 0.00000 6
+ 8192 0.00000 8
+ 16384 0.00019 409
+ 32768 0.00018 403
+ 65536 0.00000 1
]]></screen>
</para>
<para>
- Note that different units are used for the per loop time than the
- histogram. The loop can have resolution within a few nanoseconds (ns),
- while the individual timing calls can only resolve down to one microsecond
- (us).
+ Note that the accuracy of the histogram entries may be lower than the
+ per loop time.
</para>
</refsect2>
@@ -125,24 +131,25 @@ Histogram of timing durations:
When the query executor is running a statement using
<command>EXPLAIN ANALYZE</command>, individual operations are timed as well
as showing a summary. The overhead of your system can be checked by
- counting rows with the <application>psql</application> program:
+ disabling the per-row timing, using the <literal>TIMING OFF</literal>
+ option:
<screen>
-CREATE TABLE t AS SELECT * FROM generate_series(1,100000);
-\timing
-SELECT COUNT(*) FROM t;
-EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
+CREATE TABLE t AS SELECT * FROM generate_series(1, 100000);
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t;
+EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;
</screen>
</para>
<para>
- The i7-860 system measured runs the count query in 9.8 ms while
- the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each
- processing just over 100,000 rows. That 6.8 ms difference means the timing
- overhead per row is 68 ns, about twice what pg_test_timing estimated it
- would be. Even that relatively small amount of overhead is making the fully
- timed count statement take almost 70% longer. On more substantial queries,
- the timing overhead would be less problematic.
+ The i9-9880H system measured shows an execution time of 4.116 ms for the
+ <literal>TIMING OFF</literal> query, and 6.965 ms for the
+ <literal>TIMING ON</literal>, each processing 100,000 rows.
+
+ That 2.849 ms difference means the timing overhead per row is 28 ns. As
+ <literal>TIMING ON</literal> measures timestamps twice per row returned by
+ an executor node, the overhead is very close to what pg_test_timing
+ estimated it would be.
</para>
</refsect2>
@@ -157,28 +164,31 @@ EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
<screen><![CDATA[
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-tsc hpet acpi_pm
+tsc acpi_pm
# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
# pg_test_timing
-Per loop time including overhead: 722.92 ns
+Testing timing overhead for 3 seconds.
+Per loop time including overhead: 708.58 ns
Histogram of timing durations:
- < us % of total count
- 1 27.84870 1155682
- 2 72.05956 2990371
- 4 0.07810 3241
- 8 0.01357 563
- 16 0.00007 3
+ < ns % of total count
+ 1024 99.79796 4225270
+ 2048 0.15560 6588
+ 4096 0.00035 15
+ 8192 0.01738 736
+ 16384 0.01679 711
+ 32768 0.01190 504
]]></screen>
</para>
<para>
In this configuration, the sample <command>EXPLAIN ANALYZE</command> above
- takes 115.9 ms. That's 1061 ns of timing overhead, again a small multiple
- of what's measured directly by this utility. That much timing overhead
- means the actual query itself is only taking a tiny fraction of the
- accounted for time, most of it is being consumed in overhead instead. In
- this configuration, any <command>EXPLAIN ANALYZE</command> totals involving
- many timed operations would be inflated significantly by timing overhead.
+ shows an execution time of 148.7 ms. That's 1392 ns of per-row timing
+ overhead. Taking the two timestamps per row into account, that's again
+ close to what pg_test_timing estimated. That much timing overhead means
+ the actual query itself is only taking a tiny fraction of the accounted for
+ time, most of it is being consumed in overhead instead. In this
+ configuration, any <command>EXPLAIN ANALYZE</command> totals involving many
+ timed operations would be inflated significantly by timing overhead.
</para>
<para>
@@ -196,33 +206,6 @@ kern.timecounter.hardware: ACPI-fast -> TSC
</screen>
</para>
- <para>
- Other systems may only allow setting the time source on boot. On older
- Linux systems the "clock" kernel setting is the only way to make this sort
- of change. And even on some more recent ones, the only option you'll see
- for a clock source is "jiffies". Jiffies are the older Linux software clock
- implementation, which can have good resolution when it's backed by fast
- enough timing hardware, as in this example:
-
-<screen><![CDATA[
-$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-jiffies
-$ dmesg | grep time.c
-time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
-time.c: Detected 2400.153 MHz processor.
-$ pg_test_timing
-Testing timing overhead for 3 seconds.
-Per timing duration including loop overhead: 97.75 ns
-Histogram of timing durations:
- < us % of total count
- 1 90.23734 27694571
- 2 9.75277 2993204
- 4 0.00981 3010
- 8 0.00007 22
- 16 0.00000 1
- 32 0.00000 1
-]]></screen></para>
-
</refsect2>
<refsect2>
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index c29d6f8762..e20718669a 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -19,8 +19,8 @@ static void handle_args(int argc, char *argv[]);
static uint64 test_timing(unsigned int duration);
static void output(uint64 loop_count);
-/* record duration in powers of 2 microseconds */
-long long int histogram[32];
+/* record duration in powers of 2 nanoseconds */
+uint64 histogram[64];
int
main(int argc, char *argv[])
@@ -121,35 +121,48 @@ handle_args(int argc, char *argv[])
static uint64
test_timing(unsigned int duration)
{
- uint64 total_time;
- int64 time_elapsed = 0;
uint64 loop_count = 0;
- uint64 prev,
- cur;
+ instr_time until_time,
+ total_time;
instr_time start_time,
- end_time,
- temp;
-
- total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
+ end_time;
+ instr_time cur;
INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_MICROSEC(start_time);
- while (time_elapsed < total_time)
+ /*
+ * To reduce loop overhead, check loop condition in instr_time domain.
+ */
+ INSTR_TIME_SET_SECONDS(total_time, duration);
+ until_time = start_time;
+ INSTR_TIME_ADD(until_time, total_time);
+
+ cur = start_time;
+
+ while (INSTR_TIME_IS_LT(cur, until_time))
{
- int32 diff,
- bits = 0;
+ instr_time temp;
+ instr_time prev;
+ int64 diff;
+ int32 bits = 0;
prev = cur;
- INSTR_TIME_SET_CURRENT(temp);
- cur = INSTR_TIME_GET_MICROSEC(temp);
- diff = cur - prev;
+ INSTR_TIME_SET_CURRENT(cur);
+ temp = cur;
+ INSTR_TIME_SUBTRACT(temp, prev);
+ diff = INSTR_TIME_GET_NANOSEC(temp);
/* Did time go backwards? */
- if (diff < 0)
+ if (unlikely(diff <= 0))
{
+ /* can't do anything with that measurement */
+ if (diff == 0)
+ {
+ loop_count++;
+ continue;
+ }
fprintf(stderr, _("Detected clock going backwards in time.\n"));
- fprintf(stderr, _("Time warp: %d ms\n"), diff);
+ fprintf(stderr, _("Time warp: %lld ns\n"), (long long) diff);
exit(1);
}
@@ -164,8 +177,6 @@ test_timing(unsigned int duration)
histogram[bits]++;
loop_count++;
- INSTR_TIME_SUBTRACT(temp, start_time);
- time_elapsed = INSTR_TIME_GET_MICROSEC(temp);
}
INSTR_TIME_SET_CURRENT(end_time);
@@ -173,7 +184,7 @@ test_timing(unsigned int duration)
INSTR_TIME_SUBTRACT(end_time, start_time);
printf(_("Per loop time including overhead: %0.2f ns\n"),
- INSTR_TIME_GET_DOUBLE(end_time) * 1e9 / loop_count);
+ (INSTR_TIME_GET_DOUBLE(end_time) * NS_PER_S) / loop_count);
return loop_count;
}
@@ -181,9 +192,10 @@ test_timing(unsigned int duration)
static void
output(uint64 loop_count)
{
- int64 max_bit = 31,
+ int64 low_bit = 0,
+ max_bit = 63,
i;
- char *header1 = _("< us");
+ char *header1 = _("< ns");
char *header2 = /* xgettext:no-c-format */ _("% of total");
char *header3 = _("count");
int len1 = strlen(header1);
@@ -194,15 +206,19 @@ output(uint64 loop_count)
while (max_bit > 0 && histogram[max_bit] == 0)
max_bit--;
+ /* find lowest bit value */
+ while (low_bit < max_bit && histogram[low_bit] == 0)
+ low_bit++;
+
printf(_("Histogram of timing durations:\n"));
printf("%*s %*s %*s\n",
- Max(6, len1), header1,
+ Max(9, len1), header1,
Max(10, len2), header2,
Max(10, len3), header3);
- for (i = 0; i <= max_bit; i++)
- printf("%*ld %*.5f %*lld\n",
- Max(6, len1), 1l << i,
+ for (i = low_bit; i <= max_bit; i++)
+ printf("%*ld %*.5f %*llu\n",
+ Max(9, len1), 1l << i,
Max(10, len2) - 1, (double) histogram[i] * 100 / loop_count,
- Max(10, len3), histogram[i]);
+ Max(10, len3), (long long unsigned) histogram[i]);
}
--
2.34.1
v9-0003-Use-time-stamp-counter-to-measure-time-on-Linux-x.patchtext/x-patch; charset=UTF-8; name=v9-0003-Use-time-stamp-counter-to-measure-time-on-Linux-x.patchDownload
From 1e0f93a8973449bf8b5e6e01dbcbf10e7b065b26 Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Mon, 23 Jan 2023 18:13:55 +0100
Subject: [PATCH v9 3/3] Use time stamp counter to measure time on Linux/x86
We switch to using the time stamp counter (TSC) instead of clock_gettime()
to reduce overhead of EXPLAIN (ANALYZE, TIME ON). Tests showed that runtime
is reduced by around 10% for queries moving lots of rows through the plan.
For now this is only enabled on Linux/x86, in case the system clocksource is
reported as TSC. Relying on the Linux kernel simplifies the logic to detect
if the present TSC is usable (frequency invariant, synchronized between
sockets, etc.). In all other cases we fallback to clock_gettime().
Note, that we intentionally use RDTSC in the fast paths, rather than RDTSCP.
RDTSCP waits for outstanding instructions to retire on out-of-order CPUs.
This adds noticably for little benefit in the typical InstrStartNode() /
InstrStopNode() use case. The macro to be used in such cases is called
INSTR_TIME_SET_CURRENT_FAST(). The original macro INSTR_TIME_SET_CURRENT()
uses RDTSCP and is supposed to be used when precision is more important
than performance.
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
src/backend/access/heap/vacuumlazy.c | 3 +-
src/backend/commands/explain.c | 2 +-
src/backend/executor/instrument.c | 12 ++-
src/backend/utils/init/postinit.c | 3 +
src/bin/pg_test_timing/pg_test_timing.c | 7 +-
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 130 ++++++++++++++++++++++
src/common/meson.build | 1 +
src/include/portability/instr_time.h | 137 ++++++++++++++++++++----
src/tools/msvc/Mkvcbuild.pm | 2 +-
12 files changed, 273 insertions(+), 32 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 8f14cf85f3..212093568b 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3002,8 +3002,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
INSTR_TIME_SET_CURRENT(currenttime);
elapsed = currenttime;
INSTR_TIME_SUBTRACT(elapsed, starttime);
- if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
- >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ if (INSTR_TIME_GET_MILLISEC(elapsed) >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
{
if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
{
diff --git a/src/backend/commands/explain.c b/src/backend/commands/explain.c
index e57bda7b62..9eca5901f0 100644
--- a/src/backend/commands/explain.c
+++ b/src/backend/commands/explain.c
@@ -526,7 +526,7 @@ ExplainOnePlan(PlannedStmt *plannedstmt, IntoClause *into, ExplainState *es,
double totaltime = 0;
int eflags;
int instrument_option = 0;
-
+
Assert(plannedstmt->commandType != CMD_UTILITY);
if (es->analyze && es->timing)
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index ee78a5749d..1ce3d62cff 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -67,9 +67,13 @@ InstrInit(Instrumentation *instr, int instrument_options)
void
InstrStartNode(Instrumentation *instr)
{
- if (instr->need_timer &&
- !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ if (instr->need_timer)
+ {
+ if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStartNode called twice in a row");
+ else
+ INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
+ }
/* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
@@ -95,7 +99,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (INSTR_TIME_IS_ZERO(instr->starttime))
elog(ERROR, "InstrStopNode called without start");
- INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_SET_CURRENT_FAST(endtime);
INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 2f07ca7a0e..2eef902483 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -805,6 +805,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* Initialize status reporting */
pgstat_beinit();
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index e20718669a..d185c8fe2e 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -128,7 +128,8 @@ test_timing(unsigned int duration)
end_time;
instr_time cur;
- INSTR_TIME_SET_CURRENT(start_time);
+ INSTR_TIME_INITIALIZE();
+ INSTR_TIME_SET_CURRENT_FAST(start_time);
/*
* To reduce loop overhead, check loop condition in instr_time domain.
@@ -147,7 +148,7 @@ test_timing(unsigned int duration)
int32 bits = 0;
prev = cur;
- INSTR_TIME_SET_CURRENT(cur);
+ INSTR_TIME_SET_CURRENT_FAST(cur);
temp = cur;
INSTR_TIME_SUBTRACT(temp, prev);
diff = INSTR_TIME_GET_NANOSEC(temp);
@@ -179,7 +180,7 @@ test_timing(unsigned int duration)
loop_count++;
}
- INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_SET_CURRENT_FAST(end_time);
INSTR_TIME_SUBTRACT(end_time, start_time);
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 508ed218e8..4125bc8cb7 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7059,6 +7059,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 5a28b6f713..3f95e01320 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -322,6 +323,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 113029bf7b..23a5c68705 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 0000000000..c73956d0df
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,130 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "portability/instr_time.h"
+
+#ifndef WIN32
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up
+ * with nanoseconds using integer math. See comment in pg_initialize_rdtsc()
+ * for more details.
+ *
+ * By default assume we are using clock_gettime() as a fallback which uses
+ * nanoseconds as ticks. Hence, we set the multiplier to the precision scalar
+ * so that the division in INSTR_TIME_GET_NANOSEC() won't change the nanoseconds.
+ *
+ * When using the RDTSC instruction directly this is filled in during initialization
+ * based on the relevant CPUID fields.
+ */
+int64 ticks_per_ns_scaled = TICKS_TO_NS_PRECISION;
+int64 ticks_per_sec = NS_PER_S;
+int64 max_ticks_no_overflow = PG_INT64_MAX / TICKS_TO_NS_PRECISION;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Indicates if RDTSC can be used (Linux/x86 only, when OS uses TSC clocksource)
+ */
+bool has_rdtsc = false;
+
+/*
+ * Indicates if RDTSCP can be used. True if RDTSC can be used and RDTSCP is available.
+ */
+bool has_rdtscp = false;
+
+/*
+ * This fails on hypervisors. On hypervisors a different CPUID leaf could be used
+ * to obtain the TSC frequency. We don't do that because the TSC on hypervisors
+ * is usually emulated and the performance of emulated an TSC is usually as bad as
+ * the performance of the hypervisor's clock_gettime() implementation.
+ */
+static bool get_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r0, r1, r2;
+ int result = __get_cpuid(0x16, tsc_freq, &r0, &r1, &r2);
+ *tsc_freq *= 1000; // Convert from MHz to KHz
+ return result > 0 && *tsc_freq > 0;
+}
+
+static bool is_rdtscp_available()
+{
+ uint32 r0, r1, r2, r3;
+ return __get_cpuid(0x80000001, &r0, &r1, &r2, &r3) > 0 && (r3 & (1 << 27)) != 0;
+}
+
+/*
+ * Decide whether we use the RDTSC instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void pg_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+
+ if (fp)
+ {
+ char buf[128];
+
+ if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+ {
+ /*
+ * Compute baseline CPU peformance, determines speed at which RDTSC advances.
+ */
+ uint32 tsc_freq;
+
+ if (get_tsc_frequency_khz(&tsc_freq))
+ {
+ /*
+ * Ticks to nanoseconds conversion requires floating point math
+ * because because:
+ *
+ * sec = ticks / frequency_hz
+ * ns = ticks / frequency_hz * 1,000,000,000
+ * ns = ticks * (1,000,000,000 / frequency_hz)
+ * ns = ticks * (1,000,000 / frequency_khz) <-- now in kilohertz
+ *
+ * Here, 'ns' is usually a floating number. For example for a 2.5 GHz
+ * CPU the scaling factor becomes 1,000,000 / 2,500,000 = 1.2.
+ *
+ * To be able to use integer math we work around the lack of precision.
+ * We first scale the integer up and after the multiplication by the
+ * number of ticks in INSTR_TIME_GET_NANOSEC() we divide again by the
+ * same value. We picked the scaler such that it provides enough precision
+ * and is a power-of-two which allows for shifting instead of doing an
+ * integer division.
+ */
+ ticks_per_ns_scaled = INT64CONST(1000000) * TICKS_TO_NS_PRECISION / tsc_freq;
+ ticks_per_sec = tsc_freq * 1000; // KHz -> Hz
+ max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
+
+ has_rdtsc = true;
+ has_rdtscp = is_rdtscp_available();
+ }
+ }
+
+ fclose(fp);
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
+
+#endif /* WIN32 */
diff --git a/src/common/meson.build b/src/common/meson.build
index 41bd58ebdf..000983c6bf 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -11,6 +11,7 @@ common_sources = files(
'file_perm.c',
'file_utils.c',
'hashfn.c',
+ 'instr_time.c',
'ip.c',
'jsonapi.c',
'keywords.c',
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index aab80effb0..3f17cd2102 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter(). These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -19,10 +21,11 @@
*
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
- * INSTR_TIME_SET_CURRENT(t) set t to current time
- *
- * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
- * evaluates to whether t changed
+ * INSTR_TIME_SET_CURRENT_FAST(t) set t to current time without waiting
+ * for instructions in out-of-order window
+ *
+ * INSTR_TIME_SET_CURRENT(t) set t to current time while waiting for
+ * instructions in OOO to retire
*
* INSTR_TIME_SET_SECONDS(t, s) set t to s seconds
*
@@ -84,7 +87,15 @@ typedef struct instr_time
#ifndef WIN32
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)
+extern int64 ticks_per_ns_scaled;
+extern int64 ticks_per_sec;
+extern int64 max_ticks_no_overflow;
/* Use clock_gettime() */
@@ -110,35 +121,120 @@ typedef struct instr_time
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-/* helper for INSTR_TIME_SET_CURRENT */
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+
+extern bool has_rdtsc;
+extern bool has_rdtscp;
+
+extern void pg_initialize_rdtsc(void);
+#endif
+
static inline instr_time
-pg_clock_gettime_ns(void)
+pg_clock_gettime()
{
instr_time now;
struct timespec tmp;
clock_gettime(PG_INSTR_CLOCK, &tmp);
now.ticks = tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
-
return now;
}
+static inline instr_time
+pg_get_ticks_fast(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtsc)
+ {
+ instr_time now;
+ uint32 a;
+ now.ticks = __rdtscp(&a);
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline instr_time
+pg_get_ticks(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtscp)
+ {
+ instr_time now;
+ uint32 unused;
+ now.ticks = __rdtscp(&unused);
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline int64_t
+pg_ticks_to_ns(instr_time t)
+{
+ /*
+ * Would multiplication overflow? If so perform computation in two parts.
+ * Check overflow without actually overflowing via:
+ * a * b > max <=> a > max / b
+ */
+ int64 ns = 0;
+
+ if (unlikely(t.ticks > max_ticks_no_overflow))
+ {
+ /*
+ * Compute how often the maximum number of ticks fits completely into the
+ * number of elapsed ticks and convert that number into nanoseconds.
+ * Then multiply by the count to arrive at the final value. In a 2nd step
+ * we adjust the number of elapsed ticks and convert the remaining ticks.
+ */
+ int64 count = t.ticks / max_ticks_no_overflow;
+ int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ ns = max_ns * count;
+
+ /*
+ * Subtract the ticks that we now already accounted for, so that they don't
+ * get counted twice.
+ */
+ t.ticks -= count * max_ticks_no_overflow;
+ Assert(t.ticks >= 0);
+ }
+
+ ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ return ns;
+}
+
+static inline void pg_initialize_get_ticks()
+{
+#if defined(__x86_64__) && defined(__linux__)
+ pg_initialize_rdtsc();
+#endif
+}
+
+#define INSTR_TIME_INITIALIZE() \
+ pg_initialize_get_ticks()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_get_ticks_fast())
+
#define INSTR_TIME_SET_CURRENT(t) \
- ((t) = pg_clock_gettime_ns())
+ ((t) = pg_get_ticks())
#define INSTR_TIME_SET_SECONDS(t, s) \
- ((t).ticks = NS_PER_S * (s))
+ ((t).ticks = (s) * ticks_per_sec)
#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) (t).ticks)
-
+ pg_ticks_to_ns(t)
#else /* WIN32 */
/* Use QueryPerformanceCounter() */
-/* helper for INSTR_TIME_SET_CURRENT */
static inline instr_time
pg_query_performance_counter(void)
{
@@ -160,6 +256,11 @@ GetTimerFrequency(void)
return (double) f.QuadPart;
}
+#define INSTR_TIME_INITIALIZE()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_query_performance_counter())
+
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
@@ -180,13 +281,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks)
-
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
-#define INSTR_TIME_SET_CURRENT_LAZY(t) \
- (INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
-
-
#define INSTR_TIME_ADD(x,y) \
((x).ticks += (y).ticks)
@@ -196,7 +292,6 @@ GetTimerFrequency(void)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
((x).ticks += (y).ticks - (z).ticks)
-
#define INSTR_TIME_GET_DOUBLE(t) \
((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
diff --git a/src/tools/msvc/Mkvcbuild.pm b/src/tools/msvc/Mkvcbuild.pm
index f1c9ddf4a0..bec74b52ae 100644
--- a/src/tools/msvc/Mkvcbuild.pm
+++ b/src/tools/msvc/Mkvcbuild.pm
@@ -135,7 +135,7 @@ sub mkvcbuild
our @pgcommonallfiles = qw(
archive.c base64.c checksum_helper.c compression.c
config_info.c controldata_utils.c d2s.c encnames.c exec.c
- f2s.c file_perm.c file_utils.c hashfn.c ip.c jsonapi.c
+ f2s.c file_perm.c file_utils.c hashfn.c ip.c instr_time.o jsonapi.c
keywords.c kwlookup.c link-canary.c md5_common.c percentrepl.c
pg_get_line.c pg_lzcompress.c pg_prng.c pgfnames.c psprintf.c relpath.c
rmtree.c saslprep.c scram-common.c string.c stringinfo.c unicode_norm.c
--
2.34.1
On Mon, 20 Feb 2023 at 16:06, David Geier <geidav.pg@gmail.com> wrote:
Hi!
On 2/14/23 13:48, David Geier wrote:
It still fails.
I'll get Cirrus-CI working on my own Github fork so I can make sure it
really compiles on all platforms before I submit a new version.It took some time until Cirrus CI allowed me to run tests against my new
GitHub account (there's a 3 days freeze to avoid people from getting
Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch
which passes builds, rebased on latest master.I also reviewed the first two patches a while ago in [1]. I hope we can
progress with them to further reduce the size of this patch set.Beyond that: I could work on support for more OSs (e.g. starting with
Windows). Is there appetite for that or do we rather want to instead
start with a smaller patch?
Are we planning to continue on this and take it further?
I'm seeing that there has been no activity in this thread for nearly 1
year now, I'm planning to close this in the current commitfest unless
someone is planning to take it forward.
Regards,
Vignesh
On Sat, 20 Jan 2024 at 09:03, vignesh C <vignesh21@gmail.com> wrote:
On Mon, 20 Feb 2023 at 16:06, David Geier <geidav.pg@gmail.com> wrote:
Hi!
On 2/14/23 13:48, David Geier wrote:
It still fails.
I'll get Cirrus-CI working on my own Github fork so I can make sure it
really compiles on all platforms before I submit a new version.It took some time until Cirrus CI allowed me to run tests against my new
GitHub account (there's a 3 days freeze to avoid people from getting
Cirrus CI nodes to mine bitcoins :-D). Attached now the latest patch
which passes builds, rebased on latest master.I also reviewed the first two patches a while ago in [1]. I hope we can
progress with them to further reduce the size of this patch set.Beyond that: I could work on support for more OSs (e.g. starting with
Windows). Is there appetite for that or do we rather want to instead
start with a smaller patch?Are we planning to continue on this and take it further?
I'm seeing that there has been no activity in this thread for nearly 1
year now, I'm planning to close this in the current commitfest unless
someone is planning to take it forward.
Since the author or no one else showed interest in taking it forward
and the patch had no activity for more than 1 year, I have changed the
status to RWF. Feel free to add a new CF entry when someone is
planning to resume work more actively.
Regards,
Vignesh
Hi,
At some point this patch switched from rdtsc to rdtscp, which imo largely
negates the point of it. What lead to that?
Greetings,
Andres Freund
On Sun, Jun 2, 2024 at 1:08 AM Andres Freund <andres@anarazel.de> wrote:
At some point this patch switched from rdtsc to rdtscp, which imo largely
negates the point of it. What lead to that?
From what I can gather, it appears this was an oversight when David first
reapplied the work on the instr_time changes that were committed.
I've come back to this and rebased this, as well as:
- Corrected the use of RDTSCP to RDTSC in pg_get_ticks_fast
- Check 16H register if 15H register does not contain frequency information
(per research, relevant for some CPUs)
- Fixed incorrect reporting in pg_test_timing due to too small histogram
(32 => 64 bits)
- Fixed indentation per pgindent
- Added support for VMs running under KVM/VMware Hypervisors
On that last item, this does indeed make a difference on VMs, contrary to
the code comment in earlier versions (and I've not seen any odd behaviors
again, FWIW):
On a c5.xlarge (Skylake-SP or Cascade Lake) on AWS, with the same test as
done initially in this thread:
SELECT COUNT(*) FROM lotsarows;
Time: 974.423 ms
EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM lotsarows;
Time: 1336.196 ms (00:01.336)
Without patch:
EXPLAIN (ANALYZE) SELECT COUNT(*) FROM lotsarows;
Time: 2165.069 ms (00:02.165)
Per loop time including overhead: 22.15 ns
With patch:
EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM lotsarows;
Time: 1654.289 ms (00:01.654)
Per loop time including overhead: 9.81 ns
I'm registering this again in the current commitfest to help reviews.
Open questions I have:
- Could we rely on checking whether the TSC timesource is invariant (via
CPUID), instead of relying on Linux choosing it as a clocksource?
- For the Hypervisor CPUID checks I had to rely on __cpuidex which is only
available on newer GCC versions (
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95973), how do we best check
for its presence? (compiler version, or rather configure check?) -- note
this is also the reason the patch fails the clang compiler warning check in
CI, despite clang having support in recent versions (
https://reviews.llvm.org/D121653)
Thanks,
Lukas
--
Lukas Fittl
Attachments:
v10-0001-instr_time-Add-INSTR_TIME_SET_SECONDS-INSTR_TIME.patchapplication/octet-stream; name=v10-0001-instr_time-Add-INSTR_TIME_SET_SECONDS-INSTR_TIME.patchDownload
From 96884c75f39a7d9e0aefc8586c4254fdfedf692b Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Fri, 20 Jan 2023 15:31:54 -0800
Subject: [PATCH v10 1/3] instr_time: Add INSTR_TIME_SET_SECONDS(),
INSTR_TIME_IS_LT()
INSTR_TIME_SET_SECONDS() is useful to calculate the end of a time-bound loop
without having to convert into time units (which is
costly). INSTR_TIME_IS_LT() can be used to check the loop condition.
Author:
Reviewed-by:
Discussion: https://postgr.es/m/
Backpatch:
---
src/include/portability/instr_time.h | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index f71a851b18d..6ad411c6866 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -15,6 +15,8 @@
*
* INSTR_TIME_IS_ZERO(t) is t equal to zero?
*
+ * INSTR_TIME_IS_LT(x, y) x < y
+ *
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
* INSTR_TIME_SET_CURRENT(t) set t to current time
@@ -22,6 +24,8 @@
* INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
* evaluates to whether t changed
*
+ * INSTR_TIME_SET_SECONDS(t, s) set t to s seconds
+ *
* INSTR_TIME_ADD(x, y) x += y
*
* INSTR_TIME_SUBTRACT(x, y) x -= y
@@ -122,6 +126,9 @@ pg_clock_gettime_ns(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_clock_gettime_ns())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t).ticks = NS_PER_S * (s))
+
#define INSTR_TIME_GET_NANOSEC(t) \
((int64) (t).ticks)
@@ -156,6 +163,9 @@ GetTimerFrequency(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
+#define INSTR_TIME_SET_SECONDS(t, s) \
+ ((t).ticks = s * GetTimerFrequency())
+
#define INSTR_TIME_GET_NANOSEC(t) \
((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency())))
@@ -168,6 +178,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0)
+#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks)
+
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
--
2.47.1
v10-0003-Use-time-stamp-counter-to-measure-time-on-Linux-.patchapplication/octet-stream; name=v10-0003-Use-time-stamp-counter-to-measure-time-on-Linux-.patchDownload
From 7ab1d696483f4bd6a74a4e657708609af7b49c07 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 23 Feb 2025 20:44:10 -0800
Subject: [PATCH v10 3/3] Use time stamp counter to measure time on Linux/x86
We switch to using the time stamp counter (TSC) instead of clock_gettime()
to reduce overhead of EXPLAIN (ANALYZE, TIME ON). Tests showed that runtime
is reduced by around 10% for queries moving lots of rows through the plan.
For now this is only enabled on Linux/x86, in case the system clocksource is
reported as TSC. Relying on the Linux kernel simplifies the logic to detect
if the present TSC is usable (frequency invariant, synchronized between
sockets, etc.). In all other cases we fallback to clock_gettime().
Note, that we intentionally use RDTSC in the fast paths, rather than RDTSCP.
RDTSCP waits for outstanding instructions to retire on out-of-order CPUs.
This adds noticably for little benefit in the typical InstrStartNode() /
InstrStopNode() use case. The macro to be used in such cases is called
INSTR_TIME_SET_CURRENT_FAST(). The original macro INSTR_TIME_SET_CURRENT()
uses RDTSCP and is supposed to be used when precision is more important
than performance.
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
src/backend/access/heap/vacuumlazy.c | 3 +-
src/backend/executor/instrument.c | 12 +-
src/backend/utils/init/postinit.c | 3 +
src/bin/pg_test_timing/pg_test_timing.c | 7 +-
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 170 ++++++++++++++++++++++++
src/common/meson.build | 1 +
src/include/portability/instr_time.h | 139 ++++++++++++++++---
10 files changed, 314 insertions(+), 29 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 1af18a78a2b..158868729ab 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3353,8 +3353,7 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
INSTR_TIME_SET_CURRENT(currenttime);
elapsed = currenttime;
INSTR_TIME_SUBTRACT(elapsed, starttime);
- if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
- >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ if (INSTR_TIME_GET_MILLISEC(elapsed) >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
{
if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 56e635f4700..01f67c5d972 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -67,9 +67,13 @@ InstrInit(Instrumentation *instr, int instrument_options)
void
InstrStartNode(Instrumentation *instr)
{
- if (instr->need_timer &&
- !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ if (instr->need_timer)
+ {
+ if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStartNode called twice in a row");
+ else
+ INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
+ }
/* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
@@ -95,7 +99,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (INSTR_TIME_IS_ZERO(instr->starttime))
elog(ERROR, "InstrStopNode called without start");
- INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_SET_CURRENT_FAST(endtime);
INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 01bb6a410cb..f1111f917d1 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -785,6 +785,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* Initialize status reporting */
pgstat_beinit();
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index 29b0db6d619..62a15465f7c 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -128,7 +128,8 @@ test_timing(unsigned int duration)
end_time;
instr_time cur;
- INSTR_TIME_SET_CURRENT(start_time);
+ INSTR_TIME_INITIALIZE();
+ INSTR_TIME_SET_CURRENT_FAST(start_time);
/*
* To reduce loop overhead, check loop condition in instr_time domain.
@@ -147,7 +148,7 @@ test_timing(unsigned int duration)
int32 bits = 0;
prev = cur;
- INSTR_TIME_SET_CURRENT(cur);
+ INSTR_TIME_SET_CURRENT_FAST(cur);
temp = cur;
INSTR_TIME_SUBTRACT(temp, prev);
diff = INSTR_TIME_GET_NANOSEC(temp);
@@ -179,7 +180,7 @@ test_timing(unsigned int duration)
loop_count++;
}
- INSTR_TIME_SET_CURRENT(end_time);
+ INSTR_TIME_SET_CURRENT_FAST(end_time);
INSTR_TIME_SUBTRACT(end_time, start_time);
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index fdc957fa34d..88dc3b04725 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7268,6 +7268,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 703f3f582c1..96bb72c1263 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -322,6 +323,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 1e2b91c83c4..194e45e7ae8 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 00000000000..9b18e6e3588
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,170 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2022, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#include "portability/instr_time.h"
+
+#ifndef WIN32
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up
+ * with nanoseconds using integer math. See comment in pg_initialize_rdtsc()
+ * for more details.
+ *
+ * By default assume we are using clock_gettime() as a fallback which uses
+ * nanoseconds as ticks. Hence, we set the multiplier to the precision scalar
+ * so that the division in INSTR_TIME_GET_NANOSEC() won't change the nanoseconds.
+ *
+ * When using the RDTSC instruction directly this is filled in during initialization
+ * based on the relevant CPUID fields.
+ */
+int64 ticks_per_ns_scaled = TICKS_TO_NS_PRECISION;
+int64 ticks_per_sec = NS_PER_S;
+int64 max_ticks_no_overflow = PG_INT64_MAX / TICKS_TO_NS_PRECISION;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Indicates if RDTSC can be used (Linux/x86 only, when OS uses TSC clocksource)
+ */
+bool has_rdtsc = false;
+
+/*
+ * Indicates if RDTSCP can be used. True if RDTSC can be used and RDTSCP is available.
+ */
+bool has_rdtscp = false;
+
+#define CPUID_HYPERVISOR_VMWARE(words) (words[1] == 0x61774d56 && words[2] == 0x4d566572 && words[3] == 0x65726177) /* VMwareVMware */
+#define CPUID_HYPERVISOR_KVM(words) (words[1] == 0x4b4d564b && words[2] == 0x564b4d56 && words[3] == 0x0000004d) /* KVMKVMKVM */
+
+static bool
+get_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r[4];
+
+ if (__get_cpuid(0x15, &r[0] /* denominator */ , &r[1] /* numerator */ , &r[2] /* hz */ , &r[3]) && r[2] > 0)
+ {
+ if (r[0] == 0 || r[1] == 0)
+ return false;
+
+ *tsc_freq = r[2] / 1000 * r[1] / r[0];
+ return true;
+ }
+
+ /* Some CPUs only report frequency in 16H */
+ if (__get_cpuid(0x16, &r[0] /* base_mhz */ , &r[1], &r[2], &r[3]))
+ {
+ *tsc_freq = r[0] * 1000;
+ return true;
+ }
+
+ /*
+ * Check if we have a KVM or VMware Hypervisor passing down TSC frequency
+ * to us in a guest VM
+ *
+ * Note that accessing the 0x40000000 leaf for Hypervisor info requires
+ * use of __cpuidex to set ECX to 0.
+ *
+ * TODO: We need to check whether our compiler is new enough
+ * (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95973)
+ */
+ __cpuidex((int32 *) r, 0x40000000, 0);
+ if (r[0] >= 0x40000010 && (CPUID_HYPERVISOR_VMWARE(r) || CPUID_HYPERVISOR_KVM(r)))
+ {
+ __cpuidex((int32 *) r, 0x40000010, 0);
+ if (r[0] > 0)
+ {
+ *tsc_freq = r[0];
+ return true;
+ }
+ }
+
+ return false;
+}
+
+static bool
+is_rdtscp_available()
+{
+ uint32 r[4];
+
+ return __get_cpuid(0x80000001, &r[0], &r[1], &r[2], &r[3]) > 0 && (r[3] & (1 << 27)) != 0;
+}
+
+/*
+ * Decide whether we use the RDTSC instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void
+pg_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+
+ if (fp)
+ {
+ char buf[128];
+
+ if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+ {
+ /*
+ * Compute baseline CPU peformance, determines speed at which
+ * RDTSC advances.
+ */
+ uint32 tsc_freq;
+
+ if (get_tsc_frequency_khz(&tsc_freq))
+ {
+ /*
+ * Ticks to nanoseconds conversion requires floating point
+ * math because because:
+ *
+ * sec = ticks / frequency_hz ns = ticks / frequency_hz *
+ * 1,000,000,000 ns = ticks * (1,000,000,000 / frequency_hz)
+ * ns = ticks * (1,000,000 / frequency_khz) <-- now in
+ * kilohertz
+ *
+ * Here, 'ns' is usually a floating number. For example for a
+ * 2.5 GHz CPU the scaling factor becomes 1,000,000 /
+ * 2,500,000 = 1.2.
+ *
+ * To be able to use integer math we work around the lack of
+ * precision. We first scale the integer up and after the
+ * multiplication by the number of ticks in
+ * INSTR_TIME_GET_NANOSEC() we divide again by the same value.
+ * We picked the scaler such that it provides enough precision
+ * and is a power-of-two which allows for shifting instead of
+ * doing an integer division.
+ */
+ ticks_per_ns_scaled = INT64CONST(1000000) * TICKS_TO_NS_PRECISION / tsc_freq;
+ ticks_per_sec = tsc_freq * 1000; /* KHz->Hz */
+ max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
+
+ has_rdtsc = true;
+ has_rdtscp = is_rdtscp_available();
+ }
+ }
+
+ fclose(fp);
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
+
+#endif /* WIN32 */
diff --git a/src/common/meson.build b/src/common/meson.build
index 1540ba67cca..62b90b3e609 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -13,6 +13,7 @@ common_sources = files(
'file_perm.c',
'file_utils.c',
'hashfn.c',
+ 'instr_time.c',
'ip.c',
'jsonapi.c',
'keywords.c',
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 6ad411c6866..fd8ca122001 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter(). These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -19,10 +21,11 @@
*
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
- * INSTR_TIME_SET_CURRENT(t) set t to current time
+ * INSTR_TIME_SET_CURRENT_FAST(t) set t to current time without waiting
+ * for instructions in out-of-order window
*
- * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
- * evaluates to whether t changed
+ * INSTR_TIME_SET_CURRENT(t) set t to current time while waiting for
+ * instructions in OOO to retire
*
* INSTR_TIME_SET_SECONDS(t, s) set t to s seconds
*
@@ -84,7 +87,15 @@ typedef struct instr_time
#ifndef WIN32
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)
+extern int64 ticks_per_ns_scaled;
+extern int64 ticks_per_sec;
+extern int64 max_ticks_no_overflow;
/* Use clock_gettime() */
@@ -110,35 +121,124 @@ typedef struct instr_time
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-/* helper for INSTR_TIME_SET_CURRENT */
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+
+extern bool has_rdtsc;
+extern bool has_rdtscp;
+
+extern void pg_initialize_rdtsc(void);
+#endif
+
static inline instr_time
-pg_clock_gettime_ns(void)
+pg_clock_gettime()
{
instr_time now;
struct timespec tmp;
clock_gettime(PG_INSTR_CLOCK, &tmp);
now.ticks = tmp.tv_sec * NS_PER_S + tmp.tv_nsec;
-
return now;
}
+static inline instr_time
+pg_get_ticks_fast(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtsc)
+ {
+ instr_time now;
+
+ now.ticks = __rdtsc();
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline instr_time
+pg_get_ticks(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtscp)
+ {
+ instr_time now;
+ uint32 unused;
+
+ now.ticks = __rdtscp(&unused);
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline int64_t
+pg_ticks_to_ns(instr_time t)
+{
+ /*
+ * Would multiplication overflow? If so perform computation in two parts.
+ * Check overflow without actually overflowing via: a * b > max <=> a >
+ * max / b
+ */
+ int64 ns = 0;
+
+ if (unlikely(t.ticks > max_ticks_no_overflow))
+ {
+ /*
+ * Compute how often the maximum number of ticks fits completely into
+ * the number of elapsed ticks and convert that number into
+ * nanoseconds. Then multiply by the count to arrive at the final
+ * value. In a 2nd step we adjust the number of elapsed ticks and
+ * convert the remaining ticks.
+ */
+ int64 count = t.ticks / max_ticks_no_overflow;
+ int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+
+ ns = max_ns * count;
+
+ /*
+ * Subtract the ticks that we now already accounted for, so that they
+ * don't get counted twice.
+ */
+ t.ticks -= count * max_ticks_no_overflow;
+ Assert(t.ticks >= 0);
+ }
+
+ ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ return ns;
+}
+
+static inline void
+pg_initialize_get_ticks()
+{
+#if defined(__x86_64__) && defined(__linux__)
+ pg_initialize_rdtsc();
+#endif
+}
+
+#define INSTR_TIME_INITIALIZE() \
+ pg_initialize_get_ticks()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_get_ticks_fast())
+
#define INSTR_TIME_SET_CURRENT(t) \
- ((t) = pg_clock_gettime_ns())
+ ((t) = pg_get_ticks())
#define INSTR_TIME_SET_SECONDS(t, s) \
- ((t).ticks = NS_PER_S * (s))
+ ((t).ticks = (s) * ticks_per_sec)
#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) (t).ticks)
-
+ pg_ticks_to_ns(t)
#else /* WIN32 */
/* Use QueryPerformanceCounter() */
-/* helper for INSTR_TIME_SET_CURRENT */
static inline instr_time
pg_query_performance_counter(void)
{
@@ -160,6 +260,11 @@ GetTimerFrequency(void)
return (double) f.QuadPart;
}
+#define INSTR_TIME_INITIALIZE()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_query_performance_counter())
+
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
@@ -180,13 +285,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_LT(x, y) ((x).ticks < (y).ticks)
-
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
-#define INSTR_TIME_SET_CURRENT_LAZY(t) \
- (INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
-
-
#define INSTR_TIME_ADD(x,y) \
((x).ticks += (y).ticks)
@@ -196,7 +296,6 @@ GetTimerFrequency(void)
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
((x).ticks += (y).ticks - (z).ticks)
-
#define INSTR_TIME_GET_DOUBLE(t) \
((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
--
2.47.1
v10-0002-wip-report-nanoseconds-in-pg_test_timing.patchapplication/octet-stream; name=v10-0002-wip-report-nanoseconds-in-pg_test_timing.patchDownload
From b60b6a90aed5ca543b04881f11d8bfac741ec3fe Mon Sep 17 00:00:00 2001
From: Andres Freund <andres@anarazel.de>
Date: Mon, 16 Jan 2023 11:19:11 -0800
Subject: [PATCH v10 2/3] wip: report nanoseconds in pg_test_timing
This commit also updates pg_test_timing's documentation:
- compare EXPLAIN (ANALYZE, TIMING ON/OFF) instead of comparing performance of
of statement with/without EXPLAIN ANALYZE
- explain the 2x overhead (due to two timestamp acquisitions per row)
- remove old section about old versions of linux - I couldn't update the
numbers, and it's old enough nobody would care
---
doc/src/sgml/ref/pgtesttiming.sgml | 117 ++++++++++--------------
src/bin/pg_test_timing/pg_test_timing.c | 74 +++++++++------
2 files changed, 95 insertions(+), 96 deletions(-)
diff --git a/doc/src/sgml/ref/pgtesttiming.sgml b/doc/src/sgml/ref/pgtesttiming.sgml
index a5eb3aa25e0..7e0266cf58b 100644
--- a/doc/src/sgml/ref/pgtesttiming.sgml
+++ b/doc/src/sgml/ref/pgtesttiming.sgml
@@ -93,28 +93,34 @@ PostgreSQL documentation
<para>
Good results will show most (>90%) individual timing calls take less than
- one microsecond. Average per loop overhead will be even lower, below 100
- nanoseconds. This example from an Intel i7-860 system using a TSC clock
- source shows excellent performance:
+ one microsecond (1000 nanoseconds). Average per loop overhead will be even
+ lower, below 100 nanoseconds. This example from an Intel i9-9880H system
+ using a TSC clock source shows excellent performance:
<screen><![CDATA[
Testing timing overhead for 3 seconds.
-Per loop time including overhead: 35.96 ns
+Per loop time including overhead: 13.74 ns
Histogram of timing durations:
- < us % of total count
- 1 96.40465 80435604
- 2 3.59518 2999652
- 4 0.00015 126
- 8 0.00002 13
- 16 0.00000 2
+ < ns % of total count
+ 16 97.00221 211857215
+ 32 2.99555 6542412
+ 64 0.00115 2505
+ 128 0.00035 759
+ 256 0.00004 78
+ 512 0.00000 3
+ 1024 0.00000 4
+ 2048 0.00034 732
+ 4096 0.00000 6
+ 8192 0.00000 8
+ 16384 0.00019 409
+ 32768 0.00018 403
+ 65536 0.00000 1
]]></screen>
</para>
<para>
- Note that different units are used for the per loop time than the
- histogram. The loop can have resolution within a few nanoseconds (ns),
- while the individual timing calls can only resolve down to one microsecond
- (us).
+ Note that the accuracy of the histogram entries may be lower than the
+ per loop time.
</para>
</refsect2>
@@ -125,24 +131,25 @@ Histogram of timing durations:
When the query executor is running a statement using
<command>EXPLAIN ANALYZE</command>, individual operations are timed as well
as showing a summary. The overhead of your system can be checked by
- counting rows with the <application>psql</application> program:
+ disabling the per-row timing, using the <literal>TIMING OFF</literal>
+ option:
<screen>
-CREATE TABLE t AS SELECT * FROM generate_series(1,100000);
-\timing
-SELECT COUNT(*) FROM t;
-EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
+CREATE TABLE t AS SELECT * FROM generate_series(1, 100000);
+EXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM t;
+EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM t;
</screen>
</para>
<para>
- The i7-860 system measured runs the count query in 9.8 ms while
- the <command>EXPLAIN ANALYZE</command> version takes 16.6 ms, each
- processing just over 100,000 rows. That 6.8 ms difference means the timing
- overhead per row is 68 ns, about twice what pg_test_timing estimated it
- would be. Even that relatively small amount of overhead is making the fully
- timed count statement take almost 70% longer. On more substantial queries,
- the timing overhead would be less problematic.
+ The i9-9880H system measured shows an execution time of 4.116 ms for the
+ <literal>TIMING OFF</literal> query, and 6.965 ms for the
+ <literal>TIMING ON</literal>, each processing 100,000 rows.
+
+ That 2.849 ms difference means the timing overhead per row is 28 ns. As
+ <literal>TIMING ON</literal> measures timestamps twice per row returned by
+ an executor node, the overhead is very close to what pg_test_timing
+ estimated it would be.
</para>
</refsect2>
@@ -157,28 +164,31 @@ EXPLAIN ANALYZE SELECT COUNT(*) FROM t;
<screen><![CDATA[
# cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-tsc hpet acpi_pm
+tsc acpi_pm
# echo acpi_pm > /sys/devices/system/clocksource/clocksource0/current_clocksource
# pg_test_timing
-Per loop time including overhead: 722.92 ns
+Testing timing overhead for 3 seconds.
+Per loop time including overhead: 708.58 ns
Histogram of timing durations:
- < us % of total count
- 1 27.84870 1155682
- 2 72.05956 2990371
- 4 0.07810 3241
- 8 0.01357 563
- 16 0.00007 3
+ < ns % of total count
+ 1024 99.79796 4225270
+ 2048 0.15560 6588
+ 4096 0.00035 15
+ 8192 0.01738 736
+ 16384 0.01679 711
+ 32768 0.01190 504
]]></screen>
</para>
<para>
In this configuration, the sample <command>EXPLAIN ANALYZE</command> above
- takes 115.9 ms. That's 1061 ns of timing overhead, again a small multiple
- of what's measured directly by this utility. That much timing overhead
- means the actual query itself is only taking a tiny fraction of the
- accounted for time, most of it is being consumed in overhead instead. In
- this configuration, any <command>EXPLAIN ANALYZE</command> totals involving
- many timed operations would be inflated significantly by timing overhead.
+ shows an execution time of 148.7 ms. That's 1392 ns of per-row timing
+ overhead. Taking the two timestamps per row into account, that's again
+ close to what pg_test_timing estimated. That much timing overhead means
+ the actual query itself is only taking a tiny fraction of the accounted for
+ time, most of it is being consumed in overhead instead. In this
+ configuration, any <command>EXPLAIN ANALYZE</command> totals involving many
+ timed operations would be inflated significantly by timing overhead.
</para>
<para>
@@ -196,33 +206,6 @@ kern.timecounter.hardware: ACPI-fast -> TSC
</screen>
</para>
- <para>
- Other systems may only allow setting the time source on boot. On older
- Linux systems the "clock" kernel setting is the only way to make this sort
- of change. And even on some more recent ones, the only option you'll see
- for a clock source is "jiffies". Jiffies are the older Linux software clock
- implementation, which can have good resolution when it's backed by fast
- enough timing hardware, as in this example:
-
-<screen><![CDATA[
-$ cat /sys/devices/system/clocksource/clocksource0/available_clocksource
-jiffies
-$ dmesg | grep time.c
-time.c: Using 3.579545 MHz WALL PM GTOD PIT/TSC timer.
-time.c: Detected 2400.153 MHz processor.
-$ pg_test_timing
-Testing timing overhead for 3 seconds.
-Per timing duration including loop overhead: 97.75 ns
-Histogram of timing durations:
- < us % of total count
- 1 90.23734 27694571
- 2 9.75277 2993204
- 4 0.00981 3010
- 8 0.00007 22
- 16 0.00000 1
- 32 0.00000 1
-]]></screen></para>
-
</refsect2>
<refsect2>
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index ce7aad4b25a..29b0db6d619 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -19,8 +19,8 @@ static void handle_args(int argc, char *argv[]);
static uint64 test_timing(unsigned int duration);
static void output(uint64 loop_count);
-/* record duration in powers of 2 microseconds */
-static long long int histogram[32];
+/* record duration in powers of 2 nanoseconds */
+static uint64 histogram[64];
int
main(int argc, char *argv[])
@@ -121,35 +121,48 @@ handle_args(int argc, char *argv[])
static uint64
test_timing(unsigned int duration)
{
- uint64 total_time;
- int64 time_elapsed = 0;
uint64 loop_count = 0;
- uint64 prev,
- cur;
+ instr_time until_time,
+ total_time;
instr_time start_time,
- end_time,
- temp;
-
- total_time = duration > 0 ? duration * INT64CONST(1000000) : 0;
+ end_time;
+ instr_time cur;
INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_MICROSEC(start_time);
- while (time_elapsed < total_time)
+ /*
+ * To reduce loop overhead, check loop condition in instr_time domain.
+ */
+ INSTR_TIME_SET_SECONDS(total_time, duration);
+ until_time = start_time;
+ INSTR_TIME_ADD(until_time, total_time);
+
+ cur = start_time;
+
+ while (INSTR_TIME_IS_LT(cur, until_time))
{
- int32 diff,
- bits = 0;
+ instr_time temp;
+ instr_time prev;
+ int64 diff;
+ int32 bits = 0;
prev = cur;
- INSTR_TIME_SET_CURRENT(temp);
- cur = INSTR_TIME_GET_MICROSEC(temp);
- diff = cur - prev;
+ INSTR_TIME_SET_CURRENT(cur);
+ temp = cur;
+ INSTR_TIME_SUBTRACT(temp, prev);
+ diff = INSTR_TIME_GET_NANOSEC(temp);
/* Did time go backwards? */
- if (diff < 0)
+ if (unlikely(diff <= 0))
{
+ /* can't do anything with that measurement */
+ if (diff == 0)
+ {
+ loop_count++;
+ continue;
+ }
fprintf(stderr, _("Detected clock going backwards in time.\n"));
- fprintf(stderr, _("Time warp: %d ms\n"), diff);
+ fprintf(stderr, _("Time warp: %lld ns\n"), (long long) diff);
exit(1);
}
@@ -164,8 +177,6 @@ test_timing(unsigned int duration)
histogram[bits]++;
loop_count++;
- INSTR_TIME_SUBTRACT(temp, start_time);
- time_elapsed = INSTR_TIME_GET_MICROSEC(temp);
}
INSTR_TIME_SET_CURRENT(end_time);
@@ -173,7 +184,7 @@ test_timing(unsigned int duration)
INSTR_TIME_SUBTRACT(end_time, start_time);
printf(_("Per loop time including overhead: %0.2f ns\n"),
- INSTR_TIME_GET_DOUBLE(end_time) * 1e9 / loop_count);
+ (INSTR_TIME_GET_DOUBLE(end_time) * NS_PER_S) / loop_count);
return loop_count;
}
@@ -181,9 +192,10 @@ test_timing(unsigned int duration)
static void
output(uint64 loop_count)
{
- int64 max_bit = 31,
+ int64 low_bit = 0,
+ max_bit = 63,
i;
- char *header1 = _("< us");
+ char *header1 = _("< ns");
char *header2 = /* xgettext:no-c-format */ _("% of total");
char *header3 = _("count");
int len1 = strlen(header1);
@@ -194,15 +206,19 @@ output(uint64 loop_count)
while (max_bit > 0 && histogram[max_bit] == 0)
max_bit--;
+ /* find lowest bit value */
+ while (low_bit < max_bit && histogram[low_bit] == 0)
+ low_bit++;
+
printf(_("Histogram of timing durations:\n"));
printf("%*s %*s %*s\n",
- Max(6, len1), header1,
+ Max(9, len1), header1,
Max(10, len2), header2,
Max(10, len3), header3);
- for (i = 0; i <= max_bit; i++)
- printf("%*ld %*.5f %*lld\n",
- Max(6, len1), 1l << i,
+ for (i = low_bit; i <= max_bit; i++)
+ printf("%*ld %*.5f %*llu\n",
+ Max(9, len1), 1l << i,
Max(10, len2) - 1, (double) histogram[i] * 100 / loop_count,
- Max(10, len3), histogram[i]);
+ Max(10, len3), (long long unsigned) histogram[i]);
}
--
2.47.1
Hi,
On 2025-02-28 23:45:58 -0800, Lukas Fittl wrote:
From what I can gather, it appears this was an oversight when David first
reapplied the work on the instr_time changes that were committed.
Heh, glad that that's now fixed. Unfortunately the patch needs an update,
primarily because of the recent pg_test_timing changes.
Applying just patch 2 results in a performance *regression* in pg_test_timing
on my machine, which is due to always hitting the unlikely() path in
INSTR_TIME_GET_NANOSEC() when INSTR_TIME_GET_NANOSEC() is used for an
"absolute" timestamp, rather than a differential timestamp. Which in turn
means hitting a division instruction every time, which on my slightly older
hardware is slower. That may be because my workstation has been up for 40
days, but clearly that can't lead us down to the slow-path
On a c5.xlarge (Skylake-SP or Cascade Lake) on AWS, with the same test as
done initially in this thread:SELECT COUNT(*) FROM lotsarows;
Time: 974.423 msEXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM lotsarows;
Time: 1336.196 ms (00:01.336)Without patch:
EXPLAIN (ANALYZE) SELECT COUNT(*) FROM lotsarows;
Time: 2165.069 ms (00:02.165)Per loop time including overhead: 22.15 ns
With patch:
EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM lotsarows;
Time: 1654.289 ms (00:01.654)Per loop time including overhead: 9.81 ns
I still think this would be a rather awesome improvement.
Open questions I have:
- Could we rely on checking whether the TSC timesource is invariant (via
CPUID), instead of relying on Linux choosing it as a clocksource?
I don't see why not?
Greetings,
Andres Freund
Hi,
See attached v11 (and moved to the PG19-2 commitfest), split into a new set
of patches:
0001 - Improve the __cpuidex check added for a different purpose
in 792752af4eb5 to:
- Fix a typo (configure was incorrectly checking for "__get_cpuidex", vs
meson.build was doing it correctly)
- Adds support for non-MSVC compilers as well (e.g. GCC 11+), where
__cpuidex is defined in cpuid.h, not intrin.h
This change should be independently committable, though we wouldn't use
cpuidex on non-MSVC compilers today in practice, I believe.
0002 - The core patch rebased, which, as before:
- Adds INSTR_TIME_SET_CURRENT_FAST (which uses RDTSC if available) and uses
it for InstrStartNode/InstrStopNode
- Changes INSTR_TIME_SET_CURRENT to directly use RDTSCP if available
(instead of pg_clock_gettime)
- Keeps utilizing pg_clock_gettime for both unless we're on Linux x86 and
the clocksource is set to "tsc" (see note below re: that aspect)
0003 - Changes to pg_test_timing utility:
- Show the used time source (clock_gettime + clock type / RDTSC / RDTSCP)
- Allows checking the latency of the "fast" time source (RDTSC) with the
new "--fast" option, and warn if its not available
- Avoids the INSTR_TIME_GET_NANOSEC slowness that Andres reported by
diffing the ticks first and then calculating nanosecs
Note the other pg_test_timing changes regarding nanoseconds should all have
been addressed by 0b096e379e6f I believe.
On Wed, Jul 16, 2025 at 5:48 PM Andres Freund <andres@anarazel.de> wrote:
Applying just patch 2 results in a performance *regression* in
pg_test_timing
on my machine, which is due to always hitting the unlikely() path in
INSTR_TIME_GET_NANOSEC() when INSTR_TIME_GET_NANOSEC() is used for an
"absolute" timestamp, rather than a differential timestamp. Which in turn
means hitting a division instruction every time, which on my slightly older
hardware is slower. That may be because my workstation has been up for 40
days, but clearly that can't lead us down to the slow-path
Assuming you didn't restart your workstation, can you retest with this
patch set?
I believe the pg_test_timing changes should address this problem, by
avoiding calculations with the absolute (very large) ticks value.
Open questions I have:
- Could we rely on checking whether the TSC timesource is invariant (via
CPUID), instead of relying on Linux choosing it as a clocksource?I don't see why not?
Thinking this through again, my worry would be that our detection logic for
whether the TSC is safe to use directly, is much less sophisticated than
that of the Linux Kernel - and the Linux Kernel also allows configuring the
clock source explicitly, if the detection goes wrong.
For example, David had previously brought up the worry that accessing the
TSC directly in a VM can be very slow when the TSC is emulated. The Linux
Kernel indeed has checks for this, e.g. in the context of Xen:
https://github.com/torvalds/linux/blob/b711733e89a3f84c8e1e56e2328f9a0fa5facc7c/arch/x86/xen/time.c#L490
Maybe introducing a GUC for this is the way to go, with an OS-dependent
"auto" setting?
Thanks,
Lukas
--
Lukas Fittl
Attachments:
v11-0003-pg_test_timing-Add-fast-flag-to-test-fast-timing.patchapplication/octet-stream; name=v11-0003-pg_test_timing-Add-fast-flag-to-test-fast-timing.patchDownload
From 446cb3a93bcea1ceb76056dc461cf733f361a59a Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 27 Jul 2025 08:48:48 -0700
Subject: [PATCH v11 3/3] pg_test_timing: Add --fast flag to test fast timing,
report time source
In passing also reduce the per-loop overhead caused by repeated divisions
in INSTR_TIME_GET_NANOSEC when the ticks variable has become very large,
instead diff first and then turn it into nanosecs.
---
src/bin/pg_test_timing/pg_test_timing.c | 59 +++++++++++++++++++------
src/include/portability/instr_time.h | 30 ++++++++-----
2 files changed, 65 insertions(+), 24 deletions(-)
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index a5621251afc..b77ef2063b6 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -16,6 +16,7 @@ static const char *progname;
static unsigned int test_duration = 3;
static double max_rprct = 99.99;
+static bool fast_timing = false;
/* record duration in powers of 2 nanoseconds */
static long long int histogram[32];
@@ -56,6 +57,7 @@ handle_args(int argc, char *argv[])
static struct option long_options[] = {
{"duration", required_argument, NULL, 'd'},
{"cutoff", required_argument, NULL, 'c'},
+ {"fast", no_argument, NULL, 'f'},
{NULL, 0, NULL, 0}
};
@@ -68,7 +70,7 @@ handle_args(int argc, char *argv[])
{
if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-?") == 0)
{
- printf(_("Usage: %s [-d DURATION] [-c CUTOFF]\n"), progname);
+ printf(_("Usage: %s [-d DURATION] [-c CUTOFF] [--fast]\n"), progname);
exit(0);
}
if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
@@ -78,7 +80,7 @@ handle_args(int argc, char *argv[])
}
}
- while ((option = getopt_long(argc, argv, "d:c:",
+ while ((option = getopt_long(argc, argv, "d:c:f:",
long_options, &optindex)) != -1)
{
switch (option)
@@ -125,6 +127,10 @@ handle_args(int argc, char *argv[])
}
break;
+ case 'f':
+ fast_timing = true;
+ break;
+
default:
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
progname);
@@ -155,11 +161,31 @@ test_timing(unsigned int duration)
uint64 total_time;
int64 time_elapsed = 0;
uint64 loop_count = 0;
- uint64 prev,
- cur;
instr_time start_time,
end_time,
- temp;
+ prev,
+ cur;
+ char *time_source = NULL;
+ bool fast_timing_used = false;
+
+ INSTR_TIME_INITIALIZE();
+
+#if !defined(WIN32) && defined(__x86_64__) && defined(__linux__)
+ if (fast_timing && has_rdtsc)
+ {
+ time_source = "RDTSC";
+ fast_timing_used = true;
+ }
+ else if (has_rdtscp)
+ time_source = "RDTSCP";
+ else
+ time_source = PG_INSTR_CLOCK_NAME;
+#else
+ time_source = PG_INSTR_CLOCK_NAME;
+#endif
+ if (fast_timing && !fast_timing_used)
+ printf(_("Warning: Fast timing requested, but not available - regular timing source will be used\n"));
+ printf(_("Time source: %s\n"), time_source);
/*
* Pre-zero the statistics data structures. They're already zero by
@@ -173,8 +199,11 @@ test_timing(unsigned int duration)
total_time = duration > 0 ? duration * INT64CONST(1000000000) : 0;
- INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_NANOSEC(start_time);
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(start_time);
+ else
+ INSTR_TIME_SET_CURRENT(start_time);
+ cur = start_time;
while (time_elapsed < total_time)
{
@@ -182,9 +211,11 @@ test_timing(unsigned int duration)
bits;
prev = cur;
- INSTR_TIME_SET_CURRENT(temp);
- cur = INSTR_TIME_GET_NANOSEC(temp);
- diff = cur - prev;
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(cur);
+ else
+ INSTR_TIME_SET_CURRENT(cur);
+ diff = INSTR_TIME_DIFF_NANOSEC(cur, prev);
/* Did time go backwards? */
if (unlikely(diff < 0))
@@ -217,11 +248,13 @@ test_timing(unsigned int duration)
largest_diff_count++;
loop_count++;
- INSTR_TIME_SUBTRACT(temp, start_time);
- time_elapsed = INSTR_TIME_GET_NANOSEC(temp);
+ time_elapsed = INSTR_TIME_DIFF_NANOSEC(cur, start_time);
}
- INSTR_TIME_SET_CURRENT(end_time);
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(end_time);
+ else
+ INSTR_TIME_SET_CURRENT(end_time);
INSTR_TIME_SUBTRACT(end_time, start_time);
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index e2e339a0c4f..f02296f1026 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -112,10 +112,13 @@ extern int64 max_ticks_no_overflow;
*/
#if defined(__darwin__) && defined(CLOCK_MONOTONIC_RAW)
#define PG_INSTR_CLOCK CLOCK_MONOTONIC_RAW
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_MONOTONIC_RAW)"
#elif defined(CLOCK_MONOTONIC)
#define PG_INSTR_CLOCK CLOCK_MONOTONIC
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_MONOTONIC)"
#else
#define PG_INSTR_CLOCK CLOCK_REALTIME
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_REALTIME)"
#endif
#if defined(__x86_64__) && defined(__linux__)
@@ -174,7 +177,7 @@ pg_get_ticks(void)
}
static inline int64_t
-pg_ticks_to_ns(instr_time t)
+pg_ticks_to_ns(int64 ticks)
{
/*
* Would multiplication overflow? If so perform computation in two parts.
@@ -183,7 +186,7 @@ pg_ticks_to_ns(instr_time t)
*/
int64 ns = 0;
- if (unlikely(t.ticks > max_ticks_no_overflow))
+ if (unlikely(ticks > max_ticks_no_overflow))
{
/*
* Compute how often the maximum number of ticks fits completely into
@@ -192,7 +195,7 @@ pg_ticks_to_ns(instr_time t)
* value. In a 2nd step we adjust the number of elapsed ticks and
* convert the remaining ticks.
*/
- int64 count = t.ticks / max_ticks_no_overflow;
+ int64 count = ticks / max_ticks_no_overflow;
int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
ns = max_ns * count;
@@ -201,11 +204,11 @@ pg_ticks_to_ns(instr_time t)
* Subtract the ticks that we now already accounted for, so that they
* don't get counted twice.
*/
- t.ticks -= count * max_ticks_no_overflow;
- Assert(t.ticks >= 0);
+ ticks -= count * max_ticks_no_overflow;
+ Assert(ticks >= 0);
}
- ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ ns += ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
return ns;
}
@@ -226,14 +229,14 @@ pg_initialize_get_ticks()
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_get_ticks())
-#define INSTR_TIME_GET_NANOSEC(t) \
- pg_ticks_to_ns(t)
-
+#define INSTR_TIME_TICKS_TO_NANOSEC(ticks) \
+ (pg_ticks_to_ns(ticks))
#else /* WIN32 */
/* Use QueryPerformanceCounter() */
+#define PG_INSTR_CLOCK_NAME "QueryPerformanceCounter"
/* helper for INSTR_TIME_SET_CURRENT / INSTR_TIME_SET_CURRENT_FAST */
static inline instr_time
@@ -265,8 +268,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
-#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency())))
+#define INSTR_TIME_TICKS_TO_NANOSEC(ticks) \
+ ((int64) ((ticks) * ((double) NS_PER_S / GetTimerFrequency())))
#endif /* WIN32 */
@@ -285,9 +288,14 @@ GetTimerFrequency(void)
#define INSTR_TIME_SUBTRACT(x,y) \
((x).ticks -= (y).ticks)
+#define INSTR_TIME_DIFF_NANOSEC(x,y) \
+ (INSTR_TIME_TICKS_TO_NANOSEC((x).ticks - (y).ticks))
+
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
((x).ticks += (y).ticks - (z).ticks)
+#define INSTR_TIME_GET_NANOSEC(t) \
+ (INSTR_TIME_TICKS_TO_NANOSEC((t).ticks))
#define INSTR_TIME_GET_DOUBLE(t) \
((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
--
2.47.1
v11-0002-Use-time-stamp-counter-to-measure-time-on-Linux-.patchapplication/octet-stream; name=v11-0002-Use-time-stamp-counter-to-measure-time-on-Linux-.patchDownload
From 407a6a055d3a48db1c7dd9e5e7a0472e7bc98802 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Fri, 25 Jul 2025 17:57:20 -0700
Subject: [PATCH v11 2/3] Use time stamp counter to measure time on Linux/x86
We switch to using the time stamp counter (TSC) instead of clock_gettime()
to reduce overhead of EXPLAIN (ANALYZE, TIME ON). Tests showed that runtime
is reduced by around 10% for queries moving lots of rows through the plan.
For now this is only enabled on Linux/x86, in case the system clocksource is
reported as TSC. Relying on the Linux kernel simplifies the logic to detect
if the present TSC is usable (frequency invariant, synchronized between
sockets, etc.). In all other cases we fallback to clock_gettime().
Note, that we intentionally use RDTSC in the fast paths, rather than RDTSCP.
RDTSCP waits for outstanding instructions to retire on out-of-order CPUs.
This adds noticably for little benefit in the typical InstrStartNode() /
InstrStopNode() use case. The macro to be used in such cases is called
INSTR_TIME_SET_CURRENT_FAST(). The original macro INSTR_TIME_SET_CURRENT()
uses RDTSCP and is supposed to be used when precision is more important
than performance.
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
src/backend/access/heap/vacuumlazy.c | 4 +-
src/backend/executor/instrument.c | 12 +-
src/backend/utils/init/postinit.c | 3 +
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 206 +++++++++++++++++++++++++++
src/common/meson.build | 1 +
src/include/portability/instr_time.h | 136 +++++++++++++++---
9 files changed, 348 insertions(+), 22 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 14036c27e87..0165a10b17c 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3371,8 +3371,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
INSTR_TIME_SET_CURRENT(currenttime);
elapsed = currenttime;
INSTR_TIME_SUBTRACT(elapsed, starttime);
- if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
- >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ if (INSTR_TIME_GET_MILLISEC(elapsed) >=
+ VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
{
if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 56e635f4700..01f67c5d972 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -67,9 +67,13 @@ InstrInit(Instrumentation *instr, int instrument_options)
void
InstrStartNode(Instrumentation *instr)
{
- if (instr->need_timer &&
- !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ if (instr->need_timer)
+ {
+ if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStartNode called twice in a row");
+ else
+ INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
+ }
/* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
@@ -95,7 +99,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (INSTR_TIME_IS_ZERO(instr->starttime))
elog(ERROR, "InstrStopNode called without start");
- INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_SET_CURRENT_FAST(endtime);
INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 641e535a73c..d573409903b 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -810,6 +810,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/*
* Load relcache entries for the shared system catalogs. This must create
* at least entries for pg_database and catalogs used for authentication.
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 497a936c141..a283ce7a3be 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7256,6 +7256,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 249b6aa5169..d615df593c7 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -327,6 +328,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 1e2b91c83c4..194e45e7ae8 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 00000000000..fdf47699f20
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,206 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#if defined(HAVE__GET_CPUID) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
+#include <cpuid.h>
+#endif
+
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
+#include <intrin.h>
+#endif
+
+#include "portability/instr_time.h"
+
+#ifndef WIN32
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up
+ * with nanoseconds using integer math. See comment in pg_initialize_rdtsc()
+ * for more details.
+ *
+ * By default assume we are using clock_gettime() as a fallback which uses
+ * nanoseconds as ticks. Hence, we set the multiplier to the precision scalar
+ * so that the division in INSTR_TIME_GET_NANOSEC() won't change the nanoseconds.
+ *
+ * When using the RDTSC instruction directly this is filled in during initialization
+ * based on the relevant CPUID fields.
+ */
+int64 ticks_per_ns_scaled = TICKS_TO_NS_PRECISION;
+int64 ticks_per_sec = NS_PER_S;
+int64 max_ticks_no_overflow = PG_INT64_MAX / TICKS_TO_NS_PRECISION;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Indicates if RDTSC can be used (Linux/x86 only, when OS uses TSC clocksource)
+ */
+bool has_rdtsc = false;
+
+/*
+ * Indicates if RDTSCP can be used. True if RDTSC can be used and RDTSCP is available.
+ */
+bool has_rdtscp = false;
+
+#define CPUID_HYPERVISOR_VMWARE(words) (words[1] == 0x61774d56 && words[2] == 0x4d566572 && words[3] == 0x65726177) /* VMwareVMware */
+#define CPUID_HYPERVISOR_KVM(words) (words[1] == 0x4b4d564b && words[2] == 0x564b4d56 && words[3] == 0x0000004d) /* KVMKVMKVM */
+
+static bool
+get_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(0x15, &r[0] /* denominator */ , &r[1] /* numerator */ , &r[2] /* hz */ , &r[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x15);
+#else
+#error cpuid instruction not available
+#endif
+
+ if (r[2] > 0)
+ {
+ if (r[0] == 0 || r[1] == 0)
+ return false;
+
+ *tsc_freq = r[2] / 1000 * r[1] / r[0];
+ return true;
+ }
+
+ /* Some CPUs only report frequency in 16H */
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(0x16, &r[0] /* base_mhz */ , &r[1], &r[2], &r[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x16);
+#else
+#error cpuid instruction not available
+#endif
+
+ if (r[0] > 0)
+ {
+ *tsc_freq = r[0] * 1000;
+ return true;
+ }
+
+ /*
+ * Check if we have a KVM or VMware Hypervisor passing down TSC frequency
+ * to us in a guest VM
+ *
+ * Note that accessing the 0x40000000 leaf for Hypervisor info requires
+ * use of __cpuidex to set ECX to 0. The similar __get_cpuid_count
+ * function does not work as expected since it contains a check for
+ * __get_cpuid_max, which has been observed to be lower than the special
+ * Hypervisor leaf.
+ */
+#if defined(HAVE__CPUIDEX)
+ __cpuidex((int32 *) r, 0x40000000, 0);
+ if (r[0] >= 0x40000010 && (CPUID_HYPERVISOR_VMWARE(r) || CPUID_HYPERVISOR_KVM(r)))
+ {
+ __cpuidex((int32 *) r, 0x40000010, 0);
+ if (r[0] > 0)
+ {
+ *tsc_freq = r[0];
+ return true;
+ }
+ }
+#endif
+
+ return false;
+}
+
+static bool
+is_rdtscp_available()
+{
+ uint32 r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ if (!__get_cpuid(0x80000001, &r[0], &r[1], &r[2], &r[3]))
+ return false;
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x80000001);
+#else
+#error cpuid instruction not available
+#endif
+
+ return (r[3] & (1 << 27)) != 0;
+}
+
+/*
+ * Decide whether we use the RDTSC instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void
+pg_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+
+ if (fp)
+ {
+ char buf[128];
+
+ if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+ {
+ /*
+ * Compute baseline CPU peformance, determines speed at which
+ * RDTSC advances.
+ */
+ uint32 tsc_freq;
+
+ if (get_tsc_frequency_khz(&tsc_freq))
+ {
+ /*
+ * Ticks to nanoseconds conversion requires floating point
+ * math because because:
+ *
+ * sec = ticks / frequency_hz ns = ticks / frequency_hz *
+ * 1,000,000,000 ns = ticks * (1,000,000,000 / frequency_hz)
+ * ns = ticks * (1,000,000 / frequency_khz) <-- now in
+ * kilohertz
+ *
+ * Here, 'ns' is usually a floating number. For example for a
+ * 2.5 GHz CPU the scaling factor becomes 1,000,000 /
+ * 2,500,000 = 1.2.
+ *
+ * To be able to use integer math we work around the lack of
+ * precision. We first scale the integer up and after the
+ * multiplication by the number of ticks in
+ * INSTR_TIME_GET_NANOSEC() we divide again by the same value.
+ * We picked the scaler such that it provides enough precision
+ * and is a power-of-two which allows for shifting instead of
+ * doing an integer division.
+ */
+ ticks_per_ns_scaled = INT64CONST(1000000) * TICKS_TO_NS_PRECISION / tsc_freq;
+ ticks_per_sec = tsc_freq * 1000; /* KHz->Hz */
+ max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
+
+ has_rdtsc = true;
+ has_rdtscp = is_rdtscp_available();
+ }
+ }
+
+ fclose(fp);
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
+
+#endif /* WIN32 */
diff --git a/src/common/meson.build b/src/common/meson.build
index 1540ba67cca..62b90b3e609 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -13,6 +13,7 @@ common_sources = files(
'file_perm.c',
'file_utils.c',
'hashfn.c',
+ 'instr_time.c',
'ip.c',
'jsonapi.c',
'keywords.c',
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index f71a851b18d..e2e339a0c4f 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter(). These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -17,10 +19,11 @@
*
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
- * INSTR_TIME_SET_CURRENT(t) set t to current time
+ * INSTR_TIME_SET_CURRENT_FAST(t) set t to current time without waiting
+ * for instructions in out-of-order window
*
- * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
- * evaluates to whether t changed
+ * INSTR_TIME_SET_CURRENT(t) set t to current time while waiting for
+ * instructions in OOO to retire
*
* INSTR_TIME_ADD(x, y) x += y
*
@@ -81,6 +84,15 @@ typedef struct instr_time
#ifndef WIN32
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)
+
+extern int64 ticks_per_ns_scaled;
+extern int64 ticks_per_sec;
+extern int64 max_ticks_no_overflow;
/* Use clock_gettime() */
@@ -106,9 +118,18 @@ typedef struct instr_time
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-/* helper for INSTR_TIME_SET_CURRENT */
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+
+extern bool has_rdtsc;
+extern bool has_rdtscp;
+
+extern void pg_initialize_rdtsc(void);
+#endif
+
static inline instr_time
-pg_clock_gettime_ns(void)
+pg_clock_gettime(void)
{
instr_time now;
struct timespec tmp;
@@ -119,11 +140,94 @@ pg_clock_gettime_ns(void)
return now;
}
+static inline instr_time
+pg_get_ticks_fast(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtsc)
+ {
+ instr_time now;
+
+ now.ticks = __rdtsc();
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline instr_time
+pg_get_ticks(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtscp)
+ {
+ instr_time now;
+ uint32 unused;
+
+ now.ticks = __rdtscp(&unused);
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline int64_t
+pg_ticks_to_ns(instr_time t)
+{
+ /*
+ * Would multiplication overflow? If so perform computation in two parts.
+ * Check overflow without actually overflowing via: a * b > max <=> a >
+ * max / b
+ */
+ int64 ns = 0;
+
+ if (unlikely(t.ticks > max_ticks_no_overflow))
+ {
+ /*
+ * Compute how often the maximum number of ticks fits completely into
+ * the number of elapsed ticks and convert that number into
+ * nanoseconds. Then multiply by the count to arrive at the final
+ * value. In a 2nd step we adjust the number of elapsed ticks and
+ * convert the remaining ticks.
+ */
+ int64 count = t.ticks / max_ticks_no_overflow;
+ int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+
+ ns = max_ns * count;
+
+ /*
+ * Subtract the ticks that we now already accounted for, so that they
+ * don't get counted twice.
+ */
+ t.ticks -= count * max_ticks_no_overflow;
+ Assert(t.ticks >= 0);
+ }
+
+ ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ return ns;
+}
+
+static inline void
+pg_initialize_get_ticks()
+{
+#if defined(__x86_64__) && defined(__linux__)
+ pg_initialize_rdtsc();
+#endif
+}
+
+#define INSTR_TIME_INITIALIZE() \
+ pg_initialize_get_ticks()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_get_ticks_fast())
+
#define INSTR_TIME_SET_CURRENT(t) \
- ((t) = pg_clock_gettime_ns())
+ ((t) = pg_get_ticks())
#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) (t).ticks)
+ pg_ticks_to_ns(t)
#else /* WIN32 */
@@ -131,7 +235,7 @@ pg_clock_gettime_ns(void)
/* Use QueryPerformanceCounter() */
-/* helper for INSTR_TIME_SET_CURRENT */
+/* helper for INSTR_TIME_SET_CURRENT / INSTR_TIME_SET_CURRENT_FAST */
static inline instr_time
pg_query_performance_counter(void)
{
@@ -153,6 +257,11 @@ GetTimerFrequency(void)
return (double) f.QuadPart;
}
+#define INSTR_TIME_INITIALIZE()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_query_performance_counter())
+
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
@@ -168,13 +277,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0)
-
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
-#define INSTR_TIME_SET_CURRENT_LAZY(t) \
- (INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
-
-
#define INSTR_TIME_ADD(x,y) \
((x).ticks += (y).ticks)
--
2.47.1
v11-0001-cpuidex-check-Support-detecting-newer-GCC-versio.patchapplication/octet-stream; name=v11-0001-cpuidex-check-Support-detecting-newer-GCC-versio.patchDownload
From 86e5123425162a5bc3f6c4d85076fb2419e5d9ed Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 27 Jul 2025 10:45:49 -0700
Subject: [PATCH v11 1/3] cpuidex check: Support detecting newer GCC versions
defining it in cpuid.h
In passing fix a typo in the ./configure check - the function is named
"__cpuidex", not "__get_cpuidex".
---
configure | 6 +++++-
meson.build | 4 ++++
src/port/pg_crc32c_sse42_choose.c | 4 ++--
src/port/pg_popcount_avx512.c | 4 ++--
4 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/configure b/configure
index 6d7c22e153f..1d5d47f44d3 100755
--- a/configure
+++ b/configure
@@ -17675,12 +17675,16 @@ if ${pgac_cv__cpuidex+:} false; then :
else
cat confdefs.h - <<_ACEOF >conftest.$ac_ext
/* end confdefs.h. */
+#if defined(_MSC_VER)
#include <intrin.h>
+#else
+#include <cpuid.h>
+#endif
int
main ()
{
unsigned int exx[4] = {0, 0, 0, 0};
- __get_cpuidex(exx[0], 7, 0);
+ __cpuidex(exx, 7, 0);
;
return 0;
diff --git a/meson.build b/meson.build
index 5365aaf95e6..77702591421 100644
--- a/meson.build
+++ b/meson.build
@@ -2035,7 +2035,11 @@ if cc.links('''
args: test_c_args)
cdata.set('HAVE__GET_CPUID_COUNT', 1)
elif cc.links('''
+ #if defined(_MSC_VER)
#include <intrin.h>
+ #else
+ #include <cpuid.h>
+ #endif
int main(int arg, char **argv)
{
unsigned int exx[4] = {0, 0, 0, 0};
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 74d2421ba2b..750f390bfdf 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -20,11 +20,11 @@
#include "c.h"
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
#include <cpuid.h>
#endif
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
#include <intrin.h>
#endif
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index 80c0aee3e73..80d9a372dd7 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -14,13 +14,13 @@
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
#include <cpuid.h>
#endif
#include <immintrin.h>
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
#include <intrin.h>
#endif
--
2.47.1
On Sun, Jul 27, 2025 at 12:50:54PM -0700, Lukas Fittl wrote:
0001 - Improve the __cpuidex check added for a different purpose
in 792752af4eb5 to:- Fix a typo (configure was incorrectly checking for "__get_cpuidex", vs
meson.build was doing it correctly)
It seems to me that this is an independent issue that had better be
backpatched down to where this configure check has been introduced,
no?
- Adds support for non-MSVC compilers as well (e.g. GCC 11+), where
__cpuidex is defined in cpuid.h, not intrin.h
This one should be a HEAD-only change.
--
Michael
On Mon, Jul 28, 2025 at 03:38:20PM +0900, Michael Paquier wrote:
On Sun, Jul 27, 2025 at 12:50:54PM -0700, Lukas Fittl wrote:
- Fix a typo (configure was incorrectly checking for "__get_cpuidex", vs
meson.build was doing it correctly)It seems to me that this is an independent issue that had better be
backpatched down to where this configure check has been introduced,
no?
Please note that updates of ./configure should never be manual, these
are done as follows:
- Update of ./autoconf.ac
- run of autoreconf -i or equivalent to update ./configure. (I just
use the former, just my no-brainer to handle things. Committers are
usually responsible for that, but it may matter to keep the CI happy.)
And I have noticed a second inconsistency with __cpuid(), both
introduced by 3dc2d62d0486 as far as I can see, so we have never
checked for these routines. This is an independent issue for
something that should be backpatched, so I've spawned a new thread
(don't worry you have author credits):
/messages/by-id/aIgwNYGVt5aRAqTJ@paquier.xyz
--
Michael
Hi Lukas!
On 01.03.2025 08:45, Lukas Fittl wrote:
On Sun, Jun 2, 2024 at 1:08 AM Andres Freund <andres@anarazel.de> wrote:
At some point this patch switched from rdtsc to rdtscp, which imo largely
negates the point of it. What lead to that?From what I can gather, it appears this was an oversight when David first
reapplied the work on the instr_time changes that were committed.
Yes, that was by accident.
I've come back to this and rebased this, as well as:
Thanks for moving this forward.
- Added support for VMs running under KVM/VMware Hypervisors
On that last item, this does indeed make a difference on VMs, contrary to
the code comment in earlier versions (and I've not seen any odd behaviors
again, FWIW):
How can we be sure we've actually covered all hypervisors? How much
coverage do we have in the build farm? Are we good if passes on all
build animals?
On a c5.xlarge (Skylake-SP or Cascade Lake) on AWS, with the same test as
done initially in this thread:SELECT COUNT(*) FROM lotsarows;
Time: 974.423 msEXPLAIN (ANALYZE, TIMING OFF) SELECT COUNT(*) FROM lotsarows;
Time: 1336.196 ms (00:01.336)Without patch:
EXPLAIN (ANALYZE) SELECT COUNT(*) FROM lotsarows;
Time: 2165.069 ms (00:02.165)Per loop time including overhead: 22.15 ns
With patch:
EXPLAIN (ANALYZE, TIMING ON) SELECT COUNT(*) FROM lotsarows;
Time: 1654.289 ms (00:01.654)Per loop time including overhead: 9.81 ns
I'm registering this again in the current commitfest to help reviews.
Open questions I have:
- Could we rely on checking whether the TSC timesource is invariant (via
CPUID), instead of relying on Linux choosing it as a clocksource?
Why do you want to do that? Are you concerned that Linux might pick a
different clock source even though invariant TSC is available?
We could code our own check but looking at the Linux kernel code, this
is a bit more involved if we want to do it completely right. They check
e.g. if the TSC is also synchronized across different CPUs, which is not
the case if they're on different chassis (see unsynchronized_tsc() ->
apic_is_clustered_box()).
I think it's safer to start with relying on the kernel. Some research
suggests that the TSC is the preferred clock source if available.
- For the Hypervisor CPUID checks I had to rely on __cpuidex which is only
available on newer GCC versions (
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95973), how do we best check
for its presence? (compiler version, or rather configure check?) -- note
this is also the reason the patch fails the clang compiler warning check in
CI, despite clang having support in recent versions (
https://reviews.llvm.org/D121653)
What about instead using #if !__has_builtin(_cpuidex) ... #endif to
define the built-in ourselves as a function in case it doesn't exist?
--
David Geier
On 2025-Jul-27, Lukas Fittl wrote:
See attached v11 (and moved to the PG19-2 commitfest), split into a new set
of patches:
I rebased (but not reviewed) this patchset now that Michael committed
part of 0001, as seen in another thread.
Quickly looking at 0003, I wonder if adding a separate --fast switch to
pg_test_timing is really what we want. Why not report both the fast and
legacy measurements in platforms that support both, instead? If I were
a consultant trying to understand a customer's system, I would have to
ask them to run it twice just in case 'fast' is supported, and I don't
think that's very helpful. Also, were the doc updates lost somehow, or
were they made irrelevant by other concurrent pg_test_timing
development?
Thanks
--
Álvaro Herrera Breisgau, Deutschland — https://www.EnterpriseDB.com/
"Ninguna manada de bestias tiene una voz tan horrible como la humana" (Orual)
Attachments:
v12-0001-cpuidex-check-Support-detecting-newer-GCC-versio.patchtext/x-diff; charset=utf-8Download
From 3844daeee1f8eac0263f1421929812b4b04fad38 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 27 Jul 2025 10:45:49 -0700
Subject: [PATCH v12 1/3] cpuidex check: Support detecting newer GCC versions
defining it in cpuid.h
Author: Lukas Fittl <lukas@fittl.com>
Discussion: https://postgr.es/m/CAP53Pky-BN0Ui+A9no3TsU=GoMTFpxYSWYtp_LVaDH=y69BxNg@mail.gmail.com
---
meson.build | 4 ++++
src/port/pg_crc32c_sse42_choose.c | 4 ++--
src/port/pg_popcount_avx512.c | 4 ++--
3 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/meson.build b/meson.build
index 395416a6060..007ec30800f 100644
--- a/meson.build
+++ b/meson.build
@@ -2015,7 +2015,11 @@ if cc.links('''
args: test_c_args)
cdata.set('HAVE__GET_CPUID_COUNT', 1)
elif cc.links('''
+ #if defined(_MSC_VER)
#include <intrin.h>
+ #else
+ #include <cpuid.h>
+ #endif
int main(int arg, char **argv)
{
unsigned int exx[4] = {0, 0, 0, 0};
diff --git a/src/port/pg_crc32c_sse42_choose.c b/src/port/pg_crc32c_sse42_choose.c
index 74d2421ba2b..750f390bfdf 100644
--- a/src/port/pg_crc32c_sse42_choose.c
+++ b/src/port/pg_crc32c_sse42_choose.c
@@ -20,11 +20,11 @@
#include "c.h"
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
#include <cpuid.h>
#endif
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
#include <intrin.h>
#endif
diff --git a/src/port/pg_popcount_avx512.c b/src/port/pg_popcount_avx512.c
index 80c0aee3e73..80d9a372dd7 100644
--- a/src/port/pg_popcount_avx512.c
+++ b/src/port/pg_popcount_avx512.c
@@ -14,13 +14,13 @@
#ifdef USE_AVX512_POPCNT_WITH_RUNTIME_CHECK
-#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT)
+#if defined(HAVE__GET_CPUID) || defined(HAVE__GET_CPUID_COUNT) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
#include <cpuid.h>
#endif
#include <immintrin.h>
-#if defined(HAVE__CPUID) || defined(HAVE__CPUIDEX)
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
#include <intrin.h>
#endif
--
2.47.3
v12-0002-Use-time-stamp-counter-to-measure-time-on-Linux-.patchtext/x-diff; charset=utf-8Download
From d613599d09fe7841c3c6b86a4500e78b77cc3dd2 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Fri, 25 Jul 2025 17:57:20 -0700
Subject: [PATCH v12 2/3] Use time stamp counter to measure time on Linux/x86
We switch to using the time stamp counter (TSC) instead of clock_gettime()
to reduce overhead of EXPLAIN (ANALYZE, TIME ON). Tests showed that runtime
is reduced by around 10% for queries moving lots of rows through the plan.
For now this is only enabled on Linux/x86, in case the system clocksource is
reported as TSC. Relying on the Linux kernel simplifies the logic to detect
if the present TSC is usable (frequency invariant, synchronized between
sockets, etc.). In all other cases we fallback to clock_gettime().
Note, that we intentionally use RDTSC in the fast paths, rather than RDTSCP.
RDTSCP waits for outstanding instructions to retire on out-of-order CPUs.
This adds noticably for little benefit in the typical InstrStartNode() /
InstrStopNode() use case. The macro to be used in such cases is called
INSTR_TIME_SET_CURRENT_FAST(). The original macro INSTR_TIME_SET_CURRENT()
uses RDTSCP and is supposed to be used when precision is more important
than performance.
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
src/backend/access/heap/vacuumlazy.c | 4 +-
src/backend/executor/instrument.c | 12 +-
src/backend/utils/init/postinit.c | 3 +
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 206 +++++++++++++++++++++++++++
src/common/meson.build | 1 +
src/include/portability/instr_time.h | 136 +++++++++++++++---
9 files changed, 348 insertions(+), 22 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index d2b031fdd06..5027048cac4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3409,8 +3409,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
INSTR_TIME_SET_CURRENT(currenttime);
elapsed = currenttime;
INSTR_TIME_SUBTRACT(elapsed, starttime);
- if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
- >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ if (INSTR_TIME_GET_MILLISEC(elapsed) >=
+ VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
{
if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 56e635f4700..01f67c5d972 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -67,9 +67,13 @@ InstrInit(Instrumentation *instr, int instrument_options)
void
InstrStartNode(Instrumentation *instr)
{
- if (instr->need_timer &&
- !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ if (instr->need_timer)
+ {
+ if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStartNode called twice in a row");
+ else
+ INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
+ }
/* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
@@ -95,7 +99,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (INSTR_TIME_IS_ZERO(instr->starttime))
elog(ERROR, "InstrStopNode called without start");
- INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_SET_CURRENT_FAST(endtime);
INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 641e535a73c..d573409903b 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -810,6 +810,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/*
* Load relcache entries for the shared system catalogs. This must create
* at least entries for pg_database and catalogs used for authentication.
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 1515ed405ba..79bef2d2aec 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7290,6 +7290,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 249b6aa5169..d615df593c7 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -327,6 +328,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 2c720caa509..1a2fbbe887f 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 00000000000..fdf47699f20
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,206 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#if defined(HAVE__GET_CPUID) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
+#include <cpuid.h>
+#endif
+
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
+#include <intrin.h>
+#endif
+
+#include "portability/instr_time.h"
+
+#ifndef WIN32
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up
+ * with nanoseconds using integer math. See comment in pg_initialize_rdtsc()
+ * for more details.
+ *
+ * By default assume we are using clock_gettime() as a fallback which uses
+ * nanoseconds as ticks. Hence, we set the multiplier to the precision scalar
+ * so that the division in INSTR_TIME_GET_NANOSEC() won't change the nanoseconds.
+ *
+ * When using the RDTSC instruction directly this is filled in during initialization
+ * based on the relevant CPUID fields.
+ */
+int64 ticks_per_ns_scaled = TICKS_TO_NS_PRECISION;
+int64 ticks_per_sec = NS_PER_S;
+int64 max_ticks_no_overflow = PG_INT64_MAX / TICKS_TO_NS_PRECISION;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Indicates if RDTSC can be used (Linux/x86 only, when OS uses TSC clocksource)
+ */
+bool has_rdtsc = false;
+
+/*
+ * Indicates if RDTSCP can be used. True if RDTSC can be used and RDTSCP is available.
+ */
+bool has_rdtscp = false;
+
+#define CPUID_HYPERVISOR_VMWARE(words) (words[1] == 0x61774d56 && words[2] == 0x4d566572 && words[3] == 0x65726177) /* VMwareVMware */
+#define CPUID_HYPERVISOR_KVM(words) (words[1] == 0x4b4d564b && words[2] == 0x564b4d56 && words[3] == 0x0000004d) /* KVMKVMKVM */
+
+static bool
+get_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(0x15, &r[0] /* denominator */ , &r[1] /* numerator */ , &r[2] /* hz */ , &r[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x15);
+#else
+#error cpuid instruction not available
+#endif
+
+ if (r[2] > 0)
+ {
+ if (r[0] == 0 || r[1] == 0)
+ return false;
+
+ *tsc_freq = r[2] / 1000 * r[1] / r[0];
+ return true;
+ }
+
+ /* Some CPUs only report frequency in 16H */
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(0x16, &r[0] /* base_mhz */ , &r[1], &r[2], &r[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x16);
+#else
+#error cpuid instruction not available
+#endif
+
+ if (r[0] > 0)
+ {
+ *tsc_freq = r[0] * 1000;
+ return true;
+ }
+
+ /*
+ * Check if we have a KVM or VMware Hypervisor passing down TSC frequency
+ * to us in a guest VM
+ *
+ * Note that accessing the 0x40000000 leaf for Hypervisor info requires
+ * use of __cpuidex to set ECX to 0. The similar __get_cpuid_count
+ * function does not work as expected since it contains a check for
+ * __get_cpuid_max, which has been observed to be lower than the special
+ * Hypervisor leaf.
+ */
+#if defined(HAVE__CPUIDEX)
+ __cpuidex((int32 *) r, 0x40000000, 0);
+ if (r[0] >= 0x40000010 && (CPUID_HYPERVISOR_VMWARE(r) || CPUID_HYPERVISOR_KVM(r)))
+ {
+ __cpuidex((int32 *) r, 0x40000010, 0);
+ if (r[0] > 0)
+ {
+ *tsc_freq = r[0];
+ return true;
+ }
+ }
+#endif
+
+ return false;
+}
+
+static bool
+is_rdtscp_available()
+{
+ uint32 r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ if (!__get_cpuid(0x80000001, &r[0], &r[1], &r[2], &r[3]))
+ return false;
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x80000001);
+#else
+#error cpuid instruction not available
+#endif
+
+ return (r[3] & (1 << 27)) != 0;
+}
+
+/*
+ * Decide whether we use the RDTSC instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void
+pg_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+
+ if (fp)
+ {
+ char buf[128];
+
+ if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+ {
+ /*
+ * Compute baseline CPU peformance, determines speed at which
+ * RDTSC advances.
+ */
+ uint32 tsc_freq;
+
+ if (get_tsc_frequency_khz(&tsc_freq))
+ {
+ /*
+ * Ticks to nanoseconds conversion requires floating point
+ * math because because:
+ *
+ * sec = ticks / frequency_hz ns = ticks / frequency_hz *
+ * 1,000,000,000 ns = ticks * (1,000,000,000 / frequency_hz)
+ * ns = ticks * (1,000,000 / frequency_khz) <-- now in
+ * kilohertz
+ *
+ * Here, 'ns' is usually a floating number. For example for a
+ * 2.5 GHz CPU the scaling factor becomes 1,000,000 /
+ * 2,500,000 = 1.2.
+ *
+ * To be able to use integer math we work around the lack of
+ * precision. We first scale the integer up and after the
+ * multiplication by the number of ticks in
+ * INSTR_TIME_GET_NANOSEC() we divide again by the same value.
+ * We picked the scaler such that it provides enough precision
+ * and is a power-of-two which allows for shifting instead of
+ * doing an integer division.
+ */
+ ticks_per_ns_scaled = INT64CONST(1000000) * TICKS_TO_NS_PRECISION / tsc_freq;
+ ticks_per_sec = tsc_freq * 1000; /* KHz->Hz */
+ max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
+
+ has_rdtsc = true;
+ has_rdtscp = is_rdtscp_available();
+ }
+ }
+
+ fclose(fp);
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
+
+#endif /* WIN32 */
diff --git a/src/common/meson.build b/src/common/meson.build
index 1540ba67cca..62b90b3e609 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -13,6 +13,7 @@ common_sources = files(
'file_perm.c',
'file_utils.c',
'hashfn.c',
+ 'instr_time.c',
'ip.c',
'jsonapi.c',
'keywords.c',
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index f71a851b18d..e2e339a0c4f 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter(). These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -17,10 +19,11 @@
*
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
- * INSTR_TIME_SET_CURRENT(t) set t to current time
+ * INSTR_TIME_SET_CURRENT_FAST(t) set t to current time without waiting
+ * for instructions in out-of-order window
*
- * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
- * evaluates to whether t changed
+ * INSTR_TIME_SET_CURRENT(t) set t to current time while waiting for
+ * instructions in OOO to retire
*
* INSTR_TIME_ADD(x, y) x += y
*
@@ -81,6 +84,15 @@ typedef struct instr_time
#ifndef WIN32
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)
+
+extern int64 ticks_per_ns_scaled;
+extern int64 ticks_per_sec;
+extern int64 max_ticks_no_overflow;
/* Use clock_gettime() */
@@ -106,9 +118,18 @@ typedef struct instr_time
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-/* helper for INSTR_TIME_SET_CURRENT */
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+
+extern bool has_rdtsc;
+extern bool has_rdtscp;
+
+extern void pg_initialize_rdtsc(void);
+#endif
+
static inline instr_time
-pg_clock_gettime_ns(void)
+pg_clock_gettime(void)
{
instr_time now;
struct timespec tmp;
@@ -119,11 +140,94 @@ pg_clock_gettime_ns(void)
return now;
}
+static inline instr_time
+pg_get_ticks_fast(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtsc)
+ {
+ instr_time now;
+
+ now.ticks = __rdtsc();
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline instr_time
+pg_get_ticks(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtscp)
+ {
+ instr_time now;
+ uint32 unused;
+
+ now.ticks = __rdtscp(&unused);
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline int64_t
+pg_ticks_to_ns(instr_time t)
+{
+ /*
+ * Would multiplication overflow? If so perform computation in two parts.
+ * Check overflow without actually overflowing via: a * b > max <=> a >
+ * max / b
+ */
+ int64 ns = 0;
+
+ if (unlikely(t.ticks > max_ticks_no_overflow))
+ {
+ /*
+ * Compute how often the maximum number of ticks fits completely into
+ * the number of elapsed ticks and convert that number into
+ * nanoseconds. Then multiply by the count to arrive at the final
+ * value. In a 2nd step we adjust the number of elapsed ticks and
+ * convert the remaining ticks.
+ */
+ int64 count = t.ticks / max_ticks_no_overflow;
+ int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+
+ ns = max_ns * count;
+
+ /*
+ * Subtract the ticks that we now already accounted for, so that they
+ * don't get counted twice.
+ */
+ t.ticks -= count * max_ticks_no_overflow;
+ Assert(t.ticks >= 0);
+ }
+
+ ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ return ns;
+}
+
+static inline void
+pg_initialize_get_ticks()
+{
+#if defined(__x86_64__) && defined(__linux__)
+ pg_initialize_rdtsc();
+#endif
+}
+
+#define INSTR_TIME_INITIALIZE() \
+ pg_initialize_get_ticks()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_get_ticks_fast())
+
#define INSTR_TIME_SET_CURRENT(t) \
- ((t) = pg_clock_gettime_ns())
+ ((t) = pg_get_ticks())
#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) (t).ticks)
+ pg_ticks_to_ns(t)
#else /* WIN32 */
@@ -131,7 +235,7 @@ pg_clock_gettime_ns(void)
/* Use QueryPerformanceCounter() */
-/* helper for INSTR_TIME_SET_CURRENT */
+/* helper for INSTR_TIME_SET_CURRENT / INSTR_TIME_SET_CURRENT_FAST */
static inline instr_time
pg_query_performance_counter(void)
{
@@ -153,6 +257,11 @@ GetTimerFrequency(void)
return (double) f.QuadPart;
}
+#define INSTR_TIME_INITIALIZE()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_query_performance_counter())
+
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
@@ -168,13 +277,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0)
-
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
-#define INSTR_TIME_SET_CURRENT_LAZY(t) \
- (INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
-
-
#define INSTR_TIME_ADD(x,y) \
((x).ticks += (y).ticks)
--
2.47.3
v12-0003-pg_test_timing-Add-fast-flag-to-test-fast-timing.patchtext/x-diff; charset=utf-8Download
From bcf61f229e360e96bb936ad08d64b6a43b181bb2 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 27 Jul 2025 08:48:48 -0700
Subject: [PATCH v12 3/3] pg_test_timing: Add --fast flag to test fast timing,
report time source
In passing also reduce the per-loop overhead caused by repeated divisions
in INSTR_TIME_GET_NANOSEC when the ticks variable has become very large,
instead diff first and then turn it into nanosecs.
---
src/bin/pg_test_timing/pg_test_timing.c | 59 +++++++++++++++++++------
src/include/portability/instr_time.h | 30 ++++++++-----
2 files changed, 65 insertions(+), 24 deletions(-)
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index a5621251afc..b77ef2063b6 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -16,6 +16,7 @@ static const char *progname;
static unsigned int test_duration = 3;
static double max_rprct = 99.99;
+static bool fast_timing = false;
/* record duration in powers of 2 nanoseconds */
static long long int histogram[32];
@@ -56,6 +57,7 @@ handle_args(int argc, char *argv[])
static struct option long_options[] = {
{"duration", required_argument, NULL, 'd'},
{"cutoff", required_argument, NULL, 'c'},
+ {"fast", no_argument, NULL, 'f'},
{NULL, 0, NULL, 0}
};
@@ -68,7 +70,7 @@ handle_args(int argc, char *argv[])
{
if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-?") == 0)
{
- printf(_("Usage: %s [-d DURATION] [-c CUTOFF]\n"), progname);
+ printf(_("Usage: %s [-d DURATION] [-c CUTOFF] [--fast]\n"), progname);
exit(0);
}
if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
@@ -78,7 +80,7 @@ handle_args(int argc, char *argv[])
}
}
- while ((option = getopt_long(argc, argv, "d:c:",
+ while ((option = getopt_long(argc, argv, "d:c:f:",
long_options, &optindex)) != -1)
{
switch (option)
@@ -125,6 +127,10 @@ handle_args(int argc, char *argv[])
}
break;
+ case 'f':
+ fast_timing = true;
+ break;
+
default:
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
progname);
@@ -155,11 +161,31 @@ test_timing(unsigned int duration)
uint64 total_time;
int64 time_elapsed = 0;
uint64 loop_count = 0;
- uint64 prev,
- cur;
instr_time start_time,
end_time,
- temp;
+ prev,
+ cur;
+ char *time_source = NULL;
+ bool fast_timing_used = false;
+
+ INSTR_TIME_INITIALIZE();
+
+#if !defined(WIN32) && defined(__x86_64__) && defined(__linux__)
+ if (fast_timing && has_rdtsc)
+ {
+ time_source = "RDTSC";
+ fast_timing_used = true;
+ }
+ else if (has_rdtscp)
+ time_source = "RDTSCP";
+ else
+ time_source = PG_INSTR_CLOCK_NAME;
+#else
+ time_source = PG_INSTR_CLOCK_NAME;
+#endif
+ if (fast_timing && !fast_timing_used)
+ printf(_("Warning: Fast timing requested, but not available - regular timing source will be used\n"));
+ printf(_("Time source: %s\n"), time_source);
/*
* Pre-zero the statistics data structures. They're already zero by
@@ -173,8 +199,11 @@ test_timing(unsigned int duration)
total_time = duration > 0 ? duration * INT64CONST(1000000000) : 0;
- INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_NANOSEC(start_time);
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(start_time);
+ else
+ INSTR_TIME_SET_CURRENT(start_time);
+ cur = start_time;
while (time_elapsed < total_time)
{
@@ -182,9 +211,11 @@ test_timing(unsigned int duration)
bits;
prev = cur;
- INSTR_TIME_SET_CURRENT(temp);
- cur = INSTR_TIME_GET_NANOSEC(temp);
- diff = cur - prev;
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(cur);
+ else
+ INSTR_TIME_SET_CURRENT(cur);
+ diff = INSTR_TIME_DIFF_NANOSEC(cur, prev);
/* Did time go backwards? */
if (unlikely(diff < 0))
@@ -217,11 +248,13 @@ test_timing(unsigned int duration)
largest_diff_count++;
loop_count++;
- INSTR_TIME_SUBTRACT(temp, start_time);
- time_elapsed = INSTR_TIME_GET_NANOSEC(temp);
+ time_elapsed = INSTR_TIME_DIFF_NANOSEC(cur, start_time);
}
- INSTR_TIME_SET_CURRENT(end_time);
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(end_time);
+ else
+ INSTR_TIME_SET_CURRENT(end_time);
INSTR_TIME_SUBTRACT(end_time, start_time);
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index e2e339a0c4f..f02296f1026 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -112,10 +112,13 @@ extern int64 max_ticks_no_overflow;
*/
#if defined(__darwin__) && defined(CLOCK_MONOTONIC_RAW)
#define PG_INSTR_CLOCK CLOCK_MONOTONIC_RAW
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_MONOTONIC_RAW)"
#elif defined(CLOCK_MONOTONIC)
#define PG_INSTR_CLOCK CLOCK_MONOTONIC
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_MONOTONIC)"
#else
#define PG_INSTR_CLOCK CLOCK_REALTIME
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_REALTIME)"
#endif
#if defined(__x86_64__) && defined(__linux__)
@@ -174,7 +177,7 @@ pg_get_ticks(void)
}
static inline int64_t
-pg_ticks_to_ns(instr_time t)
+pg_ticks_to_ns(int64 ticks)
{
/*
* Would multiplication overflow? If so perform computation in two parts.
@@ -183,7 +186,7 @@ pg_ticks_to_ns(instr_time t)
*/
int64 ns = 0;
- if (unlikely(t.ticks > max_ticks_no_overflow))
+ if (unlikely(ticks > max_ticks_no_overflow))
{
/*
* Compute how often the maximum number of ticks fits completely into
@@ -192,7 +195,7 @@ pg_ticks_to_ns(instr_time t)
* value. In a 2nd step we adjust the number of elapsed ticks and
* convert the remaining ticks.
*/
- int64 count = t.ticks / max_ticks_no_overflow;
+ int64 count = ticks / max_ticks_no_overflow;
int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
ns = max_ns * count;
@@ -201,11 +204,11 @@ pg_ticks_to_ns(instr_time t)
* Subtract the ticks that we now already accounted for, so that they
* don't get counted twice.
*/
- t.ticks -= count * max_ticks_no_overflow;
- Assert(t.ticks >= 0);
+ ticks -= count * max_ticks_no_overflow;
+ Assert(ticks >= 0);
}
- ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ ns += ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
return ns;
}
@@ -226,14 +229,14 @@ pg_initialize_get_ticks()
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_get_ticks())
-#define INSTR_TIME_GET_NANOSEC(t) \
- pg_ticks_to_ns(t)
-
+#define INSTR_TIME_TICKS_TO_NANOSEC(ticks) \
+ (pg_ticks_to_ns(ticks))
#else /* WIN32 */
/* Use QueryPerformanceCounter() */
+#define PG_INSTR_CLOCK_NAME "QueryPerformanceCounter"
/* helper for INSTR_TIME_SET_CURRENT / INSTR_TIME_SET_CURRENT_FAST */
static inline instr_time
@@ -265,8 +268,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
-#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency())))
+#define INSTR_TIME_TICKS_TO_NANOSEC(ticks) \
+ ((int64) ((ticks) * ((double) NS_PER_S / GetTimerFrequency())))
#endif /* WIN32 */
@@ -285,9 +288,14 @@ GetTimerFrequency(void)
#define INSTR_TIME_SUBTRACT(x,y) \
((x).ticks -= (y).ticks)
+#define INSTR_TIME_DIFF_NANOSEC(x,y) \
+ (INSTR_TIME_TICKS_TO_NANOSEC((x).ticks - (y).ticks))
+
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
((x).ticks += (y).ticks - (z).ticks)
+#define INSTR_TIME_GET_NANOSEC(t) \
+ (INSTR_TIME_TICKS_TO_NANOSEC((t).ticks))
#define INSTR_TIME_GET_DOUBLE(t) \
((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
--
2.47.3
On Sun, Oct 19, 2025 at 2:16 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
If I were
a consultant trying to understand a customer's system, I would have to
ask them to run it twice just in case 'fast' is supported, and I don't
think that's very helpful.
Big +1 from me.
--
Robert Haas
EDB: http://www.enterprisedb.com
Hi,
On 2025-09-01 12:36:24 +0200, David Geier wrote:
Open questions I have:
- Could we rely on checking whether the TSC timesource is invariant (via
CPUID), instead of relying on Linux choosing it as a clocksource?Why do you want to do that? Are you concerned that Linux might pick a
different clock source even though invariant TSC is available?
Not sure about Lukas, but I'm slightly concerned about making this a linux
specific mechanism unnecessarily.
We could code our own check but looking at the Linux kernel code, this
is a bit more involved if we want to do it completely right. They check
e.g. if the TSC is also synchronized across different CPUs, which is not
the case if they're on different chassis (see unsynchronized_tsc() ->
apic_is_clustered_box()).
I think Linux has higher fidelity requirements than our instrumentation usage
- with linux an inaccurate clock would lead to broken timers, wrong wall clock
etc, whereas for us it's just a skewed instrumentation result.
Greetings,
Andres Freund
On 20.10.2025 21:59, Robert Haas wrote:
On Sun, Oct 19, 2025 at 2:16 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
If I were
a consultant trying to understand a customer's system, I would have to
ask them to run it twice just in case 'fast' is supported, and I don't
think that's very helpful.Big +1 from me.
That makes sense. I'm planning to rebase the patch the next days. Then
I'll also take care of that.
--
David Geier
On 22.10.2025 15:32, Andres Freund wrote:
Hi,
On 2025-09-01 12:36:24 +0200, David Geier wrote:
Open questions I have:
- Could we rely on checking whether the TSC timesource is invariant (via
CPUID), instead of relying on Linux choosing it as a clocksource?Why do you want to do that? Are you concerned that Linux might pick a
different clock source even though invariant TSC is available?Not sure about Lukas, but I'm slightly concerned about making this a linux
specific mechanism unnecessarily.
Considering [1]/messages/by-id/CAP53Pky-BN0Ui+A9no3TsU=GoMTFpxYSWYtp_LVaDH=y69BxNg@mail.gmail.com, Lukas seems to share my concerns that building or own
has the risk of missing cases.
We could code our own check but looking at the Linux kernel code, this
is a bit more involved if we want to do it completely right. They check
e.g. if the TSC is also synchronized across different CPUs, which is not
the case if they're on different chassis (see unsynchronized_tsc() ->
apic_is_clustered_box()).I think Linux has higher fidelity requirements than our instrumentation usage
- with linux an inaccurate clock would lead to broken timers, wrong wall clock
etc, whereas for us it's just a skewed instrumentation result.
That's true. As long as we use the RDTSCP basd code only in places where
it doesn't affect "correctness" it's not the end of the world if they're
skewed.
I'll give it a try to code our own detection mechanism and will share
findings here. Then we can make a call based on the learnings.
--
David Geier
[1]: /messages/by-id/CAP53Pky-BN0Ui+A9no3TsU=GoMTFpxYSWYtp_LVaDH=y69BxNg@mail.gmail.com
/messages/by-id/CAP53Pky-BN0Ui+A9no3TsU=GoMTFpxYSWYtp_LVaDH=y69BxNg@mail.gmail.com
On Tue, Nov 18, 2025 at 11:26 PM David Geier <geidav.pg@gmail.com> wrote:
On 22.10.2025 15:32, Andres Freund wrote:
On 2025-09-01 12:36:24 +0200, David Geier wrote:
Open questions I have:
- Could we rely on checking whether the TSC timesource is invariant(via
CPUID), instead of relying on Linux choosing it as a clocksource?
Why do you want to do that? Are you concerned that Linux might pick a
different clock source even though invariant TSC is available?Not sure about Lukas, but I'm slightly concerned about making this a
linux
specific mechanism unnecessarily.
Considering [1], Lukas seems to share my concerns that building or own
has the risk of missing cases.
I had an off-list discussion with Andres about this at PGConf.EU, and one
idea that was floated is that we could keep the Linux specific mechanism
when on Linux, but not do this check on other platforms, as to not affect
portability.
We could code our own check but looking at the Linux kernel code, this
is a bit more involved if we want to do it completely right. They check
e.g. if the TSC is also synchronized across different CPUs, which is not
the case if they're on different chassis (see unsynchronized_tsc() ->
apic_is_clustered_box()).I think Linux has higher fidelity requirements than our instrumentation
usage
- with linux an inaccurate clock would lead to broken timers, wrong wall
clock
etc, whereas for us it's just a skewed instrumentation result.
That's true. As long as we use the RDTSCP basd code only in places where
it doesn't affect "correctness" it's not the end of the world if they're
skewed.
I think my general worry here is that we basically give the user no escape
hatch - you might end up with a case where Postgres gives you unusable
EXPLAIN timings and you can't do anything to fix that.
Overall, I'm still thinking a GUC might be the way to go, but I don't think
anyone else was enthusiastic about that idea :)
Thanks for working on an updated patch!
Thanks,
Lukas
--
Lukas Fittl
On Wed, Nov 19, 2025 at 11:55 AM Lukas Fittl <lukas@fittl.com> wrote:
Overall, I'm still thinking a GUC might be the way to go, but I don't think anyone else was enthusiastic about that idea :)
Reliable feature auto-detection is the best option, but if that's not
possible, I think the choices are add a GUC or give up on the project
altogether. Using a GUC to deal with platform dependencies is a pretty
reasonable concept -- see, e.g. dynamic_shared_memory_type or
huge_pages or io_method. If we can't autodetect it reliably and we
aren't willing to add a GUC, we're basically saying there's not enough
value here to justify adding a configuration parameter. That's often a
totally reasonable conclusion -- it can easily happen that the
benefits of a platform-specific optimization are too small to make it
worth configuring. But I would have thought that in this case the
benefits might be quite large.
--
Robert Haas
EDB: http://www.enterprisedb.com
On 19.11.2025 08:20, David Geier wrote:
On 20.10.2025 21:59, Robert Haas wrote:
On Sun, Oct 19, 2025 at 2:16 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
If I were
a consultant trying to understand a customer's system, I would have to
ask them to run it twice just in case 'fast' is supported, and I don't
think that's very helpful.Big +1 from me.
That makes sense. I'm planning to rebase the patch the next days. Then
I'll also take care of that.
The attached patched is rebased on latest master and pg_test_timing now
always tests the normal and the fast timing code. If no fast clock
source is available the fast timing code is skipped.
--
David Geier
Attachments:
v12-0002-pg_test_timing-Also-test-fast-timing-and-report-time.patchtext/x-patch; charset=UTF-8; name=v12-0002-pg_test_timing-Also-test-fast-timing-and-report-time.patchDownload
From 663fc003f86e4f1e90c5a98132d3d30da4c777b5 Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 27 Jul 2025 08:48:48 -0700
Subject: [PATCH 2/2] pg_test_timing: Also test fast timing and report time
source
In passing also reduce the per-loop overhead caused by repeated divisions
in INSTR_TIME_GET_NANOSEC when the ticks variable has become very large,
instead diff first and then turn it into nanosecs.
---
src/bin/pg_test_timing/pg_test_timing.c | 59 +++++++++++++++++++------
src/include/portability/instr_time.h | 30 ++++++++-----
2 files changed, 65 insertions(+), 24 deletions(-)
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index a5621251afc..b77ef2063b6 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -16,6 +16,7 @@ static const char *progname;
static unsigned int test_duration = 3;
static double max_rprct = 99.99;
+static bool fast_timing = false;
/* record duration in powers of 2 nanoseconds */
static long long int histogram[32];
@@ -56,6 +57,7 @@ handle_args(int argc, char *argv[])
static struct option long_options[] = {
{"duration", required_argument, NULL, 'd'},
{"cutoff", required_argument, NULL, 'c'},
+ {"fast", no_argument, NULL, 'f'},
{NULL, 0, NULL, 0}
};
@@ -68,7 +70,7 @@ handle_args(int argc, char *argv[])
{
if (strcmp(argv[1], "--help") == 0 || strcmp(argv[1], "-?") == 0)
{
- printf(_("Usage: %s [-d DURATION] [-c CUTOFF]\n"), progname);
+ printf(_("Usage: %s [-d DURATION] [-c CUTOFF] [--fast]\n"), progname);
exit(0);
}
if (strcmp(argv[1], "--version") == 0 || strcmp(argv[1], "-V") == 0)
@@ -78,7 +80,7 @@ handle_args(int argc, char *argv[])
}
}
- while ((option = getopt_long(argc, argv, "d:c:",
+ while ((option = getopt_long(argc, argv, "d:c:f:",
long_options, &optindex)) != -1)
{
switch (option)
@@ -125,6 +127,10 @@ handle_args(int argc, char *argv[])
}
break;
+ case 'f':
+ fast_timing = true;
+ break;
+
default:
fprintf(stderr, _("Try \"%s --help\" for more information.\n"),
progname);
@@ -155,11 +161,31 @@ test_timing(unsigned int duration)
uint64 total_time;
int64 time_elapsed = 0;
uint64 loop_count = 0;
- uint64 prev,
- cur;
instr_time start_time,
end_time,
- temp;
+ prev,
+ cur;
+ char *time_source = NULL;
+ bool fast_timing_used = false;
+
+ INSTR_TIME_INITIALIZE();
+
+#if !defined(WIN32) && defined(__x86_64__) && defined(__linux__)
+ if (fast_timing && has_rdtsc)
+ {
+ time_source = "RDTSC";
+ fast_timing_used = true;
+ }
+ else if (has_rdtscp)
+ time_source = "RDTSCP";
+ else
+ time_source = PG_INSTR_CLOCK_NAME;
+#else
+ time_source = PG_INSTR_CLOCK_NAME;
+#endif
+ if (fast_timing && !fast_timing_used)
+ printf(_("Warning: Fast timing requested, but not available - regular timing source will be used\n"));
+ printf(_("Time source: %s\n"), time_source);
/*
* Pre-zero the statistics data structures. They're already zero by
@@ -173,8 +199,11 @@ test_timing(unsigned int duration)
total_time = duration > 0 ? duration * INT64CONST(1000000000) : 0;
- INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_NANOSEC(start_time);
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(start_time);
+ else
+ INSTR_TIME_SET_CURRENT(start_time);
+ cur = start_time;
while (time_elapsed < total_time)
{
@@ -182,9 +211,11 @@ test_timing(unsigned int duration)
bits;
prev = cur;
- INSTR_TIME_SET_CURRENT(temp);
- cur = INSTR_TIME_GET_NANOSEC(temp);
- diff = cur - prev;
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(cur);
+ else
+ INSTR_TIME_SET_CURRENT(cur);
+ diff = INSTR_TIME_DIFF_NANOSEC(cur, prev);
/* Did time go backwards? */
if (unlikely(diff < 0))
@@ -217,11 +248,13 @@ test_timing(unsigned int duration)
largest_diff_count++;
loop_count++;
- INSTR_TIME_SUBTRACT(temp, start_time);
- time_elapsed = INSTR_TIME_GET_NANOSEC(temp);
+ time_elapsed = INSTR_TIME_DIFF_NANOSEC(cur, start_time);
}
- INSTR_TIME_SET_CURRENT(end_time);
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(end_time);
+ else
+ INSTR_TIME_SET_CURRENT(end_time);
INSTR_TIME_SUBTRACT(end_time, start_time);
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index e2e339a0c4f..f02296f1026 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -112,10 +112,13 @@ extern int64 max_ticks_no_overflow;
*/
#if defined(__darwin__) && defined(CLOCK_MONOTONIC_RAW)
#define PG_INSTR_CLOCK CLOCK_MONOTONIC_RAW
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_MONOTONIC_RAW)"
#elif defined(CLOCK_MONOTONIC)
#define PG_INSTR_CLOCK CLOCK_MONOTONIC
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_MONOTONIC)"
#else
#define PG_INSTR_CLOCK CLOCK_REALTIME
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_REALTIME)"
#endif
#if defined(__x86_64__) && defined(__linux__)
@@ -174,7 +177,7 @@ pg_get_ticks(void)
}
static inline int64_t
-pg_ticks_to_ns(instr_time t)
+pg_ticks_to_ns(int64 ticks)
{
/*
* Would multiplication overflow? If so perform computation in two parts.
@@ -183,7 +186,7 @@ pg_ticks_to_ns(instr_time t)
*/
int64 ns = 0;
- if (unlikely(t.ticks > max_ticks_no_overflow))
+ if (unlikely(ticks > max_ticks_no_overflow))
{
/*
* Compute how often the maximum number of ticks fits completely into
@@ -192,7 +195,7 @@ pg_ticks_to_ns(instr_time t)
* value. In a 2nd step we adjust the number of elapsed ticks and
* convert the remaining ticks.
*/
- int64 count = t.ticks / max_ticks_no_overflow;
+ int64 count = ticks / max_ticks_no_overflow;
int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
ns = max_ns * count;
@@ -201,11 +204,11 @@ pg_ticks_to_ns(instr_time t)
* Subtract the ticks that we now already accounted for, so that they
* don't get counted twice.
*/
- t.ticks -= count * max_ticks_no_overflow;
- Assert(t.ticks >= 0);
+ ticks -= count * max_ticks_no_overflow;
+ Assert(ticks >= 0);
}
- ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ ns += ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
return ns;
}
@@ -226,14 +229,14 @@ pg_initialize_get_ticks()
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_get_ticks())
-#define INSTR_TIME_GET_NANOSEC(t) \
- pg_ticks_to_ns(t)
-
+#define INSTR_TIME_TICKS_TO_NANOSEC(ticks) \
+ (pg_ticks_to_ns(ticks))
#else /* WIN32 */
/* Use QueryPerformanceCounter() */
+#define PG_INSTR_CLOCK_NAME "QueryPerformanceCounter"
/* helper for INSTR_TIME_SET_CURRENT / INSTR_TIME_SET_CURRENT_FAST */
static inline instr_time
@@ -265,8 +268,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
-#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency())))
+#define INSTR_TIME_TICKS_TO_NANOSEC(ticks) \
+ ((int64) ((ticks) * ((double) NS_PER_S / GetTimerFrequency())))
#endif /* WIN32 */
@@ -285,9 +288,14 @@ GetTimerFrequency(void)
#define INSTR_TIME_SUBTRACT(x,y) \
((x).ticks -= (y).ticks)
+#define INSTR_TIME_DIFF_NANOSEC(x,y) \
+ (INSTR_TIME_TICKS_TO_NANOSEC((x).ticks - (y).ticks))
+
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
((x).ticks += (y).ticks - (z).ticks)
+#define INSTR_TIME_GET_NANOSEC(t) \
+ (INSTR_TIME_TICKS_TO_NANOSEC((t).ticks))
#define INSTR_TIME_GET_DOUBLE(t) \
((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
--
2.51.0
v12-0001-Use-time-stamp-counter-to-measure-time-on-Linux-x86.patchtext/x-patch; charset=UTF-8; name=v12-0001-Use-time-stamp-counter-to-measure-time-on-Linux-x86.patchDownload
From 4104480bee3161d1338f253d3e2defe418d5378a Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Fri, 25 Jul 2025 17:57:20 -0700
Subject: [PATCH 1/2] Use time stamp counter to measure time on Linux/x86
We switch to using the time stamp counter (TSC) instead of clock_gettime()
to reduce overhead of EXPLAIN (ANALYZE, TIME ON). Tests showed that runtime
is reduced by around 10% for queries moving lots of rows through the plan.
For now this is only enabled on Linux/x86, in case the system clocksource is
reported as TSC. Relying on the Linux kernel simplifies the logic to detect
if the present TSC is usable (frequency invariant, synchronized between
sockets, etc.). In all other cases we fallback to clock_gettime().
Note, that we intentionally use RDTSC in the fast paths, rather than RDTSCP.
RDTSCP waits for outstanding instructions to retire on out-of-order CPUs.
This adds noticably for little benefit in the typical InstrStartNode() /
InstrStopNode() use case. The macro to be used in such cases is called
INSTR_TIME_SET_CURRENT_FAST(). The original macro INSTR_TIME_SET_CURRENT()
uses RDTSCP and is supposed to be used when precision is more important
than performance.
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
src/backend/access/heap/vacuumlazy.c | 4 +-
src/backend/executor/instrument.c | 12 +-
src/backend/utils/init/postinit.c | 3 +
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 206 +++++++++++++++++++++++++++
src/common/meson.build | 1 +
src/include/portability/instr_time.h | 136 +++++++++++++++---
9 files changed, 348 insertions(+), 22 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 65bb0568a86..3f291a315a4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3415,8 +3415,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
INSTR_TIME_SET_CURRENT(currenttime);
elapsed = currenttime;
INSTR_TIME_SUBTRACT(elapsed, starttime);
- if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
- >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ if (INSTR_TIME_GET_MILLISEC(elapsed) >=
+ VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
{
if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 9e11c662a7c..3940b59d70c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -67,9 +67,13 @@ InstrInit(Instrumentation *instr, int instrument_options)
void
InstrStartNode(Instrumentation *instr)
{
- if (instr->need_timer &&
- !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ if (instr->need_timer)
+ {
+ if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStartNode called twice in a row");
+ else
+ INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
+ }
/* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
@@ -95,7 +99,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (INSTR_TIME_IS_ZERO(instr->starttime))
elog(ERROR, "InstrStopNode called without start");
- INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_SET_CURRENT_FAST(endtime);
INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 98f9598cd78..845c0d0e4f6 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -806,6 +806,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/*
* Load relcache entries for the shared system catalogs. This must create
* at least entries for pg_database and catalogs used for authentication.
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 68774a59efd..2ff094a093c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7334,6 +7334,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 249b6aa5169..d615df593c7 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -327,6 +328,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 2c720caa509..1a2fbbe887f 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 00000000000..fdf47699f20
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,206 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#if defined(HAVE__GET_CPUID) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
+#include <cpuid.h>
+#endif
+
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
+#include <intrin.h>
+#endif
+
+#include "portability/instr_time.h"
+
+#ifndef WIN32
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up
+ * with nanoseconds using integer math. See comment in pg_initialize_rdtsc()
+ * for more details.
+ *
+ * By default assume we are using clock_gettime() as a fallback which uses
+ * nanoseconds as ticks. Hence, we set the multiplier to the precision scalar
+ * so that the division in INSTR_TIME_GET_NANOSEC() won't change the nanoseconds.
+ *
+ * When using the RDTSC instruction directly this is filled in during initialization
+ * based on the relevant CPUID fields.
+ */
+int64 ticks_per_ns_scaled = TICKS_TO_NS_PRECISION;
+int64 ticks_per_sec = NS_PER_S;
+int64 max_ticks_no_overflow = PG_INT64_MAX / TICKS_TO_NS_PRECISION;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Indicates if RDTSC can be used (Linux/x86 only, when OS uses TSC clocksource)
+ */
+bool has_rdtsc = false;
+
+/*
+ * Indicates if RDTSCP can be used. True if RDTSC can be used and RDTSCP is available.
+ */
+bool has_rdtscp = false;
+
+#define CPUID_HYPERVISOR_VMWARE(words) (words[1] == 0x61774d56 && words[2] == 0x4d566572 && words[3] == 0x65726177) /* VMwareVMware */
+#define CPUID_HYPERVISOR_KVM(words) (words[1] == 0x4b4d564b && words[2] == 0x564b4d56 && words[3] == 0x0000004d) /* KVMKVMKVM */
+
+static bool
+get_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(0x15, &r[0] /* denominator */ , &r[1] /* numerator */ , &r[2] /* hz */ , &r[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x15);
+#else
+#error cpuid instruction not available
+#endif
+
+ if (r[2] > 0)
+ {
+ if (r[0] == 0 || r[1] == 0)
+ return false;
+
+ *tsc_freq = r[2] / 1000 * r[1] / r[0];
+ return true;
+ }
+
+ /* Some CPUs only report frequency in 16H */
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(0x16, &r[0] /* base_mhz */ , &r[1], &r[2], &r[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x16);
+#else
+#error cpuid instruction not available
+#endif
+
+ if (r[0] > 0)
+ {
+ *tsc_freq = r[0] * 1000;
+ return true;
+ }
+
+ /*
+ * Check if we have a KVM or VMware Hypervisor passing down TSC frequency
+ * to us in a guest VM
+ *
+ * Note that accessing the 0x40000000 leaf for Hypervisor info requires
+ * use of __cpuidex to set ECX to 0. The similar __get_cpuid_count
+ * function does not work as expected since it contains a check for
+ * __get_cpuid_max, which has been observed to be lower than the special
+ * Hypervisor leaf.
+ */
+#if defined(HAVE__CPUIDEX)
+ __cpuidex((int32 *) r, 0x40000000, 0);
+ if (r[0] >= 0x40000010 && (CPUID_HYPERVISOR_VMWARE(r) || CPUID_HYPERVISOR_KVM(r)))
+ {
+ __cpuidex((int32 *) r, 0x40000010, 0);
+ if (r[0] > 0)
+ {
+ *tsc_freq = r[0];
+ return true;
+ }
+ }
+#endif
+
+ return false;
+}
+
+static bool
+is_rdtscp_available()
+{
+ uint32 r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ if (!__get_cpuid(0x80000001, &r[0], &r[1], &r[2], &r[3]))
+ return false;
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x80000001);
+#else
+#error cpuid instruction not available
+#endif
+
+ return (r[3] & (1 << 27)) != 0;
+}
+
+/*
+ * Decide whether we use the RDTSC instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void
+pg_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+
+ if (fp)
+ {
+ char buf[128];
+
+ if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+ {
+ /*
+ * Compute baseline CPU peformance, determines speed at which
+ * RDTSC advances.
+ */
+ uint32 tsc_freq;
+
+ if (get_tsc_frequency_khz(&tsc_freq))
+ {
+ /*
+ * Ticks to nanoseconds conversion requires floating point
+ * math because because:
+ *
+ * sec = ticks / frequency_hz ns = ticks / frequency_hz *
+ * 1,000,000,000 ns = ticks * (1,000,000,000 / frequency_hz)
+ * ns = ticks * (1,000,000 / frequency_khz) <-- now in
+ * kilohertz
+ *
+ * Here, 'ns' is usually a floating number. For example for a
+ * 2.5 GHz CPU the scaling factor becomes 1,000,000 /
+ * 2,500,000 = 1.2.
+ *
+ * To be able to use integer math we work around the lack of
+ * precision. We first scale the integer up and after the
+ * multiplication by the number of ticks in
+ * INSTR_TIME_GET_NANOSEC() we divide again by the same value.
+ * We picked the scaler such that it provides enough precision
+ * and is a power-of-two which allows for shifting instead of
+ * doing an integer division.
+ */
+ ticks_per_ns_scaled = INT64CONST(1000000) * TICKS_TO_NS_PRECISION / tsc_freq;
+ ticks_per_sec = tsc_freq * 1000; /* KHz->Hz */
+ max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
+
+ has_rdtsc = true;
+ has_rdtscp = is_rdtscp_available();
+ }
+ }
+
+ fclose(fp);
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
+
+#endif /* WIN32 */
diff --git a/src/common/meson.build b/src/common/meson.build
index 1540ba67cca..62b90b3e609 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -13,6 +13,7 @@ common_sources = files(
'file_perm.c',
'file_utils.c',
'hashfn.c',
+ 'instr_time.c',
'ip.c',
'jsonapi.c',
'keywords.c',
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index f71a851b18d..e2e339a0c4f 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter(). These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -17,10 +19,11 @@
*
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
- * INSTR_TIME_SET_CURRENT(t) set t to current time
+ * INSTR_TIME_SET_CURRENT_FAST(t) set t to current time without waiting
+ * for instructions in out-of-order window
*
- * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
- * evaluates to whether t changed
+ * INSTR_TIME_SET_CURRENT(t) set t to current time while waiting for
+ * instructions in OOO to retire
*
* INSTR_TIME_ADD(x, y) x += y
*
@@ -81,6 +84,15 @@ typedef struct instr_time
#ifndef WIN32
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)
+
+extern int64 ticks_per_ns_scaled;
+extern int64 ticks_per_sec;
+extern int64 max_ticks_no_overflow;
/* Use clock_gettime() */
@@ -106,9 +118,18 @@ typedef struct instr_time
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-/* helper for INSTR_TIME_SET_CURRENT */
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+
+extern bool has_rdtsc;
+extern bool has_rdtscp;
+
+extern void pg_initialize_rdtsc(void);
+#endif
+
static inline instr_time
-pg_clock_gettime_ns(void)
+pg_clock_gettime(void)
{
instr_time now;
struct timespec tmp;
@@ -119,11 +140,94 @@ pg_clock_gettime_ns(void)
return now;
}
+static inline instr_time
+pg_get_ticks_fast(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtsc)
+ {
+ instr_time now;
+
+ now.ticks = __rdtsc();
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline instr_time
+pg_get_ticks(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtscp)
+ {
+ instr_time now;
+ uint32 unused;
+
+ now.ticks = __rdtscp(&unused);
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline int64_t
+pg_ticks_to_ns(instr_time t)
+{
+ /*
+ * Would multiplication overflow? If so perform computation in two parts.
+ * Check overflow without actually overflowing via: a * b > max <=> a >
+ * max / b
+ */
+ int64 ns = 0;
+
+ if (unlikely(t.ticks > max_ticks_no_overflow))
+ {
+ /*
+ * Compute how often the maximum number of ticks fits completely into
+ * the number of elapsed ticks and convert that number into
+ * nanoseconds. Then multiply by the count to arrive at the final
+ * value. In a 2nd step we adjust the number of elapsed ticks and
+ * convert the remaining ticks.
+ */
+ int64 count = t.ticks / max_ticks_no_overflow;
+ int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+
+ ns = max_ns * count;
+
+ /*
+ * Subtract the ticks that we now already accounted for, so that they
+ * don't get counted twice.
+ */
+ t.ticks -= count * max_ticks_no_overflow;
+ Assert(t.ticks >= 0);
+ }
+
+ ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ return ns;
+}
+
+static inline void
+pg_initialize_get_ticks()
+{
+#if defined(__x86_64__) && defined(__linux__)
+ pg_initialize_rdtsc();
+#endif
+}
+
+#define INSTR_TIME_INITIALIZE() \
+ pg_initialize_get_ticks()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_get_ticks_fast())
+
#define INSTR_TIME_SET_CURRENT(t) \
- ((t) = pg_clock_gettime_ns())
+ ((t) = pg_get_ticks())
#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) (t).ticks)
+ pg_ticks_to_ns(t)
#else /* WIN32 */
@@ -131,7 +235,7 @@ pg_clock_gettime_ns(void)
/* Use QueryPerformanceCounter() */
-/* helper for INSTR_TIME_SET_CURRENT */
+/* helper for INSTR_TIME_SET_CURRENT / INSTR_TIME_SET_CURRENT_FAST */
static inline instr_time
pg_query_performance_counter(void)
{
@@ -153,6 +257,11 @@ GetTimerFrequency(void)
return (double) f.QuadPart;
}
+#define INSTR_TIME_INITIALIZE()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_query_performance_counter())
+
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
@@ -168,13 +277,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0)
-
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
-#define INSTR_TIME_SET_CURRENT_LAZY(t) \
- (INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
-
-
#define INSTR_TIME_ADD(x,y) \
((x).ticks += (y).ticks)
--
2.51.0
On 19.11.2025 20:36, Robert Haas wrote:
On Wed, Nov 19, 2025 at 11:55 AM Lukas Fittl <lukas@fittl.com> wrote:
Overall, I'm still thinking a GUC might be the way to go, but I don't think anyone else was enthusiastic about that idea :)
Reliable feature auto-detection is the best option, but if that's not
possible, I think the choices are add a GUC or give up on the project
altogether. Using a GUC to deal with platform dependencies is a pretty
reasonable concept -- see, e.g. dynamic_shared_memory_type or
huge_pages or io_method. If we can't autodetect it reliably and we
aren't willing to add a GUC, we're basically saying there's not enough
value here to justify adding a configuration parameter. That's often a
totally reasonable conclusion -- it can easily happen that the
benefits of a platform-specific optimization are too small to make it
worth configuring. But I would have thought that in this case the
benefits might be quite large.
I'm also in favor of adding a GUC. Even if we could 100% reliably detect
if using TSC is giving correct results, it could be that it's slow in
some virtualized environment and hence the user wants to disable it.
I'm wondering how to best do a GUC for something that is potentially
unavailable on the system. In that case the GUC would be superfluous.
Maybe a boolean "enable_try_fast_clocksource" GUC or a "clocksource"
enum GUC which can be "default" and "try_rdtsc", where we only include
the "try_rdtsc" enum value on x86 systems?
Any other ideas?
--
David Geier
On Wed, Dec 3, 2025 at 6:03 AM David Geier <geidav.pg@gmail.com> wrote:
I'm wondering how to best do a GUC for something that is potentially
unavailable on the system. In that case the GUC would be superfluous.
Maybe a boolean "enable_try_fast_clocksource" GUC or a "clocksource"
enum GUC which can be "default" and "try_rdtsc", where we only include
the "try_rdtsc" enum value on x86 systems?
huge_pages=on/off/try is one possible precedent. Perhaps for this
case, we might do something like
clock_source=auto/this/that/the_other_thing might be better. If you
set it to any value other than "auto", the named source must be
available, or startup fails. If you set it to auto, it picks what it
believes to be the best option available, and there is some other
read-only GUC (akin to huge_page_status) that tells you what it
picked.
I'm open to other suggestions as to how this should work, but (1) all
of the existing enable_* GUCs are planner GUCs, and (2) I suspect it's
short-sighted to plan for only "fast" and "not fast".
--
Robert Haas
EDB: http://www.enterprisedb.com
On 03.12.2025 10:50, David Geier wrote:
On 19.11.2025 08:20, David Geier wrote:
On 20.10.2025 21:59, Robert Haas wrote:
On Sun, Oct 19, 2025 at 2:16 PM Álvaro Herrera <alvherre@kurilemu.de> wrote:
If I were
a consultant trying to understand a customer's system, I would have to
ask them to run it twice just in case 'fast' is supported, and I don't
think that's very helpful.Big +1 from me.
That makes sense. I'm planning to rebase the patch the next days. Then
I'll also take care of that.The attached patched is rebased on latest master and pg_test_timing now
always tests the normal and the fast timing code. If no fast clock
source is available the fast timing code is skipped.
The last patch I sent was incomplete because I had missed committing my
changes. Attached is now the patch with the changes to pg_test_timing.
--
David Geier
Attachments:
v13-0002-pg_test_timing-Also-test-fast-timing-and-report-time.patchtext/x-patch; charset=UTF-8; name=v13-0002-pg_test_timing-Also-test-fast-timing-and-report-time.patchDownload
From a634451c07208fe99f6893f9ef9a6ee620ccf41f Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 27 Jul 2025 08:48:48 -0700
Subject: [PATCH 2/2] pg_test_timing: Also test fast timing and report time
source
In passing also reduce the per-loop overhead caused by repeated divisions
in INSTR_TIME_GET_NANOSEC when the ticks variable has become very large,
instead diff first and then turn it into nanosecs.
---
src/bin/pg_test_timing/pg_test_timing.c | 76 +++++++++++++++++++------
src/include/portability/instr_time.h | 30 ++++++----
2 files changed, 77 insertions(+), 29 deletions(-)
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index a5621251afc..62e308dd965 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -30,7 +30,7 @@ static long long int largest_diff_count;
static void handle_args(int argc, char *argv[]);
-static uint64 test_timing(unsigned int duration);
+static uint64 test_timing(unsigned int duration, bool fast_timing);
static void output(uint64 loop_count);
int
@@ -43,10 +43,22 @@ main(int argc, char *argv[])
handle_args(argc, argv);
- loop_count = test_timing(test_duration);
-
+ /*
+ * First, test default (non-fast) timing code. A clock source for that
+ * is always available. Hence, we can unconditionally output the result.
+ */
+ loop_count = test_timing(test_duration, false);
output(loop_count);
+ /*
+ * Second, test the fast timing code. This clock source is not always
+ * available. In that case the loop count will be 0 and we don't print.
+ */
+ printf("\n");
+ loop_count = test_timing(test_duration, true);
+ if (loop_count > 0)
+ output(loop_count);
+
return 0;
}
@@ -78,7 +90,7 @@ handle_args(int argc, char *argv[])
}
}
- while ((option = getopt_long(argc, argv, "d:c:",
+ while ((option = getopt_long(argc, argv, "d:c:f:",
long_options, &optindex)) != -1)
{
switch (option)
@@ -143,23 +155,44 @@ handle_args(int argc, char *argv[])
exit(1);
}
- printf(ngettext("Testing timing overhead for %u second.\n",
- "Testing timing overhead for %u seconds.\n",
+ printf(ngettext("Testing timing overhead for %u second.\n\n",
+ "Testing timing overhead for %u seconds.\n\n",
test_duration),
test_duration);
}
static uint64
-test_timing(unsigned int duration)
+test_timing(unsigned int duration, bool fast_timing)
{
uint64 total_time;
int64 time_elapsed = 0;
uint64 loop_count = 0;
- uint64 prev,
- cur;
instr_time start_time,
end_time,
- temp;
+ prev,
+ cur;
+ char *time_source = NULL;
+ bool fast_timing_available = false;
+
+ INSTR_TIME_INITIALIZE();
+
+#if !defined(WIN32) && defined(__x86_64__) && defined(__linux__)
+ if (fast_timing && has_rdtsc)
+ {
+ time_source = "RDTSC";
+ fast_timing_available = true;
+ }
+ else if (has_rdtscp)
+ time_source = "RDTSCP";
+ else
+ time_source = PG_INSTR_CLOCK_NAME;
+#else
+ time_source = PG_INSTR_CLOCK_NAME;
+#endif
+ if (fast_timing && !fast_timing_available)
+ return 0;
+
+ printf(_("Time source: %s\n"), time_source);
/*
* Pre-zero the statistics data structures. They're already zero by
@@ -173,8 +206,11 @@ test_timing(unsigned int duration)
total_time = duration > 0 ? duration * INT64CONST(1000000000) : 0;
- INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_NANOSEC(start_time);
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(start_time);
+ else
+ INSTR_TIME_SET_CURRENT(start_time);
+ cur = start_time;
while (time_elapsed < total_time)
{
@@ -182,9 +218,11 @@ test_timing(unsigned int duration)
bits;
prev = cur;
- INSTR_TIME_SET_CURRENT(temp);
- cur = INSTR_TIME_GET_NANOSEC(temp);
- diff = cur - prev;
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(cur);
+ else
+ INSTR_TIME_SET_CURRENT(cur);
+ diff = INSTR_TIME_DIFF_NANOSEC(cur, prev);
/* Did time go backwards? */
if (unlikely(diff < 0))
@@ -217,11 +255,13 @@ test_timing(unsigned int duration)
largest_diff_count++;
loop_count++;
- INSTR_TIME_SUBTRACT(temp, start_time);
- time_elapsed = INSTR_TIME_GET_NANOSEC(temp);
+ time_elapsed = INSTR_TIME_DIFF_NANOSEC(cur, start_time);
}
- INSTR_TIME_SET_CURRENT(end_time);
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(end_time);
+ else
+ INSTR_TIME_SET_CURRENT(end_time);
INSTR_TIME_SUBTRACT(end_time, start_time);
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index e2e339a0c4f..f02296f1026 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -112,10 +112,13 @@ extern int64 max_ticks_no_overflow;
*/
#if defined(__darwin__) && defined(CLOCK_MONOTONIC_RAW)
#define PG_INSTR_CLOCK CLOCK_MONOTONIC_RAW
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_MONOTONIC_RAW)"
#elif defined(CLOCK_MONOTONIC)
#define PG_INSTR_CLOCK CLOCK_MONOTONIC
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_MONOTONIC)"
#else
#define PG_INSTR_CLOCK CLOCK_REALTIME
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_REALTIME)"
#endif
#if defined(__x86_64__) && defined(__linux__)
@@ -174,7 +177,7 @@ pg_get_ticks(void)
}
static inline int64_t
-pg_ticks_to_ns(instr_time t)
+pg_ticks_to_ns(int64 ticks)
{
/*
* Would multiplication overflow? If so perform computation in two parts.
@@ -183,7 +186,7 @@ pg_ticks_to_ns(instr_time t)
*/
int64 ns = 0;
- if (unlikely(t.ticks > max_ticks_no_overflow))
+ if (unlikely(ticks > max_ticks_no_overflow))
{
/*
* Compute how often the maximum number of ticks fits completely into
@@ -192,7 +195,7 @@ pg_ticks_to_ns(instr_time t)
* value. In a 2nd step we adjust the number of elapsed ticks and
* convert the remaining ticks.
*/
- int64 count = t.ticks / max_ticks_no_overflow;
+ int64 count = ticks / max_ticks_no_overflow;
int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
ns = max_ns * count;
@@ -201,11 +204,11 @@ pg_ticks_to_ns(instr_time t)
* Subtract the ticks that we now already accounted for, so that they
* don't get counted twice.
*/
- t.ticks -= count * max_ticks_no_overflow;
- Assert(t.ticks >= 0);
+ ticks -= count * max_ticks_no_overflow;
+ Assert(ticks >= 0);
}
- ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ ns += ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
return ns;
}
@@ -226,14 +229,14 @@ pg_initialize_get_ticks()
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_get_ticks())
-#define INSTR_TIME_GET_NANOSEC(t) \
- pg_ticks_to_ns(t)
-
+#define INSTR_TIME_TICKS_TO_NANOSEC(ticks) \
+ (pg_ticks_to_ns(ticks))
#else /* WIN32 */
/* Use QueryPerformanceCounter() */
+#define PG_INSTR_CLOCK_NAME "QueryPerformanceCounter"
/* helper for INSTR_TIME_SET_CURRENT / INSTR_TIME_SET_CURRENT_FAST */
static inline instr_time
@@ -265,8 +268,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
-#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency())))
+#define INSTR_TIME_TICKS_TO_NANOSEC(ticks) \
+ ((int64) ((ticks) * ((double) NS_PER_S / GetTimerFrequency())))
#endif /* WIN32 */
@@ -285,9 +288,14 @@ GetTimerFrequency(void)
#define INSTR_TIME_SUBTRACT(x,y) \
((x).ticks -= (y).ticks)
+#define INSTR_TIME_DIFF_NANOSEC(x,y) \
+ (INSTR_TIME_TICKS_TO_NANOSEC((x).ticks - (y).ticks))
+
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
((x).ticks += (y).ticks - (z).ticks)
+#define INSTR_TIME_GET_NANOSEC(t) \
+ (INSTR_TIME_TICKS_TO_NANOSEC((t).ticks))
#define INSTR_TIME_GET_DOUBLE(t) \
((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
--
2.51.0
v13-0001-Use-time-stamp-counter-to-measure-time-on-Linux-x86.patchtext/x-patch; charset=UTF-8; name=v13-0001-Use-time-stamp-counter-to-measure-time-on-Linux-x86.patchDownload
From 4104480bee3161d1338f253d3e2defe418d5378a Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Fri, 25 Jul 2025 17:57:20 -0700
Subject: [PATCH 1/2] Use time stamp counter to measure time on Linux/x86
We switch to using the time stamp counter (TSC) instead of clock_gettime()
to reduce overhead of EXPLAIN (ANALYZE, TIME ON). Tests showed that runtime
is reduced by around 10% for queries moving lots of rows through the plan.
For now this is only enabled on Linux/x86, in case the system clocksource is
reported as TSC. Relying on the Linux kernel simplifies the logic to detect
if the present TSC is usable (frequency invariant, synchronized between
sockets, etc.). In all other cases we fallback to clock_gettime().
Note, that we intentionally use RDTSC in the fast paths, rather than RDTSCP.
RDTSCP waits for outstanding instructions to retire on out-of-order CPUs.
This adds noticably for little benefit in the typical InstrStartNode() /
InstrStopNode() use case. The macro to be used in such cases is called
INSTR_TIME_SET_CURRENT_FAST(). The original macro INSTR_TIME_SET_CURRENT()
uses RDTSCP and is supposed to be used when precision is more important
than performance.
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
src/backend/access/heap/vacuumlazy.c | 4 +-
src/backend/executor/instrument.c | 12 +-
src/backend/utils/init/postinit.c | 3 +
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 206 +++++++++++++++++++++++++++
src/common/meson.build | 1 +
src/include/portability/instr_time.h | 136 +++++++++++++++---
9 files changed, 348 insertions(+), 22 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 65bb0568a86..3f291a315a4 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3415,8 +3415,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
INSTR_TIME_SET_CURRENT(currenttime);
elapsed = currenttime;
INSTR_TIME_SUBTRACT(elapsed, starttime);
- if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
- >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ if (INSTR_TIME_GET_MILLISEC(elapsed) >=
+ VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
{
if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index 9e11c662a7c..3940b59d70c 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -67,9 +67,13 @@ InstrInit(Instrumentation *instr, int instrument_options)
void
InstrStartNode(Instrumentation *instr)
{
- if (instr->need_timer &&
- !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ if (instr->need_timer)
+ {
+ if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStartNode called twice in a row");
+ else
+ INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
+ }
/* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
@@ -95,7 +99,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (INSTR_TIME_IS_ZERO(instr->starttime))
elog(ERROR, "InstrStopNode called without start");
- INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_SET_CURRENT_FAST(endtime);
INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 98f9598cd78..845c0d0e4f6 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -806,6 +806,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/*
* Load relcache entries for the shared system catalogs. This must create
* at least entries for pg_database and catalogs used for authentication.
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 68774a59efd..2ff094a093c 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7334,6 +7334,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 249b6aa5169..d615df593c7 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -327,6 +328,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 2c720caa509..1a2fbbe887f 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 00000000000..fdf47699f20
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,206 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#if defined(HAVE__GET_CPUID) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
+#include <cpuid.h>
+#endif
+
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
+#include <intrin.h>
+#endif
+
+#include "portability/instr_time.h"
+
+#ifndef WIN32
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up
+ * with nanoseconds using integer math. See comment in pg_initialize_rdtsc()
+ * for more details.
+ *
+ * By default assume we are using clock_gettime() as a fallback which uses
+ * nanoseconds as ticks. Hence, we set the multiplier to the precision scalar
+ * so that the division in INSTR_TIME_GET_NANOSEC() won't change the nanoseconds.
+ *
+ * When using the RDTSC instruction directly this is filled in during initialization
+ * based on the relevant CPUID fields.
+ */
+int64 ticks_per_ns_scaled = TICKS_TO_NS_PRECISION;
+int64 ticks_per_sec = NS_PER_S;
+int64 max_ticks_no_overflow = PG_INT64_MAX / TICKS_TO_NS_PRECISION;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Indicates if RDTSC can be used (Linux/x86 only, when OS uses TSC clocksource)
+ */
+bool has_rdtsc = false;
+
+/*
+ * Indicates if RDTSCP can be used. True if RDTSC can be used and RDTSCP is available.
+ */
+bool has_rdtscp = false;
+
+#define CPUID_HYPERVISOR_VMWARE(words) (words[1] == 0x61774d56 && words[2] == 0x4d566572 && words[3] == 0x65726177) /* VMwareVMware */
+#define CPUID_HYPERVISOR_KVM(words) (words[1] == 0x4b4d564b && words[2] == 0x564b4d56 && words[3] == 0x0000004d) /* KVMKVMKVM */
+
+static bool
+get_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(0x15, &r[0] /* denominator */ , &r[1] /* numerator */ , &r[2] /* hz */ , &r[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x15);
+#else
+#error cpuid instruction not available
+#endif
+
+ if (r[2] > 0)
+ {
+ if (r[0] == 0 || r[1] == 0)
+ return false;
+
+ *tsc_freq = r[2] / 1000 * r[1] / r[0];
+ return true;
+ }
+
+ /* Some CPUs only report frequency in 16H */
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(0x16, &r[0] /* base_mhz */ , &r[1], &r[2], &r[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x16);
+#else
+#error cpuid instruction not available
+#endif
+
+ if (r[0] > 0)
+ {
+ *tsc_freq = r[0] * 1000;
+ return true;
+ }
+
+ /*
+ * Check if we have a KVM or VMware Hypervisor passing down TSC frequency
+ * to us in a guest VM
+ *
+ * Note that accessing the 0x40000000 leaf for Hypervisor info requires
+ * use of __cpuidex to set ECX to 0. The similar __get_cpuid_count
+ * function does not work as expected since it contains a check for
+ * __get_cpuid_max, which has been observed to be lower than the special
+ * Hypervisor leaf.
+ */
+#if defined(HAVE__CPUIDEX)
+ __cpuidex((int32 *) r, 0x40000000, 0);
+ if (r[0] >= 0x40000010 && (CPUID_HYPERVISOR_VMWARE(r) || CPUID_HYPERVISOR_KVM(r)))
+ {
+ __cpuidex((int32 *) r, 0x40000010, 0);
+ if (r[0] > 0)
+ {
+ *tsc_freq = r[0];
+ return true;
+ }
+ }
+#endif
+
+ return false;
+}
+
+static bool
+is_rdtscp_available()
+{
+ uint32 r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ if (!__get_cpuid(0x80000001, &r[0], &r[1], &r[2], &r[3]))
+ return false;
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x80000001);
+#else
+#error cpuid instruction not available
+#endif
+
+ return (r[3] & (1 << 27)) != 0;
+}
+
+/*
+ * Decide whether we use the RDTSC instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void
+pg_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+
+ if (fp)
+ {
+ char buf[128];
+
+ if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+ {
+ /*
+ * Compute baseline CPU peformance, determines speed at which
+ * RDTSC advances.
+ */
+ uint32 tsc_freq;
+
+ if (get_tsc_frequency_khz(&tsc_freq))
+ {
+ /*
+ * Ticks to nanoseconds conversion requires floating point
+ * math because because:
+ *
+ * sec = ticks / frequency_hz ns = ticks / frequency_hz *
+ * 1,000,000,000 ns = ticks * (1,000,000,000 / frequency_hz)
+ * ns = ticks * (1,000,000 / frequency_khz) <-- now in
+ * kilohertz
+ *
+ * Here, 'ns' is usually a floating number. For example for a
+ * 2.5 GHz CPU the scaling factor becomes 1,000,000 /
+ * 2,500,000 = 1.2.
+ *
+ * To be able to use integer math we work around the lack of
+ * precision. We first scale the integer up and after the
+ * multiplication by the number of ticks in
+ * INSTR_TIME_GET_NANOSEC() we divide again by the same value.
+ * We picked the scaler such that it provides enough precision
+ * and is a power-of-two which allows for shifting instead of
+ * doing an integer division.
+ */
+ ticks_per_ns_scaled = INT64CONST(1000000) * TICKS_TO_NS_PRECISION / tsc_freq;
+ ticks_per_sec = tsc_freq * 1000; /* KHz->Hz */
+ max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
+
+ has_rdtsc = true;
+ has_rdtscp = is_rdtscp_available();
+ }
+ }
+
+ fclose(fp);
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
+
+#endif /* WIN32 */
diff --git a/src/common/meson.build b/src/common/meson.build
index 1540ba67cca..62b90b3e609 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -13,6 +13,7 @@ common_sources = files(
'file_perm.c',
'file_utils.c',
'hashfn.c',
+ 'instr_time.c',
'ip.c',
'jsonapi.c',
'keywords.c',
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index f71a851b18d..e2e339a0c4f 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter(). These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -17,10 +19,11 @@
*
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
- * INSTR_TIME_SET_CURRENT(t) set t to current time
+ * INSTR_TIME_SET_CURRENT_FAST(t) set t to current time without waiting
+ * for instructions in out-of-order window
*
- * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
- * evaluates to whether t changed
+ * INSTR_TIME_SET_CURRENT(t) set t to current time while waiting for
+ * instructions in OOO to retire
*
* INSTR_TIME_ADD(x, y) x += y
*
@@ -81,6 +84,15 @@ typedef struct instr_time
#ifndef WIN32
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)
+
+extern int64 ticks_per_ns_scaled;
+extern int64 ticks_per_sec;
+extern int64 max_ticks_no_overflow;
/* Use clock_gettime() */
@@ -106,9 +118,18 @@ typedef struct instr_time
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-/* helper for INSTR_TIME_SET_CURRENT */
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+
+extern bool has_rdtsc;
+extern bool has_rdtscp;
+
+extern void pg_initialize_rdtsc(void);
+#endif
+
static inline instr_time
-pg_clock_gettime_ns(void)
+pg_clock_gettime(void)
{
instr_time now;
struct timespec tmp;
@@ -119,11 +140,94 @@ pg_clock_gettime_ns(void)
return now;
}
+static inline instr_time
+pg_get_ticks_fast(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtsc)
+ {
+ instr_time now;
+
+ now.ticks = __rdtsc();
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline instr_time
+pg_get_ticks(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtscp)
+ {
+ instr_time now;
+ uint32 unused;
+
+ now.ticks = __rdtscp(&unused);
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline int64_t
+pg_ticks_to_ns(instr_time t)
+{
+ /*
+ * Would multiplication overflow? If so perform computation in two parts.
+ * Check overflow without actually overflowing via: a * b > max <=> a >
+ * max / b
+ */
+ int64 ns = 0;
+
+ if (unlikely(t.ticks > max_ticks_no_overflow))
+ {
+ /*
+ * Compute how often the maximum number of ticks fits completely into
+ * the number of elapsed ticks and convert that number into
+ * nanoseconds. Then multiply by the count to arrive at the final
+ * value. In a 2nd step we adjust the number of elapsed ticks and
+ * convert the remaining ticks.
+ */
+ int64 count = t.ticks / max_ticks_no_overflow;
+ int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+
+ ns = max_ns * count;
+
+ /*
+ * Subtract the ticks that we now already accounted for, so that they
+ * don't get counted twice.
+ */
+ t.ticks -= count * max_ticks_no_overflow;
+ Assert(t.ticks >= 0);
+ }
+
+ ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ return ns;
+}
+
+static inline void
+pg_initialize_get_ticks()
+{
+#if defined(__x86_64__) && defined(__linux__)
+ pg_initialize_rdtsc();
+#endif
+}
+
+#define INSTR_TIME_INITIALIZE() \
+ pg_initialize_get_ticks()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_get_ticks_fast())
+
#define INSTR_TIME_SET_CURRENT(t) \
- ((t) = pg_clock_gettime_ns())
+ ((t) = pg_get_ticks())
#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) (t).ticks)
+ pg_ticks_to_ns(t)
#else /* WIN32 */
@@ -131,7 +235,7 @@ pg_clock_gettime_ns(void)
/* Use QueryPerformanceCounter() */
-/* helper for INSTR_TIME_SET_CURRENT */
+/* helper for INSTR_TIME_SET_CURRENT / INSTR_TIME_SET_CURRENT_FAST */
static inline instr_time
pg_query_performance_counter(void)
{
@@ -153,6 +257,11 @@ GetTimerFrequency(void)
return (double) f.QuadPart;
}
+#define INSTR_TIME_INITIALIZE()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_query_performance_counter())
+
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
@@ -168,13 +277,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0)
-
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
-#define INSTR_TIME_SET_CURRENT_LAZY(t) \
- (INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
-
-
#define INSTR_TIME_ADD(x,y) \
((x).ticks += (y).ticks)
--
2.51.0
I have not looked at this patch series yet, but when I played around
with using rdtsc (or actually some gcc/clang construct that compiled
to rctsc on c86 and into time register reads on Arm and risc-v) then
any extra step around it had noticeable overhead. I am not sure
putting some if or function call around rdtsc call is a good idea.
Show quoted text
On Wed, Dec 3, 2025 at 3:15 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 3, 2025 at 6:03 AM David Geier <geidav.pg@gmail.com> wrote:
I'm wondering how to best do a GUC for something that is potentially
unavailable on the system. In that case the GUC would be superfluous.
Maybe a boolean "enable_try_fast_clocksource" GUC or a "clocksource"
enum GUC which can be "default" and "try_rdtsc", where we only include
the "try_rdtsc" enum value on x86 systems?huge_pages=on/off/try is one possible precedent. Perhaps for this
case, we might do something like
clock_source=auto/this/that/the_other_thing might be better. If you
set it to any value other than "auto", the named source must be
available, or startup fails. If you set it to auto, it picks what it
believes to be the best option available, and there is some other
read-only GUC (akin to huge_page_status) that tells you what it
picked.I'm open to other suggestions as to how this should work, but (1) all
of the existing enable_* GUCs are planner GUCs, and (2) I suspect it's
short-sighted to plan for only "fast" and "not fast".--
Robert Haas
EDB: http://www.enterprisedb.com
On 04.12.2025 22:21, Hannu Krosing wrote:
I have not looked at this patch series yet, but when I played around
with using rdtsc (or actually some gcc/clang construct that compiled
to rctsc on c86 and into time register reads on Arm and risc-v) then
any extra step around it had noticeable overhead. I am not sure
putting some if or function call around rdtsc call is a good idea.
We have that already. INSTR_TIME_SET_CURRENT_FAST() is currently
implemented as:
static inline instr_time
pg_get_ticks_fast(void)
{
#if defined(__x86_64__) && defined(__linux__)
if (has_rdtsc)
{
instr_time now;
now.ticks = __rdtsc();
return now;
}
#endif
return pg_clock_gettime();
}
Based on Robert's suggestion I wanted to add a "fast_clock_source" enum
GUC which can have the following values "auto", "rdtsc", "try_rdtsc" and
"off". With that, at least no additional checks are needed and
performance will remain as previously benchmarked in this thread.
Beyond that, the condition will always evaluate to the same result, so
there won't be branch mispredictions. Doing completely without any check
is impossible, except if we were to JIT compile InstrStartNode() and
InstrStopNode(). But that's a much bigger project.
I'll still add unlikely() around the if (has_rdtsc).
Any input regarding the proposed GUC is welcome.
--
David Geier
Based on Robert's suggestion I wanted to add a "fast_clock_source" enum
GUC which can have the following values "auto", "rdtsc", "try_rdtsc" and
"off". With that, at least no additional checks are needed and
performance will remain as previously benchmarked in this thread.
The attached patch set is rebased on latest master and contains a commit
which adds a "fast_clock_source" GUC that can be "try", "off" and
"rdtsc" on Linux.
Alternatively, we could call the GUC "clock_source" with "auto",
"clock_gettime" and "rdtsc". Opinions?
I moved the call to INSTR_TIME_INITIALIZE() from InitPostgres() to
PostmasterMain(). In InitPostgres() it kept the database in a recovery
cycle.
I'll still add unlikely() around the if (has_rdtsc).
Done.
--
David Geier
Attachments:
v3-0003-Add-GUC.patchtext/x-patch; charset=UTF-8; name=v3-0003-Add-GUC.patchDownload
From 1b9f9b97d5ca74532620cb51c8f8beb23500537f Mon Sep 17 00:00:00 2001
From: David Geier <geidav.pg@gmail.com>
Date: Fri, 5 Dec 2025 08:58:18 +0100
Subject: [PATCH v3 3/3] Add GUC
---
src/backend/postmaster/postmaster.c | 11 +++++++++
src/backend/tcop/postgres.c | 11 +++++++++
src/backend/utils/init/postinit.c | 3 ---
src/backend/utils/misc/guc_parameters.dat | 7 ++++++
src/backend/utils/misc/guc_tables.c | 10 +++++++++
src/bin/pg_test_timing/pg_test_timing.c | 2 +-
src/bin/pgbench/pgbench.c | 2 +-
src/bin/psql/startup.c | 2 +-
src/common/instr_time.c | 11 +++++++++
src/include/portability/instr_time.h | 27 +++++++++++++----------
src/include/utils/guc_tables.h | 1 +
11 files changed, 69 insertions(+), 18 deletions(-)
diff --git a/src/backend/postmaster/postmaster.c b/src/backend/postmaster/postmaster.c
index 921d73226d6..a6dfaf356ee 100644
--- a/src/backend/postmaster/postmaster.c
+++ b/src/backend/postmaster/postmaster.c
@@ -1005,6 +1005,17 @@ PostmasterMain(int argc, char *argv[])
*/
CreateSharedMemoryAndSemaphores();
+ /*
+ * Initialize high-precision interval timing.
+ * Raise FATAL error if RDTSC is requested but not available.
+ */
+ INSTR_TIME_INITIALIZE(true);
+
+#if defined(__x86_64__) && defined(__linux__)
+ if (fast_clock_source == FAST_CLOCK_SOURCE_RDTSC && (!has_rdtsc || !has_rdtscp))
+ ereport(FATAL, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("TSC is not supported as fast clock source")));
+#endif
+
/*
* Estimate number of openable files. This must happen after setting up
* semaphores, because on some platforms semaphores count as open files.
diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index e54bf1e760f..e882220b9a0 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -4152,6 +4152,17 @@ PostgresSingleUserMain(int argc, char *argv[],
*/
CreateSharedMemoryAndSemaphores();
+ /*
+ * Initialize high-precision interval timing.
+ * Raise FATAL error if RDTSC is requested but not available.
+ */
+ INSTR_TIME_INITIALIZE(true);
+
+#if defined(__x86_64__) && defined(__linux__)
+ if (fast_clock_source == FAST_CLOCK_SOURCE_RDTSC && (!has_rdtsc || !has_rdtscp))
+ ereport(FATAL, (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), errmsg("TSC is not supported as fast clock source")));
+#endif
+
/*
* Estimate number of openable files. This must happen after setting up
* semaphores, because on some platforms semaphores count as open files.
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index c339c35b25d..3f401faf3de 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -809,9 +809,6 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
- /* initialize high-precision interval timing */
- INSTR_TIME_INITIALIZE();
-
/*
* Load relcache entries for the shared system catalogs. This must create
* at least entries for pg_database and catalogs used for authentication.
diff --git a/src/backend/utils/misc/guc_parameters.dat b/src/backend/utils/misc/guc_parameters.dat
index 7c60b125564..c68b3ff1184 100644
--- a/src/backend/utils/misc/guc_parameters.dat
+++ b/src/backend/utils/misc/guc_parameters.dat
@@ -1041,6 +1041,13 @@
max => '3',
},
+{ name => 'fast_clock_source', type => 'enum', context => 'PGC_POSTMASTER', group => 'RESOURCES_TIME',
+ short_desc => 'Use of fast clock source.',
+ variable => 'fast_clock_source',
+ boot_val => 'FAST_CLOCK_SOURCE_TRY',
+ options => 'fast_clock_source_options',
+},
+
{ name => 'file_copy_method', type => 'enum', context => 'PGC_USERSET', group => 'RESOURCES_DISK',
short_desc => 'Selects the file copy method.',
variable => 'file_copy_method',
diff --git a/src/backend/utils/misc/guc_tables.c b/src/backend/utils/misc/guc_tables.c
index 73ff6ad0a32..3614aaf1876 100644
--- a/src/backend/utils/misc/guc_tables.c
+++ b/src/backend/utils/misc/guc_tables.c
@@ -90,6 +90,7 @@
#include "tcop/tcopprot.h"
#include "tsearch/ts_cache.h"
#include "utils/builtins.h"
+#include "portability/instr_time.h"
#include "utils/bytea.h"
#include "utils/float.h"
#include "utils/guc_hooks.h"
@@ -371,6 +372,15 @@ static const struct config_enum_entry huge_pages_options[] = {
{NULL, 0, false}
};
+static const struct config_enum_entry fast_clock_source_options[] = {
+ {"try", FAST_CLOCK_SOURCE_TRY, false},
+ {"off", FAST_CLOCK_SOURCE_OFF, false},
+#if defined(__x86_64__) && defined(__linux__)
+ {"rdtsc", FAST_CLOCK_SOURCE_RDTSC, false},
+#endif
+ {NULL, 0, false}
+};
+
static const struct config_enum_entry huge_pages_status_options[] = {
{"off", HUGE_PAGES_OFF, false},
{"on", HUGE_PAGES_ON, false},
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index 62e308dd965..3ffa95769a5 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -174,7 +174,7 @@ test_timing(unsigned int duration, bool fast_timing)
char *time_source = NULL;
bool fast_timing_available = false;
- INSTR_TIME_INITIALIZE();
+ INSTR_TIME_INITIALIZE(true);
#if !defined(WIN32) && defined(__x86_64__) && defined(__linux__)
if (fast_timing && has_rdtsc)
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 7b64fdaed34..78e5b40a7bb 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7335,7 +7335,7 @@ main(int argc, char **argv)
}
/* initialize high-precision interval timing */
- INSTR_TIME_INITIALIZE();
+ INSTR_TIME_INITIALIZE(false);
/* opening connection... */
con = doConnect();
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index a0b3bafe20c..ec8748002d8 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -329,7 +329,7 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
/* initialize high-precision interval timing */
- INSTR_TIME_INITIALIZE();
+ INSTR_TIME_INITIALIZE(false);
SyncVariables();
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
index fdf47699f20..02b57edfb9b 100644
--- a/src/common/instr_time.c
+++ b/src/common/instr_time.c
@@ -24,6 +24,17 @@
#include "portability/instr_time.h"
+int fast_clock_source = FAST_CLOCK_SOURCE_TRY;
+
+void
+pg_initialize_timing(bool try_fast)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (try_fast)
+ pg_initialize_rdtsc();
+#endif
+}
+
#ifndef WIN32
/*
* Stores what the number of cycles needs to be multiplied with to end up
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 33e506f3526..22ee991d2c9 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -81,6 +81,14 @@ typedef struct instr_time
#define NS_PER_MS INT64CONST(1000000)
#define NS_PER_US INT64CONST(1000)
+typedef enum
+{
+ FAST_CLOCK_SOURCE_TRY,
+ FAST_CLOCK_SOURCE_OFF,
+ FAST_CLOCK_SOURCE_RDTSC
+} FastClockSourceType;
+
+extern int fast_clock_source;
#ifndef WIN32
@@ -147,7 +155,7 @@ static inline instr_time
pg_get_ticks_fast(void)
{
#if defined(__x86_64__) && defined(__linux__)
- if (has_rdtsc)
+ if (likely(has_rdtsc && fast_clock_source != FAST_CLOCK_SOURCE_OFF))
{
instr_time now;
@@ -163,7 +171,7 @@ static inline instr_time
pg_get_ticks(void)
{
#if defined(__x86_64__) && defined(__linux__)
- if (has_rdtscp)
+ if (likely(has_rdtscp && fast_clock_source != FAST_CLOCK_SOURCE_OFF))
{
instr_time now;
uint32 unused;
@@ -212,16 +220,11 @@ pg_ticks_to_ns(int64 ticks)
return ns;
}
-static inline void
-pg_initialize_get_ticks()
-{
-#if defined(__x86_64__) && defined(__linux__)
- pg_initialize_rdtsc();
-#endif
-}
+extern void
+pg_initialize_timing(bool try_fast);
-#define INSTR_TIME_INITIALIZE() \
- pg_initialize_get_ticks()
+#define INSTR_TIME_INITIALIZE(try_fast) \
+ pg_initialize_timing(try_fast)
#define INSTR_TIME_SET_CURRENT_FAST(t) \
((t) = pg_get_ticks_fast())
@@ -260,7 +263,7 @@ GetTimerFrequency(void)
return (double) f.QuadPart;
}
-#define INSTR_TIME_INITIALIZE()
+#define INSTR_TIME_INITIALIZE(try_fast)
#define INSTR_TIME_SET_CURRENT_FAST(t) \
((t) = pg_query_performance_counter())
diff --git a/src/include/utils/guc_tables.h b/src/include/utils/guc_tables.h
index 71a80161961..63440b8e36c 100644
--- a/src/include/utils/guc_tables.h
+++ b/src/include/utils/guc_tables.h
@@ -60,6 +60,7 @@ enum config_group
CONN_AUTH_TCP,
CONN_AUTH_AUTH,
CONN_AUTH_SSL,
+ RESOURCES_TIME,
RESOURCES_MEM,
RESOURCES_DISK,
RESOURCES_KERNEL,
--
2.51.0
v3-0002-pg_test_timing-Also-test-fast-timing-and-report-t.patchtext/x-patch; charset=UTF-8; name=v3-0002-pg_test_timing-Also-test-fast-timing-and-report-t.patchDownload
From 6eaa7f5e3723ad6490448082b470c183cd808a9f Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Sun, 27 Jul 2025 08:48:48 -0700
Subject: [PATCH v3 2/3] pg_test_timing: Also test fast timing and report time
source
In passing also reduce the per-loop overhead caused by repeated divisions
in INSTR_TIME_GET_NANOSEC when the ticks variable has become very large,
instead diff first and then turn it into nanosecs.
---
src/bin/pg_test_timing/pg_test_timing.c | 76 +++++++++++++++++++------
src/include/portability/instr_time.h | 31 ++++++----
2 files changed, 78 insertions(+), 29 deletions(-)
diff --git a/src/bin/pg_test_timing/pg_test_timing.c b/src/bin/pg_test_timing/pg_test_timing.c
index a5621251afc..62e308dd965 100644
--- a/src/bin/pg_test_timing/pg_test_timing.c
+++ b/src/bin/pg_test_timing/pg_test_timing.c
@@ -30,7 +30,7 @@ static long long int largest_diff_count;
static void handle_args(int argc, char *argv[]);
-static uint64 test_timing(unsigned int duration);
+static uint64 test_timing(unsigned int duration, bool fast_timing);
static void output(uint64 loop_count);
int
@@ -43,10 +43,22 @@ main(int argc, char *argv[])
handle_args(argc, argv);
- loop_count = test_timing(test_duration);
-
+ /*
+ * First, test default (non-fast) timing code. A clock source for that
+ * is always available. Hence, we can unconditionally output the result.
+ */
+ loop_count = test_timing(test_duration, false);
output(loop_count);
+ /*
+ * Second, test the fast timing code. This clock source is not always
+ * available. In that case the loop count will be 0 and we don't print.
+ */
+ printf("\n");
+ loop_count = test_timing(test_duration, true);
+ if (loop_count > 0)
+ output(loop_count);
+
return 0;
}
@@ -78,7 +90,7 @@ handle_args(int argc, char *argv[])
}
}
- while ((option = getopt_long(argc, argv, "d:c:",
+ while ((option = getopt_long(argc, argv, "d:c:f:",
long_options, &optindex)) != -1)
{
switch (option)
@@ -143,23 +155,44 @@ handle_args(int argc, char *argv[])
exit(1);
}
- printf(ngettext("Testing timing overhead for %u second.\n",
- "Testing timing overhead for %u seconds.\n",
+ printf(ngettext("Testing timing overhead for %u second.\n\n",
+ "Testing timing overhead for %u seconds.\n\n",
test_duration),
test_duration);
}
static uint64
-test_timing(unsigned int duration)
+test_timing(unsigned int duration, bool fast_timing)
{
uint64 total_time;
int64 time_elapsed = 0;
uint64 loop_count = 0;
- uint64 prev,
- cur;
instr_time start_time,
end_time,
- temp;
+ prev,
+ cur;
+ char *time_source = NULL;
+ bool fast_timing_available = false;
+
+ INSTR_TIME_INITIALIZE();
+
+#if !defined(WIN32) && defined(__x86_64__) && defined(__linux__)
+ if (fast_timing && has_rdtsc)
+ {
+ time_source = "RDTSC";
+ fast_timing_available = true;
+ }
+ else if (has_rdtscp)
+ time_source = "RDTSCP";
+ else
+ time_source = PG_INSTR_CLOCK_NAME;
+#else
+ time_source = PG_INSTR_CLOCK_NAME;
+#endif
+ if (fast_timing && !fast_timing_available)
+ return 0;
+
+ printf(_("Time source: %s\n"), time_source);
/*
* Pre-zero the statistics data structures. They're already zero by
@@ -173,8 +206,11 @@ test_timing(unsigned int duration)
total_time = duration > 0 ? duration * INT64CONST(1000000000) : 0;
- INSTR_TIME_SET_CURRENT(start_time);
- cur = INSTR_TIME_GET_NANOSEC(start_time);
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(start_time);
+ else
+ INSTR_TIME_SET_CURRENT(start_time);
+ cur = start_time;
while (time_elapsed < total_time)
{
@@ -182,9 +218,11 @@ test_timing(unsigned int duration)
bits;
prev = cur;
- INSTR_TIME_SET_CURRENT(temp);
- cur = INSTR_TIME_GET_NANOSEC(temp);
- diff = cur - prev;
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(cur);
+ else
+ INSTR_TIME_SET_CURRENT(cur);
+ diff = INSTR_TIME_DIFF_NANOSEC(cur, prev);
/* Did time go backwards? */
if (unlikely(diff < 0))
@@ -217,11 +255,13 @@ test_timing(unsigned int duration)
largest_diff_count++;
loop_count++;
- INSTR_TIME_SUBTRACT(temp, start_time);
- time_elapsed = INSTR_TIME_GET_NANOSEC(temp);
+ time_elapsed = INSTR_TIME_DIFF_NANOSEC(cur, start_time);
}
- INSTR_TIME_SET_CURRENT(end_time);
+ if (fast_timing)
+ INSTR_TIME_SET_CURRENT_FAST(end_time);
+ else
+ INSTR_TIME_SET_CURRENT(end_time);
INSTR_TIME_SUBTRACT(end_time, start_time);
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index c2cea75b7c8..33e506f3526 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -112,10 +112,13 @@ extern int64 max_ticks_no_overflow;
*/
#if defined(__darwin__) && defined(CLOCK_MONOTONIC_RAW)
#define PG_INSTR_CLOCK CLOCK_MONOTONIC_RAW
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_MONOTONIC_RAW)"
#elif defined(CLOCK_MONOTONIC)
#define PG_INSTR_CLOCK CLOCK_MONOTONIC
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_MONOTONIC)"
#else
#define PG_INSTR_CLOCK CLOCK_REALTIME
+#define PG_INSTR_CLOCK_NAME "clock_gettime (CLOCK_REALTIME)"
#endif
#if defined(__x86_64__) && defined(__linux__)
@@ -174,7 +177,7 @@ pg_get_ticks(void)
}
static inline int64_t
-pg_ticks_to_ns(instr_time t)
+pg_ticks_to_ns(int64 ticks)
{
/*
* Would multiplication overflow? If so perform computation in two parts.
@@ -183,7 +186,7 @@ pg_ticks_to_ns(instr_time t)
*/
int64 ns = 0;
- if (unlikely(t.ticks > max_ticks_no_overflow))
+ if (unlikely(ticks > max_ticks_no_overflow))
{
/*
* Compute how often the maximum number of ticks fits completely into
@@ -192,7 +195,7 @@ pg_ticks_to_ns(instr_time t)
* value. In a 2nd step we adjust the number of elapsed ticks and
* convert the remaining ticks.
*/
- int64 count = t.ticks / max_ticks_no_overflow;
+ int64 count = ticks / max_ticks_no_overflow;
int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
ns = max_ns * count;
@@ -201,11 +204,11 @@ pg_ticks_to_ns(instr_time t)
* Subtract the ticks that we now already accounted for, so that they
* don't get counted twice.
*/
- t.ticks -= count * max_ticks_no_overflow;
- Assert(t.ticks >= 0);
+ ticks -= count * max_ticks_no_overflow;
+ Assert(ticks >= 0);
}
- ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ ns += ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
return ns;
}
@@ -226,14 +229,14 @@ pg_initialize_get_ticks()
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_get_ticks())
-#define INSTR_TIME_GET_NANOSEC(t) \
- pg_ticks_to_ns(t)
-
+#define INSTR_TIME_TICKS_TO_NANOSEC(ticks) \
+ (pg_ticks_to_ns(ticks))
#else /* WIN32 */
/* Use QueryPerformanceCounter() */
+#define PG_INSTR_CLOCK_NAME "QueryPerformanceCounter"
/* helper for INSTR_TIME_SET_CURRENT / INSTR_TIME_SET_CURRENT_FAST */
static inline instr_time
@@ -265,8 +268,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
-#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) ((t).ticks * ((double) NS_PER_S / GetTimerFrequency())))
+#define INSTR_TIME_TICKS_TO_NANOSEC(ticks) \
+ ((int64) ((ticks) * ((double) NS_PER_S / GetTimerFrequency())))
#endif /* WIN32 */
@@ -285,12 +288,18 @@ GetTimerFrequency(void)
#define INSTR_TIME_SUBTRACT(x,y) \
((x).ticks -= (y).ticks)
+#define INSTR_TIME_DIFF_NANOSEC(x,y) \
+ (INSTR_TIME_TICKS_TO_NANOSEC((x).ticks - (y).ticks))
+
#define INSTR_TIME_ACCUM_DIFF(x,y,z) \
((x).ticks += (y).ticks - (z).ticks)
#define INSTR_TIME_LT(x,y) \
((x).ticks > (y).ticks)
+#define INSTR_TIME_GET_NANOSEC(t) \
+ (INSTR_TIME_TICKS_TO_NANOSEC((t).ticks))
+
#define INSTR_TIME_GET_DOUBLE(t) \
((double) INSTR_TIME_GET_NANOSEC(t) / NS_PER_S)
--
2.51.0
v3-0001-Use-time-stamp-counter-to-measure-time-on-Linux-x.patchtext/x-patch; charset=UTF-8; name=v3-0001-Use-time-stamp-counter-to-measure-time-on-Linux-x.patchDownload
From 375329c11b4b322970a0d8554096f4743f62af1d Mon Sep 17 00:00:00 2001
From: Lukas Fittl <lukas@fittl.com>
Date: Fri, 25 Jul 2025 17:57:20 -0700
Subject: [PATCH v3 1/3] Use time stamp counter to measure time on Linux/x86
We switch to using the time stamp counter (TSC) instead of clock_gettime()
to reduce overhead of EXPLAIN (ANALYZE, TIME ON). Tests showed that runtime
is reduced by around 10% for queries moving lots of rows through the plan.
For now this is only enabled on Linux/x86, in case the system clocksource is
reported as TSC. Relying on the Linux kernel simplifies the logic to detect
if the present TSC is usable (frequency invariant, synchronized between
sockets, etc.). In all other cases we fallback to clock_gettime().
Note, that we intentionally use RDTSC in the fast paths, rather than RDTSCP.
RDTSCP waits for outstanding instructions to retire on out-of-order CPUs.
This adds noticably for little benefit in the typical InstrStartNode() /
InstrStopNode() use case. The macro to be used in such cases is called
INSTR_TIME_SET_CURRENT_FAST(). The original macro INSTR_TIME_SET_CURRENT()
uses RDTSCP and is supposed to be used when precision is more important
than performance.
Author: David Geier <geidav.pg@gmail.com>
Author: Andres Freund <andres@anarazel.de>
Author: Lukas Fittl <lukas@fittl.com>
Reviewed-by:
Discussion: https://www.postgresql.org/message-id/flat/20200612232810.f46nbqkdhbutzqdg%40alap3.anarazel.de
---
src/backend/access/heap/vacuumlazy.c | 4 +-
src/backend/executor/instrument.c | 12 +-
src/backend/utils/init/postinit.c | 3 +
src/bin/pgbench/pgbench.c | 3 +
src/bin/psql/startup.c | 4 +
src/common/Makefile | 1 +
src/common/instr_time.c | 206 +++++++++++++++++++++++++++
src/common/meson.build | 1 +
src/include/portability/instr_time.h | 136 +++++++++++++++---
9 files changed, 348 insertions(+), 22 deletions(-)
create mode 100644 src/common/instr_time.c
diff --git a/src/backend/access/heap/vacuumlazy.c b/src/backend/access/heap/vacuumlazy.c
index 2086a577199..af2201f42bc 100644
--- a/src/backend/access/heap/vacuumlazy.c
+++ b/src/backend/access/heap/vacuumlazy.c
@@ -3458,8 +3458,8 @@ count_nondeletable_pages(LVRelState *vacrel, bool *lock_waiter_detected)
INSTR_TIME_SET_CURRENT(currenttime);
elapsed = currenttime;
INSTR_TIME_SUBTRACT(elapsed, starttime);
- if ((INSTR_TIME_GET_MICROSEC(elapsed) / 1000)
- >= VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
+ if (INSTR_TIME_GET_MILLISEC(elapsed) >=
+ VACUUM_TRUNCATE_LOCK_CHECK_INTERVAL)
{
if (LockHasWaitersRelation(vacrel->rel, AccessExclusiveLock))
{
diff --git a/src/backend/executor/instrument.c b/src/backend/executor/instrument.c
index edab92a0ebe..ebdad31ca3b 100644
--- a/src/backend/executor/instrument.c
+++ b/src/backend/executor/instrument.c
@@ -67,9 +67,13 @@ InstrInit(Instrumentation *instr, int instrument_options)
void
InstrStartNode(Instrumentation *instr)
{
- if (instr->need_timer &&
- !INSTR_TIME_SET_CURRENT_LAZY(instr->starttime))
- elog(ERROR, "InstrStartNode called twice in a row");
+ if (instr->need_timer)
+ {
+ if (!INSTR_TIME_IS_ZERO(instr->starttime))
+ elog(ERROR, "InstrStartNode called twice in a row");
+ else
+ INSTR_TIME_SET_CURRENT_FAST(instr->starttime);
+ }
/* save buffer usage totals at node entry, if needed */
if (instr->need_bufusage)
@@ -95,7 +99,7 @@ InstrStopNode(Instrumentation *instr, double nTuples)
if (INSTR_TIME_IS_ZERO(instr->starttime))
elog(ERROR, "InstrStopNode called without start");
- INSTR_TIME_SET_CURRENT(endtime);
+ INSTR_TIME_SET_CURRENT_FAST(endtime);
INSTR_TIME_ACCUM_DIFF(instr->counter, endtime, instr->starttime);
INSTR_TIME_SET_ZERO(instr->starttime);
diff --git a/src/backend/utils/init/postinit.c b/src/backend/utils/init/postinit.c
index 3f401faf3de..c339c35b25d 100644
--- a/src/backend/utils/init/postinit.c
+++ b/src/backend/utils/init/postinit.c
@@ -809,6 +809,9 @@ InitPostgres(const char *in_dbname, Oid dboid,
/* Initialize portal manager */
EnablePortalManager();
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/*
* Load relcache entries for the shared system catalogs. This must create
* at least entries for pg_database and catalogs used for authentication.
diff --git a/src/bin/pgbench/pgbench.c b/src/bin/pgbench/pgbench.c
index 58735871c17..7b64fdaed34 100644
--- a/src/bin/pgbench/pgbench.c
+++ b/src/bin/pgbench/pgbench.c
@@ -7334,6 +7334,9 @@ main(int argc, char **argv)
initRandomState(&state[i].cs_func_rs);
}
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
/* opening connection... */
con = doConnect();
if (con == NULL)
diff --git a/src/bin/psql/startup.c b/src/bin/psql/startup.c
index 9a397ec87b7..a0b3bafe20c 100644
--- a/src/bin/psql/startup.c
+++ b/src/bin/psql/startup.c
@@ -24,6 +24,7 @@
#include "help.h"
#include "input.h"
#include "mainloop.h"
+#include "portability/instr_time.h"
#include "settings.h"
/*
@@ -327,6 +328,9 @@ main(int argc, char *argv[])
PQsetNoticeProcessor(pset.db, NoticeProcessor, NULL);
+ /* initialize high-precision interval timing */
+ INSTR_TIME_INITIALIZE();
+
SyncVariables();
if (options.list_dbs)
diff --git a/src/common/Makefile b/src/common/Makefile
index 2c720caa509..1a2fbbe887f 100644
--- a/src/common/Makefile
+++ b/src/common/Makefile
@@ -59,6 +59,7 @@ OBJS_COMMON = \
file_perm.o \
file_utils.o \
hashfn.o \
+ instr_time.o \
ip.o \
jsonapi.o \
keywords.o \
diff --git a/src/common/instr_time.c b/src/common/instr_time.c
new file mode 100644
index 00000000000..fdf47699f20
--- /dev/null
+++ b/src/common/instr_time.c
@@ -0,0 +1,206 @@
+/*-------------------------------------------------------------------------
+ *
+ * instr_time.c
+ * Non-inline parts of the portable high-precision interval timing
+ * implementation
+ *
+ * Portions Copyright (c) 2025, PostgreSQL Global Development Group
+ *
+ *
+ * IDENTIFICATION
+ * src/backend/port/instr_time.c
+ *
+ *-------------------------------------------------------------------------
+ */
+#include "postgres.h"
+
+#if defined(HAVE__GET_CPUID) || (defined(HAVE__CPUIDEX) && !defined(_MSC_VER))
+#include <cpuid.h>
+#endif
+
+#if defined(HAVE__CPUID) || (defined(HAVE__CPUIDEX) && defined(_MSC_VER))
+#include <intrin.h>
+#endif
+
+#include "portability/instr_time.h"
+
+#ifndef WIN32
+/*
+ * Stores what the number of cycles needs to be multiplied with to end up
+ * with nanoseconds using integer math. See comment in pg_initialize_rdtsc()
+ * for more details.
+ *
+ * By default assume we are using clock_gettime() as a fallback which uses
+ * nanoseconds as ticks. Hence, we set the multiplier to the precision scalar
+ * so that the division in INSTR_TIME_GET_NANOSEC() won't change the nanoseconds.
+ *
+ * When using the RDTSC instruction directly this is filled in during initialization
+ * based on the relevant CPUID fields.
+ */
+int64 ticks_per_ns_scaled = TICKS_TO_NS_PRECISION;
+int64 ticks_per_sec = NS_PER_S;
+int64 max_ticks_no_overflow = PG_INT64_MAX / TICKS_TO_NS_PRECISION;
+
+#if defined(__x86_64__) && defined(__linux__)
+/*
+ * Indicates if RDTSC can be used (Linux/x86 only, when OS uses TSC clocksource)
+ */
+bool has_rdtsc = false;
+
+/*
+ * Indicates if RDTSCP can be used. True if RDTSC can be used and RDTSCP is available.
+ */
+bool has_rdtscp = false;
+
+#define CPUID_HYPERVISOR_VMWARE(words) (words[1] == 0x61774d56 && words[2] == 0x4d566572 && words[3] == 0x65726177) /* VMwareVMware */
+#define CPUID_HYPERVISOR_KVM(words) (words[1] == 0x4b4d564b && words[2] == 0x564b4d56 && words[3] == 0x0000004d) /* KVMKVMKVM */
+
+static bool
+get_tsc_frequency_khz(uint32 *tsc_freq)
+{
+ uint32 r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(0x15, &r[0] /* denominator */ , &r[1] /* numerator */ , &r[2] /* hz */ , &r[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x15);
+#else
+#error cpuid instruction not available
+#endif
+
+ if (r[2] > 0)
+ {
+ if (r[0] == 0 || r[1] == 0)
+ return false;
+
+ *tsc_freq = r[2] / 1000 * r[1] / r[0];
+ return true;
+ }
+
+ /* Some CPUs only report frequency in 16H */
+
+#if defined(HAVE__GET_CPUID)
+ __get_cpuid(0x16, &r[0] /* base_mhz */ , &r[1], &r[2], &r[3]);
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x16);
+#else
+#error cpuid instruction not available
+#endif
+
+ if (r[0] > 0)
+ {
+ *tsc_freq = r[0] * 1000;
+ return true;
+ }
+
+ /*
+ * Check if we have a KVM or VMware Hypervisor passing down TSC frequency
+ * to us in a guest VM
+ *
+ * Note that accessing the 0x40000000 leaf for Hypervisor info requires
+ * use of __cpuidex to set ECX to 0. The similar __get_cpuid_count
+ * function does not work as expected since it contains a check for
+ * __get_cpuid_max, which has been observed to be lower than the special
+ * Hypervisor leaf.
+ */
+#if defined(HAVE__CPUIDEX)
+ __cpuidex((int32 *) r, 0x40000000, 0);
+ if (r[0] >= 0x40000010 && (CPUID_HYPERVISOR_VMWARE(r) || CPUID_HYPERVISOR_KVM(r)))
+ {
+ __cpuidex((int32 *) r, 0x40000010, 0);
+ if (r[0] > 0)
+ {
+ *tsc_freq = r[0];
+ return true;
+ }
+ }
+#endif
+
+ return false;
+}
+
+static bool
+is_rdtscp_available()
+{
+ uint32 r[4] = {0, 0, 0, 0};
+
+#if defined(HAVE__GET_CPUID)
+ if (!__get_cpuid(0x80000001, &r[0], &r[1], &r[2], &r[3]))
+ return false;
+#elif defined(HAVE__CPUID)
+ __cpuid(r, 0x80000001);
+#else
+#error cpuid instruction not available
+#endif
+
+ return (r[3] & (1 << 27)) != 0;
+}
+
+/*
+ * Decide whether we use the RDTSC instruction at runtime, for Linux/x86,
+ * instead of incurring the overhead of a full clock_gettime() call.
+ *
+ * This can't be reliably determined at compile time, since the
+ * availability of an "invariant" TSC (that is not affected by CPU
+ * frequency changes) is dependent on the CPU architecture. Additionally,
+ * there are cases where TSC availability is impacted by virtualization,
+ * where a simple cpuid feature check would not be enough.
+ *
+ * Since Linux already does a significant amount of work to determine
+ * whether TSC is a viable clock source, decide based on that.
+ */
+void
+pg_initialize_rdtsc(void)
+{
+ FILE *fp = fopen("/sys/devices/system/clocksource/clocksource0/current_clocksource", "r");
+
+ if (fp)
+ {
+ char buf[128];
+
+ if (fgets(buf, sizeof(buf), fp) != NULL && strcmp(buf, "tsc\n") == 0)
+ {
+ /*
+ * Compute baseline CPU peformance, determines speed at which
+ * RDTSC advances.
+ */
+ uint32 tsc_freq;
+
+ if (get_tsc_frequency_khz(&tsc_freq))
+ {
+ /*
+ * Ticks to nanoseconds conversion requires floating point
+ * math because because:
+ *
+ * sec = ticks / frequency_hz ns = ticks / frequency_hz *
+ * 1,000,000,000 ns = ticks * (1,000,000,000 / frequency_hz)
+ * ns = ticks * (1,000,000 / frequency_khz) <-- now in
+ * kilohertz
+ *
+ * Here, 'ns' is usually a floating number. For example for a
+ * 2.5 GHz CPU the scaling factor becomes 1,000,000 /
+ * 2,500,000 = 1.2.
+ *
+ * To be able to use integer math we work around the lack of
+ * precision. We first scale the integer up and after the
+ * multiplication by the number of ticks in
+ * INSTR_TIME_GET_NANOSEC() we divide again by the same value.
+ * We picked the scaler such that it provides enough precision
+ * and is a power-of-two which allows for shifting instead of
+ * doing an integer division.
+ */
+ ticks_per_ns_scaled = INT64CONST(1000000) * TICKS_TO_NS_PRECISION / tsc_freq;
+ ticks_per_sec = tsc_freq * 1000; /* KHz->Hz */
+ max_ticks_no_overflow = PG_INT64_MAX / ticks_per_ns_scaled;
+
+ has_rdtsc = true;
+ has_rdtscp = is_rdtscp_available();
+ }
+ }
+
+ fclose(fp);
+ }
+}
+#endif /* defined(__x86_64__) && defined(__linux__) */
+
+#endif /* WIN32 */
diff --git a/src/common/meson.build b/src/common/meson.build
index b757618a9c9..042edb7473a 100644
--- a/src/common/meson.build
+++ b/src/common/meson.build
@@ -13,6 +13,7 @@ common_sources = files(
'file_perm.c',
'file_utils.c',
'hashfn.c',
+ 'instr_time.c',
'ip.c',
'jsonapi.c',
'keywords.c',
diff --git a/src/include/portability/instr_time.h b/src/include/portability/instr_time.h
index 490593d1825..c2cea75b7c8 100644
--- a/src/include/portability/instr_time.h
+++ b/src/include/portability/instr_time.h
@@ -4,9 +4,11 @@
* portable high-precision interval timing
*
* This file provides an abstraction layer to hide portability issues in
- * interval timing. On Unix we use clock_gettime(), and on Windows we use
- * QueryPerformanceCounter(). These macros also give some breathing room to
- * use other high-precision-timing APIs.
+ * interval timing. On Linux/x86 we use the rdtsc instruction when a TSC
+ * clocksource is also used on the host OS. Otherwise, and on other
+ * Unix-like systems we use clock_gettime() and on Windows we use
+ * QueryPerformanceCounter(). These macros also give some breathing
+ * room to use other high-precision-timing APIs.
*
* The basic data type is instr_time, which all callers should treat as an
* opaque typedef. instr_time can store either an absolute time (of
@@ -17,10 +19,11 @@
*
* INSTR_TIME_SET_ZERO(t) set t to zero (memset is acceptable too)
*
- * INSTR_TIME_SET_CURRENT(t) set t to current time
+ * INSTR_TIME_SET_CURRENT_FAST(t) set t to current time without waiting
+ * for instructions in out-of-order window
*
- * INSTR_TIME_SET_CURRENT_LAZY(t) set t to current time if t is zero,
- * evaluates to whether t changed
+ * INSTR_TIME_SET_CURRENT(t) set t to current time while waiting for
+ * instructions in OOO to retire
*
* INSTR_TIME_ADD(x, y) x += y
*
@@ -81,6 +84,15 @@ typedef struct instr_time
#ifndef WIN32
+/*
+ * Make sure this is a power-of-two, so that the compiler can turn the
+ * multiplications and divisions into shifts.
+ */
+#define TICKS_TO_NS_PRECISION (1<<14)
+
+extern int64 ticks_per_ns_scaled;
+extern int64 ticks_per_sec;
+extern int64 max_ticks_no_overflow;
/* Use clock_gettime() */
@@ -106,9 +118,18 @@ typedef struct instr_time
#define PG_INSTR_CLOCK CLOCK_REALTIME
#endif
-/* helper for INSTR_TIME_SET_CURRENT */
+#if defined(__x86_64__) && defined(__linux__)
+#include <x86intrin.h>
+#include <cpuid.h>
+
+extern bool has_rdtsc;
+extern bool has_rdtscp;
+
+extern void pg_initialize_rdtsc(void);
+#endif
+
static inline instr_time
-pg_clock_gettime_ns(void)
+pg_clock_gettime(void)
{
instr_time now;
struct timespec tmp;
@@ -119,11 +140,94 @@ pg_clock_gettime_ns(void)
return now;
}
+static inline instr_time
+pg_get_ticks_fast(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtsc)
+ {
+ instr_time now;
+
+ now.ticks = __rdtsc();
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline instr_time
+pg_get_ticks(void)
+{
+#if defined(__x86_64__) && defined(__linux__)
+ if (has_rdtscp)
+ {
+ instr_time now;
+ uint32 unused;
+
+ now.ticks = __rdtscp(&unused);
+ return now;
+ }
+#endif
+
+ return pg_clock_gettime();
+}
+
+static inline int64_t
+pg_ticks_to_ns(instr_time t)
+{
+ /*
+ * Would multiplication overflow? If so perform computation in two parts.
+ * Check overflow without actually overflowing via: a * b > max <=> a >
+ * max / b
+ */
+ int64 ns = 0;
+
+ if (unlikely(t.ticks > max_ticks_no_overflow))
+ {
+ /*
+ * Compute how often the maximum number of ticks fits completely into
+ * the number of elapsed ticks and convert that number into
+ * nanoseconds. Then multiply by the count to arrive at the final
+ * value. In a 2nd step we adjust the number of elapsed ticks and
+ * convert the remaining ticks.
+ */
+ int64 count = t.ticks / max_ticks_no_overflow;
+ int64 max_ns = max_ticks_no_overflow * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+
+ ns = max_ns * count;
+
+ /*
+ * Subtract the ticks that we now already accounted for, so that they
+ * don't get counted twice.
+ */
+ t.ticks -= count * max_ticks_no_overflow;
+ Assert(t.ticks >= 0);
+ }
+
+ ns += t.ticks * ticks_per_ns_scaled / TICKS_TO_NS_PRECISION;
+ return ns;
+}
+
+static inline void
+pg_initialize_get_ticks()
+{
+#if defined(__x86_64__) && defined(__linux__)
+ pg_initialize_rdtsc();
+#endif
+}
+
+#define INSTR_TIME_INITIALIZE() \
+ pg_initialize_get_ticks()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_get_ticks_fast())
+
#define INSTR_TIME_SET_CURRENT(t) \
- ((t) = pg_clock_gettime_ns())
+ ((t) = pg_get_ticks())
#define INSTR_TIME_GET_NANOSEC(t) \
- ((int64) (t).ticks)
+ pg_ticks_to_ns(t)
#else /* WIN32 */
@@ -131,7 +235,7 @@ pg_clock_gettime_ns(void)
/* Use QueryPerformanceCounter() */
-/* helper for INSTR_TIME_SET_CURRENT */
+/* helper for INSTR_TIME_SET_CURRENT / INSTR_TIME_SET_CURRENT_FAST */
static inline instr_time
pg_query_performance_counter(void)
{
@@ -153,6 +257,11 @@ GetTimerFrequency(void)
return (double) f.QuadPart;
}
+#define INSTR_TIME_INITIALIZE()
+
+#define INSTR_TIME_SET_CURRENT_FAST(t) \
+ ((t) = pg_query_performance_counter())
+
#define INSTR_TIME_SET_CURRENT(t) \
((t) = pg_query_performance_counter())
@@ -168,13 +277,8 @@ GetTimerFrequency(void)
#define INSTR_TIME_IS_ZERO(t) ((t).ticks == 0)
-
#define INSTR_TIME_SET_ZERO(t) ((t).ticks = 0)
-#define INSTR_TIME_SET_CURRENT_LAZY(t) \
- (INSTR_TIME_IS_ZERO(t) ? INSTR_TIME_SET_CURRENT(t), true : false)
-
-
#define INSTR_TIME_ADD(x,y) \
((x).ticks += (y).ticks)
--
2.51.0