Reduce timing overhead of EXPLAIN ANALYZE using rdtsc?

Started by Andres Freundabout 6 years ago172 messageshackers

andres@anarazel.de

about 6 years ago

Hi,

Currently using EXPLAIN (ANALYZE) without TIMING OFF regularly changes
the resulting timing enough that the times aren't meaningful. E.g.

CREATE TABLE lotsarows(key int not null);
INSERT INTO lotsarows SELECT generate_series(1, 50000000);
VACUUM FREEZE lotsarows;

-- best of three:
SELECT count(*) FROM lotsarows;
Time: 1923.394 ms (00:01.923)

-- best of three:
EXPLAIN (ANALYZE, TIMING OFF) SELECT count(*) FROM lotsarows;
Time: 2319.830 ms (00:02.320)

-- best of three:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4202.649 ms (00:04.203)

That nearly *double* the execution time without TIMING.

Looking at a profile of this shows that we spend a good bit of cycles
"normalizing" timstamps etc. That seems pretty unnecessary, just forced
on us due to struct timespec. So the first attached patch just turns
instr_time to be a 64bit integer, counting nanoseconds.

That helps, a tiny bit:
EXPLAIN (ANALYZE, TIMING ON) SELECT count(*) FROM lotsarows;
Time: 4179.302 ms (00:04.179)

but obviously doesn't move the needle.

Looking at a profile it's easy to confirm that we spend a lot of time
acquiring time:
-   95.49%     0.00%  postgres     postgres                 [.] agg_retrieve_direct (inlined)
   - agg_retrieve_direct (inlined)
      - 79.27% fetch_input_tuple
         - ExecProcNode (inlined)
            - 75.72% ExecProcNodeInstr
               + 25.22% SeqNext
               - 21.74% InstrStopNode
                  + 17.80% __GI___clock_gettime (inlined)
               - 21.44% InstrStartNode
                  + 19.23% __GI___clock_gettime (inlined)
               + 4.06% ExecScan
      + 13.09% advance_aggregates (inlined)
        1.06% MemoryContextReset

And that's even though linux avoids a syscall (in most cases) etc to
acquire the time. Unless the kernel detects there's a reason not to do
so, linux does this by using 'rdtscp' and multiplying it by kernel
provided factors to turn the cycles into time.

Some of the time is spent doing function calls, dividing into struct
timespec, etc. But most of it just the rdtscp instruction:
65.30 │1 63: rdtscp

The reason for that is largely that rdtscp waits until all prior
instructions have finished (but it allows later instructions to already
start). Multiple times for each tuple.

In the second attached prototype patch I've change instr_time to count
in cpu cycles instead of nanoseconds. And then just turned the cycles
into seconds in INSTR_TIME_GET_DOUBLE() (more about that part later).

When using rdtsc that results in *vastly* lower overhead:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ QUERY PLAN │
├───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ Aggregate (cost=846239.20..846239.21 rows=1 width=8) (actual time=2610.235..2610.235 rows=1 loops=1) │
│ -> Seq Scan on lotsarows (cost=0.00..721239.16 rows=50000016 width=0) (actual time=0.006..1512.886 rows=50000000 loops=1) │
│ Planning Time: 0.028 ms │
│ Execution Time: 2610.256 ms │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
(4 rows)

Time: 2610.589 ms (00:02.611)

And there's still some smaller improvements that could be made ontop of
that.

As a comparison, here's the time when using rdtscp directly in
instr_time, instead of going through clock_gettime:
Time: 3481.162 ms (00:03.481)

That shows pretty well how big the cost of the added pipeline stalls
are, and how important out-of-order execution is for decent
performance...

In my opinion, for the use in InstrStartNode(), InstrStopNode() etc, we
do *not* want to wait for prior instructions to finish, since that
actually leads to the timing being less accurate, rather than
more. There are other cases where that'd be different, e.g. measuring
how long an entire query takes or such (but there it's probably
irrelevant which to use).

I've above skipped a bit over the details of how to turn the cycles
returned by rdtsc into time:

On x86 CPUs of the last ~12 years rdtsc doesn't return the cycles that
have actually been run, but instead returns the number of 'reference
cycles'. That's important because otherwise things like turbo mode and
lower power modes would lead to completely bogus times.

Thus, knowing the "base frequency" of the CPU allows to turn the
difference between two rdtsc return values into seconds.

In the attached prototype I just determined the number of cycles using
cpuid(0x16). That's only available since Skylake (I think). On older
CPUs we'd have to look at /proc/cpuinfo or
/sys/devices/system/cpu/cpu0/cpufreq/base_frequency.

There's also other issues with using rdtsc directly: On older CPUs, in
particular older multi-socket systems, the tsc will not be synchronized
in detail across cores. There's bits that'd let us check whether tsc is
suitable or not. The more current issue of that is that things like
virtual machines being migrated can lead to rdtsc suddenly returning a
different value / the frequency differening. But that is supposed to be
solved these days, by having virtualization technologies set frequency
multipliers and offsets which then cause rdtsc[p] to return something
meaningful, even after migration.

The attached patches are really just a prototype. I'm also not really
planning to work on getting this into a "production ready" patchset
anytime soon. I developed it primarily because I found it the overhead
made it too hard to nail down in which part of a query tree performance
changed. If somebody else wants to continue from here...

I do think it's be a pretty significant improvement if we could reduce
the timing overhead of EXPLAIN ANALYZE by this much. Even if requires a
bunch of low-level code.

Greetings,

Andres Freund